[
  {
    "path": ".gitignore",
    "content": "*/**/__pycache__\nbuild/\ntest/logs/*\n!.gitkeep\ngds/**/*.gltf\n\n.DS_Store\nresults.xml"
  },
  {
    "path": "Makefile",
    "content": ".PHONY: test compile\n\nexport LIBPYTHON_LOC=$(shell cocotb-config --libpython)\n\ntest_%:\n\tmake compile\n\tiverilog -o build/sim.vvp -s gpu -g2012 build/gpu.v\n\tMODULE=test.test_$* vvp -M $$(cocotb-config --prefix)/cocotb/libs -m libcocotbvpi_icarus build/sim.vvp\n\ncompile:\n\tmake compile_alu\n\tsv2v -I src/* -w build/gpu.v\n\techo \"\" >> build/gpu.v\n\tcat build/alu.v >> build/gpu.v\n\techo '`timescale 1ns/1ns' > build/temp.v\n\tcat build/gpu.v >> build/temp.v\n\tmv build/temp.v build/gpu.v\n\ncompile_%:\n\tsv2v -w build/$*.v src/$*.sv\n\n# TODO: Get gtkwave visualizaiton\n\nshow_%: %.vcd %.gtkw\n\tgtkwave $^\n"
  },
  {
    "path": "README.md",
    "content": "# tiny-gpu\n\nA minimal GPU implementation in Verilog optimized for learning about how GPUs work from the ground up.\n\nBuilt with <15 files of fully documented Verilog, complete documentation on architecture & ISA, working matrix addition/multiplication kernels, and full support for kernel simulation & execution traces.\n\n### Table of Contents\n\n- [Overview](#overview)\n- [Architecture](#architecture)\n  - [GPU](#gpu)\n  - [Memory](#memory)\n  - [Core](#core)\n- [ISA](#isa)\n- [Execution](#execution)\n  - [Core](#core-1)\n  - [Thread](#thread)\n- [Kernels](#kernels)\n  - [Matrix Addition](#matrix-addition)\n  - [Matrix Multiplication](/tree/master?tab=readme-ov-file#matrix-multiplication)\n- [Simulation](#simulation)\n- [Advanced Functionality](#advanced-functionality)\n- [Next Steps](#next-steps)\n\n# Overview\n\nIf you want to learn how a CPU works all the way from architecture to control signals, there are many resources online to help you.\n\nGPUs are not the same.\n\nBecause the GPU market is so competitive, low-level technical details for all modern architectures remain proprietary.\n\nWhile there are lots of resources to learn about GPU programming, there's almost nothing available to learn about how GPU's work at a hardware level.\n\nThe best option is to go through open-source GPU implementations like [Miaow](https://github.com/VerticalResearchGroup/miaow) and [VeriGPU](https://github.com/hughperkins/VeriGPU/tree/main) and try to figure out what's going on. This is challenging since these projects aim at being feature complete and functional, so they're quite complex.\n\nThis is why I built `tiny-gpu`!\n\n## What is tiny-gpu?\n\n> [!IMPORTANT]\n>\n> **tiny-gpu** is a minimal GPU implementation optimized for learning about how GPUs work from the ground up.\n>\n> Specifically, with the trend toward general-purpose GPUs (GPGPUs) and ML-accelerators like Google's TPU, tiny-gpu focuses on highlighting the general principles of all of these architectures, rather than on the details of graphics-specific hardware.\n\nWith this motivation in mind, we can simplify GPUs by cutting out the majority of complexity involved with building a production-grade graphics card, and focus on the core elements that are critical to all of these modern hardware accelerators.\n\nThis project is primarily focused on exploring:\n\n1. **Architecture** - What does the architecture of a GPU look like? What are the most important elements?\n2. **Parallelization** - How is the SIMD progamming model implemented in hardware?\n3. **Memory** - How does a GPU work around the constraints of limited memory bandwidth?\n\nAfter understanding the fundamentals laid out in this project, you can checkout the [advanced functionality section](#advanced-functionality) to understand some of the most important optimizations made in production grade GPUs (that are more challenging to implement) which improve performance.\n\n# Architecture\n\n<p float=\"left\">\n  <img src=\"/docs/images/gpu.png\" alt=\"GPU\" width=\"48%\">\n  <img src=\"/docs/images/core.png\" alt=\"Core\" width=\"48%\">\n</p>\n\n## GPU\n\ntiny-gpu is built to execute a single kernel at a time.\n\nIn order to launch a kernel, we need to do the following:\n\n1. Load global program memory with the kernel code\n2. Load data memory with the necessary data\n3. Specify the number of threads to launch in the device control register\n4. Launch the kernel by setting the start signal to high.\n\nThe GPU itself consists of the following units:\n\n1. Device control register\n2. Dispatcher\n3. Variable number of compute cores\n4. Memory controllers for data memory & program memory\n5. Cache\n\n### Device Control Register\n\nThe device control register usually stores metadata specifying how kernels should be executed on the GPU.\n\nIn this case, the device control register just stores the `thread_count` - the total number of threads to launch for the active kernel.\n\n### Dispatcher\n\nOnce a kernel is launched, the dispatcher is the unit that actually manages the distribution of threads to different compute cores.\n\nThe dispatcher organizes threads into groups that can be executed in parallel on a single core called **blocks** and sends these blocks off to be processed by available cores.\n\nOnce all blocks have been processed, the dispatcher reports back that the kernel execution is done.\n\n## Memory\n\nThe GPU is built to interface with an external global memory. Here, data memory and program memory are separated out for simplicity.\n\n### Global Memory\n\ntiny-gpu data memory has the following specifications:\n\n- 8 bit addressability (256 total rows of data memory)\n- 8 bit data (stores values of <256 for each row)\n\ntiny-gpu program memory has the following specifications:\n\n- 8 bit addressability (256 rows of program memory)\n- 16 bit data (each instruction is 16 bits as specified by the ISA)\n\n### Memory Controllers\n\nGlobal memory has fixed read/write bandwidth, but there may be far more incoming requests across all cores to access data from memory than the external memory is actually able to handle.\n\nThe memory controllers keep track of all the outgoing requests to memory from the compute cores, throttle requests based on actual external memory bandwidth, and relay responses from external memory back to the proper resources.\n\nEach memory controller has a fixed number of channels based on the bandwidth of global memory.\n\n### Cache (WIP)\n\nThe same data is often requested from global memory by multiple cores. Constantly access global memory repeatedly is expensive, and since the data has already been fetched once, it would be more efficient to store it on device in SRAM to be retrieved much quicker on later requests.\n\nThis is exactly what the cache is used for. Data retrieved from external memory is stored in cache and can be retrieved from there on later requests, freeing up memory bandwidth to be used for new data.\n\n## Core\n\nEach core has a number of compute resources, often built around a certain number of threads it can support. In order to maximize parallelization, these resources need to be managed optimally to maximize resource utilization.\n\nIn this simplified GPU, each core processed one **block** at a time, and for each thread in a block, the core has a dedicated ALU, LSU, PC, and register file. Managing the execution of thread instructions on these resources is one of the most challening problems in GPUs.\n\n### Scheduler\n\nEach core has a single scheduler that manages the execution of threads.\n\nThe tiny-gpu scheduler executes instructions for a single block to completion before picking up a new block, and it executes instructions for all threads in-sync and sequentially.\n\nIn more advanced schedulers, techniques like **pipelining** are used to stream the execution of multiple instructions subsequent instructions to maximize resource utilization before previous instructions are fully complete. Additionally, **warp scheduling** can be use to execute multiple batches of threads within a block in parallel.\n\nThe main constraint the scheduler has to work around is the latency associated with loading & storing data from global memory. While most instructions can be executed synchronously, these load-store operations are asynchronous, meaning the rest of the instruction execution has to be built around these long wait times.\n\n### Fetcher\n\nAsynchronously fetches the instruction at the current program counter from program memory (most should actually be fetching from cache after a single block is executed).\n\n### Decoder\n\nDecodes the fetched instruction into control signals for thread execution.\n\n### Register Files\n\nEach thread has it's own dedicated set of register files. The register files hold the data that each thread is performing computations on, which enables the same-instruction multiple-data (SIMD) pattern.\n\nImportantly, each register file contains a few read-only registers holding data about the current block & thread being executed locally, enabling kernels to be executed with different data based on the local thread id.\n\n### ALUs\n\nDedicated arithmetic-logic unit for each thread to perform computations. Handles the `ADD`, `SUB`, `MUL`, `DIV` arithmetic instructions.\n\nAlso handles the `CMP` comparison instruction which actually outputs whether the result of the difference between two registers is negative, zero or positive - and stores the result in the `NZP` register in the PC unit.\n\n### LSUs\n\nDedicated load-store unit for each thread to access global data memory.\n\nHandles the `LDR` & `STR` instructions - and handles async wait times for memory requests to be processed and relayed by the memory controller.\n\n### PCs\n\nDedicated program-counter for each unit to determine the next instructions to execute on each thread.\n\nBy default, the PC increments by 1 after every instruction.\n\nWith the `BRnzp` instruction, the NZP register checks to see if the NZP register (set by a previous `CMP` instruction) matches some case - and if it does, it will branch to a specific line of program memory. _This is how loops and conditionals are implemented._\n\nSince threads are processed in parallel, tiny-gpu assumes that all threads \"converge\" to the same program counter after each instruction - which is a naive assumption for the sake of simplicity.\n\nIn real GPUs, individual threads can branch to different PCs, causing **branch divergence** where a group of threads threads initially being processed together has to split out into separate execution.\n\n# ISA\n\n![ISA](/docs/images/isa.png)\n\ntiny-gpu implements a simple 11 instruction ISA built to enable simple kernels for proof-of-concept like matrix addition & matrix multiplication (implementation further down on this page).\n\nFor these purposes, it supports the following instructions:\n\n- `BRnzp` - Branch instruction to jump to another line of program memory if the NZP register matches the `nzp` condition in the instruction.\n- `CMP` - Compare the value of two registers and store the result in the NZP register to use for a later `BRnzp` instruction.\n- `ADD`, `SUB`, `MUL`, `DIV` - Basic arithmetic operations to enable tensor math.\n- `LDR` - Load data from global memory.\n- `STR` - Store data into global memory.\n- `CONST` - Load a constant value into a register.\n- `RET` - Signal that the current thread has reached the end of execution.\n\nEach register is specified by 4 bits, meaning that there are 16 total registers. The first 13 register `R0` - `R12` are free registers that support read/write. The last 3 registers are special read-only registers used to supply the `%blockIdx`, `%blockDim`, and `%threadIdx` critical to SIMD.\n\n# Execution\n\n### Core\n\nEach core follows the following control flow going through different stages to execute each instruction:\n\n1. `FETCH` - Fetch the next instruction at current program counter from program memory.\n2. `DECODE` - Decode the instruction into control signals.\n3. `REQUEST` - Request data from global memory if necessary (if `LDR` or `STR` instruction).\n4. `WAIT` - Wait for data from global memory if applicable.\n5. `EXECUTE` - Execute any computations on data.\n6. `UPDATE` - Update register files and NZP register.\n\nThe control flow is laid out like this for the sake of simplicity and understandability.\n\nIn practice, several of these steps could be compressed to be optimize processing times, and the GPU could also use **pipelining** to stream and coordinate the execution of many instructions on a cores resources without waiting for previous instructions to finish.\n\n### Thread\n\n![Thread](/docs/images/thread.png)\n\nEach thread within each core follows the above execution path to perform computations on the data in it's dedicated register file.\n\nThis resembles a standard CPU diagram, and is quite similar in functionality as well. The main difference is that the `%blockIdx`, `%blockDim`, and `%threadIdx` values lie in the read-only registers for each thread, enabling SIMD functionality.\n\n# Kernels\n\nI wrote a matrix addition and matrix multiplication kernel using my ISA as a proof of concept to demonstrate SIMD programming and execution with my GPU. The test files in this repository are capable of fully simulating the execution of these kernels on the GPU, producing data memory states and a complete execution trace.\n\n### Matrix Addition\n\nThis matrix addition kernel adds two 1 x 8 matrices by performing 8 element wise additions in separate threads.\n\nThis demonstration makes use of the `%blockIdx`, `%blockDim`, and `%threadIdx` registers to show SIMD programming on this GPU. It also uses the `LDR` and `STR` instructions which require async memory management.\n\n`matadd.asm`\n\n```asm\n.threads 8\n.data 0 1 2 3 4 5 6 7          ; matrix A (1 x 8)\n.data 0 1 2 3 4 5 6 7          ; matrix B (1 x 8)\n\nMUL R0, %blockIdx, %blockDim\nADD R0, R0, %threadIdx         ; i = blockIdx * blockDim + threadIdx\n\nCONST R1, #0                   ; baseA (matrix A base address)\nCONST R2, #8                   ; baseB (matrix B base address)\nCONST R3, #16                  ; baseC (matrix C base address)\n\nADD R4, R1, R0                 ; addr(A[i]) = baseA + i\nLDR R4, R4                     ; load A[i] from global memory\n\nADD R5, R2, R0                 ; addr(B[i]) = baseB + i\nLDR R5, R5                     ; load B[i] from global memory\n\nADD R6, R4, R5                 ; C[i] = A[i] + B[i]\n\nADD R7, R3, R0                 ; addr(C[i]) = baseC + i\nSTR R7, R6                     ; store C[i] in global memory\n\nRET                            ; end of kernel\n```\n\n### Matrix Multiplication\n\nThe matrix multiplication kernel multiplies two 2x2 matrices. It performs element wise calculation of the dot product of the relevant row and column and uses the `CMP` and `BRnzp` instructions to demonstrate branching within the threads (notably, all branches converge so this kernel works on the current tiny-gpu implementation).\n\n`matmul.asm`\n\n```asm\n.threads 4\n.data 1 2 3 4                  ; matrix A (2 x 2)\n.data 1 2 3 4                  ; matrix B (2 x 2)\n\nMUL R0, %blockIdx, %blockDim\nADD R0, R0, %threadIdx         ; i = blockIdx * blockDim + threadIdx\n\nCONST R1, #1                   ; increment\nCONST R2, #2                   ; N (matrix inner dimension)\nCONST R3, #0                   ; baseA (matrix A base address)\nCONST R4, #4                   ; baseB (matrix B base address)\nCONST R5, #8                   ; baseC (matrix C base address)\n\nDIV R6, R0, R2                 ; row = i // N\nMUL R7, R6, R2\nSUB R7, R0, R7                 ; col = i % N\n\nCONST R8, #0                   ; acc = 0\nCONST R9, #0                   ; k = 0\n\nLOOP:\n  MUL R10, R6, R2\n  ADD R10, R10, R9\n  ADD R10, R10, R3             ; addr(A[i]) = row * N + k + baseA\n  LDR R10, R10                 ; load A[i] from global memory\n\n  MUL R11, R9, R2\n  ADD R11, R11, R7\n  ADD R11, R11, R4             ; addr(B[i]) = k * N + col + baseB\n  LDR R11, R11                 ; load B[i] from global memory\n\n  MUL R12, R10, R11\n  ADD R8, R8, R12              ; acc = acc + A[i] * B[i]\n\n  ADD R9, R9, R1               ; increment k\n\n  CMP R9, R2\n  BRn LOOP                    ; loop while k < N\n\nADD R9, R5, R0                 ; addr(C[i]) = baseC + i\nSTR R9, R8                     ; store C[i] in global memory\n\nRET                            ; end of kernel\n```\n\n# Simulation\n\ntiny-gpu is setup to simulate the execution of both of the above kernels. Before simulating, you'll need to install [iverilog](https://steveicarus.github.io/iverilog/usage/installation.html) and [cocotb](https://docs.cocotb.org/en/stable/install.html):\n\n- Install Verilog compilers with `brew install icarus-verilog` and `pip3 install cocotb`\n- Download the latest version of sv2v from https://github.com/zachjs/sv2v/releases, unzip it and put the binary in $PATH.\n- Run `mkdir build` in the root directory of this repository.\n\nOnce you've installed the pre-requisites, you can run the kernel simulations with `make test_matadd` and `make test_matmul`.\n\nExecuting the simulations will output a log file in `test/logs` with the initial data memory state, complete execution trace of the kernel, and final data memory state.\n\nIf you look at the initial data memory state logged at the start of the logfile for each, you should see the two start matrices for the calculation, and in the final data memory at the end of the file you should also see the resultant matrix.\n\nBelow is a sample of the execution traces, showing on each cycle the execution of every thread within every core, including the current instruction, PC, register values, states, etc.\n\n![execution trace](docs/images/trace.png)\n\n**For anyone trying to run the simulation or play with this repo, please feel free to DM me on [twitter](https://twitter.com/majmudaradam) if you run into any issues - I want you to get this running!**\n\n# Advanced Functionality\n\nFor the sake of simplicity, there were many additional features implemented in modern GPUs that heavily improve performance & functionality that tiny-gpu omits. We'll discuss some of those most critical features in this section.\n\n### Multi-layered Cache & Shared Memory\n\nIn modern GPUs, multiple different levels of caches are used to minimize the amount of data that needs to get accessed from global memory. tiny-gpu implements only one cache layer between individual compute units requesting memory and the memory controllers which stores recent cached data.\n\nImplementing multi-layered caches allows frequently accessed data to be cached more locally to where it's being used (with some caches within individual compute cores), minimizing load times for this data.\n\nDifferent caching algorithms are used to maximize cache-hits - this is a critical dimension that can be improved on to optimize memory access.\n\nAdditionally, GPUs often use **shared memory** for threads within the same block to access a single memory space that can be used to share results with other threads.\n\n### Memory Coalescing\n\nAnother critical memory optimization used by GPUs is **memory coalescing.** Multiple threads running in parallel often need to access sequential addresses in memory (for example, a group of threads accessing neighboring elements in a matrix) - but each of these memory requests is put in separately.\n\nMemory coalescing is used to analyzing queued memory requests and combine neighboring requests into a single transaction, minimizing time spent on addressing, and making all the requests together.\n\n### Pipelining\n\nIn the control flow for tiny-gpu, cores wait for one instruction to be executed on a group of threads before starting execution of the next instruction.\n\nModern GPUs use **pipelining** to stream execution of multiple sequential instructions at once while ensuring that instructions with dependencies on each other still get executed sequentially.\n\nThis helps to maximize resource utilization within cores as resources are not sitting idle while waiting (ex: during async memory requests).\n\n### Warp Scheduling\n\nAnother strategy used to maximize resource utilization on course is **warp scheduling.** This approach involves breaking up blocks into individual batches of theads that can be executed together.\n\nMultiple warps can be executed on a single core simultaneously by executing instructions from one warp while another warp is waiting. This is similar to pipelining, but dealing with instructions from different threads.\n\n### Branch Divergence\n\ntiny-gpu assumes that all threads in a single batch end up on the same PC after each instruction, meaning that threads can be executed in parallel for their entire lifetime.\n\nIn reality, individual threads could diverge from each other and branch to different lines based on their data. With different PCs, these threads would need to split into separate lines of execution, which requires managing diverging threads & paying attention to when threads converge again.\n\n### Synchronization & Barriers\n\nAnother core functionality of modern GPUs is the ability to set **barriers** so that groups of threads in a block can synchronize and wait until all other threads in the same block have gotten to a certain point before continuing execution.\n\nThis is useful for cases where threads need to exchange shared data with each other so they can ensure that the data has been fully processed.\n\n# Next Steps\n\nUpdates I want to make in the future to improve the design, anyone else is welcome to contribute as well:\n\n- [ ] Add a simple cache for instructions\n- [ ] Build an adapter to use GPU with Tiny Tapeout 7\n- [ ] Add basic branch divergence\n- [ ] Add basic memory coalescing\n- [ ] Add basic pipelining\n- [ ] Optimize control flow and use of registers to improve cycle time\n- [ ] Write a basic graphics kernel or add simple graphics hardware to demonstrate graphics functionality\n\n**For anyone curious to play around or make a contribution, feel free to put up a PR with any improvements you'd like to add 😄**\n"
  },
  {
    "path": "src/alu.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// ARITHMETIC-LOGIC UNIT\n// > Executes computations on register values\n// > In this minimal implementation, the ALU supports the 4 basic arithmetic operations\n// > Each thread in each core has it's own ALU\n// > ADD, SUB, MUL, DIV instructions are all executed here\nmodule alu (\n    input wire clk,\n    input wire reset,\n    input wire enable, // If current block has less threads then block size, some ALUs will be inactive\n\n    input reg [2:0] core_state,\n\n    input reg [1:0] decoded_alu_arithmetic_mux,\n    input reg decoded_alu_output_mux,\n\n    input reg [7:0] rs,\n    input reg [7:0] rt,\n    output wire [7:0] alu_out\n);\n    localparam ADD = 2'b00,\n        SUB = 2'b01,\n        MUL = 2'b10,\n        DIV = 2'b11;\n\n    reg [7:0] alu_out_reg;\n    assign alu_out = alu_out_reg;\n\n    always @(posedge clk) begin \n        if (reset) begin \n            alu_out_reg <= 8'b0;\n        end else if (enable) begin\n            // Calculate alu_out when core_state = EXECUTE\n            if (core_state == 3'b101) begin \n                if (decoded_alu_output_mux == 1) begin \n                    // Set values to compare with NZP register in alu_out[2:0]\n                    alu_out_reg <= {5'b0, (rs - rt > 0), (rs - rt == 0), (rs - rt < 0)};\n                end else begin \n                    // Execute the specified arithmetic instruction\n                    case (decoded_alu_arithmetic_mux)\n                        ADD: begin \n                            alu_out_reg <= rs + rt;\n                        end\n                        SUB: begin \n                            alu_out_reg <= rs - rt;\n                        end\n                        MUL: begin \n                            alu_out_reg <= rs * rt;\n                        end\n                        DIV: begin \n                            alu_out_reg <= rs / rt;\n                        end\n                    endcase\n                end\n            end\n        end\n    end\nendmodule\n"
  },
  {
    "path": "src/controller.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// MEMORY CONTROLLER\n// > Receives memory requests from all cores\n// > Throttles requests based on limited external memory bandwidth\n// > Waits for responses from external memory and distributes them back to cores\nmodule controller #(\n    parameter ADDR_BITS = 8,\n    parameter DATA_BITS = 16,\n    parameter NUM_CONSUMERS = 4, // The number of consumers accessing memory through this controller\n    parameter NUM_CHANNELS = 1,  // The number of concurrent channels available to send requests to global memory\n    parameter WRITE_ENABLE = 1   // Whether this memory controller can write to memory (program memory is read-only)\n) (\n    input wire clk,\n    input wire reset,\n\n    // Consumer Interface (Fetchers / LSUs)\n    input reg [NUM_CONSUMERS-1:0] consumer_read_valid,\n    input reg [ADDR_BITS-1:0] consumer_read_address [NUM_CONSUMERS-1:0],\n    output reg [NUM_CONSUMERS-1:0] consumer_read_ready,\n    output reg [DATA_BITS-1:0] consumer_read_data [NUM_CONSUMERS-1:0],\n    input reg [NUM_CONSUMERS-1:0] consumer_write_valid,\n    input reg [ADDR_BITS-1:0] consumer_write_address [NUM_CONSUMERS-1:0],\n    input reg [DATA_BITS-1:0] consumer_write_data [NUM_CONSUMERS-1:0],\n    output reg [NUM_CONSUMERS-1:0] consumer_write_ready,\n\n    // Memory Interface (Data / Program)\n    output reg [NUM_CHANNELS-1:0] mem_read_valid,\n    output reg [ADDR_BITS-1:0] mem_read_address [NUM_CHANNELS-1:0],\n    input reg [NUM_CHANNELS-1:0] mem_read_ready,\n    input reg [DATA_BITS-1:0] mem_read_data [NUM_CHANNELS-1:0],\n    output reg [NUM_CHANNELS-1:0] mem_write_valid,\n    output reg [ADDR_BITS-1:0] mem_write_address [NUM_CHANNELS-1:0],\n    output reg [DATA_BITS-1:0] mem_write_data [NUM_CHANNELS-1:0],\n    input reg [NUM_CHANNELS-1:0] mem_write_ready\n);\n    localparam IDLE = 3'b000, \n        READ_WAITING = 3'b010, \n        WRITE_WAITING = 3'b011,\n        READ_RELAYING = 3'b100,\n        WRITE_RELAYING = 3'b101;\n\n    // Keep track of state for each channel and which jobs each channel is handling\n    reg [2:0] controller_state [NUM_CHANNELS-1:0];\n    reg [$clog2(NUM_CONSUMERS)-1:0] current_consumer [NUM_CHANNELS-1:0]; // Which consumer is each channel currently serving\n    reg [NUM_CONSUMERS-1:0] channel_serving_consumer; // Which channels are being served? Prevents many workers from picking up the same request.\n\n    always @(posedge clk) begin\n        if (reset) begin \n            mem_read_valid <= 0;\n            mem_read_address <= 0;\n\n            mem_write_valid <= 0;\n            mem_write_address <= 0;\n            mem_write_data <= 0;\n\n            consumer_read_ready <= 0;\n            consumer_read_data <= 0;\n            consumer_write_ready <= 0;\n\n            current_consumer <= 0;\n            controller_state <= 0;\n\n            channel_serving_consumer = 0;\n        end else begin \n            // For each channel, we handle processing concurrently\n            for (int i = 0; i < NUM_CHANNELS; i = i + 1) begin \n                case (controller_state[i])\n                    IDLE: begin\n                        // While this channel is idle, cycle through consumers looking for one with a pending request\n                        for (int j = 0; j < NUM_CONSUMERS; j = j + 1) begin \n                            if (consumer_read_valid[j] && !channel_serving_consumer[j]) begin \n                                channel_serving_consumer[j] = 1;\n                                current_consumer[i] <= j;\n\n                                mem_read_valid[i] <= 1;\n                                mem_read_address[i] <= consumer_read_address[j];\n                                controller_state[i] <= READ_WAITING;\n\n                                // Once we find a pending request, pick it up with this channel and stop looking for requests\n                                break;\n                            end else if (consumer_write_valid[j] && !channel_serving_consumer[j]) begin \n                                channel_serving_consumer[j] = 1;\n                                current_consumer[i] <= j;\n\n                                mem_write_valid[i] <= 1;\n                                mem_write_address[i] <= consumer_write_address[j];\n                                mem_write_data[i] <= consumer_write_data[j];\n                                controller_state[i] <= WRITE_WAITING;\n\n                                // Once we find a pending request, pick it up with this channel and stop looking for requests\n                                break;\n                            end\n                        end\n                    end\n                    READ_WAITING: begin\n                        // Wait for response from memory for pending read request\n                        if (mem_read_ready[i]) begin \n                            mem_read_valid[i] <= 0;\n                            consumer_read_ready[current_consumer[i]] <= 1;\n                            consumer_read_data[current_consumer[i]] <= mem_read_data[i];\n                            controller_state[i] <= READ_RELAYING;\n                        end\n                    end\n                    WRITE_WAITING: begin \n                        // Wait for response from memory for pending write request\n                        if (mem_write_ready[i]) begin \n                            mem_write_valid[i] <= 0;\n                            consumer_write_ready[current_consumer[i]] <= 1;\n                            controller_state[i] <= WRITE_RELAYING;\n                        end\n                    end\n                    // Wait until consumer acknowledges it received response, then reset\n                    READ_RELAYING: begin\n                        if (!consumer_read_valid[current_consumer[i]]) begin \n                            channel_serving_consumer[current_consumer[i]] = 0;\n                            consumer_read_ready[current_consumer[i]] <= 0;\n                            controller_state[i] <= IDLE;\n                        end\n                    end\n                    WRITE_RELAYING: begin \n                        if (!consumer_write_valid[current_consumer[i]]) begin \n                            channel_serving_consumer[current_consumer[i]] = 0;\n                            consumer_write_ready[current_consumer[i]] <= 0;\n                            controller_state[i] <= IDLE;\n                        end\n                    end\n                endcase\n            end\n        end\n    end\nendmodule\n"
  },
  {
    "path": "src/core.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// COMPUTE CORE\n// > Handles processing 1 block at a time\n// > The core also has it's own scheduler to manage control flow\n// > Each core contains 1 fetcher & decoder, and register files, ALUs, LSUs, PC for each thread\nmodule core #(\n    parameter DATA_MEM_ADDR_BITS = 8,\n    parameter DATA_MEM_DATA_BITS = 8,\n    parameter PROGRAM_MEM_ADDR_BITS = 8,\n    parameter PROGRAM_MEM_DATA_BITS = 16,\n    parameter THREADS_PER_BLOCK = 4\n) (\n    input wire clk,\n    input wire reset,\n\n    // Kernel Execution\n    input wire start,\n    output wire done,\n\n    // Block Metadata\n    input wire [7:0] block_id,\n    input wire [$clog2(THREADS_PER_BLOCK):0] thread_count,\n\n    // Program Memory\n    output reg program_mem_read_valid,\n    output reg [PROGRAM_MEM_ADDR_BITS-1:0] program_mem_read_address,\n    input reg program_mem_read_ready,\n    input reg [PROGRAM_MEM_DATA_BITS-1:0] program_mem_read_data,\n\n    // Data Memory\n    output reg [THREADS_PER_BLOCK-1:0] data_mem_read_valid,\n    output reg [DATA_MEM_ADDR_BITS-1:0] data_mem_read_address [THREADS_PER_BLOCK-1:0],\n    input reg [THREADS_PER_BLOCK-1:0] data_mem_read_ready,\n    input reg [DATA_MEM_DATA_BITS-1:0] data_mem_read_data [THREADS_PER_BLOCK-1:0],\n    output reg [THREADS_PER_BLOCK-1:0] data_mem_write_valid,\n    output reg [DATA_MEM_ADDR_BITS-1:0] data_mem_write_address [THREADS_PER_BLOCK-1:0],\n    output reg [DATA_MEM_DATA_BITS-1:0] data_mem_write_data [THREADS_PER_BLOCK-1:0],\n    input reg [THREADS_PER_BLOCK-1:0] data_mem_write_ready\n);\n    // State\n    reg [2:0] core_state;\n    reg [2:0] fetcher_state;\n    reg [15:0] instruction;\n\n    // Intermediate Signals\n    reg [7:0] current_pc;\n    wire [7:0] next_pc[THREADS_PER_BLOCK-1:0];\n    reg [7:0] rs[THREADS_PER_BLOCK-1:0];\n    reg [7:0] rt[THREADS_PER_BLOCK-1:0];\n    reg [1:0] lsu_state[THREADS_PER_BLOCK-1:0];\n    reg [7:0] lsu_out[THREADS_PER_BLOCK-1:0];\n    wire [7:0] alu_out[THREADS_PER_BLOCK-1:0];\n    \n    // Decoded Instruction Signals\n    reg [3:0] decoded_rd_address;\n    reg [3:0] decoded_rs_address;\n    reg [3:0] decoded_rt_address;\n    reg [2:0] decoded_nzp;\n    reg [7:0] decoded_immediate;\n\n    // Decoded Control Signals\n    reg decoded_reg_write_enable;           // Enable writing to a register\n    reg decoded_mem_read_enable;            // Enable reading from memory\n    reg decoded_mem_write_enable;           // Enable writing to memory\n    reg decoded_nzp_write_enable;           // Enable writing to NZP register\n    reg [1:0] decoded_reg_input_mux;        // Select input to register\n    reg [1:0] decoded_alu_arithmetic_mux;   // Select arithmetic operation\n    reg decoded_alu_output_mux;             // Select operation in ALU\n    reg decoded_pc_mux;                     // Select source of next PC\n    reg decoded_ret;\n\n    // Fetcher\n    fetcher #(\n        .PROGRAM_MEM_ADDR_BITS(PROGRAM_MEM_ADDR_BITS),\n        .PROGRAM_MEM_DATA_BITS(PROGRAM_MEM_DATA_BITS)\n    ) fetcher_instance (\n        .clk(clk),\n        .reset(reset),\n        .core_state(core_state),\n        .current_pc(current_pc),\n        .mem_read_valid(program_mem_read_valid),\n        .mem_read_address(program_mem_read_address),\n        .mem_read_ready(program_mem_read_ready),\n        .mem_read_data(program_mem_read_data),\n        .fetcher_state(fetcher_state),\n        .instruction(instruction) \n    );\n\n    // Decoder\n    decoder decoder_instance (\n        .clk(clk),\n        .reset(reset),\n        .core_state(core_state),\n        .instruction(instruction),\n        .decoded_rd_address(decoded_rd_address),\n        .decoded_rs_address(decoded_rs_address),\n        .decoded_rt_address(decoded_rt_address),\n        .decoded_nzp(decoded_nzp),\n        .decoded_immediate(decoded_immediate),\n        .decoded_reg_write_enable(decoded_reg_write_enable),\n        .decoded_mem_read_enable(decoded_mem_read_enable),\n        .decoded_mem_write_enable(decoded_mem_write_enable),\n        .decoded_nzp_write_enable(decoded_nzp_write_enable),\n        .decoded_reg_input_mux(decoded_reg_input_mux),\n        .decoded_alu_arithmetic_mux(decoded_alu_arithmetic_mux),\n        .decoded_alu_output_mux(decoded_alu_output_mux),\n        .decoded_pc_mux(decoded_pc_mux),\n        .decoded_ret(decoded_ret)\n    );\n\n    // Scheduler\n    scheduler #(\n        .THREADS_PER_BLOCK(THREADS_PER_BLOCK),\n    ) scheduler_instance (\n        .clk(clk),\n        .reset(reset),\n        .start(start),\n        .fetcher_state(fetcher_state),\n        .core_state(core_state),\n        .decoded_mem_read_enable(decoded_mem_read_enable),\n        .decoded_mem_write_enable(decoded_mem_write_enable),\n        .decoded_ret(decoded_ret),\n        .lsu_state(lsu_state),\n        .current_pc(current_pc),\n        .next_pc(next_pc),\n        .done(done)\n    );\n\n    // Dedicated ALU, LSU, registers, & PC unit for each thread this core has capacity for\n    genvar i;\n    generate\n        for (i = 0; i < THREADS_PER_BLOCK; i = i + 1) begin : threads\n            // ALU\n            alu alu_instance (\n                .clk(clk),\n                .reset(reset),\n                .enable(i < thread_count),\n                .core_state(core_state),\n                .decoded_alu_arithmetic_mux(decoded_alu_arithmetic_mux),\n                .decoded_alu_output_mux(decoded_alu_output_mux),\n                .rs(rs[i]),\n                .rt(rt[i]),\n                .alu_out(alu_out[i])\n            );\n\n            // LSU\n            lsu lsu_instance (\n                .clk(clk),\n                .reset(reset),\n                .enable(i < thread_count),\n                .core_state(core_state),\n                .decoded_mem_read_enable(decoded_mem_read_enable),\n                .decoded_mem_write_enable(decoded_mem_write_enable),\n                .mem_read_valid(data_mem_read_valid[i]),\n                .mem_read_address(data_mem_read_address[i]),\n                .mem_read_ready(data_mem_read_ready[i]),\n                .mem_read_data(data_mem_read_data[i]),\n                .mem_write_valid(data_mem_write_valid[i]),\n                .mem_write_address(data_mem_write_address[i]),\n                .mem_write_data(data_mem_write_data[i]),\n                .mem_write_ready(data_mem_write_ready[i]),\n                .rs(rs[i]),\n                .rt(rt[i]),\n                .lsu_state(lsu_state[i]),\n                .lsu_out(lsu_out[i])\n            );\n\n            // Register File\n            registers #(\n                .THREADS_PER_BLOCK(THREADS_PER_BLOCK),\n                .THREAD_ID(i),\n                .DATA_BITS(DATA_MEM_DATA_BITS),\n            ) register_instance (\n                .clk(clk),\n                .reset(reset),\n                .enable(i < thread_count),\n                .block_id(block_id),\n                .core_state(core_state),\n                .decoded_reg_write_enable(decoded_reg_write_enable),\n                .decoded_reg_input_mux(decoded_reg_input_mux),\n                .decoded_rd_address(decoded_rd_address),\n                .decoded_rs_address(decoded_rs_address),\n                .decoded_rt_address(decoded_rt_address),\n                .decoded_immediate(decoded_immediate),\n                .alu_out(alu_out[i]),\n                .lsu_out(lsu_out[i]),\n                .rs(rs[i]),\n                .rt(rt[i])\n            );\n\n            // Program Counter\n            pc #(\n                .DATA_MEM_DATA_BITS(DATA_MEM_DATA_BITS),\n                .PROGRAM_MEM_ADDR_BITS(PROGRAM_MEM_ADDR_BITS)\n            ) pc_instance (\n                .clk(clk),\n                .reset(reset),\n                .enable(i < thread_count),\n                .core_state(core_state),\n                .decoded_nzp(decoded_nzp),\n                .decoded_immediate(decoded_immediate),\n                .decoded_nzp_write_enable(decoded_nzp_write_enable),\n                .decoded_pc_mux(decoded_pc_mux),\n                .alu_out(alu_out[i]),\n                .current_pc(current_pc),\n                .next_pc(next_pc[i])\n            );\n        end\n    endgenerate\nendmodule\n"
  },
  {
    "path": "src/dcr.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// DEVICE CONTROL REGISTER\n// > Used to configure high-level settings\n// > In this minimal example, the DCR is used to configure the number of threads to run for the kernel\nmodule dcr (\n    input wire clk,\n    input wire reset,\n\n    input wire device_control_write_enable,\n    input wire [7:0] device_control_data,\n    output wire [7:0] thread_count,\n);\n    // Store device control data in dedicated register\n    reg [7:0] device_conrol_register;\n    assign thread_count = device_conrol_register[7:0];\n\n    always @(posedge clk) begin\n        if (reset) begin\n            device_conrol_register <= 8'b0;\n        end else begin\n            if (device_control_write_enable) begin \n                device_conrol_register <= device_control_data;\n            end\n        end\n    end\nendmodule"
  },
  {
    "path": "src/decoder.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// INSTRUCTION DECODER\n// > Decodes an instruction into the control signals necessary to execute it\n// > Each core has it's own decoder\nmodule decoder (\n    input wire clk,\n    input wire reset,\n\n    input reg [2:0] core_state,\n    input reg [15:0] instruction,\n    \n    // Instruction Signals\n    output reg [3:0] decoded_rd_address,\n    output reg [3:0] decoded_rs_address,\n    output reg [3:0] decoded_rt_address,\n    output reg [2:0] decoded_nzp,\n    output reg [7:0] decoded_immediate,\n    \n    // Control Signals\n    output reg decoded_reg_write_enable,           // Enable writing to a register\n    output reg decoded_mem_read_enable,            // Enable reading from memory\n    output reg decoded_mem_write_enable,           // Enable writing to memory\n    output reg decoded_nzp_write_enable,           // Enable writing to NZP register\n    output reg [1:0] decoded_reg_input_mux,        // Select input to register\n    output reg [1:0] decoded_alu_arithmetic_mux,   // Select arithmetic operation\n    output reg decoded_alu_output_mux,             // Select operation in ALU\n    output reg decoded_pc_mux,                     // Select source of next PC\n\n    // Return (finished executing thread)\n    output reg decoded_ret\n);\n    localparam NOP = 4'b0000,\n        BRnzp = 4'b0001,\n        CMP = 4'b0010,\n        ADD = 4'b0011,\n        SUB = 4'b0100,\n        MUL = 4'b0101,\n        DIV = 4'b0110,\n        LDR = 4'b0111,\n        STR = 4'b1000,\n        CONST = 4'b1001,\n        RET = 4'b1111;\n\n    always @(posedge clk) begin \n        if (reset) begin \n            decoded_rd_address <= 0;\n            decoded_rs_address <= 0;\n            decoded_rt_address <= 0;\n            decoded_immediate <= 0;\n            decoded_nzp <= 0;\n            decoded_reg_write_enable <= 0;\n            decoded_mem_read_enable <= 0;\n            decoded_mem_write_enable <= 0;\n            decoded_nzp_write_enable <= 0;\n            decoded_reg_input_mux <= 0;\n            decoded_alu_arithmetic_mux <= 0;\n            decoded_alu_output_mux <= 0;\n            decoded_pc_mux <= 0;\n            decoded_ret <= 0;\n        end else begin \n            // Decode when core_state = DECODE\n            if (core_state == 3'b010) begin \n                // Get instruction signals from instruction every time\n                decoded_rd_address <= instruction[11:8];\n                decoded_rs_address <= instruction[7:4];\n                decoded_rt_address <= instruction[3:0];\n                decoded_immediate <= instruction[7:0];\n                decoded_nzp <= instruction[11:9];\n\n                // Control signals reset on every decode and set conditionally by instruction\n                decoded_reg_write_enable <= 0;\n                decoded_mem_read_enable <= 0;\n                decoded_mem_write_enable <= 0;\n                decoded_nzp_write_enable <= 0;\n                decoded_reg_input_mux <= 0;\n                decoded_alu_arithmetic_mux <= 0;\n                decoded_alu_output_mux <= 0;\n                decoded_pc_mux <= 0;\n                decoded_ret <= 0;\n\n                // Set the control signals for each instruction\n                case (instruction[15:12])\n                    NOP: begin \n                        // no-op\n                    end\n                    BRnzp: begin \n                        decoded_pc_mux <= 1;\n                    end\n                    CMP: begin \n                        decoded_alu_output_mux <= 1;\n                        decoded_nzp_write_enable <= 1;\n                    end\n                    ADD: begin \n                        decoded_reg_write_enable <= 1;\n                        decoded_reg_input_mux <= 2'b00;\n                        decoded_alu_arithmetic_mux <= 2'b00;\n                    end\n                    SUB: begin \n                        decoded_reg_write_enable <= 1;\n                        decoded_reg_input_mux <= 2'b00;\n                        decoded_alu_arithmetic_mux <= 2'b01;\n                    end\n                    MUL: begin \n                        decoded_reg_write_enable <= 1;\n                        decoded_reg_input_mux <= 2'b00;\n                        decoded_alu_arithmetic_mux <= 2'b10;\n                    end\n                    DIV: begin \n                        decoded_reg_write_enable <= 1;\n                        decoded_reg_input_mux <= 2'b00;\n                        decoded_alu_arithmetic_mux <= 2'b11;\n                    end\n                    LDR: begin \n                        decoded_reg_write_enable <= 1;\n                        decoded_reg_input_mux <= 2'b01;\n                        decoded_mem_read_enable <= 1;\n                    end\n                    STR: begin \n                        decoded_mem_write_enable <= 1;\n                    end\n                    CONST: begin \n                        decoded_reg_write_enable <= 1;\n                        decoded_reg_input_mux <= 2'b10;\n                    end\n                    RET: begin \n                        decoded_ret <= 1;\n                    end\n                endcase\n            end\n        end\n    end\nendmodule\n"
  },
  {
    "path": "src/dispatch.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// BLOCK DISPATCH\n// > The GPU has one dispatch unit at the top level\n// > Manages processing of threads and marks kernel execution as done\n// > Sends off batches of threads in blocks to be executed by available compute cores\nmodule dispatch #(\n    parameter NUM_CORES = 2,\n    parameter THREADS_PER_BLOCK = 4\n) (\n    input wire clk,\n    input wire reset,\n    input wire start,\n\n    // Kernel Metadata\n    input wire [7:0] thread_count,\n\n    // Core States\n    input reg [NUM_CORES-1:0] core_done,\n    output reg [NUM_CORES-1:0] core_start,\n    output reg [NUM_CORES-1:0] core_reset,\n    output reg [7:0] core_block_id [NUM_CORES-1:0],\n    output reg [$clog2(THREADS_PER_BLOCK):0] core_thread_count [NUM_CORES-1:0],\n\n    // Kernel Execution\n    output reg done\n);\n    // Calculate the total number of blocks based on total threads & threads per block\n    wire [7:0] total_blocks;\n    assign total_blocks = (thread_count + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;\n\n    // Keep track of how many blocks have been processed\n    reg [7:0] blocks_dispatched; // How many blocks have been sent to cores?\n    reg [7:0] blocks_done; // How many blocks have finished processing?\n    reg start_execution; // EDA: Unimportant hack used because of EDA tooling\n\n    always @(posedge clk) begin\n        if (reset) begin\n            done <= 0;\n            blocks_dispatched = 0;\n            blocks_done = 0;\n            start_execution <= 0;\n\n            for (int i = 0; i < NUM_CORES; i++) begin\n                core_start[i] <= 0;\n                core_reset[i] <= 1;\n                core_block_id[i] <= 0;\n                core_thread_count[i] <= THREADS_PER_BLOCK;\n            end\n        end else if (start) begin    \n            // EDA: Indirect way to get @(posedge start) without driving from 2 different clocks\n            if (!start_execution) begin \n                start_execution <= 1;\n                for (int i = 0; i < NUM_CORES; i++) begin\n                    core_reset[i] <= 1;\n                end\n            end\n\n            // If the last block has finished processing, mark this kernel as done executing\n            if (blocks_done == total_blocks) begin \n                done <= 1;\n            end\n\n            for (int i = 0; i < NUM_CORES; i++) begin\n                if (core_reset[i]) begin \n                    core_reset[i] <= 0;\n\n                    // If this core was just reset, check if there are more blocks to be dispatched\n                    if (blocks_dispatched < total_blocks) begin \n                        core_start[i] <= 1;\n                        core_block_id[i] <= blocks_dispatched;\n                        core_thread_count[i] <= (blocks_dispatched == total_blocks - 1) \n                            ? thread_count - (blocks_dispatched * THREADS_PER_BLOCK)\n                            : THREADS_PER_BLOCK;\n\n                        blocks_dispatched = blocks_dispatched + 1;\n                    end\n                end\n            end\n\n            for (int i = 0; i < NUM_CORES; i++) begin\n                if (core_start[i] && core_done[i]) begin\n                    // If a core just finished executing it's current block, reset it\n                    core_reset[i] <= 1;\n                    core_start[i] <= 0;\n                    blocks_done = blocks_done + 1;\n                end\n            end\n        end\n    end\nendmodule"
  },
  {
    "path": "src/fetcher.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// INSTRUCTION FETCHER\n// > Retrieves the instruction at the current PC from global data memory\n// > Each core has it's own fetcher\nmodule fetcher #(\n    parameter PROGRAM_MEM_ADDR_BITS = 8,\n    parameter PROGRAM_MEM_DATA_BITS = 16\n) (\n    input wire clk,\n    input wire reset,\n    \n    // Execution State\n    input reg [2:0] core_state,\n    input reg [7:0] current_pc,\n\n    // Program Memory\n    output reg mem_read_valid,\n    output reg [PROGRAM_MEM_ADDR_BITS-1:0] mem_read_address,\n    input reg mem_read_ready,\n    input reg [PROGRAM_MEM_DATA_BITS-1:0] mem_read_data,\n\n    // Fetcher Output\n    output reg [2:0] fetcher_state,\n    output reg [PROGRAM_MEM_DATA_BITS-1:0] instruction,\n);\n    localparam IDLE = 3'b000, \n        FETCHING = 3'b001, \n        FETCHED = 3'b010;\n    \n    always @(posedge clk) begin\n        if (reset) begin\n            fetcher_state <= IDLE;\n            mem_read_valid <= 0;\n            mem_read_address <= 0;\n            instruction <= {PROGRAM_MEM_DATA_BITS{1'b0}};\n        end else begin\n            case (fetcher_state)\n                IDLE: begin\n                    // Start fetching when core_state = FETCH\n                    if (core_state == 3'b001) begin\n                        fetcher_state <= FETCHING;\n                        mem_read_valid <= 1;\n                        mem_read_address <= current_pc;\n                    end\n                end\n                FETCHING: begin\n                    // Wait for response from program memory\n                    if (mem_read_ready) begin\n                        fetcher_state <= FETCHED;\n                        instruction <= mem_read_data; // Store the instruction when received\n                        mem_read_valid <= 0;\n                    end\n                end\n                FETCHED: begin\n                    // Reset when core_state = DECODE\n                    if (core_state == 3'b010) begin \n                        fetcher_state <= IDLE;\n                    end\n                end\n            endcase\n        end\n    end\nendmodule\n"
  },
  {
    "path": "src/gpu.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// GPU\n// > Built to use an external async memory with multi-channel read/write\n// > Assumes that the program is loaded into program memory, data into data memory, and threads into\n//   the device control register before the start signal is triggered\n// > Has memory controllers to interface between external memory and its multiple cores\n// > Configurable number of cores and thread capacity per core\nmodule gpu #(\n    parameter DATA_MEM_ADDR_BITS = 8,        // Number of bits in data memory address (256 rows)\n    parameter DATA_MEM_DATA_BITS = 8,        // Number of bits in data memory value (8 bit data)\n    parameter DATA_MEM_NUM_CHANNELS = 4,     // Number of concurrent channels for sending requests to data memory\n    parameter PROGRAM_MEM_ADDR_BITS = 8,     // Number of bits in program memory address (256 rows)\n    parameter PROGRAM_MEM_DATA_BITS = 16,    // Number of bits in program memory value (16 bit instruction)\n    parameter PROGRAM_MEM_NUM_CHANNELS = 1,  // Number of concurrent channels for sending requests to program memory\n    parameter NUM_CORES = 2,                 // Number of cores to include in this GPU\n    parameter THREADS_PER_BLOCK = 4          // Number of threads to handle per block (determines the compute resources of each core)\n) (\n    input wire clk,\n    input wire reset,\n\n    // Kernel Execution\n    input wire start,\n    output wire done,\n\n    // Device Control Register\n    input wire device_control_write_enable,\n    input wire [7:0] device_control_data,\n\n    // Program Memory\n    output wire [PROGRAM_MEM_NUM_CHANNELS-1:0] program_mem_read_valid,\n    output wire [PROGRAM_MEM_ADDR_BITS-1:0] program_mem_read_address [PROGRAM_MEM_NUM_CHANNELS-1:0],\n    input wire [PROGRAM_MEM_NUM_CHANNELS-1:0] program_mem_read_ready,\n    input wire [PROGRAM_MEM_DATA_BITS-1:0] program_mem_read_data [PROGRAM_MEM_NUM_CHANNELS-1:0],\n\n    // Data Memory\n    output wire [DATA_MEM_NUM_CHANNELS-1:0] data_mem_read_valid,\n    output wire [DATA_MEM_ADDR_BITS-1:0] data_mem_read_address [DATA_MEM_NUM_CHANNELS-1:0],\n    input wire [DATA_MEM_NUM_CHANNELS-1:0] data_mem_read_ready,\n    input wire [DATA_MEM_DATA_BITS-1:0] data_mem_read_data [DATA_MEM_NUM_CHANNELS-1:0],\n    output wire [DATA_MEM_NUM_CHANNELS-1:0] data_mem_write_valid,\n    output wire [DATA_MEM_ADDR_BITS-1:0] data_mem_write_address [DATA_MEM_NUM_CHANNELS-1:0],\n    output wire [DATA_MEM_DATA_BITS-1:0] data_mem_write_data [DATA_MEM_NUM_CHANNELS-1:0],\n    input wire [DATA_MEM_NUM_CHANNELS-1:0] data_mem_write_ready\n);\n    // Control\n    wire [7:0] thread_count;\n\n    // Compute Core State\n    reg [NUM_CORES-1:0] core_start;\n    reg [NUM_CORES-1:0] core_reset;\n    reg [NUM_CORES-1:0] core_done;\n    reg [7:0] core_block_id [NUM_CORES-1:0];\n    reg [$clog2(THREADS_PER_BLOCK):0] core_thread_count [NUM_CORES-1:0];\n\n    // LSU <> Data Memory Controller Channels\n    localparam NUM_LSUS = NUM_CORES * THREADS_PER_BLOCK;\n    reg [NUM_LSUS-1:0] lsu_read_valid;\n    reg [DATA_MEM_ADDR_BITS-1:0] lsu_read_address [NUM_LSUS-1:0];\n    reg [NUM_LSUS-1:0] lsu_read_ready;\n    reg [DATA_MEM_DATA_BITS-1:0] lsu_read_data [NUM_LSUS-1:0];\n    reg [NUM_LSUS-1:0] lsu_write_valid;\n    reg [DATA_MEM_ADDR_BITS-1:0] lsu_write_address [NUM_LSUS-1:0];\n    reg [DATA_MEM_DATA_BITS-1:0] lsu_write_data [NUM_LSUS-1:0];\n    reg [NUM_LSUS-1:0] lsu_write_ready;\n\n    // Fetcher <> Program Memory Controller Channels\n    localparam NUM_FETCHERS = NUM_CORES;\n    reg [NUM_FETCHERS-1:0] fetcher_read_valid;\n    reg [PROGRAM_MEM_ADDR_BITS-1:0] fetcher_read_address [NUM_FETCHERS-1:0];\n    reg [NUM_FETCHERS-1:0] fetcher_read_ready;\n    reg [PROGRAM_MEM_DATA_BITS-1:0] fetcher_read_data [NUM_FETCHERS-1:0];\n    \n    // Device Control Register\n    dcr dcr_instance (\n        .clk(clk),\n        .reset(reset),\n\n        .device_control_write_enable(device_control_write_enable),\n        .device_control_data(device_control_data),\n        .thread_count(thread_count)\n    );\n\n    // Data Memory Controller\n    controller #(\n        .ADDR_BITS(DATA_MEM_ADDR_BITS),\n        .DATA_BITS(DATA_MEM_DATA_BITS),\n        .NUM_CONSUMERS(NUM_LSUS),\n        .NUM_CHANNELS(DATA_MEM_NUM_CHANNELS)\n    ) data_memory_controller (\n        .clk(clk),\n        .reset(reset),\n\n        .consumer_read_valid(lsu_read_valid),\n        .consumer_read_address(lsu_read_address),\n        .consumer_read_ready(lsu_read_ready),\n        .consumer_read_data(lsu_read_data),\n        .consumer_write_valid(lsu_write_valid),\n        .consumer_write_address(lsu_write_address),\n        .consumer_write_data(lsu_write_data),\n        .consumer_write_ready(lsu_write_ready),\n\n        .mem_read_valid(data_mem_read_valid),\n        .mem_read_address(data_mem_read_address),\n        .mem_read_ready(data_mem_read_ready),\n        .mem_read_data(data_mem_read_data),\n        .mem_write_valid(data_mem_write_valid),\n        .mem_write_address(data_mem_write_address),\n        .mem_write_data(data_mem_write_data),\n        .mem_write_ready(data_mem_write_ready)\n    );\n\n    // Program Memory Controller\n    controller #(\n        .ADDR_BITS(PROGRAM_MEM_ADDR_BITS),\n        .DATA_BITS(PROGRAM_MEM_DATA_BITS),\n        .NUM_CONSUMERS(NUM_FETCHERS),\n        .NUM_CHANNELS(PROGRAM_MEM_NUM_CHANNELS),\n        .WRITE_ENABLE(0)\n    ) program_memory_controller (\n        .clk(clk),\n        .reset(reset),\n\n        .consumer_read_valid(fetcher_read_valid),\n        .consumer_read_address(fetcher_read_address),\n        .consumer_read_ready(fetcher_read_ready),\n        .consumer_read_data(fetcher_read_data),\n\n        .mem_read_valid(program_mem_read_valid),\n        .mem_read_address(program_mem_read_address),\n        .mem_read_ready(program_mem_read_ready),\n        .mem_read_data(program_mem_read_data),\n    );\n\n    // Dispatcher\n    dispatch #(\n        .NUM_CORES(NUM_CORES),\n        .THREADS_PER_BLOCK(THREADS_PER_BLOCK)\n    ) dispatch_instance (\n        .clk(clk),\n        .reset(reset),\n        .start(start),\n        .thread_count(thread_count),\n        .core_done(core_done),\n        .core_start(core_start),\n        .core_reset(core_reset),\n        .core_block_id(core_block_id),\n        .core_thread_count(core_thread_count),\n        .done(done)\n    );\n\n    // Compute Cores\n    genvar i;\n    generate\n        for (i = 0; i < NUM_CORES; i = i + 1) begin : cores\n            // EDA: We create separate signals here to pass to cores because of a requirement\n            // by the OpenLane EDA flow (uses Verilog 2005) that prevents slicing the top-level signals\n            reg [THREADS_PER_BLOCK-1:0] core_lsu_read_valid;\n            reg [DATA_MEM_ADDR_BITS-1:0] core_lsu_read_address [THREADS_PER_BLOCK-1:0];\n            reg [THREADS_PER_BLOCK-1:0] core_lsu_read_ready;\n            reg [DATA_MEM_DATA_BITS-1:0] core_lsu_read_data [THREADS_PER_BLOCK-1:0];\n            reg [THREADS_PER_BLOCK-1:0] core_lsu_write_valid;\n            reg [DATA_MEM_ADDR_BITS-1:0] core_lsu_write_address [THREADS_PER_BLOCK-1:0];\n            reg [DATA_MEM_DATA_BITS-1:0] core_lsu_write_data [THREADS_PER_BLOCK-1:0];\n            reg [THREADS_PER_BLOCK-1:0] core_lsu_write_ready;\n\n            // Pass through signals between LSUs and data memory controller\n            genvar j;\n            for (j = 0; j < THREADS_PER_BLOCK; j = j + 1) begin\n                localparam lsu_index = i * THREADS_PER_BLOCK + j;\n                always @(posedge clk) begin \n                    lsu_read_valid[lsu_index] <= core_lsu_read_valid[j];\n                    lsu_read_address[lsu_index] <= core_lsu_read_address[j];\n\n                    lsu_write_valid[lsu_index] <= core_lsu_write_valid[j];\n                    lsu_write_address[lsu_index] <= core_lsu_write_address[j];\n                    lsu_write_data[lsu_index] <= core_lsu_write_data[j];\n                    \n                    core_lsu_read_ready[j] <= lsu_read_ready[lsu_index];\n                    core_lsu_read_data[j] <= lsu_read_data[lsu_index];\n                    core_lsu_write_ready[j] <= lsu_write_ready[lsu_index];\n                end\n            end\n\n            // Compute Core\n            core #(\n                .DATA_MEM_ADDR_BITS(DATA_MEM_ADDR_BITS),\n                .DATA_MEM_DATA_BITS(DATA_MEM_DATA_BITS),\n                .PROGRAM_MEM_ADDR_BITS(PROGRAM_MEM_ADDR_BITS),\n                .PROGRAM_MEM_DATA_BITS(PROGRAM_MEM_DATA_BITS),\n                .THREADS_PER_BLOCK(THREADS_PER_BLOCK),\n            ) core_instance (\n                .clk(clk),\n                .reset(core_reset[i]),\n                .start(core_start[i]),\n                .done(core_done[i]),\n                .block_id(core_block_id[i]),\n                .thread_count(core_thread_count[i]),\n                \n                .program_mem_read_valid(fetcher_read_valid[i]),\n                .program_mem_read_address(fetcher_read_address[i]),\n                .program_mem_read_ready(fetcher_read_ready[i]),\n                .program_mem_read_data(fetcher_read_data[i]),\n\n                .data_mem_read_valid(core_lsu_read_valid),\n                .data_mem_read_address(core_lsu_read_address),\n                .data_mem_read_ready(core_lsu_read_ready),\n                .data_mem_read_data(core_lsu_read_data),\n                .data_mem_write_valid(core_lsu_write_valid),\n                .data_mem_write_address(core_lsu_write_address),\n                .data_mem_write_data(core_lsu_write_data),\n                .data_mem_write_ready(core_lsu_write_ready)\n            );\n        end\n    endgenerate\nendmodule\n"
  },
  {
    "path": "src/lsu.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// LOAD-STORE UNIT\n// > Handles asynchronous memory load and store operations and waits for response\n// > Each thread in each core has it's own LSU\n// > LDR, STR instructions are executed here\nmodule lsu (\n    input wire clk,\n    input wire reset,\n    input wire enable, // If current block has less threads then block size, some LSUs will be inactive\n\n    // State\n    input reg [2:0] core_state,\n\n    // Memory Control Sgiansl\n    input reg decoded_mem_read_enable,\n    input reg decoded_mem_write_enable,\n\n    // Registers\n    input reg [7:0] rs,\n    input reg [7:0] rt,\n\n    // Data Memory\n    output reg mem_read_valid,\n    output reg [7:0] mem_read_address,\n    input reg mem_read_ready,\n    input reg [7:0] mem_read_data,\n    output reg mem_write_valid,\n    output reg [7:0] mem_write_address,\n    output reg [7:0] mem_write_data,\n    input reg mem_write_ready,\n\n    // LSU Outputs\n    output reg [1:0] lsu_state,\n    output reg [7:0] lsu_out\n);\n    localparam IDLE = 2'b00, REQUESTING = 2'b01, WAITING = 2'b10, DONE = 2'b11;\n\n    always @(posedge clk) begin\n        if (reset) begin\n            lsu_state <= IDLE;\n            lsu_out <= 0;\n            mem_read_valid <= 0;\n            mem_read_address <= 0;\n            mem_write_valid <= 0;\n            mem_write_address <= 0;\n            mem_write_data <= 0;\n        end else if (enable) begin\n            // If memory read enable is triggered (LDR instruction)\n            if (decoded_mem_read_enable) begin \n                case (lsu_state)\n                    IDLE: begin\n                        // Only read when core_state = REQUEST\n                        if (core_state == 3'b011) begin \n                            lsu_state <= REQUESTING;\n                        end\n                    end\n                    REQUESTING: begin \n                        mem_read_valid <= 1;\n                        mem_read_address <= rs;\n                        lsu_state <= WAITING;\n                    end\n                    WAITING: begin\n                        if (mem_read_ready == 1) begin\n                            mem_read_valid <= 0;\n                            lsu_out <= mem_read_data;\n                            lsu_state <= DONE;\n                        end\n                    end\n                    DONE: begin \n                        // Reset when core_state = UPDATE\n                        if (core_state == 3'b110) begin \n                            lsu_state <= IDLE;\n                        end\n                    end\n                endcase\n            end\n\n            // If memory write enable is triggered (STR instruction)\n            if (decoded_mem_write_enable) begin \n                case (lsu_state)\n                    IDLE: begin\n                        // Only read when core_state = REQUEST\n                        if (core_state == 3'b011) begin \n                            lsu_state <= REQUESTING;\n                        end\n                    end\n                    REQUESTING: begin \n                        mem_write_valid <= 1;\n                        mem_write_address <= rs;\n                        mem_write_data <= rt;\n                        lsu_state <= WAITING;\n                    end\n                    WAITING: begin\n                        if (mem_write_ready) begin\n                            mem_write_valid <= 0;\n                            lsu_state <= DONE;\n                        end\n                    end\n                    DONE: begin \n                        // Reset when core_state = UPDATE\n                        if (core_state == 3'b110) begin \n                            lsu_state <= IDLE;\n                        end\n                    end\n                endcase\n            end\n        end\n    end\nendmodule\n"
  },
  {
    "path": "src/pc.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// PROGRAM COUNTER\n// > Calculates the next PC for each thread to update to (but currently we assume all threads\n//   update to the same PC and don't support branch divergence)\n// > Currently, each thread in each core has it's own calculation for next PC\n// > The NZP register value is set by the CMP instruction (based on >/=/< comparison) to \n//   initiate the BRnzp instruction for branching\nmodule pc #(\n    parameter DATA_MEM_DATA_BITS = 8,\n    parameter PROGRAM_MEM_ADDR_BITS = 8\n) (\n    input wire clk,\n    input wire reset,\n    input wire enable, // If current block has less threads then block size, some PCs will be inactive\n\n    // State\n    input reg [2:0] core_state,\n\n    // Control Signals\n    input reg [2:0] decoded_nzp,\n    input reg [DATA_MEM_DATA_BITS-1:0] decoded_immediate,\n    input reg decoded_nzp_write_enable,\n    input reg decoded_pc_mux, \n\n    // ALU Output - used for alu_out[2:0] to compare with NZP register\n    input reg [DATA_MEM_DATA_BITS-1:0] alu_out,\n\n    // Current & Next PCs\n    input reg [PROGRAM_MEM_ADDR_BITS-1:0] current_pc,\n    output reg [PROGRAM_MEM_ADDR_BITS-1:0] next_pc\n);\n    reg [2:0] nzp;\n\n    always @(posedge clk) begin\n        if (reset) begin\n            nzp <= 3'b0;\n            next_pc <= 0;\n        end else if (enable) begin\n            // Update PC when core_state = EXECUTE\n            if (core_state == 3'b101) begin \n                if (decoded_pc_mux == 1) begin \n                    if (((nzp & decoded_nzp) != 3'b0)) begin \n                        // On BRnzp instruction, branch to immediate if NZP case matches previous CMP\n                        next_pc <= decoded_immediate;\n                    end else begin \n                        // Otherwise, just update to PC + 1 (next line)\n                        next_pc <= current_pc + 1;\n                    end\n                end else begin \n                    // By default update to PC + 1 (next line)\n                    next_pc <= current_pc + 1;\n                end\n            end   \n\n            // Store NZP when core_state = UPDATE   \n            if (core_state == 3'b110) begin \n                // Write to NZP register on CMP instruction\n                if (decoded_nzp_write_enable) begin\n                    nzp[2] <= alu_out[2];\n                    nzp[1] <= alu_out[1];\n                    nzp[0] <= alu_out[0];\n                end\n            end      \n        end\n    end\n\nendmodule\n"
  },
  {
    "path": "src/registers.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// REGISTER FILE\n// > Each thread within each core has it's own register file with 13 free registers and 3 read-only registers\n// > Read-only registers hold the familiar %blockIdx, %blockDim, and %threadIdx values critical to SIMD\nmodule registers #(\n    parameter THREADS_PER_BLOCK = 4,\n    parameter THREAD_ID = 0,\n    parameter DATA_BITS = 8\n) (\n    input wire clk,\n    input wire reset,\n    input wire enable, // If current block has less threads then block size, some registers will be inactive\n\n    // Kernel Execution\n    input reg [7:0] block_id,\n\n    // State\n    input reg [2:0] core_state,\n\n    // Instruction Signals\n    input reg [3:0] decoded_rd_address,\n    input reg [3:0] decoded_rs_address,\n    input reg [3:0] decoded_rt_address,\n\n    // Control Signals\n    input reg decoded_reg_write_enable,\n    input reg [1:0] decoded_reg_input_mux,\n    input reg [DATA_BITS-1:0] decoded_immediate,\n\n    // Thread Unit Outputs\n    input reg [DATA_BITS-1:0] alu_out,\n    input reg [DATA_BITS-1:0] lsu_out,\n\n    // Registers\n    output reg [7:0] rs,\n    output reg [7:0] rt\n);\n    localparam ARITHMETIC = 2'b00,\n        MEMORY = 2'b01,\n        CONSTANT = 2'b10;\n\n    // 16 registers per thread (13 free registers and 3 read-only registers)\n    reg [7:0] registers[15:0];\n\n    always @(posedge clk) begin\n        if (reset) begin\n            // Empty rs, rt\n            rs <= 0;\n            rt <= 0;\n            // Initialize all free registers\n            registers[0] <= 8'b0;\n            registers[1] <= 8'b0;\n            registers[2] <= 8'b0;\n            registers[3] <= 8'b0;\n            registers[4] <= 8'b0;\n            registers[5] <= 8'b0;\n            registers[6] <= 8'b0;\n            registers[7] <= 8'b0;\n            registers[8] <= 8'b0;\n            registers[9] <= 8'b0;\n            registers[10] <= 8'b0;\n            registers[11] <= 8'b0;\n            registers[12] <= 8'b0;\n            // Initialize read-only registers\n            registers[13] <= 8'b0;              // %blockIdx\n            registers[14] <= THREADS_PER_BLOCK; // %blockDim\n            registers[15] <= THREAD_ID;         // %threadIdx\n        end else if (enable) begin \n            // [Bad Solution] Shouldn't need to set this every cycle\n            registers[13] <= block_id; // Update the block_id when a new block is issued from dispatcher\n            \n            // Fill rs/rt when core_state = REQUEST\n            if (core_state == 3'b011) begin \n                rs <= registers[decoded_rs_address];\n                rt <= registers[decoded_rt_address];\n            end\n\n            // Store rd when core_state = UPDATE\n            if (core_state == 3'b110) begin \n                // Only allow writing to R0 - R12\n                if (decoded_reg_write_enable && decoded_rd_address < 13) begin\n                    case (decoded_reg_input_mux)\n                        ARITHMETIC: begin \n                            // ADD, SUB, MUL, DIV\n                            registers[decoded_rd_address] <= alu_out;\n                        end\n                        MEMORY: begin \n                            // LDR\n                            registers[decoded_rd_address] <= lsu_out;\n                        end\n                        CONSTANT: begin \n                            // CONST\n                            registers[decoded_rd_address] <= decoded_immediate;\n                        end\n                    endcase\n                end\n            end\n        end\n    end\nendmodule\n"
  },
  {
    "path": "src/scheduler.sv",
    "content": "`default_nettype none\n`timescale 1ns/1ns\n\n// SCHEDULER\n// > Manages the entire control flow of a single compute core processing 1 block\n// 1. FETCH - Retrieve instruction at current program counter (PC) from program memory\n// 2. DECODE - Decode the instruction into the relevant control signals\n// 3. REQUEST - If we have an instruction that accesses memory, trigger the async memory requests from LSUs\n// 4. WAIT - Wait for all async memory requests to resolve (if applicable)\n// 5. EXECUTE - Execute computations on retrieved data from registers / memory\n// 6. UPDATE - Update register values (including NZP register) and program counter\n// > Each core has it's own scheduler where multiple threads can be processed with\n//   the same control flow at once.\n// > Technically, different instructions can branch to different PCs, requiring \"branch divergence.\" In\n//   this minimal implementation, we assume no branch divergence (naive approach for simplicity)\nmodule scheduler #(\n    parameter THREADS_PER_BLOCK = 4,\n) (\n    input wire clk,\n    input wire reset,\n    input wire start,\n    \n    // Control Signals\n    input reg decoded_mem_read_enable,\n    input reg decoded_mem_write_enable,\n    input reg decoded_ret,\n\n    // Memory Access State\n    input reg [2:0] fetcher_state,\n    input reg [1:0] lsu_state [THREADS_PER_BLOCK-1:0],\n\n    // Current & Next PC\n    output reg [7:0] current_pc,\n    input reg [7:0] next_pc [THREADS_PER_BLOCK-1:0],\n\n    // Execution State\n    output reg [2:0] core_state,\n    output reg done\n);\n    localparam IDLE = 3'b000, // Waiting to start\n        FETCH = 3'b001,       // Fetch instructions from program memory\n        DECODE = 3'b010,      // Decode instructions into control signals\n        REQUEST = 3'b011,     // Request data from registers or memory\n        WAIT = 3'b100,        // Wait for response from memory if necessary\n        EXECUTE = 3'b101,     // Execute ALU and PC calculations\n        UPDATE = 3'b110,      // Update registers, NZP, and PC\n        DONE = 3'b111;        // Done executing this block\n    \n    always @(posedge clk) begin \n        if (reset) begin\n            current_pc <= 0;\n            core_state <= IDLE;\n            done <= 0;\n        end else begin \n            case (core_state)\n                IDLE: begin\n                    // Here after reset (before kernel is launched, or after previous block has been processed)\n                    if (start) begin \n                        // Start by fetching the next instruction for this block based on PC\n                        core_state <= FETCH;\n                    end\n                end\n                FETCH: begin \n                    // Move on once fetcher_state = FETCHED\n                    if (fetcher_state == 3'b010) begin \n                        core_state <= DECODE;\n                    end\n                end\n                DECODE: begin\n                    // Decode is synchronous so we move on after one cycle\n                    core_state <= REQUEST;\n                end\n                REQUEST: begin \n                    // Request is synchronous so we move on after one cycle\n                    core_state <= WAIT;\n                end\n                WAIT: begin\n                    // Wait for all LSUs to finish their request before continuing\n                    reg any_lsu_waiting = 1'b0;\n                    for (int i = 0; i < THREADS_PER_BLOCK; i++) begin\n                        // Make sure no lsu_state = REQUESTING or WAITING\n                        if (lsu_state[i] == 2'b01 || lsu_state[i] == 2'b10) begin\n                            any_lsu_waiting = 1'b1;\n                            break;\n                        end\n                    end\n\n                    // If no LSU is waiting for a response, move onto the next stage\n                    if (!any_lsu_waiting) begin\n                        core_state <= EXECUTE;\n                    end\n                end\n                EXECUTE: begin\n                    // Execute is synchronous so we move on after one cycle\n                    core_state <= UPDATE;\n                end\n                UPDATE: begin \n                    if (decoded_ret) begin \n                        // If we reach a RET instruction, this block is done executing\n                        done <= 1;\n                        core_state <= DONE;\n                    end else begin \n                        // TODO: Branch divergence. For now assume all next_pc converge\n                        current_pc <= next_pc[THREADS_PER_BLOCK-1];\n\n                        // Update is synchronous so we move on after one cycle\n                        core_state <= FETCH;\n                    end\n                end\n                DONE: begin \n                    // no-op\n                end\n            endcase\n        end\n    end\nendmodule\n"
  },
  {
    "path": "test/__init__.py",
    "content": ""
  },
  {
    "path": "test/helpers/format.py",
    "content": "from typing import List, Optional\nfrom .logger import logger\n\ndef format_register(register: int) -> str:\n    if register < 13:\n        return f\"R{register}\"\n    if register == 13:\n        return f\"%blockIdx\"\n    if register == 14:\n        return f\"%blockDim\"\n    if register == 15:\n        return f\"%threadIdx\"\n    \ndef format_instruction(instruction: str) -> str:\n    opcode = instruction[0:4]\n    rd = format_register(int(instruction[4:8], 2))\n    rs = format_register(int(instruction[8:12], 2))\n    rt = format_register(int(instruction[12:16], 2))\n    n = \"N\" if instruction[4] == 1 else \"\"\n    z = \"Z\" if instruction[5] == 1 else \"\"\n    p = \"P\" if instruction[6] == 1 else \"\"\n    imm = f\"#{int(instruction[8:16], 2)}\"\n\n    if opcode == \"0000\":\n        return \"NOP\"\n    elif opcode == \"0001\":\n        return f\"BRnzp {n}{z}{p}, {imm}\"\n    elif opcode == \"0010\":\n        return f\"CMP {rs}, {rt}\"\n    elif opcode == \"0011\":\n        return f\"ADD {rd}, {rs}, {rt}\"\n    elif opcode == \"0100\":\n        return f\"SUB {rd}, {rs}, {rt}\"\n    elif opcode == \"0101\":\n        return f\"MUL {rd}, {rs}, {rt}\"\n    elif opcode == \"0110\":\n        return f\"DIV {rd}, {rs}, {rt}\"\n    elif opcode == \"0111\":\n        return f\"LDR {rd}, {rs}\"\n    elif opcode == \"1000\":\n        return f\"STR {rs}, {rt}\"\n    elif opcode == \"1001\":\n        return f\"CONST {rd}, {imm}\"\n    elif opcode == \"1111\":\n        return \"RET\"\n    return \"UNKNOWN\"\n\ndef format_core_state(core_state: str) -> str:\n    core_state_map = {\n        \"000\": \"IDLE\",\n        \"001\": \"FETCH\",\n        \"010\": \"DECODE\",\n        \"011\": \"REQUEST\",\n        \"100\": \"WAIT\",\n        \"101\": \"EXECUTE\",\n        \"110\": \"UPDATE\",\n        \"111\": \"DONE\"\n    }\n    return core_state_map[core_state]\n\ndef format_fetcher_state(fetcher_state: str) -> str:\n    fetcher_state_map = {\n        \"000\": \"IDLE\",\n        \"001\": \"FETCHING\",\n        \"010\": \"FETCHED\"\n    }\n    return fetcher_state_map[fetcher_state]\n\ndef format_lsu_state(lsu_state: str) -> str:\n    lsu_state_map = {\n        \"00\": \"IDLE\",\n        \"01\": \"REQUESTING\",\n        \"10\": \"WAITING\",\n        \"11\": \"DONE\"\n    }\n    return lsu_state_map[lsu_state]\n\ndef format_memory_controller_state(controller_state: str) -> str:\n    controller_state_map = {\n        \"000\": \"IDLE\",\n        \"010\": \"READ_WAITING\",\n        \"011\": \"WRITE_WAITING\",\n        \"100\": \"READ_RELAYING\",\n        \"101\": \"WRITE_RELAYING\"\n    }\n    return controller_state_map[controller_state]\n\ndef format_registers(registers: List[str]) -> str:\n    formatted_registers = []\n    for i, reg_value in enumerate(registers):\n        decimal_value = int(reg_value, 2)  # Convert binary string to decimal\n        reg_idx = 15 - i # Register data is provided in reverse order\n        formatted_registers.append(f\"{format_register(reg_idx)} = {decimal_value}\")\n    formatted_registers.reverse()\n    return ', '.join(formatted_registers)\n\ndef format_cycle(dut, cycle_id: int, thread_id: Optional[int] = None):\n    logger.debug(f\"\\n================================== Cycle {cycle_id} ==================================\")\n\n    for core in dut.cores:\n        # Not exactly accurate, but good enough for now\n        if int(str(dut.thread_count.value), 2) <= core.i.value * dut.THREADS_PER_BLOCK.value:\n            continue\n\n        logger.debug(f\"\\n+--------------------- Core {core.i.value} ---------------------+\")\n\n        instruction = str(core.core_instance.instruction.value)\n        for thread in core.core_instance.threads:\n            if int(thread.i.value) < int(str(core.core_instance.thread_count.value), 2): # if enabled\n                block_idx = core.core_instance.block_id.value\n                block_dim = int(core.core_instance.THREADS_PER_BLOCK)\n                thread_idx = thread.register_instance.THREAD_ID.value\n                idx = block_idx * block_dim + thread_idx\n\n                rs = int(str(thread.register_instance.rs.value), 2)\n                rt = int(str(thread.register_instance.rt.value), 2)\n\n                reg_input_mux = int(str(core.core_instance.decoded_reg_input_mux.value), 2)\n                alu_out = int(str(thread.alu_instance.alu_out.value), 2)\n                lsu_out = int(str(thread.lsu_instance.lsu_out.value), 2)\n                constant = int(str(core.core_instance.decoded_immediate.value), 2)\n\n                if (thread_id is None or thread_id == idx):\n                    logger.debug(f\"\\n+-------- Thread {idx} --------+\")\n\n                    logger.debug(\"PC:\", int(str(core.core_instance.current_pc.value), 2))\n                    logger.debug(\"Instruction:\", format_instruction(instruction))\n                    logger.debug(\"Core State:\", format_core_state(str(core.core_instance.core_state.value)))\n                    logger.debug(\"Fetcher State:\", format_fetcher_state(str(core.core_instance.fetcher_state.value)))\n                    logger.debug(\"LSU State:\", format_lsu_state(str(thread.lsu_instance.lsu_state.value)))\n                    logger.debug(\"Registers:\", format_registers([str(item.value) for item in thread.register_instance.registers]))\n                    logger.debug(f\"RS = {rs}, RT = {rt}\")\n\n                    if reg_input_mux == 0:\n                        logger.debug(\"ALU Out:\", alu_out)\n                    if reg_input_mux == 1:\n                        logger.debug(\"LSU Out:\", lsu_out)\n                    if reg_input_mux == 2:\n                        logger.debug(\"Constant:\", constant)\n\n        logger.debug(\"Core Done:\", str(core.core_instance.done.value))"
  },
  {
    "path": "test/helpers/logger.py",
    "content": "import datetime\n\nclass Logger:\n    def __init__(self, level=\"debug\"):\n        self.filename = f\"test/logs/log_{datetime.datetime.now().strftime('%Y%m%d%H%M%S')}.txt\"\n        self.level = level\n\n    def debug(self, *messages):\n        if self.level == \"debug\":\n            self.info(*messages)\n\n    def info(self, *messages):\n        full_message = ' '.join(str(message) for message in messages)\n        with open(self.filename, \"a\") as log_file:\n            log_file.write(full_message + \"\\n\")\n\nlogger = Logger(level=\"debug\")"
  },
  {
    "path": "test/helpers/memory.py",
    "content": "from typing import List\nfrom .logger import logger\n\nclass Memory:\n    def __init__(self, dut, addr_bits, data_bits, channels, name):\n        self.dut = dut\n        self.addr_bits = addr_bits\n        self.data_bits = data_bits\n        self.memory = [0] * (2**addr_bits)\n        self.channels = channels\n        self.name = name\n\n        self.mem_read_valid = getattr(dut, f\"{name}_mem_read_valid\")\n        self.mem_read_address = getattr(dut, f\"{name}_mem_read_address\")\n        self.mem_read_ready = getattr(dut, f\"{name}_mem_read_ready\")\n        self.mem_read_data = getattr(dut, f\"{name}_mem_read_data\")\n\n        if name != \"program\":\n            self.mem_write_valid = getattr(dut, f\"{name}_mem_write_valid\")\n            self.mem_write_address = getattr(dut, f\"{name}_mem_write_address\")\n            self.mem_write_data = getattr(dut, f\"{name}_mem_write_data\")\n            self.mem_write_ready = getattr(dut, f\"{name}_mem_write_ready\")\n\n    def run(self):\n        mem_read_valid = [\n            int(str(self.mem_read_valid.value)[i:i+1], 2)\n            for i in range(0, len(str(self.mem_read_valid.value)), 1)\n        ]\n\n        mem_read_address = [\n            int(str(self.mem_read_address.value)[i:i+self.addr_bits], 2)\n            for i in range(0, len(str(self.mem_read_address.value)), self.addr_bits)\n        ]\n        mem_read_ready = [0] * self.channels\n        mem_read_data = [0] * self.channels\n\n        for i in range(self.channels):\n            if mem_read_valid[i] == 1:\n                mem_read_data[i] = self.memory[mem_read_address[i]]\n                mem_read_ready[i] = 1\n            else:\n                mem_read_ready[i] = 0\n\n        self.mem_read_data.value = int(''.join(format(d, '0' + str(self.data_bits) + 'b') for d in mem_read_data), 2)\n        self.mem_read_ready.value = int(''.join(format(r, '01b') for r in mem_read_ready), 2)\n\n        if self.name != \"program\":\n            mem_write_valid = [\n                int(str(self.mem_write_valid.value)[i:i+1], 2)\n                for i in range(0, len(str(self.mem_write_valid.value)), 1)\n            ]\n            mem_write_address = [\n                int(str(self.mem_write_address.value)[i:i+self.addr_bits], 2)\n                for i in range(0, len(str(self.mem_write_address.value)), self.addr_bits)\n            ]\n            mem_write_data = [\n                int(str(self.mem_write_data.value)[i:i+self.data_bits], 2)\n                for i in range(0, len(str(self.mem_write_data.value)), self.data_bits)\n            ]\n            mem_write_ready = [0] * self.channels\n\n            for i in range(self.channels):\n                if mem_write_valid[i] == 1:\n                    self.memory[mem_write_address[i]] = mem_write_data[i]\n                    mem_write_ready[i] = 1\n                else:\n                    mem_write_ready[i] = 0\n\n            self.mem_write_ready.value = int(''.join(format(w, '01b') for w in mem_write_ready), 2)\n\n    def write(self, address, data):\n        if address < len(self.memory):\n            self.memory[address] = data\n\n    def load(self, rows: List[int]):\n        for address, data in enumerate(rows):\n            self.write(address, data)\n\n    def display(self, rows, decimal=True):\n        logger.info(\"\\n\")\n        logger.info(f\"{self.name.upper()} MEMORY\")\n        \n        table_size = (8 * 2) + 3\n        logger.info(\"+\" + \"-\" * (table_size - 3) + \"+\")\n\n        header = \"| Addr | Data \"\n        logger.info(header + \" \" * (table_size - len(header) - 1) + \"|\")\n\n        logger.info(\"+\" + \"-\" * (table_size - 3) + \"+\")\n        for i, data in enumerate(self.memory):\n            if i < rows:\n                if decimal:\n                    row = f\"| {i:<4} | {data:<4}\"\n                    logger.info(row + \" \" * (table_size - len(row) - 1) + \"|\")\n                else:\n                    data_bin = format(data, f'0{16}b')\n                    row = f\"| {i:<4} | {data_bin} |\"\n                    logger.info(row + \" \" * (table_size - len(row) - 1) + \"|\")\n        logger.info(\"+\" + \"-\" * (table_size - 3) + \"+\")"
  },
  {
    "path": "test/helpers/setup.py",
    "content": "from typing import List\nimport cocotb\nfrom cocotb.clock import Clock\nfrom cocotb.triggers import RisingEdge\nfrom .memory import Memory\n\nasync def setup(\n    dut, \n    program_memory: Memory, \n    program: List[int],\n    data_memory: Memory,\n    data: List[int],\n    threads: int\n):\n    # Setup Clock\n    clock = Clock(dut.clk, 25, units=\"us\")\n    cocotb.start_soon(clock.start())\n\n    # Reset\n    dut.reset.value = 1\n    await RisingEdge(dut.clk)\n    dut.reset.value = 0\n\n    # Load Program Memory\n    program_memory.load(program)\n\n    # Load Data Memory\n    data_memory.load(data)\n\n    # Device Control Register\n    dut.device_control_write_enable.value = 1\n    dut.device_control_data.value = threads\n    await RisingEdge(dut.clk)\n    dut.device_control_write_enable.value = 0\n\n    # Start\n    dut.start.value = 1\n"
  },
  {
    "path": "test/logs/.gitkeep",
    "content": ""
  },
  {
    "path": "test/test_matadd.py",
    "content": "import cocotb\nfrom cocotb.triggers import RisingEdge\nfrom .helpers.setup import setup\nfrom .helpers.memory import Memory\nfrom .helpers.format import format_cycle\nfrom .helpers.logger import logger\n\n@cocotb.test()\nasync def test_matadd(dut):\n    # Program Memory\n    program_memory = Memory(dut=dut, addr_bits=8, data_bits=16, channels=1, name=\"program\")\n    program = [\n        0b0101000011011110, # MUL R0, %blockIdx, %blockDim\n        0b0011000000001111, # ADD R0, R0, %threadIdx         ; i = blockIdx * blockDim + threadIdx\n        0b1001000100000000, # CONST R1, #0                   ; baseA (matrix A base address)\n        0b1001001000001000, # CONST R2, #8                   ; baseB (matrix B base address)\n        0b1001001100010000, # CONST R3, #16                  ; baseC (matrix C base address)\n        0b0011010000010000, # ADD R4, R1, R0                 ; addr(A[i]) = baseA + i\n        0b0111010001000000, # LDR R4, R4                     ; load A[i] from global memory\n        0b0011010100100000, # ADD R5, R2, R0                 ; addr(B[i]) = baseB + i\n        0b0111010101010000, # LDR R5, R5                     ; load B[i] from global memory\n        0b0011011001000101, # ADD R6, R4, R5                 ; C[i] = A[i] + B[i]\n        0b0011011100110000, # ADD R7, R3, R0                 ; addr(C[i]) = baseC + i\n        0b1000000001110110, # STR R7, R6                     ; store C[i] in global memory\n        0b1111000000000000, # RET                            ; end of kernel\n    ]\n\n    # Data Memory\n    data_memory = Memory(dut=dut, addr_bits=8, data_bits=8, channels=4, name=\"data\")\n    data = [\n        0, 1, 2, 3, 4, 5, 6, 7, # Matrix A (1 x 8)\n        0, 1, 2, 3, 4, 5, 6, 7  # Matrix B (1 x 8)\n    ]\n\n    # Device Control\n    threads = 8\n\n    await setup(\n        dut=dut,\n        program_memory=program_memory,\n        program=program,\n        data_memory=data_memory,\n        data=data,\n        threads=threads\n    )\n\n    data_memory.display(24)\n\n    cycles = 0\n    while dut.done.value != 1:\n        data_memory.run()\n        program_memory.run()\n\n        await cocotb.triggers.ReadOnly()\n        format_cycle(dut, cycles)\n        \n        await RisingEdge(dut.clk)\n        cycles += 1\n\n    logger.info(f\"Completed in {cycles} cycles\")\n    data_memory.display(24)\n\n    expected_results = [a + b for a, b in zip(data[0:8], data[8:16])]\n    for i, expected in enumerate(expected_results):\n        result = data_memory.memory[i + 16]\n        assert result == expected, f\"Result mismatch at index {i}: expected {expected}, got {result}\""
  },
  {
    "path": "test/test_matmul.py",
    "content": "import cocotb\nfrom cocotb.triggers import RisingEdge\nfrom .helpers.setup import setup\nfrom .helpers.memory import Memory\nfrom .helpers.format import format_cycle\nfrom .helpers.logger import logger\n\n@cocotb.test()\nasync def test_matadd(dut):\n    # Program Memory\n    program_memory = Memory(dut=dut, addr_bits=8, data_bits=16, channels=1, name=\"program\")\n    program = [\n        0b0101000011011110, # MUL R0, %blockIdx, %blockDim\n        0b0011000000001111, # ADD R0, R0, %threadIdx         ; i = blockIdx * blockDim + threadIdx\n        0b1001000100000001, # CONST R1, #1                   ; increment\n        0b1001001000000010, # CONST R2, #2                   ; N (matrix inner dimension)\n        0b1001001100000000, # CONST R3, #0                   ; baseA (matrix A base address)\n        0b1001010000000100, # CONST R4, #4                   ; baseB (matrix B base address)\n        0b1001010100001000, # CONST R5, #8                   ; baseC (matrix C base address)\n        0b0110011000000010, # DIV R6, R0, R2                 ; row = i // N\n        0b0101011101100010, # MUL R7, R6, R2\n        0b0100011100000111, # SUB R7, R0, R7                 ; col = i % N\n        0b1001100000000000, # CONST R8, #0                   ; acc = 0\n        0b1001100100000000, # CONST R9, #0                   ; k = 0\n                            # LOOP:\n        0b0101101001100010, #   MUL R10, R6, R2\n        0b0011101010101001, #   ADD R10, R10, R9\n        0b0011101010100011, #   ADD R10, R10, R3             ; addr(A[i]) = row * N + k + baseA\n        0b0111101010100000, #   LDR R10, R10                 ; load A[i] from global memory\n        0b0101101110010010, #   MUL R11, R9, R2\n        0b0011101110110111, #   ADD R11, R11, R7\n        0b0011101110110100, #   ADD R11, R11, R4             ; addr(B[i]) = k * N + col + baseB\n        0b0111101110110000, #   LDR R11, R11                 ; load B[i] from global memory\n        0b0101110010101011, #   MUL R12, R10, R11\n        0b0011100010001100, #   ADD R8, R8, R12              ; acc = acc + A[i] * B[i]\n        0b0011100110010001, #   ADD R9, R9, R1               ; increment k\n        0b0010000010010010, #   CMP R9, R2\n        0b0001100000001100, #   BRn LOOP                     ; loop while k < N\n        0b0011100101010000, # ADD R9, R5, R0                 ; addr(C[i]) = baseC + i \n        0b1000000010011000, # STR R9, R8                     ; store C[i] in global memory\n        0b1111000000000000  # RET                            ; end of kernel\n    ]\n\n    # Data Memory\n    data_memory = Memory(dut=dut, addr_bits=8, data_bits=8, channels=4, name=\"data\")\n    data = [\n        1, 2, 3, 4, # Matrix A (2 x 2)\n        1, 2, 3, 4, # Matrix B (2 x 2)\n    ]\n\n    # Device Control\n    threads = 4\n\n    await setup(\n        dut=dut,\n        program_memory=program_memory,\n        program=program,\n        data_memory=data_memory,\n        data=data,\n        threads=threads\n    )\n\n    data_memory.display(12)\n\n    cycles = 0\n    while dut.done.value != 1:\n        data_memory.run()\n        program_memory.run()\n\n        await cocotb.triggers.ReadOnly()\n        format_cycle(dut, cycles, thread_id=1)\n        \n        await RisingEdge(dut.clk)\n        cycles += 1\n\n    logger.info(f\"Completed in {cycles} cycles\")\n    data_memory.display(12)\n\n\n    # Assuming the matrices are 2x2 and the result is stored starting at address 9\n    matrix_a = [data[0:2], data[2:4]]  # First matrix (2x2)\n    matrix_b = [data[4:6], data[6:8]]  # Second matrix (2x2)\n    expected_results = [\n        matrix_a[0][0] * matrix_b[0][0] + matrix_a[0][1] * matrix_b[1][0],  # C[0,0]\n        matrix_a[0][0] * matrix_b[0][1] + matrix_a[0][1] * matrix_b[1][1],  # C[0,1]\n        matrix_a[1][0] * matrix_b[0][0] + matrix_a[1][1] * matrix_b[1][0],  # C[1,0]\n        matrix_a[1][0] * matrix_b[0][1] + matrix_a[1][1] * matrix_b[1][1],  # C[1,1]\n    ]\n    for i, expected in enumerate(expected_results):\n        result = data_memory.memory[i + 8]  # Results start at address 9\n        assert result == expected, f\"Result mismatch at index {i}: expected {expected}, got {result}\"\n"
  }
]