Full Code of cea-wind/SimpleTPU for AI

master b81f32e563ca cached
21 files
45.1 KB
17.8k tokens
25 symbols
1 requests
Download .txt
Repository: cea-wind/SimpleTPU
Branch: master
Commit: b81f32e563ca
Files: 21
Total size: 45.1 KB

Directory structure:
gitextract_zygrtfz8/

├── README.md
├── data/
│   └── golden_result.txt
├── lab1/
│   ├── README.md
│   ├── refcode/
│   │   ├── conv3d.m
│   │   ├── convmxu.m
│   │   └── saveparam.m
│   ├── run_hls.tcl
│   └── src/
│       ├── mxu.cpp
│       ├── tb_mxu.cpp
│       └── tpu.h
├── lab2/
│   ├── README.md
│   ├── run_hls.tcl
│   └── src/
│       ├── relu_norm_pool.cpp
│       ├── tb_pool.cpp
│       └── tpu.h
└── src/
    ├── ctrl.cpp
    ├── mxu.cpp
    ├── norm_relu_pool.cpp
    ├── tb_tpu.cpp
    ├── tpu.cpp
    └── tpu.h

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# SimpleTPU

A Tensor Processing Unit is designed to accelerate the matrix multiplication, especially for Multilayer perceptron and Convolution Nerual Network.    
This implmentaion is mainly following the Google TPU Version 1, which architecture is introduced in [https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf](https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf "In-Datacenter Performance Analysis of a Tensor Processing Unit").

It may cost a lot of time to implementation TPU using Hardware Description Language (such as VHDL or Verilog HDL), even if I had tried to simplify it. So I try to use the Xilinx HLS ToolKit to complete it. 

The plan is divided into three phases.

- Phase 1: Completing the main computing module,including
    - Lab1:Systolic Array
    - Lab2:Relu, Normalization & Pooling 
- Phase 2: Finish the full design of simpleTPU.
- Phase 3: Testing the simpleTPU through some real network, such as MLP and CNN.

# Key Features

The key features of Simple TPU including
- Int8 mulitply & Int32 accumulators
- VLIW based instruction parallel
- Vector Architecture based data parallel

Here are some operate which Simple TPU can support. 

Operate | Support
-|-
Conv3d | in_channels: Resource Constrained  <br> out_channels: Resource Constrained<br>kerner_size: Support<br>stride: support<br>padding: Support<br>dilation:Support<br>groups: Architecture Constrained<br>bias    :Support
ConvTranspose3d | The same as above
Maxpool2d | kernel_size: Support <br>stride: Support<br>padding: Support    
Avgpool2d | The same as above
Relu | Only support Relu as nonlinear function
BatchNorm2d | BatchNorm2d is merge with Conv or Pool when inference
Linear | Resource Constrained 
UpscalingNearest2D | Support (calling Avgpool2d multiple times.)
UpscalingBilinear2D | Support (calling Avgpool2d multiple times.)


# Performance
The size of mac array in SimpleTPU is 32*32, the clock frequency is 500MHz (timing closure when using Xilinx ultrascale+ FPGA, Speed -2).  
$$32\times 32 \times 500 \times 2 = 1 Tops(int8)$$

# Installation
 **env** :   
 - Vivado HLS 2018.2

 **run** :  
 - step1: `vivado_hls -f run_hls.tcl`
 - step2: lanch vivado HLS and open the project  
 - step3: Run C synthesis, C/RTL cosimulation e.t.c

**Synthesis Result**:    
![result](./pictures/syn.png)    
**Simulation Result**:    
![result](./pictures/sim.png)
# Examlpes
## 1. MLP
The network structure of mlp is defined as follow.
```
class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.hidden = nn.Linear(784,64)
        self.fc = nn.Linear(64,10)

    def forward(self, x):
        x = x.view(-1,784)
        x = self.hidden(x)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)
```

Work efficiency of SimpleTPU is about 84%.


|LOC| Layers | Nonlinear function | Weights | Batch Size | % of Deployed|
|---|---|---|----|----|----|
|10 | 2 FC | Relu | 5M | 512 | 16%|

Classfication Result in MNIST.

![result](./pictures/cla_result.png)
## 2. CNN
Because there is no compiler to generate instruction, this plan was suspended.
If you want to kown how to calculate convolution using SimpleTPU, lab1  provides a simple example.


# Relative Link  
https://www.cnblogs.com/sea-wind/p/10993958.html


================================================
FILE: data/golden_result.txt
================================================
7
2
1
0
4
1
4
9
6
9
0
6
9
0
1
5
9
7
6
4
9
6
6
5
4
0
7
4
0
1
3
1
3
6
7
2
7
1
2
1
1
7
4
2
6
5
1
2
4
4
6
3
5
5
6
0
4
1
9
5
7
8
4
2
7
4
6
4
3
0
7
0
2
9
1
7
3
7
9
7
9
6
2
7
8
4
7
5
6
1
3
6
4
3
1
4
1
1
6
9
6
0
5
4
9
9
2
1
4
4
8
1
3
9
7
4
4
4
9
2
5
4
7
6
4
9
0
5
8
5
6
6
5
2
8
1
0
1
6
4
6
7
3
1
9
1
8
2
0
9
9
9
5
5
1
5
6
0
3
4
4
6
5
4
6
5
4
5
1
4
4
7
2
3
2
1
1
8
1
8
1
8
5
0
8
9
2
5
0
1
1
1
0
4
0
5
1
6
4
2
3
6
1
1
1
3
9
5
2
9
4
5
9
3
9
0
3
6
5
5
7
2
2
7
1
2
8
4
1
7
3
3
8
9
7
9
2
2
4
1
5
8
8
4
2
6
0
6
4
2
4
1
9
5
7
7
2
8
2
0
8
1
7
7
9
1
8
1
8
0
3
0
1
9
9
4
1
8
2
1
2
9
7
5
9
2
6
4
1
5
4
2
9
2
0
4
0
0
2
8
6
2
1
2
4
0
2
9
4
3
3
0
0
5
1
9
6
4
0
5
1
7
9
3
0
4
2
0
7
1
1
2
1
5
3
3
4
7
8
6
6
4
1
3
5
1
0
5
1
9
1
5
0
6
1
8
5
1
9
4
4
6
7
1
5
0
6
5
6
3
7
2
0
8
8
5
4
1
1
4
0
7
3
7
6
1
6
2
1
4
2
8
6
1
9
5
2
5
4
4
2
8
3
9
2
4
6
0
3
1
7
7
3
7
9
7
1
9
2
1
4
2
9
2
0
4
9
1
4
8
1
8
4
4
9
8
8
3
7
6
0
0
3
0
8
0
6
4
8
5
3
3
2
3
9
1
2
6
8
0
5
6
6
6
9
8
8
2
2
5
8
9
6
1
8
4
1
2
8
3
1
9
7
5
4
0
8
9
9
1
0
5
2
3
7
8
9
4
0
6
3
9
1
2
1
8
1
5
6
5
2
1


================================================
FILE: lab1/README.md
================================================
# Systolic Array 

Systolic Array implement in FPGA using Xilinx HLS.

## 1.Env & Build  
 **env** :   
 - Vivado HLS 2018.2 or 2016.3 , MATLAB 2014a(for matlabcode)  
 
 **run** :  
 - step1: `vivado_hls -f run_hls.tcl`
 - step2: lanch vivado HLS and open the project  
 - step3: Run C synthesis, C/RTL cosimulation e.t.c

## 2.Relative Link  
https://www.cnblogs.com/sea-wind/p/10995360.html

================================================
FILE: lab1/refcode/conv3d.m
================================================

rng(0);
feature = randi([-128,127],14,14,32);
weight = randi([-128,127],32,3,3,32);
bias = randi([-1024,1023],1,32);
output = zeros(14,14,32);

saveparam(feature,weight,bias)

out1 = convmxu(weight,feature,bias,2,2);
out2 = convmxu(weight,feature,zeros(1,32),1,1);
out3 = convmxu(weight,feature,zeros(1,32),1,2);
out4 = convmxu(weight,feature,zeros(1,32),1,3);
out5 = convmxu(weight,feature,zeros(1,32),2,1);
out6 = convmxu(weight,feature,zeros(1,32),2,3);
out7 = convmxu(weight,feature,zeros(1,32),3,1);
out8 = convmxu(weight,feature,zeros(1,32),3,2);
out9 = convmxu(weight,feature,zeros(1,32),3,3);

output = out1;
output(2:end,2:end,:) = output(2:end,2:end,:) + out2(1:end-1,1:end-1,:);
output(2:end,:,:) = output(2:end,:,:) + out3(1:end-1,:,:);
output(2:end,1:end-1,:) = output(2:end,1:end-1,:) + out4(1:end-1,2:end,:);
output(:,2:end,:) = output(:,2:end,:) + out5(:,1:end-1,:);
output(:,1:end-1,:) = output(:,1:end-1,:) + out6(:,2:end,:);
output(1:end-1,2:end,:) = output(1:end-1,2:end,:) + out7(2:end,1:end-1,:);
output(1:end-1,:,:) = output(1:end-1,:,:) + out8(2:end,:,:);
output(1:end-1,1:end-1,:) = output(1:end-1,1:end-1,:) + out9(2:end,2:end,:);

golden = zeros(14,14,32);
for k = 1:32
   wk = reshape(weight(k,:,:,:),3,3,32);
   wk = wk(end:-1:1,end:-1:1,end:-1:1);
   tmp = convn(feature,wk,'same');
   golden(:,:,k) = tmp(:,:,16)+bias(k);
end
golden = int32(golden);
fid = fopen('golden.dat','wb');
for i=1:14
    for j=1:14
        fwrite(fid,golden(i,j,:),'int32');
    end
end
fclose(fid);

================================================
FILE: lab1/refcode/convmxu.m
================================================
function [out1] = convmxu(weight,feature,bias,index1,index2)
%UNTITLED3 Summary of this function goes here
%   Detailed explanation goes here

out1 = zeros(14,14,32);
for i = 1:14
    for j = 1:14
        for k = 1:32
            for c = 1:32
                if(c==1)
                    out1(i,j,k) = bias(k) + weight(k,index1,index2,c)*feature(i,j,c);
                else
                    out1(i,j,k) = out1(i,j,k) + weight(k,index1,index2,c)*feature(i,j,c);
                end
            end
        end
    end
end

end



================================================
FILE: lab1/refcode/saveparam.m
================================================
function [] = saveparam(feature,weight,bias)
%UNTITLED2 Summary of this function goes here
%   Detailed explanation goes here

feature = int8(feature);
weight = int8(weight);
bias = int32(bias);
bias4 = bitand(bitshift(bias,-24),int32(255));
bias3 = bitand(bitshift(bias,-16),int32(255));
bias2 = bitand(bitshift(bias,-8),int32(255));
bias1 = bitand(bias,int32(255));
fid = fopen('feature.dat','wb');
for i=1:14
    for j=1:14
        fwrite(fid,feature(i,j,:),'int8');
    end
end
fclose(fid);

fid = fopen('weight.dat','wb');
for k=1:32
    fwrite(fid,weight(:,2,2,k),'int8');
end
fwrite(fid,uint8(bias4),'uint8');
fwrite(fid,uint8(bias3),'uint8');
fwrite(fid,uint8(bias2),'uint8');
fwrite(fid,uint8(bias1),'uint8');
for i=1:3
    for j=1:3
        for k=1:32
            if(~(i==2&&j==2))
                fwrite(fid,weight(:,i,j,k),'int8');
            end
        end
        if(~(i==2&&j==2))
            for k=1:32
                fwrite(fid,0,'int32');
            end
        end
    end
end
fclose(fid);

end



================================================
FILE: lab1/run_hls.tcl
================================================
open_project -reset mxu_conv_prj
set_top MXU
add_files src/tpu.h
add_files src/mxu.cpp
add_files -tb data/feature.dat
add_files -tb data/golden.dat
add_files -tb data/weight.dat
add_files -tb src/tb_mxu.cpp

open_solution -reset "solution1"
set_part {xczu7cg-fbvb900-2-i} -tool vivado
create_clock -period 2.5 -name default

csim_design
# Do not perform any other steps
# - The basic project will be opened in the GUI 
exit

================================================
FILE: lab1/src/mxu.cpp
================================================

#include "tpu.h"

void SetWeight(WEIGHTDTYPE weight[512][MXU_COLNUM],WEIGHTDTYPE weightreg[MXU_ROWNUM+4][MXU_COLNUM],
		short weight_raddr, bool enable){
	if(!enable)
		return;
	for(short i=weight_raddr;i<weight_raddr+4+MXU_ROWNUM;i++){
#pragma HLS PIPELINE
		for(int j=0;j<MXU_ROWNUM+4;j++){
			for(int k=0;k<MXU_COLNUM;k++){
				if(j!=MXU_ROWNUM+3)
					weightreg[j][k] = weightreg[j+1][k];
				else
					weightreg[j][k] = weight[i][k];
			}
		}
	}
}

void MacArray(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weightreg[4+MXU_ROWNUM][MXU_COLNUM],
		PSUMDTYPE psum[512][MXU_COLNUM],MXU_PARAM mxuparam,bool enable){
	if(!enable)
		return;
	FEATDTYPE featreg[MXU_ROWNUM][MXU_COLNUM+MXU_ROWNUM-1];
	PSUMDTYPE psumreg[MXU_ROWNUM][MXU_COLNUM];
    short ubuf_raddr_p1=0;
    short ubuf_raddr_p2=0;
    short ubuf_raddr_p3=0;
    short psum_addr_p1[MXU_COLNUM];
    short psum_addr_p2[MXU_COLNUM];
    for(int i=0;i<MXU_COLNUM;i++){
#pragma HLS UNROLL
    	psum_addr_p1[i] = 0;
    	psum_addr_p2[i] = 0;
    }
	for(short i=0;i<mxuparam.ubuf_raddr_num+MXU_ROWNUM+MXU_COLNUM-2;i++){
#pragma HLS PIPELINE
    short ubuf_raddr = mxuparam.ubuf_raddr_start + ubuf_raddr_p1 + ubuf_raddr_p2 + ubuf_raddr_p3;
    if(ubuf_raddr_p1==mxuparam.ubuf_raddr_end1){
        ubuf_raddr_p1 = 0;
        if(ubuf_raddr_p2==mxuparam.ubuf_raddr_end2){
            ubuf_raddr_p2 = 0;
            ubuf_raddr_p3 = ubuf_raddr_p3 +  mxuparam.ubuf_raddr_step3;
        }
        else{
            ubuf_raddr_p2 = ubuf_raddr_p2 +  mxuparam.ubuf_raddr_step2;
        }
    }
    else{
        ubuf_raddr_p1 = ubuf_raddr_p1 + mxuparam.ubuf_raddr_step1;
    }

		for(int j=0;j<MXU_ROWNUM;j++){
			for(int k=MXU_ROWNUM+MXU_COLNUM-2;k>=0;k--){
				if(k>0)
					featreg[j][k] = featreg[j][k-1];
				else
					if(i<mxuparam.ubuf_raddr_num)
						featreg[j][k] = ubuf[ubuf_raddr][j];
					else
						featreg[j][k] = 0;
			}
		}

		for(int j=MXU_ROWNUM-1;j>=0;j--){
			for(int k=0;k<MXU_COLNUM;k++){
				ap_int<32> biasreg;
				biasreg(31,24)=weightreg[MXU_ROWNUM+0][k];
				biasreg(23,16)=weightreg[MXU_ROWNUM+1][k];
				biasreg(15, 8)=weightreg[MXU_ROWNUM+2][k];
				biasreg( 7, 0)=weightreg[MXU_ROWNUM+3][k];
				if(j==0)
					psumreg[j][k] = featreg[j][k+j]*weightreg[j][k] + biasreg;
				else
					psumreg[j][k] = featreg[j][k+j]*weightreg[j][k] + psumreg[j-1][k];
			}
		}
#pragma HLS DEPENDENCE variable=psum inter false
#pragma HLS DEPENDENCE variable=psum intra false
		for(int j=0;j<MXU_COLNUM;j++){
			if(i>=j+MXU_ROWNUM-1&&i<mxuparam.ubuf_raddr_num+j+MXU_ROWNUM-1){
				short psum_raddr = mxuparam.psum_start + psum_addr_p1[j] + psum_addr_p2[j];
				if(psum_addr_p1[j]==mxuparam.psum_end1){
					psum_addr_p1[j] = 0;
					psum_addr_p2[j] = psum_addr_p2[j] + mxuparam.psum_step2;
				}
				else{
					psum_addr_p1[j] = psum_addr_p1[j] + mxuparam.psum_step1;
				}
				if(mxuparam.isfirstpsum)
					psum[psum_raddr][j] = psumreg[MXU_ROWNUM-1][j];
				else
					psum[psum_raddr][j] = psumreg[MXU_ROWNUM-1][j] + psum[psum_raddr][j];
			}
		}
	}
}

void MXU(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weight[512][MXU_COLNUM],PSUMDTYPE psum[512][MXU_COLNUM],MXU_PARAM mxuparam){
#pragma HLS INTERFACE bram port=ubuf
#pragma HLS INTERFACE bram port=weight
#pragma HLS INTERFACE bram port=psum
#pragma HLS DATA_PACK variable=mxuparam
#pragma HLS ARRAY_PARTITION variable=ubuf complete dim=2
#pragma HLS ARRAY_PARTITION variable=weight complete dim=2
#pragma HLS ARRAY_PARTITION variable=psum complete dim=2

	static WEIGHTDTYPE weightreg1[4+MXU_ROWNUM][MXU_COLNUM];
	static WEIGHTDTYPE weightreg2[4+MXU_ROWNUM][MXU_COLNUM];
#pragma HLS ARRAY_PARTITION variable=weightreg1 complete dim=0
#pragma HLS ARRAY_PARTITION variable=weightreg2 complete dim=0

	if(mxuparam.isping){
		SetWeight(weight,weightreg1,mxuparam.weight_raddr,mxuparam.isload);
		MacArray(ubuf,weightreg2,psum,mxuparam,mxuparam.iscalc);
	}
	else{
		SetWeight(weight,weightreg2,mxuparam.weight_raddr,mxuparam.isload);
		MacArray(ubuf,weightreg1,psum,mxuparam,mxuparam.iscalc);
	}
}


================================================
FILE: lab1/src/tb_mxu.cpp
================================================
#include "tpu.h"
#include "stdio.h"
#include "stdlib.h"

int main(){
	FEATDTYPE ubuf[16384][MXU_ROWNUM];
	WEIGHTDTYPE weight[512][MXU_COLNUM];
	int psum[512][MXU_COLNUM];
	FILE *fid;
	fid = fopen("feature.dat","rb");
	for(int i=0;i<14*14;i++){
		for(int j=0;j<32;j++){
			char a;
			fread(&a,sizeof(char),1,fid);
			ubuf[i][j] = a;
		}
	}
	fclose(fid);
	fid = fopen("weight.dat","rb");
	for(int i=0;i<3*3*32+32;i++){
		for(int j=0;j<32;j++){
			char a;
			fread(&a,sizeof(char),1,fid);
			weight[i][j] = a;
		}
	}
	fclose(fid);
	MXU_PARAM mxuparam;
	struct MXU_PARAM{
		bool isload;
		bool iscalc;
		bool isping;
		bool isfirstpsum;

		short weight_raddr;
		short ubuf_raddr_start;
		short ubuf_raddr_step1;
		short ubuf_raddr_step2;
		short ubuf_raddr_step3;
		short ubuf_raddr_end1;
		short ubuf_raddr_end2;
		short ubuf_raddr_end3;
		short ubuf_raddr_num;
		short psum_start;

	};

	mxuparam.isload = true;
	mxuparam.iscalc = false;
	mxuparam.isping = true;
	mxuparam.weight_raddr = 0;
	MXU(ubuf,weight,psum,mxuparam);
// 2,2
	mxuparam.weight_raddr = 36;
	mxuparam.iscalc = true;
	mxuparam.isping = false;
	mxuparam.isfirstpsum = true;
	mxuparam.ubuf_raddr_start = 0;
	mxuparam.ubuf_raddr_num = 14*14;
	mxuparam.ubuf_raddr_step1 = 1;
	mxuparam.ubuf_raddr_step2 = 14;
	mxuparam.ubuf_raddr_step3 = 1;
	mxuparam.ubuf_raddr_end1 = 14-1;
	mxuparam.ubuf_raddr_end2 = 14-1;
	mxuparam.psum_start = 0;
	mxuparam.psum_step1 = 1;
	mxuparam.psum_end1 = 13;
	mxuparam.psum_step2 = 14;
	MXU(ubuf,weight,psum,mxuparam);

//1,1
	mxuparam.weight_raddr = 36*2;
	mxuparam.iscalc = true;
	mxuparam.isping = true;
	mxuparam.isfirstpsum = false;
	mxuparam.ubuf_raddr_start = 0;
	mxuparam.ubuf_raddr_num = 13*13;
	mxuparam.ubuf_raddr_step1 = 1;
	mxuparam.ubuf_raddr_step2 = 14;
	mxuparam.ubuf_raddr_step3 = 1;
	mxuparam.ubuf_raddr_end1 = 13-1;
	mxuparam.ubuf_raddr_end2 = 14*13-1;
	mxuparam.psum_start = 15;
	mxuparam.psum_step1 = 1;
	mxuparam.psum_end1 = 12;
	mxuparam.psum_step2 = 14;
	MXU(ubuf,weight,psum,mxuparam);
//1,2
	mxuparam.weight_raddr = 36*3;
	mxuparam.iscalc = true;
	mxuparam.isping = false;
	mxuparam.isfirstpsum = false;
	mxuparam.ubuf_raddr_start = 0;
	mxuparam.ubuf_raddr_num = 14*13;
	mxuparam.ubuf_raddr_step1 = 1;
	mxuparam.ubuf_raddr_step2 = 14;
	mxuparam.ubuf_raddr_step3 = 1;
	mxuparam.ubuf_raddr_end1 = 14-1;
	mxuparam.ubuf_raddr_end2 = 14*13-1;
	mxuparam.psum_start = 14;
	mxuparam.psum_step1 = 1;
	mxuparam.psum_end1 = 13;
	mxuparam.psum_step2 = 14;
	MXU(ubuf,weight,psum,mxuparam);
//1,3
	mxuparam.weight_raddr = 36*4;
	mxuparam.iscalc = true;
	mxuparam.isping = !mxuparam.isping;
	mxuparam.isfirstpsum = false;
	mxuparam.ubuf_raddr_start = 1;
	mxuparam.ubuf_raddr_num = 13*13;
	mxuparam.ubuf_raddr_step1 = 1;
	mxuparam.ubuf_raddr_step2 = 14;
	mxuparam.ubuf_raddr_step3 = 1;
	mxuparam.ubuf_raddr_end1 = 13-1;
	mxuparam.ubuf_raddr_end2 = 13*13-1;
	mxuparam.psum_start = 14;
	mxuparam.psum_step1 = 1;
	mxuparam.psum_end1 = 12;
	mxuparam.psum_step2 = 14;
	MXU(ubuf,weight,psum,mxuparam);
//2,1
	mxuparam.weight_raddr = 36*5;
	mxuparam.iscalc = true;
	mxuparam.isping = !mxuparam.isping;
	mxuparam.isfirstpsum = false;
	mxuparam.ubuf_raddr_start = 0;
	mxuparam.ubuf_raddr_num = 14*13;
	mxuparam.ubuf_raddr_step1 = 1;
	mxuparam.ubuf_raddr_step2 = 14;
	mxuparam.ubuf_raddr_step3 = 1;
	mxuparam.ubuf_raddr_end1 = 13-1;
	mxuparam.ubuf_raddr_end2 = 14*13-1;
	mxuparam.psum_start = 1;
	mxuparam.psum_step1 = 1;
	mxuparam.psum_end1 = 12;
	mxuparam.psum_step2 = 14;
	MXU(ubuf,weight,psum,mxuparam);


//2,3
	mxuparam.weight_raddr = 36*6;
	mxuparam.iscalc = true;
	mxuparam.isping = !mxuparam.isping;
	mxuparam.isfirstpsum = false;
	mxuparam.ubuf_raddr_start = 1;
	mxuparam.ubuf_raddr_num = 14*13;
	mxuparam.ubuf_raddr_step1 = 1;
	mxuparam.ubuf_raddr_step2 = 14;
	mxuparam.ubuf_raddr_step3 = 1;
	mxuparam.ubuf_raddr_end1 = 13-1;
	mxuparam.ubuf_raddr_end2 = 14*13-1;
	mxuparam.psum_start = 0;
	mxuparam.psum_step1 = 1;
	mxuparam.psum_end1 = 12;
	mxuparam.psum_step2 = 14;
	MXU(ubuf,weight,psum,mxuparam);

//3,1
	mxuparam.weight_raddr = 36*7;
	mxuparam.iscalc = true;
	mxuparam.isping = !mxuparam.isping;
	mxuparam.isfirstpsum = false;
	mxuparam.ubuf_raddr_start = 14;
	mxuparam.ubuf_raddr_num = 13*13;
	mxuparam.ubuf_raddr_step1 = 1;
	mxuparam.ubuf_raddr_step2 = 14;
	mxuparam.ubuf_raddr_step3 = 1;
	mxuparam.ubuf_raddr_end1 = 13-1;
	mxuparam.ubuf_raddr_end2 = 13*13-1;
	mxuparam.psum_start = 1;
	mxuparam.psum_step1 = 1;
	mxuparam.psum_end1 = 12;
	mxuparam.psum_step2 = 14;
	MXU(ubuf,weight,psum,mxuparam);

//3,2
	mxuparam.weight_raddr = 36*8;
	mxuparam.iscalc = true;
	mxuparam.isping = !mxuparam.isping;
	mxuparam.isfirstpsum = false;
	mxuparam.ubuf_raddr_start = 14;
	mxuparam.ubuf_raddr_num = 14*13;
	mxuparam.ubuf_raddr_step1 = 1;
	mxuparam.ubuf_raddr_step2 = 14;
	mxuparam.ubuf_raddr_step3 = 1;
	mxuparam.ubuf_raddr_end1 = 14-1;
	mxuparam.ubuf_raddr_end2 = 14*13-1;
	mxuparam.psum_start = 0;
	mxuparam.psum_step1 = 1;
	mxuparam.psum_end1 = 13;
	mxuparam.psum_step2 = 14;
	MXU(ubuf,weight,psum,mxuparam);

//3,3
	mxuparam.isload = false;
	mxuparam.iscalc = true;
	mxuparam.isping = !mxuparam.isping;
	mxuparam.isfirstpsum = false;
	mxuparam.ubuf_raddr_start = 15;
	mxuparam.ubuf_raddr_num = 13*13;
	mxuparam.ubuf_raddr_step1 = 1;
	mxuparam.ubuf_raddr_step2 = 14;
	mxuparam.ubuf_raddr_step3 = 1;
	mxuparam.ubuf_raddr_end1 = 13-1;
	mxuparam.ubuf_raddr_end2 = 13*13-1;
	mxuparam.psum_start = 0;
	mxuparam.psum_step1 = 1;
	mxuparam.psum_end1 = 12;
	mxuparam.psum_step2 = 14;
	MXU(ubuf,weight,psum,mxuparam);

	int err = 0;
	fid = fopen("golden.dat","rb");
	for(int i=0;i<14*14;i++){
		for(int j=0;j<32;j++){
			int a;
			fread(&a,sizeof(int),1,fid);
			if(psum[i][j] != a)
				err++;
		}
	}
	return 0;
}


================================================
FILE: lab1/src/tpu.h
================================================
#include "ap_int.h"

#define MXU_COLNUM 32
#define MXU_ROWNUM 32
#define WEIGHTDTYPE char
#define FEATDTYPE char
#define PSUMDTYPE int


struct MXU_PARAM{
	bool isload;
	bool iscalc;
	bool isping;
	bool isfirstpsum;

	short weight_raddr;
	short ubuf_raddr_start;
	short ubuf_raddr_step1;
	short ubuf_raddr_step2;
	short ubuf_raddr_step3;
	short ubuf_raddr_end1;
	short ubuf_raddr_end2;
	short ubuf_raddr_end3;
	short ubuf_raddr_num;
	short psum_start;
	short psum_step1;
	short psum_end1;
	short psum_step2;
};
struct ACCREL_PARAM{
	bool isrelu;
	short psum_raddr_start;
	short psum_raddr_num;

	short ubuf_waddr_start;
	short ubuf_waddr_step1;
	short ubuf_waddr_step2;
	short ubuf_waddr_step3;
	short ubuf_waddr_end1;
	short ubuf_waddr_end2;
	short ubuf_waddr_end3;
};

void MXU(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weight[512][MXU_COLNUM],
		PSUMDTYPE psum[512][MXU_COLNUM],MXU_PARAM mxuparam);


================================================
FILE: lab2/README.md
================================================

# Relu, Normalization & Pooling 

Basic Module of Tensor Processing Unit

## 1.Env & Build  
 **env** :   
 - Vivado HLS 2018.2 or 2016.3 , MATLAB 2014a(for matlabcode)  
 
 **run** :  
 - step1: `vivado_hls -f run_hls.tcl`
 - step2: lanch vivado HLS and open the project  
 - step3: Run C synthesis, C/RTL cosimulation e.t.c

## 2.Relative Link  

================================================
FILE: lab2/run_hls.tcl
================================================
open_project -reset relu_norm_pool_prj
set_top relu_norm_pool
add_files src/tpu.h
add_files src/relu_norm_pool.cpp
add_files -tb src/tb_pool.cpp

open_solution -reset "solution1"
set_part {xczu7cg-fbvb900-2-i} -tool vivado
create_clock -period 2.5 -name default

csim_design
# Do not perform any other steps
# - The basic project will be opened in the GUI 
exit


================================================
FILE: lab2/src/relu_norm_pool.cpp
================================================

#include "tpu.h"

void relu_norm_pool(PSUMDTYPE psum_buffer[512][MXU_COLNUM],FEATDTYPE unified_buffer[16384][MXU_ROWNUM],
		int norm_coef[MXU_COLNUM],RELPOOL_PARAM param){
#pragma HLS INTERFACE bram port=unified_buffer
#pragma HLS INTERFACE bram port=psum_buffer
#pragma HLS ARRAY_PARTITION variable=norm_coef complete dim=1
#pragma HLS ARRAY_PARTITION variable=unified_buffer complete dim=2
#pragma HLS ARRAY_PARTITION variable=psum_buffer complete dim=2

	PSUMDTYPE psumreg[MXU_COLNUM];
	PSUMDTYPE psumrelu[MXU_COLNUM];
	PSUMDTYPE psumpool[MXU_COLNUM];
	FEATDTYPE res[MXU_COLNUM];
	FEATDTYPE relu[MXU_COLNUM];
	short pool[MXU_COLNUM];
#pragma HLS ARRAY_PARTITION variable=psumreg complete dim=1
#pragma HLS ARRAY_PARTITION variable=psumsht complete dim=1
#pragma HLS ARRAY_PARTITION variable=res complete dim=1
#pragma HLS ARRAY_PARTITION variable=relu complete dim=1
#pragma HLS ARRAY_PARTITION variable=pool complete dim=1

	char pool_kw_cnt = 0;
	char pool_kh_cnt = 0;
	char pool_w_cnt = 0;
	char pool_h_cnt = 0;
	short ubuf_waddr_p1=0;
	short ubuf_waddr_p2=0;
	short ubuf_waddr_p3=0;

	for(short i=0;i<param.pool_cnt;i++){
#pragma HLS PIPELINE

		short raddr = param.psum_raddr_start + (pool_h_cnt+pool_kh_cnt)*param.pool_h_step
				+ (pool_w_cnt+pool_kw_cnt);
		for(int j=0;j<MXU_COLNUM;j++){
			psumreg[j] = psum_buffer[raddr][j];
			if(psumreg[j]<0&&param.isrelu)
				psumrelu[j] = 0;
			else
				psumrelu[j] = psumreg[j];
			if(pool_kw_cnt==0&&pool_kh_cnt==0){
				psumpool[j] = psumrelu[j];
			}
			else if(param.maxpool){
				if(psumrelu[j]>psumpool[j])
					psumpool[j] = psumrelu[j];
			}
			else{
				psumpool[j] = psumpool[j] + psumrelu[j];
			}
		}

		if(pool_kw_cnt==param.pool_kw&&pool_kh_cnt==param.pool_kh){
			short ubuf_waddr = param.ubuf_waddr_start + ubuf_waddr_p1 + ubuf_waddr_p2 + ubuf_waddr_p3;
			if(ubuf_waddr_p1==param.ubuf_waddr_end1){
				if(ubuf_waddr_p2==param.ubuf_waddr_end2){
					ubuf_waddr_p2 = 0;
					ubuf_waddr_p3 = ubuf_waddr_p3 + param.ubuf_waddr_step3;
				}
				else{
		            ubuf_waddr_p2 = ubuf_waddr_p2 +  param.ubuf_waddr_step2;
				}
			}
			else{
				ubuf_waddr_p1 = ubuf_waddr_p1 + param.ubuf_waddr_step1;
			}
			for(int j=0;j<MXU_COLNUM;j++){
				long tmp;
				tmp = long(psumpool[j])*long(norm_coef[j]);
				int tmpcut = tmp>>32;
				ap_int<8> res;
				if(tmpcut>127)
					res = 127;
				else if(tmpcut<-128)
					res = -128;
				else
					res = tmpcut;
				unified_buffer[ubuf_waddr][j] = res;
			}
		}

		if(pool_kw_cnt==param.pool_kw){
			pool_kw_cnt = 0;
			if(pool_kh_cnt==param.pool_kh){
				pool_kh_cnt = 0;
				if(pool_w_cnt==param.pool_w){
					pool_w_cnt = 0;
					pool_h_cnt = pool_h_cnt + param.pool_sh;
				}
				else{
					pool_w_cnt = pool_w_cnt + param.pool_sw;
				}
			}
			else{
				pool_kh_cnt = pool_kh_cnt + 1;
			}
		}
		else{
			pool_kw_cnt = pool_kw_cnt + 1;
		}
	}
}


================================================
FILE: lab2/src/tb_pool.cpp
================================================
#include "tpu.h"
#include "stdio.h"
#include "stdlib.h"

int main(){
	PSUMDTYPE psum_buffer[512][MXU_COLNUM];
	FEATDTYPE unified_buffer[16384][MXU_ROWNUM];
	int norm_coef[MXU_COLNUM];
	RELPOOL_PARAM param;
	for(int i=0;i<14;i++){
		for(int j=0;j<14;j++){
			for(int c=0;c<32;c++){
				psum_buffer[i*14+j][c] = (i*14+j+c)*512;
			}
		}
	}
	for(int c=0;c<32;c++)
		norm_coef[c] = 1<<23;

	// no pooling
	param.isrelu = true;
	param.psum_raddr_start = 0;
	param.maxpool = true;
	param.pool_kw = 0;
	param.pool_kh = 0;
	param.pool_w = 14-1;
	param.pool_sw = 1;
	param.pool_sh = 1;
	param.pool_cnt = 14*14;
	param.pool_h_step = 14;
	param.ubuf_waddr_start = 0;
	param.ubuf_waddr_step1 = 1;
	param.ubuf_waddr_end1 = 14*14-1;
	relu_norm_pool(psum_buffer,unified_buffer,norm_coef,param);

	FEATDTYPE golden[14*14][MXU_ROWNUM];
	for(int i=0;i<14;i++){
		for(int j=0;j<14;j++){
			for (int k=0;k<32;k++){
				int tmp = psum_buffer[i*14+j][k]/512;
				tmp = tmp>127?127:tmp;
				tmp = tmp<-128?-128:tmp;
				golden[i*14+j][k] = tmp;
			}
		}
	}
	int err=0;
	for(int i=0;i<14*14;i++){
		for(int k=0;k<32;k++){
			if(golden[i][k]!=unified_buffer[i][k])
				err ++;
		}
	}

	// max pooling 2,2
	for(int c=0;c<32;c++)
		norm_coef[c] = 1<<23;
	param.isrelu = true;
	param.psum_raddr_start = 0;
	param.maxpool = true;
	param.pool_kw = 1;
	param.pool_kh = 1;
	param.pool_w = 12;
	param.pool_sw = 2;
	param.pool_sh = 2;
	param.pool_cnt = 14*14;
	param.pool_h_step = 14;
	param.ubuf_waddr_start = 0;
	param.ubuf_waddr_step1 = 1;
	param.ubuf_waddr_end1 = 7*7-1;
	relu_norm_pool(psum_buffer,unified_buffer,norm_coef,param);

	for(int i=0;i<7;i++){
		for(int j=0;j<7;j++){
			for (int k=0;k<32;k++){
				int tmp = -128;
				for(int i1=0;i1<2;i1++){
					for(int j1=0;j1<2;j1++){
						if(tmp<psum_buffer[(2*i+i1)*14+2*j+j1][k]/512)
							tmp = psum_buffer[(2*i+i1)*14+2*j+j1][k]/512;
					}
				}
				tmp = tmp>127?127:tmp;
				tmp = tmp<-128?-128:tmp;
				golden[i*7+j][k] = tmp;
			}
		}
	}
	for(int i=0;i<7*7;i++){
		for(int k=0;k<32;k++){
			if(golden[i][k]!=unified_buffer[i][k])
				err ++;
		}
	}

	for(int c=0;c<32;c++)
		norm_coef[c] = 171196;
	// avg pooling 7,7
	param.isrelu = true;
	param.psum_raddr_start = 0;
	param.maxpool = false;
	param.pool_kw = 6;
	param.pool_kh = 6;
	param.pool_w = 7;
	param.pool_sw = 7;
	param.pool_sh = 7;
	param.pool_cnt = 14*14;
	param.pool_h_step = 14;
	param.ubuf_waddr_start = 0;
	param.ubuf_waddr_step1 = 1;
	param.ubuf_waddr_end1 = 14*14-1;
	relu_norm_pool(psum_buffer,unified_buffer,norm_coef,param);

	for(int i=0;i<2;i++){
		for(int j=0;j<2;j++){
			for (int k=0;k<32;k++){
				int tmp = 0;
				for(int i1=0;i1<7;i1++){
					for(int j1=0;j1<7;j1++){
						tmp += psum_buffer[(i*7+i1)*14+7*j+j1][k];
					}
				}
				tmp = (long(tmp)*long(171196))>>32;
				tmp = tmp>127?127:tmp;
				tmp = tmp<-128?-128:tmp;
				golden[i*2+j][k] = tmp;
			}
		}
	}
	for(int i=0;i<2*2;i++){
		for(int k=0;k<32;k++){
			if(golden[i][k]!=unified_buffer[i][k])
				err ++;
		}
	}
	return err;
}


================================================
FILE: lab2/src/tpu.h
================================================
#include "ap_int.h"

#define MXU_COLNUM 32
#define MXU_ROWNUM 32
#define WEIGHTDTYPE char
#define FEATDTYPE char
#define PSUMDTYPE int


struct MXU_PARAM{
	bool isload;
	bool iscalc;
	bool isping;
	bool isfirstpsum;

	short weight_raddr;
	short ubuf_raddr_start;
	short ubuf_raddr_step1;
	short ubuf_raddr_step2;
	short ubuf_raddr_step3;
	short ubuf_raddr_end1;
	short ubuf_raddr_end2;
	short ubuf_raddr_end3;
	short ubuf_raddr_num;
	short psum_start;
	short psum_step1;
	short psum_end1;
	short psum_step2;
};
struct RELPOOL_PARAM{
	bool isrelu;
	short psum_raddr_start;

	bool maxpool; // max pool or average pool
	char pool_kw;
	char pool_kh;
	char pool_w;
	char pool_sw;
	char pool_sh;
	short pool_cnt; // output_num*pool_kw*pool_kh
	short pool_h_step;

	short ubuf_waddr_start;
	short ubuf_waddr_step1;
	short ubuf_waddr_step2;
	short ubuf_waddr_step3;
	short ubuf_waddr_end1;
	short ubuf_waddr_end2;
	short ubuf_waddr_end3;
};

void MXU(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weight[512][MXU_COLNUM],
		PSUMDTYPE psum[512][MXU_COLNUM],MXU_PARAM mxuparam);
void relu_norm_pool(PSUMDTYPE psum_buffer[512][MXU_COLNUM],FEATDTYPE unified_buffer[16384][MXU_ROWNUM],
		int norm_coef[MXU_COLNUM],RELPOOL_PARAM param);


================================================
FILE: src/ctrl.cpp
================================================
#include "tpu.h"

void loadWeight(ap_uint<256> *ddr,WEIGHTDTYPE weight_buffer[512][MXU_COLNUM],
		unsigned offset,short addr, short len, bool enable){
	if(!enable)
		return;
	for(int i=0;i<len;i++){
#pragma HLS PIPELINE
		ap_uint<256> tmp = ddr[offset+i];
		for(int j=0;j<32;j++){
			weight_buffer[addr+i][j] = tmp(j*8+7,j*8);
		}
	}
}

void loadFeature(ap_uint<256> *ddr,FEATDTYPE unified_buffer[512][MXU_ROWNUM],
		unsigned offset,short addr, short len, bool enable){
	if(!enable)
		return;
	for(int i=0;i<len;i++){
#pragma HLS PIPELINE
		ap_uint<256> tmp = ddr[offset+i];
		for(int j=0;j<32;j++){
			unified_buffer[addr+i][j] = tmp(j*8+7,j*8);
		}
	}
}
void storeFeature(ap_uint<256> *ddr,FEATDTYPE unified_buffer[512][MXU_COLNUM],
		unsigned offset,short addr, short len, bool enable){
	if(!enable)
		return;
	for(int i=0;i<len;i++){
#pragma HLS PIPELINE
		ap_uint<256> tmp;
		for(int j=0;j<32;j++){
			tmp(j*8+7,j*8) = unified_buffer[addr+i][j];;
		}
		ddr[offset+i] = tmp;
	}
}
//set instr. set register
//run instr. run process
//eop instr. end of process
//

void instr(ap_uint<64> *ddr,unsigned &offset,ap_int<16> reggroup[96],ap_int<8> &runmode,bool enable){
#pragma HLS INTERFACE m_axi depth=8192 port=ddr
#pragma HLS ARRAY_PARTITION variable=reggroup complete dim=1
	if(!enable)
		return;
	bool isRunInstr = false;
	while(!isRunInstr){
		ap_uint<64> tmp = ddr[offset];
		offset++;
		if(tmp[63]==0){
			switch(tmp(52,48)){
			case( 0):reggroup[ 0] = tmp(15, 0);reggroup[ 1] = tmp(31,16);reggroup[ 2] = tmp(47,32);break;
			case( 1):reggroup[ 3] = tmp(15, 0);reggroup[ 4] = tmp(31,16);reggroup[ 5] = tmp(47,32);break;
			case( 2):reggroup[ 6] = tmp(15, 0);reggroup[ 7] = tmp(31,16);reggroup[ 8] = tmp(47,32);break;
			case( 3):reggroup[ 9] = tmp(15, 0);reggroup[10] = tmp(31,16);reggroup[11] = tmp(47,32);break;
			case( 4):reggroup[12] = tmp(15, 0);reggroup[13] = tmp(31,16);reggroup[14] = tmp(47,32);break;
			case( 5):reggroup[15] = tmp(15, 0);reggroup[16] = tmp(31,16);reggroup[17] = tmp(47,32);break;
			case( 6):reggroup[18] = tmp(15, 0);reggroup[19] = tmp(31,16);reggroup[20] = tmp(47,32);break;
			case( 7):reggroup[21] = tmp(15, 0);reggroup[22] = tmp(31,16);reggroup[23] = tmp(47,32);break;
			case( 8):reggroup[24] = tmp(15, 0);reggroup[25] = tmp(31,16);reggroup[26] = tmp(47,32);break;
			case( 9):reggroup[27] = tmp(15, 0);reggroup[28] = tmp(31,16);reggroup[29] = tmp(47,32);break;
			case(10):reggroup[30] = tmp(15, 0);reggroup[31] = tmp(31,16);reggroup[32] = tmp(47,32);break;
			case(11):reggroup[33] = tmp(15, 0);reggroup[34] = tmp(31,16);reggroup[35] = tmp(47,32);break;
			case(12):reggroup[36] = tmp(15, 0);reggroup[37] = tmp(31,16);reggroup[38] = tmp(47,32);break;
			case(13):reggroup[39] = tmp(15, 0);reggroup[40] = tmp(31,16);reggroup[41] = tmp(47,32);break;
			case(14):reggroup[42] = tmp(15, 0);reggroup[43] = tmp(31,16);reggroup[44] = tmp(47,32);break;
			case(15):reggroup[45] = tmp(15, 0);reggroup[46] = tmp(31,16);reggroup[47] = tmp(47,32);break;
			case(16):reggroup[48] = tmp(15, 0);reggroup[49] = tmp(31,16);reggroup[50] = tmp(47,32);break;
			case(17):reggroup[51] = tmp(15, 0);reggroup[52] = tmp(31,16);reggroup[53] = tmp(47,32);break;
			case(18):reggroup[54] = tmp(15, 0);reggroup[55] = tmp(31,16);reggroup[56] = tmp(47,32);break;
			case(19):reggroup[57] = tmp(15, 0);reggroup[58] = tmp(31,16);reggroup[59] = tmp(47,32);break;
			case(20):reggroup[60] = tmp(15, 0);reggroup[61] = tmp(31,16);reggroup[62] = tmp(47,32);break;
			case(21):reggroup[63] = tmp(15, 0);reggroup[64] = tmp(31,16);reggroup[65] = tmp(47,32);break;
			case(22):reggroup[66] = tmp(15, 0);reggroup[67] = tmp(31,16);reggroup[68] = tmp(47,32);break;
			case(23):reggroup[69] = tmp(15, 0);reggroup[70] = tmp(31,16);reggroup[71] = tmp(47,32);break;
			case(24):reggroup[72] = tmp(15, 0);reggroup[73] = tmp(31,16);reggroup[74] = tmp(47,32);break;
			case(25):reggroup[75] = tmp(15, 0);reggroup[76] = tmp(31,16);reggroup[77] = tmp(47,32);break;
			case(26):reggroup[78] = tmp(15, 0);reggroup[79] = tmp(31,16);reggroup[80] = tmp(47,32);break;
			case(27):reggroup[81] = tmp(15, 0);reggroup[82] = tmp(31,16);reggroup[83] = tmp(47,32);break;
			case(28):reggroup[84] = tmp(15, 0);reggroup[85] = tmp(31,16);reggroup[86] = tmp(47,32);break;
			case(29):reggroup[87] = tmp(15, 0);reggroup[88] = tmp(31,16);reggroup[89] = tmp(47,32);break;
			case(30):reggroup[90] = tmp(15, 0);reggroup[91] = tmp(31,16);reggroup[92] = tmp(47,32);break;
			case(31):reggroup[93] = tmp(15, 0);reggroup[94] = tmp(31,16);reggroup[95] = tmp(47,32);break;
			}
		}
		else{
			runmode = tmp(55,48);
			isRunInstr = true;
		}
	}
}


void config(ap_int<16> reggroup[96],MXU_PARAM &mxuparam,RELPOOL_PARAM &poolparam,LDST_PARAM &lsdtparam, ap_int<32> norm_coef[32]){
#pragma HLS INLINE
    mxuparam.isload          = reggroup[ 0].range(0,0);
	mxuparam.iscalc          = reggroup[ 0].range(1,1);
	mxuparam.isping          = reggroup[ 0].range(2,2);
	mxuparam.isfirstpsum     = reggroup[ 0].range(3,3);
	mxuparam.weight_raddr    = reggroup[ 1];
	mxuparam.ubuf_raddr_start= reggroup[ 2];
	mxuparam.ubuf_raddr_step1= reggroup[ 3];
	mxuparam.ubuf_raddr_step2= reggroup[ 4];
	mxuparam.ubuf_raddr_step3= reggroup[ 5];
	mxuparam.ubuf_raddr_end1 = reggroup[ 6];
	mxuparam.ubuf_raddr_end2 = reggroup[ 7];
	mxuparam.ubuf_raddr_end3 = reggroup[ 8];
	mxuparam.ubuf_raddr_num  = reggroup[ 9];
	mxuparam.psum_start      = reggroup[10];
	mxuparam.psum_step1      = reggroup[11];
	mxuparam.psum_end1       = reggroup[12];
	mxuparam.psum_step2      = reggroup[13];

	poolparam.isrelu    = reggroup[14].range( 0,0);
	poolparam.maxpool   = reggroup[14].range( 1,1);
	poolparam.avg_shift = reggroup[14].range( 7,4);
	poolparam.pool_kw   = reggroup[14].range(15,8);
	poolparam.pool_kh   = reggroup[15].range( 7,0);
	poolparam.pool_w    = reggroup[15].range(15,8);
	poolparam.pool_sw   = reggroup[16].range( 7,0);
	poolparam.pool_sh   = reggroup[16].range(15,8);
	poolparam.psum_raddr_start = reggroup[17];
	poolparam.pool_cnt         = reggroup[18];
	poolparam.pool_h_step      = reggroup[19];
	poolparam.avg_val          = reggroup[20];
	poolparam.ubuf_waddr_start = reggroup[21];
	poolparam.ubuf_waddr_step1 = reggroup[22];
	poolparam.ubuf_waddr_step2 = reggroup[23];
	poolparam.ubuf_waddr_step3 = reggroup[24];
	poolparam.ubuf_waddr_end1  = reggroup[25];
	poolparam.ubuf_waddr_end2  = reggroup[26];
	poolparam.ubuf_waddr_end3  = reggroup[27];

	lsdtparam.weight_addr = reggroup[28];
	lsdtparam.weight_ldlen = reggroup[29];
	ap_uint<32> tmp = (reggroup[31],reggroup[30]);
	lsdtparam.weight_offset = tmp;
	for(int i=0;i<32;i++){
#pragma HLS UNROLL
		norm_coef[i] = (reggroup[33+2*i],reggroup[32+2*i]);
	}
	return;
}


================================================
FILE: src/mxu.cpp
================================================

#include "tpu.h"

void SetWeight(WEIGHTDTYPE weight[512][MXU_COLNUM],WEIGHTDTYPE weightreg[MXU_ROWNUM+4][MXU_COLNUM],
		short weight_raddr, bool enable){
	if(!enable)
		return;
	for(short i=weight_raddr;i<weight_raddr+4+MXU_ROWNUM;i++){
#pragma HLS PIPELINE
		for(int j=0;j<MXU_ROWNUM+4;j++){
			for(int k=0;k<MXU_COLNUM;k++){
				if(j!=MXU_ROWNUM+3)
					weightreg[j][k] = weightreg[j+1][k];
				else
					weightreg[j][k] = weight[i][k];
			}
		}
	}
}

void MacArray(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weightreg[4+MXU_ROWNUM][MXU_COLNUM],
		PSUMDTYPE psum[512][MXU_COLNUM],MXU_PARAM mxuparam,bool enable){
	if(!enable)
		return;
	FEATDTYPE featreg[MXU_ROWNUM][MXU_COLNUM+MXU_ROWNUM-1];
	PSUMDTYPE psumreg[MXU_ROWNUM][MXU_COLNUM];
    short ubuf_raddr_p1=0;
    short ubuf_raddr_p2=0;
    short ubuf_raddr_p3=0;
    short psum_addr_p1[MXU_COLNUM];
    short psum_addr_p2[MXU_COLNUM];
    for(int i=0;i<MXU_COLNUM;i++){
#pragma HLS UNROLL
    	psum_addr_p1[i] = 0;
    	psum_addr_p2[i] = 0;
    }
	for(short i=0;i<mxuparam.ubuf_raddr_num+MXU_ROWNUM+MXU_COLNUM-2;i++){
#pragma HLS PIPELINE
    short ubuf_raddr = mxuparam.ubuf_raddr_start + ubuf_raddr_p1 + ubuf_raddr_p2 + ubuf_raddr_p3;
    if(ubuf_raddr_p1==mxuparam.ubuf_raddr_end1){
        ubuf_raddr_p1 = 0;
        if(ubuf_raddr_p2==mxuparam.ubuf_raddr_end2){
            ubuf_raddr_p2 = 0;
            ubuf_raddr_p3 = ubuf_raddr_p3 +  mxuparam.ubuf_raddr_step3;
        }
        else{
            ubuf_raddr_p2 = ubuf_raddr_p2 +  mxuparam.ubuf_raddr_step2;
        }
    }
    else{
        ubuf_raddr_p1 = ubuf_raddr_p1 + mxuparam.ubuf_raddr_step1;
    }

		for(int j=0;j<MXU_ROWNUM;j++){
			for(int k=MXU_ROWNUM+MXU_COLNUM-2;k>=0;k--){
				if(k>0)
					featreg[j][k] = featreg[j][k-1];
				else
					if(i<mxuparam.ubuf_raddr_num)
						featreg[j][k] = ubuf[ubuf_raddr][j];
					else
						featreg[j][k] = 0;
			}
		}

		for(int j=MXU_ROWNUM-1;j>=0;j--){
			for(int k=0;k<MXU_COLNUM;k++){
				ap_int<32> biasreg;
				biasreg(31,24)=weightreg[MXU_ROWNUM+0][k];
				biasreg(23,16)=weightreg[MXU_ROWNUM+1][k];
				biasreg(15, 8)=weightreg[MXU_ROWNUM+2][k];
				biasreg( 7, 0)=weightreg[MXU_ROWNUM+3][k];
				if(j==0)
					psumreg[j][k] = featreg[j][k+j]*weightreg[j][k] + biasreg;
				else
					psumreg[j][k] = featreg[j][k+j]*weightreg[j][k] + psumreg[j-1][k];
			}
		}
#pragma HLS DEPENDENCE variable=psum inter false
#pragma HLS DEPENDENCE variable=psum intra false
		for(int j=0;j<MXU_COLNUM;j++){
			if(i>=j+MXU_ROWNUM-1&&i<mxuparam.ubuf_raddr_num+j+MXU_ROWNUM-1){
				short psum_raddr = mxuparam.psum_start%512 + psum_addr_p1[j] + psum_addr_p2[j];
				if(psum_addr_p1[j]==mxuparam.psum_end1){
					psum_addr_p1[j] = 0;
					psum_addr_p2[j] = psum_addr_p2[j] + mxuparam.psum_step2;
				}
				else{
					psum_addr_p1[j] = psum_addr_p1[j] + mxuparam.psum_step1;
				}
				if(mxuparam.isfirstpsum)
					psum[psum_raddr][j] = psumreg[MXU_ROWNUM-1][j];
				else
					psum[psum_raddr][j] = psumreg[MXU_ROWNUM-1][j] + psum[psum_raddr][j];
			}
		}
	}
}

void MXU(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weight[512][MXU_COLNUM],
		PSUMDTYPE psum[512][MXU_COLNUM],MXU_PARAM mxuparam, bool enable){
//#pragma HLS INTERFACE bram port=ubuf
//#pragma HLS INTERFACE bram port=weight
//#pragma HLS INTERFACE bram port=psum
//#pragma HLS DATA_PACK variable=mxuparam
//#pragma HLS ARRAY_PARTITION variable=ubuf complete dim=2
//#pragma HLS ARRAY_PARTITION variable=weight complete dim=2
//#pragma HLS ARRAY_PARTITION variable=psum complete dim=2

	static WEIGHTDTYPE weightreg1[4+MXU_ROWNUM][MXU_COLNUM];
	static WEIGHTDTYPE weightreg2[4+MXU_ROWNUM][MXU_COLNUM];
#pragma HLS ARRAY_PARTITION variable=weightreg1 complete dim=0
#pragma HLS ARRAY_PARTITION variable=weightreg2 complete dim=0

	if(!enable)
		return;
	if(mxuparam.isping){
		SetWeight(weight,weightreg1,mxuparam.weight_raddr,mxuparam.isload);
		MacArray(ubuf,weightreg2,psum,mxuparam,mxuparam.iscalc);
	}
	else{
		SetWeight(weight,weightreg2,mxuparam.weight_raddr,mxuparam.isload);
		MacArray(ubuf,weightreg1,psum,mxuparam,mxuparam.iscalc);
	}
}


================================================
FILE: src/norm_relu_pool.cpp
================================================

#include "tpu.h"

void relu_norm_pool(PSUMDTYPE psum_buffer[512][MXU_COLNUM],FEATDTYPE unified_buffer[16384][MXU_ROWNUM],
		ap_int<32> norm_coef[MXU_COLNUM],RELPOOL_PARAM param, bool enable){
//#pragma HLS INTERFACE bram port=unified_buffer
//#pragma HLS INTERFACE bram port=psum_buffer
//#pragma HLS ARRAY_PARTITION variable=norm_coef complete dim=1
//#pragma HLS ARRAY_PARTITION variable=unified_buffer complete dim=2
//#pragma HLS ARRAY_PARTITION variable=psum_buffer complete dim=2

	PSUMDTYPE psumreg[MXU_COLNUM];
	PSUMDTYPE psumrelu[MXU_COLNUM];
	PSUMDTYPE psumpool[MXU_COLNUM];
	FEATDTYPE relu[MXU_COLNUM];
	short pool[MXU_COLNUM];
#pragma HLS ARRAY_PARTITION variable=psumreg complete dim=1
#pragma HLS ARRAY_PARTITION variable=psumsht complete dim=1
#pragma HLS ARRAY_PARTITION variable=relu complete dim=1
#pragma HLS ARRAY_PARTITION variable=pool complete dim=1

	char pool_kw_cnt = 0;
	char pool_kh_cnt = 0;
	char pool_w_cnt = 0;
	char pool_h_cnt = 0;
	short ubuf_waddr_p1=0;
	short ubuf_waddr_p2=0;
	short ubuf_waddr_p3=0;

	if(!enable)
		return;
	for(short i=0;i<param.pool_cnt;i++){
#pragma HLS PIPELINE

		short raddr = param.psum_raddr_start%512 + (pool_h_cnt+pool_kh_cnt)*param.pool_h_step
				+ (pool_w_cnt+pool_kw_cnt);
		for(int j=0;j<MXU_COLNUM;j++){
			psumreg[j] = psum_buffer[raddr][j];
			if(psumreg[j]<0&&param.isrelu)
				psumrelu[j] = 0;
			else
				psumrelu[j] = psumreg[j];
			if(pool_kw_cnt==0&&pool_kh_cnt==0){
				psumpool[j] = psumrelu[j];
			}
			else if(param.maxpool){
				if(psumrelu[j]>psumpool[j])
					psumpool[j] = psumrelu[j];
			}
			else{
				psumpool[j] = psumpool[j] + psumrelu[j];
			}
		}

		if(pool_kw_cnt==param.pool_kw&&pool_kh_cnt==param.pool_kh){
			short ubuf_waddr = param.ubuf_waddr_start + ubuf_waddr_p1 + ubuf_waddr_p2 + ubuf_waddr_p3;
			if(ubuf_waddr_p1==param.ubuf_waddr_end1){
				if(ubuf_waddr_p2==param.ubuf_waddr_end2){
					ubuf_waddr_p2 = 0;
					ubuf_waddr_p3 = ubuf_waddr_p3 + param.ubuf_waddr_step3;
				}
				else{
		            ubuf_waddr_p2 = ubuf_waddr_p2 +  param.ubuf_waddr_step2;
				}
			}
			else{
				ubuf_waddr_p1 = ubuf_waddr_p1 + param.ubuf_waddr_step1;
			}
			for(int j=0;j<MXU_COLNUM;j++){
				long tmp;
				tmp = long(psumpool[j])*long(norm_coef[j]);
				int tmpcut = tmp>>32;
				ap_int<8> res;
				if(tmpcut>127)
					res = 127;
				else if(tmpcut<-128)
					res = -128;
				else
					res = tmpcut;
				unified_buffer[ubuf_waddr][j] = res;
			}
		}

		if(pool_kw_cnt==param.pool_kw){
			pool_kw_cnt = 0;
			if(pool_kh_cnt==param.pool_kh){
				pool_kh_cnt = 0;
				if(pool_w_cnt==param.pool_w){
					pool_w_cnt = 0;
					pool_h_cnt = pool_h_cnt + param.pool_sh;
				}
				else{
					pool_w_cnt = pool_w_cnt + param.pool_sw;
				}
			}
			else{
				pool_kh_cnt = pool_kh_cnt + 1;
			}
		}
		else{
			pool_kw_cnt = pool_kw_cnt + 1;
		}
	}
}


================================================
FILE: src/tb_tpu.cpp
================================================
#include "tpu.h"
#include "stdio.h"
int main(){
	ap_uint<256> *ddr;
	ap_uint<64> *ddr_instr;
	ddr = (ap_uint<256> *)malloc(sizeof(ap_uint<256>)*(16384));
	//512*25+72*25+72+512
	ddr_instr = (ap_uint<64> *)malloc(sizeof(ap_uint<64>)*3300);
	FILE *fid;
	fid = fopen("mlp_img.bin","rb");
	fread(ddr,32,25*512,fid);
	fclose(fid);
	fid = fopen("mlp_param.bin","rb");
	fread(ddr+512*25,32,25*72+72,fid);
	fclose(fid);
	fid = fopen("mlp_instr.bin","rb");
	ap_uint<64> *ddr_instr_r = ddr_instr;
	int cnt = 0;
	while(1==1){
		fread(ddr_instr_r,8,1,fid);
		ap_uint<64> tmp = *ddr_instr_r;
		if(tmp.range(55,55)==1)
			break;
		ddr_instr_r++;
		cnt++;
	}
	fclose(fid);
	tpu(ddr,ddr_instr);
	fid = fopen("golden_result.txt","r");
	int err = 0;
	for(int i=0;i<512;i++){
		ap_uint<256> val = ddr[512*25+72*25+72+i];
		int maxcof = -255;
		int idx = -1;
		int ref = -1;
		for(int j=0;j<16;j++){
			int cof = val(j*8+7,j*8);
			if(cof>127)
				cof = cof-256;
			if(cof>maxcof){
				maxcof = cof;
				idx = j;
			}
		}
		fscanf(fid,"%d",&ref);
		if(idx!=ref)
			err++;
	}
	return err;
}


================================================
FILE: src/tpu.cpp
================================================

#include "tpu.h"

void ex_module(FEATDTYPE unified_buffer[16384][MXU_ROWNUM],WEIGHTDTYPE weight_buffer[512][MXU_COLNUM],
		ap_int<32> norm_coef[MXU_COLNUM],MXU_PARAM mxuparam,RELPOOL_PARAM poolparam,
		bool is_MXU,bool is_relu_norm_pool){
#pragma HLS INLINE off
#pragma HLS DEPENDENCE variable=unified_buffer inter false
#pragma HLS DEPENDENCE variable=unified_buffer intra false
	static PSUMDTYPE psum_buffer1[512][MXU_COLNUM];
	static PSUMDTYPE psum_buffer2[512][MXU_COLNUM];
#pragma HLS ARRAY_PARTITION variable=psum_buffer1 complete dim=2
#pragma HLS ARRAY_PARTITION variable=psum_buffer2 complete dim=2
	if((is_MXU&&mxuparam.psum_start<512) || (is_relu_norm_pool&&poolparam.psum_raddr_start>=512) )
	{
		MXU(unified_buffer,weight_buffer,psum_buffer1,mxuparam,is_MXU);
		relu_norm_pool(psum_buffer2,unified_buffer,norm_coef,poolparam,is_relu_norm_pool);
	}
	else{
		MXU(unified_buffer,weight_buffer,psum_buffer2,mxuparam,is_MXU);
		relu_norm_pool(psum_buffer1,unified_buffer,norm_coef,poolparam,is_relu_norm_pool);
	}

}

void tpu(ap_uint<256> *ddr,ap_uint<64> *ddr_instr){
#pragma HLS INTERFACE m_axi depth=16384 port=ddr
#pragma HLS INTERFACE m_axi depth=3300 port=ddr_instr

	static FEATDTYPE unified_buffer[16384][MXU_ROWNUM];
#pragma HLS RESOURCE variable=unified_buffer core=RAM_S2P_BRAM
	static WEIGHTDTYPE weight_buffer[512][MXU_COLNUM];
#pragma HLS RESOURCE variable=weight_buffer core=RAM_S2P_BRAM
	static ap_int<32> norm_coef[MXU_COLNUM];
#pragma HLS ARRAY_PARTITION variable=unified_buffer complete dim=2
#pragma HLS ARRAY_PARTITION variable=weight_buffer complete dim=2
#pragma HLS ARRAY_PARTITION variable=norm_coef complete dim=0

	ap_int<16> reggroup[96];
#pragma HLS ARRAY_PARTITION variable=reggroup complete dim=0
	MXU_PARAM mxuparam;
	RELPOOL_PARAM poolparam;
	LDST_PARAM ldstparam;
	unsigned instr_offset = 0;
	bool is_load_weight;
	bool is_MXU;
	bool is_relu_norm_pool;
	// load img
	loadFeature(ddr,unified_buffer, 0,0, 512*25, true);
	bool eop = false;
	ap_int<8> runmode = 0;	//0 nop, bit[0] loadweight;bit[1] mxu; bit[2] pool; bit[7] eop;
	instr(ddr_instr,instr_offset,reggroup,runmode,true);
	while(runmode[7]==0)
	{
#pragma HLS DEPENDENCE variable=unified_buffer inter false
#pragma HLS DEPENDENCE variable=unified_buffer intra false
#pragma HLS DEPENDENCE variable=weight_buffer inter false
#pragma HLS DEPENDENCE variable=weight_buffer intra false

		config(reggroup,mxuparam,poolparam,ldstparam,norm_coef);
		is_load_weight = runmode[0]==1;
		is_MXU = runmode[1]==1;
		is_relu_norm_pool = runmode[2]==1;
		instr(ddr_instr,instr_offset,reggroup,runmode,true);
		loadWeight(ddr,weight_buffer,ldstparam.weight_offset,ldstparam.weight_addr,
					ldstparam.weight_ldlen,is_load_weight);
		ex_module(unified_buffer,weight_buffer,norm_coef,mxuparam,poolparam,is_MXU,is_relu_norm_pool);
	}

	storeFeature(ddr,unified_buffer, 512*25+72*25+72,14000, 512, true);
}


================================================
FILE: src/tpu.h
================================================
#include "ap_int.h"

#define MXU_COLNUM 32
#define MXU_ROWNUM 32
#define WEIGHTDTYPE char
#define FEATDTYPE char
#define PSUMDTYPE ap_int<32>


struct MXU_PARAM{
	bool isload;
	bool iscalc;
	bool isping;
	bool isfirstpsum;

	short weight_raddr;
	short ubuf_raddr_start;
	short ubuf_raddr_step1;
	short ubuf_raddr_step2;
	short ubuf_raddr_step3;
	short ubuf_raddr_end1;
	short ubuf_raddr_end2;
	short ubuf_raddr_end3;
	short ubuf_raddr_num;
	short psum_start;
	short psum_step1;
	short psum_end1;
	short psum_step2;
};
struct RELPOOL_PARAM{
	bool isrelu;
	short psum_raddr_start;

	bool maxpool; // max pool or average pool
	char pool_kw;
	char pool_kh;
	char pool_w;
	char pool_sw;
	char pool_sh;
	short pool_cnt; // output_num*pool_kw*pool_kh
	short pool_h_step;

	short avg_val;
	ap_uint<4> avg_shift;

	short ubuf_waddr_start;
	short ubuf_waddr_step1;
	short ubuf_waddr_step2;
	short ubuf_waddr_step3;
	short ubuf_waddr_end1;
	short ubuf_waddr_end2;
	short ubuf_waddr_end3;
};

struct LDST_PARAM{
	unsigned weight_offset;
	short weight_addr;
	short weight_ldlen;
};

void MXU(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weight[512][MXU_COLNUM],
		PSUMDTYPE psum[512][MXU_COLNUM],MXU_PARAM mxuparam, bool enable);
void relu_norm_pool(PSUMDTYPE psum_buffer[512][MXU_COLNUM],FEATDTYPE unified_buffer[16384][MXU_ROWNUM],
		ap_int<32> norm_coef[MXU_COLNUM],RELPOOL_PARAM param, bool enable);
void loadWeight(ap_uint<256> *ddr,WEIGHTDTYPE weight_buffer[512][MXU_COLNUM],
		unsigned offset,short addr, short len, bool enable);
void loadFeature(ap_uint<256> *ddr,FEATDTYPE unified_buffer[512][MXU_ROWNUM],
		unsigned offset,short addr, short len, bool enable);
void storeFeature(ap_uint<256> *ddr,FEATDTYPE unified_buffer[512][MXU_COLNUM],
		unsigned offset,short addr, short len, bool enable);
void instr(ap_uint<64> *ddr,unsigned &offset,ap_int<16> reggroup[96],ap_int<8> &runmode,bool enable);
void config(ap_int<16> reggroup[96],MXU_PARAM &mxuparam,RELPOOL_PARAM &poolparam,
		LDST_PARAM &lsdtparam, ap_int<32> norm_coef[32]);
void tpu(ap_uint<256> *ddr,ap_uint<64> *ddr_instr);
Download .txt
gitextract_zygrtfz8/

├── README.md
├── data/
│   └── golden_result.txt
├── lab1/
│   ├── README.md
│   ├── refcode/
│   │   ├── conv3d.m
│   │   ├── convmxu.m
│   │   └── saveparam.m
│   ├── run_hls.tcl
│   └── src/
│       ├── mxu.cpp
│       ├── tb_mxu.cpp
│       └── tpu.h
├── lab2/
│   ├── README.md
│   ├── run_hls.tcl
│   └── src/
│       ├── relu_norm_pool.cpp
│       ├── tb_pool.cpp
│       └── tpu.h
└── src/
    ├── ctrl.cpp
    ├── mxu.cpp
    ├── norm_relu_pool.cpp
    ├── tb_tpu.cpp
    ├── tpu.cpp
    └── tpu.h
Download .txt
SYMBOL INDEX (25 symbols across 12 files)

FILE: lab1/src/mxu.cpp
  function SetWeight (line 4) | void SetWeight(WEIGHTDTYPE weight[512][MXU_COLNUM],WEIGHTDTYPE weightreg...
  function MacArray (line 21) | void MacArray(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weightreg[4+...
  function MXU (line 100) | void MXU(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weight[512][MXU_C...

FILE: lab1/src/tb_mxu.cpp
  function main (line 5) | int main(){

FILE: lab1/src/tpu.h
  type MXU_PARAM (line 10) | struct MXU_PARAM{
  type ACCREL_PARAM (line 30) | struct ACCREL_PARAM{

FILE: lab2/src/relu_norm_pool.cpp
  function relu_norm_pool (line 4) | void relu_norm_pool(PSUMDTYPE psum_buffer[512][MXU_COLNUM],FEATDTYPE uni...

FILE: lab2/src/tb_pool.cpp
  function main (line 5) | int main(){

FILE: lab2/src/tpu.h
  type MXU_PARAM (line 10) | struct MXU_PARAM{
  type RELPOOL_PARAM (line 30) | struct RELPOOL_PARAM{

FILE: src/ctrl.cpp
  function loadWeight (line 3) | void loadWeight(ap_uint<256> *ddr,WEIGHTDTYPE weight_buffer[512][MXU_COL...
  function loadFeature (line 16) | void loadFeature(ap_uint<256> *ddr,FEATDTYPE unified_buffer[512][MXU_ROW...
  function storeFeature (line 28) | void storeFeature(ap_uint<256> *ddr,FEATDTYPE unified_buffer[512][MXU_CO...
  function instr (line 46) | void instr(ap_uint<64> *ddr,unsigned &offset,ap_int<16> reggroup[96],ap_...
  function config (line 99) | void config(ap_int<16> reggroup[96],MXU_PARAM &mxuparam,RELPOOL_PARAM &p...

FILE: src/mxu.cpp
  function SetWeight (line 4) | void SetWeight(WEIGHTDTYPE weight[512][MXU_COLNUM],WEIGHTDTYPE weightreg...
  function MacArray (line 21) | void MacArray(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weightreg[4+...
  function MXU (line 100) | void MXU(FEATDTYPE ubuf[16384][MXU_ROWNUM],WEIGHTDTYPE weight[512][MXU_C...

FILE: src/norm_relu_pool.cpp
  function relu_norm_pool (line 4) | void relu_norm_pool(PSUMDTYPE psum_buffer[512][MXU_COLNUM],FEATDTYPE uni...

FILE: src/tb_tpu.cpp
  function main (line 3) | int main(){

FILE: src/tpu.cpp
  function ex_module (line 4) | void ex_module(FEATDTYPE unified_buffer[16384][MXU_ROWNUM],WEIGHTDTYPE w...
  function tpu (line 26) | void tpu(ap_uint<256> *ddr,ap_uint<64> *ddr_instr){

FILE: src/tpu.h
  type MXU_PARAM (line 10) | struct MXU_PARAM{
  type RELPOOL_PARAM (line 30) | struct RELPOOL_PARAM{
  type LDST_PARAM (line 55) | struct LDST_PARAM{
Condensed preview — 21 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (51K chars).
[
  {
    "path": "README.md",
    "chars": 3260,
    "preview": "# SimpleTPU\n\nA Tensor Processing Unit is designed to accelerate the matrix multiplication, especially for Multilayer per"
  },
  {
    "path": "data/golden_result.txt",
    "chars": 1024,
    "preview": "7\n2\n1\n0\n4\n1\n4\n9\n6\n9\n0\n6\n9\n0\n1\n5\n9\n7\n6\n4\n9\n6\n6\n5\n4\n0\n7\n4\n0\n1\n3\n1\n3\n6\n7\n2\n7\n1\n2\n1\n1\n7\n4\n2\n6\n5\n1\n2\n4\n4\n6\n3\n5\n5\n6\n0\n4\n1\n9\n5\n"
  },
  {
    "path": "lab1/README.md",
    "chars": 393,
    "preview": "# Systolic Array \n\nSystolic Array implement in FPGA using Xilinx HLS.\n\n## 1.Env & Build  \n **env** :   \n - Vivado HLS 20"
  },
  {
    "path": "lab1/refcode/conv3d.m",
    "chars": 1507,
    "preview": "\nrng(0);\nfeature = randi([-128,127],14,14,32);\nweight = randi([-128,127],32,3,3,32);\nbias = randi([-1024,1023],1,32);\nou"
  },
  {
    "path": "lab1/refcode/convmxu.m",
    "chars": 531,
    "preview": "function [out1] = convmxu(weight,feature,bias,index1,index2)\n%UNTITLED3 Summary of this function goes here\n%   Detailed "
  },
  {
    "path": "lab1/refcode/saveparam.m",
    "chars": 1019,
    "preview": "function [] = saveparam(feature,weight,bias)\n%UNTITLED2 Summary of this function goes here\n%   Detailed explanation goes"
  },
  {
    "path": "lab1/run_hls.tcl",
    "chars": 423,
    "preview": "open_project -reset mxu_conv_prj\nset_top MXU\nadd_files src/tpu.h\nadd_files src/mxu.cpp\nadd_files -tb data/feature.dat\nad"
  },
  {
    "path": "lab1/src/mxu.cpp",
    "chars": 4012,
    "preview": "\n#include \"tpu.h\"\n\nvoid SetWeight(WEIGHTDTYPE weight[512][MXU_COLNUM],WEIGHTDTYPE weightreg[MXU_ROWNUM+4][MXU_COLNUM],\n\t"
  },
  {
    "path": "lab1/src/tb_mxu.cpp",
    "chars": 5704,
    "preview": "#include \"tpu.h\"\n#include \"stdio.h\"\n#include \"stdlib.h\"\n\nint main(){\n\tFEATDTYPE ubuf[16384][MXU_ROWNUM];\n\tWEIGHTDTYPE we"
  },
  {
    "path": "lab1/src/tpu.h",
    "chars": 906,
    "preview": "#include \"ap_int.h\"\n\n#define MXU_COLNUM 32\n#define MXU_ROWNUM 32\n#define WEIGHTDTYPE char\n#define FEATDTYPE char\n#define"
  },
  {
    "path": "lab2/README.md",
    "chars": 348,
    "preview": "\n# Relu, Normalization & Pooling \n\nBasic Module of Tensor Processing Unit\n\n## 1.Env & Build  \n **env** :   \n - Vivado HL"
  },
  {
    "path": "lab2/run_hls.tcl",
    "chars": 362,
    "preview": "open_project -reset relu_norm_pool_prj\nset_top relu_norm_pool\nadd_files src/tpu.h\nadd_files src/relu_norm_pool.cpp\nadd_f"
  },
  {
    "path": "lab2/src/relu_norm_pool.cpp",
    "chars": 2857,
    "preview": "\n#include \"tpu.h\"\n\nvoid relu_norm_pool(PSUMDTYPE psum_buffer[512][MXU_COLNUM],FEATDTYPE unified_buffer[16384][MXU_ROWNUM"
  },
  {
    "path": "lab2/src/tb_pool.cpp",
    "chars": 3003,
    "preview": "#include \"tpu.h\"\n#include \"stdio.h\"\n#include \"stdlib.h\"\n\nint main(){\n\tPSUMDTYPE psum_buffer[512][MXU_COLNUM];\n\tFEATDTYPE"
  },
  {
    "path": "lab2/src/tpu.h",
    "chars": 1223,
    "preview": "#include \"ap_int.h\"\n\n#define MXU_COLNUM 32\n#define MXU_ROWNUM 32\n#define WEIGHTDTYPE char\n#define FEATDTYPE char\n#define"
  },
  {
    "path": "src/ctrl.cpp",
    "chars": 6672,
    "preview": "#include \"tpu.h\"\n\nvoid loadWeight(ap_uint<256> *ddr,WEIGHTDTYPE weight_buffer[512][MXU_COLNUM],\n\t\tunsigned offset,short "
  },
  {
    "path": "src/mxu.cpp",
    "chars": 4069,
    "preview": "\n#include \"tpu.h\"\n\nvoid SetWeight(WEIGHTDTYPE weight[512][MXU_COLNUM],WEIGHTDTYPE weightreg[MXU_ROWNUM+4][MXU_COLNUM],\n\t"
  },
  {
    "path": "src/norm_relu_pool.cpp",
    "chars": 2830,
    "preview": "\n#include \"tpu.h\"\n\nvoid relu_norm_pool(PSUMDTYPE psum_buffer[512][MXU_COLNUM],FEATDTYPE unified_buffer[16384][MXU_ROWNUM"
  },
  {
    "path": "src/tb_tpu.cpp",
    "chars": 1070,
    "preview": "#include \"tpu.h\"\n#include \"stdio.h\"\nint main(){\n\tap_uint<256> *ddr;\n\tap_uint<64> *ddr_instr;\n\tddr = (ap_uint<256> *)mall"
  },
  {
    "path": "src/tpu.cpp",
    "chars": 2890,
    "preview": "\n#include \"tpu.h\"\n\nvoid ex_module(FEATDTYPE unified_buffer[16384][MXU_ROWNUM],WEIGHTDTYPE weight_buffer[512][MXU_COLNUM]"
  },
  {
    "path": "src/tpu.h",
    "chars": 2080,
    "preview": "#include \"ap_int.h\"\n\n#define MXU_COLNUM 32\n#define MXU_ROWNUM 32\n#define WEIGHTDTYPE char\n#define FEATDTYPE char\n#define"
  }
]

About this extraction

This page contains the full source code of the cea-wind/SimpleTPU GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 21 files (45.1 KB), approximately 17.8k tokens, and a symbol index with 25 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!