Repository: 0xD4rky/Vision-Transformers
Branch: main
Commit: a4a489cfad34
Files: 13
Total size: 41.6 KB
Directory structure:
gitextract_fego_pme/
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
└── vit/
├── readme.md
├── src/
│ ├── ViT.py
│ ├── base.py
│ ├── data.py
│ ├── requirements.txt
│ ├── trainer.py
│ ├── utils.py
│ └── vit_with_lora.py
└── visualize/
└── vis.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2024 Ishaan
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# Zero-to-Hero: ViT🚀
I have tried to cover all the bases for understanding and implementing Vision Transformers (ViT) and their evolution into Video Vision Transformers (ViViT).
The main focus is on dealing with the spatio-temporal relations using visual transformers.

## 1. Vision Transformer (ViT) Fundamentals
### Surveys and Overviews:
* [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169)
* [A Survey of Visual Transformers](https://arxiv.org/abs/2111.06091)
* [Transformers in Vision](https://arxiv.org/abs/2101.01169)
### Key Papers:
* An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: [Paper](https://arxiv.org/abs/2010.11929) | [Code](https://github.com/google-research/vision_transformer)
* Training data-efficient image transformers & distillation through attention (DeiT): [Paper](https://arxiv.org/abs/2012.12877) | [Code](https://github.com/facebookresearch/deit)
### Concepts and Tutorials:
* "Attention Is All You Need": [Paper](https://arxiv.org/abs/1706.03762)
* "The Illustrated Transformers": [Blog Post](http://jalammar.github.io/illustrated-transformer/)
* "Vision Transformer Explained" [Blog Post](https://theaisummer.com/vision-transformer/)
## 2. Convolutional ViT and Hybrid Models:
* CvT: Introducing Convolutions to Vision Transformers: [Paper](https://arxiv.org/abs/2103.15808) | [Code](https://github.com/microsoft/CvT)
* CoAtNet: Marrying Convolution and Attention for All Data Sizes: [Paper](https://arxiv.org/abs/2106.04803)
* ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases: [Paper](https://arxiv.org/abs/2103.10697) | [Code](https://github.com/facebookresearch/convit)
## 3. Efficient Transformers and Swin Transformer:
* Swin Transformer: Hierarchical Vision Transformer using Shifted Windows: [Paper](https://arxiv.org/abs/2103.14030) | [Code](https://github.com/microsoft/Swin-Transformer)
* Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions: [Paper](https://arxiv.org/abs/2102.12122) | [Code](https://github.com/whai362/PVT)
* Efficient Transformers: A Survey: [Paper](https://arxiv.org/abs/2009.06732)
## 4. Space-Time Attention and Video Transformers:
* TimeSformer: Is Space-Time Attention All You Need for Video Understanding? [Paper](https://arxiv.org/abs/2102.05095) | [Code](https://github.com/facebookresearch/TimeSformer)
* Space-Time Mixing Attention for Video Transformer: [Paper](https://arxiv.org/abs/2106.05968)
* MViT: Multiscale Vision Transformers: [Paper](https://arxiv.org/abs/2104.11227) | [Code](https://github.com/facebookresearch/SlowFast)
## 5. Video Vision Transformer (ViViT):
* ViViT: A Video Vision Transformer: [Paper](https://arxiv.org/abs/2103.15691) | [Code](https://github.com/google-research/scenic/tree/main/scenic/projects/vivit)
* Video Transformer Network: [Paper](https://arxiv.org/abs/2102.00719) | [Code](https://github.com/mx-mark/VideoTransformer-pytorch)
## How to use this Repo?
* Start by reading the survey papers to get a broad understanding of the field.
* For each key paper, read the abstract and introduction, then skim through the methodology and results sections.
* Implement key concepts using the provided GitHub repositories or your own code.
* Experiment with different architectures and datasets to solidify your understanding.
* Use the additional resources to dive deeper into specific topics or applications.
================================================
FILE: requirements.txt
================================================
torch
torchvision
transformers
timm
matplotlib
opencv-python
plotly
streamlit
gradio
flask
================================================
FILE: vit/readme.md
================================================
# Building ViT from scratch
## INFO:
This project implements a Vision Transformer (ViT) from scratch using Python and PyTorch. The implementation is based on the original paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al. The model is trained and evaluated on the CIFAR-10 dataset.
## Project Structure:
The project consists of the following main files:
* `base.py`: Contains the GELU activation function implementation. [Paper](https://arxiv.org/abs/1606.08415) | [Code](https://github.com/karfaoui/gelu)
* `data.py`: Handles data preparation using the CIFAR-10 dataset.
* `ViT.py`: Contains the Vision Transformer model implemented from scratch.
* `trainer.py`: Implements the entire training and evaluation pipeline.
* `utils.py`: Contains utility functions for model and checkpoint management.
* visualization contains `vis.py` to visualize image patches and attention maps.
## Requirements:
```
cd proj/src
pip install -r requirements.txt
```
## Inference:
1. Clone the repo:
```
git clone https://github.com/0xD4rky/Vision-Transformer.git
cd proj/src
```
2. Prepare the data: The `data.py` script handles the CIFAR-10 dataset preparation. You don't need to run this separately as it will be called by the trainer.
3. Training:
```
python trainer.py
```
This script will train the Vision Transformer on the CIFAR-10 dataset and evaluate its performance.
## Model Architecture
The Vision Transformer (ViT) architecture is implemented in `ViT.py`. It follows the original paper's design, including:
* Patch embedding
* Positional embedding
* Transformer encoder with multi-head self-attention and feed-forward layers
* Classification head
## Training and Evaluation:
The `trainer.py` script handles both training and evaluation. It includes:
* Data loading and preprocessing
* Model initialization
* Training loop with gradient updates
* Evaluation on the test set
* Logging of training progress and results
## Utility Functions:
The `utils.py` file contains helper functions for:
* Saving and loading model checkpoints
* Logging training progress
* Any other utility functions used across the project
## Results:
(You can add information about the performance of your model on the CIFAR-10 dataset, including accuracy, training time, and any comparisons with baseline models.)
`VISUALIZATION`:
### 1. Image Patches:

### 2. Feature Maps:

================================================
FILE: vit/src/ViT.py
================================================
from base import *
class PatchEmbeddings(nn.Module):
"""
Convert the image into patches and then project them into a vector space.
"""
def __init__(self, config):
super().__init__()
self.image_size = config["image_size"]
self.patch_size = config["patch_size"]
self.num_channels = config["num_channels"]
self.vector_dim = config["vector_dim"]
self.num_patches = (self.image_size // self.patch_size) ** 2
self.projection = nn.Conv2d(self.num_channels, self.vector_dim, kernel_size=self.patch_size, stride=self.patch_size)
def forward(self, x):
# {batch_size, num_channels, image_size, image_size}-> {batch_size, num_patches, vector_dim}
x = self.projection(x)
x = x.flatten(2).transpose(1, 2)
return x
class Embeddings(nn.Module):
"""
adding positional information to extracted patch embeddings
"""
def __init__(self,config):
self.config = config
self.patch_emb = PatchEmbeddings(config)
self.cls_token = nn.Parameter(torch.randn(1,1,config["vector_dim"]))
# create learnable positional encoding and add +1 dim for [CLS]
self.positional_encoding = nn.Parameter(torch.randn(1,self.patch_emb.num_patches + 1, config["vector_dim"]))
self.droput = nn.Dropout(config["droput_prob"])
def forward(self,x):
x = self.patch_emb(x)
batch_size, _, _ = x.size()
# expand the [cls] token to batch size
#{1,1,vector_dim} -> (batch_size,1,hidden_size)
cls_tokens = self.cls_token.expand(batch_size,-1,-1)
"""
concatenating cls token to inputn sequence
size : {num_patches + 1}
"""
x = torch.cat((cls_tokens,x),dim = 1)
x = x + self.positional_encoding
return x
class Attention(nn.Module):
"""
Attention module
Will be used in:
Multi-headed-attention Module
"""
def __init__(self,vector_dim,attention_head_size,dropout,bias = True):
super().__init__()
self.vector_dim = vector_dim
self.attention_head_size = attention_head_size
self.dropout = nn.Dropout(dropout)
# {query,key,value}
self.query = nn.Linear(vector_dim,attention_head_size, bias = bias)
self.key = nn.Linear(vector_dim, attention_head_size,bias = bias)
self.value = nn.Linear(vector_dim,attention_head_size,bias = bias)
def forward(self,x):
query = self.query(x)
key = self.key(x)
value = self.value(x)
# i have them in matrix form
similarity = torch.matmul(query,key.transpose(-1,-2))
attention_probs = nn.functional.softmax((similarity/math.sqrt(self.attention_head_size)),dim = 1)
attention_probs = self.dropout(attention_probs)
output = torch.matmul(attention_probs,value)
return output,attention_probs
class MultiheadAttention(nn.Module):
"""
Multi-headed-attention module
Will be used in:
Transformer Encoder
"""
def __init__(self,config):
super().__init()
self.vector_dim = config["vector_dim"]
self.num_attention_heads = config["num_attention_heads"]
self.attention_head_size = self.vector_sim // self.num_attention_heads
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.qkv_bias = config["qkv_bias"]
#creating a list of attention heads
self.heads = nn.ModuleList([])
for _ in range(self.num_attention_heads):
head = Attention(
self.vector_dim,
self.attention_head_size,
config["attention_probs_dropout_prob"],
self.qkv_bias
)
self.heads.append(head)
# project attention output back to vector dim
self.output_projection = nn.Linear(self.all_head_size,self.vector_dim)
self.output_dropout = nn.Dropout(config["hidden_dropout_prob"])
def forward(self,x,output_attentions = False):
attention_outputs = [head(x) for head in self.heads] # for each attention head
attention_output = torch.cat([attention_output for attention_output, _ in attention_outputs],dim=-1)
# Project the concatenated attention output back to the hidden size
attention_output = self.output_projection(attention_output)
attention_output = self.output_dropout(attention_output)
# Return the attention output and the attention probabilities (optional)
if not output_attentions:
return (attention_output, None)
else:
attention_probs = torch.stack([attention_probs for _, attention_probs in attention_outputs], dim=1)
return (attention_output, attention_probs)
class MLP(nn.Module):
"""
Multi-Layer Perceptron Module
"""
def __init__(self,config):
super().__init__()
self.dense_1 = nn.Linear(config["vector_dim"],config["hidden_size"])
self.act = NewGELUActivation()
self.dense_2 = nn.Linear(config["hidden_size"],config["vector_dim"])
self.dropout = nn.Dropout(config["hidden_dropout_prob"])
def forward(self,x):
x = self.dense_1(x)
x = self.act(x)
x = self.dense_2(x)
x = self.dropout(x)
return x
class Block(nn.Module):
"""
Single transformer block
"""
def __init__(self,config):
super().__init__()
self.attention = MultiheadAttention(config)
self.layer_norm1 = nn.LayerNorm(config["vector_dim"])
self.mlp = MLP(config)
self.layernorm_2 = nn.LayerNorm(config["vector_dim"])
def forward(self,x,output_attentions = False):
# {self-attention after normalizing layers}
attention_output, attention_prob = self.attention(self.layer_norm1(x),output_attentions=output_attentions)
x = x + attention_output # {skip-connections}\
mlp_output = self.mlp(self.layer_norm2(x)) #{ffn}
x = x + mlp_output
if not output_attentions:
return (x,None)
else:
return (x,attention_prob)
class Encoder(nn.Module):
def __init__(self,config):
super().__init__()
self.blocks = nn.ModuleList([])
for _ in range(config["num_hidden_layers"]):
block = Block(config)
self.blocks.append(block)
def forward(self, x, output_attentions=False):
# Calculate the transformer block's output for each block
all_attentions = []
for block in self.blocks:
x, attention_probs = block(x, output_attentions=output_attentions)
if output_attentions:
all_attentions.append(attention_probs)
# Return the encoder's output and the attention probabilities (optional)
if not output_attentions:
return (x, None)
else:
return (x, all_attentions)
class Classification(nn.Module):
"""
ViT model for classification
"""
def __init__(self,config):
super().__init__()
self.config = config
self.img_size = config["img_size"]
self.vector_dim = config["vector_dim"]
self.num_classes = config["num_classes"]
# follow the below pipepline :)
self.embeddings = Embeddings(config)
self.encoder = Encoder(config)
self.classifier = nn.Linear(self.vector_dim,self.num_classes)
self.apply(self._init_weights)
def forward(self, x, output_attentions=False):
# Calculate the embedding output
embedding_output = self.embedding(x)
# Calculate the encoder's output
encoder_output, all_attentions = self.encoder(embedding_output, output_attentions=output_attentions)
# Calculate the logits, take the [CLS] token's output as features for classification
logits = self.classifier(encoder_output[:, 0, :])
# Return the logits and the attention probabilities (optional)
if not output_attentions:
return (logits, None)
else:
return (logits, all_attentions)
def _init_weights(self, module):
if isinstance(module, (nn.Linear, nn.Conv2d)):
torch.nn.init.normal_(module.weight, mean=0.0, std=self.config["initializer_range"])
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
elif isinstance(module, Embeddings):
module.position_embeddings.data = nn.init.trunc_normal_(
module.position_embeddings.data.to(torch.float32),
mean=0.0,
std=self.config["initializer_range"],
).to(module.position_embeddings.dtype)
================================================
FILE: vit/src/base.py
================================================
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.datasets as datasets
import torch.optim as optim
from torch.utils.data import DataLoader
import numpy as np
import matplotlib.pyplot as plt
import os
from PIL import Image
import math
class NewGELUActivation(nn.Module):
"""
Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
Taken from https://github.com/huggingface/transformers/blob/main/src/transformers/activations.py
"""
def forward(self, input):
return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
================================================
FILE: vit/src/data.py
================================================
# Import libraries
import torch
import torchvision
import torchvision.transforms as transforms
def prepare_data(batch_size=4, num_workers=2, train_sample_size=None, test_sample_size=None):
train_transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Resize((32, 32)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomResizedCrop((32, 32), scale=(0.8, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=2),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=train_transform)
if train_sample_size is not None:
# Randomly sample a subset of the training set
indices = torch.randperm(len(trainset))[:train_sample_size]
trainset = torch.utils.data.Subset(trainset, indices)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=num_workers)
test_transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Resize((32, 32)),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=test_transform)
if test_sample_size is not None:
# Randomly sample a subset of the test set
indices = torch.randperm(len(testset))[:test_sample_size]
testset = torch.utils.data.Subset(testset, indices)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
shuffle=False, num_workers=num_workers)
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
return trainloader, testloader, classes
================================================
FILE: vit/src/requirements.txt
================================================
python-apt==2.7.7+ubuntu3
python-dateutil==2.8.2
python-debian==0.1.49+ubuntu2
python-multipart==0.0.12
torch==2.4.1
torchaudio==2.4.1
torchvision==0.19.1
numpy==1.26.4
================================================
FILE: vit/src/trainer.py
================================================
import torch
import torch.nn as nn
import argparse
from data import *
from utils import save_checkpoint, save_experiment
from vit_base import Classification
config = {
"patch_size": 4, # Input image size: 32x32 -> 8x8 patches
"vector_dim": 48,
"num_hidden_layers": 4,
"num_attention_heads": 4,
"hidden_size": 4 * 48, # 4 * hidden_size
"hidden_dropout_prob": 0.0,
"attention_probs_dropout_prob": 0.0,
"initializer_range": 0.02,
"image_size": 32,
"num_classes": 10, # num_classes of CIFAR10
"num_channels": 3,
"qkv_bias": True,
}
assert config["vector_dim"] % config["num_attention_heads"] == 0
assert config['hidden_size'] == 4 * config['vector_dim']
assert config['image_size'] % config['patch_size'] == 0
class Trainer:
"""
simple trainer block
"""
def __init__(self,model,optimizer,loss_fn,exp_name,device):
self.model = model.to(device)
self.optim = optimizer
self.loss = loss_fn
self.exp_name = exp_name
self.device = device
def train(self,train_loader,test_loader,epochs,save_exp_every_n_epochs = 0):
train_losses, test_losses, accuracies = [],[],[]
for i in range(epochs):
train_loss = self.train_epoch(train_loader)
accuracy = test_loss = self.evaluate(test_loader)
train_losses.append(train_loss)
test_losses.append(test_loss)
accuracies.append(accuracy)
print(f"Epoch {i+1}, Train loss: {train_loss:.4f}, Test loss: {test_loss:.4f}, Accuracy: {accuracy:.4f}")
if save_exp_every_n_epochs > 0 and (i+1) % save_exp_every_n_epochs == 0 and i+1 != epochs:
print('\tSave checkpoint at epoch',i+1)
save_checkpoint(self.exp_name, self.model, i+1)
save_experiment(self.exp_name, self.model, i+1)
def train_epoch(self,train_loader):
self.model.train()
total_loss = 0
for batch in train_loader:
batch = [t.tp(self.device) for t in batch]
images, labels = batch
self.optimizer.zero_grad()
loss = self.loss(self.model(images)[0], labels)
loss.backward()
self.optimizer.step()
total_loss += loss.item()*len(images)
return total_loss/ len(train_loader.dataset)
@torch.no_grad()
def evaluate(self,test_loader):
self.model.eval()
total_loss = 0
correct = 0
with torch.no_grad():
for batch in test_loader():
batch = [t.to(self.device) for t in batch]
images, labels = batch
logits,_ = self.model(images)
loss = self.loss(logits,labels)
total_loss += loss.item() * len(images)
predictions = torch.argmax(logits, dim = 1)
correct = torch.sum(predictions == labels).item()
accuracy = correct/ len(test_loader.dataset)
avg_loss = total_loss / len(test_loader.dataset)
return accuracy, avg_loss
def parse_args():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--exp-name", type = str, required = True)
parser.add_argument("--batch-size", type = int, default = 256)
parser.add_argument("--epochs", type=int, default=100)
parser.add_argument("--lr", type=float, default=1e-2)
parser.add_argument("--device", type=str)
parser.add_argument("--save-model-every", type=int, default=0)
args = parser.parse_args()
if args.device is None:
args.device = "cuda" if torch.cuda.is_available() else "cpu"
return args
def main():
args = parse_args()
batch_size = args.batch_size
epochs = args.epochs
lr = args.lr
device = args.device
save_exp_every_n_epochs = args.save_model_every
trainloader, testloader = prepare_data(batch_size = batch_size)
model = Classification(config)
"""
IF YOU WANT TO USE LORA TRAINING, UNCOMMENT THE BELOW LINES
def create_model():
model = Classification(config)
if config["use_lora"]:
model = prepare_model_for_lora_training(model)
return model
"""
optimizer = torch.optim.AdamW(model.parameters(), lr = lr, weight_decay = 1e-2)
loss_fn = nn.CrossEntropyLoss()
trainer = Trainer(model, optimizer, loss_fn, args.exp_name, device = device)
trainer.train(trainloader, testloader, epochs, save_exp_every_n_epochs = save_exp_every_n_epochs)
if __name__ == "__main__":
main()
================================================
FILE: vit/src/utils.py
================================================
import json, os, math
import matplotlib.pyplot as plt
import numpy as np
import torch
from torch.nn import functional as F
import torchvision
import torchvision.transforms as transforms
from ViT import Classification
def save_experiment(experiment_name,config,model,train_losses,test_losses,accuracies,base_dir = "experiments"):
outdir = os.path.join(base_dir,experiment_name)
os.makedirs(outdir, exist_ok = True)
configfile = os.path.join(outdir,'config.json')
with open(configfile, 'w') as f:
json.dump(config,f,sort_keys = True,indent = 4)
jsonfile = os.path.join(outdir, 'metrics.json')
with open(jsonfile, 'w') as f:
data = {
'train_losses': train_losses,
'test_losses': test_losses,
'accuracies': accuracies,
}
json.dump(data, f, sort_keys=True, indent=4)
save_checkpoint(experiment_name,model,"final",base_dir = base_dir)
def save_checkpoint(experiment_name, model, epoch, base_dir="experiments"):
outdir = os.path.join(base_dir, experiment_name)
os.makedirs(outdir, exist_ok=True)
cpfile = os.path.join(outdir, f'model_{epoch}.pt')
torch.save(model.state_dict(), cpfile)
def load_experiment(experiment_name,checkpoint_name="model_final.pt",base_dir = "experiments"):
outdir = os.path.join(base_dir,experiment_name)
configfile = os.path.join(outdir,'config.json')
with open(configfile,'r') as f:
config = json.load(f)
jsonfile = os.path.join(outdir, 'config.json')
with open(jsonfile,'r') as f:
data = json.load(f)
train_losses = data['train_losses']
test_losses = data['test_losses']
accuracies = data['accuracies']
# Load the model
model = Classfication(config)
cpfile = os.path.join(outdir, checkpoint_name)
model.load_state_dict(torch.load(cpfile))
return config, model, train_losses, test_losses, accuracies
================================================
FILE: vit/src/vit_with_lora.py
================================================
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from base import *
class LoRALayer(nn.Module):
"""Low-Rank Adaptation layer"""
def __init__(self, in_features, out_features, rank=4, alpha=16):
super(LoRALayer,self).__init__()
self.rank = rank
self.scaling = alpha / rank
self.lora_A = nn.Parameter(torch.zeros(in_features, rank))
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x):
x = x @ (self.lora_A @ self.lora_B) * self.scaling
return x
class PatchEmbeddings(nn.Module):
"""
Convert the image into patches and then project them into a vector space.
"""
def __init__(self, config):
super().__init__()
self.image_size = config["image_size"]
self.patch_size = config["patch_size"]
self.num_channels = config["num_channels"]
self.vector_dim = config["vector_dim"]
self.num_patches = (self.image_size // self.patch_size) ** 2
self.projection = nn.Conv2d(self.num_channels, self.vector_dim, kernel_size=self.patch_size, stride=self.patch_size)
def forward(self, x):
# {batch_size, num_channels, image_size, image_size}-> {batch_size, num_patches, vector_dim}
x = self.projection(x)
x = x.flatten(2).transpose(1, 2)
return x
class Embeddings(nn.Module):
"""
adding positional information to extracted patch embeddings
"""
def __init__(self,config):
self.config = config
self.patch_emb = PatchEmbeddings(config)
self.cls_token = nn.Parameter(torch.randn(1,1,config["vector_dim"]))
self.positional_encoding = nn.Parameter(torch.randn(1,self.patch_emb.num_patches + 1, config["vector_dim"]))
self.droput = nn.Dropout(config["droput_prob"])
def forward(self,x):
x = self.patch_emb(x)
batch_size, _, _ = x.size()
# expand the [cls] token to batch size
#{1,1,vector_dim} -> (batch_size,1,hidden_size)
cls_tokens = self.cls_token.expand(batch_size,-1,-1)
"""
concatenating cls token to inputn sequence
size : {num_patches + 1}
"""
x = torch.cat((cls_tokens,x),dim = 1)
x = x + self.positional_encoding
return x
class Attention(nn.Module):
"""
Attention module with LoRA Support
"""
def __init__(self,vector_dim,attention_head_size,dropout,bias=True, use_lora=False, lora_rank=8, lora_alpha=16):
super().__init__()
self.vector_dim = vector_dim
self.attention_head_size = attention_head_size
self.dropout = nn.Dropout(dropout)
self.use_lora = use_lora
self.query = nn.Linear(vector_dim, attention_head_size, bias = bias)
self.key = nn.Linear(vector_dim, attention_head_size, bias = bias)
self.value = nn.Linear(vector_dim, attention_head_size, bias = bias)
if use_lora:
self.lora_q = LoRALayer(vector_dim, attention_head_size, lora_rank, lora_alpha)
self.lora_v = LoRALayer(vector_dim, attention_head_size, lora_rank, lora_alpha)
def forward(self, x):
q = self.query(x)
key = self.key(x)
v = self.value(x)
if self.use_lora:
query = q + self.lora_q(x)
value = v + self.lora_v(x)
similarity = torch.matmul(query, key.transpose(-1,-2))
attention_probs = F.softmax((similarity/math.sqrt(self.attention_head_size)),dim = 1)
attention_probs = self.dropout(attention_probs)
output = torch.matmul(attention_probs, value)
return output, attention_probs
class MultiheadAttention(nn.Module):
"""
Multi-headed-attention module with LoRA support
"""
def __init__(self, config):
super().__init__()
self.vector_dim = config["vector_dim"]
self.num_attention_heads = config["num_attention_heads"]
self.attention_head_size = self.vector_dim // self.num_attention_heads
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.qkv_bias = config["qkv_bias"]
self.use_lora = config.get("use_lora", False)
self.lora_rank = config.get("lora_rank", 8)
self.lora_alpha = config.get("lora_alpha", 16)
self.heads = nn.ModuleList([
Attention(
self.vector_dim,
self.attention_head_size,
config["attention_probs_dropout_prob"],
self.qkv_bias,
self.use_lora,
self.lora_rank,
self.lora_alpha
)
for _ in range(self.num_attention_heads)
])
self.output_projection = nn.Linear(self.all_head_size, self.vector_dim)
self.output_dropout = nn.Dropout(config["hidden_dropout_prob"])
def forward(self, x, output_attentions=False):
attention_outputs = [head(x) for head in self.heads]
attention_output = torch.cat(
[attention_output for attention_output, _ in attention_outputs],
dim=-1
)
attention_output = self.output_projection(attention_output)
attention_output = self.output_dropout(attention_output)
if not output_attentions:
return (attention_output, None)
attention_probs = torch.stack(
[attention_probs for _, attention_probs in attention_outputs],
dim=1
)
return (attention_output, attention_probs)
class MLP(nn.Module):
"""
Multi-Layer Perceptron Module with LoRA support
"""
def __init__(self, config):
super().__init__()
self.use_lora = config.get("use_lora", False)
self.lora_rank = config.get("lora_rank", 8)
self.lora_alpha = config.get("lora_alpha", 16)
self.dense_1 = nn.Linear(config["vector_dim"], config["hidden_size"])
self.dense_2 = nn.Linear(config["hidden_size"], config["vector_dim"])
if self.use_lora:
self.lora_1 = LoRALayer(
config["vector_dim"],
config["hidden_size"],
self.lora_rank,
self.lora_alpha
)
self.lora_2 = LoRALayer(
config["hidden_size"],
config["vector_dim"],
self.lora_rank,
self.lora_alpha
)
self.act = NewGELUActivation()
self.dropout = nn.Dropout(config["hidden_dropout_prob"])
def forward(self, x):
hidden = self.dense_1(x)
if self.use_lora:
hidden = hidden + self.lora_1(x)
hidden = self.act(hidden)
output = self.dense_2(hidden)
if self.use_lora:
output = output + self.lora_2(hidden)
output = self.dropout(output)
return output
def prepare_mlp_for_lora_training(model):
"""Freeze all parameters except LoRA parameters"""
for name, param in model.named_parameters():
if 'lora' not in name:
param.requires_grad = False
else:
param.requires_grad = True
return model
class Block(nn.Module):
"single transformer block with LoRA support"
def __init__(self, config):
super().__init_()
self.attention = MultiheadAttention(config)
self.layer_norm1 = nn.LayerNorm(config["vector_dim"])
self.mlp = MLP(config)
self.layer_norm2 = nn.LayerNorm(config["vector_dim"])
def forward(self, x, output_attentions = False):
attention_output, attention_probs = self.attention(self.layer_norm1(x), output_attentions=output_attentions)
x = x + attention_output
mlp_output = self.mlp(self.layer_norm2(x))
x = x + mlp_output
if not output_attentions:
return (x, None)
else:
return (x, attention_probs)
class Encoder(nn.Module):
"""
Transformer encoder with LoRA support
"""
def __init__(self, config):
super().__init__()
self.blocks = nn.ModuleList([
Block(config) for _ in range(config["num_hidden_layers"])
])
def forward(self, x, output_attentions=False):
all_attentions = []
for block in self.blocks:
x, attention_probs = block(x, output_attentions=output_attentions)
if output_attentions:
all_attentions.append(attention_probs)
if not output_attentions:
return (x, None)
else:
return (x, all_attentions)
class LoRALinear(nn.Module):
"""
Linear layer with LoRA support for classification head
"""
def __init__(self, in_features, out_features, rank=8, alpha=16):
super().__init__()
self.linear = nn.Linear(in_features, out_features)
self.lora = LoRALayer(in_features, out_features, rank, alpha)
def forward(self, x):
return self.linear(x) + self.lora(x)
class Classification(nn.Module):
"""
ViT model for classification with LoRA support
"""
def __init__(self, config):
super().__init__()
self.config = config
self.img_size = config["img_size"]
self.vector_dim = config["vector_dim"]
self.num_classes = config["num_classes"]
# Initialize components
self.embeddings = Embeddings(config)
self.encoder = Encoder(config)
# Use LoRA for classifier if enabled
if config.get("use_lora", False):
self.classifier = LoRALinear(
self.vector_dim,
self.num_classes,
config.get("lora_rank", 8),
config.get("lora_alpha", 16)
)
else:
self.classifier = nn.Linear(self.vector_dim, self.num_classes)
self.apply(self._init_weights)
def forward(self, x, output_attentions=False):
embedding_output = self.embeddings(x)
encoder_output, all_attentions = self.encoder(
embedding_output,
output_attentions=output_attentions
)
# Use CLS token for classification
logits = self.classifier(encoder_output[:, 0, :])
if not output_attentions:
return (logits, None)
else:
return (logits, all_attentions)
def _init_weights(self, module):
if isinstance(module, (nn.Linear, nn.Conv2d)):
torch.nn.init.normal_(
module.weight,
mean=0.0,
std=self.config["initializer_range"]
)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
elif isinstance(module, Embeddings):
module.position_embeddings.data = nn.init.trunc_normal_(
module.position_embeddings.data.to(torch.float32),
mean=0.0,
std=self.config["initializer_range"],
).to(module.position_embeddings.dtype)
def prepare_model_for_lora_training(model):
"""
Prepare the model for LoRA training by freezing non-LoRA parameters
"""
for name, param in model.named_parameters():
if 'lora' not in name:
param.requires_grad = False
else:
param.requires_grad = True
return model
================================================
FILE: vit/visualize/vis.py
================================================
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import torchvision.transforms as T
from torchvision.utils import make_grid
class PatchEmbedding(nn.Module):
def __init__(self, num_patches, vector_dim, patch_size):
super(PatchEmbedding, self).__init__()
self.conv = nn.Conv2d(3, vector_dim, kernel_size=patch_size, stride=patch_size)
def forward(self, x):
x = self.conv(x)
return x
input_image = torch.randn(1, 3, 224, 224) # {creating a dummy image to vis}
vector_dim = 256
patch_size = 16
patch_embedding = PatchEmbedding(3, vector_dim, patch_size)
output = patch_embedding(input_image)
def visualize_patches(input_image, patch_size):
"""
visualizing patches and attention maps
"""
input_image = input_image.squeeze(0).permute(1, 2, 0).numpy()
fig, ax = plt.subplots()
ax.imshow(input_image)
for i in range(0, input_image.shape[0], patch_size):
ax.axhline(i, color='red')
for j in range(0, input_image.shape[1], patch_size):
ax.axvline(j, color='red')
plt.title("Input Image with Patches")
plt.show()
visualize_patches(input_image, patch_size)
def visualize_feature_maps(feature_maps, num_maps_to_show=8):
maps_to_show = feature_maps[0, :num_maps_to_show, :, :]
grid = make_grid(maps_to_show.unsqueeze(1), nrow=4, normalize=True, scale_each=True)
plt.figure(figsize=(15, 15))
plt.imshow(grid.permute(1, 2, 0).cpu().numpy())
plt.title("Feature Maps")
plt.axis('off')
plt.show()
visualize_feature_maps(output, num_maps_to_show=8)
gitextract_fego_pme/
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
└── vit/
├── readme.md
├── src/
│ ├── ViT.py
│ ├── base.py
│ ├── data.py
│ ├── requirements.txt
│ ├── trainer.py
│ ├── utils.py
│ └── vit_with_lora.py
└── visualize/
└── vis.py
SYMBOL INDEX (76 symbols across 7 files)
FILE: vit/src/ViT.py
class PatchEmbeddings (line 3) | class PatchEmbeddings(nn.Module):
method __init__ (line 8) | def __init__(self, config):
method forward (line 17) | def forward(self, x):
class Embeddings (line 23) | class Embeddings(nn.Module):
method __init__ (line 29) | def __init__(self,config):
method forward (line 39) | def forward(self,x):
class Attention (line 53) | class Attention(nn.Module):
method __init__ (line 60) | def __init__(self,vector_dim,attention_head_size,dropout,bias = True):
method forward (line 72) | def forward(self,x):
class MultiheadAttention (line 84) | class MultiheadAttention(nn.Module):
method __init__ (line 92) | def __init__(self,config):
method forward (line 116) | def forward(self,x,output_attentions = False):
class MLP (line 129) | class MLP(nn.Module):
method __init__ (line 134) | def __init__(self,config):
method forward (line 141) | def forward(self,x):
class Block (line 148) | class Block(nn.Module):
method __init__ (line 154) | def __init__(self,config):
method forward (line 161) | def forward(self,x,output_attentions = False):
class Encoder (line 172) | class Encoder(nn.Module):
method __init__ (line 174) | def __init__(self,config):
method forward (line 182) | def forward(self, x, output_attentions=False):
class Classification (line 195) | class Classification(nn.Module):
method __init__ (line 201) | def __init__(self,config):
method forward (line 215) | def forward(self, x, output_attentions=False):
method _init_weights (line 228) | def _init_weights(self, module):
FILE: vit/src/base.py
class NewGELUActivation (line 15) | class NewGELUActivation(nn.Module):
method forward (line 23) | def forward(self, input):
FILE: vit/src/data.py
function prepare_data (line 7) | def prepare_data(batch_size=4, num_workers=2, train_sample_size=None, te...
FILE: vit/src/trainer.py
class Trainer (line 28) | class Trainer:
method __init__ (line 33) | def __init__(self,model,optimizer,loss_fn,exp_name,device):
method train (line 40) | def train(self,train_loader,test_loader,epochs,save_exp_every_n_epochs...
method train_epoch (line 57) | def train_epoch(self,train_loader):
method evaluate (line 75) | def evaluate(self,test_loader):
function parse_args (line 99) | def parse_args():
function main (line 115) | def main():
FILE: vit/src/utils.py
function save_experiment (line 11) | def save_experiment(experiment_name,config,model,train_losses,test_losse...
function save_checkpoint (line 30) | def save_checkpoint(experiment_name, model, epoch, base_dir="experiments"):
function load_experiment (line 36) | def load_experiment(experiment_name,checkpoint_name="model_final.pt",bas...
FILE: vit/src/vit_with_lora.py
class LoRALayer (line 7) | class LoRALayer(nn.Module):
method __init__ (line 9) | def __init__(self, in_features, out_features, rank=4, alpha=16):
method forward (line 19) | def forward(self, x):
class PatchEmbeddings (line 23) | class PatchEmbeddings(nn.Module):
method __init__ (line 28) | def __init__(self, config):
method forward (line 37) | def forward(self, x):
class Embeddings (line 43) | class Embeddings(nn.Module):
method __init__ (line 49) | def __init__(self,config):
method forward (line 56) | def forward(self,x):
class Attention (line 70) | class Attention(nn.Module):
method __init__ (line 74) | def __init__(self,vector_dim,attention_head_size,dropout,bias=True, us...
method forward (line 88) | def forward(self, x):
class MultiheadAttention (line 104) | class MultiheadAttention(nn.Module):
method __init__ (line 108) | def __init__(self, config):
method forward (line 137) | def forward(self, x, output_attentions=False):
class MLP (line 156) | class MLP(nn.Module):
method __init__ (line 161) | def __init__(self, config):
method forward (line 186) | def forward(self, x):
function prepare_mlp_for_lora_training (line 197) | def prepare_mlp_for_lora_training(model):
class Block (line 206) | class Block(nn.Module):
method __init__ (line 210) | def __init__(self, config):
method forward (line 217) | def forward(self, x, output_attentions = False):
class Encoder (line 227) | class Encoder(nn.Module):
method __init__ (line 232) | def __init__(self, config):
method forward (line 238) | def forward(self, x, output_attentions=False):
class LoRALinear (line 251) | class LoRALinear(nn.Module):
method __init__ (line 255) | def __init__(self, in_features, out_features, rank=8, alpha=16):
method forward (line 260) | def forward(self, x):
class Classification (line 263) | class Classification(nn.Module):
method __init__ (line 268) | def __init__(self, config):
method forward (line 292) | def forward(self, x, output_attentions=False):
method _init_weights (line 307) | def _init_weights(self, module):
function prepare_model_for_lora_training (line 326) | def prepare_model_for_lora_training(model):
FILE: vit/visualize/vis.py
class PatchEmbedding (line 7) | class PatchEmbedding(nn.Module):
method __init__ (line 8) | def __init__(self, num_patches, vector_dim, patch_size):
method forward (line 12) | def forward(self, x):
function visualize_patches (line 25) | def visualize_patches(input_image, patch_size):
function visualize_feature_maps (line 44) | def visualize_feature_maps(feature_maps, num_maps_to_show=8):
Condensed preview — 13 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (45K chars).
[
{
"path": ".gitignore",
"chars": 3139,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": "LICENSE",
"chars": 1063,
"preview": "MIT License\n\nCopyright (c) 2024 Ishaan\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof "
},
{
"path": "README.md",
"chars": 3549,
"preview": "# Zero-to-Hero: ViT🚀\n\nI have tried to cover all the bases for understanding and implementing Vision Transformers (ViT) a"
},
{
"path": "requirements.txt",
"chars": 98,
"preview": "torch \ntorchvision \ntransformers \ntimm \nmatplotlib \nopencv-python \nplotly \nstreamlit \ngradio\nflask"
},
{
"path": "vit/readme.md",
"chars": 2692,
"preview": "# Building ViT from scratch\n\n## INFO:\n\nThis project implements a Vision Transformer (ViT) from scratch using Python and "
},
{
"path": "vit/src/ViT.py",
"chars": 9087,
"preview": "from base import *\n\nclass PatchEmbeddings(nn.Module):\n \"\"\"\n Convert the image into patches and then project them i"
},
{
"path": "vit/src/base.py",
"chars": 789,
"preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torchvision\nimport torchvision.datasets as dat"
},
{
"path": "vit/src/data.py",
"chars": 1946,
"preview": "# Import libraries\nimport torch\nimport torchvision\nimport torchvision.transforms as transforms\n\n\ndef prepare_data(batch_"
},
{
"path": "vit/src/requirements.txt",
"chars": 168,
"preview": "python-apt==2.7.7+ubuntu3\npython-dateutil==2.8.2\npython-debian==0.1.49+ubuntu2\npython-multipart==0.0.12\ntorch==2.4.1\ntor"
},
{
"path": "vit/src/trainer.py",
"chars": 4737,
"preview": "import torch\nimport torch.nn as nn\nimport argparse\n\nfrom data import *\nfrom utils import save_checkpoint, save_experimen"
},
{
"path": "vit/src/utils.py",
"chars": 1933,
"preview": "import json, os, math\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport torch\nfrom torch.nn import functional as"
},
{
"path": "vit/src/vit_with_lora.py",
"chars": 11746,
"preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport math\nfrom base import *\n\nclass LoRALayer(nn.Mo"
},
{
"path": "vit/visualize/vis.py",
"chars": 1611,
"preview": "import torch\nimport torch.nn as nn\nimport matplotlib.pyplot as plt\nimport torchvision.transforms as T\nfrom torchvision.u"
}
]
About this extraction
This page contains the full source code of the 0xD4rky/Vision-Transformers GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 13 files (41.6 KB), approximately 10.3k tokens, and a symbol index with 76 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.