[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#pdm.lock\n#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it\n#   in version control.\n#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control\n.pdm.toml\n.pdm-python\n.pdm-build/\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n#.idea/\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2024 Ishaan\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# Zero-to-Hero: ViT🚀\n\nI have tried to cover all the bases for understanding and implementing Vision Transformers (ViT) and their evolution into Video Vision Transformers (ViViT).\nThe main focus is on dealing with the spatio-temporal relations using visual transformers.\n\n![image](https://github.com/user-attachments/assets/bc8a2727-b33a-4681-aee6-c6b617e7ad81)\n\n\n## 1. Vision Transformer (ViT) Fundamentals\n\n### Surveys and Overviews:\n\n* [Transformers in Vision: A Survey](https://arxiv.org/abs/2101.01169)\n* [A Survey of Visual Transformers](https://arxiv.org/abs/2111.06091)\n* [Transformers in Vision](https://arxiv.org/abs/2101.01169)\n\n### Key Papers:\n\n* An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: [Paper](https://arxiv.org/abs/2010.11929) | [Code](https://github.com/google-research/vision_transformer)\n* Training data-efficient image transformers & distillation through attention (DeiT): [Paper](https://arxiv.org/abs/2012.12877) | [Code](https://github.com/facebookresearch/deit)\n\n\n### Concepts and Tutorials:\n\n* \"Attention Is All You Need\": [Paper](https://arxiv.org/abs/1706.03762)\n* \"The Illustrated Transformers\": [Blog Post](http://jalammar.github.io/illustrated-transformer/)\n* \"Vision Transformer Explained\" [Blog Post](https://theaisummer.com/vision-transformer/)\n\n## 2. Convolutional ViT and Hybrid Models:\n\n* CvT: Introducing Convolutions to Vision Transformers: [Paper](https://arxiv.org/abs/2103.15808) | [Code](https://github.com/microsoft/CvT)\n* CoAtNet: Marrying Convolution and Attention for All Data Sizes: [Paper](https://arxiv.org/abs/2106.04803)\n* ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases: [Paper](https://arxiv.org/abs/2103.10697) | [Code](https://github.com/facebookresearch/convit)\n\n\n## 3. Efficient Transformers and Swin Transformer:\n\n* Swin Transformer: Hierarchical Vision Transformer using Shifted Windows: [Paper](https://arxiv.org/abs/2103.14030) | [Code](https://github.com/microsoft/Swin-Transformer)\n* Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions: [Paper](https://arxiv.org/abs/2102.12122) | [Code](https://github.com/whai362/PVT)\n* Efficient Transformers: A Survey: [Paper](https://arxiv.org/abs/2009.06732)\n\n\n## 4. Space-Time Attention and Video Transformers:\n\n* TimeSformer: Is Space-Time Attention All You Need for Video Understanding? [Paper](https://arxiv.org/abs/2102.05095) | [Code](https://github.com/facebookresearch/TimeSformer)\n* Space-Time Mixing Attention for Video Transformer: [Paper](https://arxiv.org/abs/2106.05968)\n* MViT: Multiscale Vision Transformers: [Paper](https://arxiv.org/abs/2104.11227) | [Code](https://github.com/facebookresearch/SlowFast)\n\n\n## 5. Video Vision Transformer (ViViT): \n\n*  ViViT: A Video Vision Transformer: [Paper](https://arxiv.org/abs/2103.15691) | [Code](https://github.com/google-research/scenic/tree/main/scenic/projects/vivit)\n*  Video Transformer Network: [Paper](https://arxiv.org/abs/2102.00719) | [Code](https://github.com/mx-mark/VideoTransformer-pytorch)\n\n\n\n## How to use this Repo?\n\n* Start by reading the survey papers to get a broad understanding of the field.\n* For each key paper, read the abstract and introduction, then skim through the methodology and results sections.\n* Implement key concepts using the provided GitHub repositories or your own code.\n* Experiment with different architectures and datasets to solidify your understanding.\n* Use the additional resources to dive deeper into specific topics or applications.\n \n \n"
  },
  {
    "path": "requirements.txt",
    "content": "torch \ntorchvision \ntransformers \ntimm \nmatplotlib \nopencv-python \nplotly \nstreamlit \ngradio\nflask"
  },
  {
    "path": "vit/readme.md",
    "content": "# Building ViT from scratch\n\n## INFO:\n\nThis project implements a Vision Transformer (ViT) from scratch using Python and PyTorch. The implementation is based on the original paper \"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale\" by Dosovitskiy et al. The model is trained and evaluated on the CIFAR-10 dataset.\n\n## Project Structure:\n\n\nThe project consists of the following main files:\n\n* `base.py`: Contains the GELU activation function implementation. [Paper](https://arxiv.org/abs/1606.08415) | [Code](https://github.com/karfaoui/gelu)\n* `data.py`: Handles data preparation using the CIFAR-10 dataset.\n* `ViT.py`: Contains the Vision Transformer model implemented from scratch.\n* `trainer.py`: Implements the entire training and evaluation pipeline.\n* `utils.py`: Contains utility functions for model and checkpoint management.\n* visualization contains `vis.py` to visualize image patches and attention maps.\n\n## Requirements:\n\n```\ncd proj/src\npip install -r requirements.txt\n```\n\n## Inference:\n\n1. Clone the repo:\n\n   ```\n   git clone https://github.com/0xD4rky/Vision-Transformer.git\n   cd proj/src\n   ```\n2. Prepare the data: The `data.py` script handles the CIFAR-10 dataset preparation. You don't need to run this separately as it will be called by the trainer.\n\n3. Training:\n   ```\n   python trainer.py\n   ```\n   This script will train the Vision Transformer on the CIFAR-10 dataset and evaluate its performance.\n\n\n## Model Architecture\nThe Vision Transformer (ViT) architecture is implemented in `ViT.py`. It follows the original paper's design, including:\n\n* Patch embedding\n* Positional embedding\n* Transformer encoder with multi-head self-attention and feed-forward layers\n* Classification head\n\n\n## Training and Evaluation:\n\nThe `trainer.py` script handles both training and evaluation. It includes:\n\n* Data loading and preprocessing\n* Model initialization\n* Training loop with gradient updates\n* Evaluation on the test set\n* Logging of training progress and results\n\n## Utility Functions:\n\nThe `utils.py` file contains helper functions for:\n\n* Saving and loading model checkpoints\n* Logging training progress\n* Any other utility functions used across the project\n\n## Results:\n\n(You can add information about the performance of your model on the CIFAR-10 dataset, including accuracy, training time, and any comparisons with baseline models.)\n\n`VISUALIZATION`:\n\n### 1. Image Patches:\n\n![Screenshot from 2024-10-11 22-51-58](https://github.com/user-attachments/assets/fa1673ca-b0fe-46ca-917b-ce9268eb4510)\n\n### 2. Feature Maps:\n\n![Screenshot from 2024-10-11 22-52-21](https://github.com/user-attachments/assets/ea7bac63-a9c7-4d7f-88b3-fab45a84d766)\n\n\n\n\n"
  },
  {
    "path": "vit/src/ViT.py",
    "content": "from base import *\n\nclass PatchEmbeddings(nn.Module):\n    \"\"\"\n    Convert the image into patches and then project them into a vector space.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.image_size = config[\"image_size\"]\n        self.patch_size = config[\"patch_size\"]\n        self.num_channels = config[\"num_channels\"]\n        self.vector_dim = config[\"vector_dim\"]\n        self.num_patches = (self.image_size // self.patch_size) ** 2\n        self.projection = nn.Conv2d(self.num_channels, self.vector_dim, kernel_size=self.patch_size, stride=self.patch_size)\n\n    def forward(self, x):\n        # {batch_size, num_channels, image_size, image_size}-> {batch_size, num_patches, vector_dim}\n        x = self.projection(x)\n        x = x.flatten(2).transpose(1, 2)\n        return x\n\nclass Embeddings(nn.Module):\n    \n    \"\"\"\n    adding positional information to extracted patch embeddings\n    \"\"\"\n    \n    def __init__(self,config):\n        self.config = config\n        self.patch_emb = PatchEmbeddings(config)\n        \n        self.cls_token = nn.Parameter(torch.randn(1,1,config[\"vector_dim\"]))\n        \n        # create learnable positional encoding and add +1 dim for [CLS]\n        self.positional_encoding = nn.Parameter(torch.randn(1,self.patch_emb.num_patches + 1, config[\"vector_dim\"]))\n        self.droput = nn.Dropout(config[\"droput_prob\"])\n        \n    def forward(self,x):\n        x = self.patch_emb(x)\n        batch_size, _, _ = x.size()\n        # expand the [cls] token to batch size\n        #{1,1,vector_dim} -> (batch_size,1,hidden_size)\n        cls_tokens = self.cls_token.expand(batch_size,-1,-1)\n        \"\"\"\n        concatenating cls token to inputn sequence\n        size : {num_patches + 1}\n        \"\"\"\n        x = torch.cat((cls_tokens,x),dim = 1)\n        x = x + self.positional_encoding\n        return x   \n\nclass Attention(nn.Module):\n    \"\"\"\n    Attention module\n\n    Will be used in:\n        Multi-headed-attention Module\n    \"\"\"\n    def __init__(self,vector_dim,attention_head_size,dropout,bias = True):\n        \n        super().__init__()\n        self.vector_dim = vector_dim\n        self.attention_head_size = attention_head_size\n        self.dropout = nn.Dropout(dropout)\n        \n        # {query,key,value}\n        self.query = nn.Linear(vector_dim,attention_head_size, bias = bias)\n        self.key = nn.Linear(vector_dim, attention_head_size,bias = bias)\n        self.value = nn.Linear(vector_dim,attention_head_size,bias = bias)\n        \n    def forward(self,x):\n        query = self.query(x)\n        key = self.key(x)\n        value = self.value(x)\n        # i have them in matrix form\n        \n        similarity = torch.matmul(query,key.transpose(-1,-2))\n        attention_probs = nn.functional.softmax((similarity/math.sqrt(self.attention_head_size)),dim = 1)\n        attention_probs = self.dropout(attention_probs)\n        output = torch.matmul(attention_probs,value)\n        return output,attention_probs\n        \nclass MultiheadAttention(nn.Module):\n    \"\"\"\n    Multi-headed-attention module\n\n    Will be used in:\n        Transformer Encoder\n    \"\"\"\n    \n    def __init__(self,config):\n        super().__init()\n        self.vector_dim = config[\"vector_dim\"]\n        self.num_attention_heads = config[\"num_attention_heads\"]\n        \n        self.attention_head_size =  self.vector_sim // self.num_attention_heads\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        \n        self.qkv_bias = config[\"qkv_bias\"]\n        #creating a list of attention heads\n        self.heads = nn.ModuleList([])\n        for _ in range(self.num_attention_heads):\n            head = Attention(\n                self.vector_dim,\n                self.attention_head_size,\n                config[\"attention_probs_dropout_prob\"],\n                self.qkv_bias\n                )\n            self.heads.append(head)\n        \n        # project attention output back to vector dim\n        self.output_projection = nn.Linear(self.all_head_size,self.vector_dim)\n        self.output_dropout = nn.Dropout(config[\"hidden_dropout_prob\"])\n        \n    def forward(self,x,output_attentions = False):\n        attention_outputs = [head(x) for head in self.heads] # for each attention head\n        attention_output = torch.cat([attention_output for attention_output, _ in attention_outputs],dim=-1)\n        # Project the concatenated attention output back to the hidden size\n        attention_output = self.output_projection(attention_output)\n        attention_output = self.output_dropout(attention_output)\n        # Return the attention output and the attention probabilities (optional)\n        if not output_attentions:\n            return (attention_output, None)\n        else:\n            attention_probs = torch.stack([attention_probs for _, attention_probs in attention_outputs], dim=1)\n            return (attention_output, attention_probs)\n\nclass MLP(nn.Module):\n    \"\"\"\n    Multi-Layer Perceptron Module\n    \"\"\"\n    \n    def __init__(self,config):\n        super().__init__()\n        self.dense_1 = nn.Linear(config[\"vector_dim\"],config[\"hidden_size\"])\n        self.act = NewGELUActivation()\n        self.dense_2 = nn.Linear(config[\"hidden_size\"],config[\"vector_dim\"])\n        self.dropout = nn.Dropout(config[\"hidden_dropout_prob\"])\n        \n    def forward(self,x):\n        x = self.dense_1(x)\n        x = self.act(x)\n        x = self.dense_2(x)\n        x = self.dropout(x)\n        return x\n    \nclass Block(nn.Module):\n    \n    \"\"\"\n    Single transformer block\n    \"\"\"\n    \n    def __init__(self,config):\n        super().__init__()\n        self.attention = MultiheadAttention(config)\n        self.layer_norm1 = nn.LayerNorm(config[\"vector_dim\"])\n        self.mlp = MLP(config)\n        self.layernorm_2 = nn.LayerNorm(config[\"vector_dim\"])\n        \n    def forward(self,x,output_attentions = False):\n        # {self-attention after normalizing layers}\n        attention_output, attention_prob = self.attention(self.layer_norm1(x),output_attentions=output_attentions)\n        x = x + attention_output # {skip-connections}\\\n        mlp_output = self.mlp(self.layer_norm2(x)) #{ffn}\n        x = x + mlp_output\n        if not output_attentions:\n            return (x,None)\n        else: \n            return (x,attention_prob)\n\nclass Encoder(nn.Module):\n    \n    def __init__(self,config):\n        super().__init__()\n        \n        self.blocks = nn.ModuleList([])\n        for _ in range(config[\"num_hidden_layers\"]):\n            block = Block(config)\n            self.blocks.append(block)\n\n    def forward(self, x, output_attentions=False):\n        # Calculate the transformer block's output for each block\n        all_attentions = []\n        for block in self.blocks:\n            x, attention_probs = block(x, output_attentions=output_attentions)\n            if output_attentions:\n                all_attentions.append(attention_probs)\n        # Return the encoder's output and the attention probabilities (optional)\n        if not output_attentions:\n            return (x, None)\n        else:\n            return (x, all_attentions)\n\nclass Classification(nn.Module):\n    \n    \"\"\"\n    ViT model for classification\n    \"\"\"       \n        \n    def __init__(self,config):\n        super().__init__()\n        self.config = config\n        self.img_size = config[\"img_size\"]\n        self.vector_dim = config[\"vector_dim\"]\n        self.num_classes = config[\"num_classes\"]\n        \n        # follow the below pipepline :)\n        \n        self.embeddings = Embeddings(config)\n        self.encoder = Encoder(config)\n        self.classifier = nn.Linear(self.vector_dim,self.num_classes)\n        self.apply(self._init_weights)\n        \n    def forward(self, x, output_attentions=False):\n        # Calculate the embedding output\n        embedding_output = self.embedding(x)\n        # Calculate the encoder's output\n        encoder_output, all_attentions = self.encoder(embedding_output, output_attentions=output_attentions)\n        # Calculate the logits, take the [CLS] token's output as features for classification\n        logits = self.classifier(encoder_output[:, 0, :])\n        # Return the logits and the attention probabilities (optional)\n        if not output_attentions:\n            return (logits, None)\n        else:\n            return (logits, all_attentions)\n    \n    def _init_weights(self, module):\n        if isinstance(module, (nn.Linear, nn.Conv2d)):\n            torch.nn.init.normal_(module.weight, mean=0.0, std=self.config[\"initializer_range\"])\n            if module.bias is not None:\n                torch.nn.init.zeros_(module.bias)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        elif isinstance(module, Embeddings):\n            module.position_embeddings.data = nn.init.trunc_normal_(\n                module.position_embeddings.data.to(torch.float32),\n                mean=0.0,\n                std=self.config[\"initializer_range\"],\n            ).to(module.position_embeddings.dtype)\n\n            \n            \n        "
  },
  {
    "path": "vit/src/base.py",
    "content": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torchvision\nimport torchvision.datasets as datasets\nimport torch.optim as optim\nfrom torch.utils.data import DataLoader\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport os\nfrom PIL import Image\nimport math\n\nclass NewGELUActivation(nn.Module):\n    \"\"\"\n    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see\n    the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415\n\n    Taken from https://github.com/huggingface/transformers/blob/main/src/transformers/activations.py\n    \"\"\"\n\n    def forward(self, input):\n        return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))\n\n"
  },
  {
    "path": "vit/src/data.py",
    "content": "# Import libraries\nimport torch\nimport torchvision\nimport torchvision.transforms as transforms\n\n\ndef prepare_data(batch_size=4, num_workers=2, train_sample_size=None, test_sample_size=None):\n    train_transform = transforms.Compose(\n        [transforms.ToTensor(),\n        transforms.Resize((32, 32)),\n        transforms.RandomHorizontalFlip(p=0.5),\n        transforms.RandomResizedCrop((32, 32), scale=(0.8, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=2),\n        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])\n\n    trainset = torchvision.datasets.CIFAR10(root='./data', train=True,\n                                            download=True, transform=train_transform)\n    if train_sample_size is not None:\n        # Randomly sample a subset of the training set\n        indices = torch.randperm(len(trainset))[:train_sample_size]\n        trainset = torch.utils.data.Subset(trainset, indices)\n    \n\n\n    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,\n                                            shuffle=True, num_workers=num_workers)\n    \n    test_transform = transforms.Compose(\n        [transforms.ToTensor(),\n        transforms.Resize((32, 32)),\n        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])\n\n    testset = torchvision.datasets.CIFAR10(root='./data', train=False,\n                                        download=True, transform=test_transform)\n    if test_sample_size is not None:\n        # Randomly sample a subset of the test set\n        indices = torch.randperm(len(testset))[:test_sample_size]\n        testset = torch.utils.data.Subset(testset, indices)\n    \n    testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,\n                                            shuffle=False, num_workers=num_workers)\n\n    classes = ('plane', 'car', 'bird', 'cat',\n            'deer', 'dog', 'frog', 'horse', 'ship', 'truck')\n    return trainloader, testloader, classes"
  },
  {
    "path": "vit/src/requirements.txt",
    "content": "python-apt==2.7.7+ubuntu3\npython-dateutil==2.8.2\npython-debian==0.1.49+ubuntu2\npython-multipart==0.0.12\ntorch==2.4.1\ntorchaudio==2.4.1\ntorchvision==0.19.1\nnumpy==1.26.4"
  },
  {
    "path": "vit/src/trainer.py",
    "content": "import torch\nimport torch.nn as nn\nimport argparse\n\nfrom data import *\nfrom utils import save_checkpoint, save_experiment\nfrom vit_base import Classification\n\nconfig = {\n    \"patch_size\": 4,  # Input image size: 32x32 -> 8x8 patches\n    \"vector_dim\": 48,\n    \"num_hidden_layers\": 4,\n    \"num_attention_heads\": 4,\n    \"hidden_size\": 4 * 48, # 4 * hidden_size\n    \"hidden_dropout_prob\": 0.0,\n    \"attention_probs_dropout_prob\": 0.0,\n    \"initializer_range\": 0.02,\n    \"image_size\": 32,\n    \"num_classes\": 10, # num_classes of CIFAR10\n    \"num_channels\": 3,\n    \"qkv_bias\": True,\n}\n\nassert config[\"vector_dim\"] % config[\"num_attention_heads\"] == 0\nassert config['hidden_size'] == 4 * config['vector_dim']\nassert config['image_size'] % config['patch_size'] == 0\n\nclass Trainer:\n    \"\"\"\n    simple trainer block\n    \"\"\"\n    \n    def __init__(self,model,optimizer,loss_fn,exp_name,device):\n        self.model = model.to(device)\n        self.optim = optimizer\n        self.loss  = loss_fn\n        self.exp_name = exp_name\n        self.device = device\n        \n    def train(self,train_loader,test_loader,epochs,save_exp_every_n_epochs = 0):\n        \n        train_losses, test_losses, accuracies = [],[],[]\n        \n        for i in range(epochs):\n            train_loss = self.train_epoch(train_loader)\n            accuracy = test_loss = self.evaluate(test_loader)\n            train_losses.append(train_loss)\n            test_losses.append(test_loss)\n            accuracies.append(accuracy)\n            print(f\"Epoch {i+1}, Train loss: {train_loss:.4f}, Test loss: {test_loss:.4f}, Accuracy: {accuracy:.4f}\")\n            if save_exp_every_n_epochs > 0 and (i+1) % save_exp_every_n_epochs == 0 and i+1 != epochs:\n                print('\\tSave checkpoint at epoch',i+1)\n                save_checkpoint(self.exp_name, self.model, i+1)\n        \n        save_experiment(self.exp_name, self.model, i+1)\n    \n    def train_epoch(self,train_loader):\n        \n        self.model.train()\n        total_loss = 0\n        for batch in train_loader:\n            \n            batch = [t.tp(self.device) for t in batch]\n            images, labels = batch\n            self.optimizer.zero_grad()\n            loss = self.loss(self.model(images)[0], labels)\n            loss.backward()\n            self.optimizer.step()\n            total_loss += loss.item()*len(images)\n            \n        return total_loss/ len(train_loader.dataset)\n    \n    \n    @torch.no_grad()\n    def evaluate(self,test_loader):\n        self.model.eval()\n        total_loss = 0\n        correct = 0\n        \n        with torch.no_grad():\n            \n            for batch in test_loader():\n                batch = [t.to(self.device) for t in batch]\n                images, labels = batch\n                \n                logits,_ = self.model(images)\n                \n                loss = self.loss(logits,labels)\n                total_loss += loss.item() * len(images)\n                \n                predictions = torch.argmax(logits, dim = 1)\n                correct = torch.sum(predictions == labels).item()\n                \n        accuracy = correct/ len(test_loader.dataset)\n        avg_loss = total_loss / len(test_loader.dataset)\n        return accuracy, avg_loss\n\n\ndef parse_args():\n    \n    import argparse\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--exp-name\", type = str, required = True)\n    parser.add_argument(\"--batch-size\", type = int, default = 256)\n    parser.add_argument(\"--epochs\", type=int, default=100)\n    parser.add_argument(\"--lr\", type=float, default=1e-2)\n    parser.add_argument(\"--device\", type=str)\n    parser.add_argument(\"--save-model-every\", type=int, default=0)\n    \n    args = parser.parse_args()\n    if args.device is None:\n        args.device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n    return args\n\ndef main():\n    \n    args = parse_args()\n    \n    batch_size = args.batch_size\n    epochs = args.epochs\n    lr = args.lr\n    device = args.device\n    save_exp_every_n_epochs = args.save_model_every\n    \n    trainloader, testloader = prepare_data(batch_size = batch_size)\n    model = Classification(config)\n    \"\"\"\n    IF YOU WANT TO USE LORA TRAINING, UNCOMMENT THE BELOW LINES\n    def create_model():\n        model = Classification(config)\n        if config[\"use_lora\"]:\n            model = prepare_model_for_lora_training(model)\n        return model\n    \"\"\"\n    optimizer = torch.optim.AdamW(model.parameters(), lr = lr, weight_decay = 1e-2)\n    loss_fn = nn.CrossEntropyLoss()\n    trainer = Trainer(model, optimizer, loss_fn, args.exp_name, device = device)\n    trainer.train(trainloader, testloader, epochs, save_exp_every_n_epochs = save_exp_every_n_epochs)\n    \n    \nif __name__ == \"__main__\":\n    \n    main()"
  },
  {
    "path": "vit/src/utils.py",
    "content": "import json, os, math\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport torch\nfrom torch.nn import functional as F\nimport torchvision\nimport torchvision.transforms as transforms\n\nfrom ViT import Classification\n\ndef save_experiment(experiment_name,config,model,train_losses,test_losses,accuracies,base_dir = \"experiments\"):\n    outdir = os.path.join(base_dir,experiment_name)\n    os.makedirs(outdir, exist_ok = True)\n    \n    configfile = os.path.join(outdir,'config.json')\n    with open(configfile, 'w') as f:\n        json.dump(config,f,sort_keys = True,indent = 4)\n    \n    jsonfile = os.path.join(outdir, 'metrics.json')\n    with open(jsonfile, 'w') as f:\n        data = {\n            'train_losses': train_losses,\n            'test_losses': test_losses,\n            'accuracies': accuracies,\n        }\n        json.dump(data, f, sort_keys=True, indent=4)\n\n    save_checkpoint(experiment_name,model,\"final\",base_dir = base_dir)\n    \ndef save_checkpoint(experiment_name, model, epoch, base_dir=\"experiments\"):\n    outdir = os.path.join(base_dir, experiment_name)\n    os.makedirs(outdir, exist_ok=True)\n    cpfile = os.path.join(outdir, f'model_{epoch}.pt')\n    torch.save(model.state_dict(), cpfile)\n    \ndef load_experiment(experiment_name,checkpoint_name=\"model_final.pt\",base_dir = \"experiments\"):\n    outdir = os.path.join(base_dir,experiment_name)\n    configfile = os.path.join(outdir,'config.json')\n    with open(configfile,'r') as f:\n        config = json.load(f)\n        \n    jsonfile = os.path.join(outdir, 'config.json')\n    with open(jsonfile,'r') as f:\n        data = json.load(f)\n    train_losses = data['train_losses']\n    test_losses = data['test_losses']\n    accuracies = data['accuracies']\n    # Load the model\n    model = Classfication(config)\n    cpfile = os.path.join(outdir, checkpoint_name)\n    model.load_state_dict(torch.load(cpfile))\n    return config, model, train_losses, test_losses, accuracies\n\n"
  },
  {
    "path": "vit/src/vit_with_lora.py",
    "content": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport math\nfrom base import *\n\nclass LoRALayer(nn.Module):\n    \"\"\"Low-Rank Adaptation layer\"\"\"\n    def __init__(self, in_features, out_features, rank=4, alpha=16):\n        super(LoRALayer,self).__init__()\n        self.rank = rank\n        self.scaling = alpha / rank\n        \n        self.lora_A = nn.Parameter(torch.zeros(in_features, rank))\n        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))\n        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))\n        nn.init.zeros_(self.lora_B)\n\n    def forward(self, x):\n        x = x @ (self.lora_A @ self.lora_B) * self.scaling\n        return x\n    \nclass PatchEmbeddings(nn.Module):\n    \"\"\"\n    Convert the image into patches and then project them into a vector space.\n    \"\"\"\n\n    def __init__(self, config):\n        super().__init__()\n        self.image_size = config[\"image_size\"]\n        self.patch_size = config[\"patch_size\"]\n        self.num_channels = config[\"num_channels\"]\n        self.vector_dim = config[\"vector_dim\"]\n        self.num_patches = (self.image_size // self.patch_size) ** 2\n        self.projection = nn.Conv2d(self.num_channels, self.vector_dim, kernel_size=self.patch_size, stride=self.patch_size)\n\n    def forward(self, x):\n        # {batch_size, num_channels, image_size, image_size}-> {batch_size, num_patches, vector_dim}\n        x = self.projection(x)\n        x = x.flatten(2).transpose(1, 2)\n        return x\n\nclass Embeddings(nn.Module):\n    \n    \"\"\"\n    adding positional information to extracted patch embeddings\n    \"\"\"\n    \n    def __init__(self,config):\n        self.config = config\n        self.patch_emb = PatchEmbeddings(config)\n        self.cls_token = nn.Parameter(torch.randn(1,1,config[\"vector_dim\"]))\n        self.positional_encoding = nn.Parameter(torch.randn(1,self.patch_emb.num_patches + 1, config[\"vector_dim\"]))\n        self.droput = nn.Dropout(config[\"droput_prob\"])\n        \n    def forward(self,x):\n        x = self.patch_emb(x)\n        batch_size, _, _ = x.size()\n        # expand the [cls] token to batch size\n        #{1,1,vector_dim} -> (batch_size,1,hidden_size)\n        cls_tokens = self.cls_token.expand(batch_size,-1,-1)\n        \"\"\"\n        concatenating cls token to inputn sequence\n        size : {num_patches + 1}\n        \"\"\"\n        x = torch.cat((cls_tokens,x),dim = 1)\n        x = x + self.positional_encoding\n        return x\n    \nclass Attention(nn.Module):\n    \"\"\"\n    Attention module with LoRA Support\n    \"\"\"\n    def __init__(self,vector_dim,attention_head_size,dropout,bias=True, use_lora=False, lora_rank=8, lora_alpha=16):\n        super().__init__()\n        self.vector_dim = vector_dim\n        self.attention_head_size = attention_head_size\n        self.dropout = nn.Dropout(dropout)\n        self.use_lora = use_lora\n        self.query = nn.Linear(vector_dim, attention_head_size, bias = bias)\n        self.key = nn.Linear(vector_dim, attention_head_size, bias = bias)\n        self.value = nn.Linear(vector_dim, attention_head_size, bias = bias)\n\n        if use_lora:\n            self.lora_q = LoRALayer(vector_dim, attention_head_size, lora_rank, lora_alpha)\n            self.lora_v = LoRALayer(vector_dim, attention_head_size, lora_rank, lora_alpha)\n        \n    def forward(self, x):\n\n        q = self.query(x)\n        key = self.key(x)\n        v = self.value(x)\n\n        if self.use_lora:\n            query = q + self.lora_q(x)\n            value = v + self.lora_v(x)\n        \n        similarity = torch.matmul(query, key.transpose(-1,-2))\n        attention_probs = F.softmax((similarity/math.sqrt(self.attention_head_size)),dim = 1)\n        attention_probs = self.dropout(attention_probs)\n        output = torch.matmul(attention_probs, value)\n        return output, attention_probs\n    \nclass MultiheadAttention(nn.Module):\n    \"\"\"\n    Multi-headed-attention module with LoRA support\n    \"\"\"\n    def __init__(self, config):\n        super().__init__()\n        self.vector_dim = config[\"vector_dim\"]\n        self.num_attention_heads = config[\"num_attention_heads\"]\n        \n        self.attention_head_size = self.vector_dim // self.num_attention_heads\n        self.all_head_size = self.num_attention_heads * self.attention_head_size\n        \n        self.qkv_bias = config[\"qkv_bias\"]\n        self.use_lora = config.get(\"use_lora\", False)  \n        self.lora_rank = config.get(\"lora_rank\", 8)    \n        self.lora_alpha = config.get(\"lora_alpha\", 16) \n        \n        self.heads = nn.ModuleList([\n            Attention(\n                self.vector_dim,\n                self.attention_head_size,\n                config[\"attention_probs_dropout_prob\"],\n                self.qkv_bias,\n                self.use_lora,\n                self.lora_rank,\n                self.lora_alpha\n            )\n            for _ in range(self.num_attention_heads)\n        ])\n        \n        self.output_projection = nn.Linear(self.all_head_size, self.vector_dim)\n        self.output_dropout = nn.Dropout(config[\"hidden_dropout_prob\"])\n    \n    def forward(self, x, output_attentions=False):\n        attention_outputs = [head(x) for head in self.heads]\n        attention_output = torch.cat(\n            [attention_output for attention_output, _ in attention_outputs],\n            dim=-1\n        )\n                \n        attention_output = self.output_projection(attention_output)\n        attention_output = self.output_dropout(attention_output)\n        \n        if not output_attentions:\n            return (attention_output, None)\n        \n        attention_probs = torch.stack(\n            [attention_probs for _, attention_probs in attention_outputs],\n            dim=1\n        )\n        return (attention_output, attention_probs)\n\nclass MLP(nn.Module):\n    \"\"\"\n    Multi-Layer Perceptron Module with LoRA support\n    \"\"\"\n    \n    def __init__(self, config):\n        super().__init__()\n        self.use_lora = config.get(\"use_lora\", False)\n        self.lora_rank = config.get(\"lora_rank\", 8)\n        self.lora_alpha = config.get(\"lora_alpha\", 16)\n        self.dense_1 = nn.Linear(config[\"vector_dim\"], config[\"hidden_size\"])\n        self.dense_2 = nn.Linear(config[\"hidden_size\"], config[\"vector_dim\"])\n        \n        if self.use_lora:\n            self.lora_1 = LoRALayer(\n                config[\"vector_dim\"],\n                config[\"hidden_size\"],\n                self.lora_rank,\n                self.lora_alpha\n            )\n            self.lora_2 = LoRALayer(\n                config[\"hidden_size\"],\n                config[\"vector_dim\"],\n                self.lora_rank,\n                self.lora_alpha\n            )\n        \n        self.act = NewGELUActivation()\n        self.dropout = nn.Dropout(config[\"hidden_dropout_prob\"])\n        \n    def forward(self, x):\n        hidden = self.dense_1(x)\n        if self.use_lora:\n            hidden = hidden + self.lora_1(x)\n        hidden = self.act(hidden)\n        output = self.dense_2(hidden)\n        if self.use_lora:\n            output = output + self.lora_2(hidden)\n        output = self.dropout(output)\n        return output\n\ndef prepare_mlp_for_lora_training(model):\n    \"\"\"Freeze all parameters except LoRA parameters\"\"\"\n    for name, param in model.named_parameters():\n        if 'lora' not in name:\n            param.requires_grad = False\n        else:\n            param.requires_grad = True\n    return model\n\nclass Block(nn.Module):\n\n    \"single transformer block with LoRA support\"\n\n    def __init__(self, config):\n        super().__init_()\n        self.attention = MultiheadAttention(config)\n        self.layer_norm1 = nn.LayerNorm(config[\"vector_dim\"])\n        self.mlp = MLP(config)\n        self.layer_norm2 = nn.LayerNorm(config[\"vector_dim\"])\n    \n    def forward(self, x, output_attentions = False):\n        attention_output, attention_probs = self.attention(self.layer_norm1(x), output_attentions=output_attentions)\n        x = x + attention_output\n        mlp_output = self.mlp(self.layer_norm2(x))\n        x = x + mlp_output\n        if not output_attentions:\n            return (x, None)\n        else:\n            return (x, attention_probs)\n        \nclass Encoder(nn.Module):\n    \"\"\"\n    Transformer encoder with LoRA support\n    \"\"\"\n    \n    def __init__(self, config):\n        super().__init__()\n        self.blocks = nn.ModuleList([\n            Block(config) for _ in range(config[\"num_hidden_layers\"])\n        ])\n\n    def forward(self, x, output_attentions=False):\n        all_attentions = []\n        \n        for block in self.blocks:\n            x, attention_probs = block(x, output_attentions=output_attentions)\n            if output_attentions:\n                all_attentions.append(attention_probs)\n                \n        if not output_attentions:\n            return (x, None)\n        else:\n            return (x, all_attentions)\n        \nclass LoRALinear(nn.Module):\n    \"\"\"\n    Linear layer with LoRA support for classification head\n    \"\"\"\n    def __init__(self, in_features, out_features, rank=8, alpha=16):\n        super().__init__()\n        self.linear = nn.Linear(in_features, out_features)\n        self.lora = LoRALayer(in_features, out_features, rank, alpha)\n        \n    def forward(self, x):\n        return self.linear(x) + self.lora(x)\n    \nclass Classification(nn.Module):\n    \"\"\"\n    ViT model for classification with LoRA support\n    \"\"\"\n    \n    def __init__(self, config):\n        super().__init__()\n        self.config = config\n        self.img_size = config[\"img_size\"]\n        self.vector_dim = config[\"vector_dim\"]\n        self.num_classes = config[\"num_classes\"]\n        \n        # Initialize components\n        self.embeddings = Embeddings(config)\n        self.encoder = Encoder(config)\n        \n        # Use LoRA for classifier if enabled\n        if config.get(\"use_lora\", False):\n            self.classifier = LoRALinear(\n                self.vector_dim,\n                self.num_classes,\n                config.get(\"lora_rank\", 8),\n                config.get(\"lora_alpha\", 16)\n            )\n        else:\n            self.classifier = nn.Linear(self.vector_dim, self.num_classes)\n            \n        self.apply(self._init_weights)\n        \n    def forward(self, x, output_attentions=False):\n        embedding_output = self.embeddings(x)\n        encoder_output, all_attentions = self.encoder(\n            embedding_output,\n            output_attentions=output_attentions\n        )\n        \n        # Use CLS token for classification\n        logits = self.classifier(encoder_output[:, 0, :])\n        \n        if not output_attentions:\n            return (logits, None)\n        else:\n            return (logits, all_attentions)\n    \n    def _init_weights(self, module):\n        if isinstance(module, (nn.Linear, nn.Conv2d)):\n            torch.nn.init.normal_(\n                module.weight,\n                mean=0.0,\n                std=self.config[\"initializer_range\"]\n            )\n            if module.bias is not None:\n                torch.nn.init.zeros_(module.bias)\n        elif isinstance(module, nn.LayerNorm):\n            module.bias.data.zero_()\n            module.weight.data.fill_(1.0)\n        elif isinstance(module, Embeddings):\n            module.position_embeddings.data = nn.init.trunc_normal_(\n                module.position_embeddings.data.to(torch.float32),\n                mean=0.0,\n                std=self.config[\"initializer_range\"],\n            ).to(module.position_embeddings.dtype)\n\ndef prepare_model_for_lora_training(model):\n    \"\"\"\n    Prepare the model for LoRA training by freezing non-LoRA parameters\n    \"\"\"\n    for name, param in model.named_parameters():\n        if 'lora' not in name:\n            param.requires_grad = False\n        else:\n            param.requires_grad = True\n    return model"
  },
  {
    "path": "vit/visualize/vis.py",
    "content": "import torch\nimport torch.nn as nn\nimport matplotlib.pyplot as plt\nimport torchvision.transforms as T\nfrom torchvision.utils import make_grid\n\nclass PatchEmbedding(nn.Module):\n    def __init__(self, num_patches, vector_dim, patch_size):\n        super(PatchEmbedding, self).__init__()\n        self.conv = nn.Conv2d(3, vector_dim, kernel_size=patch_size, stride=patch_size)\n    \n    def forward(self, x):\n        x = self.conv(x)\n        return x\n\ninput_image = torch.randn(1, 3, 224, 224)  # {creating a dummy image to vis}\n\nvector_dim = 256\npatch_size = 16\n\npatch_embedding = PatchEmbedding(3, vector_dim, patch_size)\n\noutput = patch_embedding(input_image)\n\ndef visualize_patches(input_image, patch_size):\n    \"\"\"\n    visualizing patches and attention maps\n    \"\"\"\n    input_image = input_image.squeeze(0).permute(1, 2, 0).numpy()\n    \n    fig, ax = plt.subplots()\n    ax.imshow(input_image)\n    \n    for i in range(0, input_image.shape[0], patch_size):\n        ax.axhline(i, color='red')\n    for j in range(0, input_image.shape[1], patch_size):\n        ax.axvline(j, color='red')\n    \n    plt.title(\"Input Image with Patches\")\n    plt.show()\n\nvisualize_patches(input_image, patch_size)\n\ndef visualize_feature_maps(feature_maps, num_maps_to_show=8):\n    maps_to_show = feature_maps[0, :num_maps_to_show, :, :]\n    \n    grid = make_grid(maps_to_show.unsqueeze(1), nrow=4, normalize=True, scale_each=True)\n    \n    plt.figure(figsize=(15, 15))\n    plt.imshow(grid.permute(1, 2, 0).cpu().numpy())\n    plt.title(\"Feature Maps\")\n    plt.axis('off')\n    plt.show()\n\nvisualize_feature_maps(output, num_maps_to_show=8)\n"
  }
]