[ { "path": "README-CN.md", "content": "# ComfyUI-Long-CLIP\n本项目是long-clip的comfyui实现，目前支持clip-l的替换，对于SD1.5可以使用SeaArtLongClip模块加载后替换模型中原本的clip，token的长度由77扩大至248，经过测试我们发现long-clip对成图质量有提升作用，对于SDXL模型由于clip-g的clip-long模型没有出现，所以我们的处理流程如下，对于较小的token按照原本max_len的整数倍扩大，由于最后多出来的为pad_token，所以对多余部分进行了裁剪，由于SDXL中clip-g的特征占比更多，您可能看见的更多是画面出现更多细节，最后喜欢本项目请帮我们点赞\n\n## Start\n```\ngit clone https://github.com/comfyanonymous/ComfyUI.git\ncd ComfyUI/custom_nodes\ngit clone git@github.com:SeaArtLab/ComfyUI-Long-CLIP.git\n```\n下载[LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L)到 models/checkpoints，同时感谢[Long-CLIP](https://github.com/beichenzbc/Long-CLIP/tree/main)开放权重，后续LongCLIP-G权重开放后，我们会同步支持！\n\n## Workflow\n我们特意准备了SD1.5和SDXL的使用例子，为了简化演示，我们例子简单，您无需安装其他插件\n![SD1.5](./image/SD1-5-long.png)\n![SDXL](./image/SDXL-long.png)" }, { "path": "README.md", "content": "# ComfyUI-Long-CLIP (Flux Suport Now)\nThis project implements the comfyui for long-clip, currently supporting the replacement of clip-l. For SD1.5, the SeaArtLongClip module can be used to replace the original clip in the model, expanding the token length from 77 to 248. Through testing, we found that long-clip improves the quality of the generated images. As for the SDXL model, since the clip-long model for clip-g has not been released, our processing procedure is as follows: for smaller tokens, we expand them by an integer multiple of the original max_len, and since the last added are pad_tokens, we trim the excess part. Given that clip-g features occupy a larger proportion in SDXL, you may notice more detailed images. Finally, if you like our project, please give us a thumbs up.\n\nThanks to [zer0int](https://github.com/zer0int)'s work, now long-clip provides support for flux.\n\n## Start\n```\ngit clone https://github.com/comfyanonymous/ComfyUI.git\ncd ComfyUI/custom_nodes\ngit clone git@github.com:SeaArtLab/ComfyUI-Long-CLIP.git\n```\nDownload [LongCLIP-L](https://huggingface.co/BeichenZhang/LongCLIP-L) to models/checkpoints, and thanks to [Long-CLIP](https://github.com/beichenzbc/Long-CLIP/tree/main) for making the weights available. Once the LongCLIP-G weights are released, we will also support them!\n\n## Workflow\nWe have specifically prepared examples for SD1.5 and SDXL for your use. To simplify the demonstration, our examples are straightforward, and you do not need to install any additional plugins. This plugin also supports operations such as clip-skip.\n![SD1.5](./image/SD1-5-long.png)\n![SDXL](./image/SDXL-long.png)\n![FLUX.1](./image/Flux.1-long.png)" }, { "path": "__init__.py", "content": "from . import long_clip as long_clip\n\nNODE_CLASS_MAPPINGS = {\n \"SeaArtLongClip\": long_clip.SeaArtLongClip,\n \"SeaArtLongXLClipMerge\": long_clip.SeaArtLongXLClipMerge,\n \"LongCLIPTextEncodeFlux\": long_clip.LongCLIPTextEncodeFlux,\n}\n\nNODE_DISPLAY_NAME_MAPPINGS = {\n \"SeaArtLongClip\": \"SeaArtLongClip\",\n \"SeaArtLongXLClipMerge\": \"SeaArtLongXLClipMerge\",\n \"LongCLIPTextEncodeFlux\": \"LongCLIPTextEncodeFlux\",\n}\n" }, { "path": "long_clip.py", "content": "from .long_clip_model import longclip\nimport os\nimport torch\nfrom comfy.sd import CLIP\nimport folder_paths\nfrom comfy.sd1_clip import load_embed,ClipTokenWeightEncoder\nfrom comfy.model_management import get_torch_device\nfrom comfy import model_management\nimport comfy\n\n\nclass SDLongClipModel(torch.nn.Module, ClipTokenWeightEncoder):\n \"\"\"Uses the CLIP transformer encoder for text (from huggingface)\"\"\"\n LAYERS = [\n \"last\",\n \"pooled\",\n \"hidden\"\n ]\n def __init__(self, version=\"openai/clip-vit-large-patch14\", device=\"cpu\", max_length=77,\n freeze=True, layer=\"last\", layer_idx=None, dtype=None,\n special_tokens={\"start\": 49406, \"end\": 49407, \"pad\": 49407}, layer_norm_hidden_state=True, enable_attention_masks=False, return_projected_pooled=True, **kwargs): # clip-vit-base-patch32\n super().__init__()\n\n assert layer in self.LAYERS\n\n self.transformer, _ = longclip.load(version, device=device)\n\n self.num_layers = self.transformer.transformer_layers\n\n self.max_length = max_length\n if freeze:\n self.freeze()\n self.layer = layer\n self.layer_idx = None\n self.special_tokens = special_tokens\n self.text_projection = torch.nn.Parameter(torch.eye(self.transformer.get_input_embeddings().weight.shape[1]))\n self.logit_scale = torch.nn.Parameter(torch.tensor(4.6055))\n self.enable_attention_masks = enable_attention_masks\n\n self.layer_norm_hidden_state = layer_norm_hidden_state\n self.return_projected_pooled = return_projected_pooled\n if layer == \"hidden\":\n assert layer_idx is not None\n assert abs(layer_idx) < self.num_layers\n self.clip_layer(layer_idx)\n self.layer_default = (self.layer, self.layer_idx)\n self.options_default = (self.layer, self.layer_idx, self.return_projected_pooled)\n\n self.dtypes = [param.dtype for param in self.parameters()]\n\n def freeze(self):\n self.transformer = self.transformer.eval()\n #self.train = disabled_train\n for param in self.parameters():\n param.requires_grad = False\n\n def clip_layer(self, layer_idx):\n if abs(layer_idx) > self.num_layers:\n self.layer = \"last\"\n else:\n self.layer = \"hidden\"\n self.layer_idx = layer_idx\n\n def reset_clip_layer(self):\n self.layer = self.layer_default[0]\n self.layer_idx = self.layer_default[1]\n\n def set_clip_options(self, options):\n layer_idx = options.get(\"layer\", self.layer_idx)\n self.return_projected_pooled = options.get(\"projected_pooled\", self.return_projected_pooled)\n if layer_idx is None or abs(layer_idx) > self.num_layers:\n self.layer = \"last\"\n else:\n self.layer = \"hidden\"\n self.layer_idx = layer_idx\n\n def reset_clip_options(self):\n self.layer = self.options_default[0]\n self.layer_idx = self.options_default[1]\n self.return_projected_pooled = self.options_default[2]\n\n def set_up_textual_embeddings(self, tokens, current_embeds):\n out_tokens = []\n next_new_token = token_dict_size = current_embeds.weight.shape[0] - 1\n embedding_weights = []\n\n for x in tokens:\n tokens_temp = []\n for y in x:\n if isinstance(y, int):\n if y == token_dict_size: #EOS token\n y = -1\n tokens_temp += [y]\n else:\n if y.shape[0] == current_embeds.weight.shape[1]:\n embedding_weights += [y]\n tokens_temp += [next_new_token]\n next_new_token += 1\n else:\n print(\"WARNING: shape mismatch when trying to apply embedding, embedding will be ignored\", y.shape[0], current_embeds.weight.shape[1])\n while len(tokens_temp) < len(x):\n tokens_temp += [self.special_tokens[\"pad\"]]\n out_tokens += [tokens_temp]\n\n n = token_dict_size\n if len(embedding_weights) > 0:\n new_embedding = torch.nn.Embedding(next_new_token + 1, current_embeds.weight.shape[1], device=current_embeds.weight.device, dtype=current_embeds.weight.dtype)\n new_embedding.weight[:token_dict_size] = current_embeds.weight[:-1]\n for x in embedding_weights:\n new_embedding.weight[n] = x\n n += 1\n new_embedding.weight[n] = current_embeds.weight[-1] #EOS embedding\n self.transformer.set_input_embeddings(new_embedding)\n\n processed_tokens = []\n for x in out_tokens:\n processed_tokens += [list(map(lambda a: n if a == -1 else a, x))] #The EOS token should always be the largest one\n\n return processed_tokens\n\n def forward(self, tokens):\n backup_embeds = self.transformer.get_input_embeddings()\n device = backup_embeds.weight.device\n tokens = self.set_up_textual_embeddings(tokens, backup_embeds)\n tokens = torch.LongTensor(tokens).to(device)\n\n attention_mask = None\n if self.enable_attention_masks:\n attention_mask = torch.zeros_like(tokens)\n max_token = self.transformer.get_input_embeddings().weight.shape[0] - 1\n for x in range(attention_mask.shape[0]):\n for y in range(attention_mask.shape[1]):\n attention_mask[x, y] = 1\n if tokens[x, y] == max_token:\n break\n\n outputs = self.transformer(tokens, attention_mask, intermediate_output=self.layer_idx, final_layer_norm_intermediate=self.layer_norm_hidden_state)\n self.transformer.set_input_embeddings(backup_embeds)\n\n if self.layer == \"last\":\n z = outputs[0]\n else:\n z = outputs[1]\n\n pooled_output = None\n if len(outputs) >= 3:\n if not self.return_projected_pooled and len(outputs) >= 4 and outputs[3] is not None:\n pooled_output = outputs[3].float()\n elif outputs[2] is not None:\n pooled_output = outputs[2].float()\n\n return z.float(), pooled_output\n\n def encode(self, tokens):\n return self(tokens)\n\n def load_sd(self, sd):\n if \"text_projection\" in sd:\n self.text_projection[:] = sd.pop(\"text_projection\")\n if \"text_projection.weight\" in sd:\n self.text_projection[:] = sd.pop(\"text_projection.weight\").transpose(0, 1)\n return self.transformer.load_state_dict(sd, strict=False)\n \nclass SDLongTokenizer:\n def __init__(self, max_length=248, pad_with_end=True, embedding_directory=None, tokenizer_data=None, embedding_size=768, embedding_key='clip_l', has_start_token=True, pad_to_max_length=True):\n self.tokenizer = longclip.only_tokenize ##tokenizer_class.from_pretrained(tokenizer_path)\n self.max_length = max_length\n empty = self.tokenizer('')[0]\n if has_start_token:\n self.tokens_start = 1\n self.start_token = empty[0]\n self.end_token = empty[1]\n else:\n self.tokens_start = 0\n self.start_token = None\n self.end_token = empty[0]\n self.pad_with_end = pad_with_end\n self.pad_to_max_length = pad_to_max_length\n\n ##vocab = self.tokenizer.get_vocab()\n ##self.inv_vocab = {v: k for k, v in vocab.items()}\n self.embedding_directory = embedding_directory\n self.max_word_length = 8\n self.embedding_identifier = \"embedding:\"\n self.embedding_size = embedding_size\n self.embedding_key = embedding_key\n self.tokenizer_data = tokenizer_data\n\n def _try_get_embedding(self, embedding_name:str):\n '''\n Takes a potential embedding name and tries to retrieve it.\n Returns a Tuple consisting of the embedding and any leftover string, embedding can be None.\n '''\n embed = load_embed(embedding_name, self.embedding_directory, self.embedding_size, self.embedding_key)\n if embed is None:\n stripped = embedding_name.strip(',')\n if len(stripped) < len(embedding_name):\n embed = load_embed(stripped, self.embedding_directory, self.embedding_size, self.embedding_key)\n return (embed, embedding_name[len(stripped):])\n return (embed, \"\")\n\n\n def tokenize_with_weights(self, text:str, return_word_ids=False):\n '''\n Takes a prompt and converts it to a list of (token, weight, word id) elements.\n Tokens can both be integer tokens and pre computed CLIP tensors.\n Word id values are unique per word and embedding, where the id 0 is reserved for non word tokens.\n Returned list has the dimensions NxM where M is the input size of CLIP\n '''\n if self.pad_with_end:\n pad_token = self.end_token\n else:\n pad_token = 0\n from comfy.sd1_clip import token_weights,escape_important,unescape_important\n\n text = escape_important(text)\n parsed_weights = token_weights(text, 1.0)\n\n tokens = []\n for weighted_segment, weight in parsed_weights:\n to_tokenize = unescape_important(weighted_segment).replace(\"\\n\", \" \").split(' ')\n to_tokenize = [x for x in to_tokenize if x != \"\"]\n for word in to_tokenize:\n #if we find an embedding, deal with the embedding\n if word.startswith(self.embedding_identifier) and self.embedding_directory is not None:\n embedding_name = word[len(self.embedding_identifier):].strip('\\n')\n embed, leftover = self._try_get_embedding(embedding_name)\n if embed is None:\n print(f\"warning, embedding:{embedding_name} does not exist, ignoring\")\n else:\n if len(embed.shape) == 1:\n tokens.append([(embed, weight)])\n else:\n tokens.append([(embed[x], weight) for x in range(embed.shape[0])])\n #if we accidentally have leftover text, continue parsing using leftover, else move on to next word\n if leftover != \"\":\n word = leftover\n else:\n continue\n #parse word\n tokens.append([(t, weight) for t in self.tokenizer(word)[0][self.tokens_start:-1]])\n\n #reshape token array to CLIP input size\n batched_tokens = []\n batch = []\n if self.start_token is not None:\n batch.append((self.start_token, 1.0, 0))\n batched_tokens.append(batch)\n for i, t_group in enumerate(tokens):\n #determine if we're going to try and keep the tokens in a single batch\n is_large = len(t_group) >= self.max_word_length\n\n while len(t_group) > 0:\n if len(t_group) + len(batch) > self.max_length - 1:\n remaining_length = self.max_length - len(batch) - 1\n #break word in two and add end token\n if is_large:\n batch.extend([(t,w,i+1) for t,w in t_group[:remaining_length]])\n batch.append((self.end_token, 1.0, 0))\n t_group = t_group[remaining_length:]\n #add end token and pad\n else:\n batch.append((self.end_token, 1.0, 0))\n if self.pad_to_max_length:\n batch.extend([(pad_token, 1.0, 0)] * (remaining_length))\n #start new batch\n batch = []\n if self.start_token is not None:\n batch.append((self.start_token, 1.0, 0))\n batched_tokens.append(batch)\n else:\n batch.extend([(t,w,i+1) for t,w in t_group])\n t_group = []\n\n #fill last batch\n batch.append((self.end_token, 1.0, 0))\n if self.pad_to_max_length:\n batch.extend([(pad_token, 1.0, 0)] * (self.max_length - len(batch)))\n\n if not return_word_ids:\n batched_tokens = [[(t, w) for t, w,_ in x] for x in batched_tokens]\n\n return batched_tokens\n\n\n def untokenize(self, token_weight_pair):\n return list(map(lambda a: (a, self.inv_vocab[a[0]]), token_weight_pair))\n\ndef pad_tokens(tokens,clip,add_token_num):\n if clip.pad_with_end:\n pad_token = clip.end_token\n else:\n pad_token = 0\n while add_token_num > 0:\n batch = []\n batch.append((clip.end_token, 1.0, 0))\n add_pad = clip.max_length - 1\n batch.extend([(pad_token, 1.0, 0)] * add_pad)\n tokens.append(batch)\n add_token_num -= (add_pad+1)\n return tokens\n\ndef token_num(tokens):\n n = 0\n for token in tokens:\n n += len(token)\n return n\n\nclass SDXLLongClipModel(torch.nn.Module):\n def __init__(self):\n super().__init__()\n self.clip_l = None\n self.clip_g = None\n\n def set_clip_options(self, options):\n self.clip_l.set_clip_options(options)\n self.clip_g.set_clip_options(options)\n\n def reset_clip_options(self):\n self.clip_g.reset_clip_options()\n self.clip_l.reset_clip_options()\n\n def encode_token_weights(self, token_weight_pairs):\n token_weight_pairs_g = token_weight_pairs[\"g\"]\n token_weight_pairs_l = token_weight_pairs[\"l\"]\n g_out, g_pooled = self.clip_g.encode_token_weights(token_weight_pairs_g)\n l_out, l_pooled = self.clip_l.encode_token_weights(token_weight_pairs_l)\n g_tokens = g_out.shape[1]\n l_tokens = l_out.shape[1]\n min_tokens = min(g_tokens,l_tokens)\n g_out = g_out[:,:min_tokens,:]\n l_out = l_out[:,:min_tokens,:]\n return torch.cat([l_out, g_out], dim=-1), g_pooled\n\n def load_sd(self, sd):\n if \"text_model.encoder.layers.30.mlp.fc1.weight\" in sd:\n return self.clip_g.load_sd(sd)\n else:\n return self.clip_l.load_sd(sd)\n\nclass SDXLLongTokenizer:\n def __init__(self):\n self.clip_l = None\n self.clip_g = None\n\n def tokenize_with_weights(self, text:str, return_word_ids=False):\n out = {}\n out[\"g\"] = self.clip_g.tokenize_with_weights(text, return_word_ids)\n out[\"l\"] = self.clip_l.tokenize_with_weights(text, return_word_ids)\n g_tokens = token_num(out[\"g\"])\n l_tokens = token_num(out[\"l\"])\n if g_tokens > l_tokens:\n out[\"l\"] = pad_tokens(out[\"l\"],self.clip_l,g_tokens-l_tokens)\n elif l_tokens > g_tokens:\n out[\"g\"] = pad_tokens(out[\"g\"],self.clip_g,l_tokens-g_tokens)\n return out\n\n def untokenize(self, token_weight_pair):\n return self.clip_g.untokenize(token_weight_pair)\n\nclass LongCLIPFluxModel(torch.nn.Module):\n def __init__(self):\n super().__init__()\n self.clip_l = None\n self.t5xxl = None\n\n def set_clip_options(self, options):\n self.clip_l.set_clip_options(options)\n self.t5xxl.set_clip_options(options)\n\n def reset_clip_options(self):\n self.clip_l.reset_clip_options()\n self.t5xxl.reset_clip_options()\n\n def encode_token_weights(self, token_weight_pairs):\n token_weight_pairs_l = token_weight_pairs[\"l\"]\n token_weight_pairs_t5 = token_weight_pairs[\"t5xxl\"]\n\n # Encode using Long-CLIP\n l_out, l_pooled = self.clip_l.encode_token_weights(token_weight_pairs_l)\n # Encode using T5XXL\n t5_out, t5_pooled = self.t5xxl.encode_token_weights(token_weight_pairs_t5)\n\n return t5_out, l_pooled\n\n def load_sd(self, sd):\n if \"text_model.encoder.layers.1.mlp.fc1.weight\" in sd:\n return self.clip_l.load_sd(sd)\n else:\n return self.t5xxl.load_sd(sd)\n\nclass LongCLIPFluxTokenizer:\n def __init__(self):\n self.clip_l = None\n self.t5xxl = None\n\n def tokenize_with_weights(self, text: str, return_word_ids=False):\n # Tokenize with both Long-CLIP and T5XXL\n out = {}\n out[\"l\"] = self.clip_l.tokenize_with_weights(text, return_word_ids) # Long-CLIP tokenization\n out[\"t5xxl\"] = self.t5xxl.tokenize_with_weights(text, return_word_ids) # T5XXL tokenization\n\n # Check the number of tokens\n l_tokens = token_num(out[\"l\"])\n t5_tokens = token_num(out[\"t5xxl\"])\n\n # Leaving this here as a reminder: Do NOT pad T5XXL!\n if l_tokens > t5_tokens:\n pass # Do not pad T5XXL\n\n return out\n\n def untokenize(self, token_weight_pair):\n # Untokenize using Long-CLIP tokenizer\n return self.clip_l.untokenize(token_weight_pair)\n\n def state_dict(self):\n return {}\n\nclass SeaArtLongXLClipMerge:\n @classmethod\n def INPUT_TYPES(cls):\n return {\"required\": { \"clip_name\": (folder_paths.get_filename_list(\"clip\"), ),\n \"clip\": (\"CLIP\", ),\n }}\n\n CATEGORY = \"SeaArt\"\n RETURN_TYPES = (\"CLIP\",)\n FUNCTION = \"do\"\n\n def do(self, clip_name, clip):\n clip_clone = clip.clone()\n clip_path = folder_paths.get_full_path(\"clip\", clip_name)\n load_device = model_management.text_encoder_device()\n device = model_management.text_encoder_offload_device()\n dtype = model_management.text_encoder_dtype(load_device)\n clip_l = SDLongClipModel(version=clip_path,layer=\"hidden\", layer_idx=-2, device=device, dtype=dtype, layer_norm_hidden_state=False)\n sdxl_long_clip_model = SDXLLongClipModel()\n sdxl_long_clip_model.clip_l = clip_l\n sdxl_long_clip_model.clip_g = clip_clone.cond_stage_model.clip_g\n clip_clone.cond_stage_model = sdxl_long_clip_model\n embedding_directory = folder_paths.get_folder_paths(\"embeddings\")\n long_tokenizer = SDXLLongTokenizer()\n tokenizer_clip_l = SDLongTokenizer(embedding_directory=embedding_directory)\n long_tokenizer.clip_l = tokenizer_clip_l\n long_tokenizer.clip_g = clip_clone.tokenizer.clip_g\n clip_clone.tokenizer = long_tokenizer\n return (clip_clone,)\n\nclass SeaArtLongClip:\n @classmethod\n def INPUT_TYPES(cls):\n return {\"required\": { \"clip_name\": (folder_paths.get_filename_list(\"clip\"), ),\n }}\n\n CATEGORY = \"SeaArt\"\n RETURN_TYPES = (\"CLIP\",)\n FUNCTION = \"do\"\n\n def do(self, clip_name):\n class EmptyClass:\n pass\n clip_target = EmptyClass()\n clip_path = folder_paths.get_full_path(\"clip\", clip_name)\n clip_target.params = {\"version\":clip_path}\n clip_target.clip = SDLongClipModel\n clip_target.tokenizer = SDLongTokenizer\n embedding_directory = folder_paths.get_folder_paths(\"embeddings\")\n clip = CLIP(clip_target, embedding_directory=embedding_directory)\n return (clip,)\n \nclass LongCLIPTextEncodeFlux:\n @classmethod\n def INPUT_TYPES(cls):\n return {\"required\": {\n \"clip_name\": (folder_paths.get_filename_list(\"clip\"), ),\n \"clip\": (\"CLIP\", ),\n }}\n\n CATEGORY = \"SeaArt\"\n RETURN_TYPES = (\"CLIP\",)\n FUNCTION = \"do\"\n\n def do(self, clip_name, clip):\n clip_clone = clip.clone()\n clip_path = folder_paths.get_full_path(\"clip\", clip_name)\n load_device = model_management.text_encoder_device()\n device = model_management.text_encoder_offload_device()\n dtype = model_management.text_encoder_dtype(load_device)\n longclip_model = SDLongClipModel(version=clip_path, layer=\"hidden\", layer_idx=-2, device=device, dtype=dtype, max_length=248)\n flux_clip_model = LongCLIPFluxModel()\n flux_clip_model.clip_l = longclip_model\n flux_clip_model.t5xxl = clip_clone.cond_stage_model.t5xxl\n clip_clone.cond_stage_model = flux_clip_model\n long_tokenizer = LongCLIPFluxTokenizer()\n long_tokenizer.clip_l = SDLongTokenizer(embedding_directory=clip_clone.tokenizer.clip_l.embedding_directory, max_length=248)\n long_tokenizer.t5xxl = clip_clone.tokenizer.t5xxl\n clip_clone.tokenizer = long_tokenizer\n return (clip_clone,)\n" }, { "path": "long_clip_model/longclip.py", "content": "import hashlib\nimport os\nimport urllib\nimport warnings\nfrom typing import Any, Union, List\nfrom torch import nn\nimport torch\nfrom PIL import Image\nfrom torchvision.transforms import Compose, Resize, CenterCrop, ToTensor, Normalize\nfrom tqdm import tqdm\nfrom safetensors.torch import load_file\n\nfrom .model_longclip import build_model\nfrom .simple_tokenizer import SimpleTokenizer as _Tokenizer\n\ntry:\n import packaging\nexcept ImportError:\n from pkg_resources import packaging\n\ntry:\n from torchvision.transforms import InterpolationMode\n BICUBIC = InterpolationMode.BICUBIC\nexcept ImportError:\n BICUBIC = Image.BICUBIC\n\n\nif packaging.version.parse(torch.__version__) < packaging.version.parse(\"1.7.1\"):\n warnings.warn(\"PyTorch version 1.7.1 or higher is recommended\")\n\n\n__all__ = [\"load\", \"tokenize\"]\n_tokenizer = _Tokenizer()\n\n\ndef _convert_image_to_rgb(image):\n return image.convert(\"RGB\")\n\n\ndef _transform(n_px):\n return Compose([\n Resize(n_px, interpolation=BICUBIC),\n CenterCrop(n_px),\n _convert_image_to_rgb,\n ToTensor(),\n Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),\n ])\n\n\n\ndef load(name: str, device: Union[str, torch.device] = \"cuda\" if torch.cuda.is_available() else \"cpu\", download_root: str = None):\n \"\"\"Load a long CLIP model\n\n Parameters\n ----------\n name : str\n A model name listed by `clip.available_models()`, or the path to a model checkpoint containing the state_dict\n\n device : Union[str, torch.device]\n The device to put the loaded model\n\n Returns\n -------\n model : torch.nn.Module\n The CLIP model\n\n preprocess : Callable[[PIL.Image], torch.Tensor]\n A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input\n \"\"\"\n \n model_path = name\n\n if model_path.endswith(\".safetensors\"):\n state_dict = load_file(model_path, device=\"cpu\")\n else:\n state_dict = torch.load(model_path, map_location=\"cpu\")\n \n model = build_model(state_dict or model.state_dict(), load_from_clip=False).to(device)\n\n if str(device) == \"cpu\":\n model.float()\n\n return model, _transform(model.visual.input_resolution)\n \n \n\n def _node_get(node: torch._C.Node, key: str):\n \"\"\"Gets attributes of a node which is polymorphic over return type.\n \n From https://github.com/pytorch/pytorch/pull/82628\n \"\"\"\n sel = node.kindOf(key)\n return getattr(node, sel)(key)\n\n def patch_device(module):\n try:\n graphs = [module.graph] if hasattr(module, \"graph\") else []\n except RuntimeError:\n graphs = []\n\n if hasattr(module, \"forward1\"):\n graphs.append(module.forward1.graph)\n\n for graph in graphs:\n for node in graph.findAllNodes(\"prim::Constant\"):\n if \"value\" in node.attributeNames() and str(_node_get(node, \"value\")).startswith(\"cuda\"):\n node.copyAttributes(device_node)\n\n model.apply(patch_device)\n patch_device(model.encode_image)\n patch_device(model.encode_text)\n\n # patch dtype to float32 on CPU\n if str(device) == \"cpu\":\n float_holder = torch.jit.trace(lambda: torch.ones([]).float(), example_inputs=[])\n float_input = list(float_holder.graph.findNode(\"aten::to\").inputs())[1]\n float_node = float_input.node()\n\n def patch_float(module):\n try:\n graphs = [module.graph] if hasattr(module, \"graph\") else []\n except RuntimeError:\n graphs = []\n\n if hasattr(module, \"forward1\"):\n graphs.append(module.forward1.graph)\n\n for graph in graphs:\n for node in graph.findAllNodes(\"aten::to\"):\n inputs = list(node.inputs())\n for i in [1, 2]: # dtype can be the second or third argument to aten::to()\n if _node_get(inputs[i].node(), \"value\") == 5:\n inputs[i].node().copyAttributes(float_node)\n\n model.apply(patch_float)\n patch_float(model.encode_image)\n patch_float(model.encode_text)\n\n model.float()\n\n return model, _transform(model.input_resolution.item())\n\n\ndef load_from_clip(name: str, device: Union[str, torch.device] = \"cuda\" if torch.cuda.is_available() else \"cpu\", jit: bool = False, download_root: str = None):\n \"\"\"Load from CLIP model for fine-tuning \n\n Parameters\n ----------\n name : str\n A model name listed by `clip.available_models()`, or the path to a model checkpoint containing the state_dict\n\n device : Union[str, torch.device]\n The device to put the loaded model\n\n jit : bool\n Whether to load the optimized JIT model or more hackable non-JIT model (default).\n\n download_root: str\n path to download the model files; by default, it uses \"~/.cache/clip\"\n\n Returns\n -------\n model : torch.nn.Module\n The CLIP model\n\n preprocess : Callable[[PIL.Image], torch.Tensor]\n A torchvision transform that converts a PIL image into a tensor that the returned model can take as its input\n \"\"\"\n\n _MODELS = {\n \"RN50\": \"https://openaipublic.azureedge.net/clip/models/afeb0e10f9e5a86da6080e35cf09123aca3b358a0c3e3b6c78a7b63bc04b6762/RN50.pt\",\n \"RN101\": \"https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt\",\n \"RN50x4\": \"https://openaipublic.azureedge.net/clip/models/7e526bd135e493cef0776de27d5f42653e6b4c8bf9e0f653bb11773263205fdd/RN50x4.pt\",\n \"RN50x16\": \"https://openaipublic.azureedge.net/clip/models/52378b407f34354e150460fe41077663dd5b39c54cd0bfd2b27167a4a06ec9aa/RN50x16.pt\",\n \"RN50x64\": \"https://openaipublic.azureedge.net/clip/models/be1cfb55d75a9666199fb2206c106743da0f6468c9d327f3e0d0a543a9919d9c/RN50x64.pt\",\n \"ViT-B/32\": \"https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt\",\n \"ViT-B/16\": \"https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt\",\n \"ViT-L/14\": \"https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt\",\n \"ViT-L/14@336px\": \"https://openaipublic.azureedge.net/clip/models/3035c92b350959924f9f00213499208652fc7ea050643e8b385c2dac08641f02/ViT-L-14-336px.pt\",\n }\n\n def available_models() -> List[str]:\n \"\"\"Returns the names of available CLIP models\"\"\"\n return list(_MODELS.keys())\n\n def _download(url: str, root: str):\n os.makedirs(root, exist_ok=True)\n filename = os.path.basename(url)\n\n expected_sha256 = url.split(\"/\")[-2]\n download_target = os.path.join(root, filename)\n\n if os.path.exists(download_target) and not os.path.isfile(download_target):\n raise RuntimeError(f\"{download_target} exists and is not a regular file\")\n\n if os.path.isfile(download_target):\n if hashlib.sha256(open(download_target, \"rb\").read()).hexdigest() == expected_sha256:\n return download_target\n else:\n warnings.warn(f\"{download_target} exists, but the SHA256 checksum does not match; re-downloading the file\")\n\n with urllib.request.urlopen(url) as source, open(download_target, \"wb\") as output:\n with tqdm(total=int(source.info().get(\"Content-Length\")), ncols=80, unit='iB', unit_scale=True, unit_divisor=1024) as loop:\n while True:\n buffer = source.read(8192)\n if not buffer:\n break\n\n output.write(buffer)\n loop.update(len(buffer))\n\n if hashlib.sha256(open(download_target, \"rb\").read()).hexdigest() != expected_sha256:\n raise RuntimeError(\"Model has been downloaded but the SHA256 checksum does not not match\")\n\n return download_target\n\n if name in _MODELS:\n model_path = _download(_MODELS[name], download_root or os.path.expanduser(\"~/.cache/clip\"))\n elif os.path.isfile(name):\n model_path = name\n else:\n raise RuntimeError(f\"Model {name} not found; available models = {available_models()}\")\n\n with open(model_path, 'rb') as opened_file:\n try:\n # loading JIT archive\n model = torch.jit.load(opened_file, map_location=device if jit else \"cpu\").eval()\n state_dict = None\n except RuntimeError:\n # loading saved state dict\n if jit:\n warnings.warn(f\"File {model_path} is not a JIT archive. Loading as a state dict instead\")\n jit = False\n state_dict = torch.load(opened_file, map_location=\"cpu\")\n\n model = build_model(state_dict or model.state_dict(), load_from_clip = True).to(device)\n \n positional_embedding_pre = model.positional_embedding.type(model.dtype)\n \n length, dim = positional_embedding_pre.shape\n keep_len = 20\n posisitonal_embedding_new = torch.zeros([4*length-3*keep_len, dim], dtype=model.dtype)\n for i in range(keep_len):\n posisitonal_embedding_new[i] = positional_embedding_pre[i]\n for i in range(length-1-keep_len):\n posisitonal_embedding_new[4*i + keep_len] = positional_embedding_pre[i + keep_len]\n posisitonal_embedding_new[4*i + 1 + keep_len] = 3*positional_embedding_pre[i + keep_len]/4 + 1*positional_embedding_pre[i+1+keep_len]/4\n posisitonal_embedding_new[4*i + 2+keep_len] = 2*positional_embedding_pre[i+keep_len]/4 + 2*positional_embedding_pre[i+1+keep_len]/4\n posisitonal_embedding_new[4*i + 3+keep_len] = 1*positional_embedding_pre[i+keep_len]/4 + 3*positional_embedding_pre[i+1+keep_len]/4\n\n posisitonal_embedding_new[4*length -3*keep_len - 4] = positional_embedding_pre[length-1] + 0*(positional_embedding_pre[length-1] - positional_embedding_pre[length-2])/4\n posisitonal_embedding_new[4*length -3*keep_len - 3] = positional_embedding_pre[length-1] + 1*(positional_embedding_pre[length-1] - positional_embedding_pre[length-2])/4\n posisitonal_embedding_new[4*length -3*keep_len - 2] = positional_embedding_pre[length-1] + 2*(positional_embedding_pre[length-1] - positional_embedding_pre[length-2])/4\n posisitonal_embedding_new[4*length -3*keep_len - 1] = positional_embedding_pre[length-1] + 3*(positional_embedding_pre[length-1] - positional_embedding_pre[length-2])/4\n \n positional_embedding_res = posisitonal_embedding_new.clone()\n \n model.positional_embedding = nn.Parameter(posisitonal_embedding_new, requires_grad=False)\n model.positional_embedding_res = nn.Parameter(positional_embedding_res, requires_grad=True)\n\n if str(device) == \"cpu\":\n model.float()\n return model, _transform(model.visual.input_resolution)\n \n def _node_get(node: torch._C.Node, key: str):\n \"\"\"Gets attributes of a node which is polymorphic over return type.\n \n From https://github.com/pytorch/pytorch/pull/82628\n \"\"\"\n sel = node.kindOf(key)\n return getattr(node, sel)(key)\n\n def patch_device(module):\n try:\n graphs = [module.graph] if hasattr(module, \"graph\") else []\n except RuntimeError:\n graphs = []\n\n if hasattr(module, \"forward1\"):\n graphs.append(module.forward1.graph)\n\n for graph in graphs:\n for node in graph.findAllNodes(\"prim::Constant\"):\n if \"value\" in node.attributeNames() and str(_node_get(node, \"value\")).startswith(\"cuda\"):\n node.copyAttributes(device_node)\n\n model.apply(patch_device)\n patch_device(model.encode_image)\n patch_device(model.encode_text)\n\n # patch dtype to float32 on CPU\n if str(device) == \"cpu\":\n float_holder = torch.jit.trace(lambda: torch.ones([]).float(), example_inputs=[])\n float_input = list(float_holder.graph.findNode(\"aten::to\").inputs())[1]\n float_node = float_input.node()\n\n def patch_float(module):\n try:\n graphs = [module.graph] if hasattr(module, \"graph\") else []\n except RuntimeError:\n graphs = []\n\n if hasattr(module, \"forward1\"):\n graphs.append(module.forward1.graph)\n\n for graph in graphs:\n for node in graph.findAllNodes(\"aten::to\"):\n inputs = list(node.inputs())\n for i in [1, 2]: # dtype can be the second or third argument to aten::to()\n if _node_get(inputs[i].node(), \"value\") == 5:\n inputs[i].node().copyAttributes(float_node)\n\n model.apply(patch_float)\n patch_float(model.encode_image)\n patch_float(model.encode_text)\n\n model.float()\n\n return model, _transform(model.input_resolution.item())\n\ndef only_tokenize(texts: Union[str, List[str]]) -> Union[torch.IntTensor, torch.LongTensor]:\n if isinstance(texts, str):\n texts = [texts]\n\n sot_token = _tokenizer.encoder[\"<|startoftext|>\"]\n eot_token = _tokenizer.encoder[\"<|endoftext|>\"]\n all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts]\n return all_tokens\n\ndef tokenize(texts: Union[str, List[str]], context_length: int = 77*4-60, truncate: bool = False) -> Union[torch.IntTensor, torch.LongTensor]:\n \"\"\"\n Returns the tokenized representation of given input string(s)\n\n Parameters\n ----------\n texts : Union[str, List[str]]\n An input string or a list of input strings to tokenize\n\n context_length : int\n The context length to use; all CLIP models use 77 as the context length\n\n truncate: bool\n Whether to truncate the text in case its encoding is longer than the context length\n\n Returns\n -------\n A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length].\n We return LongTensor when torch version is <1.8.0, since older index_select requires indices to be long.\n \"\"\"\n if isinstance(texts, str):\n texts = [texts]\n\n sot_token = _tokenizer.encoder[\"<|startoftext|>\"]\n eot_token = _tokenizer.encoder[\"<|endoftext|>\"]\n all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts]\n if packaging.version.parse(torch.__version__) < packaging.version.parse(\"1.8.0\"):\n result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)\n else:\n result = torch.zeros(len(all_tokens), context_length, dtype=torch.int)\n\n for i, tokens in enumerate(all_tokens):\n if len(tokens) > context_length:\n if truncate:\n tokens = tokens[:context_length]\n tokens[-1] = eot_token\n else:\n raise RuntimeError(f\"Input {texts[i]} is too long for context length {context_length}\")\n result[i, :len(tokens)] = torch.tensor(tokens)\n\n return result\n" }, { "path": "long_clip_model/model_longclip.py", "content": "from collections import OrderedDict\nfrom typing import Tuple, Union\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\n\n\nclass Bottleneck(nn.Module):\n expansion = 4\n\n def __init__(self, inplanes, planes, stride=1):\n super().__init__()\n\n # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1\n self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)\n self.bn1 = nn.BatchNorm2d(planes)\n self.relu1 = nn.ReLU(inplace=True)\n\n self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)\n self.bn2 = nn.BatchNorm2d(planes)\n self.relu2 = nn.ReLU(inplace=True)\n\n self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()\n\n self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)\n self.bn3 = nn.BatchNorm2d(planes * self.expansion)\n self.relu3 = nn.ReLU(inplace=True)\n\n self.downsample = None\n self.stride = stride\n\n if stride > 1 or inplanes != planes * Bottleneck.expansion:\n # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1\n self.downsample = nn.Sequential(OrderedDict([\n (\"-1\", nn.AvgPool2d(stride)),\n (\"0\", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),\n (\"1\", nn.BatchNorm2d(planes * self.expansion))\n ]))\n\n def forward(self, x: torch.Tensor):\n identity = x\n\n out = self.relu1(self.bn1(self.conv1(x)))\n out = self.relu2(self.bn2(self.conv2(out)))\n out = self.avgpool(out)\n out = self.bn3(self.conv3(out))\n\n if self.downsample is not None:\n identity = self.downsample(x)\n\n out += identity\n out = self.relu3(out)\n return out\n\n\nclass AttentionPool2d(nn.Module):\n def __init__(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None):\n super().__init__()\n self.positional_embedding = nn.Parameter(torch.randn(spacial_dim ** 2 + 1, embed_dim) / embed_dim ** 0.5)\n self.k_proj = nn.Linear(embed_dim, embed_dim)\n self.q_proj = nn.Linear(embed_dim, embed_dim)\n self.v_proj = nn.Linear(embed_dim, embed_dim)\n self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)\n self.num_heads = num_heads\n\n def forward(self, x):\n x = x.flatten(start_dim=2).permute(2, 0, 1) # NCHW -> (HW)NC\n x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0) # (HW+1)NC\n x = x + self.positional_embedding[:, None, :].to(x.dtype) # (HW+1)NC\n x, _ = F.multi_head_attention_forward(\n query=x[:1], key=x, value=x,\n embed_dim_to_check=x.shape[-1],\n num_heads=self.num_heads,\n q_proj_weight=self.q_proj.weight,\n k_proj_weight=self.k_proj.weight,\n v_proj_weight=self.v_proj.weight,\n in_proj_weight=None,\n in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),\n bias_k=None,\n bias_v=None,\n add_zero_attn=False,\n dropout_p=0,\n out_proj_weight=self.c_proj.weight,\n out_proj_bias=self.c_proj.bias,\n use_separate_proj_weight=True,\n training=self.training,\n need_weights=False\n )\n return x.squeeze(0)\n\n\nclass ModifiedResNet(nn.Module):\n \"\"\"\n A ResNet class that is similar to torchvision's but contains the following changes:\n - There are now 3 \"stem\" convolutions as opposed to 1, with an average pool instead of a max pool.\n - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1\n - The final pooling layer is a QKV attention instead of an average pool\n \"\"\"\n\n def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):\n super().__init__()\n self.output_dim = output_dim\n self.input_resolution = input_resolution\n\n # the 3-layer stem\n self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)\n self.bn1 = nn.BatchNorm2d(width // 2)\n self.relu1 = nn.ReLU(inplace=True)\n self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)\n self.bn2 = nn.BatchNorm2d(width // 2)\n self.relu2 = nn.ReLU(inplace=True)\n self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)\n self.bn3 = nn.BatchNorm2d(width)\n self.relu3 = nn.ReLU(inplace=True)\n self.avgpool = nn.AvgPool2d(2)\n\n # residual layers\n self._inplanes = width # this is a *mutable* variable used during construction\n self.layer1 = self._make_layer(width, layers[0])\n self.layer2 = self._make_layer(width * 2, layers[1], stride=2)\n self.layer3 = self._make_layer(width * 4, layers[2], stride=2)\n self.layer4 = self._make_layer(width * 8, layers[3], stride=2)\n\n embed_dim = width * 32 # the ResNet feature dimension\n self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)\n\n def _make_layer(self, planes, blocks, stride=1):\n layers = [Bottleneck(self._inplanes, planes, stride)]\n\n self._inplanes = planes * Bottleneck.expansion\n for _ in range(1, blocks):\n layers.append(Bottleneck(self._inplanes, planes))\n\n return nn.Sequential(*layers)\n\n def forward(self, x):\n def stem(x):\n x = self.relu1(self.bn1(self.conv1(x)))\n x = self.relu2(self.bn2(self.conv2(x)))\n x = self.relu3(self.bn3(self.conv3(x)))\n x = self.avgpool(x)\n return x\n\n x = x.type(self.conv1.weight.dtype)\n x = stem(x)\n x = self.layer1(x)\n x = self.layer2(x)\n x = self.layer3(x)\n x = self.layer4(x)\n x = self.attnpool(x)\n\n return x\n\n\nclass LayerNorm(nn.LayerNorm):\n \"\"\"Subclass torch's LayerNorm to handle fp16.\"\"\"\n\n def forward(self, x: torch.Tensor):\n orig_type = x.dtype\n ret = super().forward(x.type(torch.float32))\n return ret.type(orig_type)\n\n\nclass QuickGELU(nn.Module):\n def forward(self, x: torch.Tensor):\n return x * torch.sigmoid(1.702 * x)\n\n\nclass ResidualAttentionBlock(nn.Module):\n def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):\n super().__init__()\n\n self.attn = nn.MultiheadAttention(d_model, n_head)\n self.ln_1 = LayerNorm(d_model)\n self.mlp = nn.Sequential(OrderedDict([\n (\"c_fc\", nn.Linear(d_model, d_model * 4)),\n (\"gelu\", QuickGELU()),\n (\"c_proj\", nn.Linear(d_model * 4, d_model))\n ]))\n self.ln_2 = LayerNorm(d_model)\n self.attn_mask = attn_mask\n\n def attention(self, x: torch.Tensor):\n self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None\n return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]\n\n def forward(self, x: torch.Tensor):\n x = x + self.attention(self.ln_1(x))\n x = x + self.mlp(self.ln_2(x))\n return x\n\n\nclass Transformer(nn.Module):\n def __init__(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None):\n super().__init__()\n self.width = width\n self.layers = layers\n self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)])\n\n def forward(self, x: torch.Tensor, intermediate_output=None, attn_mask: torch.Tensor = None):\n if intermediate_output is not None:\n if intermediate_output < 0:\n intermediate_output = self.layers + intermediate_output\n\n intermediate = None\n for i, l in enumerate(self.resblocks):\n l.attn_mask = attn_mask\n x = l(x)\n if i == intermediate_output:\n intermediate = x.clone()\n return x, intermediate\n\nclass VisionTransformer(nn.Module):\n def __init__(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int):\n super().__init__()\n self.input_resolution = input_resolution\n self.output_dim = output_dim\n self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)\n\n scale = width ** -0.5\n self.class_embedding = nn.Parameter(scale * torch.randn(width))\n self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))\n self.ln_pre = LayerNorm(width)\n\n self.transformer = Transformer(width, layers, heads)\n\n self.ln_post = LayerNorm(width)\n self.proj = nn.Parameter(scale * torch.randn(width, output_dim))\n\n def forward(self, x: torch.Tensor):\n x = self.conv1(x) # shape = [*, width, grid, grid]\n x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2]\n x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width]\n x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1) # shape = [*, grid ** 2 + 1, width]\n x = x + self.positional_embedding.to(x.dtype)\n x = self.ln_pre(x)\n\n x = x.permute(1, 0, 2) # NLD -> LND\n x = self.transformer(x)\n x = x.permute(1, 0, 2) # LND -> NLD\n\n x = self.ln_post(x[:, 0, :])\n\n if self.proj is not None:\n x = x @ self.proj\n\n return x\n\n\nclass CLIP(nn.Module):\n def __init__(self,\n embed_dim: int,\n # vision\n image_resolution: int,\n vision_layers: Union[Tuple[int, int, int, int], int],\n vision_width: int,\n vision_patch_size: int,\n # text\n context_length: int,\n vocab_size: int,\n transformer_width: int,\n transformer_heads: int,\n transformer_layers: int, \n load_from_clip: bool\n ):\n super().__init__()\n\n self.context_length = 248\n\n if isinstance(vision_layers, (tuple, list)):\n vision_heads = vision_width * 32 // 64\n self.visual = ModifiedResNet(\n layers=vision_layers,\n output_dim=embed_dim,\n heads=vision_heads,\n input_resolution=image_resolution,\n width=vision_width\n )\n else:\n vision_heads = vision_width // 64\n self.visual = VisionTransformer(\n input_resolution=image_resolution,\n patch_size=vision_patch_size,\n width=vision_width,\n layers=vision_layers,\n heads=vision_heads,\n output_dim=embed_dim\n )\n self.transformer_layers = transformer_layers\n self.transformer = Transformer(\n width=transformer_width,\n layers=transformer_layers,\n heads=transformer_heads,\n attn_mask=self.build_attention_mask()\n )\n\n self.vocab_size = vocab_size\n self.token_embedding = nn.Embedding(vocab_size, transformer_width)\n\n if load_from_clip == False:\n self.positional_embedding = nn.Parameter(torch.empty(248, transformer_width))\n self.positional_embedding_res = nn.Parameter(torch.empty(248, transformer_width))\n\n else:\n self.positional_embedding = nn.Parameter(torch.empty(77, transformer_width))\n\n self.ln_final = LayerNorm(transformer_width)\n\n self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))\n self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))\n\n self.initialize_parameters()\n self.mask1 = torch.zeros([248, 1])\n self.mask1[:20, :] = 1\n self.mask2 = torch.zeros([248, 1])\n self.mask2[20:, :] = 1\n\n \n def initialize_parameters(self):\n nn.init.normal_(self.token_embedding.weight, std=0.02)\n nn.init.normal_(self.positional_embedding, std=0.01)\n\n if isinstance(self.visual, ModifiedResNet):\n if self.visual.attnpool is not None:\n std = self.visual.attnpool.c_proj.in_features ** -0.5\n nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)\n nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)\n nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)\n nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)\n\n for resnet_block in [self.visual.layer1, self.visual.layer2, self.visual.layer3, self.visual.layer4]:\n for name, param in resnet_block.named_parameters():\n if name.endswith(\"bn3.weight\"):\n nn.init.zeros_(param)\n\n proj_std = (self.transformer.width ** -0.5) * ((2 * self.transformer.layers) ** -0.5)\n attn_std = self.transformer.width ** -0.5\n fc_std = (2 * self.transformer.width) ** -0.5\n for block in self.transformer.resblocks:\n nn.init.normal_(block.attn.in_proj_weight, std=attn_std)\n nn.init.normal_(block.attn.out_proj.weight, std=proj_std)\n nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)\n nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)\n\n if self.text_projection is not None:\n nn.init.normal_(self.text_projection, std=self.transformer.width ** -0.5)\n\n def build_attention_mask(self):\n # lazily create causal attention mask, with full attention between the vision tokens\n # pytorch uses additive attention mask; fill with -inf\n mask = torch.empty(self.context_length, self.context_length)\n mask.fill_(float(\"-inf\"))\n mask.triu_(1) # zero out the lower diagonal\n return mask\n\n @property\n def dtype(self):\n return self.visual.conv1.weight.dtype\n\n def get_input_embeddings(self):\n return self.token_embedding\n\n def set_input_embeddings(self, embeddings):\n self.token_embedding = embeddings\n\n def encode_image(self, image):\n return self.visual(image.type(self.dtype))\n\n def encode_text(self, text): \n x = self.token_embedding(text).type(self.dtype) # [batch_size, n_ctx, d_model]\n \n x = x + (self.positional_embedding.to(x.device) * self.mask1.to(x.device)).type(self.dtype).to(x.device) + (self.positional_embedding_res.to(x.device) * self.mask2.to(x.device)).type(self.dtype).to(x.device) \n \n x = x.permute(1, 0, 2) # NLD -> LND\n x = self.transformer(x)\n x = x.permute(1, 0, 2) # LND -> NLD\n x = self.ln_final(x).type(self.dtype)\n\n # x.shape = [batch_size, n_ctx, transformer.width]\n # take features from the eot embedding (eot_token is the highest number in each sequence)\n x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection\n\n return x\n\n def encode_text_full(self, text): \n x = self.token_embedding(text).type(self.dtype) # [batch_size, n_ctx, d_model]\n \n x = x + (self.positional_embedding.to(x.device) * self.mask1.to(x.device)).type(self.dtype).to(x.device) + (self.positional_embedding_res.to(x.device) * self.mask2.to(x.device)).type(self.dtype).to(x.device) \n \n x = x.permute(1, 0, 2) # NLD -> LND\n x = self.transformer(x)\n x = x.permute(1, 0, 2) # LND -> NLD\n x = self.ln_final(x).type(self.dtype)\n\n # x.shape = [batch_size, n_ctx, transformer.width]\n # take features from the eot embedding (eot_token is the highest number in each sequence)\n #x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection\n\n return x\n\n def forward(self, input_tokens, attention_mask=None, intermediate_output=None, final_layer_norm_intermediate=True): \n x = self.token_embedding(input_tokens).type(self.dtype) # [batch_size, n_ctx, d_model]\n x = x + (self.positional_embedding.to(x.device) * self.mask1.to(x.device)).type(self.dtype).to(x.device) + (self.positional_embedding_res.to(x.device) * self.mask2.to(x.device)).type(self.dtype).to(x.device) \n \n mask = None\n if attention_mask is not None:\n mask = 1.0 - attention_mask.to(x.dtype).reshape((attention_mask.shape[0], 1, -1, attention_mask.shape[-1])).expand(attention_mask.shape[0], 1, attention_mask.shape[-1], attention_mask.shape[-1])\n mask = mask.masked_fill(mask.to(torch.bool), float(\"-inf\"))\n\n causal_mask = torch.empty(x.shape[1], x.shape[1], dtype=x.dtype, device=x.device).fill_(float(\"-inf\")).triu_(1)\n if mask is not None:\n mask += causal_mask\n else:\n mask = causal_mask\n\n x = x.permute(1, 0, 2) # NLD -> LND\n x,i = self.transformer(x,intermediate_output=intermediate_output, attn_mask= mask)\n x = x.permute(1, 0, 2) # LND -> NLD\n x = self.ln_final(x).type(self.dtype)\n if i is not None and final_layer_norm_intermediate:\n i = self.ln_final(i).type(self.dtype)\n if i is not None:\n i = i.permute(1, 0, 2) # LND -> NLD\n\n pooled_output = x[torch.arange(x.shape[0], device=x.device), input_tokens.to(dtype=torch.int, device=x.device).argmax(dim=-1),]\n\n return x,i,pooled_output\n\n\ndef convert_weights(model: nn.Module):\n \"\"\"Convert applicable model parameters to fp16\"\"\"\n\n def _convert_weights_to_fp16(l):\n if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):\n l.weight.data = l.weight.data.half()\n if l.bias is not None:\n l.bias.data = l.bias.data.half()\n\n if isinstance(l, nn.MultiheadAttention):\n for attr in [*[f\"{s}_proj_weight\" for s in [\"in\", \"q\", \"k\", \"v\"]], \"in_proj_bias\", \"bias_k\", \"bias_v\"]:\n tensor = getattr(l, attr)\n if tensor is not None:\n tensor.data = tensor.data.half()\n\n for name in [\"text_projection\", \"proj\"]:\n if hasattr(l, name):\n attr = getattr(l, name)\n if attr is not None:\n attr.data = attr.data.half()\n\n model.apply(_convert_weights_to_fp16)\n\n\ndef build_model(state_dict: dict, load_from_clip: bool):\n vit = \"visual.proj\" in state_dict\n\n if vit:\n vision_width = state_dict[\"visual.conv1.weight\"].shape[0]\n vision_layers = len([k for k in state_dict.keys() if k.startswith(\"visual.\") and k.endswith(\".attn.in_proj_weight\")])\n vision_patch_size = state_dict[\"visual.conv1.weight\"].shape[-1]\n grid_size = round((state_dict[\"visual.positional_embedding\"].shape[0] - 1) ** 0.5)\n image_resolution = vision_patch_size * grid_size\n else:\n counts: list = [len(set(k.split(\".\")[2] for k in state_dict if k.startswith(f\"visual.layer{b}\"))) for b in [1, 2, 3, 4]]\n vision_layers = tuple(counts)\n vision_width = state_dict[\"visual.layer1.0.conv1.weight\"].shape[0]\n output_width = round((state_dict[\"visual.attnpool.positional_embedding\"].shape[0] - 1) ** 0.5)\n vision_patch_size = None\n assert output_width ** 2 + 1 == state_dict[\"visual.attnpool.positional_embedding\"].shape[0]\n image_resolution = output_width * 32\n\n embed_dim = state_dict[\"text_projection\"].shape[1]\n context_length = state_dict[\"positional_embedding\"].shape[0]\n vocab_size = state_dict[\"token_embedding.weight\"].shape[0]\n transformer_width = state_dict[\"ln_final.weight\"].shape[0]\n transformer_heads = transformer_width // 64\n transformer_layers = len(set(k.split(\".\")[2] for k in state_dict if k.startswith(\"transformer.resblocks\")))\n\n model = CLIP(\n embed_dim,\n image_resolution, vision_layers, vision_width, vision_patch_size,\n context_length, vocab_size, transformer_width, transformer_heads, transformer_layers, load_from_clip\n )\n\n for key in [\"input_resolution\", \"context_length\", \"vocab_size\"]:\n if key in state_dict:\n del state_dict[key]\n\n convert_weights(model)\n model.load_state_dict(state_dict)\n return model.eval()\n" }, { "path": "long_clip_model/simple_tokenizer.py", "content": "import gzip\nimport html\nimport os\nfrom functools import lru_cache\n\nimport ftfy\nimport regex as re\n\n\n@lru_cache()\ndef default_bpe():\n return os.path.join(os.path.dirname(os.path.abspath(__file__)), \"bpe_simple_vocab_16e6.txt.gz\")\n\n\n@lru_cache()\ndef bytes_to_unicode():\n \"\"\"\n Returns list of utf-8 byte and a corresponding list of unicode strings.\n The reversible bpe codes work on unicode strings.\n This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.\n When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.\n This is a signficant percentage of your normal, say, 32K bpe vocab.\n To avoid that, we want lookup tables between utf-8 bytes and unicode strings.\n And avoids mapping to whitespace/control characters the bpe code barfs on.\n \"\"\"\n bs = list(range(ord(\"!\"), ord(\"~\")+1))+list(range(ord(\"¡\"), ord(\"¬\")+1))+list(range(ord(\"®\"), ord(\"ÿ\")+1))\n cs = bs[:]\n n = 0\n for b in range(2**8):\n if b not in bs:\n bs.append(b)\n cs.append(2**8+n)\n n += 1\n cs = [chr(n) for n in cs]\n return dict(zip(bs, cs))\n\n\ndef get_pairs(word):\n \"\"\"Return set of symbol pairs in a word.\n Word is represented as tuple of symbols (symbols being variable-length strings).\n \"\"\"\n pairs = set()\n prev_char = word[0]\n for char in word[1:]:\n pairs.add((prev_char, char))\n prev_char = char\n return pairs\n\n\ndef basic_clean(text):\n text = ftfy.fix_text(text)\n text = html.unescape(html.unescape(text))\n return text.strip()\n\n\ndef whitespace_clean(text):\n text = re.sub(r'\\s+', ' ', text)\n text = text.strip()\n return text\n\n\nclass SimpleTokenizer(object):\n def __init__(self, bpe_path: str = default_bpe()):\n self.byte_encoder = bytes_to_unicode()\n self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}\n merges = gzip.open(bpe_path).read().decode(\"utf-8\").split('\\n')\n merges = merges[1:49152-256-2+1]\n merges = [tuple(merge.split()) for merge in merges]\n vocab = list(bytes_to_unicode().values())\n vocab = vocab + [v+'' for v in vocab]\n for merge in merges:\n vocab.append(''.join(merge))\n vocab.extend(['<|startoftext|>', '<|endoftext|>'])\n self.encoder = dict(zip(vocab, range(len(vocab))))\n self.decoder = {v: k for k, v in self.encoder.items()}\n self.bpe_ranks = dict(zip(merges, range(len(merges))))\n self.cache = {'<|startoftext|>': '<|startoftext|>', '<|endoftext|>': '<|endoftext|>'}\n self.pat = re.compile(r\"\"\"<\\|startoftext\\|>|<\\|endoftext\\|>|'s|'t|'re|'ve|'m|'ll|'d|[\\p{L}]+|[\\p{N}]|[^\\s\\p{L}\\p{N}]+\"\"\", re.IGNORECASE)\n\n def bpe(self, token):\n if token in self.cache:\n return self.cache[token]\n word = tuple(token[:-1]) + ( token[-1] + '',)\n pairs = get_pairs(word)\n\n if not pairs:\n return token+''\n\n while True:\n bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))\n if bigram not in self.bpe_ranks:\n break\n first, second = bigram\n new_word = []\n i = 0\n while i < len(word):\n try:\n j = word.index(first, i)\n new_word.extend(word[i:j])\n i = j\n except:\n new_word.extend(word[i:])\n break\n\n if word[i] == first and i < len(word)-1 and word[i+1] == second:\n new_word.append(first+second)\n i += 2\n else:\n new_word.append(word[i])\n i += 1\n new_word = tuple(new_word)\n word = new_word\n if len(word) == 1:\n break\n else:\n pairs = get_pairs(word)\n word = ' '.join(word)\n self.cache[token] = word\n return word\n\n def encode(self, text):\n bpe_tokens = []\n text = whitespace_clean(basic_clean(text)).lower()\n for token in re.findall(self.pat, text):\n token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))\n bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))\n return bpe_tokens\n\n def decode(self, tokens):\n text = ''.join([self.decoder[token] for token in tokens])\n text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=\"replace\").replace('', ' ')\n return text\n" }, { "path": "workflow/flux-long.json", "content": "{\n \"last_node_id\": 38,\n \"last_link_id\": 118,\n \"nodes\": [\n {\n \"id\": 17,\n \"type\": \"BasicScheduler\",\n \"pos\": [\n 480,\n 1008\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 106\n },\n \"flags\": {},\n \"order\": 13,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"model\",\n \"type\": \"MODEL\",\n \"link\": 55,\n \"slot_index\": 0\n }\n ],\n \"outputs\": [\n {\n \"name\": \"SIGMAS\",\n \"type\": \"SIGMAS\",\n \"links\": [\n 20\n ],\n \"shape\": 3\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"BasicScheduler\"\n },\n \"widgets_values\": [\n \"simple\",\n 20,\n 1\n ]\n },\n {\n \"id\": 16,\n \"type\": \"KSamplerSelect\",\n \"pos\": [\n 480,\n 912\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 58\n },\n \"flags\": {},\n \"order\": 0,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"SAMPLER\",\n \"type\": \"SAMPLER\",\n \"links\": [\n 19\n ],\n \"shape\": 3\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"KSamplerSelect\"\n },\n \"widgets_values\": [\n \"euler\"\n ]\n },\n {\n \"id\": 26,\n \"type\": \"FluxGuidance\",\n \"pos\": [\n 480,\n 144\n ],\n \"size\": {\n \"0\": 317.4000244140625,\n \"1\": 58\n },\n \"flags\": {},\n \"order\": 14,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"conditioning\",\n \"type\": \"CONDITIONING\",\n \"link\": 41\n }\n ],\n \"outputs\": [\n {\n \"name\": \"CONDITIONING\",\n \"type\": \"CONDITIONING\",\n \"links\": [\n 42\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"FluxGuidance\"\n },\n \"widgets_values\": [\n 3.5\n ],\n \"color\": \"#233\",\n \"bgcolor\": \"#355\"\n },\n {\n \"id\": 22,\n \"type\": \"BasicGuider\",\n \"pos\": [\n 576,\n 48\n ],\n \"size\": {\n \"0\": 222.3482666015625,\n \"1\": 46\n },\n \"flags\": {},\n \"order\": 15,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"model\",\n \"type\": \"MODEL\",\n \"link\": 54,\n \"slot_index\": 0\n },\n {\n \"name\": \"conditioning\",\n \"type\": \"CONDITIONING\",\n \"link\": 42,\n \"slot_index\": 1\n }\n ],\n \"outputs\": [\n {\n \"name\": \"GUIDER\",\n \"type\": \"GUIDER\",\n \"links\": [\n 30\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"BasicGuider\"\n }\n },\n {\n \"id\": 13,\n \"type\": \"SamplerCustomAdvanced\",\n \"pos\": [\n 864,\n 192\n ],\n \"size\": {\n \"0\": 272.3617858886719,\n \"1\": 124.53733825683594\n },\n \"flags\": {},\n \"order\": 16,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"noise\",\n \"type\": \"NOISE\",\n \"link\": 37,\n \"slot_index\": 0\n },\n {\n \"name\": \"guider\",\n \"type\": \"GUIDER\",\n \"link\": 30,\n \"slot_index\": 1\n },\n {\n \"name\": \"sampler\",\n \"type\": \"SAMPLER\",\n \"link\": 19,\n \"slot_index\": 2\n },\n {\n \"name\": \"sigmas\",\n \"type\": \"SIGMAS\",\n \"link\": 20,\n \"slot_index\": 3\n },\n {\n \"name\": \"latent_image\",\n \"type\": \"LATENT\",\n \"link\": 116,\n \"slot_index\": 4\n }\n ],\n \"outputs\": [\n {\n \"name\": \"output\",\n \"type\": \"LATENT\",\n \"links\": [\n 24\n ],\n \"shape\": 3,\n \"slot_index\": 0\n },\n {\n \"name\": \"denoised_output\",\n \"type\": \"LATENT\",\n \"links\": null,\n \"shape\": 3\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"SamplerCustomAdvanced\"\n }\n },\n {\n \"id\": 25,\n \"type\": \"RandomNoise\",\n \"pos\": [\n 480,\n 768\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 82\n },\n \"flags\": {},\n \"order\": 1,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"NOISE\",\n \"type\": \"NOISE\",\n \"links\": [\n 37\n ],\n \"shape\": 3\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"RandomNoise\"\n },\n \"widgets_values\": [\n 219670278747233,\n \"randomize\"\n ],\n \"color\": \"#2a363b\",\n \"bgcolor\": \"#3f5159\"\n },\n {\n \"id\": 8,\n \"type\": \"VAEDecode\",\n \"pos\": [\n 866,\n 367\n ],\n \"size\": {\n \"0\": 210,\n \"1\": 46\n },\n \"flags\": {},\n \"order\": 17,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"samples\",\n \"type\": \"LATENT\",\n \"link\": 24\n },\n {\n \"name\": \"vae\",\n \"type\": \"VAE\",\n \"link\": 12\n }\n ],\n \"outputs\": [\n {\n \"name\": \"IMAGE\",\n \"type\": \"IMAGE\",\n \"links\": [\n 9\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"VAEDecode\"\n }\n },\n {\n \"id\": 6,\n \"type\": \"CLIPTextEncode\",\n \"pos\": [\n 384,\n 240\n ],\n \"size\": {\n \"0\": 422.84503173828125,\n \"1\": 164.31304931640625\n },\n \"flags\": {},\n \"order\": 12,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"clip\",\n \"type\": \"CLIP\",\n \"link\": 118\n }\n ],\n \"outputs\": [\n {\n \"name\": \"CONDITIONING\",\n \"type\": \"CONDITIONING\",\n \"links\": [\n 41\n ],\n \"slot_index\": 0\n }\n ],\n \"title\": \"CLIP Text Encode (Positive Prompt)\",\n \"properties\": {\n \"Node name for S&R\": \"CLIPTextEncode\"\n },\n \"widgets_values\": [\n \"cute anime girl with massive fluffy fennec ears and a big fluffy tail blonde messy long hair blue eyes wearing a maid outfit with a long black gold leaf pattern dress and a white apron mouth open holding a fancy black forest cake with candles on top in the kitchen of an old dark Victorian mansion lit by candlelight with a bright window to the foggy forest and very expensive stuff everywhere\"\n ],\n \"color\": \"#232\",\n \"bgcolor\": \"#353\"\n },\n {\n \"id\": 30,\n \"type\": \"ModelSamplingFlux\",\n \"pos\": [\n 480,\n 1152\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 130\n },\n \"flags\": {},\n \"order\": 11,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"model\",\n \"type\": \"MODEL\",\n \"link\": 56,\n \"slot_index\": 0\n },\n {\n \"name\": \"width\",\n \"type\": \"INT\",\n \"link\": 115,\n \"widget\": {\n \"name\": \"width\"\n },\n \"slot_index\": 1\n },\n {\n \"name\": \"height\",\n \"type\": \"INT\",\n \"link\": 114,\n \"widget\": {\n \"name\": \"height\"\n },\n \"slot_index\": 2\n }\n ],\n \"outputs\": [\n {\n \"name\": \"MODEL\",\n \"type\": \"MODEL\",\n \"links\": [\n 54,\n 55\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"ModelSamplingFlux\"\n },\n \"widgets_values\": [\n 1.15,\n 0.5,\n 1024,\n 1024\n ]\n },\n {\n \"id\": 27,\n \"type\": \"EmptySD3LatentImage\",\n \"pos\": [\n 480,\n 624\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 106\n },\n \"flags\": {},\n \"order\": 9,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"width\",\n \"type\": \"INT\",\n \"link\": 112,\n \"widget\": {\n \"name\": \"width\"\n }\n },\n {\n \"name\": \"height\",\n \"type\": \"INT\",\n \"link\": 113,\n \"widget\": {\n \"name\": \"height\"\n }\n }\n ],\n \"outputs\": [\n {\n \"name\": \"LATENT\",\n \"type\": \"LATENT\",\n \"links\": [\n 116\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"EmptySD3LatentImage\"\n },\n \"widgets_values\": [\n 1024,\n 1024,\n 1\n ]\n },\n {\n \"id\": 34,\n \"type\": \"PrimitiveNode\",\n \"pos\": [\n 432,\n 480\n ],\n \"size\": {\n \"0\": 210,\n \"1\": 82\n },\n \"flags\": {},\n \"order\": 2,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"INT\",\n \"type\": \"INT\",\n \"links\": [\n 112,\n 115\n ],\n \"slot_index\": 0,\n \"widget\": {\n \"name\": \"width\"\n }\n }\n ],\n \"title\": \"width\",\n \"properties\": {\n \"Run widget replace on values\": false\n },\n \"widgets_values\": [\n 1024,\n \"fixed\"\n ],\n \"color\": \"#323\",\n \"bgcolor\": \"#535\"\n },\n {\n \"id\": 35,\n \"type\": \"PrimitiveNode\",\n \"pos\": [\n 672,\n 480\n ],\n \"size\": {\n \"0\": 210,\n \"1\": 82\n },\n \"flags\": {},\n \"order\": 3,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"INT\",\n \"type\": \"INT\",\n \"links\": [\n 113,\n 114\n ],\n \"widget\": {\n \"name\": \"height\"\n },\n \"slot_index\": 0\n }\n ],\n \"title\": \"height\",\n \"properties\": {\n \"Run widget replace on values\": false\n },\n \"widgets_values\": [\n 1024,\n \"fixed\"\n ],\n \"color\": \"#323\",\n \"bgcolor\": \"#535\"\n },\n {\n \"id\": 9,\n \"type\": \"SaveImage\",\n \"pos\": [\n 1155,\n 196\n ],\n \"size\": {\n \"0\": 985.3012084960938,\n \"1\": 1060.3828125\n },\n \"flags\": {},\n \"order\": 18,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"images\",\n \"type\": \"IMAGE\",\n \"link\": 9\n }\n ],\n \"properties\": {},\n \"widgets_values\": [\n \"ComfyUI\"\n ]\n },\n {\n \"id\": 37,\n \"type\": \"Note\",\n \"pos\": [\n 480,\n 1344\n ],\n \"size\": {\n \"0\": 314.99755859375,\n \"1\": 117.98363494873047\n },\n \"flags\": {},\n \"order\": 4,\n \"mode\": 0,\n \"properties\": {\n \"text\": \"\"\n },\n \"widgets_values\": [\n \"The reference sampling implementation auto adjusts the shift value based on the resolution, if you don't want this you can just bypass (CTRL-B) this ModelSamplingFlux node.\\n\"\n ],\n \"color\": \"#432\",\n \"bgcolor\": \"#653\"\n },\n {\n \"id\": 10,\n \"type\": \"VAELoader\",\n \"pos\": [\n 48,\n 432\n ],\n \"size\": {\n \"0\": 311.81634521484375,\n \"1\": 60.429901123046875\n },\n \"flags\": {},\n \"order\": 5,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"VAE\",\n \"type\": \"VAE\",\n \"links\": [\n 12\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"VAELoader\"\n },\n \"widgets_values\": [\n \"ae.safetensors\"\n ]\n },\n {\n \"id\": 28,\n \"type\": \"Note\",\n \"pos\": [\n 48,\n 576\n ],\n \"size\": {\n \"0\": 336,\n \"1\": 288\n },\n \"flags\": {},\n \"order\": 6,\n \"mode\": 0,\n \"properties\": {\n \"text\": \"\"\n },\n \"widgets_values\": [\n \"If you get an error in any of the nodes above make sure the files are in the correct directories.\\n\\nSee the top of the examples page for the links : https://comfyanonymous.github.io/ComfyUI_examples/flux/\\n\\nflux1-dev.safetensors goes in: ComfyUI/models/unet/\\n\\nt5xxl_fp16.safetensors and clip_l.safetensors go in: ComfyUI/models/clip/\\n\\nae.safetensors goes in: ComfyUI/models/vae/\\n\\n\\nTip: You can set the weight_dtype above to one of the fp8 types if you have memory issues.\"\n ],\n \"color\": \"#432\",\n \"bgcolor\": \"#653\"\n },\n {\n \"id\": 11,\n \"type\": \"DualCLIPLoader\",\n \"pos\": [\n 48,\n 275\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 106\n },\n \"flags\": {},\n \"order\": 7,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"CLIP\",\n \"type\": \"CLIP\",\n \"links\": [\n 117\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"DualCLIPLoader\"\n },\n \"widgets_values\": [\n \"t5xxl_fp16.safetensors\",\n \"clip_l.safetensors\",\n \"flux\"\n ]\n },\n {\n \"id\": 38,\n \"type\": \"LongCLIPTextEncodeFlux\",\n \"pos\": [\n 56,\n 174\n ],\n \"size\": [\n 305.3363800289681,\n 58\n ],\n \"flags\": {},\n \"order\": 10,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"clip\",\n \"type\": \"CLIP\",\n \"link\": 117\n }\n ],\n \"outputs\": [\n {\n \"name\": \"CLIP\",\n \"type\": \"CLIP\",\n \"links\": [\n 118\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"LongCLIPTextEncodeFlux\"\n },\n \"widgets_values\": [\n \"longclip-L.pt\"\n ]\n },\n {\n \"id\": 12,\n \"type\": \"UNETLoader\",\n \"pos\": [\n 61,\n 46\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 82\n },\n \"flags\": {},\n \"order\": 8,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"MODEL\",\n \"type\": \"MODEL\",\n \"links\": [\n 56\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"UNETLoader\"\n },\n \"widgets_values\": [\n \"flux1-dev.safetensors\",\n \"default\"\n ],\n \"color\": \"#223\",\n \"bgcolor\": \"#335\"\n }\n ],\n \"links\": [\n [\n 9,\n 8,\n 0,\n 9,\n 0,\n \"IMAGE\"\n ],\n [\n 12,\n 10,\n 0,\n 8,\n 1,\n \"VAE\"\n ],\n [\n 19,\n 16,\n 0,\n 13,\n 2,\n \"SAMPLER\"\n ],\n [\n 20,\n 17,\n 0,\n 13,\n 3,\n \"SIGMAS\"\n ],\n [\n 24,\n 13,\n 0,\n 8,\n 0,\n \"LATENT\"\n ],\n [\n 30,\n 22,\n 0,\n 13,\n 1,\n \"GUIDER\"\n ],\n [\n 37,\n 25,\n 0,\n 13,\n 0,\n \"NOISE\"\n ],\n [\n 41,\n 6,\n 0,\n 26,\n 0,\n \"CONDITIONING\"\n ],\n [\n 42,\n 26,\n 0,\n 22,\n 1,\n \"CONDITIONING\"\n ],\n [\n 54,\n 30,\n 0,\n 22,\n 0,\n \"MODEL\"\n ],\n [\n 55,\n 30,\n 0,\n 17,\n 0,\n \"MODEL\"\n ],\n [\n 56,\n 12,\n 0,\n 30,\n 0,\n \"MODEL\"\n ],\n [\n 112,\n 34,\n 0,\n 27,\n 0,\n \"INT\"\n ],\n [\n 113,\n 35,\n 0,\n 27,\n 1,\n \"INT\"\n ],\n [\n 114,\n 35,\n 0,\n 30,\n 2,\n \"INT\"\n ],\n [\n 115,\n 34,\n 0,\n 30,\n 1,\n \"INT\"\n ],\n [\n 116,\n 27,\n 0,\n 13,\n 4,\n \"LATENT\"\n ],\n [\n 117,\n 11,\n 0,\n 38,\n 0,\n \"CLIP\"\n ],\n [\n 118,\n 38,\n 0,\n 6,\n 0,\n \"CLIP\"\n ]\n ],\n \"groups\": [],\n \"config\": {},\n \"extra\": {\n \"ds\": {\n \"scale\": 1.1,\n \"offset\": [\n 484.9445814412023,\n 165.9252922919127\n ]\n },\n \"groupNodes\": {}\n },\n \"version\": 0.4\n}" }, { "path": "workflow/sd1-5-long.json", "content": "{\n \"last_node_id\": 10,\n \"last_link_id\": 19,\n \"nodes\": [\n {\n \"id\": 7,\n \"type\": \"CLIPTextEncode\",\n \"pos\": [\n 413,\n 389\n ],\n \"size\": {\n \"0\": 425.27801513671875,\n \"1\": 180.6060791015625\n },\n \"flags\": {},\n \"order\": 4,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"clip\",\n \"type\": \"CLIP\",\n \"link\": 19\n }\n ],\n \"outputs\": [\n {\n \"name\": \"CONDITIONING\",\n \"type\": \"CONDITIONING\",\n \"links\": [\n 6\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"CLIPTextEncode\"\n },\n \"widgets_values\": [\n \"text, watermark\"\n ]\n },\n {\n \"id\": 8,\n \"type\": \"VAEDecode\",\n \"pos\": [\n 1209,\n 188\n ],\n \"size\": {\n \"0\": 210,\n \"1\": 46\n },\n \"flags\": {},\n \"order\": 6,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"samples\",\n \"type\": \"LATENT\",\n \"link\": 7\n },\n {\n \"name\": \"vae\",\n \"type\": \"VAE\",\n \"link\": 8\n }\n ],\n \"outputs\": [\n {\n \"name\": \"IMAGE\",\n \"type\": \"IMAGE\",\n \"links\": [\n 9\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"VAEDecode\"\n }\n },\n {\n \"id\": 9,\n \"type\": \"SaveImage\",\n \"pos\": [\n 1451,\n 189\n ],\n \"size\": [\n 210,\n 270\n ],\n \"flags\": {},\n \"order\": 7,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"images\",\n \"type\": \"IMAGE\",\n \"link\": 9\n }\n ],\n \"properties\": {},\n \"widgets_values\": [\n \"ComfyUI\"\n ]\n },\n {\n \"id\": 5,\n \"type\": \"EmptyLatentImage\",\n \"pos\": [\n 473,\n 609\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 106\n },\n \"flags\": {},\n \"order\": 0,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"LATENT\",\n \"type\": \"LATENT\",\n \"links\": [\n 2\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"EmptyLatentImage\"\n },\n \"widgets_values\": [\n 1024,\n 1024,\n 1\n ]\n },\n {\n \"id\": 3,\n \"type\": \"KSampler\",\n \"pos\": [\n 863,\n 186\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 262\n },\n \"flags\": {},\n \"order\": 5,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"model\",\n \"type\": \"MODEL\",\n \"link\": 1\n },\n {\n \"name\": \"positive\",\n \"type\": \"CONDITIONING\",\n \"link\": 4\n },\n {\n \"name\": \"negative\",\n \"type\": \"CONDITIONING\",\n \"link\": 6\n },\n {\n \"name\": \"latent_image\",\n \"type\": \"LATENT\",\n \"link\": 2\n }\n ],\n \"outputs\": [\n {\n \"name\": \"LATENT\",\n \"type\": \"LATENT\",\n \"links\": [\n 7\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"KSampler\"\n },\n \"widgets_values\": [\n 425533035839016,\n \"fixed\",\n 20,\n 7,\n \"euler_ancestral\",\n \"normal\",\n 1\n ]\n },\n {\n \"id\": 6,\n \"type\": \"CLIPTextEncode\",\n \"pos\": [\n 415,\n 186\n ],\n \"size\": {\n \"0\": 422.84503173828125,\n \"1\": 164.31304931640625\n },\n \"flags\": {},\n \"order\": 3,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"clip\",\n \"type\": \"CLIP\",\n \"link\": 18\n }\n ],\n \"outputs\": [\n {\n \"name\": \"CONDITIONING\",\n \"type\": \"CONDITIONING\",\n \"links\": [\n 4\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"CLIPTextEncode\"\n },\n \"widgets_values\": [\n \"Studio Ghibli ~ Hayao Miyazaki ~ Spirited Away ~ beautiful ultra detailed illustration of a cute witch of the sitting on a tree stump reading a book, her ornate robe reminiscent of the stars in the night sky. This contrast between the fantastical character and the more bold color scheme and elements gives the piece an intriguing narrative quality. painted realism, photorealistic, 8k, fantasy digital art, HDR, UHD.\"\n ]\n },\n {\n \"id\": 4,\n \"type\": \"CheckpointLoaderSimple\",\n \"pos\": [\n -252,\n 425\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 98\n },\n \"flags\": {},\n \"order\": 1,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"MODEL\",\n \"type\": \"MODEL\",\n \"links\": [\n 1\n ],\n \"slot_index\": 0\n },\n {\n \"name\": \"CLIP\",\n \"type\": \"CLIP\",\n \"links\": [],\n \"slot_index\": 1\n },\n {\n \"name\": \"VAE\",\n \"type\": \"VAE\",\n \"links\": [\n 8\n ],\n \"slot_index\": 2\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"CheckpointLoaderSimple\"\n },\n \"widgets_values\": [\n \"Dark_Sushi_Mix.safetensors\"\n ]\n },\n {\n \"id\": 10,\n \"type\": \"SeaArtLongClip\",\n \"pos\": [\n -248,\n 204\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 58\n },\n \"flags\": {},\n \"order\": 2,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"CLIP\",\n \"type\": \"CLIP\",\n \"links\": [\n 18,\n 19\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"SeaArtLongClip\"\n },\n \"widgets_values\": [\n \"longclip-L.pt\"\n ]\n }\n ],\n \"links\": [\n [\n 1,\n 4,\n 0,\n 3,\n 0,\n \"MODEL\"\n ],\n [\n 2,\n 5,\n 0,\n 3,\n 3,\n \"LATENT\"\n ],\n [\n 4,\n 6,\n 0,\n 3,\n 1,\n \"CONDITIONING\"\n ],\n [\n 6,\n 7,\n 0,\n 3,\n 2,\n \"CONDITIONING\"\n ],\n [\n 7,\n 3,\n 0,\n 8,\n 0,\n \"LATENT\"\n ],\n [\n 8,\n 4,\n 2,\n 8,\n 1,\n \"VAE\"\n ],\n [\n 9,\n 8,\n 0,\n 9,\n 0,\n \"IMAGE\"\n ],\n [\n 18,\n 10,\n 0,\n 6,\n 0,\n \"CLIP\"\n ],\n [\n 19,\n 10,\n 0,\n 7,\n 0,\n \"CLIP\"\n ]\n ],\n \"groups\": [],\n \"config\": {},\n \"extra\": {},\n \"version\": 0.4\n}" }, { "path": "workflow/sdxl-long.json", "content": "{\n \"last_node_id\": 13,\n \"last_link_id\": 29,\n \"nodes\": [\n {\n \"id\": 7,\n \"type\": \"CLIPTextEncode\",\n \"pos\": [\n 413,\n 389\n ],\n \"size\": {\n \"0\": 425.27801513671875,\n \"1\": 180.6060791015625\n },\n \"flags\": {},\n \"order\": 4,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"clip\",\n \"type\": \"CLIP\",\n \"link\": 29\n }\n ],\n \"outputs\": [\n {\n \"name\": \"CONDITIONING\",\n \"type\": \"CONDITIONING\",\n \"links\": [\n 6\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"CLIPTextEncode\"\n },\n \"widgets_values\": [\n \"text, watermark\"\n ]\n },\n {\n \"id\": 8,\n \"type\": \"VAEDecode\",\n \"pos\": [\n 1209,\n 188\n ],\n \"size\": {\n \"0\": 210,\n \"1\": 46\n },\n \"flags\": {},\n \"order\": 6,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"samples\",\n \"type\": \"LATENT\",\n \"link\": 7\n },\n {\n \"name\": \"vae\",\n \"type\": \"VAE\",\n \"link\": 8\n }\n ],\n \"outputs\": [\n {\n \"name\": \"IMAGE\",\n \"type\": \"IMAGE\",\n \"links\": [\n 9\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"VAEDecode\"\n }\n },\n {\n \"id\": 9,\n \"type\": \"SaveImage\",\n \"pos\": [\n 1451,\n 189\n ],\n \"size\": {\n \"0\": 210,\n \"1\": 270\n },\n \"flags\": {},\n \"order\": 7,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"images\",\n \"type\": \"IMAGE\",\n \"link\": 9\n }\n ],\n \"properties\": {},\n \"widgets_values\": [\n \"ComfyUI\"\n ]\n },\n {\n \"id\": 5,\n \"type\": \"EmptyLatentImage\",\n \"pos\": [\n 473,\n 609\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 106\n },\n \"flags\": {},\n \"order\": 0,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"LATENT\",\n \"type\": \"LATENT\",\n \"links\": [\n 2\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"EmptyLatentImage\"\n },\n \"widgets_values\": [\n 1024,\n 1024,\n 1\n ]\n },\n {\n \"id\": 3,\n \"type\": \"KSampler\",\n \"pos\": [\n 863,\n 186\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 262\n },\n \"flags\": {},\n \"order\": 5,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"model\",\n \"type\": \"MODEL\",\n \"link\": 1\n },\n {\n \"name\": \"positive\",\n \"type\": \"CONDITIONING\",\n \"link\": 4\n },\n {\n \"name\": \"negative\",\n \"type\": \"CONDITIONING\",\n \"link\": 6\n },\n {\n \"name\": \"latent_image\",\n \"type\": \"LATENT\",\n \"link\": 2\n }\n ],\n \"outputs\": [\n {\n \"name\": \"LATENT\",\n \"type\": \"LATENT\",\n \"links\": [\n 7\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"KSampler\"\n },\n \"widgets_values\": [\n 425533035839016,\n \"fixed\",\n 20,\n 7,\n \"euler_ancestral\",\n \"normal\",\n 1\n ]\n },\n {\n \"id\": 6,\n \"type\": \"CLIPTextEncode\",\n \"pos\": [\n 415,\n 186\n ],\n \"size\": {\n \"0\": 422.84503173828125,\n \"1\": 164.31304931640625\n },\n \"flags\": {},\n \"order\": 3,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"clip\",\n \"type\": \"CLIP\",\n \"link\": 28\n }\n ],\n \"outputs\": [\n {\n \"name\": \"CONDITIONING\",\n \"type\": \"CONDITIONING\",\n \"links\": [\n 4\n ],\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"CLIPTextEncode\"\n },\n \"widgets_values\": [\n \"Studio Ghibli ~ Hayao Miyazaki ~ Spirited Away ~ beautiful ultra detailed illustration of a cute witch of the sitting on a tree stump reading a book, her ornate robe reminiscent of the stars in the night sky. This contrast between the fantastical character and the more bold color scheme and elements gives the piece an intriguing narrative quality. painted realism, photorealistic, 8k, fantasy digital art, HDR, UHD.\"\n ]\n },\n {\n \"id\": 4,\n \"type\": \"CheckpointLoaderSimple\",\n \"pos\": [\n -387,\n 375\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 98\n },\n \"flags\": {},\n \"order\": 1,\n \"mode\": 0,\n \"outputs\": [\n {\n \"name\": \"MODEL\",\n \"type\": \"MODEL\",\n \"links\": [\n 1\n ],\n \"slot_index\": 0\n },\n {\n \"name\": \"CLIP\",\n \"type\": \"CLIP\",\n \"links\": [\n 27\n ],\n \"slot_index\": 1\n },\n {\n \"name\": \"VAE\",\n \"type\": \"VAE\",\n \"links\": [\n 8\n ],\n \"slot_index\": 2\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"CheckpointLoaderSimple\"\n },\n \"widgets_values\": [\n \"sdxl.safetensors\"\n ]\n },\n {\n \"id\": 13,\n \"type\": \"SeaArtLongXLClipMerge\",\n \"pos\": [\n -4,\n 217\n ],\n \"size\": {\n \"0\": 315,\n \"1\": 58\n },\n \"flags\": {},\n \"order\": 2,\n \"mode\": 0,\n \"inputs\": [\n {\n \"name\": \"clip\",\n \"type\": \"CLIP\",\n \"link\": 27\n }\n ],\n \"outputs\": [\n {\n \"name\": \"CLIP\",\n \"type\": \"CLIP\",\n \"links\": [\n 28,\n 29\n ],\n \"shape\": 3,\n \"slot_index\": 0\n }\n ],\n \"properties\": {\n \"Node name for S&R\": \"SeaArtLongXLClipMerge\"\n },\n \"widgets_values\": [\n \"longclip-L.pt\"\n ]\n }\n ],\n \"links\": [\n [\n 1,\n 4,\n 0,\n 3,\n 0,\n \"MODEL\"\n ],\n [\n 2,\n 5,\n 0,\n 3,\n 3,\n \"LATENT\"\n ],\n [\n 4,\n 6,\n 0,\n 3,\n 1,\n \"CONDITIONING\"\n ],\n [\n 6,\n 7,\n 0,\n 3,\n 2,\n \"CONDITIONING\"\n ],\n [\n 7,\n 3,\n 0,\n 8,\n 0,\n \"LATENT\"\n ],\n [\n 8,\n 4,\n 2,\n 8,\n 1,\n \"VAE\"\n ],\n [\n 9,\n 8,\n 0,\n 9,\n 0,\n \"IMAGE\"\n ],\n [\n 27,\n 4,\n 1,\n 13,\n 0,\n \"CLIP\"\n ],\n [\n 28,\n 13,\n 0,\n 6,\n 0,\n \"CLIP\"\n ],\n [\n 29,\n 13,\n 0,\n 7,\n 0,\n \"CLIP\"\n ]\n ],\n \"groups\": [],\n \"config\": {},\n \"extra\": {},\n \"version\": 0.4\n}" } ]