Full Code of furiousteabag/doppelganger for AI

master 1143bcd9d4c8 cached
10 files
24.2 KB
6.2k tokens
12 symbols
1 requests
Download .txt
Repository: furiousteabag/doppelganger
Branch: master
Commit: 1143bcd9d4c8
Files: 10
Total size: 24.2 KB

Directory structure:
gitextract__ysxx7ew/

├── .gitignore
├── LICENSE
├── Makefile
├── README.md
├── data/
│   └── .gitignore
├── finetune_full.py
├── finetune_lora.py
├── prepare_dataset.py
├── requirements.txt
└── utils/
    └── finetune_utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
__pycache__
.ipynb_checkpoints
.pytest_cache
.vim-arsync
*.log
result.json
weights/
wandb/
.cache/


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2023 Alexander Smirnov

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: Makefile
================================================
LINE_WIDTH=120
ISORT_FLAGS=--line-width=${LINE_WIDTH} --profile black
BLACK_FLAGS=--line-length=${LINE_WIDTH}

install:
	pip install -r requirements.txt

install-format:
	pip install black isort

format:
	isort ${ISORT_FLAGS} --check-only --diff .
	black ${BLACK_FLAGS} --check --diff .

format-fix:
	isort ${ISORT_FLAGS} .
	black ${BLACK_FLAGS} .


================================================
FILE: README.md
================================================
# Doppelganger

Fine-tuning LLM on my Telegram chats. You may read full story [in my blog](https://asmirnov.xyz/doppelganger).

## Dataset Preparation

First, we have to get the data. Open Telegram, go to 'Setting' -> 'Advanced' -> 'Export Telegram Data' and unselect everything except 'Personal chats' and 'Private groups' (don't select 'Only my messages there'). As output format choose 'Machine-readable JSON'. It will result in `result.json`.

Use `prepare_dataset.py` to transform `result.json` to JSON with a list of sessions:

```bash
python prepare_dataset.py "./data/result.json" "./data/messages.json"
```

There are some flags available for this script, you can read more in `--help`:

```bash
python prepare_dataset.py --help
```

<details>
<summary>output</summary>

```
NAME
    prepare_dataset.py - Transforms chat histories from .json telegram export to .json with a list of sessions. Session is a list of messages, where each message is a dict with fields 'author' and 'text'.

SYNOPSIS
    prepare_dataset.py INPUT OUTPUT <flags>

DESCRIPTION
    Transforms chat histories from .json telegram export to .json with a list of sessions. Session is a list of messages, where each message is a dict with fields 'author' and 'text'.

POSITIONAL ARGUMENTS
    INPUT
        Type: str
        Path to .json telegram export, usually called result.json
    OUTPUT
        Type: str
        Path to output .json file

FLAGS
    -t, --target_name=TARGET_NAME
        Type: Optional[str | None]
        Default: None
        The name of the person to target. This person will be present in every session. If empty, will be tried to be detected from "Saved Messages"
    -l, --last_x_months=LAST_X_MONTHS
        Type: int
        Default: 24
        Number of last months to use messages from
    -s, --session_minutes_threshold=SESSION_MINUTES_THRESHOLD
        Type: int
        Default: 10
        Threshold in minutes where messages will belong to the same session
    -c, --concat_one_user_messages_delimeter=CONCAT_ONE_USER_MESSAGES_DELIMETER
        Type: str
        Default: '\n>>> '
        Users might type several messages one after each other. They are concatenated using this delimeter

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS
```

</details>

If you are interested, Telegram have several types of messages which should be handled differently:

<details>
<summary>default text message</summary>

```
{
 "id": 123,
 "type": "message",
 "date": "2023-10-31T15:23:38",
 "date_unixtime": "1698746018",
 "from": "Username",
 "from_id": "user123",
 "text": "ты где?",
 "text_entities": [
  {
   "type": "plain",
   "text": "ты где?"
  }
 ]
}
```

</details>

<details>
<summary>multiple text entities</summary>

```
{
 "id": 345,
 "type": "message",
 "date": "2023-10-25T01:56:50",
 "date_unixtime": "1698179210",
 "from": "Username",
 "from_id": "user456",
 "text": [
  "California suspends GM Cruise's autonomous vehicle deployment | Hacker News\n",
  {
   "type": "link",
   "text": "https://news.ycombinator.com/item?id=38002752"
  }
 ],
 "text_entities": [
  {
   "type": "plain",
   "text": "California suspends GM Cruise's autonomous vehicle deployment | Hacker News\n"
  },
  {
   "type": "link",
   "text": "https://news.ycombinator.com/item?id=38002752"
  }
 ]
}
```

</details>

<details>
<summary>sticker</summary>

```
{
 "id": 789,
 "type": "message",
 "date": "2023-10-30T23:24:20",
 "date_unixtime": "1698688460",
 "from": "Username",
 "from_id": "user789",
 "file": "(File not included. Change data exporting settings to download.)",
 "thumbnail": "(File not included. Change data exporting settings to download.)",
 "media_type": "sticker",
 "sticker_emoji": "🤗",
 "width": 512,
 "height": 501,
 "text": "",
 "text_entities": []
}
```

</details>

## Training

Final version of models were trained with the parameters which are default in training scripts. Training logs can be accessed on [WandB](https://wandb.ai/furiousteabag/doppelganger).

### LoRA fine-tune

To launch LoRA fine-tune with my default params, you will need GPU with 20GB VRAM. RTX 3090 is a good option for it's money. You may reduce `micro_batch_size` or `max_seq_length` if you want to lower the amount of VRAM required. To get full list of parameters, run:

```
python finetune_lora.py --help
```

To train LoRA, run:

```
python finetune_lora.py
```

### Full fine-tune

To list available params with their default values, run:

```
python finetune_full.py --help
```

To train:

```
torchrun --nnodes=1 --nproc_per_node=NUMBER_OF_GPUS finetune_full.py
```

## Launching

Use [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui). If you used LoRA, then clone [ehartford/dolphin-2.2.1-mistral-7b](https://huggingface.co/ehartford/dolphin-2.2.1-mistral-7b) or whatever model you are used as a base model and put trained LoRA connectors to `./loras/` folder within text-generation-webui. If you did full fine-tune, then copy training result to `./models/`.


================================================
FILE: data/.gitignore
================================================
*
!.gitignore


================================================
FILE: finetune_full.py
================================================
import os
from typing import Dict, List

import fire
import torch
import wandb
from datasets import Dataset
from loguru import logger
from tokenizers import AddedToken
from transformers import AutoModelForCausalLM, AutoTokenizer, PreTrainedModel, PreTrainedTokenizer, TrainingArguments
from trl import SFTTrainer

from utils.finetune_utils import DataCollatorForLanguageModelingChatML, prepare_dataset, print_trainable_parameters

DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_EOS_TOKEN = "<|im_end|>"
DEFAULT_UNK_TOKEN = "<unk>"
DEFAULT_PAD_TOKEN = "</s>"


def smart_tokenizer_and_embedding_resize(
    special_tokens_dict: Dict, tokenizer: PreTrainedTokenizer, model: PreTrainedModel, tokens_list: List = []
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict) + tokenizer.add_tokens(tokens_list)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg


def train(
    model_name_or_path: str = "mistralai/Mistral-7B-v0.1",
    data_path: str = "./data/messages.json",
    output_dir: str = "./weights/full/",
    gradient_accumulation_steps: int = 4,
    micro_batch_size: int = 2,
    num_epochs: int = 3,
    learning_rate: float = 2e-5,
    lr_scheduler_type: str = "cosine",
    warmup_ratio: float = 0.03,
    weight_decay: float = 0.0,
    max_seq_length: int = 1024,
    fsdp: str = "full_shard auto_wrap",
    fsdp_transformer_layer_cls_to_wrap: str = "MistralDecoderLayer",
    wandb_project: str = "doppelganger",
    logging_steps: int = 1,
):
    if int(os.environ.get("LOCAL_RANK", 0)) == 0:
        batch_size = micro_batch_size * gradient_accumulation_steps * int(os.environ.get("WORLD_SIZE", 1))
        logger.info(f"Total batch size: {batch_size}")

    model: PreTrainedModel = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16)
    model.config.use_cache = False

    print_trainable_parameters(model)

    tokenizer = AutoTokenizer.from_pretrained(
        model_name_or_path, model_max_length=max_seq_length, padding_side="right", use_fast=False
    )

    special_tokens_dict = dict()
    if tokenizer.bos_token is None:
        special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN
    if tokenizer.pad_token is None:
        special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN
    if tokenizer.unk_token is None:
        special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN
    special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN
    tokens_list = [AddedToken("<|im_start|>", normalized=False)]

    smart_tokenizer_and_embedding_resize(
        special_tokens_dict=special_tokens_dict,
        tokens_list=tokens_list,
        tokenizer=tokenizer,
        model=model,
    )

    data_collator = DataCollatorForLanguageModelingChatML(tokenizer=tokenizer)
    dataset = Dataset.from_dict({"session": prepare_dataset(data_path)})

    if int(os.environ.get("LOCAL_RANK", 0)) == 0:
        wandb.init(project=wandb_project, job_type="train", anonymous="allow")

    training_arguments = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=micro_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        optim="adamw_torch",
        save_steps=500,
        logging_steps=logging_steps,
        logging_first_step=True,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        bf16=True,
        max_steps=-1,
        warmup_ratio=warmup_ratio,
        group_by_length=True,
        lr_scheduler_type=lr_scheduler_type,
        fsdp=fsdp,
        fsdp_transformer_layer_cls_to_wrap=fsdp_transformer_layer_cls_to_wrap,
        report_to=["wandb"] if int(os.environ.get("LOCAL_RANK", 0)) == 0 else [],
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=dataset,
        data_collator=data_collator,
        dataset_text_field="session",
        max_seq_length=max_seq_length,
        packing=False,
        args=training_arguments,
    )

    trainer.train()
    trainer.save_state()
    trainer.save_model(output_dir)
    wandb.finish()


if __name__ == "__main__":
    fire.Fire(train)


================================================
FILE: finetune_lora.py
================================================
import fire
import torch
import wandb
from datasets import Dataset
from loguru import logger
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, MistralForCausalLM, TrainingArguments
from trl import SFTTrainer

from utils.finetune_utils import DataCollatorForLanguageModelingChatML, prepare_dataset, print_trainable_parameters


def train(
    model_name_or_path: str = "ehartford/dolphin-2.2.1-mistral-7b",
    data_path: str = "./data/messages.json",
    output_dir: str = "./weights/LoRA/",
    batch_size: int = 16,
    micro_batch_size: int = 8,
    num_epochs: int = 3,
    lora_r: int = 32,
    lora_alpha: int = 16,
    lora_dropout: float = 0.05,
    learning_rate: float = 2e-4,
    lr_scheduler_type: str = "cosine",
    warmup_ratio: float = 0.03,
    weight_decay: float = 0.001,
    max_seq_length: int = 1024,
    wandb_project: str = "doppelganger",
):
    gradient_accumulation_steps = batch_size // micro_batch_size

    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
    data_collator = DataCollatorForLanguageModelingChatML(tokenizer=tokenizer)

    dataset = Dataset.from_dict({"session": prepare_dataset(data_path)})

    # print(dataset[2500]["session"])
    # collator_res = data_collator([tokenizer(dataset["session"][2500], return_tensors="pt")["input_ids"][0]])
    # print(tokenizer.decode(collator_res["labels"][0][collator_res["labels"][0] != -100]))

    model: MistralForCausalLM = AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=False,
        ),
        device_map={"": 0},
    )
    model.config
    model.gradient_checkpointing_enable()

    peft_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj"],
    )

    model = prepare_model_for_kbit_training(model)
    logger.info(model)

    model = get_peft_model(model, peft_config)
    logger.info(model)
    print_trainable_parameters(model)

    wandb.init(project=wandb_project, job_type="train", anonymous="allow")

    training_arguments = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=micro_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        optim="paged_adamw_8bit",
        save_steps=500,
        logging_steps=10,
        logging_first_step=True,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        fp16=False,
        bf16=False,
        max_grad_norm=0.3,
        max_steps=-1,
        warmup_ratio=warmup_ratio,
        group_by_length=True,
        lr_scheduler_type=lr_scheduler_type,
        report_to=["wandb"],
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        peft_config=peft_config,
        max_seq_length=max_seq_length,
        dataset_text_field="session",
        tokenizer=tokenizer,
        args=training_arguments,
        packing=False,
        data_collator=data_collator,
    )

    trainer.train()
    trainer.model.save_pretrained(output_dir)
    wandb.finish()


if __name__ == "__main__":
    fire.Fire(train)


================================================
FILE: prepare_dataset.py
================================================
import json
from datetime import datetime, timedelta
from typing import List, Literal, Optional, Tuple

import fire
from loguru import logger
from pydantic import BaseModel
from tqdm import tqdm


class Message(BaseModel):
    date: datetime
    author: str
    text: str


class Chat(BaseModel):
    name: str
    type: Literal["personal_chat", "private_group", "private_supergroup"]
    messages: List[Message]
    sessions: Optional[List[List[Message]]] = []


def load_chats(path: str) -> Tuple[List[Chat], Tuple[int | None, str | None]]:
    chats: List[Chat] = []
    target_id, target_name = None, None
    logger.info(f"Loading chats from '{path}'...")
    with open(path, "r") as f:
        for chat in json.load(f)["chats"]["list"]:
            # It means we encountered 'Saved Messages', from which
            # we can extract id and a name of a target person
            if "name" not in chat:
                target_id = int(chat["id"])
                target_name = str(next(msg for msg in chat["messages"] if msg["from_id"] == f"user{target_id}")["from"])
            # If chat does not contain name that means we
            # encountered "Deleted Account"
            elif chat["name"]:
                messages = [
                    Message(
                        date=msg["date"],
                        author=msg["from"],
                        text="".join([text_entity["text"] for text_entity in msg["text_entities"]])
                        + msg.get("sticker_emoji", ""),
                    )
                    for msg in chat["messages"]
                    if "from" in msg and msg["from"] and (msg["text_entities"] or "sticker_emoji" in msg)
                ]
                if messages:
                    chat = Chat(name=chat["name"], type=chat["type"], messages=messages)
                    chats.append(chat)
    logger.info(f"Found {len(chats)} chats in file '{path}'")
    if not target_name:
        logger.warning(f"Was not able to detect target name from 'Saved Messages'!")
    return chats, (target_id, target_name)


def transform_chats(
    input: str,
    output: str,
    target_name: str | None = None,
    last_x_months: int = 60,
    session_minutes_threshold: int = 10,
    concat_one_user_messages_delimeter: str = "\n>>> ",
):
    """
    Transforms chat histories from .json telegram export to .json with a list of sessions.
    Session is a list of messages, where each message is a dict with fields 'author' and 'text'.

    :param input: Path to .json telegram export, usually called result.json
    :param output: Path to output .json file
    :param target_name: The name of the person to target. This person will be present in every session. If empty, will be tried to be detected from "Saved Messages"
    :param last_x_months: Number of last months to use messages from
    :param session_minutes_threshold: Threshold in minutes where messages will belong to the same session
    :param concat_one_user_messages_delimeter: Users might type several messages one after each other. They are concatenated using this delimeter
    """
    chats, (_, extracted_target_name) = load_chats(input)
    if not target_name:
        target_name = extracted_target_name
    logger.info(f"Preparing dataset for user with name '{target_name}'...")

    # Dropping all messages which are older than given date
    for chat in chats:
        chat.messages = [msg for msg in chat.messages if msg.date > datetime.now() - timedelta(days=last_x_months * 30)]
    chats = [chat for chat in chats if chat.messages]
    logger.info(f"After filtering by date, there are {len(chats)} chats left")

    # Create sessions for each chat by combining messages within
    # session_minutes_threshold into one session
    for chat in chats:
        sessions = []
        current_session = []
        for msg in chat.messages:
            if not current_session or (msg.date - current_session[-1].date).seconds / 60 < session_minutes_threshold:
                current_session.append(msg)
            else:
                sessions.append(current_session)
                current_session = [msg]
        if current_session:
            sessions.append(current_session)
        chat.sessions = sessions
    logger.info(f"Combined messages into sessions")

    # Combine consecutive messages from single user into one message
    for chat in chats:
        sessions = []
        for session in chat.sessions:
            current_session = []
            current_message = session[0]
            current_message.text = concat_one_user_messages_delimeter.lstrip() + current_message.text
            for msg in session[1:]:
                if msg.author == current_message.author:
                    current_message.text += concat_one_user_messages_delimeter + msg.text
                else:
                    current_session.append(current_message)
                    current_message = msg
                    current_message.text = concat_one_user_messages_delimeter.lstrip() + current_message.text
            current_session.append(current_message)
            sessions.append(current_session)
        chat.sessions = sessions
    logger.info(f"Combined consecutive messages from single user into one message")

    # Only leave sessions which have target_name in them
    # (1st does not count as we can't use it for training)
    for chat in chats:
        chat.sessions = [session for session in chat.sessions if any(msg.author == target_name for msg in session[1:])]

    # # Cut off messages in each session by last message from target_name
    # for chat in chats:
    #     for session in chat.sessions:
    #         session[:] = session[: max(i for i, msg in enumerate(session) if msg.author == target_name) + 1]

    # Remove date from messages
    for chat in chats:
        for session in chat.sessions:
            for msg in session:
                del msg.date

    # Write sessions to jsonl
    all_sessions = []
    for chat in chats:
        for session in chat.sessions:
            all_sessions.append(session)
    with open(output, "w") as f:
        json.dump(
            [[{"author": msg.author, "text": msg.text} for msg in session] for session in all_sessions],
            f,
            indent=4,
            ensure_ascii=False,
        )
    logger.info(
        f"Took {len(all_sessions)} sessions from {len(chats)} chats and wrote them to '{output}'. Average session length is {round(sum(len(session) for session in all_sessions) / len(all_sessions), 2)} messages"
    )


if __name__ == "__main__":
    fire.Fire(transform_chats)


================================================
FILE: requirements.txt
================================================
scipy
torch
transformers
peft
accelerate
bitsandbytes
trl
datasets
sentencepiece
wandb
loguru
tqdm
fire


================================================
FILE: utils/finetune_utils.py
================================================
import json
from typing import Any, Dict, List, Union

from loguru import logger
from transformers import DataCollatorForLanguageModeling


def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    logger.info(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {round(100 * trainable_params / all_param, 2)}"
    )


def prepare_dataset(path: str) -> List[str]:
    with open(path, "r") as f:
        sessions = json.load(f)

    final_sessions = []
    for session in sessions:
        session_str = "\n".join([f"<|im_start|>{msg['author']}\n{msg['text']}<|im_end|>" for msg in session])
        final_sessions.append(session_str)

    return final_sessions


class DataCollatorForLanguageModelingChatML(DataCollatorForLanguageModeling):
    """
    Data collator for [ChatML](https://github.com/openai/openai-python/blob/284c1799070c723c6a553337134148a7ab088dd8/chatml.md) format, like:
    ```
    <|im_start|>John Smith
    >>> damn, can't get around the 135 time limit
    >>> trying to do everything super optimally, but no luck<|im_end|>
    <|im_start|>Alexander Smirnov
    >>> yeah same
    >>> you still going with the same idea?<|im_end|>
    ```
    It will label any rows which are not starting with '>>>' to ignore.

    Reference data collator implementation: [DataCollatorForCompletionOnlyLM](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L56)

    Args:
        mlm (`bool`, *optional*, defaults to `False`): Whether or not to use masked language modeling in the underlying
            `DataCollatorForLanguageModeling` class. Note that this option currently has no effect but is present
             for flexibility and backwards-compatibility.
        ignore_index (`int`, *optional*, defaults to `-100`):
            The index to use to ignore the initial tokens with
    """

    def __init__(
        self,
        *args,
        mlm: bool = False,
        ignore_index: int = -100,
        **kwargs,
    ):
        super().__init__(*args, mlm=mlm, **kwargs)
        self.ignore_index = ignore_index
        self.start_token = self.tokenizer.encode("<|im_start|>", add_special_tokens=False)[0]
        self.end_token = self.tokenizer.encode("<|im_end|>", add_special_tokens=False)[0]
        self.new_line_token = self.tokenizer.encode("\n", add_special_tokens=False)[-1]
        self.bos_token = self.tokenizer.bos_token_id

    def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
        batch = super().torch_call(examples)

        for i in range(len(examples)):
            if_start = False
            for j in range(len(batch["labels"][i])):
                token = batch["labels"][i][j].item()

                if token == self.start_token:
                    if_start = True

                if if_start or token == self.bos_token:
                    batch["labels"][i][j] = self.ignore_index

                if token == self.new_line_token:
                    if_start = False

        return batch
Download .txt
gitextract__ysxx7ew/

├── .gitignore
├── LICENSE
├── Makefile
├── README.md
├── data/
│   └── .gitignore
├── finetune_full.py
├── finetune_lora.py
├── prepare_dataset.py
├── requirements.txt
└── utils/
    └── finetune_utils.py
Download .txt
SYMBOL INDEX (12 symbols across 4 files)

FILE: finetune_full.py
  function smart_tokenizer_and_embedding_resize (line 21) | def smart_tokenizer_and_embedding_resize(
  function train (line 42) | def train(

FILE: finetune_lora.py
  function train (line 13) | def train(

FILE: prepare_dataset.py
  class Message (line 11) | class Message(BaseModel):
  class Chat (line 17) | class Chat(BaseModel):
  function load_chats (line 24) | def load_chats(path: str) -> Tuple[List[Chat], Tuple[int | None, str | N...
  function transform_chats (line 57) | def transform_chats(

FILE: utils/finetune_utils.py
  function print_trainable_parameters (line 8) | def print_trainable_parameters(model):
  function prepare_dataset (line 20) | def prepare_dataset(path: str) -> List[str]:
  class DataCollatorForLanguageModelingChatML (line 32) | class DataCollatorForLanguageModelingChatML(DataCollatorForLanguageModel...
    method __init__ (line 55) | def __init__(
    method torch_call (line 69) | def torch_call(self, examples: List[Union[List[int], Any, Dict[str, An...
Condensed preview — 10 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (26K chars).
[
  {
    "path": ".gitignore",
    "chars": 99,
    "preview": "__pycache__\n.ipynb_checkpoints\n.pytest_cache\n.vim-arsync\n*.log\nresult.json\nweights/\nwandb/\n.cache/\n"
  },
  {
    "path": "LICENSE",
    "chars": 1074,
    "preview": "MIT License\n\nCopyright (c) 2023 Alexander Smirnov\n\nPermission is hereby granted, free of charge, to any person obtaining"
  },
  {
    "path": "Makefile",
    "chars": 348,
    "preview": "LINE_WIDTH=120\nISORT_FLAGS=--line-width=${LINE_WIDTH} --profile black\nBLACK_FLAGS=--line-length=${LINE_WIDTH}\n\ninstall:\n"
  },
  {
    "path": "README.md",
    "chars": 5018,
    "preview": "# Doppelganger\n\nFine-tuning LLM on my Telegram chats. You may read full story [in my blog](https://asmirnov.xyz/doppelga"
  },
  {
    "path": "data/.gitignore",
    "chars": 14,
    "preview": "*\n!.gitignore\n"
  },
  {
    "path": "finetune_full.py",
    "chars": 4761,
    "preview": "import os\nfrom typing import Dict, List\n\nimport fire\nimport torch\nimport wandb\nfrom datasets import Dataset\nfrom loguru "
  },
  {
    "path": "finetune_lora.py",
    "chars": 3539,
    "preview": "import fire\nimport torch\nimport wandb\nfrom datasets import Dataset\nfrom loguru import logger\nfrom peft import LoraConfig"
  },
  {
    "path": "prepare_dataset.py",
    "chars": 6605,
    "preview": "import json\nfrom datetime import datetime, timedelta\nfrom typing import List, Literal, Optional, Tuple\n\nimport fire\nfrom"
  },
  {
    "path": "requirements.txt",
    "chars": 104,
    "preview": "scipy\ntorch\ntransformers\npeft\naccelerate\nbitsandbytes\ntrl\ndatasets\nsentencepiece\nwandb\nloguru\ntqdm\nfire\n"
  },
  {
    "path": "utils/finetune_utils.py",
    "chars": 3225,
    "preview": "import json\nfrom typing import Any, Dict, List, Union\n\nfrom loguru import logger\nfrom transformers import DataCollatorFo"
  }
]

About this extraction

This page contains the full source code of the furiousteabag/doppelganger GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 10 files (24.2 KB), approximately 6.2k tokens, and a symbol index with 12 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!