Full Code of AstraBert/everything-ai for AI

main d8cc4f2092b6 cached

36 files

128.7 KB

32.7k tokens

72 symbols

1 requests

Download .txt

Repository: AstraBert/everything-ai
Branch: main
Commit: d8cc4f2092b6
Files: 36
Total size: 128.7 KB

Directory structure:
gitextract_dl7fgpnu/

├── .github/
│   └── FUNDING.yml
├── .gitignore
├── .v0_1_1/
│   ├── README.md
│   ├── docker/
│   │   ├── Dockerfile
│   │   ├── build_command.sh
│   │   ├── chat.py
│   │   ├── requirements.txt
│   │   └── utils.py
│   └── scripts/
│       ├── gemma-for-datasciences.ipynb
│       └── gemma_for_datasciences.py
├── LICENSE
├── README.md
├── _config.yml
├── compose.yaml
└── docker/
    ├── Dockerfile
    ├── agnostic_text_generation.py
    ├── audio_classification.py
    ├── autotrain_interface.py
    ├── build_your_llm.py
    ├── chat_your_llm.py
    ├── fal_img2img.py
    ├── image_classification.py
    ├── image_generation.py
    ├── image_generation_pollinations.py
    ├── image_to_text.py
    ├── llama_cpp_int.py
    ├── protein_folding_with_esm.py
    ├── requirements.txt
    ├── retrieval_image_search.py
    ├── retrieval_text_generation.py
    ├── select_and_run.py
    ├── spaces_api_supabase.py
    ├── speech_recognition.py
    ├── text_summarization.py
    ├── utils.py
    └── video_generation.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/FUNDING.yml
================================================
# These are supported funding model platforms

github: [AstraBert]


================================================
FILE: .gitignore
================================================
flagged/
docker/__pycache__
docker/flagged
qdrant_storage/

================================================
FILE: .v0_1_1/README.md
================================================
# everything-rag

>_How was this README generated? Levearaging the power of AI with **reAIdme**, an HuggingChat assistant based on meta-llama/Llama-2-70b-chat-hf._
_Go and give it a try [here](https://hf.co/chat/assistant/660d9a4f590a7924eed02a32)!_ 🤖

<div align="center">
    <img src="https://img.shields.io/github/languages/top/AstraBert/everything-rag" alt="GitHub top language">
   <img src="https://img.shields.io/github/commit-activity/t/AstraBert/everything-rag" alt="GitHub commit activity">
   <img src="https://img.shields.io/badge/everything_rag-stable-green" alt="Static Badge">
   <img src="https://img.shields.io/badge/Release-v0.1.1-purple" alt="Static Badge">
   <img src="https://img.shields.io/badge/Docker_image_size-6.6GB-red" alt="Static Badge">
   <img src="https://img.shields.io/badge/Supported_platforms-linux/amd64-brown" alt="Static Badge">
   <div>
        <a href="https://huggingface.co/spaces/as-cle-bert/everything-rag"><img src="./data/example_chat.png" alt="Example chat" align="center"></a>
        <p><i>Example chat with everything-rag, mediated by google/flan-t5-base</i></p>
   </div>
</div>


### Table of Contents

0. [TL;DR](#tldr)
1. [Introduction](#introduction)
2. [Inspiration](#inspiration)
2. [Getting Started](#getting-started)
3. [Using the Chatbot](#using-the-chatbot)
4. [Troubleshooting](#troubleshooting)
5. [Contributing](#contributing)
6. [Upcoming features](#upcoming-features) 
7. [References](#reference)

## TL;DR

* This documentation is soooooo long, I want to get my hands dirty!!!
    >You can try out everything-rag the [dedicated HuggingFace space](https://huggingface.co/spaces/as-cle-bert/everything-rag), based on google/flan-t5-large.

<div align="center">
    <iframe
        src="https://as-cle-bert-everything-rag.hf.space"
        frameborder="0"
        width="850"
        height="450"
    ></iframe>
</div>

## Introduction

Introducing **everything-rag**, your fully customizable and local chatbot assistant! 🤖

With everything-rag, you can:

1. Use virtually any LLM you want: Switch between different LLMs like _gemma-7b_ or _llama-7b_ to suit your needs.
2. Use your own data: everything-rag can work with any data you provide, whether it's a PDF about data sciences or a document about pallas' cats!🐈
3. Enjoy 100% local and 100% free functionality: No need for hosted APIs or pay-as-you-go services. everything-rag is completely free to use and runs on your desktop. Plus, with the chat_history functionality in ConversationalRetrievalChain, you can easily retrieve and review previous conversations with your chatbot, making it even more convenient to use.

While everything-rag offers many benefits, there are a couple of limitations to keep in mind:

1. Performance-critical tasks: Loading large models (>1~2 GB) and generating text can be resource-intensive, so it's recommended to have at least 16GB RAM and 4 CPU cores for optimal performance.
2. Small LLMs can still allucinate: While large LLMs like _gemma-7b_ and _llama-7b_ tend to produce better results, smaller models like _openai-community/gpt2_ can still produce suboptimal responses in certain situations.

In summary, everything-rag is a simple, customizable, and local chatbot assistant that offers a wide range of features and capabilities. By leveraging the power of RAG, everything-rag offers a unique and flexible chatbot experience that can be tailored to your specific needs and preferences. Whether you're looking for a simple chatbot to answer basic questions or a more advanced conversational AI to engage with your users, everything-rag has got you covered.😊

## Inspiration

This project is a humble and modest carbon-copy of its main and true inspirations, i.e. [Jan.ai](https://jan.ai/), [Cheshire Cat AI](https://cheshirecat.ai/), [privateGPT](https://privategpt.io/) and many other projects that focus on making LLMs (and AI in general) open-source and easily accessible to everyone. 

## Getting Started

You can do two things:

- Play with generation on [Kaggle](https://www.kaggle.com/code/astrabertelli/gemma-for-datasciences)
- Clone this repository, head over to [the python script](./scripts/gemma_for_datasciences.py) and modify everything to your needs!
- Docker installation (🥳**FULLY IMPLEMENTED**): you can install everything-rag through docker image and running it thanks do Docker by following these really simple commands:

```bash
docker pull ghcr.io/astrabert/everything-rag:latest
docker run -p 7860:7860 everything-rag:latest -m microsoft/phi-2 -t text-generation
```
- **IMPORTANT NOTE**: running the script within `docker run` does not log the port on which the app is running until you press `Ctrl+C`, but in that moment it also interrupt the execution! The app will run on port `0.0.0.0:7860` (or `localhost:7860` if your browser is Windows-based), so just make sure to open your browser on that port and to refresh it after 30s to 1 or 2 mins, when the model and the tokenizer should be loaded and the app should be ready to work!

- As you can see, you just need to specify the LLM model and its task (this is mandatory). Keep in mind that, for what concerns v0.1.1, everything-rag supports only text-generation and text2text-generation. For these two tasks, you can use virtually *any* model from HuggingFace Hub: the sole recommendation is to watch out for your disk space, RAM and CPU power, LLMs can be quite resource-consuming!

## Using the Chatbot

### GUI

The chatbot has a brand-new GradIO-based interface that runs on local server. You can interact by uploading directly your pdf files and/or sending messages, all by running:

```bash
python3 scripts/chat.py -m provider/modelname -t task
```

The suggested workflow is, nevertheless, the one that exploits Docker.

### Code breakdown - notebook

Everything is explained in [the dedicated notebook](./scripts/gemma-for-datasciences.ipynb), but here's a brief breakdown of the code:

1. The first section imports the necessary libraries, including Hugging Face Transformers, langchain-community, and tkinter.
2. The next section installs the necessary dependencies, including the gemma-2b model, and defines some useful functions for making the LLM-based data science assistant work.
3. The create_a_persistent_db function creates a persistent database from a PDF file, using the PyPDFLoader to split the PDF into smaller chunks and the Hugging Face embeddings to transform the text into numerical vectors. The resulting database is stored in a LocalFileStore.
4. The just_chatting function implements a chat system using the Hugging Face model and the persistent database. It takes a query, tokenizes it, and passes it to the model to generate a response. The response is then returned as a dictionary of strings.
5. The chat_gui class defines a simple chat GUI that displays the chat history and allows the user to input queries. The send_message function is called when the user presses the "Send" button, and it sends the user's message to the just_chatting function to get a response.
6. The script then creates a root Tk object and instantiates a ChatGUI object, which starts the main loop.

Et voilà, your chatbot is up and running!🦿

## Troubleshooting

### Common Issues Q&A

* Q: The chatbot is not responding😭
    > A: Make sure that the PDF document is in the specified path and that the database has been created successfully. 
* Q: The chatbot is taking soooo long🫠
    > A: This is quite common with resource-limited environments that deal with too large or too small models: large models require **at least** 32 GB RAM and >8 core CPU, whereas small model can easily be allucinating and producing responses that are endless repetitions of the same thing! Check *penalty_score* parameter to avoid this. **try rephrasing the query and be as specific as possible**
* Q: My model is allucinating and/or repeating the same sentence over and over again😵‍💫
    > A: This is quite common with small or old models: check *penalty_score* and *temperature* parameter to avoid this. 
* Q: The chatbot is giving incorrect/non-meaningful answers🤥
    >A: Check that the PDF document is relevant and up-to-date. Also, **try rephrasing the query and be as specific as possible**
* Q: An error occurred while generating the answer💔
    >A: This frequently occurs when your (small) LLM has a limited maximum hidden size (generally 512 or 1024) and the context that the retrieval-augmented chain produces goes beyond that maximum. You could, potentially, modify the configuration of the model, but this would mean dramatically increase its resource consumption, and your small laptop is not prepared to take it, trust me!!! A solution, if you have enough RAM and CPU power, is to switch to larger LLMs: they do not have problems in this sense.

## Upcoming features🚀

- [ ] Multi-lingual support (expected for **version 0.2.0**)

- [ ] More text-based tasks: question answering, summarisation (expected for **version 0.3.0**)

- [ ] Computer vision: Image-to-text, image generation, image segmentation... (expected for **version 1.0.0**)

## Contributing


Contributions are welcome! If you would like to improve the chatbot's functionality or add new features, please fork the repository and submit a pull request.

## Reference


* [Hugging Face Transformers](https://github.com/huggingface/transformers)
* [Langchain-community](https://github.com/langchain-community/langchain-community)
* [Tkinter](https://docs.python.org/3/library/tkinter.html)
* [PDF document about data science](https://www.kaggle.com/datasets/astrabertelli/what-is-datascience-docs)
* [GradIO](https://www.gradio.app/)

## License

This project is licensed under the Apache 2.0 License.

If you use this work for your projects, please consider citing the author [Astra Bertelli](http://astrabert.vercel.app).


================================================
FILE: .v0_1_1/docker/Dockerfile
================================================
# Use an official Python runtime as a parent image
FROM python:3.10-slim-bookworm

# Set the working directory in the container to /app
WORKDIR /app

# Add the current directory contents into the container at /app
ADD . /app

# Update and install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    libpq-dev \
    libffi-dev \
    libssl-dev \
    musl-dev \
    libxml2-dev \
    libxslt1-dev \
    zlib1g-dev \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN python3 -m pip cache purge
RUN python3 -m pip install --no-cache-dir -r requirements.txt


# Expose the port that the application will run on
EXPOSE 7860

# Set the entrypoint with a default command and allow the user to override it
ENTRYPOINT ["python3", "chat.py"]

================================================
FILE: .v0_1_1/docker/build_command.sh
================================================
docker buildx build \
--label org.opencontainers.image.title=everything-rag \
--label org.opencontainers.image.description='Introducing everything-rag, your fully customizable and local chatbot assistant!' \
--label org.opencontainers.image.url=https://github.com/AstraBert/everything-rag \
--label org.opencontainers.image.source=https://github.com/AstraBert/everything-rag --label org.opencontainers.image.version=0.1.7 \
--label org.opencontainers.image.created=2024-04-07T12:39:11.393Z \
--label org.opencontainers.image.licenses=Apache-2.0 \
--platform linux/amd64 \
--tag ghcr.io/astrabert/everything-rag:latest \
--tag ghcr.io/astrabert/everything-rag:0.1.1 \
--push .

================================================
FILE: .v0_1_1/docker/chat.py
================================================
import gradio as gr
import os
import time
from utils import *

vectordb = ""

def generate_welcome_message():
    return (None, "Hello! Welcome to the chatbot. You can enter a message or upload a file.")

def print_like_dislike(x: gr.LikeData):
    print(x.index, x.value, x.liked)

def add_message(history, message):
    if len(message["files"]) > 0:
        history.append((message["files"], None))
    if message["text"] is not None and message["text"] != "":
        history.append((message["text"], None))
    return history, gr.MultimodalTextbox(value=None, interactive=False)


def bot(history):
    global vectordb
    global tsk
    if type(history[-1][0]) != tuple:
        if vectordb == "":
            pipe = pipeline(tsk, tokenizer=tokenizer, model=model)
            response = pipe(history[-1][0])[0]
            response = response["generated_text"]
            history[-1][1] = ""
            for character in response:
                history[-1][1] += character
                time.sleep(0.05)
                yield history
        else:
            try:
                response = just_chatting(task=tsk, model=model, tokenizer=tokenizer, query=history[-1][0], vectordb=vectordb, chat_history=[convert_none_to_str(his) for his in history])["answer"]
                history[-1][1] = ""
                for character in response:
                    history[-1][1] += character
                    time.sleep(0.05)
                    yield history
            except Exception as e:
                response = f"Sorry, the error '{e}' occured while generating the response; check [troubleshooting documentation](https://astrabert.github.io/everything-rag/#troubleshooting) for more"
    if type(history[-1][0]) == tuple:
        filelist = []
        for i in history[-1][0]:
            filelist.append(i)
        if len(filelist) > 1:
            finalpdf = merge_pdfs(filelist)
        else:
            finalpdf = filelist[0]
        vectordb = create_a_persistent_db(finalpdf, os.path.dirname(finalpdf)+"_localDB", os.path.dirname(finalpdf)+"_embcache")
        response = "VectorDB was successfully created, now you can ask me anything about the document you uploaded!😊"
        history[-1][1] = ""
        for character in response:
            history[-1][1] += character
            time.sleep(0.05)
            yield history

with gr.Blocks() as demo:
    chatbot = gr.Chatbot(
        [[None, "Hi, I'm **everything-rag**🤖.\nI'm here to assist you and let you chat with _your_ pdfs!\nCheck [my website](https://astrabert.github.io/everything-rag/) for troubleshooting and documentation reference\nHave fun!😊"]],
        label="everything-rag",
        elem_id="chatbot",
        bubble_full_width=False,
    )

    chat_input = gr.MultimodalTextbox(interactive=True, file_types=["pdf"], placeholder="Enter message or upload file...", show_label=False)

    chat_msg = chat_input.submit(add_message, [chatbot, chat_input], [chatbot, chat_input])
    bot_msg = chat_msg.then(bot, chatbot, chatbot, api_name="bot_response")
    bot_msg.then(lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input])

    chatbot.like(print_like_dislike, None, None)
    clear = gr.ClearButton(chatbot)

demo.queue()
if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", share=False)

	

================================================
FILE: .v0_1_1/docker/requirements.txt
================================================
langchain-community==0.0.13 
langchain==0.1.1 
pypdf==3.17.4
sentence_transformers==2.2.2
chromadb==0.4.22
cryptography>=3.1
gradio
transformers
trl 
peft

================================================
FILE: .v0_1_1/docker/utils.py
================================================
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, pipeline
import time
from langchain_community.llms import HuggingFacePipeline
from langchain.storage import LocalFileStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
import os
from pypdf import PdfMerger
from argparse import ArgumentParser


argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

argparse.add_argument(
    "-t",
    "--task",
    help="Task for the model: for now supported task are ['text-generation', 'text2text-generation']",
    required=True,
)

args = argparse.parse_args()


mod = args.model
tsk = args.task

mod = mod.replace("\"", "").replace("'", "")
tsk = tsk.replace("\"", "").replace("'", "")

TASK_TO_MODEL = {"text-generation": AutoModelForCausalLM, "text2text-generation": AutoModelForSeq2SeqLM}

if tsk not in TASK_TO_MODEL:
    raise Exception("Unsopported task! Supported task are ['text-generation', 'text2text-generation']")

def merge_pdfs(pdfs: list):
    merger = PdfMerger()
    for pdf in pdfs:
        merger.append(pdf)
    merger.write(f"{pdfs[-1].split('.')[0]}_results.pdf")
    merger.close()
    return f"{pdfs[-1].split('.')[0]}_results.pdf"

def create_a_persistent_db(pdfpath, dbpath, cachepath) -> None:
    """
    Creates a persistent database from a PDF file.

    Args:
        pdfpath (str): The path to the PDF file.
        dbpath (str): The path to the storage folder for the persistent LocalDB.
        cachepath (str): The path to the storage folder for the embeddings cache.
    """
    print("Started the operation...")
    a = time.time()
    loader = PyPDFLoader(pdfpath)
    documents = loader.load()

    ### Split the documents into smaller chunks for processing
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(documents)

    ### Use HuggingFace embeddings for transforming text into numerical vectors
    ### This operation can take a while the first time but, once you created your local database with
    ### cached embeddings, it should be a matter of seconds to load them!
    embeddings = HuggingFaceEmbeddings()
    store = LocalFileStore(
        os.path.join(
            cachepath, os.path.basename(pdfpath).split(".")[0] + "_cache"
        )
    )
    cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
        underlying_embeddings=embeddings,
        document_embedding_cache=store,
        namespace=os.path.basename(pdfpath).split(".")[0],
    )

    b = time.time()
    print(
        f"Embeddings successfully created and stored at {os.path.join(cachepath, os.path.basename(pdfpath).split('.')[0]+'_cache')} under namespace: {os.path.basename(pdfpath).split('.')[0]}"
    )
    print(f"To load and embed, it took: {b - a}")

    persist_directory = os.path.join(
        dbpath, os.path.basename(pdfpath).split(".")[0] + "_localDB"
    )
    vectordb = Chroma.from_documents(
        documents=texts,
        embedding=cached_embeddings,
        persist_directory=persist_directory,
    )
    c = time.time()
    print(
        f"Persistent database successfully created and stored at {os.path.join(dbpath, os.path.basename(pdfpath).split('.')[0] + '_localDB')}"
    )
    print(f"To create a persistent database, it took: {c - b}")
    return vectordb

def convert_none_to_str(l: list):
    newlist = []
    for i in range(len(l)):
        if l[i] is None or type(l[i])==tuple:
            newlist.append("")
        else:
            newlist.append(l[i])
    return tuple(newlist)

def just_chatting(
    task,
    model,
    tokenizer,
    query,
    vectordb,
    chat_history=[]
):
    """
    Implements a chat system using Hugging Face models and a persistent database.

    Args:
        task (str): Task for the pipeline; for now supported task are ['text-generation', 'text2text-generation']
        model (AutoModelForCausalLM): Hugging Face model, already loaded and prepared.
        tokenizer (AutoTokenizer): Hugging Face tokenizer, already loaded and prepared.
        model_task (str): Task for the Hugging Face model.
        persistent_db_dir (str): Directory for the persistent database.
        embeddings_cache (str): Path to cache Hugging Face embeddings.
        pdfpath (str): Path to the PDF file.
        query (str): Question by the user
        vectordb (ChromaDB): vectorstorer variable for retrieval.
        chat_history (list): A list with previous questions and answers, serves as context; by default it is empty (it may make the model allucinate)
    """
    ### Create a text-generation pipeline and connect it to a ConversationalRetrievalChain
    pipe = pipeline(task,
                    model=model,
                    tokenizer=tokenizer,
                    max_new_tokens = 2048,
                    repetition_penalty = float(1.2),
    )

    local_llm = HuggingFacePipeline(pipeline=pipe)
    llm_chain = ConversationalRetrievalChain.from_llm(
        llm=local_llm,
        chain_type="stuff",
        retriever=vectordb.as_retriever(search_kwargs={"k": 1}),
        return_source_documents=False,
    )
    rst = llm_chain({"question": query, "chat_history": chat_history})
    return rst


try:
    tokenizer = AutoTokenizer.from_pretrained(
        mod,
    )


    model = TASK_TO_MODEL[tsk].from_pretrained(
        mod,
    )
except Exception as e:
    import sys
    print(f"The error {e} occured while handling model and tokenizer loading: please ensure that the model you provided was correct and suitable for the specified task. Be also sure that the HF repository for the loaded model contains all the necessary files.", file=sys.stderr)
    sys.exit(1)




================================================
FILE: .v0_1_1/scripts/gemma-for-datasciences.ipynb
================================================
{"metadata":{"colab":{"provenance":[]},"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":8018995,"sourceType":"datasetVersion","datasetId":4724972},{"sourceId":11270,"sourceType":"modelInstanceVersion","isSourceIdPinned":true,"modelInstanceId":6216}],"dockerImageVersionId":30673,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# gemma-2b AS A DATA SCIENCE TEACHER\n\n## 100% LOCAL, WITHOUT FINETUNING, WITH _YOUR OWN DATA_\n\n> _Information is the oil of the 21st century, and analytics is the combustion engine – Peter Sondergaard (Senior Vice President and the Global Head of Research at Gartner Inc)_\n\nIn a world where data are becoming more important with each day passing, data science is a fundamental discipline to master in order to understand and solve the upcoming challenges of the Big Data World.\n\nUnfortunately, data science is generally available to University-level students only, making it difficult for other people to access its concepts. This obstacle can be removed with the help of Large Language Models, such as _gemma-2b_.\n\nIn this notebook, we'll make our way through the jungle of data science thanks to _gemma-2b_, a simple pdf file titled **\"What is data science?\"** <a name=\"cite_ref-1\"></a>[<sup>[1]</sup>](#cite_note-1), ChromaDB vectorstores and Langchain, all elengatly written in python.\n\nThe final goal is to implement a simple, yet powerful, pipeline to generate a 100% local and fully-customizable LLM-based assistant that works with the user's data.\n\nLet's dive in!🛫\n\n\n<a name=\"cite_note-1\"></a>[<sup>[1]</sup>](#cite_ref-1) Brodie, Michael. (2019). What Is Data Science?. 10.1007/978-3-030-11821-1_8.","metadata":{"id":"cY7PiUMxHqWO"}},{"cell_type":"markdown","source":"# Build the environment\n\nFirst of all, we want everything set up the right way to work properly. To do so, we need to:\n\n1. Upload the pdf file in our workspace (we can simply create a dataset in Kaggle containing the pdf and add it as `input` to the notebook): in the following notebook example, we will name it \"/kaggle/input/what-is-datascience-docs/WhatisDataScienceFinalMay162018.pdf\". \n2. Install necessary dependencies\n3. Upload _gemma-2b_ model as Kaggle input\n4. Define useful functions to make our LLM-based data science assistant work","metadata":{"id":"YpQRNdd3NDT0"}},{"cell_type":"code","source":"# INSTALL NECESSARY DEPENDENCIES\n\n## Versions provided for the packages are not strict... Still, you may encounter issues if you use different ones\n\n! python3 -m pip install langchain-community==0.0.13 langchain==0.1.1 torch==2.1.2","metadata":{"id":"mTFqzkbzPQ5o","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# INSTALL NECESSARY DEPENDENCIES (pt2)\n\n! python3 -m pip install trl peft","metadata":{"id":"qb2GLlQZZan3","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# INSTALL NECESSARY DEPENDENCIES (pt3)\n\n! pip install pypdf==3.17.4","metadata":{"id":"GV6gIH8ucztn","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# INSTALL NECESSARY DEPENDENCIES (pt4)\n\n! pip install sentence_transformers==2.2.2","metadata":{"id":"lyEB5Rc_dOec","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# INSTALL NECESSARY DEPENDENCIES (pt6)\n\n! pip install chromadb==0.4.22","metadata":{"id":"B0onWX8heNch","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from kaggle_secrets import UserSecretsClient\nuser_secrets = UserSecretsClient()\nhf_token = user_secrets.get_secret(\"HF_TOKEN\")","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# IMPORT gemma-2b MODEL FROM KAGGLE\n\n## To import the model, we'll be uploading the model directly from Kaggle input\n\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nmodel_checkpoint = \"/kaggle/input/gemma/transformers/2b/1\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_checkpoint, token=hf_token)\nmodel = AutoModelForCausalLM.from_pretrained(model_checkpoint, token=hf_token)","metadata":{"id":"lcy5ImzmSLcq","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# DEFINE USEFUL FUNCTIONS\n\n## To chat, we'll need to create a vectorized database from our pdf and then build\n## a retrieval Q&A chain\n\nimport time\nfrom langchain_community.llms import HuggingFacePipeline\nfrom langchain.storage import LocalFileStore\nfrom langchain.embeddings import CacheBackedEmbeddings\nfrom langchain_community.vectorstores import Chroma\nfrom langchain.text_splitter import CharacterTextSplitter\nfrom langchain_community.document_loaders import PyPDFLoader\nfrom langchain_community.embeddings import HuggingFaceEmbeddings\nfrom langchain.chains import ConversationalRetrievalChain\nfrom transformers import pipeline, AutoTokenizer, AutoModelForCausalLM\nimport os\n\ndef create_a_persistent_db(pdfpath, dbpath, cachepath) -> None:\n    \"\"\"\n    Creates a persistent database from a PDF file.\n\n    Args:\n        pdfpath (str): The path to the PDF file.\n        dbpath (str): The path to the storage folder for the persistent LocalDB.\n        cachepath (str): The path to the storage folder for the embeddings cache.\n    \"\"\"\n    print(\"Started the operation...\")\n    a = time.time()\n    loader = PyPDFLoader(pdfpath)\n    documents = loader.load()\n\n    ### Split the documents into smaller chunks for processing\n    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n    texts = text_splitter.split_documents(documents)\n\n    ### Use HuggingFace embeddings for transforming text into numerical vectors\n    ### This operation can take a while the first time but, once you created your local database with\n    ### cached embeddings, it should be a matter of seconds to load them!\n    embeddings = HuggingFaceEmbeddings()\n    store = LocalFileStore(\n        os.path.join(\n            cachepath, os.path.basename(pdfpath).split(\".\")[0] + \"_cache\"\n        )\n    )\n    cached_embeddings = CacheBackedEmbeddings.from_bytes_store(\n        underlying_embeddings=embeddings,\n        document_embedding_cache=store,\n        namespace=os.path.basename(pdfpath).split(\".\")[0],\n    )\n\n    b = time.time()\n    print(\n        f\"Embeddings successfully created and stored at {os.path.join(cachepath, os.path.basename(pdfpath).split('.')[0]+'_cache')} under namespace: {os.path.basename(pdfpath).split('.')[0]}\"\n    )\n    print(f\"To load and embed, it took: {b - a}\")\n\n    persist_directory = os.path.join(\n        dbpath, os.path.basename(pdfpath).split(\".\")[0] + \"_localDB\"\n    )\n    vectordb = Chroma.from_documents(\n        documents=texts,\n        embedding=cached_embeddings,\n        persist_directory=persist_directory,\n    )\n    c = time.time()\n    print(\n        f\"Persistent database successfully created and stored at {os.path.join(dbpath, os.path.basename(pdfpath).split('.')[0] + '_localDB')}\"\n    )\n    print(f\"To create a persistent database, it took: {c - b}\")\n    return vectordb\n\ndef just_chatting(\n    model,\n    tokenizer,\n    query,\n    vectordb,\n    chat_history=[]\n):\n    \"\"\"\n    Implements a chat system using Hugging Face models and a persistent database.\n\n    Args:\n        model (AutoModelForCausalLM): Hugging Face model, already loaded and prepared.\n        tokenizer (AutoTokenizer): Hugging Face tokenizer, already loaded and prepared.\n        model_task (str): Task for the Hugging Face model.\n        persistent_db_dir (str): Directory for the persistent database.\n        embeddings_cache (str): Path to cache Hugging Face embeddings.\n        pdfpath (str): Path to the PDF file.\n        query (str): Question by the user\n        vectordb (ChromaDB): vectorstorer variable for retrieval.\n        chat_history (list): A list with previous questions and answers, serves as context; by default it is empty (it may make the model allucinate)\n    \"\"\"\n    ### Create a text-generation pipeline and connect it to a ConversationalRetrievalChain\n    pipe = pipeline(\"text-generation\",\n                    model=model,\n                    tokenizer=tokenizer,\n                    max_new_tokens = 2048,\n                    repetition_penalty = float(10),\n    )\n\n    local_llm = HuggingFacePipeline(pipeline=pipe)\n    llm_chain = ConversationalRetrievalChain.from_llm(\n        llm=local_llm,\n        chain_type=\"stuff\",\n        retriever=vectordb.as_retriever(search_kwargs={\"k\": 1}),\n        return_source_documents=False,\n    )\n    rst = llm_chain({\"question\": query, \"chat_history\": chat_history})\n    return rst","metadata":{"id":"_8Tt0dtkgEfv","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Chat with the model\n\nTo chat with the model, we first have to build our local, persistent, database, and also compute embeddings: after that, we'll be able to chat with the model without problems!🚀","metadata":{"id":"pospjXN1a3lW"}},{"cell_type":"code","source":"# CREATE PERSISTENT DB\n\nfilepath = \"/kaggle/input/what-is-datascience-docs/WhatisDataScienceFinalMay162018.pdf\"\ndbpath = \"/kaggle/working/\"\ncachepath = \"/kaggle/working/\"\nvectordb = create_a_persistent_db(filepath, dbpath, cachepath)","metadata":{"id":"SSd1jia8bz5s","outputId":"ecc4c317-51f9-4cd3-a589-3138ccc39d23","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# CHAT WITH MODEL\n\nchat_history = []\nquery = \"Define datascience\"\nres = just_chatting(model, tokenizer, query, vectordb, chat_history=chat_history)\nchat_history.append([query, res[\"answer\"].replace(\"\\n\",\" \")])","metadata":{"id":"LxJHt3LneuGD","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"print(\" \".join[res[\"answer\"]])","metadata":{"id":"utFaudIyitNO","trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Implement a simple chat GUI (local only)\n\nWant to interact more directly with your model, without going through that pythonic stuff? Let's implement a very simple and rudimental chat GUI, based on builtin package `tkinter`, to achieve this goal!🤯","metadata":{"id":"JlmAbwghlVrO"}},{"cell_type":"code","source":"import tkinter as tk\nfrom tkinter import scrolledtext\n\nclass ChatGUI:\n    def __init__(self, master):\n        self.master = master\n        master.title(\"DataScienceAI\")\n\n        self.chat_history = scrolledtext.ScrolledText(master, wrap=tk.WORD, width=40, height=15)\n        self.chat_history.pack(padx=10, pady=10)\n\n        self.user_input = tk.Entry(master, width=40)\n        self.user_input.pack(padx=10, pady=10)\n\n        self.send_button = tk.Button(master, text=\"Send\", command=self.send_message)\n        self.send_button.pack(pady=10)\n\n        # Set up initial conversation\n        self.display_message(\"DataScienceAI: Hello! How can I help you today?\")\n\n    def send_message(self):\n        user_message = self.user_input.get()\n        self.display_message(f\"You: {user_message}\")\n        # Replace the next line with your chatbot logic to get a response\n        chatbot_response = f\"DataScienceAI: {just_chatting(model, tokenizer, user_message, vectordb)[\"answer\"].replace(\"\\n\",\" \")}\"\n        self.display_message(chatbot_response)\n        self.user_input.delete(0, tk.END)  # Clear the input field\n\n    def display_message(self, message):\n        self.chat_history.insert(tk.END, message + '\\n')\n        self.chat_history.see(tk.END)  # Scroll to the bottom\n\nif __name__ == \"__main__\":\n    root = tk.Tk()\n    chat_gui = ChatGUI(root)\n    root.mainloop()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Conclusions\n\nThis is it!\n\nWe built a simple assistant, fully customizable in terms of both the LLM employed (you can switch to _gemma-7b_ or to your favorite LLM) and the data you can make it work with (in this case is data sciences, but you can make it work also on a pdf about pallas' cats, if you want!)🐈.\n\nAnother important thing to note is that all of this is completely local, there is no need for hosted APIs, pay-as-you-go services or other things like that: everything is free to use, on your Desktop!\n\nThere are two main disadvantages in this approach: \n\n1. Performance-critical tasks, such as loading the model and making prediction, are heavily resource-dependent: to load big models (>1~2 GB) and to make them generate text, it is useful to have more than 16GB RAM and more than 4 CPU cores.\n2. Small (and old) models, such as _openai-community/gpt2_, can easily allucinate while generating text. This is generally prompt-dependent (meaning that they tend to produce trashy results on certain prompts more frequently than on other ones) and the issue almost totally resolves when employing large LLMs (_gemma-7b_ or _llama-7b_ would not-so-easily allucinate, for instance).\n\n### TLDR😵:\n\n**Pros**:\n- Simple and customizable\n- Use virtually any LLM you want\n- Use your own data\n- 100% local, 100% free, no payments or APIs\n\n**Cons**:\n- Performance might be resource-dependent for large LLMs (if you have >16GB RAM and >4 cores it shouldn't be a great problem)\n- Small LLMs can still allucinate","metadata":{"id":"hRy5ErJ_mkfV"}},{"cell_type":"markdown","source":"# References\n\n- Paul Mooney, Ashley Chow. (2024). Google – AI Assistants for Data Tasks with Gemma. Kaggle. https://kaggle.com/competitions/data-assistants-with-gemma\n- Brodie, Michael. (2019). What Is Data Science?. 10.1007/978-3-030-11821-1_8.\n","metadata":{"id":"wHWCmEp9oFsC"}}]}

================================================
FILE: .v0_1_1/scripts/gemma_for_datasciences.py
================================================
# -*- coding: utf-8 -*-

"""# gemma-2b AS A DATA SCIENCE TEACHER

## 100% LOCAL, WITHOUT FINETUNING, WITH _YOUR OWN DATA_

> _Information is the oil of the 21st century, and analytics is the combustion engine – Peter Sondergaard (Senior Vice President and the Global Head of Research at Gartner Inc)_

In a world where data are becoming more important with each day passing, data science is a fundamental discipline to master in order to understand and solve the upcoming challenges of the Big Data World.

Unfortunately, data science is generally available to University-level students only, making it difficult for other people to access its concepts. This obstacle can be removed with the help of Large Language Models, such as _gemma-2b_.

In this notebook, we'll make our way through the jungle of data science thanks to _gemma-2b_, a simple pdf file titled **"What is data science?"**, ChromaDB vectorstores and Langchain, all elengatly written in python.

The final goal is to implement a simple, yet powerful, pipeline to generate a 100% local and fully-customizable LLM-based assistant that works with the user's data.

Let's dive in!🛫



# Build the environment

First of all, we want everything set up the right way to work properly. To do so, we need to:

1. Upload the pdf file in our workspace (we can simply create a dataset in Kaggle containing the pdf and add it as `input` to the notebook): in the following notebook example, we will name it "/kaggle/input/what-is-datascience-docs/WhatisDataScienceFinalMay162018.pdf".
2. Install necessary dependencies
3. Upload _gemma-2b_ model as Kaggle input
4. Define useful functions to make our LLM-based data science assistant work
"""

# IMPORT gemma-2b MODEL FROM KAGGLE

## To import the model, we'll be uploading the model directly from Kaggle input

from transformers import AutoTokenizer, AutoModelForCausalLM
model_checkpoint = "/kaggle/input/gemma/transformers/2b/1"

hf_token = "YOUR_TOKEN"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, token=hf_token)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint, token=hf_token)

# DEFINE USEFUL FUNCTIONS

## To chat, we'll need to create a vectorized database from our pdf and then build
## a retrieval Q&A chain

import time
from langchain_community.llms import HuggingFacePipeline
from langchain.storage import LocalFileStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import os

def create_a_persistent_db(pdfpath, dbpath, cachepath) -> None:
    """
    Creates a persistent database from a PDF file.

    Args:
        pdfpath (str): The path to the PDF file.
        dbpath (str): The path to the storage folder for the persistent LocalDB.
        cachepath (str): The path to the storage folder for the embeddings cache.
    """
    print("Started the operation...")
    a = time.time()
    loader = PyPDFLoader(pdfpath)
    documents = loader.load()

    ### Split the documents into smaller chunks for processing
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(documents)

    ### Use HuggingFace embeddings for transforming text into numerical vectors
    ### This operation can take a while the first time but, once you created your local database with
    ### cached embeddings, it should be a matter of seconds to load them!
    embeddings = HuggingFaceEmbeddings()
    store = LocalFileStore(
        os.path.join(
            cachepath, os.path.basename(pdfpath).split(".")[0] + "_cache"
        )
    )
    cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
        underlying_embeddings=embeddings,
        document_embedding_cache=store,
        namespace=os.path.basename(pdfpath).split(".")[0],
    )

    b = time.time()
    print(
        f"Embeddings successfully created and stored at {os.path.join(cachepath, os.path.basename(pdfpath).split('.')[0]+'_cache')} under namespace: {os.path.basename(pdfpath).split('.')[0]}"
    )
    print(f"To load and embed, it took: {b - a}")

    persist_directory = os.path.join(
        dbpath, os.path.basename(pdfpath).split(".")[0] + "_localDB"
    )
    vectordb = Chroma.from_documents(
        documents=texts,
        embedding=cached_embeddings,
        persist_directory=persist_directory,
    )
    c = time.time()
    print(
        f"Persistent database successfully created and stored at {os.path.join(dbpath, os.path.basename(pdfpath).split('.')[0] + '_localDB')}"
    )
    print(f"To create a persistent database, it took: {c - b}")
    return vectordb

def just_chatting(
    model,
    tokenizer,
    query,
    vectordb,
    chat_history=[]
):
    """
    Implements a chat system using Hugging Face models and a persistent database.

    Args:
        model (AutoModelForCausalLM): Hugging Face model, already loaded and prepared.
        tokenizer (AutoTokenizer): Hugging Face tokenizer, already loaded and prepared.
        model_task (str): Task for the Hugging Face model.
        persistent_db_dir (str): Directory for the persistent database.
        embeddings_cache (str): Path to cache Hugging Face embeddings.
        pdfpath (str): Path to the PDF file.
        query (str): Question by the user
        vectordb (ChromaDB): vectorstorer variable for retrieval.
        chat_history (list): A list with previous questions and answers, serves as context; by default it is empty (it may make the model allucinate)
    """
    ### Create a text-generation pipeline and connect it to a ConversationalRetrievalChain
    pipe = pipeline("text-generation",
                    model=model,
                    tokenizer=tokenizer,
                    max_new_tokens = 2048,
                    repetition_penalty = float(10),
    )

    local_llm = HuggingFacePipeline(pipeline=pipe)
    llm_chain = ConversationalRetrievalChain.from_llm(
        llm=local_llm,
        chain_type="stuff",
        retriever=vectordb.as_retriever(search_kwargs={"k": 1}),
        return_source_documents=False,
    )
    rst = llm_chain({"question": query, "chat_history": chat_history})
    return rst

"""# Chat with the model

To chat with the model, we first have to build our local, persistent, database, and also compute embeddings: after that, we'll be able to chat with the model without problems!🚀
"""

# CREATE PERSISTENT DB

filepath = "/kaggle/input/what-is-datascience-docs/WhatisDataScienceFinalMay162018.pdf"
dbpath = "/kaggle/working/"
cachepath = "/kaggle/working/"
vectordb = create_a_persistent_db(filepath, dbpath, cachepath)

# CHAT WITH MODEL

chat_history = []
query = "Define datascience"
res = just_chatting(model, tokenizer, query, vectordb, chat_history=chat_history)
chat_history.append([query, res["answer"].replace("\n"," ")])

print(" ".join[res["answer"]])

"""# Implement a simple chat GUI (local only)

Want to interact more directly with your model, without going through that pythonic stuff? Let's implement a very simple and rudimental chat GUI, based on builtin package `tkinter`, to achieve this goal!🤯
"""

import tkinter as tk
from tkinter import scrolledtext

class ChatGUI:
    def __init__(self, master):
        self.master = master
        master.title("DataScienceAI")

        self.chat_history = scrolledtext.ScrolledText(master, wrap=tk.WORD, width=40, height=15)
        self.chat_history.pack(padx=10, pady=10)

        self.user_input = tk.Entry(master, width=40)
        self.user_input.pack(padx=10, pady=10)

        self.send_button = tk.Button(master, text="Send", command=self.send_message)
        self.send_button.pack(pady=10)

        # Set up initial conversation
        self.display_message("DataScienceAI: Hello! How can I help you today?")

    def send_message(self):
        user_message = self.user_input.get()
        self.display_message(f"You: {user_message}")
        # Replace the next line with your chatbot logic to get a response
        chatbot_response = f"DataScienceAI: {just_chatting(model, tokenizer, user_message, vectordb)["answer"].replace("\n"," ")}"
        self.display_message(chatbot_response)
        self.user_input.delete(0, tk.END)  # Clear the input field

    def display_message(self, message):
        self.chat_history.insert(tk.END, message + '\n')
        self.chat_history.see(tk.END)  # Scroll to the bottom

if __name__ == "__main__":
    root = tk.Tk()
    chat_gui = ChatGUI(root)
    root.mainloop()

"""# Conclusions

This is it!

We built a simple assistant, fully customizable in terms of both the LLM employed (you can switch to _gemma-7b_ or to your favorite LLM) and the data you can make it work with (in this case is data sciences, but you can make it work also on a pdf about pallas' cats, if you want!)🐈.

Another important thing to note is that all of this is completely local, there is no need for hosted APIs, pay-as-you-go services or other things like that: everything is free to use, on your Desktop!

There are two main disadvantages in this approach:

1. Performance-critical tasks, such as loading the model and making prediction, are heavily resource-dependent: to load big models (>1~2 GB) and to make them generate text, it is useful to have more than 16GB RAM and more than 4 CPU cores.
2. Small (and old) models, such as _openai-community/gpt2_, can easily allucinate while generating text. This is generally prompt-dependent (meaning that they tend to produce trashy results on certain prompts more frequently than on other ones) and the issue almost totally resolves when employing large LLMs (_gemma-7b_ or _llama-7b_ would not-so-easily allucinate, for instance).

### TLDR😵:

**Pros**:
- Simple and customizable
- Use virtually any LLM you want
- Use your own data
- 100% local, 100% free, no payments or APIs

**Cons**:
- Performance might be resource-dependent for large LLMs (if you have >16GB RAM and >4 cores it shouldn't be a great problem)
- Small LLMs can still allucinate

# References

- Paul Mooney, Ashley Chow. (2024). Google – AI Assistants for Data Tasks with Gemma. Kaggle. https://kaggle.com/competitions/data-assistants-with-gemma
- Brodie, Michael. (2019). What Is Data Science?. 10.1007/978-3-030-11821-1_8.
"""

================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
<h1 align="center">everything-ai</h1>
<h2 align="center">Your fully proficient, AI-powered and local chatbot assistant🤖</h2>


<div align="center">
    <img src="https://img.shields.io/github/languages/top/AstraBert/everything-ai" alt="GitHub top language">
   <img src="https://img.shields.io/github/commit-activity/t/AstraBert/everything-ai" alt="GitHub commit activity">
   <img src="https://img.shields.io/badge/everything_ai-stable-green" alt="Static Badge">
   <img src="https://img.shields.io/badge/Release-v4.2.0-purple" alt="Static Badge">
   <img src="https://img.shields.io/docker/image-size/astrabert/everything-ai
   " alt="Docker image size">
   <img src="https://img.shields.io/badge/Supported_platforms-Windows/macOS-brown" alt="Static Badge">
   <div>
        <a href="https://huggingface.co/spaces/as-cle-bert/everything-rag"><img src="./imgs/everything-ai.drawio.png" alt="Flowchart" align="center"></a>
        <p><i>Flowchart for everything-ai</i></p>
   </div>
</div>

## Quickstart
### 1. Clone this repository
```bash
git clone https://github.com/AstraBert/everything-ai.git
cd everything-ai
```
### 2. Set your `.env` file
Modify:
- `VOLUME` variable in the .env file so that you can mount your local file system into Docker container.
- `MODELS_PATH` variable in the .env file so that you can tell llama.cpp where you stored the GGUF models you downloaded.
- `MODEL` variable in the .env file so that you can tell llama.cpp what model to use (use the actual name of the gguf file, and do not forget the .gguf extension!)
- `MAX_TOKENS` variable in the .env file so that you can tell llama.cpp how many new tokens it can generate as output.

An example of a `.env` file could be:
```bash
VOLUME="c:/Users/User/:/User/"
MODELS_PATH="c:/Users/User/.cache/llama.cpp/"
MODEL="stories260K.gguf"
MAX_TOKENS="512"
```
This means that now everything that is under "c:/Users/User/" on your local machine is under "/User/" in your Docker container, that llama.cpp knows where to look for models and what model to look for, along with the maximum new tokens for its output.

### 3. Pull the necessary images
```bash
docker pull astrabert/everything-ai:latest
docker pull qdrant/qdrant:latest
docker pull ghcr.io/ggerganov/llama.cpp:server
```
### 4. Run the multi-container app
```bash
docker compose up
```
### 5. Go to `localhost:8670` and choose your assistant

You will see something like this:

<div align="center">
    <img src="./imgs/select_and_run.png" alt="Task choice interface">
</div>

Choose the task among:

- *retrieval-text-generation*: use `qdrant` backend to build a retrieval-friendly knowledge base, which you can query and tune the response of your model on. You have to pass either a pdf/a bunch of pdfs specified as comma-separated paths or a directory where all the pdfs of interest are stored (**DO NOT** provide both); you can also specify the language in which the PDF is written, using [ISO nomenclature](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) - **MULTILINGUAL**
- *agnostic-text-generation*: ChatGPT-like text generation (no retrieval architecture), but supports every text-generation model on HF Hub (as long as your hardware supports it!) - **MULTILINGUAL**
- *text-summarization*: summarize text and pdfs, supports every text-summarization model on HF Hub - **ENGLISH ONLY**
- *image-generation*: stable diffusion, supports every text-to-image model on HF Hub - **MULTILINGUAL**
- *image-generation-pollinations*: stable diffusion, use Pollinations AI API; if you choose 'image-generation-pollinations', you do not need to specify anything else apart from the task - **MULTILINGUAL**
- *image-classification*: classify an image, supports every image-classification model on HF Hub - **ENGLISH ONLY**
- *image-to-text*:  describe an image, supports every image-to-text model on HF Hub - **ENGLISH ONLY**
- *audio-classification*: classify audio files or microphone recordings, supports audio-classification models on HF hub
- *speech-recognition*: transcribe audio files or microphone recordings, supports automatic-speech-recognition models on HF hub.
- *video-generation*: generate video upon text prompt, supports text-to-video models on HF hub - **ENGLISH ONLY**
- *protein-folding*: get the 3D structure of a protein from its amino-acid sequence, using ESM-2 backbone model - **GPU ONLY**
- *autotrain*: fine-tune a model on a specific downstream task with autotrain-advanced, just by specifying you HF username, HF writing token and the path to a yaml config file for the training
- *spaces-api-supabase*: use HF Spaces API in combination with Supabase PostgreSQL databases in order to unleash more powerful LLMs and larger RAG-oriented vector databases - **MULTILINGUAL**
- *llama.cpp-and-qdrant*: same as *retrieval-text-generation*, but uses **llama.cpp** as inference engine, so you MUST NOT specify a model - **MULTILINGUAL**
- *build-your-llm*: Build a customizable chat LLM combining a Qdrant database with your PDFs and the power of Anthropic, OpenAI, Cohere or Groq models: you just need an API key! To build the Qdrant database, have to pass either a pdf/a bunch of pdfs specified as comma-separated paths or a directory where all the pdfs of interest are stored (**DO NOT** provide both); you can also specify the language in which the PDF is written, using [ISO nomenclature](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) - **MULTILINGUAL**, **LANGFUSE INTEGRATION**
- *simply-chatting*: Build a customizable chat LLM with the power of Anthropic, OpenAI, Cohere or Groq models (no RAG pipeline): you just need an API key! - **MULTILINGUAL**, **LANGFUSE INTEGRATION**
- *fal-img2img*: Use [fal.ai](https://fal.ai) ComfyUI API to generate images starting from yur PNG and JPEG images: you just need an API key! You can aklso customize the generation working with prompts and seeds - **ENGLISH ONLY**
- *image-retrieval-search*: search an image database uploading a folder as database input. The folder should have the following structure:

```
./
├── test/
|   ├── label1/
|   └── label2/
└── train/
    ├── label1/
    └── label2/
```


You can query the database starting from your own pictures.

### 6. Go to `localhost:7860` and start using your assistant

Once everything is ready, you can head over to `localhost:7860` and start using your assistant:

<div align="center">
    <img src="./imgs/chatbot.png" alt="Chat interface">
</div>




================================================
FILE: _config.yml
================================================
theme: jekyll-theme-minimal

================================================
FILE: compose.yaml
================================================
networks:
  mynet:
    driver: bridge

services:
  everything-ai:
    image: astrabert/everything-ai
    volumes:
      - $VOLUME
    networks:
      - mynet
    ports:
      - "7860:7860"
      - "8760:8760"
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
    volumes:
      - "./qdrant_storage:/qdrant/storage"
    networks:
      - mynet
  llama_server:
    image: ghcr.io/ggerganov/llama.cpp:server
    ports:
      - "8000:8000"
    volumes:
      - "$MODELS_PATH:/models"
    networks:
      - mynet
    command: "-m /models/$MODEL --port 8000 --host 0.0.0.0 -n $MAX_TOKENS"
  


================================================
FILE: docker/Dockerfile
================================================
# Use an official Python runtime as a parent image
FROM astrabert/everything-ai

# Set the working directory in the container to /app
WORKDIR /app

# Add the current directory contents into the container at /app
ADD . /app

RUN pip install fal_client

# Expose the port that the application will run on
EXPOSE 8760

ENTRYPOINT [ "python3", "select_and_run.py" ]


================================================
FILE: docker/agnostic_text_generation.py
================================================
import gradio as gr
from utils import Translation
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from argparse import ArgumentParser

argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

args = argparse.parse_args()


mod = args.model
mod = mod.replace("\"", "").replace("'", "")

model_checkpoint = mod

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2048, repetition_penalty=1.2, temperature=0.4)


def reply(message, history):
    txt = Translation(message, "en")
    if txt.original == "en":
        response = pipe(message)
        return response[0]["generated_text"]
    else:
        translation = txt.translatef()
        response = pipe(translation)
        t = Translation(response[0]["generated_text"], txt.original)
        res = t.translatef()
        return res


demo = gr.ChatInterface(fn=reply, title="Multilingual-Bloom Bot")
demo.launch(server_name="0.0.0.0", share=False)

================================================
FILE: docker/audio_classification.py
================================================
from transformers import pipeline
from argparse import ArgumentParser
import torch
import gradio as gr
import numpy as np

argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

args = argparse.parse_args()


mod = args.model
mod = mod.replace("\"", "").replace("'", "")

model_checkpoint = mod

# Audio class
classifier = pipeline(task="audio-classification", model=mod)

def classify_text(audio):
    global classifier
    sr, data = audio
    short_tensor = data.astype(np.float32)
    res = classifier(short_tensor)
    return res[0]["label"]

input_audio = gr.Audio(
    sources=["upload","microphone"],
    waveform_options=gr.WaveformOptions(
        waveform_color="#01C6FF",
        waveform_progress_color="#0066B4",
        skip_length=2,
        show_controls=False,
    ),
)
demo = gr.Interface(
    title="everything-ai-audioclass",
    fn=classify_text,
    inputs=input_audio,
    outputs="text"
)

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", share=False)


================================================
FILE: docker/autotrain_interface.py
================================================
import subprocess as sp
import gradio as gr
import subprocess as sp


def build_command(hf_usr, hf_token, configpath):
    sp.run(f"export HF_USERNAME=\"{hf_usr}\"", shell=True)
    sp.run(f"export HF_TOKEN=\"{hf_token}\"", shell=True)
    sp.run(f"autotrain --config {configpath}", shell=True)
    return f"export HF_USERNAME={hf_usr}\nexport HF_TOKEN={hf_token}\nautotrain --config {configpath}"
    

demo = gr.Interface(
    build_command,
    [
        gr.Textbox(
            label="HF username",
            info="Your HF username",
            lines=3,
            value=f"your-cute-name",
        ),
        gr.Textbox(
            label="HF write token",
            info="An HF token that has write permissions on your repository",
            lines=3,
            value=f"your-powerful-token",
        ),
        gr.File(label="Yaml configuration file path"
        )
    ],
    title="everything-ai-autotrain",
    outputs="textbox",
    theme=gr.themes.Base()
)

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=7860, share=False)

	

================================================
FILE: docker/build_your_llm.py
================================================
from langchain_anthropic import ChatAnthropic
from langchain_cohere import ChatCohere
from langchain_groq import ChatGroq
from langchain_openai import ChatOpenAI
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import SQLChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
import gradio as gr
from argparse import ArgumentParser
from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer
from utils import *
import os
import subprocess as sp
import time
from langfuse.callback import CallbackHandler

argparse = ArgumentParser()

argparse.add_argument(
    "-pf",
    "--pdf_file",
    help="Single pdf file or N pdfs reported like this: /path/to/file1.pdf,/path/to/file2.pdf,...,/path/to/fileN.pdf (there is no strict naming, you just need to provide them comma-separated)",
    required=False,
    default="No file"
)

argparse.add_argument(
    "-d",
    "--directory",
    help="Directory where all your pdfs of interest are stored",
    required=False,
    default="No directory"
)

argparse.add_argument(
    "-l",
    "--language",
    help="Language of the written content contained in the pdfs",
    required=False,
    default="Same as query"
)

args = argparse.parse_args()


pdff = args.pdf_file
dirs = args.directory
lan = args.language


if pdff.replace("\\","").replace("'","") != "None" and dirs.replace("\\","").replace("'","") == "No directory":
    pdfs = pdff.replace("\\","/").replace("'","").split(",")
else:
    pdfs = [os.path.join(dirs.replace("\\","/").replace("'",""), f) for f in os.listdir(dirs.replace("\\","/").replace("'","")) if f.endswith(".pdf")]

client = QdrantClient(host="host.docker.internal", port="6333")
encoder = SentenceTransformer("all-MiniLM-L6-v2")

pdfdb = PDFdatabase(pdfs, encoder, client)
pdfdb.preprocess()
pdfdb.collect_data()
pdfdb.qdrant_collection_and_upload()

sp.run("rm -rf memory.db", shell=True)

def get_session_history(session_id):
    return SQLChatMessageHistory(session_id, "sqlite:///memory.db")

NAME2CHAT = {"Cohere": ChatCohere, "claude-3-opus-20240229": ChatAnthropic, "claude-3-sonnet-20240229": ChatAnthropic, "claude-3-haiku-20240307": ChatAnthropic, "llama3-8b-8192": ChatGroq, "llama3-70b-8192": ChatGroq, "mixtral-8x7b-32768": ChatGroq, "gemma-7b-it": ChatGroq, "gpt-4o": ChatOpenAI, "gpt-3.5-turbo-0125": ChatOpenAI}
NAME2APIKEY = {"Cohere": "COHERE_API_KEY", "claude-3-opus-20240229": "ANTHROPIC_API_KEY", "claude-3-sonnet-20240229": "ANTHROPIC_API_KEY", "claude-3-haiku-20240307": "ANTHROPIC_API_KEY", "llama3-8b-8192": "GROQ_API_KEY", "llama3-70b-8192": "GROQ_API_KEY", "mixtral-8x7b-32768": "GROQ_API_KEY", "gemma-7b-it": "GROQ_API_KEY", "gpt-4o": "OPENAI_API_KEY", "gpt-3.5-turbo-0125": "OPENAI_API_KEY"}



system_template = "You are an helpful assistant that can rely on this: {context} and on the previous message history as context, and from that you build a context and history-aware reply to this user input:"

def build_langfuse_handler(langfuse_host, langfuse_pkey, langfuse_skey):
    if langfuse_host!="None" and langfuse_pkey!="None" and langfuse_skey!="None":
        langfuse_handler = CallbackHandler(
            public_key=langfuse_pkey,
            secret_key=langfuse_skey,
            host=langfuse_host
        )
        return langfuse_handler, True
    else:
        return "No langfuse", False

def reply(message, history, name, api_key, temperature, max_new_tokens,langfuse_host, langfuse_pkey, langfuse_skey, sessionid):
    global pdfdb
    os.environ[NAME2APIKEY[name]]  = api_key
    if name == "Cohere":
        model = NAME2CHAT[name](temperature=temperature, max_tokens=max_new_tokens)
    else:
        model = NAME2CHAT[name](model=name,temperature=temperature, max_tokens=max_new_tokens)
    prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")]
    )
    lf_handler, truth = build_langfuse_handler(langfuse_host, langfuse_pkey, langfuse_skey)
    chain = prompt_template | model
    runnable_with_history = RunnableWithMessageHistory(
        chain,
        get_session_history,
        input_messages_key="input",
        history_messages_key="history",
    )
    txt = Translation(message, "en")
    if txt.original == "en" and lan.replace("\\","").replace("'","") == "None":
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        results = txt2txt.search(message)
        if not truth:
            response = runnable_with_history.invoke({"context": results[0]["text"], "input": message}, config={"configurable": {"session_id": sessionid}})##CONFIGURE!
        else:
            response = runnable_with_history.invoke({"context": results[0]["text"], "input": message}, config={"configurable": {"session_id": sessionid}, "callbacks": [lf_handler]})##CONFIGURE!
        llm=''
        for char in response.content:
            llm+=char
            time.sleep(0.001)
            yield llm 
    elif txt.original == "en" and lan.replace("\\","").replace("'","") != "None":
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        transl = Translation(message, lan.replace("\\","").replace("'",""))
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0]["text"], txt.original)
        res = t.translatef()
        if not truth:
            response = runnable_with_history.invoke({"context": res, "input": message}, config={"configurable": {"session_id": sessionid}})##CONFIGURE!
        else:
            response = runnable_with_history.invoke({"context": res, "input": message}, config={"configurable": {"session_id": sessionid}, "callbacks": [lf_handler]})##CONFIGURE!
        llm = ''
        for char in response.content:
            llm+=char
            time.sleep(0.001)
            yield llm 
    elif txt.original != "en" and lan.replace("\\","").replace("'","") == "None":
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        results = txt2txt.search(message)
        transl = Translation(results[0]["text"], "en")
        translation = transl.translatef()
        if not truth:
            response = runnable_with_history.invoke({"context": translation, "input": message}, config={"configurable": {"session_id": sessionid}})##CONFIGURE!
        else:
            response = runnable_with_history.invoke({"context": translation, "input": message}, config={"configurable": {"session_id": sessionid}, "callbacks": [lf_handler]})##CONFIGURE!
        t = Translation(response.content, txt.original)
        res = t.translatef()
        llm = ''
        for char in res:
            llm+=char
            time.sleep(0.001)
            yield llm 
    else:
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        transl = Translation(message, lan.replace("\\","").replace("'",""))
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0]["text"], txt.original)
        res = t.translatef()
        if not truth:
            response = runnable_with_history.invoke({"context": res, "input": message}, config={"configurable": {"session_id": sessionid}})##CONFIGURE!
        else:
            response = runnable_with_history.invoke({"context": res, "input": message}, config={"configurable": {"session_id": sessionid}, "callbacks": [lf_handler]})##CONFIGURE!
        tr = Translation(response.content, txt.original)
        ress = tr.translatef()
        llm = ''
        for char in ress:
            llm+=char
            time.sleep(0.001)
            yield llm 
    
chat_model = gr.Dropdown(
    [m for m in list(NAME2APIKEY)], label="Chat Model", info="Choose one of the available chat models"
    )

user_api_key = gr.Textbox(
    label="API key",
    info="Paste your API key here",
    lines=1,
    type="password",
)

user_temperature = gr.Slider(0, 1, value=0.5, label="Temperature", info="Select model temperature")

user_max_new_tokens = gr.Slider(0, 8192, value=1024, label="Max new tokens", info="Select max output tokens (higher number of tokens will result in a longer latency)")

user_lf_host = gr.Textbox(label="LangFuse Host",info="Provide LangFuse host URL, or type 'None' if you do not wish to use LangFuse",value="https://cloud.langfuse.com")

user_lf_pkey = gr.Textbox(label="LangFuse Public Key",info="Provide LangFuse Public key, or type 'None' if you do not wish to use LangFuse",value="pk-*************************", type="password")

user_lf_skey = gr.Textbox(label="LangFuse Secret Key",info="Provide LangFuse Secret key, or type 'None' if you do not wish to use LangFuse",value="sk-*************************", type="password")

user_session_id = gr.Textbox(label="Session ID",info="This alphanumeric code will link model reply to a specific message history of which the models will be aware when replying. Changing it will result in the loss of memory for your model",value="1")

additional_accordion = gr.Accordion(label="Parameters to be set before you start chatting", open=True)

demo = gr.ChatInterface(fn=reply, additional_inputs=[chat_model, user_api_key, user_temperature, user_max_new_tokens, user_lf_host, user_lf_pkey, user_lf_skey, user_session_id], additional_inputs_accordion=additional_accordion, title="everything-ai-buildyourllm")


if __name__=="__main__":
    demo.launch(server_name="0.0.0.0", share=False)


================================================
FILE: docker/chat_your_llm.py
================================================
from langchain_anthropic import ChatAnthropic
from langchain_cohere import ChatCohere
from langchain_groq import ChatGroq
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import SQLChatMessageHistory
from utils import Translation
import time
import os
from langfuse.callback import CallbackHandler
import gradio as gr
import subprocess as sp


NAME2CHAT = {"Cohere": ChatCohere, "claude-3-opus-20240229": ChatAnthropic, "claude-3-sonnet-20240229": ChatAnthropic, "claude-3-haiku-20240307": ChatAnthropic, "llama3-8b-8192": ChatGroq, "llama3-70b-8192": ChatGroq, "mixtral-8x7b-32768": ChatGroq, "gemma-7b-it": ChatGroq, "gpt-4o": ChatOpenAI, "gpt-3.5-turbo-0125": ChatOpenAI}
NAME2APIKEY = {"Cohere": "COHERE_API_KEY", "claude-3-opus-20240229": "ANTHROPIC_API_KEY", "claude-3-sonnet-20240229": "ANTHROPIC_API_KEY", "claude-3-haiku-20240307": "ANTHROPIC_API_KEY", "llama3-8b-8192": "GROQ_API_KEY", "llama3-70b-8192": "GROQ_API_KEY", "mixtral-8x7b-32768": "GROQ_API_KEY", "gemma-7b-it": "GROQ_API_KEY", "gpt-4o": "OPENAI_API_KEY", "gpt-3.5-turbo-0125": "OPENAI_API_KEY"}

sp.run("rm -rf memory.db", shell=True)

def build_langfuse_handler(langfuse_host, langfuse_pkey, langfuse_skey):
    if langfuse_host!="None" and langfuse_pkey!="None" and langfuse_skey!="None":
        langfuse_handler = CallbackHandler(
            public_key=langfuse_pkey,
            secret_key=langfuse_skey,
            host=langfuse_host
        )
        return langfuse_handler, True
    else:
        return "No langfuse", False

def get_session_history(session_id):
    return SQLChatMessageHistory(session_id, "sqlite:///chatmemory.db")

def reply(message, history, name, api_key, temperature, max_new_tokens,langfuse_host, langfuse_pkey, langfuse_skey, system_template, sessionid):
    os.environ[NAME2APIKEY[name]]  = api_key
    if name == "Cohere":
        model = NAME2CHAT[name](temperature=temperature, max_tokens=max_new_tokens)
    else:
        model = NAME2CHAT[name](model=name,temperature=temperature, max_tokens=max_new_tokens)
    prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}")]
    )
    lf_handler, truth = build_langfuse_handler(langfuse_host, langfuse_pkey, langfuse_skey)
    chain = prompt_template | model
    runnable_with_history = RunnableWithMessageHistory(
        chain,
        get_session_history,
        input_messages_key="input",
        history_messages_key="history",
    )
    txt = Translation(message, "en")
    if txt.original == "en":
        if not truth:
            response = runnable_with_history.invoke({"input": message}, config={"configurable": {"session_id": sessionid}})##CONFIGURE!
        else:
            response = runnable_with_history.invoke({"input": message}, config={"configurable": {"session_id": sessionid}, "callbacks": [lf_handler]})
        r = ''
        for c in response.content:
            r+=c
            time.sleep(0.001)
            yield r   
    else:
        translation = txt.translatef()
        if not truth:
            response = runnable_with_history.invoke({"input": translation}, config={"configurable": {"session_id": sessionid}})##CONFIGURE!
        else:
            response = runnable_with_history.invoke({"input": translation}, config={"configurable": {"session_id": sessionid}, "callbacks": [lf_handler]})
        t = Translation(response.content, txt.original)
        res = t.translatef()
        r = ''
        for c in res:
            r+=c
            time.sleep(0.001)
            yield r
    
    
chat_model = gr.Dropdown(
    [m for m in list(NAME2APIKEY)], label="Chat Model", info="Choose one of the available chat models"
    )

user_api_key = gr.Textbox(
    label="API key",
    info="Paste your API key here",
    lines=1,
    type="password",
)

user_temperature = gr.Slider(0, 1, value=0.5, label="Temperature", info="Select model temperature")

user_max_new_tokens = gr.Slider(0, 8192, value=1024, label="Max new tokens", info="Select max output tokens (higher number of tokens will result in a longer latency)")

user_lf_host = gr.Textbox(label="LangFuse Host",info="Provide LangFuse host URL, or type 'None' if you do not wish to use LangFuse",value="https://cloud.langfuse.com")

user_lf_pkey = gr.Textbox(label="LangFuse Public Key",info="Provide LangFuse Public key, or type 'None' if you do not wish to use LangFuse",value="pk-*************************", type="password")

user_lf_skey = gr.Textbox(label="LangFuse Secret Key",info="Provide LangFuse Secret key, or type 'None' if you do not wish to use LangFuse",value="sk-*************************", type="password")

user_template = gr.Textbox(label="System Template",info="Customize your assistant with your instructions",value="You are an helpful assistant")

user_session_id = gr.Textbox(label="Session ID",info="This alphanumeric code will link model reply to a specific message history of which the models will be aware when replying. Changing it will result in the loss of memory for your model",value="1")

additional_accordion = gr.Accordion(label="Parameters to be set before you start chatting", open=True)

demo = gr.ChatInterface(fn=reply, additional_inputs=[chat_model, user_api_key, user_temperature, user_max_new_tokens, user_lf_host, user_lf_pkey, user_lf_skey, user_template, user_session_id], additional_inputs_accordion=additional_accordion, title="everything-ai-simplychatting")


if __name__=="__main__":
    demo.launch(server_name="0.0.0.0", share=False)



================================================
FILE: docker/fal_img2img.py
================================================
import asyncio
import fal_client
import os
import gradio as gr
from PIL import Image

MAP_EXTS = {"jpg": "jpeg", "jpeg": "jpeg", "png": "png"}

async def submit(image_path, prompt, seed):
    ext = image_path.split(".")[1]
    handler = await fal_client.submit_async(
        "comfy/astrabert/image2image",
        arguments={
            "ksampler_seed": seed,
            "cliptextencode_text": prompt,
            "image_load_image_path": f"data:image/{MAP_EXTS[ext]};base64,{image_path}"
        },
    )
    result = await handler.get()
    return result

def get_url(results):
    url = results['outputs'][list(results['outputs'].keys())[0]]['images'][0]['url']
    nm = results['outputs'][list(results['outputs'].keys())[0]]['images'][0]['filename']
    return f"![{nm}]({url})"


def render_image(api_key, image_path, prompt, seed):
    os.environ["FAL_KEY"] = api_key
    results = asyncio.run(submit(image_path, prompt, int(seed)))
    url = get_url(results)
    img = Image.open(image_path)
    return img, url


demo = gr.Interface(render_image, inputs=[gr.Textbox(label="API key", type="password", value="fal-******************"), gr.File(label="PNG/JPEG Image"), gr.Textbox(label="Prompt", info="Specify how you would like the image generation to be"),  gr.Textbox(label="Seed", info="Pass your seed here (if not interested, leave it as it is)", value="123498235498246")], outputs=[gr.Image(label="Your Base Image"), gr.Markdown(label="Generated Image")], title="everything-ai-img2img")

if __name__=="__main__":
    demo.launch(server_name="0.0.0.0", server_port=7860)

================================================
FILE: docker/image_classification.py
================================================
from transformers import AutoModelForImageClassification, AutoImageProcessor, pipeline
from PIL import Image
from argparse import ArgumentParser
import torch

argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

args = argparse.parse_args()


mod = args.model
mod = mod.replace("\"", "").replace("'", "")

model_checkpoint = mod

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForImageClassification.from_pretrained(model_checkpoint).to(device)
processor = AutoImageProcessor.from_pretrained(model_checkpoint)


pipe = pipeline("image-classification", model=model, image_processor=processor)

def get_results(image, ppln=pipe):
    img = Image.fromarray(image)
    result = ppln(img)
    scores = []
    labels = []
    for el in result:
        scores.append(el["score"])
        labels.append(el["label"])
    return labels[scores.index(max(scores))]

import gradio as gr
## Build interface with loaded image + ouput from the model
demo = gr.Interface(get_results, gr.Image(), "text")

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", share=False)

================================================
FILE: docker/image_generation.py
================================================
from diffusers import DiffusionPipeline
import torch
from argparse import ArgumentParser

argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

args = argparse.parse_args()


mod = args.model
mod = mod.replace("\"", "").replace("'", "")

model_checkpoint = mod

pipe = DiffusionPipeline.from_pretrained(model_checkpoint, torch_dtype=torch.float32)

import gradio as gr
from utils import Translation



def reply(message, history):
    txt = Translation(message, "en")
    if txt.original == "en":
        image = pipe(message).images[0]
        image.save("generated_image.png")
        return "Here's your image:\n![generated_image](generated_image.png)"
    else:
        translation = txt.translatef()
        image = pipe(translation).images[0]
        image.save("generated_image.png")
        t = Translation("Here's your image:", txt.original)
        res = t.translatef()
        return f"{res}:\n![generated_image](generated_image.png)"


demo = gr.ChatInterface(fn=reply, title="everything-ai-sd-imgs")
demo.launch(server_name="0.0.0.0", share=False)

================================================
FILE: docker/image_generation_pollinations.py
================================================
import gradio as gr
from utils import Translation



def reply(message, history):
    txt = Translation(message, "en")
    if txt.original == "en":
        image = f"https://pollinations.ai/p/{message.replace(' ', '_')}"
        return f"Here's your image:\n![generated_image]({image})"
    else:
        translation = txt.translatef()
        image = f"https://pollinations.ai/p/{translation.replace(' ', '_')}"
        t = Translation("Here's your image:", txt.original)
        res = t.translatef()
        return f"{res}:\n![generated_image]({image})"


demo = gr.ChatInterface(fn=reply, title="everything-ai-pollinations-imgs")
demo.launch(server_name="0.0.0.0", share=False)

================================================
FILE: docker/image_to_text.py
================================================
import torch
from transformers import pipeline
from PIL import Image
from argparse import ArgumentParser

argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

args = argparse.parse_args()


mod = args.model
mod = mod.replace("\"", "").replace("'", "")

model_checkpoint = mod

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = pipeline("image-to-text", model=model_checkpoint, device=device)

def get_results(image, ppln=pipe):
    img = Image.fromarray(image)
    result = ppln(img, prompt="", generate_kwargs={"max_new_tokens": 1024})
    return result[0]["generated_text"].capitalize()

import gradio as gr
## Build interface with loaded image + ouput from the model
demo = gr.Interface(get_results, gr.Image(), "text")

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", share=False)



================================================
FILE: docker/llama_cpp_int.py
================================================
from utils import Translation, PDFdatabase, NeuralSearcher
import gradio as gr
from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer
from argparse import ArgumentParser
import os

argparse = ArgumentParser()

argparse.add_argument(
    "-pf",
    "--pdf_file",
    help="Single pdf file or N pdfs reported like this: /path/to/file1.pdf,/path/to/file2.pdf,...,/path/to/fileN.pdf (there is no strict naming, you just need to provide them comma-separated)",
    required=False,
    default="No file"
)

argparse.add_argument(
    "-d",
    "--directory",
    help="Directory where all your pdfs of interest are stored",
    required=False,
    default="No directory"
)

argparse.add_argument(
    "-l",
    "--language",
    help="Language of the written content contained in the pdfs",
    required=False,
    default="Same as query"
)

args = argparse.parse_args()


pdff = args.pdf_file
dirs = args.directory
lan = args.language


if pdff.replace("\\","").replace("'","") != "None" and dirs.replace("\\","").replace("'","") == "No directory":
    pdfs = pdff.replace("\\","/").replace("'","").split(",")
else:
    pdfs = [os.path.join(dirs.replace("\\","/").replace("'",""), f) for f in os.listdir(dirs.replace("\\","/").replace("'","")) if f.endswith(".pdf")]

client = QdrantClient(host="host.docker.internal", port="6333")
encoder = SentenceTransformer("all-MiniLM-L6-v2")

pdfdb = PDFdatabase(pdfs, encoder, client)
pdfdb.preprocess()
pdfdb.collect_data()
pdfdb.qdrant_collection_and_upload()


import requests

def llama_cpp_respond(query, max_new_tokens):
    url = "http://localhost:8000/completion"
    headers = {
        "Content-Type": "application/json"
    }
    data = {
        "prompt": query,
        "n_predict": int(max_new_tokens)
    }

    response = requests.post(url, headers=headers, json=data)

    a = response.json()
    return a["content"]


def reply(max_new_tokens, message):
    global pdfdb
    txt = Translation(message, "en")
    if txt.original == "en" and lan.replace("\\","").replace("'","") == "None":
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        results = txt2txt.search(message)
        response = llama_cpp_respond(f"Context: {results[0]["text"]}, prompt: {message}", max_new_tokens)
        return response
    elif txt.original == "en" and lan.replace("\\","").replace("'","") != "None":
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        transl = Translation(message, lan.replace("\\","").replace("'",""))
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0]["text"], txt.original)
        res = t.translatef()
        response = llama_cpp_respond(f"Context: {res}, prompt: {message}", max_new_tokens)
        return response
    elif txt.original != "en" and lan.replace("\\","").replace("'","") == "None":
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        results = txt2txt.search(message)
        transl = Translation(results[0]["text"], "en")
        translation = transl.translatef()
        response = llama_cpp_respond(f"Context: {translation}, prompt: {message}", max_new_tokens)
        t = Translation(response, txt.original)
        res = t.translatef()
        return res
    else:
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        transl = Translation(message, lan.replace("\\","").replace("'",""))
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0]["text"], txt.original)
        res = t.translatef()
        response = llama_cpp_respond(f"Context: {res}, prompt: {message}", max_new_tokens)
        tr = Translation(response, txt.original)
        ress = tr.translatef()
        return ress 
    
demo = gr.Interface(
    reply,
    [
        gr.Textbox(
            label="Max new tokens",
            info="The number reported should not be higher than the one specified within the .env file",
            lines=3,
            value=f"512",
        ),
        gr.Textbox(
            label="Input query",
            info="Write your input query here",
            lines=3,
            value=f"What are penguins?",
        )
    ],
    title="everything-ai-llamacpp",
    outputs="textbox"
)
demo.launch(server_name="0.0.0.0", share=False)

================================================
FILE: docker/protein_folding_with_esm.py
================================================
from transformers import AutoTokenizer, EsmForProteinFolding
from transformers.models.esm.openfold_utils.protein import to_pdb, Protein as OFProtein
from transformers.models.esm.openfold_utils.feats import atom14_to_atom37
import gradio as gr
from gradio_molecule3d import Molecule3D

reps =    [
    {
      "model": 0,
      "chain": "",
      "resname": "",
      "style": "stick",
      "color": "whiteCarbon",
      "residue_range": "",
      "around": 0,
      "byres": False,
      "visible": False
    }
]

def read_mol(molpath):
    with open(molpath, "r") as fp:
        lines = fp.readlines()
    mol = ""
    for l in lines:
        mol += l
    return mol


def molecule(input_pdb):

    mol = read_mol(input_pdb)

    x = (
        """<!DOCTYPE html>
        <html>
        <head>    
    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
    <style>
    body{
        font-family:sans-serif
    }
    .mol-container {
    width: 100%;
    height: 600px;
    position: relative;
    }
    .mol-container select{
        background-image:None;
    }
    </style>
     <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.3/jquery.min.js" integrity="sha512-STof4xm1wgkfm7heWqFJVn58Hm3EtS31XFaagaa8VMReCXAkQnJZ+jEy8PCC/iT18dFy95WcExNHFTqLyp72eQ==" crossorigin="anonymous" referrerpolicy="no-referrer"></script>
    <script src="https://3Dmol.csb.pitt.edu/build/3Dmol-min.js"></script>
    </head>
    <body>  
    <div id="container" class="mol-container"></div>
  
            <script>
               let pdb = `"""
        + mol
        + """`  
      
             $(document).ready(function () {
                let element = $("#container");
                let config = { backgroundColor: "white" };
                let viewer = $3Dmol.createViewer(element, config);
                viewer.addModel(pdb, "pdb");
                viewer.getModel(0).setStyle({}, { cartoon: { colorscheme:"whiteCarbon" } });
                viewer.zoomTo();
                viewer.render();
                viewer.zoom(0.8, 2000);
              })
        </script>
        </body></html>"""
    )

    return f"""<iframe style="width: 100%; height: 600px" name="result" allow="midi; geolocation; microphone; camera; 
    display-capture; encrypted-media;" sandbox="allow-modals allow-forms 
    allow-scripts allow-same-origin allow-popups 
    allow-top-navigation-by-user-activation allow-downloads" allowfullscreen="" 
    allowpaymentrequest="" frameborder="0" srcdoc='{x}'></iframe>"""


def convert_outputs_to_pdb(outputs):
    final_atom_positions = atom14_to_atom37(outputs["positions"][-1], outputs)
    outputs = {k: v.to("cpu").numpy() for k, v in outputs.items()}
    final_atom_positions = final_atom_positions.cpu().numpy()
    final_atom_mask = outputs["atom37_atom_exists"]
    pdbs = []
    for i in range(outputs["aatype"].shape[0]):
        aa = outputs["aatype"][i]
        pred_pos = final_atom_positions[i]
        mask = final_atom_mask[i]
        resid = outputs["residue_index"][i] + 1
        pred = OFProtein(
            aatype=aa,
            atom_positions=pred_pos,
            atom_mask=mask,
            residue_index=resid,
            b_factors=outputs["plddt"][i],
            chain_index=outputs["chain_index"][i] if "chain_index" in outputs else None,
        )
        pdbs.append(to_pdb(pred))
    return pdbs

tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1")
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1", low_cpu_mem_usage=True)

model = model.cuda()

model.esm = model.esm.half()

import torch

torch.backends.cuda.matmul.allow_tf32 = True

model.trunk.set_chunk_size(64)

def fold_protein(test_protein):
    tokenized_input = tokenizer([test_protein], return_tensors="pt", add_special_tokens=False)['input_ids']
    tokenized_input = tokenized_input.cuda()
    with torch.no_grad():
        output = model(tokenized_input)
    pdb = convert_outputs_to_pdb(output)
    with open("output_structure.pdb", "w") as f:
        f.write("".join(pdb))
    html = molecule("output_structure.pdb")
    return html, "output_structure.pdb"

iface = gr.Interface(
    title="everything-ai-proteinfold",
    fn=fold_protein,
    inputs=gr.Textbox(
            label="Protein Sequence",
            info="Find sequences examples below, and complete examples with images at: https://github.com/AstraBert/proteinviz/tree/main/examples.md; if you input a sequence, you're gonna get the static image and the 3D model to explore and play with",
            lines=5,
            value=f"Paste or write amino-acidic sequence here",
        ),
    outputs=[gr.HTML(label="Protein 3D model"), Molecule3D(label="Molecular 3D model", reps=reps)], 
)

iface.launch(server_name="0.0.0.0", share=False)

================================================
FILE: docker/requirements.txt
================================================
langchain-community==0.0.13 
langchain==0.1.1 
pypdf==3.17.4
sentence_transformers==2.2.2
transformers==4.39.3
langdetect==1.0.9
deep-translator==1.11.4
torch==2.1.2
gradio==4.36.0
diffusers==0.27.2
pydantic==2.6.4
qdrant_client==1.9.0
pillow==10.2.0
datasets==2.15.0
accelerate


================================================
FILE: docker/retrieval_image_search.py
================================================
from transformers import AutoImageProcessor, AutoModel
from utils import ImageDB
from PIL import Image
from qdrant_client import QdrantClient
import gradio as gr
from argparse import ArgumentParser
import torch

argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

argparse.add_argument(
    "-id",
    "--image_dimension",
    help="Dimension of the image (e.g. 512, 758, 384...)",
    required=False,
    default=512,
    type=int
)

argparse.add_argument(
    "-d",
    "--directory",
    help="Directory where all your pdfs of interest are stored",
    required=False,
    default="No directory"
)


args = argparse.parse_args()


mod = args.model
dirs = args.directory
imd = args.image_dimension

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoImageProcessor.from_pretrained(mod)
model = AutoModel.from_pretrained(mod).to(device)

client = QdrantClient(host="host.docker.internal", port=6333)
imdb = ImageDB(dirs, processor, model, client, imd)
print(imdb.collection_name)
imdb.create_dataset()
imdb.to_collection()


def see_images(dataset, results):
    images = []
    for i in range(len(results)):
        img = dataset[results[0].id]['image']
        images.append(img)
    return images

def process_img(image):
    global imdb
    results = imdb.searchDB(Image.fromarray(image))
    images = see_images(imdb.dataset, results)
    return images


iface = gr.Interface(
    title="everything-ai-retrievalimg",
    fn=process_img,
    inputs=gr.Image(label="Input Image"),
    outputs=gr.Gallery(label="Matching Images"),  
)

iface.launch(server_name="0.0.0.0", share=False)

================================================
FILE: docker/retrieval_text_generation.py
================================================
from utils import Translation, PDFdatabase, NeuralSearcher
import gradio as gr
from qdrant_client import QdrantClient
from sentence_transformers import SentenceTransformer
from argparse import ArgumentParser
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
import os

argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

argparse.add_argument(
    "-pf",
    "--pdf_file",
    help="Single pdf file or N pdfs reported like this: /path/to/file1.pdf,/path/to/file2.pdf,...,/path/to/fileN.pdf (there is no strict naming, you just need to provide them comma-separated)",
    required=False,
    default="No file"
)

argparse.add_argument(
    "-d",
    "--directory",
    help="Directory where all your pdfs of interest are stored",
    required=False,
    default="No directory"
)

argparse.add_argument(
    "-l",
    "--language",
    help="Language of the written content contained in the pdfs",
    required=False,
    default="Same as query"
)

args = argparse.parse_args()


mod = args.model
pdff = args.pdf_file
dirs = args.directory
lan = args.language


if pdff.replace("\\","").replace("'","") != "None" and dirs.replace("\\","").replace("'","") == "No directory":
    pdfs = pdff.replace("\\","/").replace("'","").split(",")
else:
    pdfs = [os.path.join(dirs.replace("\\","/").replace("'",""), f) for f in os.listdir(dirs.replace("\\","/").replace("'","")) if f.endswith(".pdf")]

client = QdrantClient(host="host.docker.internal", port="6333")
encoder = SentenceTransformer("all-MiniLM-L6-v2")

pdfdb = PDFdatabase(pdfs, encoder, client)
pdfdb.preprocess()
pdfdb.collect_data()
pdfdb.qdrant_collection_and_upload()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(mod).to(device)
tokenizer = AutoTokenizer.from_pretrained(mod)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2048, repetition_penalty=1.2, temperature=0.4)

def reply(message, history):
    global pdfdb
    txt = Translation(message, "en")
    if txt.original == "en" and lan.replace("\\","").replace("'","") == "None":
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        results = txt2txt.search(message)
        response = pipe(f"Context: {results[0]["text"]}, prompt: {message}")
        return response[0]["generated_text"]
    elif txt.original == "en" and lan.replace("\\","").replace("'","") != "None":
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        transl = Translation(message, lan.replace("\\","").replace("'",""))
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0]["text"], txt.original)
        res = t.translatef()
        response = pipe(f"Context: {res}, prompt: {message}")
        return response[0]["generated_text"]
    elif txt.original != "en" and lan.replace("\\","").replace("'","") == "None":
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        results = txt2txt.search(message)
        transl = Translation(results[0]["text"], "en")
        translation = transl.translatef()
        response = pipe(f"Context: {translation}, prompt: {message}")
        t = Translation(response[0]["generated_text"], txt.original)
        res = t.translatef()
        return res
    else:
        txt2txt = NeuralSearcher(pdfdb.collection_name, pdfdb.client, pdfdb.encoder)
        transl = Translation(message, lan.replace("\\","").replace("'",""))
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0]["text"], txt.original)
        res = t.translatef()
        response = pipe(f"Context: {res}, prompt: {message}")
        tr = Translation(response[0]["generated_text"], txt.original)
        ress = tr.translatef()
        return ress 
    
demo = gr.ChatInterface(fn=reply, title="everything-ai-retrievaltext")
demo.launch(server_name="0.0.0.0", share=False)

================================================
FILE: docker/select_and_run.py
================================================
import subprocess as sp
import gradio as gr

TASK_TO_SCRIPT = {"retrieval-text-generation": "retrieval_text_generation.py", "agnostic-text-generation": "agnostic_text_generation.py", "text-summarization": "text_summarization.py", "image-generation": "image_generation.py", "image-generation-pollinations": "image_generation_pollinations.py", "image-classification": "image_classification.py", "image-to-text": "image_to_text.py", "retrieval-image-search": "retrieval_image_search.py", "protein-folding": "protein_folding_with_esm.py", "video-generation": "video_generation.py", "speech-recognition": "speech_recognition.py", "spaces-api-supabase": "spaces_api_supabase.py", "audio-classification": "audio_classification.py", "autotrain": "autotrain_interface.py", "llama.cpp-and-qdrant": "llama_cpp_int.py", "build-your-llm": "build_your_llm.py", "simply-chatting": "chat_your_llm.py", "fal-img2img": "fal_img2img.py"}


def build_command(tsk, mod="None", pdff="None", dirs="None", lan="None", imdim="512", gradioclient="None", supabaseurl="None", collectname="None", supenc="all-MiniLM-L6-v2", supdim="384"):
    if tsk != "retrieval-text-generation" and tsk != "image-generation-pollinations" and tsk != "retrieval-image-search" and tsk != "autotrain" and tsk != "protein-folding" and tsk != "spaces-api-supabase" and tsk != "llama.cpp-and-qdrant" and tsk!="build-your-llm" and tsk!="simply-chatting" and tsk!="fal-img2img":
        sp.run(f"python3 {TASK_TO_SCRIPT[tsk]} -m {mod}", shell=True)
        return f"python3 {TASK_TO_SCRIPT[tsk]} -m {mod}"
    elif tsk == "retrieval-text-generation":
        sp.run(f"python3 {TASK_TO_SCRIPT[tsk]} -m {mod} -pf '{pdff}' -d '{dirs}' -l '{lan}'", shell=True)
        return f"python3 {TASK_TO_SCRIPT[tsk]} -m {mod} -pf '{pdff}' -d '{dirs}' -l '{lan}'"
    elif tsk == "llama.cpp-and-qdrant" or tsk== "build-your-llm":
        sp.run(f"python3 {TASK_TO_SCRIPT[tsk]} -pf '{pdff}' -d '{dirs}' -l '{lan}'", shell=True)
        return f"python3 {TASK_TO_SCRIPT[tsk]} -pf '{pdff}' -d '{dirs}' -l '{lan}'"
    elif tsk == "image-generation-pollinations" or tsk == "autotrain" or tsk == "protein-folding" or tsk=="simply-chatting" or tsk=="fal-img2img":
        sp.run(f"python3 {TASK_TO_SCRIPT[tsk]}", shell=True)
        return f"python3 {TASK_TO_SCRIPT[tsk]}"
    elif tsk == "spaces-api-supabase":
        if lan == "None":
            sp.run(f"python3 {TASK_TO_SCRIPT[tsk]} -gc {gradioclient} -sdu {supabaseurl} -cn {collectname} -en {supenc} -s {supdim}", shell=True)
        else:
            sp.run(f"python3 {TASK_TO_SCRIPT[tsk]} -gc {gradioclient} -sdu {supabaseurl} -cn {collectname} -en {supenc} -s {supdim} -l {lan}", shell=True)
        return f"python3 {TASK_TO_SCRIPT[tsk]} -gc {gradioclient} -sdu {supabaseurl} -cn {collectname} -en {supenc} -s {supdim} -l {lan}"
    else:
        sp.run(f"python3 {TASK_TO_SCRIPT[tsk]} -d {dirs} -id {imdim} -m {mod}", shell=True)
        return f"python3 {TASK_TO_SCRIPT[tsk]} -d {dirs} -id {imdim} -m {mod}"

demo = gr.Interface(
    build_command,
    [
        gr.Textbox(
            label="Task",
            info="Task you want your assistant to help you with",
            lines=3,
            value=f"Choose one of the following: {','.join(list(TASK_TO_SCRIPT.keys()))}; if you choose 'image-generation-pollinations' or 'autotrain' or 'protein-folding' or 'simply-chatting' or 'fal-img2img', you do not need to specify anything else. If you choose 'spaces-api-supabase' you need to specify the Spaces API client, the database URL, the collection name, the Sentence-Transformers encoder used to upload the vectors to the Supabase database and the vectors size (optionally also the language)",
        ),
        gr.Textbox(
            label="Model",
            info="AI model you want your assistant to run with",
            lines=3,
            value="None",
        ),
        gr.Textbox(
            label="PDF file(s)",
            info="Single pdf file or N pdfs reported like this: /path/to/file1.pdf,/path/to/file2.pdf,...,/path/to/fileN.pdf (there is no strict naming, you just need to provide them comma-separated), please do not use '\\' as path separators: only available with 'retrieval-text-generation'",
            lines=3,
            value="No file",
        ),
        gr.Textbox(
            label="Directory",
            info="Directory where all your pdfs or images (.jpg, .jpeg, .png) of interest are stored (only available with 'retrieval-text-generation' for pdfs and 'retrieval-image-search' for images). Please do not use '\\' as path separators",
            lines=3,
            value="No directory",
        ),
        gr.Textbox(
            label="Language",
            info="Language of the written content contained in the pdfs",
            lines=1,
            value="None",
        ),
        gr.Textbox(
            label="Image dimension",
            info="Dimension of the image (this is generally model and/or task-dependent!)",
            lines=1,
            value=f"e.g.: 512, 384, 758...",
        ),
        gr.Textbox(
            label="Spaces API client",
            info="Client for Spaces API",
            lines=3,
            value=f"e.g.: eswardivi/Phi-3-mini-4k-instruct",
        ),
        gr.Textbox(
            label="Supabase Database URL",
            info="URL of the Supabase database (to use with Spaces API)",
            lines=3,
            value=f"e.g.: postgresql://postgres.reneogdbgdsbgdbgdsgbdlf:yourcomplexpasswordhere@aws-0-eu-central-1.pooler.supabase.com:5432/postgres",
        ),
        gr.Textbox(
            label="Supabase collection name",
            info="Name of the Supabase collectio (to use with Spaces API)",
            lines=2,
            value=f"e.g.: documents",
        ),
        gr.Textbox(
            label="Supabase Vector Encoder",
            info="Name of the sentence-transformers encoder you used to upload vectors to your supabase database",
            lines=2,
            value=f"e.g.: all-MiniLM-L6-v2",
        ),
        gr.Textbox(
            label="Supabase Vector Size",
            info="Size of vectors in you supabase database",
            lines=1,
            value=f"e.g.: 384",
        ),
    ],
    outputs="textbox",
    theme=gr.themes.Base(),
    title="everything-ai"
)
if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", server_port=8760, share=False)

	

================================================
FILE: docker/spaces_api_supabase.py
================================================
import gradio as gr
from utils import Translation, NeuralSearcheR
from gradio_client import Client
import os
import vecs
from sentence_transformers import SentenceTransformer
from argparse import ArgumentParser

argparse = ArgumentParser()

argparse.add_argument(
    "-gc",
    "--gradio_client",
    help="Spaces API to connect with",
    required=True,
)

argparse.add_argument(
    "-sdu",
    "--supabase_database_url",
    help="URL for Supabase database",
    required=True
)

argparse.add_argument(
    "-cn",
    "--collection_name",
    help="Name of the Supabase collection",
    required=True
)

argparse.add_argument(
    "-l",
    "--language",
    help="Language of the written content contained in the pdfs",
    required=False,
    default="en"
)

argparse.add_argument(
    "-en",
    "--encoder",
    help="Encoder used in text vectorization",
    required=False,
    default="all-MiniLM-L6-v2"
)

argparse.add_argument(
    "-s",
    "--size",
    help="Size of the vectors",
    required=False,
    default=384,
    type=int
)

args = argparse.parse_args()


gradcli = args.gradio_client
supdb = args.supabase_database_url
collname = args.collection_name
lan = args.language
encd = args.encoder
sz = args.size


collection_name = collname
encoder = SentenceTransformer(encd)
client = supdb
api_client = Client(gradcli)
lan = "en"
vx = vecs.create_client(client)
docs = vx.get_or_create_collection(name=collection_name, dimension=sz)

def reply(message, history):
    global docs
    global encoder
    global api_client
    global lan
    txt = Translation(message, "en")
    print(txt.original, lan)
    if txt.original == "en" and lan == "en":
        txt2txt = NeuralSearcheR(docs, encoder)
        results = txt2txt.search(message)
        response = api_client.predict(
            f"Context: {results[0][2]['Content']}; Prompt: {message}",	# str  in 'Message' Textbox component
            0.4,	# float (numeric value between 0 and 1) in 'Temperature' Slider component
            True,	# bool  in 'Sampling' Checkbox component
            512,	# float (numeric value between 128 and 4096) in 'Max new tokens' Slider component
            api_name="/chat"
        )
        return response
    elif txt.original == "en" and lan != "en":
        txt2txt = NeuralSearcheR(docs, encoder)
        transl = Translation(message, lan)
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0][2]['Content'], txt.original)
        res = t.translatef()
        response = api_client.predict(
            f"Context: {res}; Prompt: {message}",	# str  in 'Message' Textbox component
            0.4,	# float (numeric value between 0 and 1) in 'Temperature' Slider component
            True,	# bool  in 'Sampling' Checkbox component
            512,	# float (numeric value between 128 and 4096) in 'Max new tokens' Slider component
            api_name="/chat"
        )
        response = Translation(response, txt.original)
        return response.translatef()
    elif txt.original != "en" and lan == "en":
        txt2txt = NeuralSearcheR(docs, encoder)
        results = txt2txt.search(message)
        transl = Translation(results[0][2]['Content'], "en")
        translation = transl.translatef()
        response = api_client.predict(
            f"Context: {translation}; Prompt: {message}",	# str  in 'Message' Textbox component
            0.4,	# float (numeric value between 0 and 1) in 'Temperature' Slider component
            True,	# bool  in 'Sampling' Checkbox component
            512,	# float (numeric value between 128 and 4096) in 'Max new tokens' Slider component
            api_name="/chat"
        )
        t = Translation(response, txt.original)
        res = t.translatef()
        return res
    else:
        txt2txt = NeuralSearcheR(docs, encoder)
        transl = Translation(message, lan.replace("\\","").replace("'",""))
        message = transl.translatef()
        results = txt2txt.search(message)
        t = Translation(results[0][2]['Content'], txt.original)
        res = t.translatef()
        response = api_client.predict(
            f"Context: {res}; Prompt: {message}",	# str  in 'Message' Textbox component
            0.4,	# float (numeric value between 0 and 1) in 'Temperature' Slider component
            True,	# bool  in 'Sampling' Checkbox component
            512,	# float (numeric value between 128 and 4096) in 'Max new tokens' Slider component
            api_name="/chat"
        )
        tr = Translation(response, txt.original)
        ress = tr.translatef()
        return ress


demo = gr.ChatInterface(fn=reply, title="everything-ai-supabase2spacesapi")
demo.launch(server_name="0.0.0.0", share=False)

================================================
FILE: docker/speech_recognition.py
================================================
from transformers import pipeline
from argparse import ArgumentParser
import torch
import gradio as gr
import numpy as np


argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

args = argparse.parse_args()


mod = args.model
mod = mod.replace("\"", "").replace("'", "")

model_checkpoint = mod

# Audio class
classifier = pipeline(task="automatic-speech-recognition", model=mod)

def classify_text(audio):
    global classifier
    sr, data = audio
    short_tensor = data.astype(np.float32)
    res = classifier(short_tensor)
    return res["text"]

input_audio = gr.Audio(
    sources=["upload","microphone"],
    waveform_options=gr.WaveformOptions(
        waveform_color="#01C6FF",
        waveform_progress_color="#0066B4",
        skip_length=2,
        show_controls=False,
    ),
)
demo = gr.Interface(
    title="everything-ai-speechrec",
    fn=classify_text,
    inputs=input_audio,
    outputs="text"
)

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", share=False)


================================================
FILE: docker/text_summarization.py
================================================
from transformers import pipeline
from argparse import ArgumentParser
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from utils import merge_pdfs
import gradio as gr
import time
import torch

histr = [[None, "Hi, I'm **everything-ai-summarization**🤖.\nI'm here to assist you and let you summarize _your_ texts and _your_ pdfs!\nCheck [my website](https://astrabert.github.io/everything-ai/) for troubleshooting and documentation reference\nHave fun!😊"]]

argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

args = argparse.parse_args()


mod = args.model
mod = mod.replace("\"", "").replace("'", "")

model_checkpoint = mod

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
summarizer = pipeline("summarization", model=model_checkpoint, device=device)

def convert_none_to_str(l: list):
    newlist = []
    for i in range(len(l)):
        if l[i] is None or type(l[i])==tuple:
            newlist.append("")
        else:
            newlist.append(l[i])
    return tuple(newlist)

def pdf2string(pdfpath):
    loader = PyPDFLoader(pdfpath)
    documents = loader.load()

    ### Split the documents into smaller chunks for processing
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(documents)
    fulltext = ""
    for text in texts:
        fulltext += text.page_content+"\n\n\n"
    return fulltext

def add_message(history, message):
    global histr
    if history is not None:
        if len(message["files"]) > 0:
            history.append((message["files"], None))
            histr.append([message["files"], None])
        if message["text"] is not None and message["text"] != "":
            history.append((message["text"], None))
            histr.append([message["text"], None])
    else:
        history = histr
        add_message(history, message)
    return history, gr.MultimodalTextbox(value=None, interactive=False)


def bot(history):
    global histr
    if not history is None:
        if type(history[-1][0]) != tuple:
            text = history[-1][0]
            response = summarizer(text, max_length=int(len(text.split(" "))*0.5), min_length=int(len(text.split(" "))*0.05), do_sample=False)[0]
            response = response["summary_text"]
            histr[-1][1] = response
            history[-1][1] = ""
            for character in response:
                history[-1][1] += character
                time.sleep(0.05)
                yield history
        if type(history[-1][0]) == tuple:
            filelist = []
            for i in history[-1][0]:
                filelist.append(i)
            finalpdf = merge_pdfs(filelist)
            text = pdf2string(finalpdf)
            response = summarizer(text, max_length=int(len(text.split(" "))*0.5), min_length=int(len(text.split(" "))*0.05), do_sample=False)[0]
            response = response["summary_text"]
            histr[-1][1] = response
            history[-1][1] = ""
            for character in response:
                history[-1][1] += character
                time.sleep(0.05)
                yield history
    else:
        history = histr
        bot(history)

with gr.Blocks() as demo:
    chatbot = gr.Chatbot(
        [[None, "Hi, I'm **everything-ai-summarization**🤖.\nI'm here to assist you and let you summarize _your_ texts and _your_ pdfs!\nCheck [my website](https://astrabert.github.io/everything-ai/) for troubleshooting and documentation reference\nHave fun!😊"]],
        label="everything-rag",
        elem_id="chatbot",
        bubble_full_width=False,
    )

    chat_input = gr.MultimodalTextbox(interactive=True, file_types=["pdf"], placeholder="Enter message or upload file...", show_label=False)

    chat_msg = chat_input.submit(add_message, [chatbot, chat_input], [chatbot, chat_input])
    bot_msg = chat_msg.then(bot, chatbot, chatbot, api_name="bot_response")
    bot_msg.then(lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input])


demo.queue()

if __name__ == "__main__":
    demo.launch(server_name="0.0.0.0", share=False)

	

================================================
FILE: docker/utils.py
================================================
# f(x)s that now are useful for all the tasks
from langdetect import detect
from deep_translator import GoogleTranslator
from pypdf import PdfMerger
from qdrant_client import models
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
import os
from datasets import load_dataset, Dataset
import torch
import numpy as np


def remove_items(test_list, item): 
    res = [i for i in test_list if i != item] 
    return res 

def merge_pdfs(pdfs: list):
    merger = PdfMerger()
    for pdf in pdfs:
        merger.append(pdf)
    merger.write(f"{pdfs[-1].split('.')[0]}_results.pdf")
    merger.close()
    return f"{pdfs[-1].split('.')[0]}_results.pdf"

class NeuralSearcher:
    def __init__(self, collection_name, client, model):
        self.collection_name = collection_name
        # Initialize encoder model
        self.model = model
        # initialize Qdrant client
        self.qdrant_client = client
    def search(self, text: str):
        # Convert text query into vector
        vector = self.model.encode(text).tolist()

        # Use `vector` for search for closest vectors in the collection
        search_result = self.qdrant_client.search(
            collection_name=self.collection_name,
            query_vector=vector,
            query_filter=None,  # If you don't want any filters for now
            limit=1,  # 5 the most closest results is enough
        )
        # `search_result` contains found vector ids with similarity scores along with the stored payload
        # In this function you are interested in payload only
        payloads = [hit.payload for hit in search_result]
        return payloads

class PDFdatabase:
    def __init__(self, pdfs, encoder, client):
        self.finalpdf = merge_pdfs(pdfs)
        self.collection_name = os.path.basename(self.finalpdf).split(".")[0].lower()
        self.encoder = encoder
        self.client = client
    def preprocess(self):
        loader = PyPDFLoader(self.finalpdf)
        documents = loader.load()
        ### Split the documents into smaller chunks for processing
        text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
        self.pages = text_splitter.split_documents(documents)
    def collect_data(self):
        self.documents = []
        for text in self.pages:
            contents = text.page_content.split("\n")
            contents = remove_items(contents, "")
            for content in contents:
                self.documents.append({"text": content, "source": text.metadata["source"], "page": str(text.metadata["page"])})
    def qdrant_collection_and_upload(self):
        self.client.recreate_collection(
            collection_name=self.collection_name,
            vectors_config=models.VectorParams(
                size=self.encoder.get_sentence_embedding_dimension(),  # Vector size is defined by used model
                distance=models.Distance.COSINE,
            ),
        )
        self.client.upload_points(
            collection_name=self.collection_name,
            points=[
                models.PointStruct(
                    id=idx, vector=self.encoder.encode(doc["text"]).tolist(), payload=doc
                )
                for idx, doc in enumerate(self.documents)
            ],
        )

class Translation:
    def __init__(self, text, destination):
        self.text = text
        self.destination = destination
        try:
            self.original = detect(self.text)
        except Exception as e:
            self.original = "auto"
    def translatef(self):
        translator = GoogleTranslator(source=self.original, target=self.destination)
        translation = translator.translate(self.text)
        return translation

class ImageDB:
    def __init__(self, imagesdir, processor, model, client, dimension):
        self.imagesdir = imagesdir
        self.processor = processor
        self.model = model
        self.client = client
        self.dimension = dimension
        if os.path.basename(self.imagesdir) != "":
            self.collection_name = os.path.basename(self.imagesdir)+"_ImagesCollection"
        else:
            if "\\" in self.imagesdir:
               self.collection_name = self.imagesdir.split("\\")[-2]+"_ImagesCollection" 
            else:
                self.collection_name = self.imagesdir.split("/")[-2]+"_ImagesCollection" 
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.client.recreate_collection(
            collection_name=self.collection_name,
            vectors_config=models.VectorParams(size=self.dimension, distance=models.Distance.COSINE)
        )
    def get_embeddings(self, batch):
        inputs = self.processor(images=batch['image'], return_tensors="pt").to(self.device)
        with torch.no_grad():
            outputs = self.model(**inputs).last_hidden_state.mean(dim=1).cpu().numpy()
        batch['embeddings'] = outputs
        return batch
    def create_dataset(self):
        self.dataset = load_dataset("imagefolder", data_dir=self.imagesdir, split="train")
        self.dataset = self.dataset.map(self.get_embeddings, batched=True, batch_size=16)
    def to_collection(self):
        np.save(os.path.join(self.imagesdir, "vectors"), np.array(self.dataset['embeddings']), allow_pickle=False)

        payload = self.dataset.select_columns([
            "label"
        ]).to_pandas().fillna(0).to_dict(orient="records")

        ids = list(range(self.dataset.num_rows))
        embeddings = np.load(os.path.join(self.imagesdir, "vectors.npy")).tolist()

        batch_size = 1000

        for i in range(0, self.dataset.num_rows, batch_size):

            low_idx = min(i+batch_size, self.dataset.num_rows)

            batch_of_ids = ids[i: low_idx]
            batch_of_embs = embeddings[i: low_idx]
            batch_of_payloads = payload[i: low_idx]

            self.client.upsert(
                collection_name = self.collection_name,
                points=models.Batch(
                    ids=batch_of_ids,
                    vectors=batch_of_embs,
                    payloads=batch_of_payloads
                )
            )
    def searchDB(self, image):
        dtst = {"image": [image], "label": ["None"]}
        dtst = Dataset.from_dict(dtst)
        dtst = dtst.map(self.get_embeddings, batched=True, batch_size=1)
        img = dtst[0]
        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=img['embeddings'],
            limit=4
        )
        return results

class NeuralSearcheR:
    def __init__(self, collection, encoder):
        self.collection = collection
        self.encoder = encoder
    def search(self, text):
        results = self.collection.query(
            data=self.encoder.encode(text).tolist(),  # required
            limit=1,                     # number of records to return
            filters={},                  # metadata filters
            measure="cosine_distance",   # distance measure to use
            include_value=True,         # should distance measure values be returned?
            include_metadata=True,      # should record metadata be returned?
        )
        return results

================================================
FILE: docker/video_generation.py
================================================
import gradio as gr
import moviepy.editor as mp
from diffusers import DiffusionPipeline
from argparse import ArgumentParser


argparse = ArgumentParser()
argparse.add_argument(
    "-m",
    "--model",
    help="HuggingFace Model identifier, such as 'google/flan-t5-base'",
    required=True,
)

args = argparse.parse_args()


mod = args.model
mod = mod.replace("\"", "").replace("'", "")

model_checkpoint = mod

# Load diffusion pipelines
image_pipeline = DiffusionPipeline.from_pretrained(model_checkpoint)
video_pipeline = DiffusionPipeline.from_pretrained(model_checkpoint)

def generate_images(prompt, num_images):
  """Generates images using the image pipeline."""
  images = []
  for _ in range(num_images):
    generated_image = image_pipeline(prompt=prompt).images[0]
    images.append(generated_image)
  return images


def generate_videos(images):
    """Generates videos from a list of images using the video pipeline."""
    videos = []
    for image in images:
        # Wrap the image in a list as expected by the pipeline
        generated_video = video_pipeline(images=[image]).images[0]
        videos.append(generated_video)
    return videos

def combine_videos(video_clips):
  final_clip = mp.concatenate_videoclips(video_clips)
  return final_clip

def generate(prompt):
  images = generate_images(prompt, 2)
  video_clips = generate_videos(images)
  combined_video = combine_videos(video_clips)
  return combined_video

# Gradio interface with improved formatting and video output
interface = gr.Interface(
    fn=generate,
    inputs="text",
    outputs="video",
    title="everything-ai-text2vid",
    description="Enter a prompt to generate a video using diffusion models.",
    css="""
      .output-video {
        width: 100%; /* Adjust width as needed */
        height: 400px; /* Adjust height as desired */
      }
    """,
)

# Launch the interface
interface.launch(server_name="0.0.0.0", share=False)

Download .txt

gitextract_dl7fgpnu/

├── .github/
│   └── FUNDING.yml
├── .gitignore
├── .v0_1_1/
│   ├── README.md
│   ├── docker/
│   │   ├── Dockerfile
│   │   ├── build_command.sh
│   │   ├── chat.py
│   │   ├── requirements.txt
│   │   └── utils.py
│   └── scripts/
│       ├── gemma-for-datasciences.ipynb
│       └── gemma_for_datasciences.py
├── LICENSE
├── README.md
├── _config.yml
├── compose.yaml
└── docker/
    ├── Dockerfile
    ├── agnostic_text_generation.py
    ├── audio_classification.py
    ├── autotrain_interface.py
    ├── build_your_llm.py
    ├── chat_your_llm.py
    ├── fal_img2img.py
    ├── image_classification.py
    ├── image_generation.py
    ├── image_generation_pollinations.py
    ├── image_to_text.py
    ├── llama_cpp_int.py
    ├── protein_folding_with_esm.py
    ├── requirements.txt
    ├── retrieval_image_search.py
    ├── retrieval_text_generation.py
    ├── select_and_run.py
    ├── spaces_api_supabase.py
    ├── speech_recognition.py
    ├── text_summarization.py
    ├── utils.py
    └── video_generation.py

Download .txt

SYMBOL INDEX (72 symbols across 23 files)

FILE: .v0_1_1/docker/chat.py
  function generate_welcome_message (line 8) | def generate_welcome_message():
  function print_like_dislike (line 11) | def print_like_dislike(x: gr.LikeData):
  function add_message (line 14) | def add_message(history, message):
  function bot (line 22) | def bot(history):

FILE: .v0_1_1/docker/utils.py
  function merge_pdfs (line 45) | def merge_pdfs(pdfs: list):
  function create_a_persistent_db (line 53) | def create_a_persistent_db(pdfpath, dbpath, cachepath) -> None:
  function convert_none_to_str (line 107) | def convert_none_to_str(l: list):
  function just_chatting (line 116) | def just_chatting(

FILE: .v0_1_1/scripts/gemma_for_datasciences.py
  function create_a_persistent_db (line 60) | def create_a_persistent_db(pdfpath, dbpath, cachepath) -> None:
  function just_chatting (line 114) | def just_chatting(
  class ChatGUI (line 182) | class ChatGUI:
    method __init__ (line 183) | def __init__(self, master):
    method send_message (line 199) | def send_message(self):
    method display_message (line 207) | def display_message(self, message):

FILE: docker/agnostic_text_generation.py
  function reply (line 28) | def reply(message, history):

FILE: docker/audio_classification.py
  function classify_text (line 26) | def classify_text(audio):

FILE: docker/autotrain_interface.py
  function build_command (line 6) | def build_command(hf_usr, hf_token, configpath):

FILE: docker/build_your_llm.py
  function get_session_history (line 67) | def get_session_history(session_id):
  function build_langfuse_handler (line 77) | def build_langfuse_handler(langfuse_host, langfuse_pkey, langfuse_skey):
  function reply (line 88) | def reply(message, history, name, api_key, temperature, max_new_tokens,l...

FILE: docker/chat_your_llm.py
  function build_langfuse_handler (line 22) | def build_langfuse_handler(langfuse_host, langfuse_pkey, langfuse_skey):
  function get_session_history (line 33) | def get_session_history(session_id):
  function reply (line 36) | def reply(message, history, name, api_key, temperature, max_new_tokens,l...

FILE: docker/fal_img2img.py
  function submit (line 9) | async def submit(image_path, prompt, seed):
  function get_url (line 22) | def get_url(results):
  function render_image (line 28) | def render_image(api_key, image_path, prompt, seed):

FILE: docker/image_classification.py
  function get_results (line 29) | def get_results(image, ppln=pipe):

FILE: docker/image_generation.py
  function reply (line 28) | def reply(message, history):

FILE: docker/image_generation_pollinations.py
  function reply (line 6) | def reply(message, history):

FILE: docker/image_to_text.py
  function get_results (line 25) | def get_results(image, ppln=pipe):

FILE: docker/llama_cpp_int.py
  function llama_cpp_respond (line 58) | def llama_cpp_respond(query, max_new_tokens):
  function reply (line 74) | def reply(max_new_tokens, message):

FILE: docker/protein_folding_with_esm.py
  function read_mol (line 21) | def read_mol(molpath):
  function molecule (line 30) | def molecule(input_pdb):
  function convert_outputs_to_pdb (line 84) | def convert_outputs_to_pdb(outputs):
  function fold_protein (line 119) | def fold_protein(test_protein):

FILE: docker/retrieval_image_search.py
  function see_images (line 53) | def see_images(dataset, results):
  function process_img (line 60) | def process_img(image):

FILE: docker/retrieval_text_generation.py
  function reply (line 70) | def reply(message, history):

FILE: docker/select_and_run.py
  function build_command (line 7) | def build_command(tsk, mod="None", pdff="None", dirs="None", lan="None",...

FILE: docker/spaces_api_supabase.py
  function reply (line 76) | def reply(message, history):

FILE: docker/speech_recognition.py
  function classify_text (line 27) | def classify_text(audio):

FILE: docker/text_summarization.py
  function convert_none_to_str (line 31) | def convert_none_to_str(l: list):
  function pdf2string (line 40) | def pdf2string(pdfpath):
  function add_message (line 52) | def add_message(history, message):
  function bot (line 67) | def bot(history):

FILE: docker/utils.py
  function remove_items (line 14) | def remove_items(test_list, item):
  function merge_pdfs (line 18) | def merge_pdfs(pdfs: list):
  class NeuralSearcher (line 26) | class NeuralSearcher:
    method __init__ (line 27) | def __init__(self, collection_name, client, model):
    method search (line 33) | def search(self, text: str):
  class PDFdatabase (line 49) | class PDFdatabase:
    method __init__ (line 50) | def __init__(self, pdfs, encoder, client):
    method preprocess (line 55) | def preprocess(self):
    method collect_data (line 61) | def collect_data(self):
    method qdrant_collection_and_upload (line 68) | def qdrant_collection_and_upload(self):
  class Translation (line 86) | class Translation:
    method __init__ (line 87) | def __init__(self, text, destination):
    method translatef (line 94) | def translatef(self):
  class ImageDB (line 99) | class ImageDB:
    method __init__ (line 100) | def __init__(self, imagesdir, processor, model, client, dimension):
    method get_embeddings (line 118) | def get_embeddings(self, batch):
    method create_dataset (line 124) | def create_dataset(self):
    method to_collection (line 127) | def to_collection(self):
    method searchDB (line 155) | def searchDB(self, image):
  class NeuralSearcheR (line 167) | class NeuralSearcheR:
    method __init__ (line 168) | def __init__(self, collection, encoder):
    method search (line 171) | def search(self, text):

FILE: docker/video_generation.py
  function generate_images (line 27) | def generate_images(prompt, num_images):
  function generate_videos (line 36) | def generate_videos(images):
  function combine_videos (line 45) | def combine_videos(video_clips):
  function generate (line 49) | def generate(prompt):

Download .json

Condensed preview — 36 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (142K chars).

[
  {
    "path": ".github/FUNDING.yml",
    "chars": 67,
    "preview": "# These are supported funding model platforms\n\ngithub: [AstraBert]\n"
  },
  {
    "path": ".gitignore",
    "chars": 61,
    "preview": "flagged/\r\ndocker/__pycache__\r\ndocker/flagged\r\nqdrant_storage/"
  },
  {
    "path": ".v0_1_1/README.md",
    "chars": 9985,
    "preview": "# everything-rag\r\n\r\n>_How was this README generated? Levearaging the power of AI with **reAIdme**, an HuggingChat assist"
  },
  {
    "path": ".v0_1_1/docker/Dockerfile",
    "chars": 817,
    "preview": "# Use an official Python runtime as a parent image\r\nFROM python:3.10-slim-bookworm\r\n\r\n# Set the working directory in the"
  },
  {
    "path": ".v0_1_1/docker/build_command.sh",
    "chars": 675,
    "preview": "docker buildx build \\\n--label org.opencontainers.image.title=everything-rag \\\n--label org.opencontainers.image.descripti"
  },
  {
    "path": ".v0_1_1/docker/chat.py",
    "chars": 3400,
    "preview": "import gradio as gr\r\nimport os\r\nimport time\r\nfrom utils import *\r\n\r\nvectordb = \"\"\r\n\r\ndef generate_welcome_message():\r\n  "
  },
  {
    "path": ".v0_1_1/docker/requirements.txt",
    "chars": 163,
    "preview": "langchain-community==0.0.13 \r\nlangchain==0.1.1 \r\npypdf==3.17.4\r\nsentence_transformers==2.2.2\r\nchromadb==0.4.22\r\ncryptogr"
  },
  {
    "path": ".v0_1_1/docker/utils.py",
    "chars": 6261,
    "preview": "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, pipeline\r\nimport time\r\nfrom langcha"
  },
  {
    "path": ".v0_1_1/scripts/gemma-for-datasciences.ipynb",
    "chars": 14129,
    "preview": "{\"metadata\":{\"colab\":{\"provenance\":[]},\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"la"
  },
  {
    "path": ".v0_1_1/scripts/gemma_for_datasciences.py",
    "chars": 10534,
    "preview": "# -*- coding: utf-8 -*-\n\n\"\"\"# gemma-2b AS A DATA SCIENCE TEACHER\n\n## 100% LOCAL, WITHOUT FINETUNING, WITH _YOUR OWN DATA"
  },
  {
    "path": "LICENSE",
    "chars": 11558,
    "preview": "                                 Apache License\r\n                           Version 2.0, January 2004\r\n                 "
  },
  {
    "path": "README.md",
    "chars": 6546,
    "preview": "<h1 align=\"center\">everything-ai</h1>\r\n<h2 align=\"center\">Your fully proficient, AI-powered and local chatbot assistant🤖"
  },
  {
    "path": "_config.yml",
    "chars": 27,
    "preview": "theme: jekyll-theme-minimal"
  },
  {
    "path": "compose.yaml",
    "chars": 634,
    "preview": "networks:\r\n  mynet:\r\n    driver: bridge\r\n\r\nservices:\r\n  everything-ai:\r\n    image: astrabert/everything-ai\r\n    volumes:"
  },
  {
    "path": "docker/Dockerfile",
    "chars": 377,
    "preview": "# Use an official Python runtime as a parent image\r\nFROM astrabert/everything-ai\r\n\r\n# Set the working directory in the c"
  },
  {
    "path": "docker/agnostic_text_generation.py",
    "chars": 1194,
    "preview": "import gradio as gr\nfrom utils import Translation\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline"
  },
  {
    "path": "docker/audio_classification.py",
    "chars": 1159,
    "preview": "from transformers import pipeline\r\nfrom argparse import ArgumentParser\r\nimport torch\r\nimport gradio as gr\r\nimport numpy "
  },
  {
    "path": "docker/autotrain_interface.py",
    "chars": 1114,
    "preview": "import subprocess as sp\r\nimport gradio as gr\r\nimport subprocess as sp\r\n\r\n\r\ndef build_command(hf_usr, hf_token, configpat"
  },
  {
    "path": "docker/build_your_llm.py",
    "chars": 9806,
    "preview": "from langchain_anthropic import ChatAnthropic\r\nfrom langchain_cohere import ChatCohere\r\nfrom langchain_groq import ChatG"
  },
  {
    "path": "docker/chat_your_llm.py",
    "chars": 5901,
    "preview": "from langchain_anthropic import ChatAnthropic\r\nfrom langchain_cohere import ChatCohere\r\nfrom langchain_groq import ChatG"
  },
  {
    "path": "docker/fal_img2img.py",
    "chars": 1621,
    "preview": "import asyncio\r\nimport fal_client\r\nimport os\r\nimport gradio as gr\r\nfrom PIL import Image\r\n\r\nMAP_EXTS = {\"jpg\": \"jpeg\", \""
  },
  {
    "path": "docker/image_classification.py",
    "chars": 1223,
    "preview": "from transformers import AutoModelForImageClassification, AutoImageProcessor, pipeline\nfrom PIL import Image\nfrom argpar"
  },
  {
    "path": "docker/image_generation.py",
    "chars": 1176,
    "preview": "from diffusers import DiffusionPipeline\nimport torch\nfrom argparse import ArgumentParser\n\nargparse = ArgumentParser()\nar"
  },
  {
    "path": "docker/image_generation_pollinations.py",
    "chars": 680,
    "preview": "import gradio as gr\nfrom utils import Translation\n\n\n\ndef reply(message, history):\n    txt = Translation(message, \"en\")\n "
  },
  {
    "path": "docker/image_to_text.py",
    "chars": 948,
    "preview": "import torch\nfrom transformers import pipeline\nfrom PIL import Image\nfrom argparse import ArgumentParser\n\nargparse = Arg"
  },
  {
    "path": "docker/llama_cpp_int.py",
    "chars": 4572,
    "preview": "from utils import Translation, PDFdatabase, NeuralSearcher\r\nimport gradio as gr\r\nfrom qdrant_client import QdrantClient\r"
  },
  {
    "path": "docker/protein_folding_with_esm.py",
    "chars": 4922,
    "preview": "from transformers import AutoTokenizer, EsmForProteinFolding\r\nfrom transformers.models.esm.openfold_utils.protein import"
  },
  {
    "path": "docker/requirements.txt",
    "chars": 279,
    "preview": "langchain-community==0.0.13 \nlangchain==0.1.1 \npypdf==3.17.4\nsentence_transformers==2.2.2\ntransformers==4.39.3\nlangdetec"
  },
  {
    "path": "docker/retrieval_image_search.py",
    "chars": 1813,
    "preview": "from transformers import AutoImageProcessor, AutoModel\r\nfrom utils import ImageDB\r\nfrom PIL import Image\r\nfrom qdrant_cl"
  },
  {
    "path": "docker/retrieval_text_generation.py",
    "chars": 4157,
    "preview": "from utils import Translation, PDFdatabase, NeuralSearcher\nimport gradio as gr\nfrom qdrant_client import QdrantClient\nfr"
  },
  {
    "path": "docker/select_and_run.py",
    "chars": 6405,
    "preview": "import subprocess as sp\nimport gradio as gr\n\nTASK_TO_SCRIPT = {\"retrieval-text-generation\": \"retrieval_text_generation.p"
  },
  {
    "path": "docker/spaces_api_supabase.py",
    "chars": 4885,
    "preview": "import gradio as gr\r\nfrom utils import Translation, NeuralSearcheR\r\nfrom gradio_client import Client\r\nimport os\r\nimport "
  },
  {
    "path": "docker/speech_recognition.py",
    "chars": 1164,
    "preview": "from transformers import pipeline\r\nfrom argparse import ArgumentParser\r\nimport torch\r\nimport gradio as gr\r\nimport numpy "
  },
  {
    "path": "docker/text_summarization.py",
    "chars": 4260,
    "preview": "from transformers import pipeline\nfrom argparse import ArgumentParser\nfrom langchain.text_splitter import CharacterTextS"
  },
  {
    "path": "docker/utils.py",
    "chars": 7264,
    "preview": "# f(x)s that now are useful for all the tasks\nfrom langdetect import detect\nfrom deep_translator import GoogleTranslator"
  },
  {
    "path": "docker/video_generation.py",
    "chars": 2007,
    "preview": "import gradio as gr\r\nimport moviepy.editor as mp\r\nfrom diffusers import DiffusionPipeline\r\nfrom argparse import Argument"
  }
]

About this extraction

This page contains the full source code of the AstraBert/everything-ai GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 36 files (128.7 KB), approximately 32.7k tokens, and a symbol index with 72 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo