[
  {
    "path": ".github/workflows/sync-with-huggingface.yml",
    "content": "name: Sync with Hugging Face\n\non:\n  push:\n    branches:\n      - main\n    paths:\n      - .github/workflows/sync-with-huggingface.yml\n      - app/**\n\njobs:\n  build:\n    runs-on: ubuntu-latest\n    steps:\n    - name: Sync with Hugging Face\n      uses: nateraw/huggingface-sync-action@v0.0.5\n      with:\n        # The github repo you are syncing from. Required.\n        github_repo_id: 'myscale/ChatData'\n\n        # The Hugging Face repo id you want to sync to. (ex. 'username/reponame')\n        # A repo with this name will be created if it doesn't exist. Required.\n        huggingface_repo_id: 'myscale/ChatData'\n\n        # Hugging Face token with write access. Required.\n        # Here, we provide a token that we called `HF_TOKEN` when we added the secret to our GitHub repo.\n        hf_token: ${{ secrets.HF_TOKEN }}\n\n        # The type of repo you are syncing to: model, dataset, or space.\n        # Defaults to space.\n        repo_type: 'space'\n        \n        # If true and the Hugging Face repo doesn't already exist, it will be created\n        # as a private repo.\n        #\n        # Note: this param has no effect if the repo already exists.\n        private: false\n\n        # If repo type is space, specify a space_sdk. One of: streamlit, gradio, or static\n        #\n        # This option is especially important if the repo has not been created yet.\n        # It won't really be used if the repo already exists.\n        space_sdk: 'streamlit'\n        \n        # If provided, subdirectory will determine which directory of the repo will be synced.\n        # By default, this action syncs the entire GitHub repo.\n        #\n        # An example using this option can be seen here:\n        # https://github.com/huggingface/fuego/blob/830ed98/.github/workflows/sync-with-huggingface.yml\n        subdirectory: app\n"
  },
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\n.idea/\n\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#pdm.lock\n#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it\n#   in version control.\n#   https://pdm.fming.dev/#use-with-ide\n.pdm.toml\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n#.idea/\n\n\n# dataset files\ndata/\n.streamlit/\n#*.ipynb\n.DS_Store"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2023 MyScale\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# ChatData 🔍 📖\n\n***We are constantly improving LangChain's self-query retriever. Some of the features are not merged yet.***\n\n[![](https://dcbadge.vercel.app/api/server/D2qpkqc4Jq?compact=true&style=flat)](https://discord.gg/D2qpkqc4Jq)\n[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/myscaledb.svg?style=social&label=Follow%20%40MyScaleDB)](https://twitter.com/myscaledb)\n<a href=\"https://huggingface.co/spaces/myscale/ChatData\"  style=\"padding-left: 0.5rem;\"><img src=\"https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-orange\"></a>\n\n<br>\n<div style=\"text-align: center\">\n<img src=\"assets/logo.png\" width=60%>\n</div>\n\nYet another chat-with-documents app, but supporting query over millions of files with [MyScale](https://myscale.com) and [LangChain](https://github.com/hwchase17/langchain/).\n\n## Introduction 📖\n\n### Overview\n\nChatData is a robust chat-with-documents application designed to extract information and provide answers by querying the MyScale free knowledge base or your uploaded documents.\n\nPowered by the Retrieval Augmented Generation (RAG) framework, ChatData leverages millions of Wikipedia pages and arXiv papers as its external knowledge base, with MyScale managing all data hosting tasks. Simply input your questions in natural language, and ChatData takes care of generating SQL, querying the data, and presenting the results.\n\nEnhancing your chat experience, ChatData introduces three key features. Let's delve into each of them in detail.\n\n#### Feature 1: Retriever Type\n\nMyScale works closely with LangChain, providing the easiest interface to build complex queries with LLM.\n\n**Self-querying retriever:** MyScale augmented LangChain's Self Querying Retriever, where the LLM can use more data types, for instance timestamps and array of strings, to build filters for the query.\n\n**VectorSQL:** SQL is powerful and can be used to construct complex search queries. Vector Structured Query Language (Vector SQL) is designed to teach LLMs how to query SQL vector databases. Besides the general data types and functions, vectorSQL contains extra functions like DISTANCE(column, query_vector)and NeuralArray(entity), with which we can extend the standard SQL for vector search.\n\n#### Feature 2: Session Management\n\nTo enhance your experience and seamlessly continue interactions with existing sessions, ChatData has introduced the Session Management feature. You can easily customize your session ID and modify your prompt to guide ChatData in addressing your queries. With just a few clicks, you can enjoy smooth and personalized session interactions.\n\n#### Feature 3: Building Your Own Knowledge Base\n\nIn addition to tapping into ChatData's external knowledge base powered by MyScale for answers, you also have the option to upload your own files and establish a personalized knowledge base. We've implemented the Unstructured API for this purpose, ensuring that only processed texts from your documents are stored, prioritizing your data privacy.\n\nIn conclusion, with ChatData, you can effortlessly navigate through vast amounts of data, effortlessly accessing precisely what you need. Whether you're a researcher, a student, or a knowledge enthusiast, ChatData empowers you to explore academic papers and research documents like never before. Unlock the true potential of information retrieval with ChatData and discover a world of knowledge at your fingertips.\n\n➡️ Dive in and experience ChatData on [Hugging Face](https://huggingface.co/spaces/myscale/ChatData)🤗\n\n![ChatData Homepage](assets/home.png)\n\n### Data schema\n\nDatabase credentials:\n\n```toml\nMYSCALE_HOST = \"msc-950b9f1f.us-east-1.aws.myscale.com\"\nMYSCALE_PORT = 443\nMYSCALE_USER = \"chatdata\"\nMYSCALE_PASSWORD = \"myscale_rocks\"\n```\n\n#### *[NEW]* Table `wiki.Wikipedia`\n\nChatData also provides you access to Wikipedia, a large knowledge base that contains about 36 million paragraphs under 5 million wiki pages. The knowledge base is a snapshot on 2022-12.\n\nYou can query from this table with the public account [here](#data-schema).\n\n```sql\nCREATE TABLE wiki.Wikipedia (\n    -- Record ID\n    `id` String, \n    -- Page title to this paragraph\n    `title` String, \n    -- Paragraph text\n    `text` String,\n    -- Page URL\n    `url` String,\n    -- Wiki page ID\n    `wiki_id` UInt64,\n    -- View statistics\n    `views` Float32,\n    -- Paragraph ID\n    `paragraph_id` UInt64,\n    -- Language ID\n    `langs` UInt32, \n    -- Feature vector to this paragraph\n    `emb` Array(Float32), \n    -- Vector Index\n    VECTOR INDEX emb_idx emb TYPE MSTG('metric_type=Cosine'), \n    CONSTRAINT emb_len CHECK length(emb) = 768) \nENGINE = ReplacingMergeTree ORDER BY id SETTINGS index_granularity = 8192\n```\n\n#### Table `default.ChatArXiv`\n\nChatData brings millions of papers into your knowledge base. We imported 2.2 million papers with metadata info, which contains:\n\n1. `id`: paper's arxiv id\n2. `abstract`: paper's abstracts used as ranking criterion (with InstructXL)\n3. `vector`: column that contains the vector array in `Array(Float32)`\n4. `metadata`: LangChain VectorStore Compatible Columns\n    1. `metadata.authors`: paper's authors in *list of strings*\n    2. `metadata.abstract`: paper's abstracts used as ranking criterion (with InstructXL)\n    3. `metadata.titles`: papers's titles\n    4. `metadata.categories`: paper's categories in *list of strings* like [\"cs.CV\"]\n    5. `metadata.pubdate`: paper's date of publication in *ISO 8601 formated strings*\n    6. `metadata.primary_category`: paper's primary category in *strings* defined by arXiv\n    7. `metadata.comment`: some additional comment to the paper\n  \n*Columns below are native columns in MyScale and can only be used as SQLDatabase*\n\n5. `authors`: paper's authors in *list of strings*\n6. `titles`: papers's titles\n7. `categories`: paper's categories in *list of strings* like [\"cs.CV\"]\n8. `pubdate`: paper's date of publication in *Date32 data type* (faster)\n9. `primary_category`: paper's primary category in *strings* defined by arXiv\n10. `comment`: some additional comment to the paper\n\nAnd for overall table schema, please refer to [table creation section in docs/self-query.md](docs/self-query.md#table-creation).\n\nIf you want to use this database with `langchain.chains.sql_database.base.SQLDatabaseChain` or `langchain.retrievers.SQLDatabaseRetriever`, please follow guides on [data preparation section](docs/vector-sql.md#prepare-the-database) and [chain creation section](docs/vector-sql.md#create-the-sqldatabasechain) in docs/vector-sql.md\n\n### Where can I get those arXiv data?\n\n- [From parquet files on S3](docs/self-query.md#insert-data)\n- <a name=\"data-service\"></a>Or Directly use MyScale database as service... for **FREE** ✨\n\n    ```python\n    import clickhouse_connect\n\n    client = clickhouse_connect.get_client(\n        host='msc-950b9f1f.us-east-1.aws.myscale.com',\n        port=443,\n        username='chatdata',\n        password='myscale_rocks'\n    )\n    ```\n\n## Monthly Updates 🔥 (November-2023)\n\n- 🚀 Upload your documents and chat with your own knowledge bases with MyScale!\n- 💬 Chat with RAG-enabled agents on both ArXiv and Wikipedia knowledge base!\n- 📖 Wikipedia is available as knowledge base!! Feel FREE 💰 to ask with 36 million of paragraphs under 5 million titles! 💫\n- 🤖 LLMs are now capable of writing **Vector SQL** - a extended SQL with vector search! Vector SQL allows you to **access MyScale faster and stronger**! This will **be added to LangChain** soon! ([PR 7454](https://github.com/hwchase17/langchain/pull/7454))\n- 🌏 Customized Retrieval QA Chain that gives you **more information** on each PDF and **answer question in your native language**!\n- 🔧 Our contribution to LangChain that helps self-query retrievers [**filter with more types and functions**](https://python.langchain.com/docs/modules/data_connection/retrievers/how_to/self_query/myscale_self_query)\n- 🌟 **We just opened a FREE pod hosting data for ArXiv paper.** Anyone can try their own SQL with vector search!!! Feel the power when SQL meets vector search! See how to access the pod [here](#data-service).\n- 📚 We collected about **2 million papers on arxiv**! We are collecting more and we need your advice!\n- More coming...\n\n## How to build your own app from scratch 🧱\n\n### Quickstart\n\n1. Enter directory `app/`\n\n```bash\ncd app/\n```\n\n2. Create an virtual environment\n\n```bash\npython3 -m venv venv\nsource venv/bin/activate\n```\n\n3. Install dependencies\n\n```bash\npython3 -m pip install -r requirements.txt\n```\n\n4. Run the app!\n\n```python\n# fill you OpenAI key in .streamlit/secrets.toml\ncp .streamlit/secrets.example.toml .streamlit/secrets.toml\n# start the app\npython3 -m streamlit run app.py\n```\n\n### With LangChain SQLDatabaseRetrievers\n\n [*Read the full article*](https://myscale.com/blog/teach-your-llm-vector-sql/)\n\n- [Why Vector SQL?](https://myscale.com/blog/teach-your-llm-vector-sql/#automate-the-whole-process-with-sql-and-vector-search)\n- [How did LangChain and MyScale convert natural language to structured filters?](https://myscale.com/docs/en/advanced-applications/chatdata/#selfqueryretriever)\n- [How to make chain execution more responsive in LangChain?](https://myscale.com/docs/en/advanced-applications/chatdata/#add-callbacks)\n\n### With LangChain Self-Query Retrievers\n\n[*Read the full article*](https://myscale.com/docs/en/advanced-applications/chatdata/)\n\n- [How this app is built?](https://docs.myscale.com/en/advanced-applications/chatdata)\n- [What is the overview pipeline?](https://docs.myscale.com/en/advanced-applications/chatdata/#design-the-query-pipeline)\n- [How did LangChain and MyScale convert natural language to structured filters?](https://docs.myscale.com/en/advanced-applications/chatdata/#selfqueryretriever)\n- [How to make chain execution more responsive in LangChain?](https://docs.myscale.com/en/advanced-applications/chatdata/#add-callbacks)\n\n## Community 🌍\n\n- Welcome to join our #ChatData channel in [Discord](https://discord.gg/jGCq2yZH) to discuss anything about ChatData.\n- Feel free to filing an issue or opening a PR against this repository.\n\n## Special Thanks 👏 (Ordered Alphabetically)\n\n- [arXiv API](https://info.arxiv.org/help/api/index.html) for its open access interoperability to pre-printed papers.\n- [InstructorXL](https://huggingface.co/hkunlp/instructor-xl) for its promptable embeddings that improves retrieve performance.\n- [LangChain🦜️🔗](https://github.com/hwchase17/langchain/) for its easy-to-use and composable API designs and prompts.\n- [OpenChatPaper](https://github.com/liuyixin-louis/OpenChatPaper) for prompt design reference.\n- [The Alexandria Index](https://alex.macrocosm.so/download) for providing arXiv data index to the public.\n"
  },
  {
    "path": "app/app.py",
    "content": "import os\nimport time\n\nimport streamlit as st\n\nfrom backend.constants.streamlit_keys import DATA_INITIALIZE_NOT_STATED, DATA_INITIALIZE_COMPLETED, \\\n    DATA_INITIALIZE_STARTED\nfrom backend.constants.variables import DATA_INITIALIZE_STATUS, JUMP_QUERY_ASK, CHAINS_RETRIEVERS_MAPPING, \\\n    TABLE_EMBEDDINGS_MAPPING, RETRIEVER_TOOLS, USER_NAME, GLOBAL_CONFIG, update_global_config\nfrom backend.construct.build_all import build_chains_and_retrievers, load_embedding_models, update_retriever_tools\nfrom backend.types.global_config import GlobalConfig\nfrom logger import logger\nfrom ui.chat_page import chat_page\nfrom ui.home import render_home\nfrom ui.retrievers import render_retrievers\n\n\n# warnings.filterwarnings(\"ignore\", category=UserWarning)\n\ndef prepare_environment():\n    os.environ['TOKENIZERS_PARALLELISM'] = 'true'\n    os.environ[\"LANGCHAIN_TRACING_V2\"] = \"false\"\n    # os.environ[\"LANGCHAIN_API_KEY\"] = \"\"\n    os.environ[\"OPENAI_API_BASE\"] = st.secrets['OPENAI_API_BASE']\n    os.environ[\"OPENAI_API_KEY\"] = st.secrets['OPENAI_API_KEY']\n    os.environ[\"AUTH0_CLIENT_ID\"] = st.secrets['AUTH0_CLIENT_ID']\n    os.environ[\"AUTH0_DOMAIN\"] = st.secrets['AUTH0_DOMAIN']\n\n    update_global_config(GlobalConfig(\n        openai_api_base=st.secrets['OPENAI_API_BASE'],\n        openai_api_key=st.secrets['OPENAI_API_KEY'],\n        auth0_client_id=st.secrets['AUTH0_CLIENT_ID'],\n        auth0_domain=st.secrets['AUTH0_DOMAIN'],\n        myscale_user=st.secrets['MYSCALE_USER'],\n        myscale_password=st.secrets['MYSCALE_PASSWORD'],\n        myscale_host=st.secrets['MYSCALE_HOST'],\n        myscale_port=st.secrets['MYSCALE_PORT'],\n        query_model=\"gpt-3.5-turbo-0125\",\n        chat_model=\"gpt-3.5-turbo-0125\",\n        untrusted_api=st.secrets['UNSTRUCTURED_API'],\n        myscale_enable_https=st.secrets.get('MYSCALE_ENABLE_HTTPS', True),\n    ))\n\n\n# when refresh browser, all session keys will be cleaned.\ndef initialize_session_state():\n    if DATA_INITIALIZE_STATUS not in st.session_state:\n        st.session_state[DATA_INITIALIZE_STATUS] = DATA_INITIALIZE_NOT_STATED\n        logger.info(f\"Initialize session state key: {DATA_INITIALIZE_STATUS}\")\n    if JUMP_QUERY_ASK not in st.session_state:\n        st.session_state[JUMP_QUERY_ASK] = False\n        logger.info(f\"Initialize session state key: {JUMP_QUERY_ASK}\")\n\n\ndef initialize_chat_data():\n    if st.session_state[DATA_INITIALIZE_STATUS] != DATA_INITIALIZE_COMPLETED:\n        start_time = time.time()\n        st.session_state[DATA_INITIALIZE_STATUS] = DATA_INITIALIZE_STARTED\n        st.session_state[TABLE_EMBEDDINGS_MAPPING] = load_embedding_models()\n        st.session_state[CHAINS_RETRIEVERS_MAPPING] = build_chains_and_retrievers()\n        st.session_state[RETRIEVER_TOOLS] = update_retriever_tools()\n        # mark data initialization finished.\n        st.session_state[DATA_INITIALIZE_STATUS] = DATA_INITIALIZE_COMPLETED\n        end_time = time.time()\n        logger.info(f\"ChatData initialized finished in {round(end_time - start_time, 3)} seconds, \"\n                    f\"session state keys: {list(st.session_state.keys())}\")\n\n\nst.set_page_config(\n    page_title=\"ChatData\",\n    page_icon=\"https://myscale.com/favicon.ico\",\n    initial_sidebar_state=\"expanded\",\n    layout=\"wide\",\n)\n\nprepare_environment()\ninitialize_session_state()\ninitialize_chat_data()\n\nif USER_NAME in st.session_state:\n    chat_page()\nelse:\n    if st.session_state[JUMP_QUERY_ASK]:\n        render_retrievers()\n    else:\n        render_home()\n"
  },
  {
    "path": "app/backend/__init__.py",
    "content": ""
  },
  {
    "path": "app/backend/callbacks/__init__.py",
    "content": ""
  },
  {
    "path": "app/backend/callbacks/arxiv_callbacks.py",
    "content": "import json\nimport textwrap\nfrom typing import Dict, Any, List\n\nfrom langchain.callbacks.streamlit.streamlit_callback_handler import (\n    LLMThought,\n    StreamlitCallbackHandler,\n)\n\n\nclass LLMThoughtWithKnowledgeBase(LLMThought):\n    def on_tool_end(\n        self,\n        output: str,\n        color=None,\n        observation_prefix=None,\n        llm_prefix=None,\n        **kwargs: Any,\n    ) -> None:\n        try:\n            self._container.markdown(\n                \"\\n\\n\".join(\n                    [\"### Retrieved Documents:\"]\n                    + [\n                        f\"**{i+1}**: {textwrap.shorten(r['page_content'], width=80)}\"\n                        for i, r in enumerate(json.loads(output))\n                    ]\n                )\n            )\n        except Exception as e:\n            super().on_tool_end(output, color, observation_prefix, llm_prefix, **kwargs)\n\n\nclass ChatDataAgentCallBackHandler(StreamlitCallbackHandler):\n    def on_llm_start(\n        self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any\n    ) -> None:\n        if self._current_thought is None:\n            self._current_thought = LLMThoughtWithKnowledgeBase(\n                parent_container=self._parent_container,\n                expanded=self._expand_new_thoughts,\n                collapse_on_complete=self._collapse_completed_thoughts,\n                labeler=self._thought_labeler,\n            )\n\n        self._current_thought.on_llm_start(serialized, prompts)\n"
  },
  {
    "path": "app/backend/callbacks/llm_thought_with_table.py",
    "content": "from typing import Any, Dict, List\n\nimport streamlit as st\nfrom langchain_core.outputs import LLMResult\nfrom streamlit.external.langchain import StreamlitCallbackHandler\n\n\nclass ChatDataSelfQueryCallBack(StreamlitCallbackHandler):\n    def __init__(self):\n        super().__init__(st.container())\n        self._current_thought = None\n        self.progress_bar = st.progress(value=0.0, text=\"Executing ChatData SelfQuery CallBack...\")\n\n    def on_llm_start(\n            self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any\n    ) -> None:\n        self.progress_bar.progress(value=0.35, text=\"Communicate with LLM...\")\n        pass\n\n    def on_chain_end(self, outputs, **kwargs) -> None:\n        if len(kwargs['tags']) == 0:\n            self.progress_bar.progress(value=0.75, text=\"Searching in DB...\")\n\n    def on_chain_start(self, serialized, inputs, **kwargs) -> None:\n\n        pass\n\n    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:\n        st.markdown(\"### Generate filter by LLM \\n\"\n                    \"> Here we get `query_constructor` results \\n\\n\")\n\n        self.progress_bar.progress(value=0.5, text=\"Generate filter by LLM...\")\n        for item in response.generations:\n            st.markdown(f\"{item[0].text}\")\n\n        pass\n"
  },
  {
    "path": "app/backend/callbacks/self_query_callbacks.py",
    "content": "from typing import Dict, Any, List\n\nimport streamlit as st\nfrom langchain.callbacks.streamlit.streamlit_callback_handler import (\n    StreamlitCallbackHandler,\n)\nfrom langchain.schema.output import LLMResult\n\n\nclass CustomSelfQueryRetrieverCallBackHandler(StreamlitCallbackHandler):\n    def __init__(self):\n        super().__init__(st.container())\n        self._current_thought = None\n        self.progress_bar = st.progress(value=0.0, text=\"Executing ChatData SelfQuery...\")\n\n    def on_llm_start(\n            self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any\n    ) -> None:\n        self.progress_bar.progress(value=0.35, text=\"Communicate with LLM...\")\n        pass\n\n    def on_chain_end(self, outputs, **kwargs) -> None:\n        if len(kwargs['tags']) == 0:\n            self.progress_bar.progress(value=0.75, text=\"Searching in DB...\")\n        pass\n\n    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:\n        st.markdown(\"### Generate filter by LLM \\n\"\n                    \"> Here we get `query_constructor` results \\n\\n\")\n        self.progress_bar.progress(value=0.5, text=\"Generate filter by LLM...\")\n        for item in response.generations:\n            st.markdown(f\"{item[0].text}\")\n        pass\n\n\nclass ChatDataSelfAskCallBackHandler(StreamlitCallbackHandler):\n    def __init__(self) -> None:\n        super().__init__(st.container())\n        self.progress_bar = st.progress(value=0.2, text=\"Executing ChatData SelfQuery Chain...\")\n\n    def on_llm_start(self, serialized, prompts, **kwargs) -> None:\n        pass\n\n    def on_llm_end(self, response: LLMResult, **kwargs: Any) -> None:\n\n        if len(kwargs['tags']) != 0:\n            self.progress_bar.progress(value=0.5, text=\"We got filter info from LLM...\")\n            st.markdown(\"### Generate filter by LLM \\n\"\n                        \"> Here we get `query_constructor` results \\n\\n\")\n            for item in response.generations:\n                st.markdown(f\"{item[0].text}\")\n        pass\n\n    def on_chain_start(self, serialized, inputs, **kwargs) -> None:\n        cid = \".\".join(serialized[\"id\"])\n        if cid.endswith(\".CustomStuffDocumentChain\"):\n            self.progress_bar.progress(value=0.7, text=\"Asking LLM with related documents...\")\n"
  },
  {
    "path": "app/backend/callbacks/vector_sql_callbacks.py",
    "content": "import streamlit as st\nfrom langchain.callbacks.streamlit.streamlit_callback_handler import (\n    StreamlitCallbackHandler,\n)\nfrom langchain.schema.output import LLMResult\nfrom sql_formatter.core import format_sql\n\n\nclass VectorSQLSearchDBCallBackHandler(StreamlitCallbackHandler):\n    def __init__(self) -> None:\n        self.progress_bar = st.progress(value=0.0, text=\"Writing SQL...\")\n        self.status_bar = st.empty()\n        self.prog_value = 0\n        self.prog_interval = 0.2\n\n    def on_llm_start(self, serialized, prompts, **kwargs) -> None:\n        pass\n\n    def on_llm_end(\n            self,\n            response: LLMResult,\n            *args,\n            **kwargs,\n    ):\n        text = response.generations[0][0].text\n        if text.replace(\" \", \"\").upper().startswith(\"SELECT\"):\n            st.markdown(\"### Generated Vector Search SQL Statement \\n\"\n                        \"> This sql statement is generated by LLM \\n\\n\")\n            st.markdown(f\"\"\"```sql\\n{format_sql(text, max_len=80)}\\n```\"\"\")\n            self.prog_value += self.prog_interval\n            self.progress_bar.progress(\n                value=self.prog_value, text=\"Searching in DB...\")\n\n    def on_chain_start(self, serialized, inputs, **kwargs) -> None:\n        cid = \".\".join(serialized[\"id\"])\n        self.prog_value += self.prog_interval\n        self.progress_bar.progress(\n            value=self.prog_value, text=f\"Running Chain `{cid}`...\"\n        )\n\n    def on_chain_end(self, outputs, **kwargs) -> None:\n        pass\n\n\nclass VectorSQLSearchLLMCallBackHandler(VectorSQLSearchDBCallBackHandler):\n    def __init__(self, table: str) -> None:\n        self.progress_bar = st.progress(value=0.0, text=\"Writing SQL...\")\n        self.status_bar = st.empty()\n        self.prog_value = 0\n        self.prog_interval = 0.1\n        self.table = table\n\n\n"
  },
  {
    "path": "app/backend/chains/__init__.py",
    "content": ""
  },
  {
    "path": "app/backend/chains/retrieval_qa_with_sources.py",
    "content": "import inspect\nfrom typing import Dict, Any, Optional, List\n\nfrom langchain.callbacks.manager import (\n    AsyncCallbackManagerForChainRun,\n    CallbackManagerForChainRun,\n)\nfrom langchain.chains.qa_with_sources.retrieval import RetrievalQAWithSourcesChain\nfrom langchain.docstore.document import Document\n\nfrom logger import logger\n\n\nclass CustomRetrievalQAWithSourcesChain(RetrievalQAWithSourcesChain):\n    \"\"\"QA with source chain for Chat ArXiv app with references\n\n    This chain will automatically assign reference number to the article,\n    Then parse it back to titles or anything else.\n    \"\"\"\n\n    def _call(\n        self,\n        inputs: Dict[str, Any],\n        run_manager: Optional[CallbackManagerForChainRun] = None,\n    ) -> Dict[str, str]:\n        logger.info(f\"\\033[91m\\033[1m{self._chain_type}\\033[0m\")\n        _run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()\n        accepts_run_manager = (\n            \"run_manager\" in inspect.signature(self._get_docs).parameters\n        )\n        if accepts_run_manager:\n            docs: List[Document] = self._get_docs(inputs, run_manager=_run_manager)\n        else:\n            docs: List[Document] = self._get_docs(inputs)  # type: ignore[call-arg]\n\n        answer = self.combine_documents_chain.run(\n            input_documents=docs, callbacks=_run_manager.get_child(), **inputs\n        )\n        # parse source with ref_id\n        sources = []\n        ref_cnt = 1\n        for d in docs:\n            ref_id = d.metadata['ref_id']\n            if f\"Doc #{ref_id}\" in answer:\n                answer = answer.replace(f\"Doc #{ref_id}\", f\"#{ref_id}\")\n            if f\"#{ref_id}\" in answer:\n                title = d.metadata['title'].replace('\\n', '')\n                d.metadata['ref_id'] = ref_cnt\n                answer = answer.replace(f\"#{ref_id}\", f\"{title} [{ref_cnt}]\")\n                sources.append(d)\n                ref_cnt += 1\n\n        result: Dict[str, Any] = {\n            self.answer_key: answer,\n            self.sources_answer_key: sources,\n        }\n        if self.return_source_documents:\n            result[\"source_documents\"] = docs\n        return result\n\n    async def _acall(\n        self,\n        inputs: Dict[str, Any],\n        run_manager: Optional[AsyncCallbackManagerForChainRun] = None,\n    ) -> Dict[str, Any]:\n        raise NotImplementedError\n\n    @property\n    def _chain_type(self) -> str:\n        return \"custom_retrieval_qa_with_sources_chain\"\n"
  },
  {
    "path": "app/backend/chains/stuff_documents.py",
    "content": "from typing import Any, List, Tuple\n\nfrom langchain.callbacks.manager import Callbacks\nfrom langchain.chains.combine_documents.stuff import StuffDocumentsChain\nfrom langchain.docstore.document import Document\nfrom langchain.schema.prompt_template import format_document\n\n\nclass CustomStuffDocumentChain(StuffDocumentsChain):\n    \"\"\"Combine arxiv documents with PDF reference number\"\"\"\n\n    def _get_inputs(self, docs: List[Document], **kwargs: Any) -> dict:\n        \"\"\"Construct inputs from kwargs and docs.\n\n        Format and the join all the documents together into one input with name\n        `self.document_variable_name`. The pluck any additional variables\n        from **kwargs.\n\n        Args:\n            docs: List of documents to format and then join into single input\n            **kwargs: additional inputs to chain, will pluck any other required\n                arguments from here.\n\n        Returns:\n            dictionary of inputs to LLMChain\n        \"\"\"\n        # Format each document according to the prompt\n        doc_strings = []\n        for doc_id, doc in enumerate(docs):\n            # add temp reference number in metadata\n            doc.metadata.update({'ref_id': doc_id})\n            doc.page_content = doc.page_content.replace('\\n', ' ')\n            doc_strings.append(format_document(doc, self.document_prompt))\n        # Join the documents together to put them in the prompt.\n        inputs = {\n            k: v\n            for k, v in kwargs.items()\n            if k in self.llm_chain.prompt.input_variables\n        }\n        inputs[self.document_variable_name] = self.document_separator.join(\n            doc_strings)\n        return inputs\n\n    def combine_docs(\n            self, docs: List[Document], callbacks: Callbacks = None, **kwargs: Any\n    ) -> Tuple[str, dict]:\n        \"\"\"Stuff all documents into one prompt and pass to LLM.\n\n        Args:\n            docs: List of documents to join together into one variable\n            callbacks: Optional callbacks to pass along\n            **kwargs: additional parameters to use to get inputs to LLMChain.\n\n        Returns:\n            The first element returned is the single string output. The second\n            element returned is a dictionary of other keys to return.\n        \"\"\"\n        inputs = self._get_inputs(docs, **kwargs)\n        # Call predict on the LLM.\n        output = self.llm_chain.predict(callbacks=callbacks, **inputs)\n        return output, {}\n\n    @property\n    def _chain_type(self) -> str:\n        return \"custom_stuff_document_chain\"\n"
  },
  {
    "path": "app/backend/chat_bot/__init__.py",
    "content": ""
  },
  {
    "path": "app/backend/chat_bot/chat.py",
    "content": "import time\n\nfrom os import environ\nfrom time import sleep\nimport streamlit as st\n\nfrom backend.constants.prompts import DEFAULT_SYSTEM_PROMPT\nfrom backend.constants.streamlit_keys import CHAT_KNOWLEDGE_TABLE, CHAT_SESSION_MANAGER, \\\n    CHAT_CURRENT_USER_SESSIONS, EL_SESSION_SELECTOR, USER_PRIVATE_FILES, \\\n    EL_BUILD_KB_WITH_FILES, \\\n    EL_PERSONAL_KB_NAME, EL_PERSONAL_KB_DESCRIPTION, \\\n    USER_PERSONAL_KNOWLEDGE_BASES, AVAILABLE_RETRIEVAL_TOOLS, EL_PERSONAL_KB_NEEDS_REMOVE, \\\n    EL_UPLOAD_FILES_STATUS, EL_SELECTED_KBS, EL_UPLOAD_FILES\nfrom backend.constants.variables import USER_INFO, USER_NAME, JUMP_QUERY_ASK, RETRIEVER_TOOLS\nfrom backend.construct.build_agents import build_agents\nfrom backend.chat_bot.session_manager import SessionManager\nfrom backend.callbacks.arxiv_callbacks import ChatDataAgentCallBackHandler\n\nfrom logger import logger\n\nenviron[\"OPENAI_API_BASE\"] = st.secrets[\"OPENAI_API_BASE\"]\n\nTOOL_NAMES = {\n    \"langchain_retriever_tool\": \"Self-querying retriever\",\n    \"vecsql_retriever_tool\": \"Vector SQL\",\n}\n\n\ndef on_chat_submit():\n    with st.session_state.next_round.container():\n        with st.chat_message(\"user\"):\n            st.write(st.session_state.chat_input)\n        with st.chat_message(\"assistant\"):\n            container = st.container()\n        st_callback = ChatDataAgentCallBackHandler(\n            container, collapse_completed_thoughts=False\n        )\n        ret = st.session_state.agent(\n            {\"input\": st.session_state.chat_input}, callbacks=[st_callback]\n        )\n        logger.info(f\"ret:{ret}\")\n\n\ndef clear_history():\n    if \"agent\" in st.session_state:\n        st.session_state.agent.memory.clear()\n\n\ndef back_to_main():\n    if USER_INFO in st.session_state:\n        del st.session_state[USER_INFO]\n    if USER_NAME in st.session_state:\n        del st.session_state[USER_NAME]\n    if JUMP_QUERY_ASK in st.session_state:\n        del st.session_state[JUMP_QUERY_ASK]\n    if EL_SESSION_SELECTOR in st.session_state:\n        del st.session_state[EL_SESSION_SELECTOR]\n    if CHAT_CURRENT_USER_SESSIONS in st.session_state:\n        del st.session_state[CHAT_CURRENT_USER_SESSIONS]\n\n\ndef refresh_sessions():\n    chat_session_manager: SessionManager = st.session_state[CHAT_SESSION_MANAGER]\n    current_user_name = st.session_state[USER_NAME]\n    current_user_sessions = chat_session_manager.list_sessions(current_user_name)\n\n    if not isinstance(current_user_sessions, dict) or not current_user_sessions:\n        # generate a default session for current user.\n        chat_session_manager.add_session(\n            user_id=current_user_name,\n            session_id=f\"{current_user_name}?default\",\n            system_prompt=DEFAULT_SYSTEM_PROMPT,\n        )\n        st.session_state[CHAT_CURRENT_USER_SESSIONS] = chat_session_manager.list_sessions(current_user_name)\n        current_user_sessions = st.session_state[CHAT_CURRENT_USER_SESSIONS]\n    else:\n        st.session_state[CHAT_CURRENT_USER_SESSIONS] = current_user_sessions\n\n    # load current user files.\n    st.session_state[USER_PRIVATE_FILES] = st.session_state[CHAT_KNOWLEDGE_TABLE].list_files(\n        current_user_name\n    )\n    # load current user private knowledge bases.\n    st.session_state[USER_PERSONAL_KNOWLEDGE_BASES] = \\\n        st.session_state[CHAT_KNOWLEDGE_TABLE].list_private_knowledge_bases(current_user_name)\n    logger.info(f\"current user name: {current_user_name}, \"\n                f\"user private knowledge bases: {st.session_state[USER_PERSONAL_KNOWLEDGE_BASES]}, \"\n                f\"user private files: {st.session_state[USER_PRIVATE_FILES]}\")\n    st.session_state[AVAILABLE_RETRIEVAL_TOOLS] = {\n        # public retrieval tools\n        **st.session_state[RETRIEVER_TOOLS],\n        # private retrieval tools\n        **st.session_state[CHAT_KNOWLEDGE_TABLE].as_retrieval_tools(current_user_name),\n    }\n    # print(f\"sel_session is {st.session_state.sel_session}, current_user_sessions is {current_user_sessions}\")\n    print(f\"current_user_sessions is {current_user_sessions}\")\n    st.session_state[EL_SESSION_SELECTOR] = current_user_sessions[0]\n\n\n# process for session add and delete.\ndef on_session_change_submit():\n    if \"session_manager\" in st.session_state and \"session_editor\" in st.session_state:\n        try:\n            for elem in st.session_state.session_editor[\"added_rows\"]:\n                if len(elem) > 0 and \"system_prompt\" in elem and \"session_id\" in elem:\n                    if elem[\"session_id\"] != \"\" and \"?\" not in elem[\"session_id\"]:\n                        st.session_state.session_manager.add_session(\n                            user_id=st.session_state.user_name,\n                            session_id=f\"{st.session_state.user_name}?{elem['session_id']}\",\n                            system_prompt=elem[\"system_prompt\"],\n                        )\n                    else:\n                        st.toast(\"`session_id` shouldn't be neither empty nor contain char `?`.\", icon=\"❌\")\n                        raise KeyError(\n                            \"`session_id` shouldn't be neither empty nor contain char `?`.\"\n                        )\n                else:\n                    st.toast(\"`You should fill both `session_id` and `system_prompt` to add a column!\", icon=\"❌\")\n                    raise KeyError(\n                        \"You should fill both `session_id` and `system_prompt` to add a column!\"\n                    )\n            for elem in st.session_state.session_editor[\"deleted_rows\"]:\n                user_name = st.session_state[USER_NAME]\n                session_id = st.session_state[CHAT_CURRENT_USER_SESSIONS][elem]['session_id']\n                user_with_session_id = f\"{user_name}?{session_id}\"\n                st.session_state.session_manager.remove_session(session_id=user_with_session_id)\n                st.toast(f\"session `{user_with_session_id}` removed.\", icon=\"✅\")\n\n            refresh_sessions()\n        except Exception as e:\n            sleep(2)\n            st.error(f\"{type(e)}: {str(e)}\")\n        finally:\n            st.session_state.session_editor[\"added_rows\"] = []\n            st.session_state.session_editor[\"deleted_rows\"] = []\n        refresh_agent()\n\n\ndef create_private_knowledge_base_as_tool():\n    current_user_name = st.session_state[USER_NAME]\n\n    if (\n            EL_PERSONAL_KB_NAME in st.session_state\n            and EL_PERSONAL_KB_DESCRIPTION in st.session_state\n            and EL_BUILD_KB_WITH_FILES in st.session_state\n            and len(st.session_state[EL_PERSONAL_KB_NAME]) > 0\n            and len(st.session_state[EL_PERSONAL_KB_DESCRIPTION]) > 0\n            and len(st.session_state[EL_BUILD_KB_WITH_FILES]) > 0\n    ):\n        st.session_state[CHAT_KNOWLEDGE_TABLE].create_private_knowledge_base(\n            user_id=current_user_name,\n            tool_name=st.session_state[EL_PERSONAL_KB_NAME],\n            tool_description=st.session_state[EL_PERSONAL_KB_DESCRIPTION],\n            files=[f[\"file_name\"] for f in st.session_state[EL_BUILD_KB_WITH_FILES]],\n        )\n        refresh_sessions()\n    else:\n        st.session_state[EL_UPLOAD_FILES_STATUS].error(\n            \"You should fill all fields to build up a tool!\"\n        )\n        sleep(2)\n\n\ndef remove_private_knowledge_bases():\n    if EL_PERSONAL_KB_NEEDS_REMOVE in st.session_state and st.session_state[EL_PERSONAL_KB_NEEDS_REMOVE]:\n        private_knowledge_bases_needs_remove = st.session_state[EL_PERSONAL_KB_NEEDS_REMOVE]\n        private_knowledge_base_names = [item[\"tool_name\"] for item in private_knowledge_bases_needs_remove]\n        # remove these private knowledge bases.\n        st.session_state[CHAT_KNOWLEDGE_TABLE].remove_private_knowledge_bases(\n            user_id=st.session_state[USER_NAME],\n            private_knowledge_bases=private_knowledge_base_names\n        )\n        refresh_sessions()\n    else:\n        st.session_state[EL_UPLOAD_FILES_STATUS].error(\n            \"You should specify at least one private knowledge base to delete!\"\n        )\n        time.sleep(2)\n\n\ndef refresh_agent():\n    with st.spinner(\"Initializing session...\"):\n        user_name = st.session_state[USER_NAME]\n        session_id = st.session_state[EL_SESSION_SELECTOR]['session_id']\n        user_with_session_id = f\"{user_name}?{session_id}\"\n\n        if EL_SELECTED_KBS in st.session_state:\n            selected_knowledge_bases = st.session_state[EL_SELECTED_KBS]\n        else:\n            selected_knowledge_bases = [\"Wikipedia + Vector SQL\"]\n\n        logger.info(f\"selected_knowledge_bases: {selected_knowledge_bases}\")\n        if EL_SESSION_SELECTOR in st.session_state:\n            system_prompt = st.session_state[EL_SESSION_SELECTOR][\"system_prompt\"]\n        else:\n            system_prompt = DEFAULT_SYSTEM_PROMPT\n\n        st.session_state[\"agent\"] = build_agents(\n            session_id=user_with_session_id,\n            tool_names=selected_knowledge_bases,\n            system_prompt=system_prompt\n        )\n\n\ndef add_file():\n    user_name = st.session_state[USER_NAME]\n    if EL_UPLOAD_FILES not in st.session_state or len(st.session_state[EL_UPLOAD_FILES]) == 0:\n        st.session_state[EL_UPLOAD_FILES_STATUS].error(\"Please upload files!\", icon=\"⚠️\")\n        sleep(2)\n        return\n    try:\n        st.session_state[EL_UPLOAD_FILES_STATUS].info(\"Uploading...\")\n        st.session_state[CHAT_KNOWLEDGE_TABLE].add_by_file(\n            user_id=user_name,\n            files=st.session_state[EL_UPLOAD_FILES]\n        )\n        refresh_sessions()\n    except ValueError as e:\n        st.session_state[EL_UPLOAD_FILES_STATUS].error(\"Failed to upload! \" + str(e))\n        sleep(2)\n\n\ndef clear_files():\n    st.session_state[CHAT_KNOWLEDGE_TABLE].clear(user_id=st.session_state[USER_NAME])\n    refresh_sessions()\n"
  },
  {
    "path": "app/backend/chat_bot/json_decoder.py",
    "content": "import json\nimport datetime\n\n\nclass CustomJSONEncoder(json.JSONEncoder):\n    def default(self, obj):\n        if isinstance(obj, datetime.datetime):\n            return datetime.datetime.isoformat(obj)\n        return json.JSONEncoder.default(self, obj)\n\n\nclass CustomJSONDecoder(json.JSONDecoder):\n    def __init__(self, *args, **kwargs):\n        json.JSONDecoder.__init__(\n            self, object_hook=self.object_hook, *args, **kwargs)\n\n    def object_hook(self, source):\n        for k, v in source.items():\n            if isinstance(v, str):\n                try:\n                    source[k] = datetime.datetime.fromisoformat(str(v))\n                except:\n                    pass\n        return source\n"
  },
  {
    "path": "app/backend/chat_bot/message_converter.py",
    "content": "import hashlib\nimport json\nimport time\nfrom typing import Any\n\nfrom langchain.memory.chat_message_histories.sql import DefaultMessageConverter\nfrom langchain.schema import BaseMessage, HumanMessage, AIMessage, SystemMessage, ChatMessage, FunctionMessage\nfrom langchain.schema.messages import ToolMessage\nfrom sqlalchemy.orm import declarative_base\n\nfrom backend.chat_bot.tools import create_message_history_table\n\n\ndef _message_from_dict(message: dict) -> BaseMessage:\n    _type = message[\"type\"]\n    if _type == \"human\":\n        return HumanMessage(**message[\"data\"])\n    elif _type == \"ai\":\n        return AIMessage(**message[\"data\"])\n    elif _type == \"system\":\n        return SystemMessage(**message[\"data\"])\n    elif _type == \"chat\":\n        return ChatMessage(**message[\"data\"])\n    elif _type == \"function\":\n        return FunctionMessage(**message[\"data\"])\n    elif _type == \"tool\":\n        return ToolMessage(**message[\"data\"])\n    elif _type == \"AIMessageChunk\":\n        message[\"data\"][\"type\"] = \"ai\"\n        return AIMessage(**message[\"data\"])\n    else:\n        raise ValueError(f\"Got unexpected message type: {_type}\")\n\n\nclass DefaultClickhouseMessageConverter(DefaultMessageConverter):\n    \"\"\"The default message converter for SQLChatMessageHistory.\"\"\"\n\n    def __init__(self, table_name: str):\n        super().__init__(table_name)\n        self.model_class = create_message_history_table(table_name, declarative_base())\n\n    def to_sql_model(self, message: BaseMessage, session_id: str) -> Any:\n        time_stamp = time.time()\n        msg_id = hashlib.sha256(\n            f\"{session_id}_{message}_{time_stamp}\".encode('utf-8')).hexdigest()\n        user_id, _ = session_id.split(\"?\")\n        return self.model_class(\n            id=time_stamp,\n            msg_id=msg_id,\n            user_id=user_id,\n            session_id=session_id,\n            type=message.type,\n            addtionals=json.dumps(message.additional_kwargs),\n            message=json.dumps({\n                \"type\": message.type,\n                \"additional_kwargs\": {\"timestamp\": time_stamp},\n                \"data\": message.dict()})\n        )\n\n    def from_sql_model(self, sql_message: Any) -> BaseMessage:\n        msg_dump = json.loads(sql_message.message)\n        msg = _message_from_dict(msg_dump)\n        msg.additional_kwargs = msg_dump[\"additional_kwargs\"]\n        return msg\n\n    def get_sql_model_class(self) -> Any:\n        return self.model_class\n"
  },
  {
    "path": "app/backend/chat_bot/private_knowledge_base.py",
    "content": "import hashlib\nfrom datetime import datetime\nfrom typing import List, Optional\n\nimport pandas as pd\nfrom clickhouse_connect import get_client\nfrom langchain.schema.embeddings import Embeddings\nfrom langchain.vectorstores.myscale import MyScaleWithoutJSON, MyScaleSettings\nfrom streamlit.runtime.uploaded_file_manager import UploadedFile\n\nfrom backend.chat_bot.tools import parse_files, extract_embedding\nfrom backend.construct.build_retriever_tool import create_retriever_tool\nfrom logger import logger\n\n\nclass ChatBotKnowledgeTable:\n    def __init__(self, host, port, username, password,\n                 embedding: Embeddings, parser_api_key: str, db=\"chat\",\n                 kb_table=\"private_kb\", tool_table=\"private_tool\") -> None:\n        super().__init__()\n        personal_files_schema_ = f\"\"\"\n            CREATE TABLE IF NOT EXISTS {db}.{kb_table}(\n                entity_id String,\n                file_name String,\n                text String,\n                user_id String,\n                created_by DateTime,\n                vector Array(Float32),\n                CONSTRAINT cons_vec_len CHECK length(vector) = 768,\n                VECTOR INDEX vidx vector TYPE MSTG('metric_type=Cosine')\n            ) ENGINE = ReplacingMergeTree ORDER BY entity_id\n        \"\"\"\n\n        # `tool_name` represent private knowledge database name.\n        private_knowledge_base_schema_ = f\"\"\"\n            CREATE TABLE IF NOT EXISTS {db}.{tool_table}(\n                tool_id String,\n                tool_name String,\n                file_names Array(String),\n                user_id String,\n                created_by DateTime,\n                tool_description String\n            ) ENGINE = ReplacingMergeTree ORDER BY tool_id\n        \"\"\"\n        self.personal_files_table = kb_table\n        self.private_knowledge_base_table = tool_table\n        config = MyScaleSettings(\n            host=host,\n            port=port,\n            username=username,\n            password=password,\n            database=db,\n            table=kb_table,\n        )\n        self.client = get_client(\n            host=config.host,\n            port=config.port,\n            username=config.username,\n            password=config.password,\n        )\n        self.client.command(\"SET allow_experimental_object_type=1\")\n        self.client.command(personal_files_schema_)\n        self.client.command(private_knowledge_base_schema_)\n        self.parser_api_key = parser_api_key\n        self.vector_store = MyScaleWithoutJSON(\n            embedding=embedding,\n            config=config,\n            must_have_cols=[\"file_name\", \"text\", \"created_by\"],\n        )\n\n    # List all files with given `user_id`\n    def list_files(self, user_id: str):\n        query = f\"\"\"\n        SELECT DISTINCT file_name, COUNT(entity_id) AS num_paragraph, \n            arrayMax(arrayMap(x->length(x), groupArray(text))) AS max_chars\n        FROM {self.vector_store.config.database}.{self.personal_files_table}\n        WHERE user_id = '{user_id}' GROUP BY file_name\n        \"\"\"\n        return [r for r in self.vector_store.client.query(query).named_results()]\n\n    # Parse and embedding files\n    def add_by_file(self, user_id, files: List[UploadedFile]):\n        data = parse_files(self.parser_api_key, user_id, files)\n        data = extract_embedding(self.vector_store.embeddings, data)\n        self.vector_store.client.insert_df(\n            table=self.personal_files_table,\n            df=pd.DataFrame(data),\n            database=self.vector_store.config.database,\n        )\n\n    # Remove all files and private_knowledge_bases with given `user_id`\n    def clear(self, user_id: str):\n        self.vector_store.client.command(\n            f\"DELETE FROM {self.vector_store.config.database}.{self.personal_files_table} \"\n            f\"WHERE user_id='{user_id}'\"\n        )\n        query = f\"\"\"DELETE FROM {self.vector_store.config.database}.{self.private_knowledge_base_table} \n                    WHERE user_id  = '{user_id}'\"\"\"\n        self.vector_store.client.command(query)\n\n    def create_private_knowledge_base(\n            self, user_id: str, tool_name: str, tool_description: str, files: Optional[List[str]] = None\n    ):\n        self.vector_store.client.insert_df(\n            self.private_knowledge_base_table,\n            pd.DataFrame(\n                [\n                    {\n                        \"tool_id\": hashlib.sha256(\n                            (user_id + tool_name).encode(\"utf-8\")\n                        ).hexdigest(),\n                        \"tool_name\": tool_name,  # tool_name represent user's private knowledge base.\n                        \"file_names\": files,\n                        \"user_id\": user_id,\n                        \"created_by\": datetime.now(),\n                        \"tool_description\": tool_description,\n                    }\n                ]\n            ),\n            database=self.vector_store.config.database,\n        )\n\n    # Show all private knowledge bases with given `user_id`\n    def list_private_knowledge_bases(self, user_id: str, private_knowledge_base=None):\n        extended_where = f\"AND tool_name = '{private_knowledge_base}'\" if private_knowledge_base else \"\"\n        query = f\"\"\"\n        SELECT tool_name, tool_description, length(file_names) \n        FROM {self.vector_store.config.database}.{self.private_knowledge_base_table}\n        WHERE user_id = '{user_id}' {extended_where}\n        \"\"\"\n        return [r for r in self.vector_store.client.query(query).named_results()]\n\n    def remove_private_knowledge_bases(self, user_id: str, private_knowledge_bases: List[str]):\n        unique_list = list(set(private_knowledge_bases))\n        unique_list = \",\".join([f\"'{t}'\" for t in unique_list])\n        query = f\"\"\"DELETE FROM {self.vector_store.config.database}.{self.private_knowledge_base_table}\n                    WHERE user_id  = '{user_id}' AND tool_name IN [{unique_list}]\"\"\"\n        self.vector_store.client.command(query)\n\n    def as_retrieval_tools(self, user_id, tool_name=None):\n        logger.info(f\"\")\n        private_knowledge_bases = self.list_private_knowledge_bases(user_id=user_id, private_knowledge_base=tool_name)\n        retrievers = {}\n        for private_kb in private_knowledge_bases:\n            file_names_sql = f\"\"\"\n            SELECT arrayJoin(file_names) FROM (\n                SELECT file_names \n                FROM chat.private_tool\n                WHERE user_id = '{user_id}' AND tool_name = '{private_kb[\"tool_name\"]}'\n            )\n            \"\"\"\n            logger.info(f\"user_id is {user_id}, file_names_sql is {file_names_sql}\")\n            res = self.client.query(file_names_sql)\n            file_names = []\n            for line in res.result_rows:\n                file_names.append(line[0])\n            file_names = ', '.join(f\"'{item}'\" for item in file_names)\n            logger.info(f\"user_id is {user_id}, file_names is {file_names}\")\n            retrievers[private_kb[\"tool_name\"]] = create_retriever_tool(\n                self.vector_store.as_retriever(\n                    search_kwargs={\"where_str\": f\"user_id='{user_id}' AND file_name IN ({file_names})\"},\n                ),\n                tool_name=private_kb[\"tool_name\"],\n                description=private_kb[\"tool_description\"],\n            )\n        return retrievers\n\n"
  },
  {
    "path": "app/backend/chat_bot/session_manager.py",
    "content": "import json\n\nfrom backend.chat_bot.tools import create_session_table, create_message_history_table\nfrom backend.constants.variables import GLOBAL_CONFIG\n\ntry:\n    from sqlalchemy.orm import declarative_base\nexcept ImportError:\n    from sqlalchemy.ext.declarative import declarative_base\nfrom datetime import datetime\nfrom sqlalchemy import orm, create_engine\nfrom logger import logger\n\n\ndef get_sessions(engine, model_class, user_id):\n    with orm.sessionmaker(engine)() as session:\n        result = (\n            session.query(model_class)\n            .where(\n                model_class.session_id == user_id\n            )\n            .order_by(model_class.create_by.desc())\n        )\n    return json.loads(result)\n\n\nclass SessionManager:\n    def __init__(\n            self,\n            session_state,\n            host,\n            port,\n            username,\n            password,\n            db='chat',\n            session_table='sessions',\n            msg_table='chat_memory'\n    ) -> None:\n        if GLOBAL_CONFIG.myscale_enable_https == False:\n            conn_str = f'clickhouse://{username}:{password}@{host}:{port}/{db}?protocol=http'\n        else:\n            conn_str = f'clickhouse://{username}:{password}@{host}:{port}/{db}?protocol=https'\n        self.engine = create_engine(conn_str, echo=False)\n        self.session_model_class = create_session_table(\n            session_table, declarative_base())\n        self.session_model_class.metadata.create_all(self.engine)\n        self.msg_model_class = create_message_history_table(msg_table, declarative_base())\n        self.msg_model_class.metadata.create_all(self.engine)\n        self.session_orm = orm.sessionmaker(self.engine)\n        self.session_state = session_state\n\n    def list_sessions(self, user_id: str):\n        with self.session_orm() as session:\n            result = (\n                session.query(self.session_model_class)\n                .where(\n                    self.session_model_class.user_id == user_id\n                )\n                .order_by(self.session_model_class.create_by.desc())\n            )\n            sessions = []\n            for r in result:\n                sessions.append({\n                    \"session_id\": r.session_id.split(\"?\")[-1],\n                    \"system_prompt\": r.system_prompt,\n                })\n            return sessions\n\n    # Update sys_prompt with given session_id\n    def modify_system_prompt(self, session_id, sys_prompt):\n        with self.session_orm() as session:\n            obj = session.query(self.session_model_class).where(\n                self.session_model_class.session_id == session_id).first()\n            if obj:\n                obj.system_prompt = sys_prompt\n                session.commit()\n            else:\n                logger.warning(f\"Session {session_id} not found\")\n\n    # Add a session(session_id, sys_prompt)\n    def add_session(self, user_id: str, session_id: str, system_prompt: str, **kwargs):\n        with self.session_orm() as session:\n            elem = self.session_model_class(\n                user_id=user_id, session_id=session_id, system_prompt=system_prompt,\n                create_by=datetime.now(), additionals=json.dumps(kwargs)\n            )\n            session.add(elem)\n            session.commit()\n\n    # Remove a session and related chat history.\n    def remove_session(self, session_id: str):\n        with self.session_orm() as session:\n            # remove session\n            session.query(self.session_model_class).where(self.session_model_class.session_id == session_id).delete()\n            # remove related chat history.\n            session.query(self.msg_model_class).where(self.msg_model_class.session_id == session_id).delete()\n"
  },
  {
    "path": "app/backend/chat_bot/tools.py",
    "content": "import hashlib\nfrom datetime import datetime\nfrom multiprocessing.pool import ThreadPool\nfrom typing import List\n\nimport requests\nfrom clickhouse_sqlalchemy import types, engines\nfrom langchain.schema.embeddings import Embeddings\nfrom sqlalchemy import Column, Text\nfrom streamlit.runtime.uploaded_file_manager import UploadedFile\n\n\ndef parse_files(api_key, user_id, files: List[UploadedFile]):\n    def parse_file(file: UploadedFile):\n        headers = {\n            \"accept\": \"application/json\",\n            \"unstructured-api-key\": api_key,\n        }\n        data = {\"strategy\": \"auto\", \"ocr_languages\": [\"eng\"]}\n        file_hash = hashlib.sha256(file.read()).hexdigest()\n        file_data = {\"files\": (file.name, file.getvalue(), file.type)}\n        response = requests.post(\n            url=\"https://api.unstructured.io/general/v0/general\",\n            headers=headers,\n            data=data,\n            files=file_data\n        )\n        json_response = response.json()\n        if response.status_code != 200:\n            raise ValueError(str(json_response))\n        texts = [\n            {\n                \"text\": t[\"text\"],\n                \"file_name\": t[\"metadata\"][\"filename\"],\n                \"entity_id\": hashlib.sha256(\n                    (file_hash + t[\"text\"]).encode()\n                ).hexdigest(),\n                \"user_id\": user_id,\n                \"created_by\": datetime.now(),\n            }\n            for t in json_response\n            if t[\"type\"] == \"NarrativeText\" and len(t[\"text\"].split(\" \")) > 10\n        ]\n        return texts\n\n    with ThreadPool(8) as p:\n        rows = []\n        for r in p.imap_unordered(parse_file, files):\n            rows.extend(r)\n        return rows\n\n\ndef extract_embedding(embeddings: Embeddings, texts):\n    if len(texts) > 0:\n        embeddings = embeddings.embed_documents(\n            [t[\"text\"] for _, t in enumerate(texts)])\n        for i, _ in enumerate(texts):\n            texts[i][\"vector\"] = embeddings[i]\n        return texts\n    raise ValueError(\"No texts extracted!\")\n\n\ndef create_message_history_table(table_name: str, base_class):\n    class Message(base_class):\n        __tablename__ = table_name\n        id = Column(types.Float64)\n        session_id = Column(Text)\n        user_id = Column(Text)\n        msg_id = Column(Text, primary_key=True)\n        type = Column(Text)\n        # should be additions, formal developer mistake spell it.\n        addtionals = Column(Text)\n        message = Column(Text)\n        __table_args__ = (\n            engines.MergeTree(\n                partition_by='session_id',\n                order_by=('id', 'msg_id')\n            ),\n            {'comment': 'Store Chat History'}\n        )\n\n    return Message\n\n\ndef create_session_table(table_name: str, DynamicBase):\n    class Session(DynamicBase):\n        __tablename__ = table_name\n        user_id = Column(Text)\n        session_id = Column(Text, primary_key=True)\n        system_prompt = Column(Text)\n        # represent create time.\n        create_by = Column(types.DateTime)\n        # should be additions, formal developer mistake spell it.\n        additionals = Column(Text)\n        __table_args__ = (\n            engines.MergeTree(order_by=session_id),\n            {'comment': 'Store Session and Prompts'}\n        )\n\n    return Session\n"
  },
  {
    "path": "app/backend/constants/__init__.py",
    "content": ""
  },
  {
    "path": "app/backend/constants/myscale_tables.py",
    "content": "from typing import Dict, List\nimport streamlit as st\nfrom langchain.chains.query_constructor.schema import AttributeInfo\nfrom langchain_community.embeddings import SentenceTransformerEmbeddings, HuggingFaceInstructEmbeddings\nfrom langchain.prompts import PromptTemplate\n\nfrom backend.types.table_config import TableConfig\n\n\ndef hint_arxiv():\n    st.markdown(\"Here we provide some query samples.\")\n    st.markdown(\"- If you want to search papers with filters\")\n    st.markdown(\"1. ```What is a Bayesian network? Please use articles published later than Feb 2018 and with more \"\n                \"than 2 categories and whose title like `computer` and must have `cs.CV` in its category. ```\")\n    st.markdown(\"2. ```What is a Bayesian network? Please use articles published later than Feb 2018```\")\n    st.markdown(\"- If you want to ask questions based on arxiv papers stored in MyScaleDB\")\n    st.markdown(\"1. ```Did Geoffrey Hinton wrote paper about Capsule Neural Networks?```\")\n    st.markdown(\"2. ```Introduce some applications of GANs published around 2019.```\")\n    st.markdown(\"3. ```请根据 2019 年左右的文章介绍一下 GAN 的应用都有哪些```\")\n\n\ndef hint_sql_arxiv():\n    st.markdown('''```sql\nCREATE TABLE default.ChatArXiv (\n    `abstract` String, \n    `id` String, \n    `vector` Array(Float32), \n    `metadata` Object('JSON'), \n    `pubdate` DateTime,\n    `title` String,\n    `categories` Array(String),\n    `authors` Array(String), \n    `comment` String,\n    `primary_category` String,\n    VECTOR INDEX vec_idx vector TYPE MSTG('fp16_storage=1', 'metric_type=Cosine', 'disk_mode=3'), \n    CONSTRAINT vec_len CHECK length(vector) = 768) \nENGINE = ReplacingMergeTree ORDER BY id\n```''')\n\n\ndef hint_wiki():\n    st.markdown(\"Here we provide some query samples.\")\n    st.markdown(\"1. ```Which company did Elon Musk found?```\")\n    st.markdown(\"2. ```What is Iron Gwazi?```\")\n    st.markdown(\"3. ```苹果的发源地是哪里？```\")\n    st.markdown(\"4. ```What is a Ring in mathematics?```\")\n    st.markdown(\"5. ```The producer of Rick and Morty.```\")\n    st.markdown(\"6. ```How low is the temperature on Pluto?```\")\n\n\ndef hint_sql_wiki():\n    st.markdown('''```sql\nCREATE TABLE wiki.Wikipedia (\n    `id` String, \n    `title` String, \n    `text` String, \n    `url` String, \n    `wiki_id` UInt64, \n    `views` Float32, \n    `paragraph_id` UInt64, \n    `langs` UInt32, \n    `emb` Array(Float32), \n    VECTOR INDEX vec_idx emb TYPE MSTG('fp16_storage=1', 'metric_type=Cosine', 'disk_mode=3'), \n    CONSTRAINT emb_len CHECK length(emb) = 768) \nENGINE = ReplacingMergeTree ORDER BY id\n```''')\n\n\nMYSCALE_TABLES: Dict[str, TableConfig] = {\n    'Wikipedia': TableConfig(\n        database=\"wiki\",\n        table=\"Wikipedia\",\n        table_contents=\"Snapshort from Wikipedia for 2022. All in English.\",\n        hint=hint_wiki,\n        hint_sql=hint_sql_wiki,\n        # doc_prompt 对 qa source chain 有用\n        doc_prompt=PromptTemplate(\n            input_variables=[\"page_content\", \"url\", \"title\", \"ref_id\", \"views\"],\n            template=\"Title for Doc #{ref_id}: {title}\\n\\tviews: {views}\\n\\tcontent: {page_content}\\nSOURCE: {url}\"\n        ),\n        metadata_col_attributes=[\n            AttributeInfo(name=\"title\", description=\"title of the wikipedia page\", type=\"string\"),\n            AttributeInfo(name=\"text\", description=\"paragraph from this wiki page\", type=\"string\"),\n            AttributeInfo(name=\"views\", description=\"number of views\", type=\"float\")\n        ],\n        must_have_col_names=['id', 'title', 'url', 'text', 'views'],\n        vector_col_name=\"emb\",\n        text_col_name=\"text\",\n        metadata_col_name=\"metadata\",\n        emb_model=lambda: SentenceTransformerEmbeddings(\n            model_name='sentence-transformers/paraphrase-multilingual-mpnet-base-v2'\n        ),\n        tool_desc=(\"search_among_wikipedia\", \"Searches among Wikipedia and returns related wiki pages\")\n    ),\n    'ArXiv Papers': TableConfig(\n        database=\"default\",\n        table=\"ChatArXiv\",\n        table_contents=\"Snapshort from Wikipedia for 2022. All in English.\",\n        hint=hint_arxiv,\n        hint_sql=hint_sql_arxiv,\n        doc_prompt=PromptTemplate(\n            input_variables=[\"page_content\", \"id\", \"title\", \"ref_id\", \"authors\", \"pubdate\", \"categories\"],\n            template=\"Title for Doc #{ref_id}: {title}\\n\\tAbstract: {page_content}\\n\\tAuthors: {authors}\\n\\t\"\n                     \"Date of Publication: {pubdate}\\n\\tCategories: {categories}\\nSOURCE: {id}\"\n        ),\n        metadata_col_attributes=[\n            AttributeInfo(name=\"pubdate\", description=\"The year the paper is published\", type=\"timestamp\"),\n            AttributeInfo(name=\"authors\", description=\"List of author names\", type=\"list[string]\"),\n            AttributeInfo(name=\"title\", description=\"Title of the paper\", type=\"string\"),\n            AttributeInfo(name=\"categories\", description=\"arxiv categories to this paper\", type=\"list[string]\"),\n            AttributeInfo(name=\"length(categories)\", description=\"length of arxiv categories to this paper\", type=\"int\")\n        ],\n        must_have_col_names=['title', 'id', 'categories', 'abstract', 'authors', 'pubdate'],\n        vector_col_name=\"vector\",\n        text_col_name=\"abstract\",\n        metadata_col_name=\"metadata\",\n        emb_model=lambda: HuggingFaceInstructEmbeddings(\n            model_name='hkunlp/instructor-xl',\n            embed_instruction=\"Represent the question for retrieving supporting scientific papers: \"\n        ),\n        tool_desc=(\n            \"search_among_scientific_papers\",\n            \"Searches among scientific papers from ArXiv and returns research papers\"\n        )\n    )\n}\n\nALL_TABLE_NAME: List[str] = [config.table for config in MYSCALE_TABLES.values()]\n"
  },
  {
    "path": "app/backend/constants/prompts.py",
    "content": "from langchain.prompts import ChatPromptTemplate, \\\n    SystemMessagePromptTemplate, HumanMessagePromptTemplate\n\nDEFAULT_SYSTEM_PROMPT = (\n    \"Do your best to answer the questions. \"\n    \"Feel free to use any tools available to look up \"\n    \"relevant information. Please keep all details in query \"\n    \"when calling search functions.\"\n)\n\nCOMBINE_PROMPT_TEMPLATE = (\n        \"You are a helpful document assistant. \"\n        \"Your task is to provide information and answer any questions related to documents given below. \"\n        \"You should use the sections, title and abstract of the selected documents as your source of information \"\n        \"and try to provide concise and accurate answers to any questions asked by the user. \"\n        \"If you are unable to find relevant information in the given sections, \"\n        \"you will need to let the user know that the source does not contain relevant information but still try to \"\n        \"provide an answer based on your general knowledge. You must refer to the corresponding section name and page \"\n        \"that you refer to when answering. \"\n        \"The following is the related information about the document that will help you answer users' questions, \"\n        \"you MUST answer it using question's language:\\n\\n {summaries} \"\n        \"Now you should answer user's question. Remember you must use `Doc #` to refer papers:\\n\\n\"\n)\n\nCOMBINE_PROMPT = ChatPromptTemplate.from_strings(\n    string_messages=[(SystemMessagePromptTemplate, COMBINE_PROMPT_TEMPLATE),\n                     (HumanMessagePromptTemplate, '{question}')])\n\nMYSCALE_PROMPT = \"\"\"\nYou are a MyScale expert. Given an input question, first create a syntactically correct MyScale query to run, then look at the results of the query and return the answer to the input question.\nMyScale queries has a vector distance function called `DISTANCE(column, array)` to compute relevance to the user's question and sort the feature array column by the relevance. \nWhen the query is asking for {top_k} closest row, you have to use this distance function to calculate distance to entity's array on vector column and order by the distance to retrieve relevant rows.\n\n*NOTICE*: `DISTANCE(column, array)` only accept an array column as its first argument and a `NeuralArray(entity)` as its second argument. You also need a user defined function called `NeuralArray(entity)` to retrieve the entity's array. \n\nUnless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per MyScale. You should only order according to the distance function.\nNever query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (\") to denote them as delimited identifiers.\nPay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.\nPay attention to use today() function to get the current date, if the question involves \"today\". `ORDER BY` clause should always be after `WHERE` clause. DO NOT add semicolon to the end of SQL. Pay attention to the comment in table schema.\nPay attention to the data type when using functions. Always use `AND` to connect conditions in `WHERE` and never use comma.\nMake sure you never write an isolated `WHERE` keyword and never use undesired condition to conrtain the query.\n\nUse the following format:\n\n======== table info ========\n<some table infos>\n\nQuestion: \"Question here\"\nSQLQuery: \"SQL Query to run\"\n\n\nHere are some examples:\n\n======== table info ========\nCREATE TABLE \"ChatPaper\" (\n\tabstract String, \n\tid String, \n\tvector Array(Float32), \n) ENGINE = ReplicatedReplacingMergeTree()\n ORDER BY id\n PRIMARY KEY id\n \nQuestion: What is Feartue Pyramid Network?\nSQLQuery: SELECT ChatPaper.abstract, ChatPaper.id FROM ChatPaper ORDER BY DISTANCE(vector, NeuralArray(PaperRank contribution)) LIMIT {top_k}\n\n\n======== table info ========\nCREATE TABLE \"ChatPaper\" (\n\tabstract String, \n\tid String, \n\tvector Array(Float32), \n\tcategories Array(String), \n\tpubdate DateTime, \n\ttitle String, \n\tauthors Array(String), \n\tprimary_category String\n) ENGINE = ReplicatedReplacingMergeTree()\n ORDER BY id\n PRIMARY KEY id\n \nQuestion: What is PaperRank? What is the contribution of those works? Use paper with more than 2 categories.\nSQLQuery: SELECT ChatPaper.title, ChatPaper.id, ChatPaper.authors FROM ChatPaper WHERE length(categories) > 2 ORDER BY DISTANCE(vector, NeuralArray(PaperRank contribution)) LIMIT {top_k}\n\n\n======== table info ========\nCREATE TABLE \"ChatArXiv\" (\n\tprimary_category String\n\tcategories Array(String), \n\tpubdate DateTime, \n\tabstract String, \n\ttitle String, \n\tpaper_id String, \n\tvector Array(Float32), \n\tauthors Array(String), \n) ENGINE = MergeTree()\n ORDER BY paper_id\n PRIMARY KEY paper_id\n \nQuestion: Did Geoffrey Hinton wrote about Capsule Neural Networks? Please use articles published later than 2021.\nSQLQuery: SELECT ChatArXiv.title, ChatArXiv.paper_id, ChatArXiv.authors FROM ChatArXiv WHERE has(authors, 'Geoffrey Hinton') AND pubdate > parseDateTimeBestEffort('2021-01-01') ORDER BY DISTANCE(vector, NeuralArray(Capsule Neural Networks)) LIMIT {top_k}\n\n\n======== table info ========\nCREATE TABLE \"PaperDatabase\" (\n\tabstract String, \n\tcategories Array(String), \n\tvector Array(Float32), \n\tpubdate DateTime, \n\tid String, \n\tcomments String,\n\ttitle String, \n\tauthors Array(String), \n\tprimary_category String\n) ENGINE = MergeTree()\n ORDER BY id\n PRIMARY KEY id\n \nQuestion: Find papers whose abstract has Mutual Information in it.\nSQLQuery: SELECT PaperDatabase.title, PaperDatabase.id FROM PaperDatabase WHERE abstract ILIKE '%Mutual Information%' ORDER BY DISTANCE(vector, NeuralArray(Mutual Information)) LIMIT {top_k}\n\n \nLet's begin:\n\n======== table info ========\n{table_info}\n\nQuestion: {input}\nSQLQuery: \"\"\"\n"
  },
  {
    "path": "app/backend/constants/streamlit_keys.py",
    "content": "DATA_INITIALIZE_NOT_STATED = \"data_initialize_not_started\"\nDATA_INITIALIZE_STARTED = \"data_initialize_started\"\nDATA_INITIALIZE_COMPLETED = \"data_initialize_completed\"\n\n\nCHAT_SESSION = \"sel_sess\"\nCHAT_KNOWLEDGE_TABLE = \"private_kb\"\n\nCHAT_SESSION_MANAGER = \"session_manager\"\nCHAT_CURRENT_USER_SESSIONS = \"current_sessions\"\n\nEL_SESSION_SELECTOR = \"el_session_selector\"\n\n# all personal knowledge bases under a specific user.\nUSER_PERSONAL_KNOWLEDGE_BASES = \"user_tools\"\n# all personal files under a specific user.\nUSER_PRIVATE_FILES = \"user_files\"\n# public and personal knowledge bases.\nAVAILABLE_RETRIEVAL_TOOLS = \"tools_with_users\"\n\nEL_PERSONAL_KB_NEEDS_REMOVE = \"el_personal_kb_needs_remove\"\n\n# files needs upload\nEL_UPLOAD_FILES = \"el_upload_files\"\nEL_UPLOAD_FILES_STATUS = \"el_upload_files_status\"\n\n# use these files to build private knowledge base\nEL_BUILD_KB_WITH_FILES = \"el_build_kb_with_files\"\n# build a personal kb, given name.\nEL_PERSONAL_KB_NAME = \"el_personal_kb_name\"\n# build a personal kb, given description.\nEL_PERSONAL_KB_DESCRIPTION = \"el_personal_kb_description\"\n\n# knowledge bases selected by user.\nEL_SELECTED_KBS = \"el_selected_kbs\"\n"
  },
  {
    "path": "app/backend/constants/variables.py",
    "content": "from backend.types.global_config import GlobalConfig\n\n# ***** str variables ***** #\nEMBEDDING_MODEL_PREFIX = \"embedding_model\"\nCHAINS_RETRIEVERS_MAPPING = \"sel_map_obj\"\nLANGCHAIN_RETRIEVER = \"langchain_retriever\"\nVECTOR_SQL_RETRIEVER = \"vecsql_retriever\"\nTABLE_EMBEDDINGS_MAPPING = \"embeddings\"\nRETRIEVER_TOOLS = \"tools\"\nDATA_INITIALIZE_STATUS = \"data_initialized\"\nUI_INITIALIZED = \"ui_initialized\"\nJUMP_QUERY_ASK = \"jump_query_ask\"\nUSER_NAME = \"user_name\"\nUSER_INFO = \"user_info\"\n\nDIVIDER_HTML = \"\"\"\n    <div style=\"\n        height: 4px;\n        background: linear-gradient(to right, red, orange, yellow, green, blue, indigo, violet);\n        margin-top: 20px;\n        margin-bottom: 20px;\n    \"></div>\n\"\"\"\n\nDIVIDER_THIN_HTML = \"\"\"\n    <div style=\"\n        height: 2px;\n        background: linear-gradient(to right, blue, darkslateblue, indigo, violet);\n        margin-top: 20px;\n        margin-bottom: 20px;\n    \"></div>\n\"\"\"\n\n\nclass RetrieverButtons:\n    vector_sql_query_from_db = \"vector_sql_query_from_db\"\n    vector_sql_query_with_llm = \"vector_sql_query_with_llm\"\n    self_query_from_db = \"self_query_from_db\"\n    self_query_with_llm = \"self_query_with_llm\"\n\n\nGLOBAL_CONFIG = GlobalConfig()\n\n\ndef update_global_config(new_config: GlobalConfig):\n    global GLOBAL_CONFIG\n    GLOBAL_CONFIG.openai_api_base = new_config.openai_api_base\n    GLOBAL_CONFIG.openai_api_key = new_config.openai_api_key\n    GLOBAL_CONFIG.auth0_client_id = new_config.auth0_client_id\n    GLOBAL_CONFIG.auth0_domain = new_config.auth0_domain\n    GLOBAL_CONFIG.myscale_user = new_config.myscale_user\n    GLOBAL_CONFIG.myscale_password = new_config.myscale_password\n    GLOBAL_CONFIG.myscale_host = new_config.myscale_host\n    GLOBAL_CONFIG.myscale_port = new_config.myscale_port\n    GLOBAL_CONFIG.query_model = new_config.query_model\n    GLOBAL_CONFIG.chat_model = new_config.chat_model\n    GLOBAL_CONFIG.untrusted_api = new_config.untrusted_api\n    GLOBAL_CONFIG.myscale_enable_https = new_config.myscale_enable_https\n"
  },
  {
    "path": "app/backend/construct/__init__.py",
    "content": ""
  },
  {
    "path": "app/backend/construct/build_agents.py",
    "content": "import os\nfrom typing import Sequence, List\n\nimport streamlit as st\nfrom langchain.agents import AgentExecutor\nfrom langchain.schema.language_model import BaseLanguageModel\nfrom langchain.tools import BaseTool\n\nfrom backend.chat_bot.message_converter import DefaultClickhouseMessageConverter\nfrom backend.constants.prompts import DEFAULT_SYSTEM_PROMPT\nfrom backend.constants.streamlit_keys import AVAILABLE_RETRIEVAL_TOOLS\nfrom backend.constants.variables import GLOBAL_CONFIG, RETRIEVER_TOOLS\nfrom logger import logger\n\ntry:\n    from sqlalchemy.orm import declarative_base\nexcept ImportError:\n    from sqlalchemy.ext.declarative import declarative_base\nfrom langchain.chat_models import ChatOpenAI\nfrom langchain.prompts.chat import MessagesPlaceholder\nfrom langchain.agents.openai_functions_agent.agent_token_buffer_memory import AgentTokenBufferMemory\nfrom langchain.agents.openai_functions_agent.base import OpenAIFunctionsAgent\nfrom langchain.schema.messages import SystemMessage\nfrom langchain.memory import SQLChatMessageHistory\n\n\ndef create_agent_executor(\n        agent_name: str,\n        session_id: str,\n        llm: BaseLanguageModel,\n        tools: Sequence[BaseTool],\n        system_prompt: str,\n        **kwargs\n) -> AgentExecutor:\n    agent_name = agent_name.replace(\" \", \"_\")\n    conn_str = f'clickhouse://{os.environ[\"MYSCALE_USER\"]}:{os.environ[\"MYSCALE_PASSWORD\"]}@{os.environ[\"MYSCALE_HOST\"]}:{os.environ[\"MYSCALE_PORT\"]}'\n    chat_memory = SQLChatMessageHistory(\n        session_id,\n        connection_string=f'{conn_str}/chat?protocol=http' if GLOBAL_CONFIG.myscale_enable_https == False else f'{conn_str}/chat?protocol=https',\n        custom_message_converter=DefaultClickhouseMessageConverter(agent_name))\n    memory = AgentTokenBufferMemory(llm=llm, chat_memory=chat_memory)\n\n    prompt = OpenAIFunctionsAgent.create_prompt(\n        system_message=SystemMessage(content=system_prompt),\n        extra_prompt_messages=[MessagesPlaceholder(variable_name=\"history\")],\n    )\n    agent = OpenAIFunctionsAgent(llm=llm, tools=tools, prompt=prompt)\n    return AgentExecutor(\n        agent=agent,\n        tools=tools,\n        memory=memory,\n        verbose=True,\n        return_intermediate_steps=True,\n        **kwargs\n    )\n\n\ndef build_agents(\n        session_id: str,\n        tool_names: List[str],\n        model: str = \"gpt-3.5-turbo-0125\",\n        temperature: float = 0.6,\n        system_prompt: str = DEFAULT_SYSTEM_PROMPT\n):\n    chat_llm = ChatOpenAI(\n        model_name=model,\n        temperature=temperature,\n        base_url=GLOBAL_CONFIG.openai_api_base,\n        api_key=GLOBAL_CONFIG.openai_api_key,\n        streaming=True\n    )\n    tools = st.session_state.get(AVAILABLE_RETRIEVAL_TOOLS, st.session_state.get(RETRIEVER_TOOLS))\n    selected_tools = [tools[k] for k in tool_names]\n    logger.info(f\"create agent, use tools: {selected_tools}\")\n    agent = create_agent_executor(\n        agent_name=\"chat_memory\",\n        session_id=session_id,\n        llm=chat_llm,\n        tools=selected_tools,\n        system_prompt=system_prompt\n    )\n    return agent\n"
  },
  {
    "path": "app/backend/construct/build_all.py",
    "content": "from logger import logger\nfrom typing import Dict, Any, Union\n\nimport streamlit as st\n\nfrom backend.constants.myscale_tables import MYSCALE_TABLES\nfrom backend.constants.variables import CHAINS_RETRIEVERS_MAPPING\nfrom backend.construct.build_chains import build_retrieval_qa_with_sources_chain\nfrom backend.construct.build_retriever_tool import create_retriever_tool\nfrom backend.construct.build_retrievers import build_self_query_retriever, build_vector_sql_db_chain_retriever\nfrom backend.types.chains_and_retrievers import ChainsAndRetrievers, MetadataColumn\n\nfrom langchain_community.embeddings import HuggingFaceEmbeddings, HuggingFaceInstructEmbeddings, \\\n    SentenceTransformerEmbeddings\n\n\n@st.cache_resource\ndef load_embedding_model_for_table(table_name: str) -> \\\n        Union[SentenceTransformerEmbeddings, HuggingFaceInstructEmbeddings]:\n    with st.spinner(f\"Loading embedding models for [{table_name}] ...\"):\n        embeddings = MYSCALE_TABLES[table_name].emb_model()\n    return embeddings\n\n\n@st.cache_resource\ndef load_embedding_models() -> Dict[str, Union[HuggingFaceEmbeddings, HuggingFaceInstructEmbeddings]]:\n    embedding_models = {}\n    for table in MYSCALE_TABLES:\n        embedding_models[table] = load_embedding_model_for_table(table)\n    return embedding_models\n\n\n@st.cache_resource\ndef update_retriever_tools():\n    retrievers_tools = {}\n    for table in MYSCALE_TABLES:\n        logger.info(f\"Updating retriever tools [<retriever>, <sql_retriever>] for table {table}\")\n        retrievers_tools.update(\n            {\n                f\"{table} + Self Querying\": create_retriever_tool(\n                    st.session_state[CHAINS_RETRIEVERS_MAPPING][table][\"retriever\"],\n                    *MYSCALE_TABLES[table].tool_desc\n                ),\n                f\"{table} + Vector SQL\": create_retriever_tool(\n                    st.session_state[CHAINS_RETRIEVERS_MAPPING][table][\"sql_retriever\"],\n                    *MYSCALE_TABLES[table].tool_desc\n                ),\n            })\n    return retrievers_tools\n\n\n@st.cache_resource\ndef build_chains_retriever_for_table(table_name: str) -> ChainsAndRetrievers:\n    metadata_col_attributes = MYSCALE_TABLES[table_name].metadata_col_attributes\n\n    self_query_retriever = build_self_query_retriever(table_name)\n    self_query_chain = build_retrieval_qa_with_sources_chain(\n        table_name=table_name,\n        retriever=self_query_retriever,\n        chain_name=\"Self Query Retriever\"\n    )\n\n    vector_sql_retriever = build_vector_sql_db_chain_retriever(table_name)\n    vector_sql_chain = build_retrieval_qa_with_sources_chain(\n        table_name=table_name,\n        retriever=vector_sql_retriever,\n        chain_name=\"Vector SQL DB Retriever\"\n    )\n\n    metadata_columns = [\n        MetadataColumn(\n            name=attribute.name,\n            desc=attribute.description,\n            type=attribute.type\n        )\n        for attribute in metadata_col_attributes\n    ]\n    return ChainsAndRetrievers(\n        metadata_columns=metadata_columns,\n        # for self query\n        retriever=self_query_retriever,\n        chain=self_query_chain,\n        # for vector sql\n        sql_retriever=vector_sql_retriever,\n        sql_chain=vector_sql_chain\n    )\n\n\n@st.cache_resource\ndef build_chains_and_retrievers() -> Dict[str, Dict[str, Any]]:\n    chains_and_retrievers = {}\n    for table in MYSCALE_TABLES:\n        logger.info(f\"Building chains, retrievers for table {table}\")\n        chains_and_retrievers[table] = build_chains_retriever_for_table(table).to_dict()\n    return chains_and_retrievers\n"
  },
  {
    "path": "app/backend/construct/build_chains.py",
    "content": "from langchain.chains import LLMChain\nfrom langchain.chat_models import ChatOpenAI\nfrom langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate\nfrom langchain.schema import BaseRetriever\nimport streamlit as st\n\nfrom backend.chains.retrieval_qa_with_sources import CustomRetrievalQAWithSourcesChain\nfrom backend.chains.stuff_documents import CustomStuffDocumentChain\nfrom backend.constants.myscale_tables import MYSCALE_TABLES\nfrom backend.constants.prompts import COMBINE_PROMPT\nfrom backend.constants.variables import GLOBAL_CONFIG\n\n\ndef build_retrieval_qa_with_sources_chain(\n        table_name: str,\n        retriever: BaseRetriever,\n        chain_name: str = \"<chain_name>\"\n) -> CustomRetrievalQAWithSourcesChain:\n    with st.spinner(f'Building QA source chain named `{chain_name}` for MyScaleDB/{table_name} ...'):\n        # Assign ref_id for documents\n        custom_stuff_document_chain = CustomStuffDocumentChain(\n            llm_chain=LLMChain(\n                prompt=COMBINE_PROMPT,\n                llm=ChatOpenAI(\n                    model_name=GLOBAL_CONFIG.chat_model,\n                    openai_api_key=GLOBAL_CONFIG.openai_api_key,\n                    temperature=0.6\n                ),\n            ),\n            document_prompt=MYSCALE_TABLES[table_name].doc_prompt,\n            document_variable_name=\"summaries\",\n        )\n        chain = CustomRetrievalQAWithSourcesChain(\n            retriever=retriever,\n            combine_documents_chain=custom_stuff_document_chain,\n            return_source_documents=True,\n            max_tokens_limit=12000,\n        )\n    return chain\n"
  },
  {
    "path": "app/backend/construct/build_chat_bot.py",
    "content": "from backend.chat_bot.private_knowledge_base import ChatBotKnowledgeTable\nfrom backend.constants.streamlit_keys import CHAT_KNOWLEDGE_TABLE, CHAT_SESSION, CHAT_SESSION_MANAGER\nimport streamlit as st\n\nfrom backend.constants.variables import GLOBAL_CONFIG, TABLE_EMBEDDINGS_MAPPING\nfrom backend.constants.prompts import DEFAULT_SYSTEM_PROMPT\nfrom backend.chat_bot.session_manager import SessionManager\n\n\ndef build_chat_knowledge_table():\n    if CHAT_KNOWLEDGE_TABLE not in st.session_state:\n        st.session_state[CHAT_KNOWLEDGE_TABLE] = ChatBotKnowledgeTable(\n            host=GLOBAL_CONFIG.myscale_host,\n            port=GLOBAL_CONFIG.myscale_port,\n            username=GLOBAL_CONFIG.myscale_user,\n            password=GLOBAL_CONFIG.myscale_password,\n            # embedding=st.session_state[TABLE_EMBEDDINGS_MAPPING][\"Wikipedia\"],\n            embedding=st.session_state[TABLE_EMBEDDINGS_MAPPING][\"ArXiv Papers\"],\n            parser_api_key=GLOBAL_CONFIG.untrusted_api,\n        )\n\n\ndef initialize_session_manager():\n    if CHAT_SESSION not in st.session_state:\n        st.session_state[CHAT_SESSION] = {\n            \"session_id\": \"default\",\n            \"system_prompt\": DEFAULT_SYSTEM_PROMPT,\n        }\n    if CHAT_SESSION_MANAGER not in st.session_state:\n        st.session_state[CHAT_SESSION_MANAGER] = SessionManager(\n            st.session_state,\n            host=GLOBAL_CONFIG.myscale_host,\n            port=GLOBAL_CONFIG.myscale_port,\n            username=GLOBAL_CONFIG.myscale_user,\n            password=GLOBAL_CONFIG.myscale_password,\n        )\n"
  },
  {
    "path": "app/backend/construct/build_retriever_tool.py",
    "content": "import json\nfrom typing import List\n\nfrom langchain.pydantic_v1 import BaseModel, Field\nfrom langchain.schema import BaseRetriever, Document\nfrom langchain.tools import Tool\n\nfrom backend.chat_bot.json_decoder import CustomJSONEncoder\n\n\nclass RetrieverInput(BaseModel):\n    query: str = Field(description=\"query to look up in retriever\")\n\n\ndef create_retriever_tool(\n        retriever: BaseRetriever,\n        tool_name: str,\n        description: str\n) -> Tool:\n    \"\"\"Create a tool to do retrieval of documents.\n\n    Args:\n        retriever: The retriever to use for the retrieval\n        tool_name: The name for the tool. This will be passed to the language model,\n            so should be unique and somewhat descriptive.\n        description: The description for the tool. This will be passed to the language\n            model, so should be descriptive.\n\n    Returns:\n        Tool class to pass to an agent\n    \"\"\"\n    def wrap(func):\n        def wrapped_retrieve(*args, **kwargs):\n            docs: List[Document] = func(*args, **kwargs)\n            return json.dumps([d.dict() for d in docs], cls=CustomJSONEncoder)\n\n        return wrapped_retrieve\n\n    return Tool(\n        name=tool_name,\n        description=description,\n        func=wrap(retriever.get_relevant_documents),\n        coroutine=retriever.aget_relevant_documents,\n        args_schema=RetrieverInput,\n    )\n"
  },
  {
    "path": "app/backend/construct/build_retrievers.py",
    "content": "import streamlit as st\nfrom langchain.chat_models import ChatOpenAI\nfrom langchain.prompts.prompt import PromptTemplate\nfrom langchain.retrievers.self_query.base import SelfQueryRetriever\nfrom langchain.retrievers.self_query.myscale import MyScaleTranslator\nfrom langchain.utilities.sql_database import SQLDatabase\nfrom langchain.vectorstores import MyScaleSettings\nfrom langchain_experimental.retrievers.vector_sql_database import VectorSQLDatabaseChainRetriever\nfrom langchain_experimental.sql.vector_sql import VectorSQLDatabaseChain\nfrom sqlalchemy import create_engine, MetaData\n\nfrom backend.constants.myscale_tables import MYSCALE_TABLES\nfrom backend.constants.prompts import MYSCALE_PROMPT\nfrom backend.constants.variables import TABLE_EMBEDDINGS_MAPPING, GLOBAL_CONFIG\nfrom backend.retrievers.vector_sql_output_parser import VectorSQLRetrieveOutputParser\nfrom backend.vector_store.myscale_without_metadata import MyScaleWithoutMetadataJson\nfrom logger import logger\n\n\n@st.cache_resource\ndef build_self_query_retriever(table_name: str) -> SelfQueryRetriever:\n    with st.spinner(f\"Building VectorStore for MyScaleDB/{table_name} ...\"):\n        myscale_connection = {\n            \"host\": GLOBAL_CONFIG.myscale_host,\n            \"port\": GLOBAL_CONFIG.myscale_port,\n            \"username\": GLOBAL_CONFIG.myscale_user,\n            \"password\": GLOBAL_CONFIG.myscale_password,\n        }\n        myscale_settings = MyScaleSettings(\n            **myscale_connection,\n            database=MYSCALE_TABLES[table_name].database,\n            table=MYSCALE_TABLES[table_name].table,\n            column_map={\n                \"id\": \"id\",\n                \"text\": MYSCALE_TABLES[table_name].text_col_name,\n                \"vector\": MYSCALE_TABLES[table_name].vector_col_name,\n                # TODO refine MyScaleDB metadata in langchain.\n                \"metadata\": MYSCALE_TABLES[table_name].metadata_col_name\n            }\n        )\n        myscale_vector_store = MyScaleWithoutMetadataJson(\n            embedding=st.session_state[TABLE_EMBEDDINGS_MAPPING][table_name],\n            config=myscale_settings,\n            must_have_cols=MYSCALE_TABLES[table_name].must_have_col_names\n        )\n\n    with st.spinner(f\"Building SelfQueryRetriever for MyScaleDB/{table_name} ...\"):\n        retriever: SelfQueryRetriever = SelfQueryRetriever.from_llm(\n            llm=ChatOpenAI(\n                model_name=GLOBAL_CONFIG.query_model,\n                base_url=GLOBAL_CONFIG.openai_api_base,\n                api_key=GLOBAL_CONFIG.openai_api_key,\n                temperature=0\n            ),\n            vectorstore=myscale_vector_store,\n            document_contents=MYSCALE_TABLES[table_name].table_contents,\n            metadata_field_info=MYSCALE_TABLES[table_name].metadata_col_attributes,\n            use_original_query=False,\n            structured_query_translator=MyScaleTranslator()\n        )\n    return retriever\n\n\n@st.cache_resource\ndef build_vector_sql_db_chain_retriever(table_name: str) -> VectorSQLDatabaseChainRetriever:\n    \"\"\"Get a group of relative docs from MyScaleDB\"\"\"\n    with st.spinner(f'Building Vector SQL Database Retriever for MyScaleDB/{table_name}...'):\n        if GLOBAL_CONFIG.myscale_enable_https == False:\n            engine = create_engine(\n                f'clickhouse://{GLOBAL_CONFIG.myscale_user}:{GLOBAL_CONFIG.myscale_password}@'\n                f'{GLOBAL_CONFIG.myscale_host}:{GLOBAL_CONFIG.myscale_port}'\n                f'/{MYSCALE_TABLES[table_name].database}?protocol=http'\n            )\n        else:\n            engine = create_engine(\n                f'clickhouse://{GLOBAL_CONFIG.myscale_user}:{GLOBAL_CONFIG.myscale_password}@'\n                f'{GLOBAL_CONFIG.myscale_host}:{GLOBAL_CONFIG.myscale_port}'\n                f'/{MYSCALE_TABLES[table_name].database}?protocol=https'\n            )\n        metadata = MetaData(bind=engine)\n        logger.info(f\"{table_name} metadata is : {metadata}\")\n        prompt = PromptTemplate(\n            input_variables=[\"input\", \"table_info\", \"top_k\"],\n            template=MYSCALE_PROMPT,\n        )\n        # Custom `out_put_parser` rewrite search SQL, make it's possible to query custom column.\n        output_parser = VectorSQLRetrieveOutputParser.from_embeddings(\n            model=st.session_state[TABLE_EMBEDDINGS_MAPPING][table_name],\n            # rewrite columns needs be searched.\n            must_have_columns=MYSCALE_TABLES[table_name].must_have_col_names\n        )\n\n        # `db_chain` will generate a SQL\n        vector_sql_db_chain: VectorSQLDatabaseChain = VectorSQLDatabaseChain.from_llm(\n            llm=ChatOpenAI(\n                model_name=GLOBAL_CONFIG.query_model,\n                base_url=GLOBAL_CONFIG.openai_api_base,\n                api_key=GLOBAL_CONFIG.openai_api_key,\n                temperature=0\n            ),\n            prompt=prompt,\n            top_k=10,\n            return_direct=True,\n            db=SQLDatabase(\n                engine,\n                None,\n                metadata,\n                include_tables=[MYSCALE_TABLES[table_name].table],\n                max_string_length=1024\n            ),\n            sql_cmd_parser=output_parser,  # TODO needs update `langchain`, fix return type.\n            native_format=True\n        )\n\n        # `retriever` can search a group of documents with `db_chain`\n        vector_sql_db_chain_retriever = VectorSQLDatabaseChainRetriever(\n            sql_db_chain=vector_sql_db_chain,\n            page_content_key=MYSCALE_TABLES[table_name].text_col_name\n        )\n    return vector_sql_db_chain_retriever\n"
  },
  {
    "path": "app/backend/retrievers/__init__.py",
    "content": ""
  },
  {
    "path": "app/backend/retrievers/self_query.py",
    "content": "from typing import List\n\nimport pandas as pd\nimport streamlit as st\nfrom langchain.retrievers import SelfQueryRetriever\nfrom langchain_core.documents import Document\nfrom langchain_core.runnables import RunnableConfig\n\nfrom backend.chains.retrieval_qa_with_sources import CustomRetrievalQAWithSourcesChain\nfrom backend.constants.myscale_tables import MYSCALE_TABLES\nfrom backend.constants.variables import CHAINS_RETRIEVERS_MAPPING, DIVIDER_HTML, RetrieverButtons\nfrom backend.callbacks.self_query_callbacks import ChatDataSelfAskCallBackHandler, CustomSelfQueryRetrieverCallBackHandler\nfrom ui.utils import display\nfrom logger import logger\n\n\ndef process_self_query(selected_table, query_type):\n    place_holder = st.empty()\n    logger.info(\n        f\"button-1: {RetrieverButtons.self_query_from_db}, \"\n        f\"button-2: {RetrieverButtons.self_query_with_llm}, \"\n        f\"content: {st.session_state.query_self}\"\n    )\n    with place_holder.expander('🪵 Chat Log', expanded=True):\n        try:\n            if query_type == RetrieverButtons.self_query_from_db:\n                callback = CustomSelfQueryRetrieverCallBackHandler()\n                retriever: SelfQueryRetriever = \\\n                    st.session_state[CHAINS_RETRIEVERS_MAPPING][selected_table][\"retriever\"]\n                config: RunnableConfig = {\"callbacks\": [callback]}\n\n                relevant_docs = retriever.invoke(\n                    input=st.session_state.query_self,\n                    config=config\n                )\n\n                callback.progress_bar.progress(\n                    value=1.0, text=\"[Question -> LLM -> Query filter -> MyScaleDB -> Results] Done!✅\")\n\n                st.markdown(f\"### Self Query Results from `{selected_table}` \\n\"\n                            f\"> Here we get documents from MyScaleDB by `SelfQueryRetriever` \\n\\n\")\n                display(\n                    dataframe=pd.DataFrame(\n                        [{**d.metadata, 'abstract': d.page_content} for d in relevant_docs]\n                    ),\n                    columns_=MYSCALE_TABLES[selected_table].must_have_col_names\n                )\n            elif query_type == RetrieverButtons.self_query_with_llm:\n                # callback = CustomSelfQueryRetrieverCallBackHandler()\n                callback = ChatDataSelfAskCallBackHandler()\n                chain: CustomRetrievalQAWithSourcesChain = \\\n                    st.session_state[CHAINS_RETRIEVERS_MAPPING][selected_table][\"chain\"]\n                chain_results = chain(st.session_state.query_self, callbacks=[callback])\n                callback.progress_bar.progress(\n                    value=1.0,\n                    text=\"[Question -> LLM -> Query filter -> MyScaleDB -> Related Results -> LLM -> LLM Answer] Done!✅\"\n                )\n\n                documents_reference: List[Document] = chain_results[\"source_documents\"]\n                st.markdown(f\"### SelfQueryRetriever Results from `{selected_table}` \\n\"\n                            f\"> Here we get documents from MyScaleDB by `SelfQueryRetriever` \\n\\n\")\n                display(\n                    pd.DataFrame(\n                        [{**d.metadata, 'abstract': d.page_content} for d in documents_reference]\n                    )\n                )\n                st.markdown(\n                    f\"### Answer from LLM \\n\"\n                    f\"> The response of the LLM when given the `SelfQueryRetriever` results. \\n\\n\"\n                )\n                st.write(chain_results['answer'])\n                st.markdown(\n                    f\"### References from `{selected_table}`\\n\"\n                    f\"> Here shows that which documents used by LLM \\n\\n\"\n                )\n                if len(chain_results['sources']) == 0:\n                    st.write(\"No documents is used by LLM.\")\n                else:\n                    display(\n                        dataframe=pd.DataFrame(\n                            [{**d.metadata, 'abstract': d.page_content} for d in chain_results['sources']]\n                        ),\n                        columns_=['ref_id'] + MYSCALE_TABLES[selected_table].must_have_col_names,\n                        index='ref_id'\n                    )\n            st.markdown(DIVIDER_HTML, unsafe_allow_html=True)\n        except Exception as e:\n            st.write('Oops 😵 Something bad happened...')\n            raise e\n"
  },
  {
    "path": "app/backend/retrievers/vector_sql_output_parser.py",
    "content": "from typing import Dict, Any, List\n\nfrom langchain_experimental.sql.vector_sql import VectorSQLOutputParser\n\n\nclass VectorSQLRetrieveOutputParser(VectorSQLOutputParser):\n    \"\"\"Based on VectorSQLOutputParser\n    It also modify the SQL to get all columns\n    \"\"\"\n    must_have_columns: List[str]\n\n    @property\n    def _type(self) -> str:\n        return \"vector_sql_retrieve_custom\"\n\n    def parse(self, text: str) -> Dict[str, Any]:\n        text = text.strip()\n        start = text.upper().find(\"SELECT\")\n        if start >= 0:\n            end = text.upper().find(\"FROM\")\n            text = text.replace(\n                text[start + len(\"SELECT\") + 1: end - 1], \", \".join(self.must_have_columns))\n        return super().parse(text)\n"
  },
  {
    "path": "app/backend/retrievers/vector_sql_query.py",
    "content": "from typing import List\n\nimport pandas as pd\nimport streamlit as st\nfrom langchain.schema import Document\nfrom langchain_experimental.retrievers.vector_sql_database import VectorSQLDatabaseChainRetriever\n\nfrom backend.chains.retrieval_qa_with_sources import CustomRetrievalQAWithSourcesChain\nfrom backend.constants.myscale_tables import MYSCALE_TABLES\nfrom backend.constants.variables import CHAINS_RETRIEVERS_MAPPING, DIVIDER_HTML, RetrieverButtons\nfrom backend.callbacks.vector_sql_callbacks import VectorSQLSearchDBCallBackHandler, VectorSQLSearchLLMCallBackHandler\nfrom ui.utils import display\nfrom logger import logger\n\n\ndef process_sql_query(selected_table: str, query_type: str):\n    place_holder = st.empty()\n    logger.info(\n        f\"button-1: {st.session_state[RetrieverButtons.vector_sql_query_from_db]}, \"\n        f\"button-2: {st.session_state[RetrieverButtons.vector_sql_query_with_llm]}, \"\n        f\"table: {selected_table}, \"\n        f\"content: {st.session_state.query_sql}\"\n    )\n    with place_holder.expander('🪵 Query Log', expanded=True):\n        try:\n            if query_type == RetrieverButtons.vector_sql_query_from_db:\n                callback = VectorSQLSearchDBCallBackHandler()\n                vector_sql_retriever: VectorSQLDatabaseChainRetriever = \\\n                    st.session_state[CHAINS_RETRIEVERS_MAPPING][selected_table][\"sql_retriever\"]\n                relevant_docs: List[Document] = vector_sql_retriever.get_relevant_documents(\n                    query=st.session_state.query_sql,\n                    callbacks=[callback]\n                )\n\n                callback.progress_bar.progress(\n                    value=1.0,\n                    text=\"[Question -> LLM -> SQL Statement -> MyScaleDB -> Results] Done! ✅\"\n                )\n\n                st.markdown(f\"### Vector Search Results from `{selected_table}` \\n\"\n                            f\"> Here we get documents from MyScaleDB with given sql statement \\n\\n\")\n                display(\n                    pd.DataFrame(\n                        [{**d.metadata, 'abstract': d.page_content} for d in relevant_docs]\n                    )\n                )\n            elif query_type == RetrieverButtons.vector_sql_query_with_llm:\n                callback = VectorSQLSearchLLMCallBackHandler(table=selected_table)\n                vector_sql_chain: CustomRetrievalQAWithSourcesChain = \\\n                    st.session_state[CHAINS_RETRIEVERS_MAPPING][selected_table][\"sql_chain\"]\n                chain_results = vector_sql_chain(\n                    inputs=st.session_state.query_sql,\n                    callbacks=[callback]\n                )\n\n                callback.progress_bar.progress(\n                    value=1.0,\n                    text=\"[Question -> LLM -> SQL Statement -> MyScaleDB -> \"\n                         \"(Question,Results) -> LLM -> Results] Done! ✅\"\n                )\n\n                documents_reference: List[Document] = chain_results[\"source_documents\"]\n                st.markdown(f\"### Vector Search Results from `{selected_table}` \\n\"\n                            f\"> Here we get documents from MyScaleDB with given sql statement \\n\\n\")\n                display(\n                    pd.DataFrame(\n                        [{**d.metadata, 'abstract': d.page_content} for d in documents_reference]\n                    )\n                )\n                st.markdown(\n                    f\"### Answer from LLM \\n\"\n                    f\"> The response of the LLM when given the vector search results. \\n\\n\"\n                )\n                st.write(chain_results['answer'])\n                st.markdown(\n                    f\"### References from `{selected_table}`\\n\"\n                    f\"> Here shows that which documents used by LLM \\n\\n\"\n                )\n                if len(chain_results['sources']) == 0:\n                    st.write(\"No documents is used by LLM.\")\n                else:\n                    display(\n                        dataframe=pd.DataFrame(\n                            [{**d.metadata, 'abstract': d.page_content} for d in chain_results['sources']]\n                        ),\n                        columns_=['ref_id'] + MYSCALE_TABLES[selected_table].must_have_col_names,\n                        index='ref_id'\n                    )\n            else:\n                raise NotImplementedError(f\"Unsupported query type: {query_type}\")\n            st.markdown(DIVIDER_HTML, unsafe_allow_html=True)\n        except Exception as e:\n            st.write('Oops 😵 Something bad happened...')\n            raise e\n\n"
  },
  {
    "path": "app/backend/types/__init__.py",
    "content": ""
  },
  {
    "path": "app/backend/types/chains_and_retrievers.py",
    "content": "from typing import Dict\nfrom dataclasses import dataclass\nfrom typing import List, Any\nfrom langchain.retrievers import SelfQueryRetriever\nfrom langchain_experimental.retrievers.vector_sql_database import VectorSQLDatabaseChainRetriever\n\nfrom backend.chains.retrieval_qa_with_sources import CustomRetrievalQAWithSourcesChain\n\n\n@dataclass\nclass MetadataColumn:\n    name: str\n    desc: str\n    type: str\n\n\n@dataclass\nclass ChainsAndRetrievers:\n    metadata_columns: List[MetadataColumn]\n    retriever: SelfQueryRetriever\n    chain: CustomRetrievalQAWithSourcesChain\n    sql_retriever: VectorSQLDatabaseChainRetriever\n    sql_chain: CustomRetrievalQAWithSourcesChain\n\n    def to_dict(self) -> Dict[str, Any]:\n        return {\n            \"metadata_columns\": self.metadata_columns,\n            \"retriever\": self.retriever,\n            \"chain\": self.chain,\n            \"sql_retriever\": self.sql_retriever,\n            \"sql_chain\": self.sql_chain\n        }\n\n\n"
  },
  {
    "path": "app/backend/types/global_config.py",
    "content": "from dataclasses import dataclass\nfrom typing import Optional\n\n\n@dataclass\nclass GlobalConfig:\n    openai_api_base: Optional[str] = \"\"\n    openai_api_key: Optional[str] = \"\"\n\n    auth0_client_id: Optional[str] = \"\"\n    auth0_domain: Optional[str] = \"\"\n\n    myscale_user: Optional[str] = \"\"\n    myscale_password: Optional[str] = \"\"\n    myscale_host: Optional[str] = \"\"\n    myscale_port: Optional[int] = 443\n\n    query_model: Optional[str] = \"\"\n    chat_model: Optional[str] = \"\"\n\n    untrusted_api: Optional[str] = \"\"\n    myscale_enable_https: Optional[bool] = True\n"
  },
  {
    "path": "app/backend/types/table_config.py",
    "content": "from typing import Callable\nfrom langchain.chains.query_constructor.schema import AttributeInfo\nfrom langchain.prompts import PromptTemplate\nfrom dataclasses import dataclass\nfrom typing import List\n\n\n@dataclass\nclass TableConfig:\n    database: str\n    table: str\n    table_contents: str\n    # column names\n    must_have_col_names: List[str]\n    vector_col_name: str\n    text_col_name: str\n    metadata_col_name: str\n    # hint for UI\n    hint: Callable\n    hint_sql: Callable\n    # for langchain\n    doc_prompt: PromptTemplate\n    metadata_col_attributes: List[AttributeInfo]\n    emb_model: Callable\n    tool_desc: tuple\n"
  },
  {
    "path": "app/backend/vector_store/__init__.py",
    "content": ""
  },
  {
    "path": "app/backend/vector_store/myscale_without_metadata.py",
    "content": "from typing import Any, Optional, List\n\nfrom langchain.docstore.document import Document\nfrom langchain.embeddings.base import Embeddings\nfrom langchain.vectorstores.myscale import MyScale, MyScaleSettings\n\nfrom logger import logger\n\n\nclass MyScaleWithoutMetadataJson(MyScale):\n    def __init__(self, embedding: Embeddings, config: Optional[MyScaleSettings] = None, must_have_cols: List[str] = [],\n                 **kwargs: Any) -> None:\n        try:\n            super().__init__(embedding, config, **kwargs)\n        except Exception as e:\n            logger.error(e)\n        self.must_have_cols: List[str] = must_have_cols\n\n    def _build_qstr(\n            self, q_emb: List[float], topk: int, where_str: Optional[str] = None\n    ) -> str:\n        q_emb_str = \",\".join(map(str, q_emb))\n        if where_str:\n            where_str = f\"PREWHERE {where_str}\"\n        else:\n            where_str = \"\"\n\n        q_str = f\"\"\"\n            SELECT {self.config.column_map['text']}, dist, {','.join(self.must_have_cols)}\n            FROM {self.config.database}.{self.config.table}\n            {where_str}\n            ORDER BY distance({self.config.column_map['vector']}, [{q_emb_str}]) \n                AS dist {self.dist_order}\n            LIMIT {topk}\n            \"\"\"\n        return q_str\n\n    def similarity_search_by_vector(self, embedding: List[float], k: int = 4, where_str: Optional[str] = None,\n                                    **kwargs: Any) -> List[Document]:\n        q_str = self._build_qstr(embedding, k, where_str)\n        try:\n            return [\n                Document(\n                    page_content=r[self.config.column_map[\"text\"]],\n                    metadata={k: r[k] for k in self.must_have_cols},\n                )\n                for r in self.client.query(q_str).named_results()\n            ]\n        except Exception as e:\n            logger.error(\n                f\"\\033[91m\\033[1m{type(e)}\\033[0m \\033[95m{str(e)}\\033[0m\")\n            return []\n"
  },
  {
    "path": "app/logger.py",
    "content": "import logging\n\n\ndef setup_logger():\n    logger_ = logging.getLogger('chat-data')\n    logger_.setLevel(logging.INFO)\n    if not logger_.handlers:\n        console_handler = logging.StreamHandler()\n        console_handler.setLevel(logging.INFO)\n        formatter = logging.Formatter(\n            '%(asctime)s - %(filename)s - %(funcName)s - %(levelname)s - %(message)s - [Thread ID: %(thread)d]'\n        )\n        console_handler.setFormatter(formatter)\n        logger_.addHandler(console_handler)\n    return logger_\n\n\nlogger = setup_logger()\n"
  },
  {
    "path": "app/requirements.txt",
    "content": "langchain==0.2.1\nlangchain-community==0.2.1\nlangchain-core==0.2.1\nlangchain-experimental==0.0.59\nlangchain-openai==0.1.7\nsentence-transformers==2.2.2\nInstructorEmbedding\npandas\nstreamlit==1.36.0\nstreamlit-extras\nstreamlit-auth0-component\naltair==4.2.2\nclickhouse-connect\nopenai==1.35.3\nlark\ntiktoken\nsql-formatter\nsqlalchemy==1.4.48\nclickhouse-sqlalchemy\n"
  },
  {
    "path": "app/ui/__init__.py",
    "content": ""
  },
  {
    "path": "app/ui/chat_page.py",
    "content": "import datetime\nimport json\n\nimport pandas as pd\nimport streamlit as st\nfrom langchain_core.messages import HumanMessage, FunctionMessage\nfrom streamlit.delta_generator import DeltaGenerator\n\nfrom backend.chat_bot.json_decoder import CustomJSONDecoder\nfrom backend.constants.streamlit_keys import CHAT_CURRENT_USER_SESSIONS, EL_SESSION_SELECTOR, \\\n    EL_UPLOAD_FILES_STATUS, USER_PRIVATE_FILES, EL_BUILD_KB_WITH_FILES, \\\n    EL_PERSONAL_KB_NAME, EL_PERSONAL_KB_DESCRIPTION, \\\n    USER_PERSONAL_KNOWLEDGE_BASES, AVAILABLE_RETRIEVAL_TOOLS, EL_PERSONAL_KB_NEEDS_REMOVE, \\\n    CHAT_KNOWLEDGE_TABLE, EL_UPLOAD_FILES, EL_SELECTED_KBS\nfrom backend.constants.variables import DIVIDER_HTML, USER_NAME, RETRIEVER_TOOLS\nfrom backend.construct.build_chat_bot import build_chat_knowledge_table, initialize_session_manager\nfrom backend.chat_bot.chat import refresh_sessions, on_session_change_submit, refresh_agent, \\\n    create_private_knowledge_base_as_tool, \\\n    remove_private_knowledge_bases, add_file, clear_files, clear_history, back_to_main, on_chat_submit\n\n\ndef render_session_manager():\n    with st.expander(\"🤖 Session Management\"):\n        if CHAT_CURRENT_USER_SESSIONS not in st.session_state:\n            refresh_sessions()\n        st.markdown(\"Here you can update `session_id` and `system_prompt`\")\n        st.markdown(\"- Click empty row to add a new item\")\n        st.markdown(\"- If needs to delete an item, just click it and press `DEL` key\")\n        st.markdown(\"- Don't forget to submit your change.\")\n\n        st.data_editor(\n            data=st.session_state[CHAT_CURRENT_USER_SESSIONS],\n            num_rows=\"dynamic\",\n            key=\"session_editor\",\n            use_container_width=True,\n        )\n        st.button(\"⏫ Submit\", on_click=on_session_change_submit, type=\"primary\")\n\n\ndef render_session_selection():\n    with st.expander(\"✅ Session Selection\", expanded=True):\n        st.selectbox(\n            \"Choose a `session` to chat\",\n            options=st.session_state[CHAT_CURRENT_USER_SESSIONS],\n            index=None,\n            key=EL_SESSION_SELECTOR,\n            format_func=lambda x: x[\"session_id\"],\n            on_change=refresh_agent,\n        )\n\n\ndef render_files_manager():\n    with st.expander(\"📃 **Upload your personal files**\", expanded=False):\n        st.markdown(\"- Files will be parsed by [Unstructured API](https://unstructured.io/api-key).\")\n        st.markdown(\"- All files will be converted into vectors and stored in [MyScaleDB](https://myscale.com/).\")\n        st.file_uploader(label=\"⏫ **Upload files**\", key=EL_UPLOAD_FILES, accept_multiple_files=True)\n        # st.markdown(\"### Uploaded Files\")\n        st.dataframe(\n            data=st.session_state[CHAT_KNOWLEDGE_TABLE].list_files(st.session_state[USER_NAME]),\n            use_container_width=True,\n        )\n        st.session_state[EL_UPLOAD_FILES_STATUS] = st.empty()\n        col_1, col_2 = st.columns(2)\n        with col_1:\n            st.button(label=\"Upload files\", on_click=add_file)\n        with col_2:\n            st.button(label=\"Clear all files and tools\", on_click=clear_files)\n\n\ndef _render_create_personal_knowledge_bases(div: DeltaGenerator):\n    with div:\n        st.markdown(\"- If you haven't upload your personal files, please upload them first.\")\n        st.markdown(\"- Select some **files** to build your `personal knowledge base`.\")\n        st.markdown(\"- Once the your `personal knowledge base` is built, \"\n                    \"it will answer your questions using information from your personal **files**.\")\n        st.multiselect(\n            label=\"⚡️Select some files to build a **personal knowledge base**\",\n            options=st.session_state[USER_PRIVATE_FILES],\n            placeholder=\"You should upload some files first\",\n            key=EL_BUILD_KB_WITH_FILES,\n            format_func=lambda x: x[\"file_name\"],\n        )\n        st.text_input(\n            label=\"⚡️Personal knowledge base name\",\n            value=\"get_relevant_documents\",\n            key=EL_PERSONAL_KB_NAME\n        )\n        st.text_input(\n            label=\"⚡️Personal knowledge base description\",\n            value=\"Searches from some personal files.\",\n            key=EL_PERSONAL_KB_DESCRIPTION,\n        )\n        st.button(\n            label=\"Build 🔧\",\n            on_click=create_private_knowledge_base_as_tool\n        )\n\n\ndef _render_remove_personal_knowledge_bases(div: DeltaGenerator):\n    with div:\n        st.markdown(\"> Here is all your personal knowledge bases.\")\n        if USER_PERSONAL_KNOWLEDGE_BASES in st.session_state and len(st.session_state[USER_PERSONAL_KNOWLEDGE_BASES]) > 0:\n            st.dataframe(st.session_state[USER_PERSONAL_KNOWLEDGE_BASES])\n        else:\n            st.warning(\"You don't have any personal knowledge bases, please create a new one.\")\n        st.multiselect(\n            label=\"Choose a personal knowledge base to delete\",\n            placeholder=\"Choose a personal knowledge base to delete\",\n            options=st.session_state[USER_PERSONAL_KNOWLEDGE_BASES],\n            format_func=lambda x: x[\"tool_name\"],\n            key=EL_PERSONAL_KB_NEEDS_REMOVE,\n        )\n        st.button(\"Delete\", on_click=remove_private_knowledge_bases, type=\"primary\")\n\n\ndef render_personal_tools_build():\n    with st.expander(\"🔨 **Build your personal knowledge base**\", expanded=True):\n        create_new_kb, kb_manager = st.tabs([\"Create personal knowledge base\", \"Personal knowledge base management\"])\n        _render_create_personal_knowledge_bases(create_new_kb)\n        _render_remove_personal_knowledge_bases(kb_manager)\n\n\ndef render_knowledge_base_selector():\n    with st.expander(\"🙋 **Select some knowledge bases to query**\", expanded=True):\n        st.markdown(\"- Knowledge bases come in two types: `public` and `private`.\")\n        st.markdown(\"- All users can access our `public` knowledge bases.\")\n        st.markdown(\"- Only you can access your `personal` knowledge bases.\")\n        options = st.session_state[RETRIEVER_TOOLS].keys()\n        if AVAILABLE_RETRIEVAL_TOOLS in st.session_state:\n            options = st.session_state[AVAILABLE_RETRIEVAL_TOOLS]\n        st.multiselect(\n            label=\"Select some knowledge base tool\",\n            placeholder=\"Please select some knowledge bases to query\",\n            options=options,\n            default=[\"Wikipedia + Self Querying\"],\n            key=EL_SELECTED_KBS,\n            on_change=refresh_agent,\n        )\n\n\ndef chat_page():\n    # initialize resources\n    build_chat_knowledge_table()\n    initialize_session_manager()\n\n    # render sidebar\n    with st.sidebar:\n        left, middle, right = st.columns([1, 1, 2])\n        with left:\n            st.button(label=\"↩️ Log Out\", help=\"log out and back to main page\", on_click=back_to_main)\n        with right:\n            st.markdown(f\"👤 `{st.session_state[USER_NAME]}`\")\n        st.markdown(DIVIDER_HTML, unsafe_allow_html=True)\n        render_session_manager()\n        render_session_selection()\n        render_files_manager()\n        render_personal_tools_build()\n        render_knowledge_base_selector()\n\n    # render chat history\n    if \"agent\" not in st.session_state:\n        refresh_agent()\n    for msg in st.session_state.agent.memory.chat_memory.messages:\n        speaker = \"user\" if isinstance(msg, HumanMessage) else \"assistant\"\n        if isinstance(msg, FunctionMessage):\n            with st.chat_message(name=\"from knowledge base\", avatar=\"📚\"):\n                st.write(\n                    f\"*{datetime.datetime.fromtimestamp(msg.additional_kwargs['timestamp']).isoformat()}*\"\n                )\n                st.write(\"Retrieved from knowledge base:\")\n                try:\n                    st.dataframe(\n                        pd.DataFrame.from_records(\n                            json.loads(msg.content, cls=CustomJSONDecoder)\n                        ),\n                        use_container_width=True,\n                    )\n                except Exception as e:\n                    st.warning(e)\n                    st.write(msg.content)\n        else:\n            if len(msg.content) > 0:\n                with st.chat_message(speaker):\n                    # print(type(msg), msg.dict())\n                    st.write(\n                        f\"*{datetime.datetime.fromtimestamp(msg.additional_kwargs['timestamp']).isoformat()}*\"\n                    )\n                    st.write(f\"{msg.content}\")\n    st.session_state[\"next_round\"] = st.empty()\n    from streamlit import _bottom\n    with _bottom:\n        col1, col2 = st.columns([1, 16])\n        with col1:\n            st.button(\"🗑️\", help=\"Clean chat history\", on_click=clear_history, type=\"secondary\")\n        with col2:\n            st.chat_input(\"Input Message\", on_submit=on_chat_submit, key=\"chat_input\")\n"
  },
  {
    "path": "app/ui/home.py",
    "content": "import base64\n\nfrom streamlit_extras.add_vertical_space import add_vertical_space\nfrom streamlit_extras.card import card\nfrom streamlit_extras.colored_header import colored_header\nfrom streamlit_extras.mention import mention\nfrom streamlit_extras.tags import tagger_component\n\nfrom logger import logger\nimport os\n\nimport streamlit as st\nfrom auth0_component import login_button\n\nfrom backend.constants.variables import JUMP_QUERY_ASK, USER_INFO, USER_NAME, DIVIDER_HTML, DIVIDER_THIN_HTML\nfrom streamlit_extras.let_it_rain import rain\n\n\ndef render_home():\n    render_home_header()\n    # st.divider()\n    # st.markdown(DIVIDER_THIN_HTML, unsafe_allow_html=True)\n    add_vertical_space(5)\n    render_home_content()\n    # st.divider()\n    st.markdown(DIVIDER_THIN_HTML, unsafe_allow_html=True)\n    render_home_footer()\n\n\ndef render_home_header():\n    logger.info(\"render home header\")\n    st.header(\"ChatData - Your Intelligent Assistant\")\n    st.markdown(DIVIDER_THIN_HTML, unsafe_allow_html=True)\n    st.markdown(\"> [ChatData](https://github.com/myscale/ChatData) \\\n                     is developed by [MyScale](https://myscale.com/), \\\n                     it's an integration of [LangChain](https://www.langchain.com/) \\\n                     and [MyScaleDB](https://github.com/myscale/myscaledb)\")\n\n    tagger_component(\n        \"Keywords:\",\n        [\"MyScaleDB\", \"LangChain\", \"VectorSearch\", \"ChatBot\", \"GPT\", \"arxiv\", \"wikipedia\", \"Personal Knowledge Base 📚\"],\n        color_name=[\"darkslateblue\", \"green\", \"orange\", \"darkslategrey\", \"red\", \"crimson\", \"darkcyan\", \"darkgrey\"],\n    )\n    text, col1, col2, col3, _ = st.columns([1, 1, 1, 1, 4])\n    with text:\n        st.markdown(\"Related:\")\n    with col1.container():\n        mention(\n            label=\"streamlit\",\n            icon=\"streamlit\",\n            url=\"https://streamlit.io/\",\n            write=True\n        )\n    with col2.container():\n        mention(\n            label=\"langchain\",\n            icon=\"🦜🔗\",\n            url=\"https://www.langchain.com/\",\n            write=True\n        )\n    with col3.container():\n        mention(\n            label=\"streamlit-extras\",\n            icon=\"🪢\",\n            url=\"https://github.com/arnaudmiribel/streamlit-extras\",\n            write=True\n        )\n\n\ndef _render_self_query_chain_content():\n    col1, col2 = st.columns([1, 1], gap='large')\n    with col1.container():\n        st.image(image='./assets/home_page_background_1.png',\n                 caption=None,\n                 width=None,\n                 use_column_width=True,\n                 clamp=False,\n                 channels=\"RGB\",\n                 output_format=\"PNG\")\n    with col2.container():\n        st.header(\"VectorSearch & SelfQuery with Sources\")\n        st.info(\"In this sample, you will learn how **LangChain** integrates with **MyScaleDB**.\")\n        st.markdown(\"\"\"This example demonstrates two methods for integrating MyScale into LangChain: [Vector SQL](https://api.python.langchain.com/en/latest/sql/langchain_experimental.sql.vector_sql.VectorSQLDatabaseChain.html) and [Self-querying retriever](https://python.langchain.com/v0.2/docs/integrations/retrievers/self_query/myscale_self_query/). For each method, you can choose one of the following options:\n\n1. `Retrieve from MyScaleDB ➡️` - The LLM (GPT) converts user queries into SQL statements with vector search, executes these searches in MyScaleDB, and retrieves relevant content.\n   \n2. `Retrieve and answer with LLM ➡️` - After retrieving relevant content from MyScaleDB, the user query along with the retrieved content is sent to the LLM (GPT), which then provides a comprehensive answer.\"\"\")\n        add_vertical_space(3)\n        _, middle, _ = st.columns([2, 1, 2], gap='small')\n        with middle.container():\n            st.session_state[JUMP_QUERY_ASK] = st.button(\"Try sample\", use_container_width=False, type=\"secondary\")\n\n\ndef _render_chat_bot_content():\n    col1, col2 = st.columns(2, gap='large')\n    with col1.container():\n        st.image(image='./assets/home_page_background_2.png',\n                 caption=None,\n                 width=None,\n                 use_column_width=True,\n                 clamp=False,\n                 channels=\"RGB\",\n                 output_format=\"PNG\")\n    with col2.container():\n        st.header(\"Chat Bot\")\n        st.info(\"Now you can try our chatbot, this chatbot is built with MyScale and LangChain.\")\n        st.markdown(\"- You need to go to [https://myscale-chatdata.hf.space/](https://myscale-chatdata.hf.space/) \"\n                    \"to log in successfully, otherwise the auth service will not work.\")\n        st.markdown(\"- You can upload your own PDF files and build your own knowledge base. \\\n                     (This is just a sample application. Please do not upload important or confidential files.)\")\n        st.markdown(\"- A default session will be assigned as your initial chat session. \\\n                     You can create and switch to other sessions to jump between different chat conversations.\")\n        add_vertical_space(1)\n        _, middle, _ = st.columns([1, 2, 1], gap='small')\n        with middle.container():\n            if USER_NAME not in st.session_state:\n                login_button(clientId=os.environ[\"AUTH0_CLIENT_ID\"],\n                             domain=os.environ[\"AUTH0_DOMAIN\"],\n                             key=\"auth0\")\n                # if user_info:\n                #     user_name = user_info.get(\"nickname\", \"default\") + \"_\" + user_info.get(\"email\", \"null\")\n                #     st.session_state[USER_NAME] = user_name\n                #     print(user_info)\n\n\ndef render_home_content():\n    logger.info(\"render home content\")\n    _render_self_query_chain_content()\n    add_vertical_space(3)\n    _render_chat_bot_content()\n\n\ndef render_home_footer():\n    logger.info(\"render home footer\")\n    st.write(\n        \"Please follow us on [Twitter](https://x.com/myscaledb) and [Discord](https://discord.gg/D2qpkqc4Jq)!\"\n    )\n    st.write(\n        \"For more details, please refer to [our repository on GitHub](https://github.com/myscale/ChatData)!\")\n    st.write(\"Our [privacy policy](https://myscale.com/privacy/), [terms of service](https://myscale.com/terms/)\")\n\n    # st.write(\n    #     \"Recommended to use the standalone version of Chat-Data, \"\n    #     \"available [here](https://myscale-chatdata.hf.space/).\"\n    # )\n\n    if st.session_state.auth0 is not None:\n        st.session_state[USER_INFO] = dict(st.session_state.auth0)\n        if 'email' in st.session_state[USER_INFO]:\n            email = st.session_state[USER_INFO][\"email\"]\n        else:\n            email = f\"{st.session_state[USER_INFO]['nickname']}@{st.session_state[USER_INFO]['sub']}\"\n        st.session_state[\"user_name\"] = email\n        del st.session_state.auth0\n        st.rerun()\n    if st.session_state.jump_query_ask:\n        st.rerun()\n"
  },
  {
    "path": "app/ui/retrievers.py",
    "content": "import streamlit as st\nfrom streamlit_extras.add_vertical_space import add_vertical_space\n\nfrom backend.constants.myscale_tables import MYSCALE_TABLES\nfrom backend.constants.variables import CHAINS_RETRIEVERS_MAPPING, RetrieverButtons\nfrom backend.retrievers.self_query import process_self_query\nfrom backend.retrievers.vector_sql_query import process_sql_query\nfrom backend.constants.variables import JUMP_QUERY_ASK, USER_NAME, USER_INFO\n\n\ndef back_to_main():\n    if USER_INFO in st.session_state:\n        del st.session_state[USER_INFO]\n    if USER_NAME in st.session_state:\n        del st.session_state[USER_NAME]\n    if JUMP_QUERY_ASK in st.session_state:\n        del st.session_state[JUMP_QUERY_ASK]\n\n\ndef _render_table_selector() -> str:\n    col1, col2 = st.columns(2)\n    with col1:\n        selected_table = st.selectbox(\n            label='Each public knowledge base is stored in a MyScaleDB table, which is read-only.',\n            options=MYSCALE_TABLES.keys(),\n        )\n        MYSCALE_TABLES[selected_table].hint()\n    with col2:\n        add_vertical_space(1)\n        st.info(f\"Here is your selected public knowledge base schema in MyScaleDB\",\n                icon='📚')\n        MYSCALE_TABLES[selected_table].hint_sql()\n\n    return selected_table\n\n\ndef render_retrievers():\n    st.button(\"⬅️ Back\", key=\"back_sql\", on_click=back_to_main)\n    st.subheader('Please choose a public knowledge base to search.')\n    selected_table = _render_table_selector()\n\n    tab_sql, tab_self_query = st.tabs(\n        tabs=['Vector SQL', 'Self-querying Retriever']\n    )\n\n    with tab_sql:\n        render_tab_sql(selected_table)\n\n    with tab_self_query:\n        render_tab_self_query(selected_table)\n\n\ndef render_tab_sql(selected_table: str):\n    st.warning(\n        \"When you input a query with filtering conditions, you need to ensure that your filters are applied only to \"\n        \"the metadata we provide. This table allows filters to be established on the following metadata fields:\",\n        icon=\"⚠️\")\n    st.dataframe(st.session_state[CHAINS_RETRIEVERS_MAPPING][selected_table][\"metadata_columns\"])\n\n    cols = st.columns([8, 3, 3, 2])\n    cols[0].text_input(\"Input your question:\", key='query_sql')\n    with cols[1].container():\n        add_vertical_space(2)\n        st.button(\"Retrieve from MyScaleDB ➡️\", key=RetrieverButtons.vector_sql_query_from_db)\n    with cols[2].container():\n        add_vertical_space(2)\n        st.button(\"Retrieve and answer with LLM ➡️\", key=RetrieverButtons.vector_sql_query_with_llm)\n\n    if st.session_state[RetrieverButtons.vector_sql_query_from_db]:\n        process_sql_query(selected_table, RetrieverButtons.vector_sql_query_from_db)\n\n    if st.session_state[RetrieverButtons.vector_sql_query_with_llm]:\n        process_sql_query(selected_table, RetrieverButtons.vector_sql_query_with_llm)\n\n\ndef render_tab_self_query(selected_table):\n    st.warning(\n        \"When you input a query with filtering conditions, you need to ensure that your filters are applied only to \"\n        \"the metadata we provide. This table allows filters to be established on the following metadata fields:\",\n        icon=\"⚠️\")\n    st.dataframe(st.session_state[CHAINS_RETRIEVERS_MAPPING][selected_table][\"metadata_columns\"])\n\n    cols = st.columns([8, 3, 3, 2])\n    cols[0].text_input(\"Input your question:\", key='query_self')\n\n    with cols[1].container():\n        add_vertical_space(2)\n        st.button(\"Retrieve from MyScaleDB ➡️\", key='search_self')\n    with cols[2].container():\n        add_vertical_space(2)\n        st.button(\"Retrieve and answer with LLM ➡️\", key='ask_self')\n\n    if st.session_state.search_self:\n        process_self_query(selected_table, RetrieverButtons.self_query_from_db)\n\n    if st.session_state.ask_self:\n        process_self_query(selected_table, RetrieverButtons.self_query_with_llm)\n"
  },
  {
    "path": "app/ui/utils.py",
    "content": "import streamlit as st\n\n\ndef display(dataframe, columns_=None, index=None):\n    if len(dataframe) > 0:\n        if index:\n            dataframe.set_index(index)\n        if columns_:\n            st.dataframe(dataframe[columns_])\n        else:\n            st.dataframe(dataframe)\n    else:\n        st.write(\n            \"Sorry 😵 we didn't find any articles related to your query.\\n\\n\"\n            \"Maybe the LLM is too naughty that does not follow our instruction... \\n\\n\"\n            \"Please try again and use verbs that may match the datatype.\",\n            unsafe_allow_html=True\n        )\n"
  },
  {
    "path": "docs/self-query.md",
    "content": "# HOW-TO: Build a ChatPDF App over millions of documents with LangChain and MyScale in 30 Minutes\n\nChatting with GPT about a single academic paper is relatively straightforward by providing the document as the language model context. Chatting with millions of research papers is also simple... as long as you choose the right vector database.\n\n<br>\n<div style=\"text-align: center\">\n<img src=\"../assets/logo.png\" width=60%>\n</div>\n\nLarge language models (LLM) are powerful NLP tools. One of the most significant benefits of LLMs, such as ChatGPT, is that you can use them to build tools that allow you to interact (or chat) with documents, such as PDF copies of research or academic papers, based on their topics and content.\n\nMany implementations of chat-with-document apps already exist, like [ChatPaper](https://github.com/kaixindelele/ChatPaper), [OpenChatPaper](https://github.com/liuyixin-louis/OpenChatPaper), and [DocsMind](https://github.com/3Alan/DocsMind). But, many of these implementations seem complicated, with simplistic search utilities only featuring elementary keyword searches filtering on basic metadata such as year and subject.\n\nTherefore, developing a ChatPDF-like app to interact with millions of academic/research papers makes sense. You can chat with the data in your natural language, combining both semantic and structural attributes, asking questions such as \"What is a neural network?\" and providing additional qualifiers like \"Please use articles published by Geoffrey Hinton after 2018.\"\n\nThe primary purpose of this article is to help you build your own ChatPDF app that allows you to interact (chat) with millions of academic/research papers using LangChain and MyScale.\n\n> This app should take about 30 minutes to create.\n\nBut before we start, let’s look at the following diagrammatic workflow of the whole process:\n\n<br>\n<div style=\"text-align: center\">\n<img src=\"../assets/chatapp-workflow.png\" width=60%>\n</div>\n\n> Even though we describe how to develop this LLM-based chat app, we have a sample app on [GitHub](https://github.com/myscale/ChatData), including access to a [read-only vector database](../app/.streamlit/secrets.example.toml), further simplifying the app-creation process.\n\n## Prepare the Data\n\nAs described in this image, the first step is to prepare the data.\n\n> We recommend you to use our open database for this app. The credentials are in the example configuration: `$PROJECT_DIR/app/.streamlit/secrets.toml`. Or you can follow the instruction below to create your own database.  It takes about 20 minutes to create the database.\n\nWe have sourced our data: a usable list of abstracts and arXiv IDs from the Alexandria Index through the [Macrocosm website](https://alex.macrocosm.so/). Using this data and interrogating the arXiv Open API, we can significantly enhance the query experience by retrieving a much richer set of metadata, including year, subject, release date, category, and author.\n\n> We have prepared the data using the [arXiv Open API](https://info.arxiv.org/help/api/index.html) to simplify the app-creation process.\n\nNow that we have the data, let's dive into the next steps:\n\n## Create the Table\n\nThe first step is to create a table schema. Sign into [myscale.com](http://myscale.com/) and create a free [cluster](http://console.myscale.com/).\n\nAfter creating the cluster, the next step is to create a table that is compatible with our MyScale VectorStore.\n\n> There will be a workspace sidebar on your left. This is the place where you can play with pure SQL.\n\nUse the following script to create a MyScale VectorStore table:\n\n```sql\nCREATE TABLE default.langchain (\n    `abstract` String,\n    `id` String,\n    `vector` Array(Float32),\n    `metadata` Object('JSON'),\n    CONSTRAINT vec_len CHECK length(vector) = 768)\nENGINE = ReplacingMergeTree ORDER BY id\n```\n\nYou can also create a table initializing a MyScale VectorStore in LangChain with Python, as the following code sample describes:\n\n```python\nfrom langchain.vectorstores import MyScale, MyScaleSettings\nconfig = MyScaleSetting(host=\"<your-backend-url>\", port=8443, ...)\ndoc_search = MyScale(embedding_function, config)\n```\n\nBoth these methods do the same job.\n\nGreat. Let’s move onto the next step.\n\n## Insert the Data\n\n> We have appended additional metadata, like the publication date and authors, to each ArXiv entry.\n\nOur data is hosted on [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html), supported by [Clickhouse table functions](https://clickhouse.com/docs/en/sql-reference/table-functions/s3).\nThe compressed `jsonl` file is available [here](https://myscale-demo.s3.ap-southeast-1.amazonaws.com/chat_arxiv/full.json.zst). You can also import data via partitioned dataset (113 parts) from our AWS S3 bucket (the URL will look like `https://myscale-demo.s3.ap-southeast-1.amazonaws.com/chat_arxiv/data.part*.0.jsonl.zst`) to your MyScale Cloud with [S3 table function](https://clickhouse.com/docs/en/sql-reference/table-functions/s3).\n\n> You can also upload the data onto Google Cloud Platform and use the same SQL insert query to import this data.\n\nTo insert data into this table, you still have the same options as when creating a VectorStore table: MyScale SQL workspace or LangChain.\n\n### MyScale SQL Workspace\n\nPaste the following SQL statement into the MyScale SQL workspace and click **Run**:\n\n```sql\nINSERT INTO langchain\nSELECT\n  *\nFROM\n  s3(\n    'https://myscale-demo.s3.ap-southeast-1.amazonaws.com/chat_arxiv/data.*.jsonl.zst',\n    'JSONEachRow',\n    'abstract String, id String, vector Array(Float32), metadata Object(''JSON'')',\n    'zstd'\n  )\n```\n\nThen you need to build the vector index with this SQL:\n\n```sql\nALTER TABLE langchain ADD VECTOR INDEX vec_idx vector TYPE MSTG('metric_type=Cosine')\n```\n\n### LangChain\n\nThe second option is to insert the data into the table using LangChain for better control over the data insertion process.\n\nAdd the following code snippet to your app’s code:\n\n```python\n# ! unzstd data-*.jsonl.zst\nimport json\nfrom langchain.docstore.document import Document\n\ndef str2doc(_str):\n    j = json.loads(_str)\n    return Document(page_content=j['abstract'], metadata=j['metadata'])\n\nwith open('func_call_data.jsonl') as f:\n    docs = [str2doc(l) for l in f.readlines()]\n```\n\n## Design the Query Pipeline\n\nMost LLM-based applications need an automated pipeline for querying and returning an answer to the query.\n\n> Chat-with-LLM apps must generally retrieve reference documents before querying their models (LLMs).\n\nLet's look at the step-by-step workflow describing how the app answers the user's queries, as illustrated in the following diagram:\n\n<br>\n<div style=\"text-align: center\">\n<img src=\"../asserts/../assets/overview.jpg\" width=60%>\n</div>\n\n1. **Ask for the user's input/questions.**\n   This input must be as concise as possible. In most cases, it should be at most several sentences.\n\n2. **Construct a DB query from the user's input.**\n   The query is simple for vector databases. All you need to do is to extract the relevant embedding from the vector database. However, for enhanced accuracy, it is advisable to filter your query.\n\n   For instance, let's assume the user only wants the latest papers rather than all the papers in the returned embedding, but the returned embedding includes all the research papers. By way of solving this challenge, you can add metadata filters to the query to filter out the correct information.\n\n3. **Parse the retrieved documents from VectorStore.**\n   The data returned from the vector store is not in a native format that the LLM understands. You must parse it and insert it into your prompt templates. Sometimes you need to add more metadata to these templates, like the date created, authors, or document categories. This metadata will help LLM improve the quality of its answer.\n\n4. **Ask the LLM.**\n   This process is straightforward as long as you are familiar with the LLM's API and have properly designed prompts.\n\n5. **Fetch the answer**\n  Returning the answer is straightforward for simple applications. But, if the question is complex, additional effort is required to provide more information to the user; for example, adding the LLM's source data. Additionally, adding reference numbers to the prompt can help you find the source and reduce your prompt's size by avoiding repeating content, such as the document's title.\n\nIn practice, LangChain has a good framework to work with. We used the following functions to build this pipeline:\n\n* `RetrievalQAWithSourcesChain`\n* `SelfQueryRetriever`\n\n### `SelfQueryRetriever`\n\nThis function defines the interaction between the VectorStore and your app. Let’s dive deeper into how a self-query retriever works, as illustrated in the following diagram:\n\n<br>\n<div style=\"text-align: center\">\n<img src=\"../assets/self-query.jpg\" width=60%>\n</div>\n\nLangChain’s `SelfQueryRetriever` defines a universal filter for every VectorStore, including several `comparators` for comparing values and `operators`, combining these conditions to form a filter. The LLM will generate a filter rule based on these `comparators` and `operators`. All VectorStore providers will implement a `FilterTranslator` to translate the given universal filter to the correct arguments that call the VectorStore.\n\nLangChain's universal solution provides a complete package for new operators, comparators, and vector store providers. However, you are limited to the pre-defined elements inside it.\n\n<!-- > Why don't you try and build a prompt filter yourself? This will help you eliminate LangChain's default filter translator, allowing you to implement customized filter comparisons. -->\n\nIn the prompt filter context, MyScale includes more powerful and flexible filters. We have added more data types, like lists and timestamps, and more functions, like string pattern matching and `CONTAIN` comparators for lists, offering more options for data storage and query design.\n\n> We contributed to LangChain's Self-Query retrievers to make them more powerful, resulting in self-query retrievers that provide more freedom to the LLM when designing the query.\n\nLook at [what else MyScale can do with metadata filters](https://myscale.com/blog/why-integrated-database-solution-can-boost-your-llm-apps/#filter-on-anything-without-constraints).\n\nHere is the code for it, written using LangChain:\n\n```python\nfrom langchain.vectorstores import MyScale\nfrom langchain.embeddings import HuggingFaceInstructEmbeddings\n\n# Assuming you data is ready on MyScale Cloud\nembeddings = HuggingFaceInstructEmbeddings()\ndoc_search = MyScale(embeddings)\n\n# Define metadata fields and their types\n# Descriptions are important. That's where LLM know how to use that metadata.\nmetadata_field_info=[\n    AttributeInfo(\n        name=\"pubdate\",\n        description=\"The year the paper is published\",\n        type=\"timestamp\",\n    ),\n    AttributeInfo(\n        name=\"authors\",\n        description=\"List of author names\",\n        type=\"list[string]\",\n    ),\n    AttributeInfo(\n        name=\"title\",\n        description=\"Title of the paper\",\n        type=\"string\",\n    ),\n    AttributeInfo(\n        name=\"categories\",\n        description=\"arxiv categories to this paper\",\n        type=\"list[string]\"\n    ),\n    AttributeInfo(\n        name=\"length(categories)\",\n        description=\"length of arxiv categories to this paper\",\n        type=\"int\"\n    ),\n]\n\n# Now build a retriever with LLM, a vector store and your metadata info\nretriever = SelfQueryRetriever.from_llm(\n    OpenAI(openai_api_key=st.secrets['OPENAI_API_KEY'], temperature=0),\n    doc_search, \"Scientific papers indexes with abstracts\", metadata_field_info,\n    use_original_query=True)\n```\n\n### `RetrievalQAWithSourcesChain`\n\nThis function constructs the prompts containing the documents.\n\nData should be formatted into LLM readable strings, like JSON or Markdown that contain the document’s info in it.\n\n<br>\n<div style=\"text-align: center\">\n<img src=\"../assets/chain.jpg\" width=60%>\n</div>\n\nAs highlighted above, once the document data has been retrieved from the vector store, it must be formatted into LLM-readable strings, like JSON or Markdown.\n\nLangChain uses the following chains to build these LLM-readable strings:\n\n* `MapReduceDocumentsChain`\n* `StuffDocumentsChain`\n\n`MapReduceDocumentChain` gathers all the documents the vector store returns and normalizes them into a standard format. It maps the documents to a prompt template and concatenates them together. `StuffDocumentChain` works on those formatted documents, inserting them as context with task descriptions as prefixes and examples as suffixes.\n\nAdd the following code snippet to your app’s code so that your app will format the vector store data into LLM-readable documents.\n\n```python\nchain = RetrievalQAWithSourcesChain.from_llm(\n        llm=OpenAI(openai_api_key=st.secrets['OPENAI_API_KEY'], temperature=0.\n        retriever=retriever,\n        return_source_documents=True,)\n```\n\n## Run the Chain\n\nWith these components, we can now search and answer the user's questions with a scalable vector store.\n\nTry it yourself!\n\n```python\nret = st.session_state.chain(st.session_state.query, callbacks=[callback])\n# You can find the answer from LLM in the field `answer`\nst.markdown(f\"### Answer from LLM\\n{ret['answer']}\\n### References\")\n# and source documents in `sources` and `source_documents`\ndocs = ret['source_documents']\n```\n\nNot responsive?\n\n### Add Callbacks\n\nThe chain works just fine, but you might have a complaint: It needs to be faster!\n\nYes, the chain will be slow as it will construct a filtered vector query (one LLM call), retrieve data from VectorStore and ask LLM (another LLM call). Consequently, the total execution time will be about 10~20 seconds.\n\nDon't worry; LangChain has your back. It includes [Callbacks](https://python.langchain.com/en/latest/modules/callbacks/getting_started.html?highlight=Callbacks) that you can use to increase your app's responsiveness. In our example, we added several callback functions to update a progress bar:\n\n```python\nclass ChatArXivAskCallBackHandler(StreamlitCallbackHandler):\n    def __init__(self) -> None:\n        # You will have a progress bar when this callback is initialized\n        self.progress_bar = st.progress(value=0.0, text='Searching DB...')\n        self.status_bar = st.empty()\n        self.prog_value = 0.0\n        # You can use chain names to control the progress\n        self.prog_map = {\n            'langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain': 0.2,\n            'langchain.chains.combine_documents.map_reduce.MapReduceDocumentsChain': 0.4,\n            'langchain.chains.combine_documents.stuff.StuffDocumentsChain': 0.8\n        }\n\n    def on_llm_start(self, serialized, prompts, **kwargs) -> None:\n        pass\n\n    def on_text(self, text: str, **kwargs) -> None:\n        pass\n\n    def on_chain_start(self, serialized, inputs, **kwargs) -> None:\n        # the name is in list, so you can join them in strings.\n        cid = '.'.join(serialized['id'])\n        if cid != 'langchain.chains.llm.LLMChain':\n            self.progress_bar.progress(value=self.prog_map[cid], text=f'Running Chain `{cid}`...')\n            self.prog_value = self.prog_map[cid]\n        else:\n            self.prog_value += 0.1\n            self.progress_bar.progress(value=self.prog_value, text=f'Running Chain `{cid}`...')\n\n    def on_chain_end(self, outputs, **kwargs) -> None:\n        pass\n```\n\nNow your app will have a pretty progress bar just like ours.\n\n<br>\n<div style=\"text-align: center\">\n<img src=\"../assets/demo.jpg\" width=60%>\n</div>\n\n## In Conclusion\n\nThis is how an LLM app should be built with LangChain!\n\nToday we provided a brief overview of how to build a simple LLM app that chats with the MyScale VectorStore, and also explained how to use chains in the query pipeline.\n\nWe hope this article helps you when you design your LLM-based app architecture from the ground up.\n\nYou can also ask for help on our [Discord server](https://discord.gg/D2qpkqc4Jq). We are happy to help, whether on vector databases, LLM apps, or other fantastic stuff. You are also welcomed to use our open database to build your own apps! We believe you can more awesome apps with this self-query retriever with MyScale! Happy Coding!\n\nSee you in the following article!\n"
  },
  {
    "path": "docs/vector-sql.md",
    "content": "# Teach Your LLM to Search Using Vector SQL and Answer With Facts From Database\n\nA vector database that supports Structured Query Language can store more than vectors. Common data types like timestamps and arrays can be accessed and filtered within the database, which improves the accuracy and efficiency of vector search queries. Accurate results from the database can teach LLMs to speak with facts, which reduces hallucination and enhance the quality and credibility of answers from LLM.\n\n## What is Hallucination?\n\nLarge Language Models are advanced AI systems that can answer a wide range of questions. Although they provide informative responses on topics they know, they are not always accurate on unfamiliar topics. This phenomenon is known as **hallucination**.\n\nBefore we look at an example of an LLM hallucination, let's consider a definition of the term \"hallucination\" as described by [Wikipedia.com](https://en.wikipedia.org/wiki/Hallucination):\n\n> \"A hallucination is a perception in the absence of an external stimulus that has the qualities of a real perception.\"\n\nMoreover:\n\n> \"Hallucinations are vivid, substantial, and are perceived to be located in external objective space.\"\n\nIn other words, a hallucination is an error in (or a false) perception of something real or concrete. For example, a Large Language Model was asked what LLM hallucinations are, with the answer being:\n\n![LLM Hallucinations. source: [aruna-x](https://dev.to/aruna/how-to-minimize-llm-hallucinations-2el7)](../assets/hallucination.png)\n\nTherefore, the question begs, how do we improve on (or fix) this result? The concise answer is to add facts to your question, such as providing the LLM definition before or after you ask the question.\n\nFor instance:\n\n> An LLM is a Large Language Model, an artificial neural network that models how humans talk and write. Please tell me, what is LLM hallucination?\n\nThe public domain answer to this question, provided by ChatGPT, is:\n\n![ChatGPT LLM Hallucinations Response](../assets/chatgpt-hallucination-response.png)\n\n**Note:** The reason for the first sentence, \"Apologies for the confusion in my earlier response,\" is that we asked ChatGPT our first question, what LLM hallucinations are, before giving it our second prompt: \"An LLM...\"\n\nThese additions have improved the quality of the answer. At least it no longer thinks an LLM hallucination is a \"Late-Life Migraine Accompaniment!\" 😆\n\n## External Knowledge Reduces Hallucinations\n\nAt this juncture, it is absolutely crucial to note that an LLM is not infallible nor the ultimate authority on all knowledge. LLMs are trained on large amounts of data and learn patterns in language, but they may not always have access to the most up-to-date information or have a comprehensive understanding of complex topics.\n\nWhat now? How do you increase the chance of reducing LLM hallucinations?\n\nThe solution to this problem is to include supporting documents to the query (or prompt) to guide the LLM toward a more accurate and informed response. Like humans, it needs to learn from these documents to answer your question accurately and correctly.\n\nHelpful documents can come from many sources, including a search engine like Google or Bing and a digital library like Arxiv, among others, providing an interface to search for relevant passages. Using a database is also a good choice, providing a more flexible and private query interface.\n\nKnowledge retrieved from sources must be relevant to the question/prompt. There are several ways to retrieve relevant documents, including:\n\n* **Keyword-based:** Searching for keywords in plain text, suitable for an exact match on terms.\n* **Vector search-based:** Searching for records closer to embeddings, helpful in searching for appropriate paraphrases or general documents.\n\nNowadays, vector searches are popular since they can solve paraphrase problems and calculate paragraph meanings. Vector search is not a one-size-fits-all solution; it should be paired with specific filters to maintain its performance, especially when searching massive volumes of records. For example, should you only want to retrieve knowledge about physics (as a subject), you must filter out all information about any other subjects. Thus, the LLM will not be confused by knowledge from other disciplines.\n\n## Automate the Whole Process with SQL... and Vector Search\n\nThe LLM should also learn to query data from its data sources before answering the questions, automating the whole process. Actually, LLMs are already capable of writing SQL queries and following instructions.\n\n![Vector Pipeline](../assets/pipeline.png)\n\nSQL is powerful and can be used to construct complex search queries. It supports many different data types and functions. And it allows us to write a vector search in SQL with `ORDER BY` and `LIMIT`, treating the similarity score between embeddings as a column `distance`. Pretty straightforward, isn't it?\n\n> See the next section, [What Vector SQL Looks Like](#what-vector-sql-looks-like), for more information on structuring a vector SQL query.\n\nThere are significant benefits to using vector SQL to build complex search queries, including:\n\n* Increased flexibility for data type and function support\n* Improved efficiency because SQL is highly optimized and executed inside the database\n* Is human-readable and easy to learn as it is an extension of standard SQL\n* Is LLM-friendly\n\n**Note:** Many SQL examples and tutorials are available on the Internet. LLMs are familiar with standard SQL as well as some of its dialects.\n\nApart from MyScale, many SQL database solutions like Clickhouse and PostgreSQL are adding vector search to their existing functionality, allowing users to use vector SQL and LLMs to answer questions on complex topics. Similarly, an increasing number of application developers are starting to integrate vector searches with SQL into their applications.\n\n## What Vector SQL Looks Like\n\nVector Structured Query Language (Vector SQL) is designed to teach LLMs how to query vector SQL databases and contains the following extra functions:\n\n* `DISTANCE(column, query_vector)`: This function compares the distance between the column of vectors and the query vector either exactly or approximately.\n* `NeuralArray(entity)`: This function converts an entity (for example, an image or a piece of text) into an embedding.\n\nWith these two functions, we can extend the standard SQL for vector search. For example, if you want to search for 10 relevant records to word `flower`, you can use the following SQL statement:\n\n```sql\nSELECT * FROM table \nORDER BY DISTANCE(vector, NeuralArray(flower))\nLIMIT 10\n```\n\nThe `DISTANCE` function comprises the following:\n\n* The inner function, `NeuralArray(flower)`, converts the word `flower` into an embedding.\n* This embedding is then serialized and injected into the `DISTANCE` function.\n\nVector SQL is an extended version of SQL that needs further translation based on the vector database used. For instance, many implementations have different names for the `DISTANCE` function. It is called `distance` in MyScale, and `L2Distance` or `CosineDistance` in Clickhouse. Additionally, based on the database, this function name will be translated differently.\n\n## How to teach an LLM to write Vector SQL\n\nNow that we understand the basic principles of vector SQL and its unique functions, let's use an LLM to help us to write a vector SQL query.\n\n### 1. Teach an LLM What Standard Vector SQL is\n\nFirst, we need to teach our LLM what standard vector SQL is. We aim to ensure that the LLM will do the following three things spontaneously when writing a vector SQL query:\n\n* Extract the keywords from our question/prompt. It could be an object, a concept, or a topic.\n* Decide which column to use to perform the similarity search. It should always choose a vector column for similarity.\n* Translate the rest of our question's constraints into valid SQL.\n\n### 2. Design the LLM Prompt\n\nHaving determined exactly what information the LLM requires to construct a vector SQL query, we can design the prompt as follows:\n\n```python\n# Here is an example of a vector SQL prompt\n_prompt = f\"\"\"\nYou are a {dialect} expert. Given an input question, first, create a syntactically correct MyScale query to run, look at the query results, and return the answer to the input question.\nThe {dialect} query has a vector distance function called `DISTANCE(column, array)` to compute relevance to the user's question and sort the feature array column by this relevance. \nWhen the query asks for {top_k} closest row, you must use this distance function to calculate the distance to the entity's array on the vector column and order by the distance to retrieve the relevant rows.\n\n*NOTICE*: `DISTANCE(column, array)` only accepts an array column as its first argument and a `NeuralArray(entity)` as its second argument. You also need a user-defined function called `NeuralArray(entity)` to retrieve the entity's array. \n\"\"\"\n```\n\nThis prompt should do its job. But the more examples you add, the better it will be, like using the following vector SQL-to-text pair as a prompt:\n\n**The SQL table create statement:**\n\n```sql\n------ table schema ------\nCREATE TABLE \"ChatPaper\" (\n abstract String, \n id String, \n vector Array(Float32), \n        categories Array(String), \n pubdate DateTime, \n title String, \n authors Array(String), \n primary_category String\n) ENGINE = ReplicatedReplacingMergeTree()\n ORDER BY id\n PRIMARY KEY id\n```\n\n**The question and answer:**\n\n```text\nQuestion: What is PaperRank? What is the contribution of these works? Use papers with more than 2 categories.\nSQLQuery: SELECT ChatPaper.title, ChatPaper.id, ChatPaper.authors FROM ChatPaper WHERE length(categories) > 2 ORDER BY DISTANCE(vector, NeuralArray(PaperRank contribution)) LIMIT {top_k}\n```\n\nThe more relevant examples you add to your prompt, the more the LLM's process of building the correct vector SQL query will improve.\n\nLastly, here are several extra tips to help you when designing your prompt:\n\n* Cover all possible functions that might appear in any questions asked.\n* Avoid monotonic questions.\n* Alter the table schema, like adding/removing /modifying names and data types.\n* Align the prompt's format.\n\n## A Real-World Example: Using MyScale\n\nLet's now build [**a real-world example**](https://huggingface.co/spaces/myscale/ChatData), set out in the following steps:\n\n![A Real-World Example: Using MyScale](../assets/myscale-example.png)\n\n### Prepare the Database\n\nWe have prepared a playground for you with more than 2 million papers ready to query. You can access this data by adding the following Python code to your app.\n\n```python\nfrom sqlalchemy import create_engine\nMYSCALE_HOST = \"msc-950b9f1f.us-east-1.aws.myscale.com\"\nMYSCALE_PORT = 443\nMYSCALE_USER = \"chatdata\"\nMYSCALE_PASSWORD = \"myscale_rocks\"\n\nengine = create_engine(f'clickhouse://{MYSCALE_USER}:{MYSCALE_PASSWORD}@{MYSCALE_HOST}:{MYSCALE_PORT}/default?protocol=https')\n```\n\nIf you like, you can skip the following steps, where we create the table and insert its data using the MyScale console, and jump to where we play with vector SQL and [create the `SQLDatabaseChain`](#create-sqldatabasechain) to query the database.\n\n**Create the database table:**\n\n```sql\nCREATE TABLE default.ChatArXiv (\n    `abstract` String, \n    `id` String, \n    `vector` Array(Float32), \n    `metadata` Object('JSON'), \n    `pubdate` DateTime,\n    `title` String,\n    `categories` Array(String),\n    `authors` Array(String), \n    `comment` String,\n    `primary_category` String,\n    CONSTRAINT vec_len CHECK length(vector) = 768) \nENGINE = ReplacingMergeTree ORDER BY id SETTINGS index_granularity = 8192\n```\n\n**Insert the data:**\n\n```sql\nINSERT INTO ChatArXiv\nSELECT\n  abstract, id, vector, metadata,\n  parseDateTimeBestEffort(JSONExtractString(toJSONString(metadata), 'pubdate')) AS pubdate,\n  JSONExtractString(toJSONString(metadata), 'title') AS title,\n  arrayMap(x->trim(BOTH '\"' FROM x), JSONExtractArrayRaw(toJSONString(metadata), 'categories')) AS categories,\n  arrayMap(x->trim(BOTH '\"' FROM x), JSONExtractArrayRaw(toJSONString(metadata), 'authors')) AS authors,\n  JSONExtractString(toJSONString(metadata), 'comment') AS comment,\n  JSONExtractString(toJSONString(metadata), 'primary_category') AS primary_category\nFROM\n  s3(\n    'https://myscale-demo.s3.ap-southeast-1.amazonaws.com/chat_arxiv/data.part*.zst',\n    'JSONEachRow',\n    'abstract String, id String, vector Array(Float32), metadata Object(''JSON'')',\n    'zstd'\n  );\nALTER TABLE ChatArXiv ADD VECTOR INDEX vec_idx vector TYPE MSTG('metric_type=Cosine');\n```\n\n### Create the `SQLDatabaseChain`\n\nThis LangChain feature is currently under [MyScale tech preview](https://github.com/myscale/langchain/tree/preview). You can install it by executing the following installation script:\n\n```bash\npython3 -m venv .venv\nsource .venv/bin/activate\n# This is a technical preview of langchain from MyScale\npip3 install langchain@git+https://github.com/myscale/langchain.git@preview\n```\n\nOnce you have installed this feature, the next step is to use it to query the database, as the following Python code demonstrates:\n\n```python\nfrom sqlalchemy import create_engine\nMYSCALE_HOST = \"msc-950b9f1f.us-east-1.aws.myscale.com\"\nMYSCALE_PORT = 443\nMYSCALE_USER = \"chatdata\"\nMYSCALE_PASSWORD = \"myscale_rocks\"\n\n# create connection to database\nengine = create_engine(f'clickhouse://{MYSCALE_USER}:{MYSCALE_PASSWORD}@{MYSCALE_HOST}:{MYSCALE_PORT}/default?protocol=https')\n\nfrom langchain.embeddings import HuggingFaceInstructEmbeddings\nfrom langchain.callbacks import StdOutCallbackHandler\nfrom langchain.chains.sql_database.parser import VectorSQLRetrieveAllOutputParser\nfrom langchain.sql_database import SQLDatabase\nfrom langchain.chains.sql_database.prompt import MYSCALE_PROMPT\nfrom langchain.llms import OpenAI\nfrom langchain.chains.sql_database.base import SQLDatabaseChain\n\n# this parser converts `NeuralArray()` into embeddings\noutput_parser = VectorSQLRetrieveAllOutputParser(\n    model=HuggingFaceInstructEmbeddings(model_name='hkunlp/instructor-xl')\n)\n\n# use the prompt above\nPROMPT = PromptTemplate(\n    input_variables=[\"input\", \"table_info\", \"top_k\"],\n    template=_prompt,\n)\n\n# bind the metadata to SqlAlchemy engine\nmetadata = MetaData(bind=engine)\n\n# create SQLDatabaseChain\nquery_chain = SQLDatabaseChain.from_llm(\n    # GPT-3.5 generates valid SQL better\n    llm=OpenAI(openai_api_key=OPENAI_API_KEY, temperature=0), \n    # use the predefined prompt, change it to your own prompt\n    prompt=MYSCALE_PROMPT, \n    # returns top 10 relevant documents\n    top_k=10,\n    # use result directly from DB\n    return_direct=True,\n    # use our database for retreival\n    db=SQLDatabase(engine, None, metadata),\n    # convert `NeuralArray()` into embeddings\n    sql_cmd_parser=output_parser)\n\n# launch the chain!! And trace all chain calls in standard output\nquery_chain.run(\"Introduce some papers that uses Generative Adversarial Networks published around 2019.\", \n                callbacks=[StdOutCallbackHandler()])\n```\n\n### Ask with `RetrievalQAwithSourcesChain`\n\nYou can also use this SQLDatabaseChain as a Retriever. You can plugin it in to some retrieval QA chains just like other retievers in LangChain.\n\n```python\nfrom langchain.retrievers import SQLDatabaseChainRetriever\nfrom langchain.chains.qa_with_sources.map_reduce_prompt import combine_prompt_template\n\nOPENAI_API_KEY = \"sk-***\"\n\n# define how you serialize those structured data from database\ndocument_with_metadata_prompt = PromptTemplate(\n    input_variables=[\"page_content\", \"id\", \"title\", \"authors\", \"pubdate\", \"categories\"],\n    template=\"Content:\\n\\tTitle: {title}\\n\\tAbstract: {page_content}\\n\\t\" +\n             \"Authors: {authors}\\n\\tDate of Publication: {pubdate}\\n\\tCategories: {categories}\\nSOURCE: {id}\"\n)\n# define the prompt you use to ask the LLM\nCOMBINE_PROMPT = PromptTemplate(\n    template=combine_prompt_template, input_variables=[\"summaries\", \"question\"])\n\n# define a retriever with a SQLDatabaseChain\nretriever = SQLDatabaseChainRetriever(\n            sql_db_chain=query_chain, page_content_key=\"abstract\")\n\n# finally, the ask chain to organize all of these\nask_chain = RetrievalQAWithSourcesChain.from_chain_type(\n    ChatOpenAI(model_name='gpt-3.5-turbo-16k',\n                openai_api_key=OPENAI_API_KEY, temperature=0.6),\n    retriever=retriever,\n    chain_type='stuff',\n    chain_type_kwargs={\n        'prompt': COMBINE_PROMPT,\n        'document_prompt': document_with_metadata_prompt,\n    }, return_source_documents=True)\n\n# Run the chain! and get the result from LLM\nask_chain(\"Introduce some papers that uses Generative Adversarial Networks published around 2019.\", \n    callbacks=[StdOutCallbackHandler()])\n```\n\nWe also provide a live demo on [**huggingface**](https://huggingface.co/spaces/myscale/ChatData) and the code is available on [**GitHub**](https://github.com/myscale/ChatData)! We used [**a customized Retrieval QA chain**](https://github.com/myscale/ChatData/blob/main/chains/arxiv_chains.py) to maximize the performance our search and ask pipeline with LangChain!\n\n## In Conclusion\n\nIn reality, most LLMs hallucinate. The most practical way to reduce its appearance is to add extra facts (external knowledge) to your question. External knowledge is crucial to improving the performance of LLM systems, allowing for the efficient and accurate retrieval of answers. Every word counts, and you don't want to waste your money on unused information that is retrieved by inaccurate queries.\n\nHow?\n\nEnter Vector SQL, allowing you to execute finely-grained vector searches to target and retrieve the required information.\n\nVector SQL is powerful and easy to learn for humans and machines. You can use many data types and functions to create complex queries. LLMs also like vector SQL, as its training dataset includes many references.\n\nLastly, it is possible to translate Vector SQL into many vector databases using different embedding models. We believe that is the future of vector databases.  \n\nAre interested in what we are doing? Join us on [discord](https://discord.gg/D2qpkqc4Jq) today!\n"
  }
]