[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#pdm.lock\n#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it\n#   in version control.\n#   https://pdm.fming.dev/#use-with-ide\n.pdm.toml\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n.idea/\n*.pkl\nmodels/"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2023 Jinghao Zhao\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# GPT-Code-Learner\nLearn A Repo Interactively with GPT. Ask questions to let the GPT explain the code to you.\n\n![GPT-Code-Learner.jpg](docs%2FGPT-Code-Learner.jpg)\n\n### Try it out in your browser\n![GUI.jpg](docs%2FGUI.jpg)\n\n\n## Local LLM Support (Experimental)\nGPT-Code-Learner supports running the LLM models locally. In general, GPT-Code-Learner uses [LocalAI](https://github.com/go-skynet/LocalAI) for local private LLM and [Sentence Transformers](https://huggingface.co/sentence-transformers) for local embedding.\n\nPlease refer to [Local LLM](docs/LocalLLM.md) for more details.\n\n<span style=\"color:red\">Note: Due to the current capability of local LLM, the performance of GPT-Code-Learner is not as good as the online version. </span>\n\n## Installation\n\n1. Clone this repository and install the required packages:\n```\ngit clone https://github.com/JinghaoZhao/GPT-Code-Learner.git\npip install -r requirements.txt\n```\n2. Create a `.env` file to put your API key:\n```\nOPENAI_API_KEY=sk-xxxxxx\nLLM_TYPE=\"OpenAI\"\nEMBEDDING_TYPE=\"OpenAI\"\n```\nIf you want to run the whole program locally, please change the following line in the `.env` file:\n```\nLLM_TYPE=\"local\"\nEMBEDDING_TYPE=\"local\"\n```\n3. Put the repo url (e.g., Github link) in the `Repo Link` textbox and click `Analyze Code Repo` button in the GUI. Or manually clone the repo you want to learn into `code_repo` folder:\n```\ncd code_repo\ngit clone <repo_url>\n```\n4. Run the GPT-Code-Learner. If you use local LLM models, please run the local model before running the GPT-Code-Learner. Please refer to [Local LLM](docs/LocalLLM.md) for more details.\n```\npython run.py\n```\n\n5. Open your web browser at http://127.0.0.1:7860 to ask any questions about your repo\n\n\n## Knowledge Base\nGPT-Code-Learner generates vector database from the code repo as a knowledge base to answer repo-related questions. By default, it will use the source codes as the knowledge base. More details can be found in [Knowledge Base](docs/KnowledgeBase.md).\n\n## Tool Planner\nThe core of the GPT-Code-Learner is the tool planner. It leverages available tools to process the input to provide contexts.\n\nCurrently, the tool planner supports the following tools:\n\n- **Code_Searcher**: This tool searches keywords (e.g., specific functions or variables) extracted from user query in the code repository\n\n- **Repo_Parser**: This tool performs a fuzzy search with vector database of the code repo. It provides contexts for questions about the general procedures in the repo.\n\nMore tools are under development. Feel free to contribute to this project!"
  },
  {
    "path": "code_learner.py",
    "content": "import gradio as gr\nimport json\nimport requests\nimport os\nfrom termcolor import colored\nfrom repo_parser import clone_repo, generate_or_load_knowledge_from_repo\nimport tool_planner\n\nllm_type = os.environ.get('LLM_TYPE', \"local\")\nOPENAI_API_KEY = os.environ.get(\"OPENAI_API_KEY\", \"null\")\nif llm_type == \"local\":\n    API_URL = \"http://localhost:8080/v1/chat/completions\"\n    model = \"ggml-gpt4all-j\"\nelse:\n    API_URL = \"https://api.openai.com/v1/chat/completions\"\n    model = \"gpt-3.5-turbo\"\n\ncode_repo_path = \"./code_repo\"\n\ninit_system_prompt = \"\"\"Now you are an expert programmer and teacher of a code repository. \n    You will be asked to explain the code for a specific task in the repo.\n    You will be provided with some related code snippets or documents related to the question.\n    Please think the explanation step-by-step.\n    Please answer the questions based on your knowledge, and you can also refer to the provided related code snippets.\n    The README.md file and the repo structure are also available for your reference.\n    If you need any details clarified, please ask questions until all issues are clarified. \\n\\n\n\"\"\"\nsystem_prompt = init_system_prompt\n\n\ndef generate_response(system_msg, inputs, top_p, temperature, chat_counter, chatbot=[], history=[]):\n    orig_inputs = inputs\n\n    # Inputs are pre-processed with extra tools\n    inputs = tool_planner.user_input_handler(inputs)\n\n    print(\"Inputs Length: \", len(inputs))\n    # Add checker for the input length to fitin the GPT model window size\n    if llm_type == \"local\":\n        token_limit = 2000\n    else:\n        token_limit = 8000\n    if len(inputs) > token_limit:\n        inputs = inputs[:token_limit]\n\n    headers = {\n        \"Content-Type\": \"application/json\",\n        \"Authorization\": f\"Bearer {OPENAI_API_KEY}\"\n    }\n\n    if system_msg.strip() == '':\n        initial_message = [{\"role\": \"user\", \"content\": f\"{inputs}\"}]\n        multi_turn_message = []\n    else:\n        initial_message = [{\"role\": \"system\", \"content\": system_msg},\n                           {\"role\": \"user\", \"content\": f\"{inputs}\"}]\n        multi_turn_message = [{\"role\": \"system\", \"content\": init_system_prompt}]\n\n    if chat_counter == 0:\n        payload = {\n            \"model\": model,\n            \"messages\": initial_message,\n            \"temperature\": temperature,\n            \"top_p\": top_p,\n            \"n\": 1,\n            \"stream\": True,\n            \"presence_penalty\": 0,\n            \"frequency_penalty\": 0,\n        }\n    else:\n        messages = multi_turn_message\n        for data in chatbot:\n            user = {\"role\": \"user\", \"content\": data[0]}\n            assistant = {\"role\": \"assistant\", \"content\": data[1]}\n            messages.extend([user, assistant])\n        temp = {\"role\": \"user\", \"content\": inputs}\n        messages.append(temp)\n\n        payload = {\n            \"model\": model,\n            \"messages\": messages,\n            \"temperature\": temperature,\n            \"top_p\": top_p,\n            \"n\": 1,\n            \"stream\": True,\n            \"presence_penalty\": 0,\n            \"frequency_penalty\": 0, }\n\n    chat_counter += 1\n    history.append(orig_inputs)\n    print(colored(\"Orig input from the user: \", \"green\"), colored(orig_inputs, \"green\"))\n    print(colored(\"Input with tools: \", \"blue\"), colored(inputs, \"blue\"))\n    response = requests.post(API_URL, headers=headers, json=payload, stream=True)\n    token_counter = 0\n    partial_words = \"\"\n\n    response_complete = False\n\n    counter = 0\n    for chunk in response.iter_lines():\n        if counter == 0:\n            counter += 1\n            continue\n\n        if response_complete:\n            print(colored(\"Response: \", \"yellow\"), colored(partial_words, \"yellow\"))\n\n        if chunk.decode():\n            chunk = chunk.decode()\n            if chunk.startswith(\"error:\"):\n                print(colored(\"Chunk: \", \"red\"), colored(chunk, \"red\"))\n\n            # Check if the chatbot is done generating the response\n            try:\n                if len(chunk) > 12 and \"finish_reason\" in json.loads(chunk[6:])['choices'][0]:\n                    response_complete = json.loads(chunk[6:])['choices'][0].get(\"finish_reason\", None) == \"stop\"\n            except:\n                print(\"Error in response_complete check\")\n                pass\n\n            try:\n                if len(chunk) > 12 and \"content\" in json.loads(chunk[6:])['choices'][0]['delta']:\n                    partial_words = partial_words + json.loads(chunk[6:])['choices'][0][\"delta\"][\"content\"]\n                    if token_counter == 0:\n                        history.append(\" \" + partial_words)\n                    else:\n                        history[-1] = partial_words\n                    chat = [(history[i], history[i + 1]) for i in range(0, len(history) - 1, 2)]\n                    token_counter += 1\n                    yield chat, history, chat_counter, response\n            except:\n                print(\"Error in partial_words check\")\n                pass\n\n\ndef reset_textbox():\n    return gr.update(value='')\n\n\ndef set_visible_false():\n    return gr.update(visible=False)\n\n\ndef set_visible_true():\n    return gr.update(visible=True)\n\n\ndef analyze_repo(repo_url, progress=gr.Progress()):\n    progress(0, desc=\"Starting\")\n    repo_information = clone_repo(repo_url, progress)\n\n    progress(0.6, desc=\"Building Knowledge Base\")\n    generate_or_load_knowledge_from_repo()\n\n    if repo_information is not None:\n        return init_system_prompt + repo_information, \"Analysis completed\"\n    else:\n        return init_system_prompt, \"Analysis failed\"\n\ndef main():\n    title = \"\"\"<h1 align=\"center\">GPT-Code-Learner</h1>\"\"\"\n\n    system_msg_info = \"\"\"A conversation could begin with a system message to gently instruct the assistant.\"\"\"\n\n    theme = gr.themes.Soft(text_size=gr.themes.sizes.text_md)\n\n    with gr.Blocks(\n            css=\"\"\"#col_container { margin-left: auto; margin-right: auto;} #chatbot {height: 520px; overflow: auto;}\"\"\",\n            theme=theme,\n            title=\"GPT-Code-Learner\",\n    ) as demo:\n        gr.HTML(title)\n\n        with gr.Column(elem_id=\"col_container\"):\n            with gr.Accordion(label=\"System message:\", open=False):\n                system_msg = gr.Textbox(\n                    label=\"Instruct the AI Assistant to set its beaviour\",\n                    info=system_msg_info,\n                    value=system_prompt\n                )\n                accordion_msg = gr.HTML(\n                    value=\"Refresh the app to reset system message\",\n                    visible=False\n                )\n            # Add text box for the repo link with submit button\n            with gr.Row():\n                with gr.Column(scale=6):\n                    repo_url = gr.Textbox(\n                        placeholder=\"Repo Link\",\n                        lines=1,\n                        label=\"Repo Link\"\n                    )\n                with gr.Column(scale=2):\n                    repo_link_btn = gr.Button(\"Analyze Code Repo\").style(full_width=True)\n                with gr.Column(scale=2):\n                    analyze_progress = gr.Textbox(label=\"Status\")\n\n            repo_link_btn.click(analyze_repo, [repo_url], [system_msg, analyze_progress])\n\n            with gr.Row():\n                with gr.Column(scale=10):\n                    chatbot = gr.Chatbot(\n                        label='GPT-Code-Learner',\n                        elem_id=\"chatbot\"\n                    )\n\n            state = gr.State([])\n            with gr.Row():\n                with gr.Column(scale=8):\n                    inputs = gr.Textbox(\n                        placeholder=\"What questions do you have for the repo?\",\n                        lines=1,\n                        label=\"Type an input and press Enter\"\n                    )\n                with gr.Column(scale=2):\n                    b1 = gr.Button().style(full_width=True)\n\n            with gr.Accordion(label=\"Examples\", open=True):\n                gr.Examples(\n                    examples=[\n                        [\"What is the usage of this repo?\"],\n                        [\"Which function launches the application in the repo?\"],\n                    ],\n                    inputs=inputs)\n\n            with gr.Accordion(\"Parameters\", open=False):\n                top_p = gr.Slider(minimum=-0, maximum=1.0, value=0.5, step=0.05, interactive=True,\n                                  label=\"Top-p (nucleus sampling)\", )\n                temperature = gr.Slider(minimum=-0, maximum=5.0, value=0.5, step=0.1, interactive=True,\n                                        label=\"Temperature\", )\n                chat_counter = gr.Number(value=0, visible=True, precision=0)\n\n        inputs.submit(generate_response, [system_msg, inputs, top_p, temperature, chat_counter, chatbot, state],\n                      [chatbot, state, chat_counter], )\n        b1.click(generate_response, [system_msg, inputs, top_p, temperature, chat_counter, chatbot, state],\n                 [chatbot, state, chat_counter], )\n\n        inputs.submit(set_visible_false, [], [system_msg])\n        b1.click(set_visible_false, [], [system_msg])\n        inputs.submit(set_visible_true, [], [accordion_msg])\n        b1.click(set_visible_true, [], [accordion_msg])\n\n        b1.click(reset_textbox, [], [inputs])\n        inputs.submit(reset_textbox, [], [inputs])\n\n    demo.queue(max_size=99, concurrency_count=20).launch(debug=True)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "code_searcher.py",
    "content": "import re\nimport subprocess\n\n\ndef extract_grep_output(line):\n    # Regular expressions to match the grep output lines\n    regex_colon = r'(.*):(\\d+):(.*)'\n    regex_dash = r'(.*?)-(\\d+)-(.*)'\n    match_colon = re.match(regex_colon, line)\n    match_dash = re.match(regex_dash, line)\n\n    if match_colon:\n        filename, line_number, line_content = match_colon.groups()\n        return [filename, line_number, line_content]\n    elif match_dash:\n        filename, line_number, line_content = match_dash.groups()\n        return [filename, line_number, line_content]\n    else:\n        return [\"\", \"\", line]\n\n\ndef search_function_with_context(function_name, before_lines=5, after_lines=10, search_dir=\"./code_repo\"):\n    command = [\n        \"grep\",\n        \"-r\",  # Recursive search\n        \"-n\",  # Print line numbers\n        f\"-B{before_lines}\",  # Show context before the match\n        f\"-A{after_lines}\",  # Show context after the match\n        f\"{function_name}\",  # The search pattern\n        search_dir\n    ]\n\n    # Run the command and capture the output\n    result = subprocess.run(command, capture_output=True, text=True)\n\n    # Split the output by lines\n    output_lines = result.stdout.splitlines()\n\n    # Group the lines by occurrence\n    occurrences = []\n    current_filename = None\n    current_start_line = None\n    current_lines = []\n    for line in output_lines:\n        if line.startswith(\"--\"):  # This line separates occurrences\n            if current_filename is not None:\n                occurrences.append((current_filename, current_start_line, \"\\n\".join(current_lines)))\n            current_lines = []\n        else:\n            current_filename, line_number, line_text = extract_grep_output(line)\n            if function_name in line_text:\n                current_start_line = line_number + \":\" + line_text\n            current_lines.append(line_text)\n\n    # Add the last occurrence if there is one\n    if current_filename is not None:\n        occurrences.append((current_filename, current_start_line, \"\\n\".join(current_lines)))\n\n    return occurrences\n\n\ndef get_function_context(function_name):\n    results = search_function_with_context(function_name)\n    output = \"\"\n    for filename, start_line, context in results:\n        output += f\"Filename: {filename}\\n\"\n        output += f\"Start line: {start_line}\\n\"\n        output += \"Context:\\n\"\n        output += context\n        output += \"\\n\\n\"\n    return output\n\n\nif __name__ == \"__main__\":\n    function_name = \"set_visible_true\"\n    results = search_function_with_context(function_name)\n\n    for filename, start_line, context in results:\n        print(f\"Filename: {filename}\")\n        print(f\"Start line: {start_line}\")\n        print(\"Context:\")\n        print(context)\n        print()\n"
  },
  {
    "path": "docs/KnowledgeBase.md",
    "content": "# Knowledge Base\n\nGPT-Code-Learner supports using a knowledge base to answer questions. By default, it will use the codebase as the knowledge base. \n\nThe knowledge base is powered by a vector database. GPT-Code-Learner supports two types of vector databases: local or cloud. By default, it will use the local version.\n\nThe local version uses [FAISS](https://github.com/facebookresearch/faiss), while the cloud version utilizes [Supabase](https://app.supabase.com/).\n\n## Supabase Setup\n\nFor the Supabase version, create a Supabase account and project at https://app.supabase.com/sign-in. Next, add your Supabase URL and key to the `.env` file. You can find them in the portal under Project/API.\n\n```\nSUPABASE_URL=https://xxxxxx.supabase.co\nSUPABASE_KEY=xxxxxx\n```\n\nCreate the default document table using the following SQL, which follows the format of the [langchain example](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/supabase.html).\n\n```postgresql\n-- Enable the pgvector extension to work with embedding vectors\ncreate extension vector;\n\n-- Create a table to store your documents\ncreate table documents (\nid bigserial primary key,\ncontent text, -- corresponds to Document.pageContent\nmetadata jsonb, -- corresponds to Document.metadata\nembedding vector(1536) -- 1536 works for OpenAI embeddings, change if needed\n);\n\nCREATE FUNCTION match_documents(query_embedding vector(1536), match_count int)\n   RETURNS TABLE(\n       id bigint,\n       content text,\n       metadata jsonb,\n       -- we return matched vectors to enable maximal marginal relevance searches\n       embedding vector(1536),\n       similarity float)\n   LANGUAGE plpgsql\n   AS $$\n   # variable_conflict use_column\nBEGIN\n   RETURN query\n   SELECT\n       id,\n       content,\n       metadata,\n       embedding,\n       1 -(documents.embedding <=> query_embedding) AS similarity\n   FROM\n       documents\n   ORDER BY\n       documents.embedding <=> query_embedding\n   LIMIT match_count;\nEND;\n$$;\n```\n\nThe [knowledge_base.py](..%2Fknowledge_base.py) provides examples of how to use the knowledge base."
  },
  {
    "path": "docs/LocalLLM.md",
    "content": "# Using Local LLM Models\n\nGPT-Code-Learner uses [LocalAI](https://github.com/go-skynet/LocalAI) to run the LLM models locally. \n\n## Installation\nHere are general steps for installation on Mac. Please refer to [LocalAI](https://github.com/go-skynet/LocalAI) for more details.\n\n```shell\n# install build dependencies\nbrew install cmake\nbrew install go\n\n# clone the repo\ngit clone https://github.com/go-skynet/LocalAI.git\n\ncd LocalAI\n\n# build the binary\nmake build\n\n# Download gpt4all-j to models/\nwget https://gpt4all.io/models/ggml-gpt4all-j.bin -O models/ggml-gpt4all-j\n\n# Use a template from the examples\ncp -rf prompt-templates/ggml-gpt4all-j.tmpl models/\n\n# Run LocalAI\n./local-ai --models-path ./models/ --debug\n\n# Now API is accessible at localhost:8080\ncurl http://localhost:8080/v1/models\n\ncurl http://localhost:8080/v1/chat/completions -H \"Content-Type: application/json\" -d '{\n     \"model\": \"ggml-gpt4all-j\",\n     \"messages\": [{\"role\": \"user\", \"content\": \"How are you?\"}],\n     \"temperature\": 0.9 \n   }'\n```\n\n## Running GPT-Code-Learner with Local LLM Models\nBefore running GPT-Code-Learner, please make sure the LocalAI is running at localhost:8080.\n\n```shell\n./local-ai --models-path ./models/ --debug\n```\n\nThen, change the following line in the `.env` file:\n```\nLLM_TYPE=\"local\"\n```\n\nFinally, run the GPT-Code-Learner:\n```\npython run.py\n```\n\n## Known Issues\n- The accuracy of the local LLM models is not as good as the online version. We are still working on improving the performance of the local LLM models.\n- Also, the first message of the conversation are usually blocked in the local LLM models. Restarting the GPT-Code-Learner may solve this issue."
  },
  {
    "path": "knowledge_base.py",
    "content": "import openai\nfrom dotenv import load_dotenv, find_dotenv\nimport os\nfrom supabase import create_client, Client\nfrom langchain.embeddings.openai import OpenAIEmbeddings\nfrom langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter\nfrom langchain.vectorstores import FAISS, SupabaseVectorStore\nfrom langchain.document_loaders import TextLoader, PyPDFLoader\nimport requests\nfrom bs4 import BeautifulSoup\nimport pickle\nfrom langchain import OpenAI\nfrom langchain.chains import VectorDBQAWithSourcesChain\nfrom langchain.embeddings.base import Embeddings\nfrom sentence_transformers import SentenceTransformer\nfrom termcolor import colored\n\n\nclass LocalHuggingFaceEmbeddings(Embeddings):\n    def __init__(self, model_id=\"all-mpnet-base-v2\"):\n        self.model = SentenceTransformer(model_id)\n\n    def embed_documents(self, texts):\n        embeddings = self.model.encode(texts)\n        return embeddings\n\n    def embed_query(self, text):\n        embedding = self.model.encode(text)\n        return list(map(float, embedding))\n\n\ndef load_documents(filenames):\n    text_splitter = RecursiveCharacterTextSplitter(\n        chunk_size=1500,\n        chunk_overlap=200,\n        length_function=len,\n    )\n    docs = []\n    for filename in filenames:\n        if filename.endswith(\".pdf\"):\n            loader = PyPDFLoader(filename)\n        else:\n            loader = TextLoader(filename)\n        documents = loader.load()\n        splits = text_splitter.split_documents(documents)\n        docs.extend(splits)\n        print(f\"Split {filename} into {len(splits)} chunks\")\n    return docs\n\n\ndef load_urls(urls):\n    text_splitter = CharacterTextSplitter(chunk_size=1500, separator=\"\\n\")\n    docs, metadatas = [], []\n    for url in urls:\n        html = requests.get(url).text\n        soup = BeautifulSoup(html, features=\"html.parser\")\n        text = soup.get_text()\n        lines = (line.strip() for line in text.splitlines())\n        page_content = '\\n'.join(line for line in lines if line)\n\n        splits = text_splitter.split_text(page_content)\n        docs.extend(splits)\n        metadatas.extend([{\"source\": url}] * len(splits))\n        print(f\"Split {url} into {len(splits)} chunks\")\n    return docs, metadatas\n\n\ndef load_code_chunks(chunks, filepath):\n    text_splitter = CharacterTextSplitter(chunk_size=1500, separator=\"\\n\")\n    docs, metadatas = [], []\n    for chunk in chunks:\n        splits = text_splitter.split_text(chunk)\n        docs.extend(splits)\n        metadatas.extend([{\"source\": filepath}] * len(splits))\n    print(f\"Split {filepath} into {len(docs)} pieces\")\n    return docs, metadatas\n\n\ndef local_vdb(knowledge, vdb_path=None):\n    embedding_type = os.environ.get('EMBEDDING_TYPE', \"local\")\n    if embedding_type == \"local\":\n        embedding = LocalHuggingFaceEmbeddings()\n    else:\n        embedding = OpenAIEmbeddings(disallowed_special=())\n    print(colored(\"Embedding documents...\", \"green\"))\n    faiss_store = FAISS.from_documents(knowledge[\"known_docs\"], embedding=embedding)\n    if vdb_path is not None:\n        with open(vdb_path, \"wb\") as f:\n            pickle.dump(faiss_store, f)\n\n    return faiss_store\n\n\ndef load_local_vdb(vdb_path):\n    with open(vdb_path, \"rb\") as f:\n        faiss_store = pickle.load(f)\n\n    return faiss_store\n\n\ndef supabase_vdb(knowledge):\n    supabase_url = os.environ.get(\"SUPABASE_URL\")\n    supabase_key = os.environ.get(\"SUPABASE_KEY\")\n    supabase: Client = create_client(supabase_url, supabase_key)\n\n    vector_store = SupabaseVectorStore(client=supabase, embedding=OpenAIEmbeddings(), table_name=\"documents\")\n    vector_store.add_documents(knowledge[\"known_docs\"])\n    vector_store.add_texts(knowledge[\"known_text\"][\"pages\"], metadatas=knowledge[\"known_text\"][\"metadatas\"])\n\n    return vector_store\n\n\nif __name__ == \"__main__\":\n    load_dotenv(find_dotenv())\n    openai.api_key = os.environ.get(\"OPENAI_API_KEY\", \"null\")\n\n    query = \"What is the usage of this repo?\"\n    files = [\"./README.md\"]\n    urls = [\"https://github.com/JinghaoZhao/GPT-Code-Learner\"]\n\n    known_docs = load_documents(files)\n    known_pages, metadatas = load_urls(urls)\n\n    knowledge_base = {\"known_docs\": known_docs, \"known_text\": {\"pages\": known_pages, \"metadatas\": metadatas}}\n\n    faiss_store = local_vdb(knowledge_base)\n    matched_docs = faiss_store.similarity_search(query)\n    for doc in matched_docs:\n        print(\"------------------------\\n\", doc)\n\n    supabase_store = supabase_vdb(knowledge_base)\n    matched_docs = supabase_store.similarity_search(query)\n    for doc in matched_docs:\n        print(\"------------------------\\n\", doc)\n\n    chain = VectorDBQAWithSourcesChain.from_llm(llm=OpenAI(temperature=0), vectorstore=faiss_store)\n    result = chain({\"question\": query})\n    print(\"FAISS result\", result)\n\n    chain = VectorDBQAWithSourcesChain.from_llm(llm=OpenAI(temperature=0), vectorstore=supabase_store)\n    result = chain({\"question\": query})\n    print(\"Supabase result\", result)\n"
  },
  {
    "path": "repo_parser.py",
    "content": "import os\nimport json\nimport openai\nfrom termcolor import colored\nfrom dotenv import load_dotenv, find_dotenv\nfrom knowledge_base import load_documents, load_code_chunks, supabase_vdb, local_vdb, load_local_vdb\nfrom collections import deque\nfrom pathlib import Path\nimport util\nimport subprocess\nimport gradio as gr\n\n\ndef clone_repo(git_url, progress=gr.Progress(), code_repo_path=\"./code_repo\"):\n    print(progress(0.1, desc=\"Cloning the repo...\"))\n    print(\"Cloning the repo: \", git_url)\n    # Check if directory exists\n    if not os.path.exists(code_repo_path):\n        os.makedirs(code_repo_path)\n    try:\n        subprocess.check_call(['git', 'clone', git_url], cwd=code_repo_path)\n        print(f\"Successfully cloned {git_url} into {code_repo_path}\")\n    except subprocess.CalledProcessError as e:\n        print(f\"Error: {e.output}\")\n\n    print(progress(0.3, desc=\"Summarizing the repo...\"))\n    readme_info = get_readme(code_repo_path)\n    if readme_info is not None:\n        readme_info = \"\"\"The README.md file is as follows: \"\"\" + readme_info + \"\\n\\n\"\n\n    print(progress(0.4, desc=\"Parsing repo structure...\"))\n    repo_structure = get_repo_structure(code_repo_path)\n    if repo_structure is not None:\n        repo_structure = \"\"\"The repo structure is as follows: \"\"\" + get_repo_structure(code_repo_path) + \"\\n\\n\"\n\n    return readme_info + repo_structure\n\n\ndef generate_knowledge_from_repo(dir_path, ignore_list):\n    knowledge = {\"known_docs\": [], \"known_text\": {\"pages\": [], \"metadatas\": []}}\n    for root, dirs, files in os.walk(dir_path):\n        dirs[:] = [d for d in dirs if d not in ignore_list]  # modify dirs in-place\n        for file in files:\n            if file in ignore_list:\n                continue\n            filepath = os.path.join(root, file)\n            try:\n                # Using a more general way for code file parsing\n                knowledge[\"known_docs\"].extend(load_documents([filepath]))\n\n            except Exception as e:\n                print(f\"Failed to process {filepath} due to error: {str(e)}\")\n\n    return knowledge\n\n\n# Find the Readme.md file from the code repo in the code_repo folder\ndef find_repo_folder(directory):\n    # Find the name of the folder in the specified directory\n    folder_name = None\n    for item in os.listdir(directory):\n        item_path = os.path.join(directory, item)\n        if os.path.isdir(item_path):\n            folder_name = item\n            break\n    return os.path.join(directory, folder_name)\n\n\ndef find_readme(repo_folder):\n    # Search for the README file within the found folder\n    for filename in os.listdir(repo_folder):\n        if filename.lower().startswith('readme'):\n            readme_path = os.path.join(repo_folder, filename)\n            print(\"README found in folder:\", repo_folder)\n            return readme_path\n\n    print(\"README not found in folder:\", repo_folder)\n    return None\n\n\n# summarize the README file\ndef summarize_readme(readme_path):\n    if readme_path:\n        print(colored(\"Summarizing README...\", \"green\"))\n\n        system_prompt = \"\"\"You are an expert developer and programmer. \n            Please infer the programming languages from the README.\n            You are asked to summarize the README file of the code repository in detail. \n            Provide enough information about the code repository.\n            Please also mention the framework used in the code repository.\n            \"\"\"\n        readme_content = open(readme_path, \"r\").read()\n        user_prompt = f'Here is the README content: {readme_content}'\n        return util.get_chat_response(system_prompt, user_prompt)\n\n\ndef bfs_folder_search(text_length_limit=4000, folder_path=\"./code_repo\"):\n    if not Path(folder_path).is_dir():\n        return \"Invalid directory path\"\n\n    root = Path(folder_path).resolve()\n    file_structure = {str(root): {}}\n    queue = deque([(root, file_structure[str(root)])])\n\n    while queue:\n        current_dir, parent_node = queue.popleft()\n        try:\n            for path in current_dir.iterdir():\n                if path.is_dir():\n                    if str(path.name) == \".git\":\n                        continue\n                    parent_node[str(path.name)] = {\"files\": []}\n                    queue.append((path, parent_node[str(path.name)]))\n                else:\n                    if \"files\" not in parent_node:\n                        parent_node[\"files\"] = []\n                    parent_node[\"files\"].append(str(path.name))\n\n                # Check if we've exceeded the text length limit\n                file_structure_text = json.dumps(file_structure)\n                if len(file_structure_text) >= text_length_limit:\n                    return file_structure_text\n\n        except PermissionError:\n            # This can happen in directories the user doesn't have permission to read.\n            continue\n\n    return json.dumps(file_structure)\n\n\ndef get_readme(code_repo_path=\"./code_repo\"):\n    repo_folder = find_repo_folder(code_repo_path)\n    print(colored(\"Repo folder: \" + repo_folder, \"green\"))\n    readme_path = find_readme(repo_folder)\n    if readme_path is None:\n        return \"README not found\"\n    else:\n        summary = summarize_readme(readme_path)\n        print(colored(\"README Summary: \", \"green\"), colored(summary, \"green\"))\n        return summary\n\n\ndef get_repo_structure(code_repo_path=\"./code_repo\"):\n    return bfs_folder_search(4000, code_repo_path)\n\n\ndef get_repo_names(dir_path):\n    folder_names = [name for name in os.listdir(dir_path) if os.path.isdir(os.path.join(dir_path, name))]\n    concatenated_names = \"-\".join(folder_names)\n    return concatenated_names\n\n\ndef generate_or_load_knowledge_from_repo(dir_path=\"./code_repo\"):\n    vdb_path = \"./vdb-\" + get_repo_names(dir_path) + \".pkl\"\n    # check if vdb_path exists\n    if os.path.isfile(vdb_path):\n        print(colored(\"Local VDB found! Loading VDB from file...\", \"green\"))\n        vdb = load_local_vdb(vdb_path)\n    else:\n        print(colored(\"Generating VDB from repo...\", \"green\"))\n        ignore_list = ['.git', 'node_modules', '__pycache__', '.idea',\n                       '.vscode']\n        knowledge = generate_knowledge_from_repo(dir_path, ignore_list)\n        vdb = local_vdb(knowledge, vdb_path=vdb_path)\n    print(colored(\"VDB generated!\", \"green\"))\n    return vdb\n\n\ndef get_repo_context(query, vdb):\n    matched_docs = vdb.similarity_search(query, k=10)\n    output = \"\"\n    for idx, docs in enumerate(matched_docs):\n        output += f\"Context {idx}:\\n\"\n        output += str(docs)\n        output += \"\\n\\n\"\n    return output\n\n\nif __name__ == '__main__':\n    code_repo_path = \"./code_repo\"\n    load_dotenv(find_dotenv())\n    openai.api_key = os.environ.get(\"OPENAI_API_KEY\", \"null\")\n\n    print(get_repo_names(code_repo_path))\n\n    # Basic repo information\n    get_readme(code_repo_path)\n    print(colored(bfs_folder_search(4000, code_repo_path), \"yellow\"))\n\n    # Generate knowledge base\n    vdb = generate_or_load_knowledge_from_repo(\"./code_repo\")\n\n    # Search the knowledge base\n    query = \"How to use the knowledge base?\"\n    context = get_repo_context(query, vdb)\n    print(context)\n"
  },
  {
    "path": "requirements.txt",
    "content": "python-dotenv\nhupper\ngradio\ntermcolor\nopenai\ntenacity\nsupabase\nlangchain\ntiktoken\nbeautifulsoup4\nfaiss-cpu\npypdf\nchardet\nsentence-transformers"
  },
  {
    "path": "run.py",
    "content": "import hupper\nfrom dotenv import load_dotenv, find_dotenv\n\nif __name__ == '__main__':\n    load_dotenv(find_dotenv())\n    reloader = hupper.start_reloader('code_learner.main')"
  },
  {
    "path": "tool_planner.py",
    "content": "from code_searcher import get_function_context\nfrom repo_parser import generate_or_load_knowledge_from_repo, get_repo_context\nfrom termcolor import colored\nimport util\n\n\ndef tool_selection(input):\n    system_prompt = \"\"\"You are an expert developer and programmer. \"\"\"\n    user_prompt = \"\"\"\n        You need to act as a tool recommender according to the user's questions.\n        You are giving a user question about the code repository.\n        You choose one of the following tools to help you answer the question.\n        Your answer should be the name of the tool. No any other words or symbol are allowed.\n\n        The tools are defined as follows:\n\n        - Code_Searcher: This module is designed to search for specific keywords in a code repository that are derived from a user's query. It is particularly beneficial when the user's question pertains to particular functions or variables. As an illustration, this tool could answer queries such as \"How do I utilize the function named 'extract_function_name'?\" or \"How should I apply the function 'def supabase_vdb()?'\".\n\n        - Repo_Parser: This module conducts a fuzzy search within a code repository, offering context for inquiries concerning general procedures and operations in the repository. The inquiries may be high-level, potentially involving multiple source code files and documents. For instance, this tool could handle queries like \"Which function is in charge of processing incoming messages?\" or \"How does the code manage the knowledge base?\".\n\n        - No_Tool: This is the default module that comes into play when the user's query doesn't have a direct connection to the code repository or when other tools can't provide a suitable answer. This module is particularly useful for handling generic programming queries that aren't specific to the codebase in question. For instance, it could address questions like \"How is the 'asyncio' library used in Python?\" or \"Can you explain the workings of smart pointers in C++?\".\n\n\n        Below are some example questions and answers:\n\n        - Question: How to use the function extract_function_name?\n        - Code_Searcher\n\n        - Question: How to use the function def supabase_vdb(knowledge_base):?\n        - Code_Searcher \n\n        - Question: How to create a knowledge base?\n        - Repo_Parser \n\n        - Question: How to use the knowledge base?\n        - Repo_Parser \n\n        - Question: How does this repo generate the UI interface?\n        - Repo_Parser \n        \n        - Question: How to use Text Splitters in this repo?\n        - Repo_Parser\n\n        - Question: How to use the python asyncio library?\n        - No_Tool\n\n        \"\"\" + f'Here is the user input: {input}'\n    return util.get_chat_response(system_prompt, user_prompt)\n\n\ndef extract_function_name(input):\n    system_prompt = \"\"\"You are an expert developer and programmer. \"\"\"\n    user_prompt = \"\"\"\n        You will handle user questions about the code repository.\n        Please extract the function or variable name appeared in the question.\n        Only response the one name without the parameters or any other words.\n        If both function and variable names are mentioned, only extract the function name.\n\n        Below are two examples:\n        - Question: How to use the function extract_function_name?\n        - Answer: extract_function_name\n\n        - Question: How to use the function def supabase_vdb(query, knowledge_base):?\n        - Answer: supabase_vdb \n\n        - Question: What is the usage of vdb?\n        - Answer: vdb \n\n        \"\"\" + f'Here is the user input: {input}'\n    return util.get_chat_response(system_prompt, user_prompt)\n\n\ndef user_input_handler(input):\n    tool = tool_selection(input)\n    print(colored(f\"Tool selected: {tool}\", \"green\"))\n    if tool == \"Code_Searcher\":\n        # extract the function or variable name from the input\n        function_name = extract_function_name(input)\n        print(function_name)\n        if function_name:\n            # search the function with context\n            context = get_function_context(function_name)\n            prompt = input + \"\\n\\n\" + \\\n                     f\"Here are some the contexts of the function or variable {function_name}: \\n\\n\" + context\n            return prompt\n    elif tool == \"Repo_Parser\":\n        vdb = generate_or_load_knowledge_from_repo()\n        context = get_repo_context(input, vdb)\n        prompt = input + \"\\n\\n\" + \\\n                 f\"Here are some contexts about the question, which are ranked by the relevance to the question: \\n\\n\" + context\n        return prompt\n    else:\n        print(\"No tool is selected.\")\n        return input\n\n\nif __name__ == \"__main__\":\n    # results = user_input_handler(\"What is the usage of the function traffic_interval?\")\n    # print(results)\n\n    results = user_input_handler(\"How to build a knowledge base?\")\n    print(results)\n"
  },
  {
    "path": "util.py",
    "content": "import os\nfrom dotenv import load_dotenv, find_dotenv\nfrom termcolor import colored\nfrom langchain.llms import OpenAI\nfrom langchain.chat_models import ChatOpenAI\nfrom langchain.schema import HumanMessage, SystemMessage\n\nload_dotenv(find_dotenv())\n\n\ndef get_chat_response(system_prompt, user_prompt):\n    # By default, use the local LLM\n    llm_type = os.environ.get('LLM_TYPE', \"local\")\n    if llm_type == \"local\":\n        return get_local_llm_response(system_prompt, user_prompt)\n    else:\n        return get_openai_response(system_prompt, user_prompt)\n\n\ndef get_local_llm_response(system_prompt, user_prompt, model=\"ggml-gpt4all-j\", temperature=0.9):\n    base_path = os.environ.get('OPENAI_API_BASE', 'http://localhost:8080/v1')\n    model_name = os.environ.get('MODEL_NAME', model)\n    llm = OpenAI(temperature=temperature, openai_api_base=base_path, model_name=model_name, openai_api_key=\"null\")\n    text = system_prompt + \"\\n\\n\" + user_prompt + \"\\n\\n\"\n    response = llm(text)\n    print(response)\n    return response\n\n\ndef get_openai_response(system_prompt, user_prompt, model=\"gpt-3.5-turbo\", temperature=0):\n    chat = ChatOpenAI(model_name=model, temperature=temperature)\n    messages = [\n        SystemMessage(content=system_prompt),\n        HumanMessage(content=user_prompt)\n    ]\n    response = chat(messages)\n    print(response)\n    return response.content\n"
  }
]