Full Code of upstash/degree-guru for AI

master f063f69e0071 cached

38 files

50.0 KB

13.8k tokens

39 symbols

1 requests

Download .txt

Repository: upstash/degree-guru
Branch: master
Commit: f063f69e0071
Files: 38
Total size: 50.0 KB

Directory structure:
gitextract_e5_f89hi/

├── .eslintrc.json
├── .gitignore
├── .prettierignore
├── .prettierrc
├── README.md
├── degreegurucrawler/
│   ├── .gitignore
│   ├── Dockerfile
│   ├── degreegurucrawler/
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   ├── spiders/
│   │   │   ├── __init__.py
│   │   │   └── configurable.py
│   │   └── utils/
│   │       ├── config.py
│   │       ├── crawler.yaml
│   │       └── upstash_vector_store.py
│   ├── docker-compose.yml
│   ├── requirements.txt
│   └── scrapy.cfg
├── next.config.js
├── package.json
├── postcss.config.js
├── src/
│   ├── app/
│   │   ├── api/
│   │   │   └── guru/
│   │   │       └── route.tsx
│   │   ├── globals.css
│   │   ├── layout.tsx
│   │   ├── page.tsx
│   │   └── vectorstore/
│   │       ├── UpstashVectorStore.d.ts
│   │       └── UpstashVectorStore.js
│   ├── components/
│   │   ├── form.tsx
│   │   ├── message-loading.tsx
│   │   ├── message.tsx
│   │   ├── powered-by.tsx
│   │   └── upstash-logo.tsx
│   └── utils/
│       ├── const.ts
│       └── cx.ts
├── tailwind.config.ts
└── tsconfig.json

================================================
FILE CONTENTS
================================================

================================================
FILE: .eslintrc.json
================================================
{
  "extends": "next/core-web-vitals"
}


================================================
FILE: .gitignore
================================================
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.

# dependencies
/node_modules
/.pnp
.pnp.js
.yarn/install-state.gz

# testing
/coverage

# next.js
/.next/
/out/

# production
/build

# misc
.DS_Store
*.pem
.idea

# debug
npm-debug.log*
yarn-debug.log*
yarn-error.log*

# local env files
.env*.local

# vercel
.vercel

# typescript
*.tsbuildinfo
next-env.d.ts


================================================
FILE: .prettierignore
================================================
# Ignore artifacts:
degreegurucrawler
figs


================================================
FILE: .prettierrc
================================================
{
  "arrowParens": "always",
  "bracketSameLine": false,
  "bracketSpacing": true,
  "semi": true,
  "singleQuote": false,
  "jsxSingleQuote": false,
  "quoteProps": "as-needed",
  "trailingComma": "all",
  "singleAttributePerLine": false,
  "htmlWhitespaceSensitivity": "css",
  "vueIndentScriptAndStyle": false,
  "proseWrap": "preserve",
  "insertPragma": false,
  "printWidth": 80,
  "requirePragma": false,
  "tabWidth": 2,
  "useTabs": false,
  "embeddedLanguageFormatting": "auto",
  "jsxBracketSameLine": false,
  "fluid": false,
  "importOrderSeparation": true,
  "importOrderSortSpecifiers": true,
  "importOrderBuiltinModulesToTop": true,
  "importOrderParserPlugins": ["typescript", "jsx"]
}


================================================
FILE: README.md
================================================
# DegreeGuru

## Build a RAG Chatbot using Vercel AI SDK, Langchain, Upstash Vector and OpenAI

[![Deploy with Vercel](https://vercel.com/button)](https://vercel.com/new/clone?repository-url=https%3A%2F%2Fgithub.com%2Fupstash%2Fdegreeguru&env=UPSTASH_REDIS_REST_URL,UPSTASH_REDIS_REST_TOKEN,UPSTASH_VECTOR_REST_URL,UPSTASH_VECTOR_REST_TOKEN,OPENAI_API_KEY&demo-title=DegreeGuru%20Demo&demo-description=A%20Demo%20Showcasing%20the%20DegreeGuru%20App&demo-url=https%3A%2F%2Fdegreeguru.vercel.app%2F&demo-image=https%3A%2F%2Fupstash.com%2Ficons%2Ffavicon-32x32.png)

![overview](figs/overview.gif)

> [!NOTE]  
> **This project is a Community Project.**
>
> The project is maintained and supported by the community. Upstash may contribute but does not officially support or assume responsibility for it.

**DegreeGuru** is a project designed to teach you making your own AI RAG chatbot on any custom data. Some of our favorite features:

- 🕷️ Built-in crawler that scrapes the website you point it to, automatically making this data available for the AI
- ⚡ Fast answers using Upstash Vector and real-time data streaming
- 🛡️ Includes rate limiting to prevent API abuse

This chatbot is trained on data from Stanford University as an example, but is totally domain agnostic. We've created this project so you can turn it into a chatbot with your very own data by simply modifying the `crawler.yaml` file.  

## Overview

1. [Stack](#stack)
2. [Quickstart](#quickstart)
   1. [Crawler](#crawler)
   2. [ChatBot](#chatbot)
3. [Conclusion](#conclusion)
4. [Shortcomings](#shortcomings)

## Stack

- Crawler: [scrapy](https://scrapy.org/)
- Chatbot App: [Next.js](https://nextjs.org/)
- Vector DB: [Upstash](https://upstash.com/)
- LLM Orchestration: [Langchain.js](https://js.langchain.com)
- Generative Model: [OpenAI](https://openai.com/), [gpt-3.5-turbo-1106](https://platform.openai.com/docs/models)
- Embedding Model: [OpenAI](https://openai.com/), [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings)
- Text Streaming: [Vercel AI](https://vercel.com/ai)
- Rate Limiting: [Upstash](https://upstash.com/)

## Quickstart

For local development, we recommend forking this project and cloning the forked repository to your local machine by running the following command:

```
git clone git@github.com:[YOUR_GITHUB_ACCOUNT]/DegreeGuru.git
```

This project contains two primary components: the crawler and the chatbot. First, we'll take a look at how the crawler extracts information from any website you point it to. This data is automatically stored in an Upstash Vector database. If you already have a vector database available, the crawling stage can be skipped.

### Step 1: Crawler

![crawler-diagram](figs/how-this-project-works.png)

The crawler is developed using Python, by [initializing a Scrapy project](https://docs.scrapy.org/en/latest/intro/tutorial.html#creating-a-project) and implementing a [custom spider](https://github.com/upstash/degreeguru/blob/master/degreegurucrawler/degreegurucrawler/spiders/configurable.py). The spider is equipped with [the `parse_page` function](https://github.com/upstash/degreeguru/blob/master/degreegurucrawler/degreegurucrawler/spiders/configurable.py#L42), invoked each time the spider visits a webpage. This callback function splits the text on the webpage into chunks, generates vector embeddings for each chunk, and upserts those vectors into your Upstash Vector Database. Each vector stored in our database includes the original text and website URL as metadata.

</br>

To run the crawler, follow these steps:

> [!TIP]
> If you have docker installed, you can skip the "Configure Environment Variables" and "Install Required Python Libraries" sections. Instead you can simply update the environment variables in [docker-compose.yml](https://github.com/upstash/DegreeGuru/blob/master/degreegurucrawler/docker-compose.yml) and run `docker-compose up`. This will create a container running our crawler. Don't forget to configure the crawler as explained in the following sections!

<details>

<summary>Configure Environment Variables</summary>
Before we can run our crawler, we need to configure environment variables. They let us securely store sensitive information, such as the API keys we need to communicate with OpenAI or Upstash Vector.

If you don't already have an Upstash Vector Database, create one [here](https://console.upstash.com/vector) and set 1536 as the vector dimensions. We set 1536 here because that is the amount needed by the embedding model we will use. 

![vector-db-create](figs/vector-db-create.png)

The following environment variables should be set:

```
# Upstash Vector credentials retrieved here: https://console.upstash.com/vector
UPSTASH_VECTOR_REST_URL=****
UPSTASH_VECTOR_REST_TOKEN=****

# OpenAI key retrieved here: https://platform.openai.com/api-keys
OPENAI_API_KEY=****
```

</details>

<details>
<summary>Install Required Python Libraries</summary>

To install the libraries, we suggest setting up a virtual Python environment. Before starting the installation, navigate to the `degreegurucrawler` directory.

To setup a virtual environment, first install `virtualenv` package:

```bash
pip install virtualenv
```

Then, create a new virtual environment and activate it:

```bash
# create environment
python3 -m venv venv

# activate environment
source venv/bin/activate
```

Finally, use [the `requirements.txt`](https://github.com/upstash/degreeguru/blob/master/degreegurucrawler/requirements.txt) to install the required libraries:

```bash
pip install -r requirements.txt
```

</details>



</br>

After setting these environment variables, we are almost ready to run the crawler. The subsequent step involves configuring the crawler itself, primarily accomplished through the `crawler.yaml` file located in the `degreegurucrawler/utils` directory. Additionally, it is imperative to address a crucial setting within the `settings.py` file.

<details>
<summary>Configuring the crawler in `crawler.yaml`</summary>

The crawler.yaml has two main sections: `crawler` and `index`:

```yaml
crawler:
  start_urls:
    - https://www.some.domain.com
  link_extractor:
    allow: '.*some\.domain.*'
    deny:
      - "#"
      - '\?'
      - about
index:
  openAI_embedding_model: text-embedding-ada-002
  text_splitter:
    chunk_size: 1000
    chunk_overlap: 100
```

In the `crawler` section, there are two subsections:

- `start_urls`: the entrypoints our crawler will start searching from
- `link_extractor`: a dictionary passed as arguments to [`scrapy.linkextractors.LinkExtractor`](https://docs.scrapy.org/en/latest/topics/link-extractors.html). Some important parameters are:
  - `allow`: Only extracts links matching the given regex(s)
  - `allow_domains`: Only extract links matching the given domain(s)
  - `deny`: Deny links matching the given regex(s)

In the `index` section, there are two subsections:

- `openAI_embedding_model`: The embedding model to use
- `test_splitter`: a dictionary passed as arguments to [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)

</details>

<details>
<summary>Configuring crawl depth via `settings.py`</summary>

`settings.py` file has an important setting called `DEPTH_LIMIT` which determines how many consecutive links our spider can crawl. A high value lets our crawler visit the deepest corners of a website, taking longer to finish with possibly diminishing returns. A low value could end the crawl before extracting relevant information.

If pages are skipped due to the `DEPTH_LIMIT`, Scrapy logs those skipped URLs for us. Because this usually causes a lot of logs, we've disabled this option in our project. If you'd like to keep it enabled, remove  [the `"scrapy.spidermiddlewares.depth"` from the `disable_loggers` in `degreegurucrawler/spider/configurable.py` file](https://github.com/upstash/degreeguru/blob/master/degreegurucrawler/degreegurucrawler/spiders/configurable.py#L22).

</details>

</br>

That's it! 🎉 We've configured our crawler and are ready to run it using the following command:

```
scrapy crawl configurable --logfile degreegurucrawl.log
```

Note that running this might take time. You can monitor the progress by looking at the log file `degreegurucrawl.log` or the metrics of your Upstash Vector Database dashboard as shown below.

![vector-db](figs/vector-db.png)

> [!TIP]
> If you want to do a dry run (without creating embeddings or a vector database), simply comment out [the line where we pass the `callback` parameter to the `Rule` object in `ConfigurableSpider`](https://github.com/upstash/degreeguru/blob/master/degreegurucrawler/degreegurucrawler/spiders/configurable.py#L38)

### Step 2: Chatbot

In this section, we'll explore how to chat with the data we've just crawled and stored in our vector database. Here's an overview of what this will look like architecturally:

![chatbot-diagram](figs/infrastructure.png)

Before we can run the chatbot locally, we need to set the environment variables as shown in the [`.env.local.example`](https://github.com/upstash/degreeguru/blob/master/.env.local.example) file. Rename this file and remove the `.example` ending, leaving us with `.env.local`. 

Your `.env.local` file should look like this:
```
# Redis tokens retrieved here: https://console.upstash.com/
UPSTASH_REDIS_REST_URL=
UPSTASH_REDIS_REST_TOKEN=

# Vector database tokens retrieved here: https://console.upstash.com/vector
UPSTASH_VECTOR_REST_URL=
UPSTASH_VECTOR_REST_TOKEN=

# OpenAI key retrieved here: https://platform.openai.com/api-keys
OPENAI_API_KEY=
```

The first four variables are provided by Upstash, you can visit the commented links for the place to retrieve these tokens. You can find the vector database tokens here:

![vector-db-read-only](figs/vector-db-read-only.png)

The `UPSTASH_REDIS_REST_URL` and `UPSTASH_REDIS_REST_TOKEN` are needed for rate-limiting based on IP address. In order to get these secrets, go to Upstash dashboard and create a Redis database.

![redis-create](figs/redis-create.png)

Finally, set the `OPENAI_API_KEY` environment variable you can get [here](https://platform.openai.com/api-keys) which allows us to vectorize user queries and generate responses.

That's the setup done! 🎉 We've configured our crawler, set up all neccessary environment variables are after running `npm install` to install all local packages needed to run the app, we can start our chatbot using the command:

```bash
npm run dev
```

Visit `http://localhost:3000` to see your chatbot live in action!

### Step 3: Optional tweaking

You can use this chatbot in two different modes:

- Streaming Mode: model responses are streamed to the web application in real-time as the model generates them. Interaction with the app is more fluid.
- Non-Streaming Mode: Model responses are shown to the user once entirely generated. In this mode, DegreeGuru can explicitly provide the URLs of the web pages it uses as context.

<details>
<summary>Changing streaming mode</summary>

To turn streaming on/off, navigate to `src/app/route/guru` and open the `route.tsx` file. Setting [`returnIntermediateSteps`](https://github.com/upstash/degreeguru/blob/master/src/app/api/guru/route.tsx#L64) to `true` disables streaming, setting it to `false` enables streaming.

</details>

To customize the chatbot further, you can update the [AGENT_SYSTEM_TEMPLATE in your route.tsx file](https://github.com/upstash/DegreeGuru/blob/master/src/app/api/guru/route.tsx#L101) to better match your specific use case.

</br>

## Conclusion

Congratulations on setting up your own AI chatbot! We hope you learned a lot by following along and seeing how the different parts of this app, namely the crawler, vector database, and LLM, play together. A major focus in developing this project was on its user-friendly design and adaptable settings to make this project perfect for your use case.

## Limitations

The above implementation works great for a variety of use cases. There are a few limitations I'd like to mention:

- Because the Upstash LangChain integration is a work-in-progress, the [`UpstashVectorStore`](https://github.com/upstash/degreeguru/blob/master/src/app/vectorstore/UpstashVectorStore.js) used with LangChain currently only implements the `similaritySearchVectorWithScore` method needed for our agent. Once we're done developing our native LangChain integration, we'll update this project accordingly.
- When the non-streaming mode is enabled, the message history can cause an error after the user enters another query.
- Our sources are available as URLs in the Upstash Vector Database, but we cannot show the sources explicitly when streaming. Instead, we provide the links to the chatbot as context and expect the bot to include the links in the response.


================================================
FILE: degreegurucrawler/.gitignore
================================================
*.log
__pycache__
degreegurudb

================================================
FILE: degreegurucrawler/Dockerfile
================================================

FROM python:3.8-slim

# copy directory into dockerfile
COPY . crawler
WORKDIR crawler

# install requirements
RUN pip install -r requirements.txt

CMD ["scrapy", "crawl", "configurable"]


================================================
FILE: degreegurucrawler/degreegurucrawler/__init__.py
================================================


================================================
FILE: degreegurucrawler/degreegurucrawler/items.py
================================================
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DegreegurucrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


================================================
FILE: degreegurucrawler/degreegurucrawler/middlewares.py
================================================
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class DegreegurucrawlerSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)


class DegreegurucrawlerDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)


================================================
FILE: degreegurucrawler/degreegurucrawler/pipelines.py
================================================
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class DegreegurucrawlerPipeline:
    def process_item(self, item, spider):
        return item


================================================
FILE: degreegurucrawler/degreegurucrawler/settings.py
================================================
# Scrapy settings for degreegurucrawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "degreegurucrawler"

SPIDER_MODULES = ["degreegurucrawler.spiders"]
NEWSPIDER_MODULE = "degreegurucrawler.spiders"

DEPTH_LIMIT = 3

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "degreegurucrawler (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "degreegurucrawler.middlewares.DegreegurucrawlerSpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "degreegurucrawler.middlewares.DegreegurucrawlerDownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "degreegurucrawler.pipelines.DegreegurucrawlerPipeline": 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"


================================================
FILE: degreegurucrawler/degreegurucrawler/spiders/__init__.py
================================================
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


================================================
FILE: degreegurucrawler/degreegurucrawler/spiders/configurable.py
================================================

import os
import uuid
import logging

from ..utils.upstash_vector_store import UpstashVectorStore
from ..utils.config import text_splitter_config, crawler_config

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from langchain.text_splitter import RecursiveCharacterTextSplitter


class ConfigurableSpider(CrawlSpider):

    name = "configurable"
    start_urls = crawler_config["start_urls"]
    rules = (
        Rule(
            LinkExtractor(
                **crawler_config["link_extractor"]
            ),
            callback="parse_page",
            follow=True # to enable following links on each page when callback is provided
        ),
    )

    def __init__(self, *a, **kw):
        super().__init__(*a, **kw)

        self.vectorstore = UpstashVectorStore(
            url=os.environ.get("UPSTASH_VECTOR_REST_URL"),
            token=os.environ.get("UPSTASH_VECTOR_REST_TOKEN")
        )

        print(
            f"Creating a vector index at {os.environ.get('UPSTASH_VECTOR_REST_URL')}.\n"
            f" Vector store info before crawl: {self.vectorstore.index.info()}"
        )

        self.text_splitter = RecursiveCharacterTextSplitter(
            **text_splitter_config
        )

        self._disable_loggers()

    def _disable_loggers(self):
        """
        disables some of the loggers to keep the log clean
        """

        disable_loggers = [
            "scrapy.spidermiddlewares.depth",
            "protego",
            "httpcore.http11",
            "httpx",
            "openai._base_client",
            "urllib3.connectionpool"
        ]
        for logger in disable_loggers:
            logging.getLogger(logger).setLevel(logging.WARNING)

    def parse_page(self, response):
        """
        Creates chunks out of the crawled webpage and adds them to the vector
        store.
        """

        # extract text content
        text_content = response.xpath('//p').getall()
        text_content = '\n'.join(text_content)

        # split documents
        documents = self.text_splitter.split_text(text_content)

        if len(documents) == 0:
            return

        # get source url
        link = response.url

        # add documents to vector store
        self.vectorstore.add(
            ids=[str(uuid.uuid4())[:8] for doc in documents],
            documents=documents,
            link=link
        )


================================================
FILE: degreegurucrawler/degreegurucrawler/utils/config.py
================================================
import os
import yaml

config_path = "degreegurucrawler/utils/crawler.yaml"
with open(config_path, 'r') as file:
    config = yaml.load(file, Loader=yaml.FullLoader)

embedding_function_config = {
    "api_key": os.environ.get('OPENAI_API_KEY'),
    "model_name": config["index"]["openAI_embedding_model"]
}

crawler_config = config["crawler"]
text_splitter_config = config["index"]["text_splitter"]


================================================
FILE: degreegurucrawler/degreegurucrawler/utils/crawler.yaml
================================================
crawler:
  start_urls:
    - https://www.some.domain.com
  link_extractor:
    allow: '.*some\.domain.*'
    deny:
      - "#"
      - '\?'
      - course
      - search
      - subjects
      - degree-charts
      - archive
      - news
      - alumni
      - announcement
      - people
      - topics
      - membership
      - section
      - about
      - letter
      - member
      - committee
      - book
      - year
      - project
      - user
      - page
      - event
      - resource
      - login
index:
  openAI_embedding_model: text-embedding-ada-002
  text_splitter:
    chunk_size: 1000
    chunk_overlap: 100

================================================
FILE: degreegurucrawler/degreegurucrawler/utils/upstash_vector_store.py
================================================
from typing import List
from openai import OpenAI
from upstash_vector import Index

class UpstashVectorStore:

    def __init__(
            self,
            url: str,
            token: str
    ):
        self.client = OpenAI()
        self.index = Index(url=url, token=token)

    def get_embeddings(
            self,
            documents: List[str],
            model: str = "text-embedding-ada-002"
    ) -> List[List[float]]:
        """
        Given a list of documents, generates and returns a list of embeddings
        """
        documents = [document.replace("\n", " ") for document in documents]
        embeddings = self.client.embeddings.create(
            input = documents,
            model=model
        )
        return [data.embedding for data in embeddings.data]

    def add(
            self,
            ids: List[str],
            documents: List[str],
            link: str
    ) -> None:
        """
        Adds a list of documents to the Upstash Vector Store
        """
        embeddings = self.get_embeddings(documents)
        self.index.upsert(
            vectors=[
                (
                    id,
                    embedding,
                    {
                        "text": document,
                        "url": link
                    }
                )
                for id, embedding, document
                in zip(ids, embeddings, documents)
            ]
        )


================================================
FILE: degreegurucrawler/docker-compose.yml
================================================
version: '3'

services:
  my_service:
    image: degreegurucrawler
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      - UPSTASH_VECTOR_REST_URL=****
      - UPSTASH_VECTOR_REST_TOKEN=****
      - OPENAI_API_KEY=****


================================================
FILE: degreegurucrawler/requirements.txt
================================================
upstash_vector
scrapy==2.11.0
langchain==0.1.0
openai==1.7.2

================================================
FILE: degreegurucrawler/scrapy.cfg
================================================
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = degreegurucrawler.settings

[deploy]
#url = http://localhost:6800/
project = degreegurucrawler


================================================
FILE: next.config.js
================================================
/** @type {import('next').NextConfig} */
const nextConfig = {};

module.exports = nextConfig;


================================================
FILE: package.json
================================================
{
  "name": "degreeguru",
  "version": "0.1.0",
  "private": true,
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start",
    "lint": "next lint"
  },
  "dependencies": {
    "@tabler/icons-react": "^2.47.0",
    "@upstash/ratelimit": "^2.0.3",
    "@upstash/redis": "^1.34.0",
    "@upstash/vector": "^0.1.0-alpha-13",
    "ai": "^2.2.31",
    "langchain": "^0.1.5",
    "markdown-to-jsx": "^7.4.0",
    "next": "14.2.35",
    "react": "^18",
    "react-dom": "^18"
  },
  "devDependencies": {
    "@types/node": "^20",
    "@types/react": "^18",
    "@types/react-dom": "^18",
    "autoprefixer": "^10.0.1",
    "clsx": "^2.1.0",
    "eslint": "^8",
    "eslint-config-next": "14.0.4",
    "postcss": "^8",
    "prettier": "^3.2.5",
    "tailwind-merge": "^2.2.1",
    "tailwindcss": "^3.3.0",
    "typescript": "^5"
  }
}


================================================
FILE: postcss.config.js
================================================
module.exports = {
  plugins: {
    tailwindcss: {},
    autoprefixer: {},
  },
};


================================================
FILE: src/app/api/guru/route.tsx
================================================
import { NextRequest, NextResponse } from "next/server";

import { Ratelimit } from "@upstash/ratelimit";
import { Redis } from "@upstash/redis";

import { Message as VercelChatMessage, StreamingTextResponse } from "ai";

import { AIMessage, ChatMessage, HumanMessage } from "@langchain/core/messages";
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { createRetrieverTool } from "langchain/tools/retriever";
import { AgentExecutor, createOpenAIFunctionsAgent } from "langchain/agents";
import {
  ChatPromptTemplate,
  MessagesPlaceholder,
} from "@langchain/core/prompts";

import { UpstashVectorStore } from "@/app/vectorstore/UpstashVectorStore";

export const runtime = "edge";

const redis = Redis.fromEnv();

const ratelimit = new Ratelimit({
  redis: redis,
  limiter: Ratelimit.slidingWindow(1, "10 s"),
});

const convertVercelMessageToLangChainMessage = (message: VercelChatMessage) => {
  if (message.role === "user") {
    return new HumanMessage(message.content);
  } else if (message.role === "assistant") {
    return new AIMessage(message.content);
  } else {
    return new ChatMessage(message.content, message.role);
  }
};

export async function POST(req: NextRequest) {
  try {
    const ip = req.ip ?? "127.0.0.1";
    const { success } = await ratelimit.limit(ip);

    if (!success) {
      const textEncoder = new TextEncoder();
      const customString =
        "Oops! It seems you've reached the rate limit. Please try again later.";

      const transformStream = new ReadableStream({
        async start(controller) {
          controller.enqueue(textEncoder.encode(customString));
          controller.close();
        },
      });
      return new StreamingTextResponse(transformStream);
    }

    const body = await req.json();

    /**
     * We represent intermediate steps as system messages for display purposes,
     * but don't want them in the chat history.
     */
    const messages = (body.messages ?? []).filter(
      (message: VercelChatMessage) =>
        message.role === "user" || message.role === "assistant",
    );
    const returnIntermediateSteps = false;
    const previousMessages = messages
      .slice(0, -1)
      .map(convertVercelMessageToLangChainMessage);
    const currentMessageContent = messages[messages.length - 1].content;

    const chatModel = new ChatOpenAI({
      modelName: "gpt-3.5-turbo-1106",
      temperature: 0.2,
      // IMPORTANT: Must "streaming: true" on OpenAI to enable final output streaming below.
      streaming: true,
    }, {
      apiKey: process.env.OPENAI_API_KEY,
      organization: process.env.OPENAI_ORGANIZATION
    });

    /**
     * Create vector store and retriever
     */
    const vectorstore = await new UpstashVectorStore(new OpenAIEmbeddings());
    const retriever = vectorstore.asRetriever(
      {
        k: 6,
        searchType: "mmr",
        searchKwargs: {
          fetchK: 20,
          lambda: 0.5
        },
        verbose: false
      },
    );

    /**
     * Wrap the retriever in a tool to present it to the agent in a
     * usable form.
     */
    const tool = createRetrieverTool(retriever, {
      name: "search_latest_knowledge",
      description: "Searches and returns up-to-date general information.",
    });

    /**
     * Based on https://smith.langchain.com/hub/hwchase17/openai-functions-agent
     *
     * This default prompt for the OpenAI functions agent has a placeholder
     * where chat messages get inserted as "chat_history".
     *
     * You can customize this prompt yourself!
     */

    const AGENT_SYSTEM_TEMPLATE = `
    You are an artificial intelligence university bot named DegreeGuru, programmed to respond to inquiries about Stanford in a highly systematic and data-driven manner.

    Begin your answers with a formal greeting and sign off with a closing statement about promoting knowledge.

    Your responses should be precise and factual, with an emphasis on using the context provided and providing links from the context whenever posible. If some link does not look like it belongs to stanford, don't use the link and the information in your response.

    Don't repeat yourself in your responses even if some information is repeated in the context.
    
    Reply with apologies and tell the user that you don't know the answer only when you are faced with a question whose answer is not available in the context.
    `;

    const prompt = ChatPromptTemplate.fromMessages([
      ["system", AGENT_SYSTEM_TEMPLATE],
      new MessagesPlaceholder("chat_history"),
      ["human", "{input}"],
      new MessagesPlaceholder("agent_scratchpad"),
    ]);

    const agent = await createOpenAIFunctionsAgent({
      llm: chatModel,
      tools: [tool],
      prompt,
    });

    const agentExecutor = new AgentExecutor({
      agent,
      tools: [tool],
      // Set this if you want to receive all intermediate steps in the output of .invoke().
      returnIntermediateSteps,
    });

    if (!returnIntermediateSteps) {
      /**
       * Agent executors also allow you to stream back all generated tokens and steps
       * from their runs.
       *
       * This contains a lot of data, so we do some filtering of the generated log chunks
       * and only stream back the final response.
       *
       * This filtering is easiest with the OpenAI functions or tools agents, since final outputs
       * are log chunk values from the model that contain a string instead of a function call object.
       *
       * See: https://js.langchain.com/docs/modules/agents/how_to/streaming#streaming-tokens
       */
      const logStream = await agentExecutor.streamLog({
        input: currentMessageContent,
        chat_history: previousMessages,
      });

      const textEncoder = new TextEncoder();
      const transformStream = new ReadableStream({
        async start(controller) {
          for await (const chunk of logStream) {
            if (chunk.ops?.length > 0 && chunk.ops[0].op === "add") {
              const addOp = chunk.ops[0];
              if (
                addOp.path.startsWith("/logs/ChatOpenAI") &&
                typeof addOp.value === "string" &&
                addOp.value.length
              ) {
                controller.enqueue(textEncoder.encode(addOp.value));
              }
            }
          }
          controller.close();
        },
      });

      return new StreamingTextResponse(transformStream);
    } else {
      /**
       * Intermediate steps are the default outputs with the executor's `.stream()` method.
       * We could also pick them out from `streamLog` chunks.
       * They are generated as JSON objects, so streaming them is a bit more complicated.
       */
      const result = await agentExecutor.invoke({
        input: currentMessageContent,
        chat_history: previousMessages,
      });

      const urls = JSON.parse(
        `[${result.intermediateSteps[0]?.observation.replaceAll("}\n\n{", "}, {")}]`,
      ).map((source: { url: any }) => source.url);

      return NextResponse.json(
        {
          _no_streaming_response_: true,
          output: result.output,
          sources: urls,
        },
        { status: 200 },
      );
    }
  } catch (e: any) {
    console.log(e.message);
    return NextResponse.json({ error: e.message }, { status: 500 });
  }
}


================================================
FILE: src/app/globals.css
================================================
@tailwind base;
@tailwind components;
@tailwind utilities;

@layer base {
  a,
  button {
    @apply transition;
  }

  a {
    @apply text-emerald-700 underline
        decoration-emerald-700/60 decoration-2
        hover:decoration-emerald-700 hover:bg-emerald-200
        outline-0 focus:ring-2 focus:ring-offset-1 focus:ring-emerald-500;
  }

  ::selection {
    @apply bg-emerald-200 text-emerald-950;
  }

  button:focus,
  input:focus {
    @apply outline-0 ring-2 ring-offset-1 ring-emerald-500 caret-emerald-500;
  }

  label,
  strong,
  b {
    @apply font-semibold;
  }

  h1,
  h2,
  h3,
  h4 {
    @apply text-balance;
  }

  p {
    @apply text-pretty;
  }
}


================================================
FILE: src/app/layout.tsx
================================================
import type { Metadata } from "next";
import { Inter } from "next/font/google";
import "./globals.css";
import cx from "@/utils/cx";

const inter = Inter({ subsets: ["latin"] });

export const metadata: Metadata = {
  title: "DegreeGuru",
  description: "DegreeGuru ChatBot",
};

export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    <html lang="en" className="scroll-smooth antialiased">
      <body className={cx(inter.className, "text-sm md:text-base bg-white")}>
        {children}
      </body>
    </html>
  );
}


================================================
FILE: src/app/page.tsx
================================================
"use client";

import React, { useCallback, useEffect, useRef, useState } from "react";
import { Message as MessageProps, useChat } from "ai/react";
import Form from "@/components/form";
import Message from "@/components/message";
import cx from "@/utils/cx";
import PoweredBy from "@/components/powered-by";
import MessageLoading from "@/components/message-loading";
import { INITIAL_QUESTIONS } from "@/utils/const";

export default function Home() {
  const formRef = useRef<HTMLFormElement>(null);
  const messagesEndRef = useRef<HTMLDivElement>(null);

  const [streaming, setStreaming] = useState<boolean>(false);

  const { messages, input, handleInputChange, handleSubmit, setInput } =
    useChat({
      api: "/api/guru",
      initialMessages: [
        {
          id: "0",
          role: "system",
          content: `**Welcome to DegreeGuru**

Your ultimate companion in navigating the academic landscape of Stanford.`,
        },
      ],
      onResponse: () => {
        setStreaming(false);
      },
    });

  const onClickQuestion = (value: string) => {
    setInput(value);
    setTimeout(() => {
      formRef.current?.dispatchEvent(
        new Event("submit", {
          cancelable: true,
          bubbles: true,
        }),
      );
    }, 1);
  };

  useEffect(() => {
    if (messagesEndRef.current) {
      messagesEndRef.current.scrollIntoView();
    }
  }, [messages]);

  const onSubmit = useCallback(
    (e: React.FormEvent<HTMLFormElement>) => {
      e.preventDefault();
      handleSubmit(e);
      setStreaming(true);
    },
    [handleSubmit],
  );

  return (
    <main className="relative max-w-screen-md p-4 md:p-6 mx-auto flex min-h-svh !pb-32 md:!pb-40 overflow-y-auto">
      <div className="w-full">
        {messages.map((message: MessageProps) => {
          return <Message key={message.id} {...message} />;
        })}

        {/* loading */}
        {streaming && <MessageLoading />}

        {/* initial question */}
        {messages.length === 1 && (
          <div className="mt-4 md:mt-6 grid md:grid-cols-2 gap-2 md:gap-4">
            {INITIAL_QUESTIONS.map((message) => {
              return (
                <button
                  key={message.content}
                  type="button"
                  className="cursor-pointer select-none text-left bg-white font-normal
                  border border-gray-200 rounded-xl p-3 md:px-4 md:py-3
                  hover:bg-zinc-50 hover:border-zinc-400"
                  onClick={() => onClickQuestion(message.content)}
                >
                  {message.content}
                </button>
              );
            })}
          </div>
        )}

        {/* bottom ref */}
        <div ref={messagesEndRef} />
      </div>

      <div
        className={cx(
          "fixed z-10 bottom-0 inset-x-0",
          "flex justify-center items-center",
          "bg-white",
        )}
      >
        <span
          className="absolute bottom-full h-10 inset-x-0 from-white/0
         bg-gradient-to-b to-white pointer-events-none"
        />

        <div className="w-full max-w-screen-md rounded-xl px-4 md:px-5 py-6">
          <Form
            ref={formRef}
            onSubmit={onSubmit}
            inputProps={{
              disabled: streaming,
              value: input,
              onChange: handleInputChange,
            }}
            buttonProps={{
              disabled: streaming,
            }}
          />

          <PoweredBy />
        </div>
      </div>
    </main>
  );
}


================================================
FILE: src/app/vectorstore/UpstashVectorStore.d.ts
================================================
import { Index } from "@upstash/vector";
import { Document } from "@langchain/core/documents";
import {
  MaxMarginalRelevanceSearchOptions,
  VectorStore,
} from "@langchain/core/vectorstores";


type UpstashMetadata = Record<string, any>;


export class UpstashVectorStore extends VectorStore {
  declare FilterType: PineconeMetadata;

  constructor(embeddings: any);
  index: Index;
  similaritySearchVectorWithScore(
    query: any,
    k: any,
    filter: any,
  ): Promise<any[][]>;

  maxMarginalRelevanceSearch(
    query: string,
    options: MaxMarginalRelevanceSearchOptions<this["FilterType"]>
  ): Promise<Document[]>
}


================================================
FILE: src/app/vectorstore/UpstashVectorStore.js
================================================
import { VectorStore } from "@langchain/core/vectorstores";
import { Document } from "@langchain/core/documents";
import { Index } from "@upstash/vector";
import { maximalMarginalRelevance } from "@langchain/core/utils/math";


export class UpstashVectorStore extends VectorStore {
  _vectorstoreType() {
    return "upstash";
  }

  constructor(embeddings) {
    super(embeddings);

    this.index = new Index({
      url: process.env.UPSTASH_VECTOR_URL,
      token: process.env.UPSTASH_VECTOR_TOKEN,
    });
  }

  async similaritySearchVectorWithScore(query, k, filter) {
    const result = await this.index.query({
      vector: query,
      topK: k,
      includeVectors: false,
      includeMetadata: true,
    });

    const results = [];
    for (let i = 0; i < result.length; i++) {
      results.push([
        new Document({
          pageContent: JSON.stringify(result[i]?.metadata) || "",
        }),
      ]);
    }

    return results;
  }

  async maxMarginalRelevanceSearch(query, options) {
    const queryEmbedding = await this.embeddings.embedQuery(query);
    const result = await this.index.query({
      vector: queryEmbedding,
      topK: options.fetchK ?? 20,
      includeVectors: true,
      includeMetadata: true,
    });
    const embeddingList = result.map((r) => r.vector)

    const mmrIndexes = maximalMarginalRelevance(
      queryEmbedding,
      embeddingList,
      options.lambda,
      options.k
    );
    const topMmrMatches = mmrIndexes.map((idx) => result[idx]);

    const results = [];
    for (let i = 0; i < topMmrMatches.length; i++) {
      results.push(
        new Document({
          pageContent: JSON.stringify(topMmrMatches[i]?.metadata) || "",
        }),
      );
    }

    return results;
  }
}


================================================
FILE: src/components/form.tsx
================================================
import { ComponentProps, forwardRef } from "react";
import { IconArrowBack } from "@tabler/icons-react";
import cx from "@/utils/cx";

export interface Props extends ComponentProps<"form"> {
  inputProps: ComponentProps<"input">;
  buttonProps: ComponentProps<"button">;
}

const Form = ({ inputProps, buttonProps, onSubmit }: Props, ref: any) => {
  return (
    <form
      onSubmit={onSubmit}
      className="relative m-auto flex items-center gap-4 justify-center"
      ref={ref}
    >
      {/*<Avatar isUser={true} className="md:size-10 bg-gray-300" />*/}

      <input
        placeholder="Your question..."
        required
        {...inputProps}
        className={cx(
          "transition h-10 md:h-12 pl-4 pr-12 flex-1 rounded-xl",
          "border border-gray-400 text-base",
          "disabled:bg-gray-100",
          inputProps.className,
        )}
        type="text"
      />

      <button
        {...buttonProps}
        type="submit"
        tabIndex={-1}
        className={cx(
          "absolute right-3 top-1/2 -translate-y-1/2",
          "opacity-50",
        )}
      >
        <IconArrowBack stroke={1.5} />
      </button>
    </form>
  );
};

export default forwardRef(Form);


================================================
FILE: src/components/message-loading.tsx
================================================
import React from "react";
import cx from "@/utils/cx";
import { Avatar } from "@/components/message";

const MessageLoading: React.FC = () => {
  return (
    <article
      className={cx(
        "mb-2 flex items-center gap-4 p-4 md:p-5 rounded-2xl",
        "bg-emerald-50/80",
      )}
    >
      <Avatar />

      {/* https://github.com/n3r4zzurr0/svg-spinners/blob/main/svg-smil/3-dots-bounce.svg?short_path=50864c0 */}
      <svg
        width="24"
        height="24"
        viewBox="0 0 24 24"
        xmlns="http://www.w3.org/2000/svg"
        className="text-emerald-800"
      >
        <circle cx="4" cy="12" r="2" fill="currentColor">
          <animate
            id="spinner_qFRN"
            begin="0;spinner_OcgL.end+0.25s"
            attributeName="cy"
            calcMode="spline"
            dur="0.6s"
            values="12;6;12"
            keySplines=".33,.66,.66,1;.33,0,.66,.33"
          />
        </circle>
        <circle cx="12" cy="12" r="2" fill="currentColor">
          <animate
            begin="spinner_qFRN.begin+0.1s"
            attributeName="cy"
            calcMode="spline"
            dur="0.6s"
            values="12;6;12"
            keySplines=".33,.66,.66,1;.33,0,.66,.33"
          />
        </circle>
        <circle cx="20" cy="12" r="2" fill="currentColor">
          <animate
            id="spinner_OcgL"
            begin="spinner_qFRN.begin+0.2s"
            attributeName="cy"
            calcMode="spline"
            dur="0.6s"
            values="12;6;12"
            keySplines=".33,.66,.66,1;.33,0,.66,.33"
          />
        </circle>
      </svg>
    </article>
  );
};

export default MessageLoading;


================================================
FILE: src/components/message.tsx
================================================
import React from "react";
import Markdown from "markdown-to-jsx";
import cx from "@/utils/cx";
import { Message as MessageProps } from "ai/react";
import UpstashLogo from "@/components/upstash-logo";
import { IconUser } from "@tabler/icons-react";

const Message: React.FC<MessageProps> = ({ content, role }) => {
  const isUser = role === "user";

  return (
    <article
      className={cx(
        "mb-4 flex items-start gap-4 p-4 md:p-5 rounded-2xl",
        isUser ? "" : "bg-emerald-50",
      )}
    >
      <Avatar isUser={isUser} />
      <Markdown
        className={cx(
          "py-1.5 md:py-1 space-y-4",
          isUser ? "font-semibold" : "",
        )}
        options={{
          overrides: {
            ol: ({ children }) => <ol className="list-decimal">{children}</ol>,
            ul: ({ children }) => <ol className="list-disc">{children}</ol>,
          },
        }}
      >
        {content}
      </Markdown>
    </article>
  );
};

const Avatar: React.FC<{ isUser?: boolean; className?: string }> = ({
  isUser = false,
  className,
}) => {
  return (
    <div
      className={cx(
        "flex items-center justify-center size-8 shrink-0 rounded-full",
        isUser ? "bg-gray-200 text-gray-700" : "bg-emerald-950",
        className,
      )}
    >
      {isUser ? <IconUser size={20} /> : <UpstashLogo />}
    </div>
  );
};

export default Message;
export { Avatar };


================================================
FILE: src/components/powered-by.tsx
================================================
const PoweredBy = () => {
  return (
    <p className="mt-4 text-xs md:text-sm text-gray-600 text-center">
      This project is a prototype for a RAG chatbot. <br /> Built using{" "}
      <a href="https://www.langchain.com/" target="_blank">
        LangChain
      </a>
      ,{" "}
      <a href="https://upstash.com" target="_blank">
        Upstash Vector
      </a>{" "}
      and{" "}
      <a href="https://sdk.vercel.ai" target="_blank">
        Vercel AI SDK
      </a>{" "}
      ・{" "}
      <a href="https://github.com/upstash/DegreeGuru" target="_blank">
        Source Code
      </a>
    </p>
  );
};

export default PoweredBy;


================================================
FILE: src/components/upstash-logo.tsx
================================================
import React, { HTMLProps } from "react";

export interface Props extends HTMLProps<SVGSVGElement> {
  size?: number;
}

export default function UpstashLogo({ height = 20, ...props }: Props) {
  return (
    <svg
      role="img"
      height={height}
      viewBox="0 0 118 118"
      fill="none"
      xmlns="http://www.w3.org/2000/svg"
    >
      <g clipPath="url(#upstash_icon_dark_bg)">
        <path
          className="fill-emerald-400"
          d="M15.105 103.244c19.416 19.526 50.895 19.526 70.311 0 19.416-19.526 19.416-51.185 0-70.711l-8.789 8.839c14.562 14.645 14.562 38.388 0 53.033-14.562 14.644-38.171 14.644-52.733 0l-8.789 8.839Z"
        />
        <path
          className="fill-emerald-400"
          d="M32.683 85.566c9.708 9.763 25.447 9.763 35.155 0 9.708-9.763 9.708-25.592 0-35.355L59.05 59.05c4.854 4.881 4.854 12.796 0 17.677a12.38 12.38 0 0 1-17.578 0l-8.79 8.839Z"
        />
        <path
          className="fill-emerald-200"
          d="M102.994 14.855c-19.416-19.526-50.895-19.526-70.311 0-19.416 19.527-19.416 51.185 0 70.711l8.788-8.839c-14.561-14.645-14.561-38.388 0-53.033 14.562-14.644 38.172-14.644 52.734 0l8.789-8.839Z"
        />
        <path
          className="fill-emerald-200"
          d="M85.416 32.533c-9.708-9.763-25.448-9.763-35.156 0-9.708 9.763-9.708 25.592 0 35.355l8.79-8.839c-4.855-4.881-4.855-12.795 0-17.677a12.38 12.38 0 0 1 17.577 0l8.789-8.839Z"
        />
      </g>
      <defs>
        <clipPath id="upstash_icon_dark_bg">
          <path fill="#fff" d="M15 0h88v118H15z" />
        </clipPath>
      </defs>
    </svg>
  );
}


================================================
FILE: src/utils/const.ts
================================================
export const INITIAL_QUESTIONS = [
  {
    content: "Are there resources for students interested in creative writing?",
  },
  {
    content: "Are there courses on environmental sustainability?",
  },
  {
    content:
      "Are there any workshops or seminars on entrepreneurship for students?",
  },
  {
    content: "What kinds of courses will I take as a philosophy major?",
  },
];


================================================
FILE: src/utils/cx.ts
================================================
import { ClassValue, clsx } from "clsx";
import { twMerge } from "tailwind-merge";

export default function cx(...inputs: ClassValue[]) {
  return twMerge(clsx(inputs));
}


================================================
FILE: tailwind.config.ts
================================================
import type { Config } from "tailwindcss";

const config: Config = {
  content: [
    "./src/pages/**/*.{js,ts,jsx,tsx,mdx}",
    "./src/components/**/*.{js,ts,jsx,tsx,mdx}",
    "./src/app/**/*.{js,ts,jsx,tsx,mdx}",
  ],
  plugins: [],
};
export default config;


================================================
FILE: tsconfig.json
================================================
{
  "compilerOptions": {
    "target": "es5",
    "lib": ["dom", "dom.iterable", "esnext"],
    "allowJs": true,
    "skipLibCheck": true,
    "strict": true,
    "noEmit": true,
    "esModuleInterop": true,
    "module": "esnext",
    "moduleResolution": "bundler",
    "resolveJsonModule": true,
    "isolatedModules": true,
    "jsx": "preserve",
    "incremental": true,
    "plugins": [
      {
        "name": "next"
      }
    ],
    "paths": {
      "@/*": ["./src/*"]
    }
  },
  "include": ["next-env.d.ts", "**/*.ts", "**/*.tsx", ".next/types/**/*.ts"],
  "exclude": ["node_modules"]
}

Download .txt

gitextract_e5_f89hi/

├── .eslintrc.json
├── .gitignore
├── .prettierignore
├── .prettierrc
├── README.md
├── degreegurucrawler/
│   ├── .gitignore
│   ├── Dockerfile
│   ├── degreegurucrawler/
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── middlewares.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   ├── spiders/
│   │   │   ├── __init__.py
│   │   │   └── configurable.py
│   │   └── utils/
│   │       ├── config.py
│   │       ├── crawler.yaml
│   │       └── upstash_vector_store.py
│   ├── docker-compose.yml
│   ├── requirements.txt
│   └── scrapy.cfg
├── next.config.js
├── package.json
├── postcss.config.js
├── src/
│   ├── app/
│   │   ├── api/
│   │   │   └── guru/
│   │   │       └── route.tsx
│   │   ├── globals.css
│   │   ├── layout.tsx
│   │   ├── page.tsx
│   │   └── vectorstore/
│   │       ├── UpstashVectorStore.d.ts
│   │       └── UpstashVectorStore.js
│   ├── components/
│   │   ├── form.tsx
│   │   ├── message-loading.tsx
│   │   ├── message.tsx
│   │   ├── powered-by.tsx
│   │   └── upstash-logo.tsx
│   └── utils/
│       ├── const.ts
│       └── cx.ts
├── tailwind.config.ts
└── tsconfig.json

Download .txt

SYMBOL INDEX (39 symbols across 14 files)

FILE: degreegurucrawler/degreegurucrawler/items.py
  class DegreegurucrawlerItem (line 9) | class DegreegurucrawlerItem(scrapy.Item):

FILE: degreegurucrawler/degreegurucrawler/middlewares.py
  class DegreegurucrawlerSpiderMiddleware (line 12) | class DegreegurucrawlerSpiderMiddleware:
    method from_crawler (line 18) | def from_crawler(cls, crawler):
    method process_spider_input (line 24) | def process_spider_input(self, response, spider):
    method process_spider_output (line 31) | def process_spider_output(self, response, result, spider):
    method process_spider_exception (line 39) | def process_spider_exception(self, response, exception, spider):
    method process_start_requests (line 46) | def process_start_requests(self, start_requests, spider):
    method spider_opened (line 55) | def spider_opened(self, spider):
  class DegreegurucrawlerDownloaderMiddleware (line 59) | class DegreegurucrawlerDownloaderMiddleware:
    method from_crawler (line 65) | def from_crawler(cls, crawler):
    method process_request (line 71) | def process_request(self, request, spider):
    method process_response (line 83) | def process_response(self, request, response, spider):
    method process_exception (line 92) | def process_exception(self, request, exception, spider):
    method spider_opened (line 102) | def spider_opened(self, spider):

FILE: degreegurucrawler/degreegurucrawler/pipelines.py
  class DegreegurucrawlerPipeline (line 11) | class DegreegurucrawlerPipeline:
    method process_item (line 12) | def process_item(self, item, spider):

FILE: degreegurucrawler/degreegurucrawler/spiders/configurable.py
  class ConfigurableSpider (line 15) | class ConfigurableSpider(CrawlSpider):
    method __init__ (line 29) | def __init__(self, *a, **kw):
    method _disable_loggers (line 48) | def _disable_loggers(self):
    method parse_page (line 64) | def parse_page(self, response):

FILE: degreegurucrawler/degreegurucrawler/utils/upstash_vector_store.py
  class UpstashVectorStore (line 5) | class UpstashVectorStore:
    method __init__ (line 7) | def __init__(
    method get_embeddings (line 15) | def get_embeddings(
    method add (line 30) | def add(

FILE: src/app/api/guru/route.tsx
  function POST (line 38) | async function POST(req: NextRequest) {

FILE: src/app/layout.tsx
  function RootLayout (line 13) | function RootLayout({

FILE: src/app/page.tsx
  function Home (line 12) | function Home() {

FILE: src/app/vectorstore/UpstashVectorStore.d.ts
  type UpstashMetadata (line 9) | type UpstashMetadata = Record<string, any>;
  class UpstashVectorStore (line 12) | class UpstashVectorStore extends VectorStore {

FILE: src/app/vectorstore/UpstashVectorStore.js
  class UpstashVectorStore (line 7) | class UpstashVectorStore extends VectorStore {
    method _vectorstoreType (line 8) | _vectorstoreType() {
    method constructor (line 12) | constructor(embeddings) {
    method similaritySearchVectorWithScore (line 21) | async similaritySearchVectorWithScore(query, k, filter) {
    method maxMarginalRelevanceSearch (line 41) | async maxMarginalRelevanceSearch(query, options) {

FILE: src/components/form.tsx
  type Props (line 5) | interface Props extends ComponentProps<"form"> {

FILE: src/components/upstash-logo.tsx
  type Props (line 3) | interface Props extends HTMLProps<SVGSVGElement> {
  function UpstashLogo (line 7) | function UpstashLogo({ height = 20, ...props }: Props) {

FILE: src/utils/const.ts
  constant INITIAL_QUESTIONS (line 1) | const INITIAL_QUESTIONS = [

FILE: src/utils/cx.ts
  function cx (line 4) | function cx(...inputs: ClassValue[]) {

Download .json

Condensed preview — 38 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (56K chars).

[
  {
    "path": ".eslintrc.json",
    "chars": 40,
    "preview": "{\n  \"extends\": \"next/core-web-vitals\"\n}\n"
  },
  {
    "path": ".gitignore",
    "chars": 397,
    "preview": "# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.\n\n# dependencies\n/node_modules\n/.pn"
  },
  {
    "path": ".prettierignore",
    "chars": 43,
    "preview": "# Ignore artifacts:\ndegreegurucrawler\nfigs\n"
  },
  {
    "path": ".prettierrc",
    "chars": 704,
    "preview": "{\n  \"arrowParens\": \"always\",\n  \"bracketSameLine\": false,\n  \"bracketSpacing\": true,\n  \"semi\": true,\n  \"singleQuote\": fals"
  },
  {
    "path": "README.md",
    "chars": 12911,
    "preview": "# DegreeGuru\n\n## Build a RAG Chatbot using Vercel AI SDK, Langchain, Upstash Vector and OpenAI\n\n[![Deploy with Vercel](h"
  },
  {
    "path": "degreegurucrawler/.gitignore",
    "chars": 30,
    "preview": "*.log\n__pycache__\ndegreegurudb"
  },
  {
    "path": "degreegurucrawler/Dockerfile",
    "chars": 188,
    "preview": "\nFROM python:3.8-slim\n\n# copy directory into dockerfile\nCOPY . crawler\nWORKDIR crawler\n\n# install requirements\nRUN pip i"
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/items.py",
    "chars": 273,
    "preview": "# Define here the models for your scraped items\n#\n# See documentation in:\n# https://docs.scrapy.org/en/latest/topics/ite"
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/middlewares.py",
    "chars": 3668,
    "preview": "# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://docs.scrapy.org/en/latest/topics"
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/pipelines.py",
    "chars": 371,
    "preview": "# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES setting\n# See: https://doc"
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/settings.py",
    "chars": 3421,
    "preview": "# Scrapy settings for degreegurucrawler project\n#\n# For simplicity, this file contains only settings considered importan"
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/spiders/__init__.py",
    "chars": 161,
    "preview": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on "
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/spiders/configurable.py",
    "chars": 2420,
    "preview": "\nimport os\nimport uuid\nimport logging\n\nfrom ..utils.upstash_vector_store import UpstashVectorStore\nfrom ..utils.config i"
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/utils/config.py",
    "chars": 400,
    "preview": "import os\nimport yaml\n\nconfig_path = \"degreegurucrawler/utils/crawler.yaml\"\nwith open(config_path, 'r') as file:\n    con"
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/utils/crawler.yaml",
    "chars": 630,
    "preview": "crawler:\n  start_urls:\n    - https://www.some.domain.com\n  link_extractor:\n    allow: '.*some\\.domain.*'\n    deny:\n     "
  },
  {
    "path": "degreegurucrawler/degreegurucrawler/utils/upstash_vector_store.py",
    "chars": 1438,
    "preview": "from typing import List\nfrom openai import OpenAI\nfrom upstash_vector import Index\n\nclass UpstashVectorStore:\n\n    def _"
  },
  {
    "path": "degreegurucrawler/docker-compose.yml",
    "chars": 245,
    "preview": "version: '3'\n\nservices:\n  my_service:\n    image: degreegurucrawler\n    build:\n      context: .\n      dockerfile: Dockerf"
  },
  {
    "path": "degreegurucrawler/requirements.txt",
    "chars": 60,
    "preview": "upstash_vector\nscrapy==2.11.0\nlangchain==0.1.0\nopenai==1.7.2"
  },
  {
    "path": "degreegurucrawler/scrapy.cfg",
    "chars": 277,
    "preview": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrap"
  },
  {
    "path": "next.config.js",
    "chars": 94,
    "preview": "/** @type {import('next').NextConfig} */\nconst nextConfig = {};\n\nmodule.exports = nextConfig;\n"
  },
  {
    "path": "package.json",
    "chars": 862,
    "preview": "{\n  \"name\": \"degreeguru\",\n  \"version\": \"0.1.0\",\n  \"private\": true,\n  \"scripts\": {\n    \"dev\": \"next dev\",\n    \"build\": \"n"
  },
  {
    "path": "postcss.config.js",
    "chars": 83,
    "preview": "module.exports = {\n  plugins: {\n    tailwindcss: {},\n    autoprefixer: {},\n  },\n};\n"
  },
  {
    "path": "src/app/api/guru/route.tsx",
    "chars": 7354,
    "preview": "import { NextRequest, NextResponse } from \"next/server\";\n\nimport { Ratelimit } from \"@upstash/ratelimit\";\nimport { Redis"
  },
  {
    "path": "src/app/globals.css",
    "chars": 674,
    "preview": "@tailwind base;\n@tailwind components;\n@tailwind utilities;\n\n@layer base {\n  a,\n  button {\n    @apply transition;\n  }\n\n  "
  },
  {
    "path": "src/app/layout.tsx",
    "chars": 568,
    "preview": "import type { Metadata } from \"next\";\nimport { Inter } from \"next/font/google\";\nimport \"./globals.css\";\nimport cx from \""
  },
  {
    "path": "src/app/page.tsx",
    "chars": 3534,
    "preview": "\"use client\";\n\nimport React, { useCallback, useEffect, useRef, useState } from \"react\";\nimport { Message as MessageProps"
  },
  {
    "path": "src/app/vectorstore/UpstashVectorStore.d.ts",
    "chars": 633,
    "preview": "import { Index } from \"@upstash/vector\";\nimport { Document } from \"@langchain/core/documents\";\nimport {\n  MaxMarginalRel"
  },
  {
    "path": "src/app/vectorstore/UpstashVectorStore.js",
    "chars": 1755,
    "preview": "import { VectorStore } from \"@langchain/core/vectorstores\";\nimport { Document } from \"@langchain/core/documents\";\nimport"
  },
  {
    "path": "src/components/form.tsx",
    "chars": 1212,
    "preview": "import { ComponentProps, forwardRef } from \"react\";\nimport { IconArrowBack } from \"@tabler/icons-react\";\nimport cx from "
  },
  {
    "path": "src/components/message-loading.tsx",
    "chars": 1678,
    "preview": "import React from \"react\";\nimport cx from \"@/utils/cx\";\nimport { Avatar } from \"@/components/message\";\n\nconst MessageLoa"
  },
  {
    "path": "src/components/message.tsx",
    "chars": 1407,
    "preview": "import React from \"react\";\nimport Markdown from \"markdown-to-jsx\";\nimport cx from \"@/utils/cx\";\nimport { Message as Mess"
  },
  {
    "path": "src/components/powered-by.tsx",
    "chars": 645,
    "preview": "const PoweredBy = () => {\n  return (\n    <p className=\"mt-4 text-xs md:text-sm text-gray-600 text-center\">\n      This pr"
  },
  {
    "path": "src/components/upstash-logo.tsx",
    "chars": 1599,
    "preview": "import React, { HTMLProps } from \"react\";\n\nexport interface Props extends HTMLProps<SVGSVGElement> {\n  size?: number;\n}\n"
  },
  {
    "path": "src/utils/const.ts",
    "chars": 387,
    "preview": "export const INITIAL_QUESTIONS = [\n  {\n    content: \"Are there resources for students interested in creative writing?\",\n"
  },
  {
    "path": "src/utils/cx.ts",
    "chars": 172,
    "preview": "import { ClassValue, clsx } from \"clsx\";\nimport { twMerge } from \"tailwind-merge\";\n\nexport default function cx(...inputs"
  },
  {
    "path": "tailwind.config.ts",
    "chars": 263,
    "preview": "import type { Config } from \"tailwindcss\";\n\nconst config: Config = {\n  content: [\n    \"./src/pages/**/*.{js,ts,jsx,tsx,m"
  },
  {
    "path": "tsconfig.json",
    "chars": 599,
    "preview": "{\n  \"compilerOptions\": {\n    \"target\": \"es5\",\n    \"lib\": [\"dom\", \"dom.iterable\", \"esnext\"],\n    \"allowJs\": true,\n    \"sk"
  }
]

About this extraction

This page contains the full source code of the upstash/degree-guru GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 38 files (50.0 KB), approximately 13.8k tokens, and a symbol index with 39 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo