Showing preview only (3,901K chars total). Download the full file or copy to clipboard to get everything.
Repository: tablegpt/tablegpt-agent
Branch: main
Commit: 26bc576bb21f
Files: 106
Total size: 3.7 MB
Directory structure:
gitextract_d58d11ud/
├── .devcontainer/
│ └── devcontainer.json
├── .gitattributes
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ └── bug_report.md
│ └── workflows/
│ ├── ci.yml
│ ├── publish-docs.yml
│ ├── publish.yml
│ └── stale.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CONTRIBUTING.md
├── LICENSE
├── Makefile
├── README.md
├── collect_script.py
├── docs/
│ ├── explanation/
│ │ ├── agent-workflow.md
│ │ ├── code-sandbox.md
│ │ ├── file-reading.ipynb
│ │ └── ipython-startup-scripts.md
│ ├── howto/
│ │ ├── cleanup-error-trace.md
│ │ ├── customize-table-info.md
│ │ ├── incluster-code-execution.md
│ │ ├── messages-truncation.ipynb
│ │ ├── normalize-datasets.ipynb
│ │ ├── persist-messages.ipynb
│ │ └── retrieval.ipynb
│ ├── index.md
│ ├── reference.md
│ ├── stylesheets/
│ │ └── extra.css
│ └── tutorials/
│ ├── chat-on-tabular-data.ipynb
│ ├── continue-analysis-on-generated-charts.ipynb
│ └── quick-start.ipynb
├── examples/
│ ├── __init__.py
│ ├── data_analysis.py
│ ├── datasets/
│ │ ├── titanic.csv
│ │ ├── 产品生产统计表.xlsx
│ │ └── 产品销量表.csv
│ └── quick_start.py
├── ipython/
│ ├── README.md
│ ├── ipython-startup-scripts/
│ │ ├── 00-pandas.py
│ │ ├── 98-udfs.py
│ │ └── 99-cfont.py
│ └── requirements.txt
├── mkdocs.yml
├── pyproject.toml
├── realtabbench/
│ ├── README.md
│ ├── __init__.py
│ ├── agent_eval/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── config.py
│ │ ├── evaluatee.py
│ │ ├── evaluator/
│ │ │ ├── __init__.py
│ │ │ ├── output_parser.py
│ │ │ └── prompt.py
│ │ ├── example-config.yaml
│ │ ├── questioner.py
│ │ ├── requirements.txt
│ │ ├── runner.py
│ │ ├── tablegpt_evaluatee.py
│ │ └── worker.py
│ ├── evalset/
│ │ ├── bird_data/
│ │ │ ├── dev.json
│ │ │ ├── dev.sql
│ │ │ └── dev_tables.json
│ │ └── spider_data/
│ │ ├── dev.json
│ │ ├── dev_gold.sql
│ │ ├── test.json
│ │ ├── test_gold.sql
│ │ └── test_tables.json
│ ├── inference.py
│ ├── inference_encoder.py
│ ├── requirements.txt
│ ├── run_text2sql_eval.py
│ ├── text2sql/
│ │ ├── __init__.py
│ │ └── src/
│ │ ├── __init__.py
│ │ ├── evaluation.py
│ │ ├── gpt_request.py
│ │ └── gpt_request_encoder.py
│ └── utils.py
├── src/
│ └── tablegpt/
│ ├── __about__.py
│ ├── __init__.py
│ ├── agent/
│ │ ├── __init__.py
│ │ ├── data_analyzer.py
│ │ ├── file_reading/
│ │ │ ├── __init__.py
│ │ │ └── data_normalizer.py
│ │ └── output_parser.py
│ ├── errors.py
│ ├── retriever/
│ │ ├── __init__.py
│ │ ├── compressor.py
│ │ └── loader.py
│ ├── safety.py
│ ├── tools.py
│ ├── translation.py
│ └── utils.py
└── tests/
├── __init__.py
├── agent/
│ ├── __init__.py
│ ├── file_reading/
│ │ ├── __init__.py
│ │ └── test_data_normalizer.py
│ └── test_output_parser.py
├── retriever/
│ ├── __init__.py
│ ├── test_compressor.py
│ ├── test_format.py
│ └── test_loader.py
├── test_profile_init.py
├── test_safety.py
├── test_tools.py
└── test_utils.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .devcontainer/devcontainer.json
================================================
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile
{
"name": "tablegpt-agent",
"image": "mcr.microsoft.com/devcontainers/python:1-3.12",
"containerEnv" : {
// This will instruct hatch to create envs in the workspace folder.
// It makes selecting interpreter simpler.
"HATCH_DATA_DIR": "${containerWorkspaceFolder}"
},
// Use 'postCreateCommand' to run commands after the container is created.
"postCreateCommand": "pip3 install hatch",
// See https://stackoverflow.com/questions/70206554/share-ssh-keys-with-vs-code-devcontainer-running-with-dockers-wsl2-backend
"mounts": [
"type=bind,source=${localEnv:HOME}${localEnv:USERPROFILE}/.ssh,target=/home/vscode/.ssh,readonly"
]
}
================================================
FILE: .gitattributes
================================================
* text=auto eol=lf
*.{cmd,[cC][mM][dD]} text eol=crlf
*.{bat,[bB][aA][tT]} text eol=crlf
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug report
about: Create a report to help us improve
title: ""
labels: bug
assignees: ""
---
- [ ] I have searched the issue tracker and believe that this is not a duplicate.
## Run `python collect_script.py` and paste or upload the resulting text file here.
<!--The Collect script output-->
> If you are using TableGPT2 deployed with vLLM, please specify the vLLM version and include the command used to start the server.
>
> If not, you may skip this section.
## vLLM version
### The version of the vLLM
<!--The version of the vLLM-->
### The start command of the vLLM serve
<!--The start command of the vLLM serve -->
## Steps to reproduce
<!--Describe the minimized example of how to reproduce the bug-->
## Actual behavior
<!--A clear and concise description the result of the above steps-->
## Expected behavior
<!--A clear and concise description of what you expected to happen.-->
================================================
FILE: .github/workflows/ci.yml
================================================
name: "CI"
on:
push:
branches: [ main ]
paths:
- "src/**"
- "tests/**"
- "Makefile"
- "pyproject.toml"
pull_request:
branches: [ main ]
paths:
- "src/**"
- "tests/**"
- "Makefile"
- "pyproject.toml"
jobs:
lint-test:
name: "lint & tests"
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Set up pip cache
if: runner.os == 'Linux'
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('pyproject.toml') }}
restore-keys: ${{ runner.os }}-pip-
- name: Install hatch
run: |
pipx install hatch
- name: Lint
run: make lint
- name: Tests
run: make test
install-ubuntu:
name: "install on ubuntu"
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
# <https://github.com/actions/setup-python/tree/v5/?tab=readme-ov-file#caching-packages-dependencies>
cache: 'pip'
- name: Install tablegpt-agent
run: |
pip install -e .
install-win:
name: "install on windows"
runs-on: windows-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
# <https://github.com/actions/setup-python/tree/v5/?tab=readme-ov-file#caching-packages-dependencies>
cache: 'pip'
- name: Install tablegpt-agent
run: |
pip install -e .
================================================
FILE: .github/workflows/publish-docs.yml
================================================
name: Publish docs
on:
push:
branches: [ main ]
paths:
- "docs/**"
- "mkdocs.yml"
workflow_dispatch: # Allows to trigger the workflow manually in GitHub UI
jobs:
run:
name: "deploy docs"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
# See <https://github.com/mkdocs/mkdocs/issues/2370#issuecomment-821926264>
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: 3.12
- name: Set up pip cache
if: runner.os == 'Linux'
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('pyproject.toml') }}
restore-keys: ${{ runner.os }}-pip-
- name: Install hatch
run: |
pipx install hatch
- name: Publish doc
run: hatch env run -e docs mkdocs gh-deploy
================================================
FILE: .github/workflows/publish.yml
================================================
name: Publish to PyPI
on:
release:
types: [published]
jobs:
deploy:
runs-on: ubuntu-latest
permissions:
id-token: write # IMPORTANT: this permission is mandatory for trusted publishing
contents: read # IMPORTANT: this permission is mandatory for private repositories
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
pipx install hatch
- name: Build package
run: hatch build
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
with:
verbose: true
================================================
FILE: .github/workflows/stale.yml
================================================
name: "Close stale issues and PRs"
permissions:
actions: write
contents: write # only for delete-branch option
issues: write
pull-requests: write
on:
schedule:
- cron: "30 1 * * *"
jobs:
stale:
runs-on: ubuntu-latest
steps:
- uses: actions/stale@v9
with:
stale-issue-message: "This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days."
stale-pr-message: "This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 10 days."
close-issue-message: "This issue was closed because it has been stalled for 10 days with no activity."
close-pr-message: "This PR was closed because it has been stalled for 10 days with no activity."
days-before-issue-stale: 30
days-before-pr-stale: 30
days-before-issue-close: 10
days-before-pr-close: 10
================================================
FILE: .gitignore
================================================
##
# MacOS
# <https://github.com/github/gitignore/blob/main/Global/macOS.gitignore>
##
# General
.DS_Store
.AppleDouble
.LSOverride
# Icon must end with two \r
Icon
# Thumbnails
._*
# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent
# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk
##
# Windows
# <https://github.com/github/gitignore/blob/main/Global/Windows.gitignore>
##
# Windows thumbnail cache files
Thumbs.db
Thumbs.db:encryptable
ehthumbs.db
ehthumbs_vista.db
# Dump file
*.stackdump
# Folder config file
[Dd]esktop.ini
# Recycle Bin used on file shares
$RECYCLE.BIN/
# Windows Installer files
*.cab
*.msi
*.msix
*.msm
*.msp
# Windows shortcuts
*.lnk
##
# Linux
# <https://github.com/github/gitignore/blob/main/Global/Linux.gitignore>
##
*~
# temporary files which can be created if a process still has a handle open of a deleted file
.fuse_hidden*
# KDE directory preferences
.directory
# Linux trash folder which might appear on any partition or disk
.Trash-*
# .nfs files are created when an open file is removed but is still being accessed
.nfs*
##
# JetBrans
# <https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore>
##
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf
# AWS User-specific
.idea/**/aws.xml
# Generated files
.idea/**/contentModel.xml
# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml
# Gradle
.idea/**/gradle.xml
.idea/**/libraries
# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/artifacts
# .idea/compiler.xml
# .idea/jarRepositories.xml
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr
# CMake
cmake-build-*/
# Mongo Explorer plugin
.idea/**/mongoSettings.xml
# File-based project format
*.iws
# IntelliJ
out/
# mpeltonen/sbt-idea plugin
.idea_modules/
# JIRA plugin
atlassian-ide-plugin.xml
# Cursive Clojure plugin
.idea/replstate.xml
# SonarLint plugin
.idea/sonarlint/
# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
# Editor-based Rest Client
.idea/httpRequests
# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser
##
# Python
# <https://github.com/github/gitignore/blob/main/Python.gitignore>
##
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
##
# JetBrans
# <https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore>
##
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf
# AWS User-specific
.idea/**/aws.xml
# Generated files
.idea/**/contentModel.xml
# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml
# Gradle
.idea/**/gradle.xml
.idea/**/libraries
# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/artifacts
# .idea/compiler.xml
# .idea/jarRepositories.xml
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr
# CMake
cmake-build-*/
# Mongo Explorer plugin
.idea/**/mongoSettings.xml
# File-based project format
*.iws
# IntelliJ
out/
# mpeltonen/sbt-idea plugin
.idea_modules/
# JIRA plugin
atlassian-ide-plugin.xml
# Cursive Clojure plugin
.idea/replstate.xml
# SonarLint plugin
.idea/sonarlint/
# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
# Editor-based Rest Client
.idea/httpRequests
# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser
##
# Python
# <https://github.com/github/gitignore/blob/main/Python.gitignore>
##
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
##
# JetBrans
# <https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore>
##
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf
# AWS User-specific
.idea/**/aws.xml
# Generated files
.idea/**/contentModel.xml
# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml
# Gradle
.idea/**/gradle.xml
.idea/**/libraries
# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/artifacts
# .idea/compiler.xml
# .idea/jarRepositories.xml
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr
# CMake
cmake-build-*/
# Mongo Explorer plugin
.idea/**/mongoSettings.xml
# File-based project format
*.iws
# IntelliJ
out/
# mpeltonen/sbt-idea plugin
.idea_modules/
# JIRA plugin
atlassian-ide-plugin.xml
# Cursive Clojure plugin
.idea/replstate.xml
# SonarLint plugin
.idea/sonarlint/
# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties
# Editor-based Rest Client
.idea/httpRequests
# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser
# Evalset sqlite and csv files
*.csv
*.sqlite
# Preserve use case data
!examples/datasets/*.csv
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: check-yaml
args: [--allow-multiple-documents]
- id: end-of-file-fixer
- id: trailing-whitespace
args: [--markdown-linebreak-ext=md]
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.1
hooks:
# Run the linter.
- id: ruff
# It is recommended to specify the latest version of Python
# supported by your project here, or alternatively use
# pre-commit's default_language_version, see
# https://pre-commit.com/#top_level-default_language_version
language_version: python3.12
args: [ --fix ]
# Run the formatter.
- id: ruff-format
language_version: python3.12
================================================
FILE: CONTRIBUTING.md
================================================
# Welcome to TableGPT-Agent contributing guide <!-- omit in toc -->
Thank you for investing your time in contributing to our project! :sparkles:.
In this guide you will get an overview of the contribution workflow from opening an issue, creating a PR, reviewing, and merging the PR.
## New contributor guide
To get an overview of the project, read the [README](./README.md) file. Here are some resources to help you get started with open source contributions:
- [Set up Git](https://docs.github.com/en/get-started/getting-started-with-git/set-up-git)
- [GitHub flow](https://docs.github.com/en/get-started/using-github/github-flow)
- [Collaborating with pull requests](https://docs.github.com/en/github/collaborating-with-pull-requests)
## Get Started
### Create a new issue
If you spot a problem with TableGPT, [search if an issue already exists](https://docs.github.com/en/github/searching-for-information-on-github/searching-on-github/searching-issues-and-pull-requests#search-by-the-title-body-or-comments). If a related issue doesn't exist, you can [open a new issue](https://github.com/tablegpt/tablegpt-agent/issues/new).
### Solve an issue
Once you are assigned an issue, you can start working on it. You can scan through our [existing issues](https://github.com/tablegpt/tablegpt-agent/issues) to find one that is assigned to you. You can narrow down the search using `labels` as filters.
1. Fork the repository.
2. Setup development environment.
3. Create a working branch and start with your changes!
### Commit your update
Commit the changes once you are happy with them. To speed up the review process, make sure your commit messages are clear and concise. We follow the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) standard for commit messages.
### Pull Request
When you're finished with the changes, create a pull request, also known as a PR.
- Don't forget to link PR to issue if you are solving one.
- Once you submit your PR, a Docs team member will review your proposal. We may ask questions or request additional information.
- We may ask for changes to be made before a PR can be merged, either using suggested changes or pull request comments. You can make any other changes in your fork, then commit them to your branch.
- As you update your PR and apply changes, mark each conversation as `resolved`.
- If you run into any merge issues, checkout this [git tutorial](https://github.com/skills/resolve-merge-conflicts) to help you resolve merge conflicts and other issues.
### Code Quality
Before your PR gets merged, we will check the code quality. We use [GitHub Actions](https://docs.github.com/en/actions/) to automate the process. You can inspect the detailed workflow at [ci workflow](./.github/workflows/ci.yml).
If you want to check the code quality locally, you can use the following command:
```sh
make lint && make test
```
In addition to the automated checks, we also have a code review process. The reviewers will provide feedback on your PR and ask for changes if necessary. The feedback is mainly based on google's [python style guide](https://google.github.io/styleguide/pyguide.html).
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: Makefile
================================================
# Default target executed when no arguments are given to make.
all: help
lint:
hatch fmt --check
format:
hatch fmt
test:
hatch test
wheel:
hatch build
# 'make docs' is a make command, use 'doc' instead of 'docs' to avoid conflict
doc:
hatch env run -e docs mkdocs build
clean:
hatch clean
######################
# HELP
######################
help:
@echo '----'
@echo 'lint - run linters'
@echo 'format - run code formatters'
@echo 'test - run unit tests'
@echo 'wheel - build wheel package'
@echo 'doc - build documentation site'
@echo 'clean - clean up'
================================================
FILE: README.md
================================================
# TableGPT Agent
[](https://pypi.org/project/tablegpt-agent)
[](https://pypi.org/project/tablegpt-agent)
-----
## Introduction
`tablegpt-agent` is a pre-built agent for TableGPT2 ([huggingface](https://huggingface.co/collections/tablegpt/tablegpt2-67265071d6e695218a7e0376)), a series of LLMs for table-based question answering. This agent is built on top of the [Langgraph](https://github.com/langchain-ai/langgraph) library and provides a user-friendly interface for interacting with TableGPT2.
You can find the full document at <https://tablegpt.github.io/tablegpt-agent/>
## Evaluation
This repository also includes a collection of evaluation scripts for table-related benchmarks. The evaluation scripts and datasets can be found in the `realtabbench` directory. For more details, please refer to the [Evaluation README](realtabbench/README.md).
## Liscence
`tablegpt-agent` is distributed under the terms of the [Apache 2.0](https://spdx.org/licenses/Apache-2.0.html) license.
## Model Card
For more information about TableGPT2, see the [TableGPT2 Model Card](https://huggingface.co/tablegpt/tablegpt).
## Citation
If you find our work helpful, please cite us by
```bibtex
@misc{su2024tablegpt2largemultimodalmodel,
title={TableGPT2: A Large Multimodal Model with Tabular Data Integration},
author={Aofeng Su and Aowen Wang and Chao Ye and Chen Zhou and Ga Zhang and Guangcheng Zhu and Haobo Wang and Haokai Xu and Hao Chen and Haoze Li and Haoxuan Lan and Jiaming Tian and Jing Yuan and Junbo Zhao and Junlin Zhou and Kaizhe Shou and Liangyu Zha and Lin Long and Liyao Li and Pengzuo Wu and Qi Zhang and Qingyi Huang and Saisai Yang and Tao Zhang and Wentao Ye and Wufang Zhu and Xiaomeng Hu and Xijun Gu and Xinjie Sun and Xiang Li and Yuhang Yang and Zhiqing Xiao},
year={2024},
eprint={2411.02059},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.02059},
}
```
================================================
FILE: collect_script.py
================================================
import platform
import subprocess
import sys
def get_os_info():
return {
"system": platform.system(),
"node": platform.node(),
"release": platform.release(),
"version": platform.version(),
"machine": platform.machine(),
"processor": platform.processor(),
}
def get_python_info():
return {
"implementation": platform.python_implementation(),
"version": platform.python_version(),
"compiler": platform.python_compiler(),
}
def get_pip_list():
result = subprocess.run(
[sys.executable, "-m", "pip", "list"],
capture_output=True,
text=True,
check=False,
)
if result.returncode == 0:
return result.stdout
return f"Failed to get pip list: {result.stderr}"
def write_to_log_file(content, filename="env_output.log"):
with open(filename, "w") as file:
file.write(content)
def main():
os_info = get_os_info()
python_info = get_python_info()
pip_list = get_pip_list()
content = "Operating System Information:\n"
for key, value in os_info.items():
content += f"{key}: {value}\n"
content += "\nPython Information:\n"
for key, value in python_info.items():
content += f"{key}: {value}\n"
content += "\nPip List:\n"
content += pip_list
# stdout
print(content) # noqa: T201
# file
write_to_log_file(content)
if __name__ == "__main__":
main()
================================================
FILE: docs/explanation/agent-workflow.md
================================================
# Agent Workflow
The Agent Workflow is the core functionality of the `tablegpt-agent`. It processes user input and generates appropriate responses. This workflow is similar to those found in most single-agent systems and consists of an agent and various tools. Specifically, the data analysis workflow includes:
- **An Agent Powered by TableGPT2**: This agent performs data analysis tasks. It is designed to understand and execute complex data analysis queries, providing accurate and insightful results.
- **An IPython tool**: This tool executes the generated code within a sandbox environment, ensuring that the code runs safely and efficiently.
Additionally, TableGPT Agent offers several optional plugins that extend the agent's functionality:
- **Visual Language Model**: This plugin can be used to enhance summarization for data visualization tasks.
- **Retriever**: This plugin fetches information about the dataset, improving the quality and relevance of the generated code.
- **Safety Mechanism**: This plugin protects the system from toxic inputs.
## Workflow Steps
1. **User Input**: The user provides a query or command to the agent.
2. **Security Assessment (optional)**: The agent evaluates whether the user's query involves sensitive topics. If it does, the agent will prompt the LLM to be cautious in its response.
3. **Data Retrieval(optional)**: The retriever plugin fetches relevant data and metadata.
4. **Code Generation**: The agent generates the appropriate code to perform the requested task.
5. **Code Execution**: The generated code is executed in the IPython sandbox environment.
6. **Result Generation**: The agent processes the results of the code execution and generates a response.
7. **Visual analysis (optional)**: Perform visual analysis and summary on the generated pictures to enrich the answers.
**NOTE:** During the operation of `tablegpt-agent`, the system will repeatedly attempt to resolve any issues that arise during code execution. As a result, Steps `4~7` may be executed multiple times in an iterative debugging process. This cycle continues until the final solution is reached or the maximum iteration limit is exceeded. The default maximum iteration count is **25**.
## Example Workflow
### User Input
The user inputs a query:
> Draw a pie chart of the male-female ratio based on gender.
### Security Assessment (Optional)
When creating a `tablegpt-agent`, you can specify a dedicated safety evaluation model using the `safety_llm` parameter. For example:
```python
from langchain_openai import ChatOpenAI
from tablegpt.agent import create_tablegpt_graph
safety_llm = ChatOpenAI(openai_api_base="LLM_URL", openai_api_key="whatever", model_name="LLM_MODEL_NAME")
agent = create_tablegpt_graph(
...
safety_llm=safety_llm,
...
)
```
If `safety_llm` is configured, it will evaluate user inputs to identify if they fall under any of the following `14` hazard categories. If a question is flagged as belonging to one of these categories, `tablegpt-agent` will respond with caution.
| Hazard Category | Description |
| ---- | ---- |
| S1: Violent Crimes | Violent criminal activities |
| S2: Non-Violent Crimes | Non-violent criminal activities |
| S3: Sex-Related Crimes | Crimes related to sexual misconduct |
| S4: Child Sexual Exploitation | Exploitation of children |
| S5: Defamation | Defamatory content |
| S6: Specialized Advice | Professional advice (e.g., medical, legal) |
| S7: Privacy | Privacy violations |
| S8: Intellectual Property | Intellectual property issues |
| S9: Indiscriminate Weapons | Use or production of indiscriminate weapons |
| S10: Hate | Hateful or discriminatory content |
| S11: Suicide & Self-Harm | Suicide or self-harm-related content |
| S12: Sexual Content | Explicit sexual content |
| S13: Elections | Content related to elections |
| S14: Code Interpreter Abuse | Misuse of code interpretation features |
This feature enhances the safety of the `tablegpt-agent`, helping to mitigate ethical and legal risks associated with generated content.
### Data Retrieval (optional)
The retriever plugin recalls columns and values related to the query, enhancing the LLM's understanding of the dataset. This improves the accuracy of the code generated by the LLM. For detailed usage instructions, refer to [Enhance TableGPT Agent with RAG](../../howto/retrieval).
For this example, based on the user’s input, the retrieved results are as follows:
```pycon
Here are some extra column information that might help you understand the dataset:\n- titanic.csv:\n - {"column": Sex, "dtype": "string", "values": ["male", "female", ...]}
```
### Code Generation
The agent generates the following Python code:
```python
import seaborn as sns
import matplotlib.pyplot as plt
# Count the number of males and females
gender_counts = df1['Sex'].value_counts()
# Create a pie chart
plt.figure(figsize=(6, 6))
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Gender Distribution')
plt.show()
```
### Code Execution
The generated code is automatically executed in the IPython sandbox environment.
### Result Generation
After the execution is complete, the results are generated as follows:

### Visual Analysis (optional)
The visual analysis plugin allows you to enhance generated results with visualizations, making the output more intuitive and informative.
To enable this feature, you can pass the `vlm` parameter when creating a `tablegpt-agent`. Here’s an example:
```python
from langchain_openai import ChatOpenAI
from tablegpt.agent import create_tablegpt_graph
vlm = ChatOpenAI(openai_api_base="VLM_URL", openai_api_key="whatever", model_name="VLM_MODEL_NAME")
agent = create_tablegpt_graph(
...
vlm=vlm,
...
)
```
Once enabled, the `tablegpt-agent` will use the `vlm` model to generate visual representations of the data.
For instance, in response to the query mentioned earlier, the `tablegpt-agent` generates the following visualization:
> *I have drawn a pie chart illustrating the ratio of men to women. From the chart, you can see that men constitute 64.4% while women make up 35.6%. If you need any further analysis or visualizations, feel free to let me know.*
This feature adds a layer of clarity and insight, helping users interpret the results more effectively. On some complex graphs, this function is more effective.
================================================
FILE: docs/explanation/code-sandbox.md
================================================
# Code Sandbox
`tablegpt-agent` directs `tablegpt` to generate Python code for data analysis. However, the generated code may contain potential vulnerabilities or unexpected errors. Running such code directly in a production environment could threaten the system's stability and security.
`Code Sandbox` is designed to address this challenge. By leveraging sandbox technology, it confines code execution to a controlled environment, effectively preventing malicious or unexpected behaviors from impacting the main system. This provides an isolated and reliable space for running code safely.
`Code Sandbox` built on the [pybox](https://github.com/edwardzjl/pybox) library and supports three main execution modes:
- **Local Environment**: Executes code in a local sandbox for quick *deployment* and *validation*.
- **Remote Environment**: Create remote environments through `Jupyter Enterprise Gateway` to achieve shared computing.
- **Cluster Environment**: Bypassing the need for proxy services such as `Jupyter Enterprise Gateway` by communicating directly with kernel pods.
Code Sandbox is designed based on the following key principles:
- **Security**: Limits code access using sandbox technology to ensure a safe and reliable execution environment.
- **Isolation**: Provides independent execution environments for each task, ensuring strict separation of resources and data.
- **Scalability**: Adapts to diverse computing environments, from local setups to Kubernetes clusters, supporting dynamic resource allocation and efficient task execution.
## Local Environment
In a local environment, Code Sandbox utilizes the `pybox` library to create and manage sandbox environments, providing a secure code execution platform. By isolating code execution from the host system's resources and imposing strict permission controls, it ensures safety and reliability. This approach is especially suitable for **development** and **debugging** scenarios.
If you want to run `tablegpt-agent` in a local environment, you can enable the **local mode**. Below are the installation steps and a detailed operation guide.
### Installing
To use `tablegpt-agent` in local mode, install the library with the following command:
```sh
pip install tablegpt-agent[local]
```
### Configuring
`tablegpt-agent` comes with several built-in features, such as auxiliary methods for data analysis and setting display font. **These features are automatically added to the sandbox environment by default**. If you need advanced customization (e.g., adding specific methods or fonts), refer to the [TableGPT IPython Kernel Configuration Documentation](https://github.com/tablegpt/tablegpt-agent/tree/main/ipython) for further guidance.
### Creating and Running
The following code demonstrates how to use the pybox library to set up a sandbox, execute code, and retrieve results in a local environment:
```python
from uuid import uuid4
from pybox import LocalPyBoxManager, PyBoxOut
# Initialize the local sandbox manager
pybox_manager = LocalPyBoxManager()
# Assign a unique Kernel ID for the sandbox
kernel_id = str(uuid4())
# Start the sandbox environment
box = pybox_manager.start(kernel_id)
# Define the test code to execute
test_code = """
import math
result = math.sqrt(16)
result
"""
# Run the code in the sandbox
out: PyBoxOut = box.run(code=test_code)
# Print the execution result
print(out)
```
### Example Output
After running the above code, the system will return the following output, indicating successful execution with no errors:
```text
data=[{'text/plain': '4.0'}] error=None
```
With `Code Sandbox` in local execution mode, developers can enjoy the safety of sandbox isolation at minimal cost while maintaining flexibility and efficiency. This lays a solid foundation for more complex remote or cluster-based scenarios.
## Remote Environment
In a remote environment, `Code Sandbox` uses the `pybox` library and its `RemotePyBoxManager` to create and manage sandbox environments. The remote mode relies on the [Enterprise Gateway](https://github.com/jupyter-server/enterprise_gateway) service to dynamically create and execute remote sandboxes. This mode allows multiple services to connect to the same remote environment, enabling shared access to resources.
### Configuring
If `tablegpt-agent` is used in **remote mode**, the first step is to start the `enterprise_gateway` service. You can refer to the [Enterprise Gateway Deployment Guide](https://jupyter-enterprise-gateway.readthedocs.io/en/latest/operators/index.html#deploying-enterprise-gateway) for detailed instructions on configuring and starting the service.
Once the service is up and running, ensure that the service address is accessible. For example, assume the `enterprise_gateway` service is available at `http://example.com`.
### Creating and Running
The following code demonstrates how to create a remote sandbox using `RemotePyBoxManager` and execute code within it:
```python
from uuid import uuid4
from pybox import RemotePyBoxManager, PyBoxOut
# Initialize the remote sandbox manager, replacing with the actual Enterprise Gateway service address
pybox_manager = RemotePyBoxManager(host="http://example.com")
# Assign a unique Kernel ID
kernel_id = str(uuid4())
# Start the remote sandbox environment
box = pybox_manager.start(kernel_id)
# Define the test code
test_code = """
import math
result = math.sqrt(16)
result
"""
# Run the code in the sandbox
out: PyBoxOut = box.run(code=test_code)
# Print the execution result
print(out)
```
### Example Output
After executing the above code, the system will return the following output, indicating successful execution without any errors:
```plaintext
data=[{'text/plain': '4.0'}] error=None
```
### Advanced Environment Configuration
The `RemotePyBoxManager` provides the following advanced configuration options to allow for flexible customization of the sandbox execution environment:
1. **`env_file`**: Allows you to load environment variables from a file to configure the remote sandbox.
2. **`kernel_env`**: Enables you to pass environment variables directly as key-value pairs, simplifying the setup process.
To learn more about the parameters and configuration options, refer to the [Kernel Environment Variables](https://jupyter-enterprise-gateway.readthedocs.io/en/latest/users/kernel-envs.html) documentation.
## Cluster Environment
In a Kubernetes cluster, `Code Sandbox` leverages the `KubePyBoxManager` provided by the `pybox` library to create and manage sandboxes. Unlike the `remote environment`, the cluster environment **communicates directly with Kernel Pods** created by the [Jupyter Kernel Controller](https://github.com/edwardzjl/jupyter-kernel-controller), eliminating the need for an intermediary service like `Enterprise Gateway`.
### Configuring
Before using the cluster environment, you need to deploy the `jupyter-kernel-controller` service. You can quickly create the required CRDs and Deployments using the [Deploy Documentation](https://github.com/edwardzjl/jupyter-kernel-controller?tab=readme-ov-file#build-run-deploy).
### Creating and Running
Once the `jupyter-kernel-controller` service is successfully deployed and running, you can create and run a cluster sandbox using the following code:
```python
from uuid import uuid4
from pybox import KubePyBoxManager, PyBoxOut
# Initialize the cluster sandbox manager, replacing with actual paths and environment variable configurations
pybox_manager = KubePyBoxManager(
env_file="YOUR_ENV_FILE_PATH", # Path to the environment variable file
kernel_env="YOUR_KERNEL_ENV_DICT", # Kernel environment variable configuration
)
# Assign a unique Kernel ID
kernel_id = str(uuid4())
# Start the cluster sandbox environment
box = pybox_manager.start(kernel_id)
# Define the test code
test_code = """
import math
result = math.sqrt(16)
result
"""
# Run the code in the sandbox
out: PyBoxOut = box.run(code=test_code)
# Print the execution result
print(out)
```
### Example Output
After executing the code above, the following output will be returned, indicating successful execution without any errors:
```plaintext
data=[{'text/plain': '4.0'}] error=None
```
**NOTE:** The `env_file` and `kernel_env` parameters required by `KubePyBoxManager` are essentially the same as those for `RemotePyBoxManager`. For detailed information about these parameters, please refer to the [RemotePyBoxManager Advanced Environment Configuration](#advanced-environment-configuration).
With the above configuration, you can efficiently manage secure and reliable sandboxes in a Kubernetes cluster, supporting flexible control and extension of execution results.
================================================
FILE: docs/explanation/file-reading.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "229c30c0-9715-48a2-b5fe-ee8c733d847a",
"metadata": {},
"source": [
"# File Reading\n",
"\n",
"When working with dataset files, maintaining a clear separation between file reading and data analysis workflows can significantly improve control and clarity. At TableGPT Agent, we've designed a robust and structured approach to handling file reading that empowers the LLM (Large Language Model) to effectively analyze dataset files without being overwhelmed by unnecessary details. This method not only enhances the LLM's ability to inspect the data but also ensures a smoother and more reliable data analysis process.\n",
"\n",
"Traditionally, allowing an LLM to directly inspect a dataset might involve simply calling the `df.head()` function to preview its content. While this approach suffices for straightforward use cases, it often lacks depth when dealing with more complex or messy datasets. To address this, we've developed a multi-step file reading workflow designed to deliver richer insights into the dataset structure while preparing it for advanced analysis."
]
},
{
"cell_type": "markdown",
"id": "a6ffbe96-f066-4b10-a743-0e9da6d41cbd",
"metadata": {},
"source": [
"**Here's how the workflow unfolds:**"
]
},
{
"cell_type": "markdown",
"id": "f9ba4763-5784-4c39-8e99-6156061e35bf",
"metadata": {},
"source": [
"## Normalization (Optional)\n",
"\n",
"Not all files are immediately suitable for direct analysis. Excel files, in particular, can pose challenges—irregular formatting, merged cells, and inconsistent headers are just a few examples. To tackle these issues, we introduce an optional normalization step that preprocesses the data, transforming it into a format that is “pandas-friendly.”\n",
"\n",
"This step addresses the most common quirks in Excel files, such as non-standard column headers, inconsistent row structures, or missing metadata. By resolving these typical issues upfront, the data is transformed into a format that is 'pandas-friendly' ensuring smooth integration with downstream processes.\n",
"\n",
"**Example Scenario:**\n",
"\n",
"Imagine you have an Excel file that looks like this:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c83bfe50-176b-4781-a4f6-ba809aa54750",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th colspan=\"9\" halign=\"left\">产品生产统计表</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>生产日期</th>\n",
" <th>制造编号</th>\n",
" <th>产品名称</th>\n",
" <th>预定产量</th>\n",
" <th colspan=\"2\" halign=\"left\">本日产量</th>\n",
" <th>累计产量</th>\n",
" <th colspan=\"2\" halign=\"left\">耗费工时</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>Unnamed: 0_level_2</th>\n",
" <th>Unnamed: 1_level_2</th>\n",
" <th>Unnamed: 2_level_2</th>\n",
" <th>Unnamed: 3_level_2</th>\n",
" <th>预计</th>\n",
" <th>实际</th>\n",
" <th>Unnamed: 6_level_2</th>\n",
" <th>本日</th>\n",
" <th>累计</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2007-08-10 00:00:00</td>\n",
" <td>FK-001</td>\n",
" <td>猕猴桃果肉饮料</td>\n",
" <td>100000.0</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>10.0</td>\n",
" <td>20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2007-08-11 00:00:00</td>\n",
" <td>FK-002</td>\n",
" <td>西瓜果肉饮料</td>\n",
" <td>100000.0</td>\n",
" <td>40000</td>\n",
" <td>44000</td>\n",
" <td>82000</td>\n",
" <td>9.0</td>\n",
" <td>18.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2007-08-12 00:00:00</td>\n",
" <td>FK-003</td>\n",
" <td>草莓果肉饮料</td>\n",
" <td>100000.0</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9.0</td>\n",
" <td>18.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2007-08-13 00:00:00</td>\n",
" <td>FK-004</td>\n",
" <td>蓝莓果肉饮料</td>\n",
" <td>100000.0</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9.0</td>\n",
" <td>18.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2007-08-14 00:00:00</td>\n",
" <td>FK-005</td>\n",
" <td>水密桃果肉饮料</td>\n",
" <td>100000.0</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>10.0</td>\n",
" <td>20.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 产品生产统计表 \n",
" 生产日期 制造编号 产品名称 预定产量 本日产量 累计产量 耗费工时 \n",
" Unnamed: 0_level_2 Unnamed: 1_level_2 Unnamed: 2_level_2 Unnamed: 3_level_2 预计 实际 Unnamed: 6_level_2 本日 累计\n",
"0 2007-08-10 00:00:00 FK-001 猕猴桃果肉饮料 100000.0 40000 45000 83000 10.0 20.0\n",
"1 2007-08-11 00:00:00 FK-002 西瓜果肉饮料 100000.0 40000 44000 82000 9.0 18.0\n",
"2 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000.0 40000 45000 83000 9.0 18.0\n",
"3 2007-08-13 00:00:00 FK-004 蓝莓果肉饮料 100000.0 40000 45000 83000 9.0 18.0\n",
"4 2007-08-14 00:00:00 FK-005 水密桃果肉饮料 100000.0 40000 45000 83000 10.0 20.0"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load the data into a DataFrame\n",
"df1 = read_df('产品生产统计表.xlsx', header=[0, 1, 2])\n",
"df1.head(5)"
]
},
{
"cell_type": "markdown",
"id": "0062d50e-63c9-4be2-bc15-7ebc15b23e4e",
"metadata": {},
"source": [
"The file is riddled with merged cells, empty rows, and redundant formatting that make it incompatible with pandas. If you try to load this file directly, pandas might misinterpret the structure or fail to parse it entirely.\n",
"\n",
"With our normalization feature, irregular datasets can be seamlessly transformed into clean, structured formats. When using the `create_tablegpt_agent` method, simply pass the `normalize_llm` parameter. The system will automatically analyze the irregular data and generate the appropriate transformation code, ensuring the dataset is prepared in the optimal format for further analysis.\n",
"\n",
"Below is an example of the code generated for the provided irregular dataset:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "9933fabd-d951-4da6-9bcd-a6511b12bc1b",
"metadata": {},
"outputs": [],
"source": [
"# Normalize the data\n",
"try:\n",
" df = df1.copy()\n",
"\n",
" import pandas as pd\n",
"\n",
" # Assuming the original data is loaded into a DataFrame named df\n",
" # Here is the transformation process:\n",
"\n",
" # Step 1: Isolate the Table Header\n",
" # Remove the unnecessary top rows and columns\n",
" final_df = df.iloc[2:, :9].copy()\n",
"\n",
" # Step 2: Rename Columns of final_df\n",
" # Adjust the column names to match the desired format\n",
" final_df.columns = ['生产日期', '制造编号', '产品名称', '预定产量', '本日产量预计', '本日产量实际', '累计产量', '本日耗费工时', '累计耗费工时']\n",
"\n",
" # Step 3: Data Processing\n",
" # Ensure there are no NaN values and drop any duplicate rows if necessary\n",
" final_df.dropna(inplace=True)\n",
" final_df.drop_duplicates(inplace=True)\n",
"\n",
" # Convert the appropriate columns to numeric types\n",
" final_df['预定产量'] = final_df['预定产量'].astype(int)\n",
" final_df['本日产量预计'] = final_df['本日产量预计'].astype(int)\n",
" final_df['本日产量实际'] = final_df['本日产量实际'].astype(int)\n",
" final_df['累计产量'] = final_df['累计产量'].astype(int)\n",
" final_df['本日耗费工时'] = final_df['本日耗费工时'].astype(int)\n",
" final_df['累计耗费工时'] = final_df['累计耗费工时'].astype(int)\n",
"\n",
" # Display the transformed DataFrame\n",
" if final_df.columns.tolist() == final_df.iloc[0].tolist():\n",
" final_df = final_df.iloc[1:]\n",
"\n",
" # reassign df1 with the formatted DataFrame\n",
" df1 = final_df\n",
"except Exception as e:\n",
" # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\n",
" print(f\"Reformat failed with error {e}, use the original DataFrame.\")"
]
},
{
"cell_type": "markdown",
"id": "2b589ac3-405c-4350-84f7-bf675ddaaa06",
"metadata": {},
"source": [
"Using the generated transformation code, the irregular dataset is converted into a clean, structured format, ready for analysis:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "76efd557-333b-46c3-a697-644a84b8e6ec",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>生产日期</th>\n",
" <th>制造编号</th>\n",
" <th>产品名称</th>\n",
" <th>预定产量</th>\n",
" <th>本日产量预计</th>\n",
" <th>本日产量实际</th>\n",
" <th>累计产量</th>\n",
" <th>本日耗费工时</th>\n",
" <th>累计耗费工时</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2007-08-12 00:00:00</td>\n",
" <td>FK-003</td>\n",
" <td>草莓果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2007-08-13 00:00:00</td>\n",
" <td>FK-004</td>\n",
" <td>蓝莓果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2007-08-14 00:00:00</td>\n",
" <td>FK-005</td>\n",
" <td>水密桃果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>10</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2007-08-15 00:00:00</td>\n",
" <td>FK-006</td>\n",
" <td>荔枝果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>44000</td>\n",
" <td>82000</td>\n",
" <td>10</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2007-08-16 00:00:00</td>\n",
" <td>FK-007</td>\n",
" <td>樱桃果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>46000</td>\n",
" <td>84000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 生产日期 制造编号 产品名称 预定产量 本日产量预计 本日产量实际 累计产量 本日耗费工时 累计耗费工时\n",
"2 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 83000 9 18\n",
"3 2007-08-13 00:00:00 FK-004 蓝莓果肉饮料 100000 40000 45000 83000 9 18\n",
"4 2007-08-14 00:00:00 FK-005 水密桃果肉饮料 100000 40000 45000 83000 10 20\n",
"5 2007-08-15 00:00:00 FK-006 荔枝果肉饮料 100000 40000 44000 82000 10 20\n",
"6 2007-08-16 00:00:00 FK-007 樱桃果肉饮料 100000 40000 46000 84000 9 18"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1.head(5)"
]
},
{
"cell_type": "markdown",
"id": "ac19e7a0-5487-4e02-80c0-f7dc48903df4",
"metadata": {},
"source": [
"## Dataset Structure Overview \n",
"\n",
"After normalization, the next step dives into the structural aspects of the dataset using the `df.info()` function. Unlike `df.head()`, which only shows a snippet of the data, `df.info()` provides a holistic view of the dataset’s structure. Key insights include:\n",
"\n",
"- **Column Data Types**: Helps identify numerical, categorical, or textual data at a glance.\n",
"- **Non-Null Counts**: Reveals the completeness of each column, making it easy to spot potential gaps or inconsistencies.\n",
"\n",
"By focusing on the foundational structure of the dataset, this step enables the LLM to better understand the quality and layout of the data, paving the way for more informed analyses."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2acf71a1-0e81-4f14-973e-05dfe1c9d963",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 18 entries, 2 to 19\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 生产日期 18 non-null object\n",
" 1 制造编号 18 non-null object\n",
" 2 产品名称 18 non-null object\n",
" 3 预定产量 18 non-null int64 \n",
" 4 本日产量预计 18 non-null int64 \n",
" 5 本日产量实际 18 non-null int64 \n",
" 6 累计产量 18 non-null int64 \n",
" 7 本日耗费工时 18 non-null int64 \n",
" 8 累计耗费工时 18 non-null int64 \n",
"dtypes: int64(6), object(3)"
]
}
],
"source": [
"# Remove leading and trailing whitespaces in column names\n",
"df1.columns = df1.columns.str.strip()\n",
"\n",
"# Remove rows and columns that contain only empty values\n",
"df1 = df1.dropna(how='all').dropna(axis=1, how='all')\n",
"\n",
"# Get the basic information of the dataset\n",
"df1.info(memory_usage=False)"
]
},
{
"cell_type": "markdown",
"id": "226cebc1-e38c-4da8-8455-662db9c152f6",
"metadata": {},
"source": [
"## Dataset Content Preview\n",
"\n",
"Finally, we utilize the `df.head()` function to provide a **visual preview of the dataset’s content**. This step is crucial for understanding the actual values within the dataset—patterns, anomalies, or trends often become apparent here.\n",
"\n",
"The number of rows displayed (`n`) is configurable to balance between granularity and simplicity. For smaller datasets or detailed exploration, a larger `n` might be beneficial. However, for larger datasets, displaying too many rows could overwhelm the LLM with excessive details, detracting from the primary analytical objectives."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "37ffeb0f-a80f-4ca8-9fda-fa2054870acf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>生产日期</th>\n",
" <th>制造编号</th>\n",
" <th>产品名称</th>\n",
" <th>预定产量</th>\n",
" <th>本日产量预计</th>\n",
" <th>本日产量实际</th>\n",
" <th>累计产量</th>\n",
" <th>本日耗费工时</th>\n",
" <th>累计耗费工时</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2007-08-12 00:00:00</td>\n",
" <td>FK-003</td>\n",
" <td>草莓果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2007-08-13 00:00:00</td>\n",
" <td>FK-004</td>\n",
" <td>蓝莓果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2007-08-14 00:00:00</td>\n",
" <td>FK-005</td>\n",
" <td>水密桃果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>10</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2007-08-15 00:00:00</td>\n",
" <td>FK-006</td>\n",
" <td>荔枝果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>44000</td>\n",
" <td>82000</td>\n",
" <td>10</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2007-08-16 00:00:00</td>\n",
" <td>FK-007</td>\n",
" <td>樱桃果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>46000</td>\n",
" <td>84000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 生产日期 制造编号 产品名称 预定产量 本日产量预计 本日产量实际 累计产量 本日耗费工时 累计耗费工时\n",
"2 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 83000 9 18\n",
"3 2007-08-13 00:00:00 FK-004 蓝莓果肉饮料 100000 40000 45000 83000 9 18\n",
"4 2007-08-14 00:00:00 FK-005 水密桃果肉饮料 100000 40000 45000 83000 10 20\n",
"5 2007-08-15 00:00:00 FK-006 荔枝果肉饮料 100000 40000 44000 82000 10 20\n",
"6 2007-08-16 00:00:00 FK-007 樱桃果肉饮料 100000 40000 46000 84000 9 18"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show the first 5 rows to understand the structure\n",
"df1.head(5)"
]
},
{
"cell_type": "markdown",
"id": "a7b90c69-cf82-4711-9229-242113d30804",
"metadata": {},
"source": [
"## Why This Matters\n",
"\n",
"This structured, multi-step approach is not just about processing data; it's about making the LLM smarter in how it interacts with datasets. By systematically addressing issues like messy formatting, structural ambiguity, and information overload, we ensure the LLM operates with clarity and purpose.\n",
"\n",
"The separation of file reading from analysis offers several advantages:\n",
"\n",
"- Enhanced Accuracy: Preprocessing and structure-checking reduce the risk of errors in downstream analyses.\n",
"- Scalability: Handles datasets of varying complexity and size with equal efficiency.\n",
"- Transparency: Provides clear visibility into the dataset’s structure, enabling better decision-making.\n",
"\n",
"By adopting this method, TableGPT Agent transforms the way dataset files are read and analyzed, offering a smarter, more controlled, and ultimately more **user-friendly experience**."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/explanation/ipython-startup-scripts.md
================================================
# IPython Startup Scripts
<!-- Placeholder -->
================================================
FILE: docs/howto/cleanup-error-trace.md
================================================
# Cleanup Error Trace
<!-- Placeholder -->
================================================
FILE: docs/howto/customize-table-info.md
================================================
# Customize Table Info
<!-- Placeholder -->
================================================
FILE: docs/howto/incluster-code-execution.md
================================================
# Incluster Code Execution
The `tablegpt-agent` directs `tablegpt` to generate Python code for data analysis. This code is then executed within a sandbox environment to ensure system security. The execution is managed by the [pybox](https://github.com/edwardzjl/pybox) library, which provides a simple way to run Python code outside the main process.
================================================
FILE: docs/howto/messages-truncation.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Messages Truncation\n",
"\n",
"Sometimes LLM services may have limited capacity to handle long messages, which can result in 400 status code errors. Therefore, we need to implement message truncation to keep message lengths within the LLM service's capabilities.\n",
"\n",
"The `tablegpt-agent` provides a `TruncationConfig` class to specify truncation settings for the LLM and VLM.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Too long messages without truncation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from datetime import date\n",
"\n",
"from langchain_core.messages import HumanMessage,AIMessage,SystemMessage\n",
"from langchain_openai import ChatOpenAI\n",
"from tablegpt.agent import create_tablegpt_graph\n",
"from pybox import AsyncLocalPyBoxManager\n",
"pybox_manager = AsyncLocalPyBoxManager()\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Assuming the model service supports max_model_len=1024, which means input_tokens + max_completion_tokens <= 1024\n",
"llm = ChatOpenAI(openai_api_base=\"YOUR_VLLM_URL\", openai_api_key=\"whatever\", model_name=\"TableGPT2-7B\",max_tokens=256)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"agent_without_truncation = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" SystemMessage(content=\"你是一个友好的AI助手\"),\n",
" HumanMessage(content=\"你能给我讲一个关于大语言模型的故事吗?\"),\n",
" AIMessage(content=\"当然可以。让我们从大语言模型的起源开始讲起。一切要从2017年谷歌提出的Transformer架构说起。这个创新性的架构为后来的GPT、BERT等模型奠定了基础。Transformer架构引入了自注意力机制,能够更好地处理序列数据中的长距离依赖关系。这一突破性进展使得模型能够更好地理解文本的上下文语境,为自然语言处理领域带来了革命性的变化。在此基础上,OpenAI于2018年发布了第一代GPT模型,随后又相继推出了GPT-2和GPT-3,每一代都在规模和性能上有显著提升。同时,谷歌推出的BERT模型采用了双向编码器架构,在多个自然语言理解任务上取得了突破性进展。这些模型的成功激发了更多研究者和机构投入到大语言模型的研发中,推动了整个领域的快速发展。现在,我们已经看到像GPT-4这样的模型展现出令人惊叹的能力,不仅能够进行基础的文本生成,还能够理解上下文、进行推理、解决复杂问题,甚至展现出一定程度的创造力...\"),\n",
" \n",
" HumanMessage(content=\"那AI是如何学习理解人类语言的呢?\"),\n",
" AIMessage(content=\"这是个很好的问题。AI通过大量的文本数据训练来理解语言。它使用自注意力机制来捕捉词语之间的关系,通过预训练和微调两个阶段,逐步掌握语言的规律。在预训练阶段,模型会阅读海量的文本,学习语言的基本模式。这个过程就像一个婴儿通过观察和模仿来学习语言一样。模型会分析数十亿甚至数千亿个词语,理解它们之间的关联和使用规律。在这个过程中,模型会建立起一个复杂的神经网络,每个神经元都负责捕捉特定的语言特征。通过反向传播算法,模型不断调整其内部参数,以更好地预测和理解语言。在微调阶段,模型会针对特定任务进行专门训练,比如问答、摘要生成或情感分析等。这就像人类在掌握基本语言能力后,进一步学习专业词汇和特定领域的表达方式。模型通过大量的实例学习,逐渐理解语言中的细微差别,包括语境、语气、隐含意义等。这个学习过程是持续的,模型通过不断接触新的语言样本来完善自己的理解能力...\"),\n",
" \n",
" HumanMessage(content=\"训练过程中会遇到什么挑战?\"),\n",
" AIMessage(content=\"训练大语言模型面临着多重挑战。首先是计算资源的需求,训练大模型需要数千台GPU和数月时间。这不仅带来了巨大的经济成本,还面临着能源消耗和环境影响的问题。一个大型语言模型的训练可能消耗数百万度电,相当于数千个家庭一年的用电量。其次是高质量数据的获取和处理问题。模型需要海量的训练数据,但这些数据必须经过严格的筛选和清洗。数据中可能包含偏见、歧视、不当内容等有害信息,如果不经过处理,这些问题会被模型学习并在输出中体现出来。此外,数据的多样性和代表性也是一个重要问题,需要确保数据能够覆盖不同语言、文化和领域的知识。在训练过程中,还面临着模型优化的技术挑战,比如梯度消失、过拟合、训练不稳定等问题。这需要研究人员不断改进训练算法和策略。另外,模型的知识更新也是一个持续的挑战,因为世界在不断变化,新的信息和知识在不断产生,如何让模型保持最新的知识状态是一个重要问题...\"),\n",
" \n",
" HumanMessage(content=\"大语言模型是如何生成回答的?\"),\n",
" AIMessage(content=\"大语言模型生成回答的过程非常有趣且复杂。当模型收到一个问题或提示时,它首先会通过其编码器将输入转换为高维向量表示。这个过程中,模型会考虑输入的每个词语及其上下文关系。通过多层自注意力机制,模型能够理解输入中的关键信息和语义结构。在生成回答时,模型会一个词一个词地预测最合适的内容。每生成一个词,都会基于之前生成的所有内容和原始输入来决定下一个最合适的词。这个过程使用了复杂的概率分布计算,模型会为词表中的每个候选词计算一个概率分数,然后选择最合适的词。这种选择不仅要考虑语法正确性,还要考虑语义连贯性和上下文相关性。模型还会使用注意力机制来关注输入中的不同部分,确保生成的内容与问题相关。在生成过程中,模型会维护一个内部状态,这个状态包含了之前生成内容的信息,帮助确保整个回答的连贯性和完整性。同时,模型还会使用各种策略来控制生成的质量,比如温度参数来调节创造性,或使用束搜索等技术来优化输出质量...\"),\n",
" \n",
" HumanMessage(content=\"它们真的理解我们说的话吗?\"),\n",
" AIMessage(content=\"这涉及到AI意识和理解力的哲学问题,这是一个复杂且富有争议的话题。虽然大语言模型能够产生看似智能的回答,但它们是否真正'理解'仍然存在很大争议。从技术角度来看,这些模型本质上是非常复杂的统计模型,它们通过分析大量文本数据中的模式来生成回应。它们的'理解'更像是一种高级的模式匹配和概率预测,而不是像人类那样具有真正的理解和意识。人类的理解涉及到感知、经验、情感和意识等多个层面,而AI目前还无法真正复制这些特质。例如,当我们理解'苹果'这个词时,我们不仅知道它的字面含义,还能联想到它的味道、触感、气味,以及与之相关的个人经历和情感。而AI模型只能基于文本数据中的统计关系来处理这个词。此外,人类的理解还包括常识推理、因果关系理解、情境适应等能力,这些都是当前AI系统还难以完全掌握的。虽然模型可以生成符合语法和语义的回答,但它们可能缺乏真正的理解深度,有时会产生表面上合理但实际上没有意义的回答...\"),\n",
" \n",
" HumanMessage(content=\"未来大语言模型会变得更强大吗?\"),\n",
" AIMessage(content=\"随着技术进步,大语言模型的能力确实在不断提升,未来发展前景令人期待。从技术层面来看,我们正在见证模型规模的持续增长,算法的不断改进,以及训练数据的持续扩充。未来的发展方向可能包括多个突破性的领域:首先是多模态理解能力的提升,未来的模型不仅能处理文本,还能更好地理解图像、音频、视频等多种形式的信息,实现跨模态的理解和生成。其次是持续学习能力的发展,使模型能够像人类一样不断学习和更新知识,而不是停留在训练时的知识状态。在推理能力方面,未来的模型可能会具备更强的逻辑推理、因果推理和抽象思维能力,能够处理更复杂的问题和任务。此外,模型的效率也会得到提升,通过更先进的架构设计和训练方法,实现更低的计算成本和能源消耗。在应用领域,我们可能会看到更专业化的模型出现,针对特定领域如医疗、法律、科研等提供更精准的服务。同时,模型的可解释性和可控性也会得到改善,使其决策过程更透明、更可靠...\"),\n",
" \n",
" HumanMessage(content=\"我们应该如何负责任地使用AI技术?\"),\n",
" AIMessage(content=\"这需要我们在多个层面做出努力,构建一个负责任的AI使用生态系统。首先在技术层面,我们需要确保AI系统的安全性和可控性。这包括开发强大的安全机制,防止系统被滥用或产生有害输出;建立有效的监测和控制机制,确保系统行为符合预期;实施严格的测试和验证流程,评估系统的可靠性和稳定性。在伦理层面,我们需要建立完善的使用准则和监管框架。这包括制定明确的AI伦理原则,规范AI的开发和使用;建立行业标准和最佳实践指南;设立独立的监督机构,确保AI技术的使用符合公共利益。在隐私保护方面,我们需要采取严格的数据保护措施,确保用户数据的安全性和隐私性;建立透明的数据使用政策;给予用户对其数据的控制权。在偏见消除方面,我们需要持续努力提高训练数据的多样性和代表性;开发更公平的算法;定期评估和消除系统中的偏见。在环境影响方面,我们需要关注AI系统的能源消耗和碳排放;开发更环保的计算方案;推动绿色AI技术的发展...\"),\n",
" \n",
" HumanMessage(content=\"你觉得AI会取代人类吗?\"),\n",
" AIMessage(content=\"AI不应该也不会完全取代人类,这个问题需要从多个角度深入思考。首先,虽然AI在某些特定任务上可能超越人类,但人类具有许多AI难以复制的独特优势。人类的创造力是独特的,我们能够产生原创性的想法,进行艺术创作,提出创新性的解决方案。人类的情感共鸣能力也是无可替代的,我们能够理解和分享他人的情感,建立深层的情感联系,这是当前AI技术远远无法达到的。在道德判断和价值观方面,人类能够基于复杂的伦理考虑做出决策,而AI系统往往难以处理需要道德权衡的情况。此外,人类具有自主意识和主观体验,这些都是AI所不具备的特质。未来的发展方向应该是人机协作,让AI成为增强人类能力的工具,而不是替代品。在这种协作模式下,AI可以处理重复性、计算密集型的任务,而人类则专注于需要创造力、情感理解和道德判断的工作。我们需要明智地使用AI技术,确保它始终服务于人类福祉,而不是反过来控制或限制人类的发展...\"),\n",
" HumanMessage(content=\"你认为未来的AI会怎么发展?\")\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Error code: 400 - {'object': 'error', 'message': \"This model's maximum context length is 1024 tokens. However, you requested 2406 tokens (2150 in the messages, 256 in the completion). Please reduce the length of the messages or completion.\", 'type': 'BadRequestError', 'param': None, 'code': 400}\n"
]
}
],
"source": [
"_input = {\n",
" \"messages\": messages,\n",
" \"parent_id\": \"some-parent-id\",\n",
" \"date\": date.today(), # noqa: DTZ011\n",
"}\n",
"\n",
"try:\n",
" await agent_without_truncation.ainvoke(input=_input)\n",
"except Exception as e:\n",
" print(e)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Too long messages with truncation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### TruncationConfig settings in `create_tablegpt_graph`\n",
"- `llm_truncation_config`: Truncate messages sent to pure language models\n",
"- `vlm_truncation_config`: Truncate messages sent to vision+language multimodal models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"> In the following, we use messages length as the truncation method\n",
"\n",
"**For custom trim settings based on your LLM service(e.g. vLLM,TGI,SGLang), see this [example](https://github.com/edwardzjl/chatbot/blob/main/api/chatbot/llm_providers.py#L67) or implement it in your own custom manner.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create TruncationConfig\n",
"\n",
"**The parameters set in `TruncationConfig` will be used in `langchain_core.messages.trim_messages`, see [trim_messages documentation](https://python.langchain.com/docs/how_to/trim_messages/)**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# import truncation config\n",
"from tablegpt.agent.data_analyzer import TruncationConfig"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# token_counter=len, uses message length as truncation method\n",
"# max_tokens=5, maximum length of messages after truncation\n",
"# start_on=\"human\", start truncation from human messages\n",
"llm_truncation_config = TruncationConfig(token_counter=len, max_tokens=5, start_on=\"human\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"agent = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
" llm_truncation_config=llm_truncation_config\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"未来AI的发展可能会基于一系列先进的技术和科学突破,以下是一些可能的发展方向:\n",
"\n",
"1. **增强现实与虚拟现实**:AI将能够提供更加沉浸式的体验,例如增强现实和虚拟现实技术,使用户能够更自然地与虚拟环境互动。这将改变我们获取知识、工作和娱乐的方式。\n",
"\n",
"2. **神经网络与深度学习**:神经网络和深度学习将变得更加强大和通用,能够处理更多样化的问题和数据。例如,在医疗诊断、自动驾驶和智能制造等领域,AI可以提供更准确、更高效的解决方案。\n",
"\n",
"3. **更强的计算能力**:AI将实现更强大的计算能力,能够处理更复杂、更大规模的数据。这将推动许多行业,如金融、医疗和科学研究,从传统的人工智能转型到AI驱动的新技术。\n",
"\n",
"4. **更自然的交互**:AI将能够更好地理解和模拟人类的自然语言和行为,使人类与AI能够更自然、更流畅地交流。这将使人类和AI之间的互动更加无缝。\n",
"\n",
"5. **伦理和法律**:随着AI技术的发展,伦理和法律问题将越来越重要。我们需要制定明确的AI伦理准则,确保AI技术的使用符合道德规范。这需要跨\n"
]
}
],
"source": [
"_input = {\n",
" \"messages\": messages,\n",
" \"parent_id\": \"some-parent-id\",\n",
" \"date\": date.today(), # noqa: DTZ011\n",
"}\n",
"try:\n",
" res = await agent.ainvoke(input=_input)\n",
" print(res[\"messages\"][-1].content)\n",
"except Exception as e:\n",
" print(e)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: docs/howto/normalize-datasets.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "592d977a-34b0-42f0-879c-1e8afe5cb134",
"metadata": {},
"source": [
"# Normalize Datasets\n",
"\n",
"The Dataset Normalizer plugin is used to transform 'pandas-unfriendly' datasets (e.g., Excel files that do not follow a standard tabular structure) into a more suitable format for pandas. It is backed by an LLM that generates Python code to convert the original datasets into new ones.\n",
"\n",
"In `tablegpt-agent`, this plugin is used to better format 'pandas-unfriendly' datasets, making them more understandable for the subsequent steps. This plugin is optional; if used, it serves as the very first step in the [File Reading Workflow](../../explanation/file-reading), easing the difficulty of data analysis in the subsequent workflow.\n",
"\n",
"## Introduction\n",
"\n",
"The `Dataset Normalizer` is a specialized tool designed to tackle challenges that arise when working with irregular and poorly structured datasets. These challenges are especially prevalent in Excel files, which are often used as a flexible but inconsistent way of storing data.\n",
"\n",
"Analyzing Excel data files can pose significant challenges, such as:\n",
"\n",
"- **Irregular Formatting:** Datasets may lack a consistent tabular structure, with varying cell sizes or non-standard layouts.\n",
"- **Merged Cells:** Cells spanning multiple rows or columns can disrupt parsing tools.\n",
"- **Inconsistent Headers:** Columns may have incomplete, redundant, or nested headers.\n",
"- **Hidden Data:** Data may be stored in additional sheets or rely on calculated fields that are not directly accessible.\n",
"- **Mixed Data Types:** Columns may contain inconsistent data types, such as numbers mixed with text.\n",
"- **Empty or Placeholder Rows:** Extra rows with missing or irrelevant data can complicate data loading and analysis."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **!!! Note:** When the `tablegpt-agent` enables the `Dataset Normalizer` to format the dataset, the dataset reading process will be noticeably slower. This is because the `Dataset Normalizer` needs to analyze the dataset and generate transformation code, a process that takes considerable time. \n",
">\n",
"> **It is worth noting that the data normalization process can effectively address most common data irregularities. However, for more complex datasets, further optimization may be needed, and the results depend on the specific normalization model used.**"
]
},
{
"cell_type": "markdown",
"id": "bc22a838",
"metadata": {},
"source": [
"## Quick Start\n",
"\n",
"To enable the `Dataset Normalizer`, ensure you pass it as a parameter when creating the `tablegpt-agent`. You can follow the example below:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7d892b2e-63ea-47bf-8bf7-ed4dcb7ed876",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"from langchain_openai import ChatOpenAI\n",
"from pybox import AsyncLocalPyBoxManager\n",
"from tablegpt.agent import create_tablegpt_graph\n",
"from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR\n",
"\n",
"llm = ChatOpenAI(openai_api_base=\"YOUR_VLLM_URL\", openai_api_key=\"whatever\", model_name=\"TableGPT2-7B\")\n",
"normalize_llm = ChatOpenAI(openai_api_base=\"YOUR_VLLM_URL\", openai_api_key=\"whatever\", model_name=\"YOUR_VLLM_MODEL_NAME\")\n",
"pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)\n",
"\n",
"agent = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
" normalize_llm=normalize_llm,\n",
" session_id=\"some-session-id\", # This is required when using file-reading\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d350ea84-a0d4-4780-a8e1-d483be53ecaa",
"metadata": {},
"source": [
"Given an Excel file [产品生产统计表.xlsx](https://github.com/tablegpt/tablegpt-agent/blob/main/examples/datasets/产品生产统计表.xlsx) with merged cells and irregular headers:\n",
"\n",
"<table style=\"border: 1px solid black; border-collapse: collapse;\">\n",
" <thead>\n",
" <tr>\n",
" <th colspan=\"9\" style=\"text-align: center; font-size: 24px;\">产品生产统计表</th>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\">生产日期</th>\n",
" <th rowspan=\"2\">制造编号</th>\n",
" <th rowspan=\"2\">产品名称</th>\n",
" <th rowspan=\"2\">预定产量</th>\n",
" <th colspan=\"2\">本日产量</th>\n",
" <th rowspan=\"2\">累计产量</th>\n",
" <th colspan=\"2\">耗费工时</th>\n",
" </tr>\n",
" <tr>\n",
" <th>预计</th>\n",
" <th>实际</th>\n",
" <th>本日</th>\n",
" <th>累计</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>2007/8/10</td>\n",
" <td>FK-001</td>\n",
" <td>猕猴桃果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>10</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2007/8/11</td>\n",
" <td>FK-002</td>\n",
" <td>西瓜果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>44000</td>\n",
" <td>82000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2007/8/12</td>\n",
" <td>FK-003</td>\n",
" <td>草莓果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2007/8/13</td>\n",
" <td>FK-004</td>\n",
" <td>蓝莓果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"id": "2ccec619",
"metadata": {},
"source": [
"Add the file for processing in the `additional_kwargs` of HumanMessage:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d7a09def-56ca-4889-b240-1c9c31d8f7cb",
"metadata": {},
"outputs": [],
"source": [
"from typing import TypedDict\n",
"from langchain_core.messages import HumanMessage\n",
"\n",
"class Attachment(TypedDict):\n",
" \"\"\"Contains at least one dictionary with the key filename.\"\"\"\n",
" filename: str\n",
"\n",
"attachment_msg = HumanMessage(\n",
" content=\"\",\n",
" # Please make sure your iPython kernel can access your filename.\n",
" additional_kwargs={\"attachments\": [Attachment(filename=\"产品生产统计表.xlsx\")]},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "220d96fc",
"metadata": {},
"source": [
"Invoke the `tablegpt-agent` to normalize the datasets:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "e7fe0dfa-633b-4cf4-b9fc-51a64afc24ff",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': '产品生产统计表.xlsx'}]}, response_metadata={}, id='452ef08b-a45f-49ce-8489-4062cd35d57c'),\n",
" AIMessage(content='我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查看 NaN 情况和数据类型。\\n```python\\n# Load the data into a DataFrame\\ndf = read_df(\\'产品生产统计表.xlsx\\')\\n# Normalize the data\\ntry:\\n df = df.copy()\\n\\n import pandas as pd\\n\\n # Assuming the original data is loaded into a DataFrame named `df`\\n\\n # Step 1: Isolate the Table Header\\n # Remove the unnecessary top rows and columns\\n df = df.drop([0, 1]) # Drop the first two rows which contain header information\\n df = df.reset_index(drop=True) # Reset the index to start from 0\\n\\n # Step 2: Store the Result as `final_df`\\n final_df = df\\n\\n # Step 3: Rename Columns of final_df\\n # Adjust the column names of final_df to match the desired format\\n final_df.columns = [\\'生产日期\\', \\'制造编号\\', \\'产品名称\\', \\'预定产量\\', \\'本日产量预计\\', \\'本日产量实际\\', \\'累计产量\\', \\'本日耗费工时\\', \\'累计耗费工时\\']\\n\\n # Step 4: Data Processing\\n # Ensure the data types are correct and remove any unnecessary rows or columns\\n final_df = final_df.dropna(subset=[\\'生产日期\\']) # Remove rows where the production date is missing\\n final_df = final_df.drop_duplicates() # Remove duplicate rows if any\\n final_df = final_df.reset_index(drop=True) # Reset the index again to ensure it starts from 0\\n\\n # Convert columns to appropriate data types\\n final_df[\\'预定产量\\'] = final_df[\\'预定产量\\'].astype(int)\\n final_df[\\'本日产量预计\\'] = final_df[\\'本日产量预计\\'].astype(int)\\n final_df[\\'本日产量实际\\'] = final_df[\\'本日产量实际\\'].astype(int)\\n final_df[\\'累计产量\\'] = final_df[\\'累计产量\\'].astype(int)\\n final_df[\\'本日耗费工时\\'] = final_df[\\'本日耗费工时\\'].astype(int)\\n final_df[\\'累计耗费工时\\'] = final_df[\\'累计耗费工时\\'].astype(int)\\n\\n # Display the transformed DataFrame\\n print(final_df)\\n if final_df.columns.tolist() == final_df.iloc[0].tolist():\\n final_df = final_df.iloc[1:]\\n\\n # reassign df with the formatted DataFrame\\n df = final_df\\nexcept Exception as e:\\n # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\\n print(f\"Reformat failed with error {e}, use the original DataFrame.\")\\n# Remove leading and trailing whitespaces in column names\\ndf.columns = df.columns.str.strip()\\n\\n# Remove rows and columns that contain only empty values\\ndf = df.dropna(how=\\'all\\').dropna(axis=1, how=\\'all\\')\\n\\n# Get the basic information of the dataset\\ndf.info(memory_usage=False)\\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查看 NaN 情况和数据类型。', 'action': {'tool': 'python', 'tool_input': '# Load the data into a DataFrame\\ndf = read_df(\\'产品生产统计表.xlsx\\')\\n# Normalize the data\\ntry:\\n df = df.copy()\\n\\n import pandas as pd\\n\\n # Assuming the original data is loaded into a DataFrame named `df`\\n\\n # Step 1: Isolate the Table Header\\n # Remove the unnecessary top rows and columns\\n df = df.drop([0, 1]) # Drop the first two rows which contain header information\\n df = df.reset_index(drop=True) # Reset the index to start from 0\\n\\n # Step 2: Store the Result as `final_df`\\n final_df = df\\n\\n # Step 3: Rename Columns of final_df\\n # Adjust the column names of final_df to match the desired format\\n final_df.columns = [\\'生产日期\\', \\'制造编号\\', \\'产品名称\\', \\'预定产量\\', \\'本日产量预计\\', \\'本日产量实际\\', \\'累计产量\\', \\'本日耗费工时\\', \\'累计耗费工时\\']\\n\\n # Step 4: Data Processing\\n # Ensure the data types are correct and remove any unnecessary rows or columns\\n final_df = final_df.dropna(subset=[\\'生产日期\\']) # Remove rows where the production date is missing\\n final_df = final_df.drop_duplicates() # Remove duplicate rows if any\\n final_df = final_df.reset_index(drop=True) # Reset the index again to ensure it starts from 0\\n\\n # Convert columns to appropriate data types\\n final_df[\\'预定产量\\'] = final_df[\\'预定产量\\'].astype(int)\\n final_df[\\'本日产量预计\\'] = final_df[\\'本日产量预计\\'].astype(int)\\n final_df[\\'本日产量实际\\'] = final_df[\\'本日产量实际\\'].astype(int)\\n final_df[\\'累计产量\\'] = final_df[\\'累计产量\\'].astype(int)\\n final_df[\\'本日耗费工时\\'] = final_df[\\'本日耗费工时\\'].astype(int)\\n final_df[\\'累计耗费工时\\'] = final_df[\\'累计耗费工时\\'].astype(int)\\n\\n # Display the transformed DataFrame\\n print(final_df)\\n if final_df.columns.tolist() == final_df.iloc[0].tolist():\\n final_df = final_df.iloc[1:]\\n\\n # reassign df with the formatted DataFrame\\n df = final_df\\nexcept Exception as e:\\n # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\\n print(f\"Reformat failed with error {e}, use the original DataFrame.\")\\n# Remove leading and trailing whitespaces in column names\\ndf.columns = df.columns.str.strip()\\n\\n# Remove rows and columns that contain only empty values\\ndf = df.dropna(how=\\'all\\').dropna(axis=1, how=\\'all\\')\\n\\n# Get the basic information of the dataset\\ndf.info(memory_usage=False)'}, 'model_type': None}, response_metadata={}, id='8e5d0026-215d-46e2-ab60-9174c5bf50bd', tool_calls=[{'name': 'python', 'args': {'query': '# Load the data into a DataFrame\\ndf = read_df(\\'产品生产统计表.xlsx\\')\\n# Normalize the data\\ntry:\\n df = df.copy()\\n\\n import pandas as pd\\n\\n # Assuming the original data is loaded into a DataFrame named `df`\\n\\n # Step 1: Isolate the Table Header\\n # Remove the unnecessary top rows and columns\\n df = df.drop([0, 1]) # Drop the first two rows which contain header information\\n df = df.reset_index(drop=True) # Reset the index to start from 0\\n\\n # Step 2: Store the Result as `final_df`\\n final_df = df\\n\\n # Step 3: Rename Columns of final_df\\n # Adjust the column names of final_df to match the desired format\\n final_df.columns = [\\'生产日期\\', \\'制造编号\\', \\'产品名称\\', \\'预定产量\\', \\'本日产量预计\\', \\'本日产量实际\\', \\'累计产量\\', \\'本日耗费工时\\', \\'累计耗费工时\\']\\n\\n # Step 4: Data Processing\\n # Ensure the data types are correct and remove any unnecessary rows or columns\\n final_df = final_df.dropna(subset=[\\'生产日期\\']) # Remove rows where the production date is missing\\n final_df = final_df.drop_duplicates() # Remove duplicate rows if any\\n final_df = final_df.reset_index(drop=True) # Reset the index again to ensure it starts from 0\\n\\n # Convert columns to appropriate data types\\n final_df[\\'预定产量\\'] = final_df[\\'预定产量\\'].astype(int)\\n final_df[\\'本日产量预计\\'] = final_df[\\'本日产量预计\\'].astype(int)\\n final_df[\\'本日产量实际\\'] = final_df[\\'本日产量实际\\'].astype(int)\\n final_df[\\'累计产量\\'] = final_df[\\'累计产量\\'].astype(int)\\n final_df[\\'本日耗费工时\\'] = final_df[\\'本日耗费工时\\'].astype(int)\\n final_df[\\'累计耗费工时\\'] = final_df[\\'累计耗费工时\\'].astype(int)\\n\\n # Display the transformed DataFrame\\n print(final_df)\\n if final_df.columns.tolist() == final_df.iloc[0].tolist():\\n final_df = final_df.iloc[1:]\\n\\n # reassign df with the formatted DataFrame\\n df = final_df\\nexcept Exception as e:\\n # Unable to apply formatting to the original DataFrame. proceeding with the unformatted DataFrame.\\n print(f\"Reformat failed with error {e}, use the original DataFrame.\")\\n# Remove leading and trailing whitespaces in column names\\ndf.columns = df.columns.str.strip()\\n\\n# Remove rows and columns that contain only empty values\\ndf = df.dropna(how=\\'all\\').dropna(axis=1, how=\\'all\\')\\n\\n# Get the basic information of the dataset\\ndf.info(memory_usage=False)'}, 'id': 'af99b549-09f1-4623-930a-3ffbfde40216', 'type': 'tool_call'}]),\n",
" ToolMessage(content=[{'type': 'text', 'text': \"```pycon\\nReformat failed with error cannot convert float NaN to integer, use the original DataFrame.\\n<class 'pandas.core.frame.DataFrame'>\\nRangeIndex: 26 entries, 0 to 25\\nData columns (total 9 columns):\\n # Column Non-Null Count Dtype \\n--- ------ -------------- ----- \\n 0 生产日期 22 non-null object\\n 1 制造编号 25 non-null object\\n 2 产品名称 20 non-null object\\n 3 预定产量 20 non-null object\\n 4 本日产量预计 21 non-null object\\n 5 本日产量实际 21 non-null object\\n 6 累计产量 25 non-null object\\n 7 本日耗费工时 20 non-null object\\n 8 累计耗费工时 20 non-null object\\ndtypes: object(9)\\n```\"}], name='python', id='58429576-461b-4a50-8c3d-5bbacca22cdb', tool_call_id='af99b549-09f1-4623-930a-3ffbfde40216', artifact=[]),\n",
" AIMessage(content='接下来我将用 `df.head(5)` 来查看数据集的前 5 行。\\n```python\\n# Show the first 5 rows to understand the structure\\ndf.head(5)\\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看数据集的前 5 行。', 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='0c67c2b2-36d2-49cc-8fb8-7f6e5dfc8625', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\\ndf.head(5)'}, 'id': 'a4ccd41a-a872-4ff0-aae5-678a96b9b54a', 'type': 'tool_call'}]),\n",
" ToolMessage(content=[{'type': 'text', 'text': '```pycon\\n 生产日期 制造编号 产品名称 预定产量 本日产量预计 本日产量实际 累计产量 本日耗费工时 累计耗费工时\\n0 2007-08-10 00:00:00 FK-001 猕猴桃果肉饮料 100000 40000 45000 83000 10 20\\n1 2007-08-11 00:00:00 FK-002 西瓜果肉饮料 100000 40000 44000 82000 9 18\\n2 2007-08-12 00:00:00 FK-003 草莓果肉饮料 100000 40000 45000 83000 9 18\\n3 2007-08-13 00:00:00 FK-004 蓝莓果肉饮料 100000 40000 45000 83000 9 18\\n4 2007-08-14 00:00:00 FK-005 水密桃果肉饮料 100000 40000 45000 83000 10 20\\n```'}], name='python', id='d828aa34-7c9e-4fee-8ae1-7b553530292b', tool_call_id='a4ccd41a-a872-4ff0-aae5-678a96b9b54a', artifact=[]),\n",
" AIMessage(content='我已经了解了数据集 产品生产统计表.xlsx 的基本信息。请问我可以帮您做些什么?', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='e836eba6-9597-4bf8-acfd-2a81871916a6')]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from datetime import date\n",
"from tablegpt.agent.file_reading import Stage\n",
"\n",
"# Reading and processing files.\n",
"response = await agent.ainvoke(\n",
" input={\n",
" \"entry_message\": attachment_msg,\n",
" \"processing_stage\": Stage.UPLOADED,\n",
" \"messages\": [attachment_msg],\n",
" \"parent_id\": \"some-parent-id1\",\n",
" \"date\": date.today(),\n",
" },\n",
" config={\n",
" # Using checkpointer requires binding thread_id at runtime.\n",
" \"configurable\": {\"thread_id\": \"some-thread-id\"},\n",
" },\n",
")\n",
"\n",
"response[\"messages\"]"
]
},
{
"cell_type": "markdown",
"id": "478453cd",
"metadata": {},
"source": [
"By formatting the content of the last `ToolMessage`, you can see the normalized data:\n",
"\n",
"<table style=\"border: 1px solid black; border-collapse: collapse;\">\n",
" <thead>\n",
" <tr>\n",
" <th>生产日期</th>\n",
" <th>制造编号</th>\n",
" <th>产品名称</th>\n",
" <th>预定产量</th>\n",
" <th>本日产量预计</th>\n",
" <th>本日产量实际</th>\n",
" <th>累计产量</th>\n",
" <th>本日耗费工时</th>\n",
" <th>累计耗费工时</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>2007/8/10</td>\n",
" <td>FK-001</td>\n",
" <td>猕猴桃果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>10</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2007/8/11</td>\n",
" <td>FK-002</td>\n",
" <td>西瓜果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>44000</td>\n",
" <td>82000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2007/8/12</td>\n",
" <td>FK-003</td>\n",
" <td>草莓果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2007/8/13</td>\n",
" <td>FK-004</td>\n",
" <td>蓝莓果肉饮料</td>\n",
" <td>100000</td>\n",
" <td>40000</td>\n",
" <td>45000</td>\n",
" <td>83000</td>\n",
" <td>9</td>\n",
" <td>18</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/howto/persist-messages.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "a04d67a0-660f-41ea-a873-bab8c5f6197c",
"metadata": {},
"source": [
"# Persist Messages\n",
"\n",
"When creating TableGPT agents, you have the option to persist their state, enabling interactions with the agent across multiple sessions while retaining memory of previous interactions. For more information on persistence, you can refer to the [Persistence](https://langchain-ai.github.io/langgraph/concepts/persistence/) documentation.\n",
"\n",
"The benefit of persistent messages is that you can interact with the TableGPT agent across multiple sessions, and the agent remembers previous interactions. This is useful for applications that require long-term tracking of context or complex conversations.\n",
"\n",
"TableGPT Agent leverages [langgraph-checkpoint](https://github.com/langchain-ai/langgraph/tree/main/libs/checkpoint) to implement persistent message storage. It supports using any type of `checkpointer` to store messages, such as: `Postgres`, `Redis`, `Memory`, etc. To integrate a checkpointer with a `TableGPT Agent`, you can follow the example below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7218312b-3482-44c1-afcd-48dafb48b83a",
"metadata": {},
"outputs": [],
"source": [
"from datetime import date\n",
"\n",
"from langchain_openai import ChatOpenAI\n",
"from langgraph.checkpoint.memory import MemorySaver\n",
"from tablegpt.agent import create_tablegpt_graph\n",
"from pybox import AsyncLocalPyBoxManager\n",
"\n",
"llm = ChatOpenAI(openai_api_base=\"YOUR_VLLM_URL\", openai_api_key=\"whatever\", model_name=\"TableGPT2-7B\")\n",
"pybox_manager = AsyncLocalPyBoxManager()\n",
"checkpointer = MemorySaver()\n",
"\n",
"graph = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
" checkpointer=checkpointer,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "96b02c4f-d98b-42ca-a257-2db82c749f56",
"metadata": {},
"source": [
"**Conducting a Conversation with `TableGPT`**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "841e7083-95aa-4e59-b66c-685acf2700f0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AIMessage(content=\"I understand that you're asking for an introduction to Jackie Chan. However, my primary role is to analyze datasets using Python. If you have a dataset related to Jackie Chan or any other topic, I'd be happy to help you analyze it. Could you please provide more details on what kind of data you have or what specific analysis you would like to perform?\", additional_kwargs={'parent_id': '1'}, response_metadata={}, id='cdf638ce-0e56-475b-a86b-0d8d7a0f6d05')"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"resp = await graph.ainvoke(\n",
" input={\n",
" \"messages\": [(\"human\", \"Please introduce Jackie Chan\")],\n",
" \"parent_id\": \"1\",\n",
" \"date\": date.today(),\n",
" },\n",
" config={\"configurable\": {\"thread_id\": \"1\"}},\n",
")\n",
"resp[\"messages\"][-1]"
]
},
{
"cell_type": "markdown",
"id": "c85dd8fd-36bc-42ab-b8d2-ed0f917093af",
"metadata": {},
"source": [
"**Continuing the Conversation**\n",
"\n",
"To extend the conversation while maintaining context, you can provide new input along with the same `config` configuration:\n",
"\n",
"> Note: `config` is the configuration associated with this `checkpointer`. Through this configuration, the `checkpointer` can retrieve previous status information, so that in subsequent conversations, the model can better understand the user's intention and reply."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "fe437db5-f6c9-40cc-a886-896e7e60ecad",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"AIMessage(content=\"Certainly! Jackie Chan is a renowned actor, director, and martial artist, and he has starred in numerous films. Here are three popular movies in which he has participated:\\n\\n1. **Rush Hour (1998)** - In this action-comedy film, Jackie Chan plays the role of Inspector Lee, a Hong Kong detective who teams up with a Los Angeles detective, played by Chris Tucker, to solve a kidnapping case.\\n\\n2. **Drunken Master (1978)** - This is one of Jackie Chan's early films where he plays a young man who learns the art of drunken boxing to avenge his father's enemies.\\n\\n3. **The Karate Kid (2010)** - In this remake of the original 1984 film, Jackie Chan plays Mr. Han, a maintenance man who becomes the mentor to a young boy, Jaden Smith, teaching him martial arts and life lessons.\\n\\nIf you have any specific data or analysis related to these movies or Jackie Chan's filmography, feel free to provide more details, and I can help you with that!\", additional_kwargs={'parent_id': '1'}, response_metadata={}, id='4f62bec2-a4ec-43ad-97b2-cbade8c774b4')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"resp = await graph.ainvoke(\n",
" input={\n",
" \"messages\": [(\"human\", \"Please name three movies he participated in.\")],\n",
" \"parent_id\": \"1\",\n",
" \"date\": date.today(),\n",
" },\n",
" config={\"configurable\": {\"thread_id\": \"1\"}},\n",
")\n",
"resp[\"messages\"][-1]"
]
},
{
"cell_type": "markdown",
"id": "ac748856-f062-43a2-8381-6b80e7e2bf96",
"metadata": {},
"source": [
"**Next, we demonstrates how to use `Postgres` as the backend for persisting checkpoint state using the [langgraph-checkpoint-postgres](https://github.com/langchain-ai/langgraph/tree/main/libs/checkpoint-postgres) library.**\n",
"\n",
"## Installing Required Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6532184f-6871-4fb1-adb3-e61b1335cb9a",
"metadata": {},
"outputs": [],
"source": [
"%pip install -U psycopg psycopg-pool psycopg_binary langgraph langgraph-checkpoint-postgres"
]
},
{
"cell_type": "markdown",
"id": "95b50676-b43b-429f-9c85-91b384832d60",
"metadata": {},
"source": [
"## Use Async Connection\n",
"\n",
"**Note: `TableGPT Agent` is built based on [LangGraph](https://langchain-ai.github.io/langgraph/), and many of the `Node` components use `async/await` syntax, which does not yet support non-asynchronous operations.**\n",
"\n",
"Setting up an asynchronous connection to the database allows for non-blocking database operations. This means other parts of your application can continue running while waiting for database operations to complete. This is particularly beneficial in high-concurrency scenarios or when dealing with I/O-bound operations.\n",
"\n",
"The `DB_URI` is the database connection URI, specifying the protocol for connecting to a PostgreSQL database, including authentication and the host where the database is running."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "dd0c3389-da09-41a5-ac70-d1ecd5bee211",
"metadata": {},
"outputs": [],
"source": [
"DB_URI = \"postgresql://postgres:postgres@127.0.0.1:5432/postgres?sslmode=disable\""
]
},
{
"cell_type": "markdown",
"id": "806b491f-f6e3-47d2-9043-23bf784635e1",
"metadata": {},
"source": [
"### Creating a Checkpointer with AsyncPostgresSaver\n",
"\n",
"This creates a connection based on a connection string:\n",
"- Advantages: Simplicity, encapsulates connection details\n",
"- Best for: Quick setup or when connection details are provided as a string"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "3e78e56d-a108-4594-bd1d-d56189cfa9f2",
"metadata": {},
"outputs": [],
"source": [
"from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver\n",
"from tablegpt.agent import create_tablegpt_graph\n",
"\n",
"config = {\"configurable\": {\"thread_id\": \"2\"}}\n",
"\n",
"async with AsyncPostgresSaver.from_conn_string(DB_URI) as checkpointer:\n",
" graph = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
" checkpointer=checkpointer,\n",
" )\n",
" \n",
" res = await graph.ainvoke(\n",
" input={\n",
" \"messages\": [(\"human\", \"Who are you?\")],\n",
" \"parent_id\": \"2\",\n",
" \"date\": date.today()\n",
" },\n",
" config=config,\n",
" )\n",
" checkpoint_tuples = [c async for c in checkpointer.alist(config)]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "18de503a-69f8-4191-9ed9-85bdd25d6213",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[CheckpointTuple(config={'configurable': {'thread_id': '2', 'checkpoint_ns': '', 'checkpoint_id': '1efa6543-5d51-63c3-8001-9bd42cf9d6e6'}}, checkpoint={'v': 1, 'id': '1efa6543-5d51-63c3-8001-9bd42cf9d6e6', 'ts': '2024-11-19T08:56:48.239486+00:00', 'pending_sends': [], 'versions_seen': {'__input__': {}, '__start__': {'__start__': '00000000000000000000000000000001.0.6926057190269731'}, 'data_analyze_graph': {'branch:__start__:router:data_analyze_graph': '00000000000000000000000000000002.0.32407437506283565'}}, 'channel_versions': {'date': '00000000000000000000000000000003.0.1780977977687367', 'messages': '00000000000000000000000000000003.0.05509702188973753', '__start__': '00000000000000000000000000000002.3.0886787893869005e-05', 'parent_id': '00000000000000000000000000000003.0.43858547879187637', 'data_analyze_graph': '00000000000000000000000000000003.0.1082481333441786', 'branch:__start__:router:data_analyze_graph': '00000000000000000000000000000003.0.593567034515958'}, 'channel_values': {'date': datetime.date(2024, 11, 19), 'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930'), AIMessage(content=\"I am TableGPT2, an expert Python data analyst developed by Zhejiang University. My primary role is to assist you in analyzing datasets by writing Python code. I can help you with tasks such as data cleaning, transformation, visualization, and more. If you have a dataset or a specific analysis in mind, feel free to share it with me, and I'll do my best to help you!\", additional_kwargs={'parent_id': '2'}, response_metadata={}, id='560a88be-4fd0-4cc1-aa55-0747862fa222')], 'parent_id': '2', 'data_analyze_graph': 'data_analyze_graph'}}, metadata={'step': 1, 'source': 'loop', 'writes': {'data_analyze_graph': {'date': datetime.date(2024, 11, 19), 'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930'), AIMessage(content=\"I am TableGPT2, an expert Python data analyst developed by Zhejiang University. My primary role is to assist you in analyzing datasets by writing Python code. I can help you with tasks such as data cleaning, transformation, visualization, and more. If you have a dataset or a specific analysis in mind, feel free to share it with me, and I'll do my best to help you!\", additional_kwargs={'parent_id': '2'}, response_metadata={}, id='560a88be-4fd0-4cc1-aa55-0747862fa222')], 'parent_id': '2'}}, 'parents': {}, 'thread_id': '2'}, parent_config={'configurable': {'thread_id': '2', 'checkpoint_ns': '', 'checkpoint_id': '1efa6543-4a1a-6852-8000-b975c65fb2ff'}}, pending_writes=[]),\n",
" CheckpointTuple(config={'configurable': {'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'checkpoint_id': '1efa6543-5d27-6b37-8002-4e274906f4a5'}}, checkpoint={'v': 1, 'id': '1efa6543-5d27-6b37-8002-4e274906f4a5', 'ts': '2024-11-19T08:56:48.222457+00:00', 'pending_sends': [], 'versions_seen': {'agent': {'join:input_guard+retrieve_columns:agent': '00000000000000000000000000000003.0.8898909183470118'}, '__input__': {}, '__start__': {'__start__': '00000000000000000000000000000001.0.6163867462467301'}, 'input_guard': {'start:input_guard': '00000000000000000000000000000002.0.6848611387807798'}, 'retrieve_columns': {'start:retrieve_columns': '00000000000000000000000000000002.0.030416452982199194'}}, 'channel_versions': {'date': '00000000000000000000000000000002.0.2490273362085793', 'agent': '00000000000000000000000000000005.0.024232530645486583', 'messages': '00000000000000000000000000000005.0.14282746367420773', '__start__': '00000000000000000000000000000002.0.18620036399153372', 'parent_id': '00000000000000000000000000000002.0.5201095646733788', 'input_guard': '00000000000000000000000000000005.0.859473129275239', 'retrieve_columns': '00000000000000000000000000000005.0.7752176585300508', 'start:input_guard': '00000000000000000000000000000003.0.6183120254220215', 'start:retrieve_columns': '00000000000000000000000000000003.0.08187600687354024', 'join:input_guard+retrieve_columns:agent': '00000000000000000000000000000004.0.5107824581933167'}, 'channel_values': {'date': datetime.date(2024, 11, 19), 'agent': 'agent', 'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930'), AIMessage(content=\"I am TableGPT2, an expert Python data analyst developed by Zhejiang University. My primary role is to assist you in analyzing datasets by writing Python code. I can help you with tasks such as data cleaning, transformation, visualization, and more. If you have a dataset or a specific analysis in mind, feel free to share it with me, and I'll do my best to help you!\", additional_kwargs={'parent_id': '2'}, response_metadata={}, id='560a88be-4fd0-4cc1-aa55-0747862fa222')], 'parent_id': '2', 'join:input_guard+retrieve_columns:agent': set()}}, metadata={'step': 2, 'source': 'loop', 'writes': {'agent': {'messages': [AIMessage(content=\"I am TableGPT2, an expert Python data analyst developed by Zhejiang University. My primary role is to assist you in analyzing datasets by writing Python code. I can help you with tasks such as data cleaning, transformation, visualization, and more. If you have a dataset or a specific analysis in mind, feel free to share it with me, and I'll do my best to help you!\", additional_kwargs={'parent_id': '2'}, response_metadata={}, id='560a88be-4fd0-4cc1-aa55-0747862fa222')]}}, 'parents': {'': '1efa6543-4a1a-6852-8000-b975c65fb2ff'}, 'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'langgraph_node': 'data_analyze_graph', 'langgraph_path': ['__pregel_pull', 'data_analyze_graph'], 'langgraph_step': 1, 'langgraph_triggers': ['branch:__start__:router:data_analyze_graph'], 'langgraph_checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd'}, parent_config={'configurable': {'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'checkpoint_id': '1efa6543-4a53-6aa4-8001-7474db32528e'}}, pending_writes=[]),\n",
" CheckpointTuple(config={'configurable': {'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'checkpoint_id': '1efa6543-4a53-6aa4-8001-7474db32528e'}}, checkpoint={'v': 1, 'id': '1efa6543-4a53-6aa4-8001-7474db32528e', 'ts': '2024-11-19T08:56:46.248182+00:00', 'pending_sends': [], 'versions_seen': {'__input__': {}, '__start__': {'__start__': '00000000000000000000000000000001.0.6163867462467301'}, 'input_guard': {'start:input_guard': '00000000000000000000000000000002.0.6848611387807798'}, 'retrieve_columns': {'start:retrieve_columns': '00000000000000000000000000000002.0.030416452982199194'}}, 'channel_versions': {'date': '00000000000000000000000000000002.0.2490273362085793', 'messages': '00000000000000000000000000000003.0.8478737633204881', '__start__': '00000000000000000000000000000002.0.18620036399153372', 'parent_id': '00000000000000000000000000000002.0.5201095646733788', 'input_guard': '00000000000000000000000000000003.0.06039244147872136', 'retrieve_columns': '00000000000000000000000000000003.0.8552403538042089', 'start:input_guard': '00000000000000000000000000000003.0.6183120254220215', 'start:retrieve_columns': '00000000000000000000000000000003.0.08187600687354024', 'join:input_guard+retrieve_columns:agent': '00000000000000000000000000000003.0.8898909183470118'}, 'channel_values': {'date': datetime.date(2024, 11, 19), 'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930')], 'parent_id': '2', 'input_guard': 'input_guard', 'retrieve_columns': 'retrieve_columns', 'join:input_guard+retrieve_columns:agent': {'input_guard', 'retrieve_columns'}}}, metadata={'step': 1, 'source': 'loop', 'writes': {'input_guard': {'messages': []}, 'retrieve_columns': {'messages': []}}, 'parents': {'': '1efa6543-4a1a-6852-8000-b975c65fb2ff'}, 'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'langgraph_node': 'data_analyze_graph', 'langgraph_path': ['__pregel_pull', 'data_analyze_graph'], 'langgraph_step': 1, 'langgraph_triggers': ['branch:__start__:router:data_analyze_graph'], 'langgraph_checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd'}, parent_config={'configurable': {'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'checkpoint_id': '1efa6543-4a4c-6639-8000-a5baf4d38bd2'}}, pending_writes=[('36d7d700-5338-feda-05f0-57d74fddbc0b', 'agent', 'agent'), ('36d7d700-5338-feda-05f0-57d74fddbc0b', 'messages', [AIMessage(content=\"I am TableGPT2, an expert Python data analyst developed by Zhejiang University. My primary role is to assist you in analyzing datasets by writing Python code. I can help you with tasks such as data cleaning, transformation, visualization, and more. If you have a dataset or a specific analysis in mind, feel free to share it with me, and I'll do my best to help you!\", additional_kwargs={'parent_id': '2'}, response_metadata={}, id='560a88be-4fd0-4cc1-aa55-0747862fa222')])]),\n",
" CheckpointTuple(config={'configurable': {'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'checkpoint_id': '1efa6543-4a4c-6639-8000-a5baf4d38bd2'}}, checkpoint={'v': 1, 'id': '1efa6543-4a4c-6639-8000-a5baf4d38bd2', 'ts': '2024-11-19T08:56:46.245208+00:00', 'pending_sends': [], 'versions_seen': {'__input__': {}, '__start__': {'__start__': '00000000000000000000000000000001.0.6163867462467301'}}, 'channel_versions': {'date': '00000000000000000000000000000002.0.2490273362085793', 'messages': '00000000000000000000000000000002.0.19507603965774079', '__start__': '00000000000000000000000000000002.0.18620036399153372', 'parent_id': '00000000000000000000000000000002.0.5201095646733788', 'start:input_guard': '00000000000000000000000000000002.0.6848611387807798', 'start:retrieve_columns': '00000000000000000000000000000002.0.030416452982199194'}, 'channel_values': {'date': datetime.date(2024, 11, 19), 'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930')], 'parent_id': '2', 'start:input_guard': '__start__', 'start:retrieve_columns': '__start__'}}, metadata={'step': 0, 'source': 'loop', 'writes': None, 'parents': {'': '1efa6543-4a1a-6852-8000-b975c65fb2ff'}, 'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'langgraph_node': 'data_analyze_graph', 'langgraph_path': ['__pregel_pull', 'data_analyze_graph'], 'langgraph_step': 1, 'langgraph_triggers': ['branch:__start__:router:data_analyze_graph'], 'langgraph_checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd'}, parent_config={'configurable': {'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'checkpoint_id': '1efa6543-4a49-6ec0-bfff-7a6bdf830d6b'}}, pending_writes=[('637b8d5c-1e9a-d15c-7560-eba440c88860', 'input_guard', 'input_guard'), ('637b8d5c-1e9a-d15c-7560-eba440c88860', 'messages', []), ('637b8d5c-1e9a-d15c-7560-eba440c88860', 'join:input_guard+retrieve_columns:agent', 'input_guard'), ('fcc182eb-567e-cef1-c5be-16b527e21434', 'retrieve_columns', 'retrieve_columns'), ('fcc182eb-567e-cef1-c5be-16b527e21434', 'messages', []), ('fcc182eb-567e-cef1-c5be-16b527e21434', 'join:input_guard+retrieve_columns:agent', 'retrieve_columns')]),\n",
" CheckpointTuple(config={'configurable': {'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'checkpoint_id': '1efa6543-4a49-6ec0-bfff-7a6bdf830d6b'}}, checkpoint={'v': 1, 'id': '1efa6543-4a49-6ec0-bfff-7a6bdf830d6b', 'ts': '2024-11-19T08:56:46.244206+00:00', 'pending_sends': [], 'versions_seen': {'__input__': {}}, 'channel_versions': {'__start__': '00000000000000000000000000000001.0.6163867462467301'}, 'channel_values': {'__start__': {'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930')], 'parent_id': '2', 'date': datetime.date(2024, 11, 19)}}}, metadata={'step': -1, 'source': 'input', 'writes': {'__start__': {'date': datetime.date(2024, 11, 19), 'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930')], 'parent_id': '2'}}, 'parents': {'': '1efa6543-4a1a-6852-8000-b975c65fb2ff'}, 'thread_id': '2', 'checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'langgraph_node': 'data_analyze_graph', 'langgraph_path': ['__pregel_pull', 'data_analyze_graph'], 'langgraph_step': 1, 'langgraph_triggers': ['branch:__start__:router:data_analyze_graph'], 'langgraph_checkpoint_ns': 'data_analyze_graph:3d6fb60f-4da5-a1de-bcf2-fa5632547abd'}, parent_config=None, pending_writes=[('39da96de-984e-3d02-e2ef-9bd5146d7336', 'messages', [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930')]), ('39da96de-984e-3d02-e2ef-9bd5146d7336', 'date', datetime.date(2024, 11, 19)), ('39da96de-984e-3d02-e2ef-9bd5146d7336', 'parent_id', '2'), ('39da96de-984e-3d02-e2ef-9bd5146d7336', 'start:input_guard', '__start__'), ('39da96de-984e-3d02-e2ef-9bd5146d7336', 'start:retrieve_columns', '__start__')]),\n",
" CheckpointTuple(config={'configurable': {'thread_id': '2', 'checkpoint_ns': '', 'checkpoint_id': '1efa6543-4a1a-6852-8000-b975c65fb2ff'}}, checkpoint={'v': 1, 'id': '1efa6543-4a1a-6852-8000-b975c65fb2ff', 'ts': '2024-11-19T08:56:46.224784+00:00', 'pending_sends': [], 'versions_seen': {'__input__': {}, '__start__': {'__start__': '00000000000000000000000000000001.0.6926057190269731'}}, 'channel_versions': {'date': '00000000000000000000000000000002.0.602279509708772', 'messages': '00000000000000000000000000000002.0.47212683047327253', '__start__': '00000000000000000000000000000002.3.0886787893869005e-05', 'parent_id': '00000000000000000000000000000002.0.43326965052986344', 'branch:__start__:router:data_analyze_graph': '00000000000000000000000000000002.0.32407437506283565'}, 'channel_values': {'date': datetime.date(2024, 11, 19), 'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930')], 'parent_id': '2', 'branch:__start__:router:data_analyze_graph': '__start__'}}, metadata={'step': 0, 'source': 'loop', 'writes': None, 'parents': {}, 'thread_id': '2'}, parent_config={'configurable': {'thread_id': '2', 'checkpoint_ns': '', 'checkpoint_id': '1efa6543-4a12-6bbb-bfff-ac4aee846bfe'}}, pending_writes=[('3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'data_analyze_graph', 'data_analyze_graph'), ('3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'messages', [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930'), AIMessage(content=\"I am TableGPT2, an expert Python data analyst developed by Zhejiang University. My primary role is to assist you in analyzing datasets by writing Python code. I can help you with tasks such as data cleaning, transformation, visualization, and more. If you have a dataset or a specific analysis in mind, feel free to share it with me, and I'll do my best to help you!\", additional_kwargs={'parent_id': '2'}, response_metadata={}, id='560a88be-4fd0-4cc1-aa55-0747862fa222')]), ('3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'parent_id', '2'), ('3d6fb60f-4da5-a1de-bcf2-fa5632547abd', 'date', datetime.date(2024, 11, 19))]),\n",
" CheckpointTuple(config={'configurable': {'thread_id': '2', 'checkpoint_ns': '', 'checkpoint_id': '1efa6543-4a12-6bbb-bfff-ac4aee846bfe'}}, checkpoint={'v': 1, 'id': '1efa6543-4a12-6bbb-bfff-ac4aee846bfe', 'ts': '2024-11-19T08:56:46.221598+00:00', 'pending_sends': [], 'versions_seen': {'__input__': {}}, 'channel_versions': {'__start__': '00000000000000000000000000000001.0.6926057190269731'}, 'channel_values': {'__start__': {'messages': [['human', 'Who are you?']], 'parent_id': '2', 'date': datetime.date(2024, 11, 19)}}}, metadata={'step': -1, 'source': 'input', 'writes': {'__start__': {'date': datetime.date(2024, 11, 19), 'messages': [['human', 'Who are you?']], 'parent_id': '2'}}, 'parents': {}, 'thread_id': '2'}, parent_config=None, pending_writes=[('8e27b3ac-0a9d-697e-0667-6d3ccc170a50', 'messages', [['human', 'Who are you?']]), ('8e27b3ac-0a9d-697e-0667-6d3ccc170a50', 'parent_id', '2'), ('8e27b3ac-0a9d-697e-0667-6d3ccc170a50', 'date', datetime.date(2024, 11, 19)), ('8e27b3ac-0a9d-697e-0667-6d3ccc170a50', 'branch:__start__:router:data_analyze_graph', '__start__')])]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"checkpoint_tuples"
]
},
{
"cell_type": "markdown",
"id": "8c3d22bf-ebdb-48fe-a452-8697d786404e",
"metadata": {},
"source": [
"## Get Persisted Messages with Config\n",
"\n",
"We can use the same config parameters to retrieve persisted messages through the `checkpointer`. You can follow the example below:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c1b166b5-cbfb-4cbe-b193-605f64315a38",
"metadata": {},
"outputs": [],
"source": [
"async with AsyncPostgresSaver.from_conn_string(DB_URI) as checkpointer:\n",
" graph = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
" checkpointer=checkpointer,\n",
" )\n",
" \n",
" graph_state = await graph.aget_state(config)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "09963483-7f79-4050-a27d-3862b0b940e6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"StateSnapshot(values={'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930'), AIMessage(content=\"I am TableGPT2, an expert Python data analyst developed by Zhejiang University. My primary role is to assist you in analyzing datasets by writing Python code. I can help you with tasks such as data cleaning, transformation, visualization, and more. If you have a dataset or a specific analysis in mind, feel free to share it with me, and I'll do my best to help you!\", additional_kwargs={'parent_id': '2'}, response_metadata={}, id='560a88be-4fd0-4cc1-aa55-0747862fa222')], 'parent_id': '2', 'date': datetime.date(2024, 11, 19)}, next=(), config={'configurable': {'thread_id': '2', 'checkpoint_ns': '', 'checkpoint_id': '1efa6543-5d51-63c3-8001-9bd42cf9d6e6'}}, metadata={'step': 1, 'source': 'loop', 'writes': {'data_analyze_graph': {'date': datetime.date(2024, 11, 19), 'messages': [HumanMessage(content='Who are you?', additional_kwargs={}, response_metadata={}, id='32b67f59-d13e-43eb-9239-ec711811e930'), AIMessage(content=\"I am TableGPT2, an expert Python data analyst developed by Zhejiang University. My primary role is to assist you in analyzing datasets by writing Python code. I can help you with tasks such as data cleaning, transformation, visualization, and more. If you have a dataset or a specific analysis in mind, feel free to share it with me, and I'll do my best to help you!\", additional_kwargs={'parent_id': '2'}, response_metadata={}, id='560a88be-4fd0-4cc1-aa55-0747862fa222')], 'parent_id': '2'}}, 'parents': {}, 'thread_id': '2'}, created_at='2024-11-19T08:56:48.239486+00:00', parent_config={'configurable': {'thread_id': '2', 'checkpoint_ns': '', 'checkpoint_id': '1efa6543-4a1a-6852-8000-b975c65fb2ff'}}, tasks=())"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"graph_state"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/howto/retrieval.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "25c9a82f-ea07-434c-a031-e844bc28279d",
"metadata": {},
"source": [
"# Enhance TableGPT Agent with RAG\n",
"\n",
"While the [File Reading Workflow](../../explanation/file-reading) is adequate for most scenarios, it may not always provide the information necessary for the LLM to generate accurate code. Consider the following examples:\n",
"\n",
"- A categorical column in the dataset contains 'foo', 'bar', and 'baz', but 'baz' only appears after approximately 100 rows. In this case, the LLM may not encounter the 'baz' value through `df.head()`.\n",
"- The user's query may not align with the dataset's content for several reasons:\n",
" - The dataset lacks proper governance. For instance, a cell value might be misspelled from 'foo' to 'fou'.\n",
" - There could be a typo in the user's query. For example, if the user queries, \"Show me the data for 'fou',\" but the dataset contains 'foo' instead.\n",
"\n",
"In such situations, the Dataset Retriever plugin can be utilized to fetch additional information about the dataset from external sources, thereby providing the LLM with more context and improving its ability to generate accurate responses."
]
},
{
"cell_type": "markdown",
"id": "b4581e28-08df-4674-9a0f-18d79e2b3c1d",
"metadata": {},
"source": [
"## Quick Start\n",
"\n",
"To help you quickly integrate and utilize `RAG` with the `TableGPT Agent`, follow the steps outlined in this section. These instructions will guide you through the process of loading datasets, enhancing retrieval with document compression, and integrating with a powerful LLM-based agent. By the end of this quick start, you'll be able to issue complex queries and receive enriched, context-aware responses.\n",
"\n",
"### Step 1: Install Required Dependencies\n",
"To get started with using RAG in the TableGPT Agent, you need to install the necessary dependencies. The primary package required is langchain, which facilitates building retrieval-augmented workflows.\n",
"\n",
"Run the following command to install it:\n",
"\n",
"```sh\n",
"pip install langchain\n",
"```\n",
"\n",
"### Step 2: Load and Prepare Data with CSVLoader\n",
"\n",
"The `TableGPT Agent` provides a convenient `CSVLoader` for converting `CSV` or `Excel` files into a format that can be processed by the RAG pipeline. This method allows seamless integration of your data for further retrieval and embedding.\n",
"\n",
"**Example Code:**"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f0212e29-0de9-487b-a555-8eaf65c519ea",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.vectorstores import InMemoryVectorStore\n",
"from tablegpt.retriever import CSVLoader\n",
"\n",
"loader = CSVLoader(\"产品销量表.csv\", autodetect_encoding=True)\n",
"\n",
"documents = []\n",
"async for item in loader.alazy_load():\n",
" documents.append(item)\n",
"\n",
"# Initialize with an embedding model\n",
"vector_store = InMemoryVectorStore(embedding=SomeEmbeddingModel())\n",
"\n",
"await vector_store.aadd_documents(documents=documents)\n",
"dataset_base_retriever = vector_store.as_retriever()"
]
},
{
"cell_type": "markdown",
"id": "d6200c96",
"metadata": {},
"source": [
"### Step 3: Build a Context-Aware Retriever with Document Compression\n",
"\n",
"To enhance the retrieval process, `langchain` provides powerful retriever utilities that can be combined with custom compressors. In this step, we utilize the `ColumnDocCompressor` from tablegpt to focus on relevant columns and build an efficient `dataset_retriever`.\n",
"\n",
"**Example Code:**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5d253ade-5c85-404e-9b77-88c6e3e5cb9c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.retrievers import ContextualCompressionRetriever\n",
"from langchain.retrievers.document_compressors import DocumentCompressorPipeline\n",
"from tablegpt.retriever import ColumnDocCompressor\n",
"\n",
"dataset_compressor = DocumentCompressorPipeline(\n",
" transformers=[ColumnDocCompressor()]\n",
")\n",
"\n",
"dataset_retriever = ContextualCompressionRetriever(\n",
" base_compressor=dataset_compressor,\n",
" base_retriever=dataset_base_retriever,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "169dc7b6-e732-43bf-a9fb-a802719cc0f4",
"metadata": {},
"source": [
"### Step 4: Integrate with TableGPT Agent\n",
"\n",
"In this step, we integrate the `dataset_retriever` with the `TableGPT Agent` using an `LLM` and a local execution environment. This setup ensures that the agent can handle user queries effectively by leveraging both the LLM and retrieved dataset context.\n",
"\n",
"**Example Code:**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "188480c4-fbb2-4a81-b18e-1fc587db4b8e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from pybox import AsyncLocalPyBoxManager\n",
"from tablegpt.agent import create_tablegpt_graph\n",
"from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR\n",
"\n",
"llm = ChatOpenAI(openai_api_base=\"YOUR_VLLM_URL\", openai_api_key=\"whatever\", model_name=\"TableGPT2-7B\")\n",
"pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)\n",
"\n",
"agent = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
" dataset_retriever=dataset_retriever,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "d0d8db97",
"metadata": {},
"source": [
"With this setup, your `TableGPT Agent` is ready to process user queries, retrieve relevant data, and generate contextually accurate responses. The integration of RAG techniques ensures that the agent leverages external data effectively, providing enhanced insights and performance.\n",
"\n",
"\n",
"### Step 5: Analyze Data with the TableGPT Agent\n",
"\n",
"Finally, you can use the `TableGPT Agent` to perform analysis by sending a query. The response can help determine whether retrieval-augmented generation (RAG) has provided enhanced results. Observing the returned information allows you to assess the accuracy and completeness of the generated response.\n",
"\n",
"**Example Code:**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "de23ac53-5ec6-4684-932d-dc75a2d67255",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[HumanMessage(content='桃酥的销售量是多少?', additional_kwargs={}, response_metadata={}, id='b567e1c3-8943-453c-9ebe-fa8d34cfc388'),\n",
" SystemMessage(content='\\nHere are some extra column information that might help you understand the dataset:\\n- 产品销量表.csv:\\n - {\"column\": 名称, \"dtype\": \"string\", \"values\": [\"花生桃酥\", ...]}\\n - {\"column\": 销售额 , \"dtype\": \"string\", \"values\": [\" ¥931,000.00 \", \" ¥225,060.00 \", \" ¥58,500.00 \", ...]}\\n', additional_kwargs={'parent_id': 'some-parent-id'}, response_metadata={}, id='07fdddf4-05e8-4022-9a78-98ee3744aab2'),\n",
" AIMessage(content=\"为了回答这个问题,我们首先需要读取文件`产品销量表.csv`,然后找到列名包含“名称”和“销售额”的列,特别是需要找到“花生桃酥”的销售量。让我们先读取数据并查看前几行。\\n```python\\nimport pandas as pd\\n\\n# 读取数据\\ndf = read_df(uri='产品销量表.csv')\\n\\n# 显示数据框的前几行\\ndf.head()\\n```\", additional_kwargs={'thought': '为了回答这个问题,我们首先需要读取文件`产品销量表.csv`,然后找到列名包含“名称”和“销售额”的列,特别是需要找到“花生桃酥”的销售量。让我们先读取数据并查看前几行。', 'action': {'tool': 'python', 'tool_input': \"import pandas as pd\\n\\n# 读取数据\\ndf = read_df(uri='产品销量表.csv')\\n\\n# 显示数据框的前几行\\ndf.head()\"}, 'parent_id': 'some-parent-id'}, response_metadata={}, id='27da6f10-2201-4349-bc23-9f7b42f34742', tool_calls=[{'name': 'python', 'args': {'query': \"import pandas as pd\\n\\n# 读取数据\\ndf = read_df(uri='产品销量表.csv')\\n\\n# 显示数据框的前几行\\ndf.head()\"}, 'id': 'be9a29de-7f5d-4010-a85b-37286ab99e86', 'type': 'tool_call'}]),\n",
" ToolMessage(content=[{'type': 'text', 'text': '```pycon\\n 编号 名称 单位 单价(元) 销售量 销售额 \\n0 mb2033 法式面包 包 ¥7.40 305080 ¥2,257,592.00 \\n1 mb2034 奶昔蛋糕 包 ¥5.80 93200 ¥540,560.00 \\n2 mb2035 奶油夹心饼干 包 ¥3.10 215300 ¥667,430.00 \\n3 mb2036 葱油饼 包 ¥2.20 102300 ¥225,060.00 \\n4 mb2037 花生桃酥 包 ¥3.80 130000 ¥494,000.00 \\n```'}], name='python', id='a48d70fd-2e01-48ee-a9a5-25dc0eec04d6', tool_call_id='be9a29de-7f5d-4010-a85b-37286ab99e86', artifact=[]),\n",
" AIMessage(content='从数据中我们可以看到,“花生桃酥”的销售量为130,000包。', additional_kwargs={'parent_id': 'some-parent-id'}, response_metadata={}, id='5c5b703d-2eea-444b-a627-0828dca06df2')]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from datetime import date\n",
"from langchain_core.messages import HumanMessage\n",
"\n",
"message = HumanMessage(content=\"桃酥的销售量是多少?\")\n",
"\n",
"_input = {\n",
" \"messages\": [message],\n",
" \"parent_id\": \"some-parent-id\",\n",
" \"date\": date.today(),\n",
"}\n",
"\n",
"response = await agent.ainvoke(_input)\n",
"\n",
"response[\"messages\"]"
]
},
{
"cell_type": "markdown",
"id": "167c23b6",
"metadata": {},
"source": [
"**Output:**\n",
"\n",
"> Here are some extra column information that might help you understand the dataset:\n",
"> - 产品销量表.csv:\n",
"> - {\"column\": 名称, \"dtype\": \"string\", \"values\": [\"花生桃酥\", ...]}\n",
"> - {\"column\": 销售额 , \"dtype\": \"string\", \"values\": [\" ¥931,000.00 \", \" ¥225,060.00 \", \" ¥58,500.00 \", ...]}\n",
"\n",
"The output confirms that the RAG approach effectively enriches the agent's responses by incorporating dataset context. This improvement allows the agent to provide detailed, actionable insights rather than generic answers, thereby enhancing its utility for complex queries."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/index.md
================================================
# Home
[](https://pypi.org/project/tablegpt-agent)
[](https://pypi.org/project/tablegpt-agent)
## Introduction
tablegpt-agent is a pre-built agent for [TableGPT2 (huggingface)](https://huggingface.co/tablegpt/TableGPT2-7B), a series of LLMs for table-based question answering. This agent is built on top of the [Langgraph](https://www.langchain.com/langgraph) library and provides a user-friendly interface for interacting with TableGPT2.
## Table Of Contents
<!-- mkdocs requires 4 space to intent the list -->
- Tutorials
- [Quickstart](tutorials/quick-start.ipynb)
- [Chat on Tabular Data](tutorials/chat-on-tabular-data.ipynb)
- [Continue Analysis on Generated Charts](tutorials/continue-analysis-on-generated-charts.ipynb)
- How-To Guides
- [Enhance TableGPT Agent with RAG](howto/retrieval.ipynb)
- [Persist Messages](howto/persist-messages.ipynb)
- [Incluster Code Execution](howto/incluster-code-execution.md)
- [Normalize Datasets](howto/normalize-datasets.ipynb)
- Explanation
- [Agent Workflow](explanation/agent-workflow.md)
- [File Reading](explanation/file-reading.ipynb)
- [Reference](reference.md)
## Contributing
Thank you for your interest in TableGPT Agent. For more information on contributing, please see [the contributing guide](https://github.com/tablegpt/tablegpt-agent/blob/main/CONTRIBUTING.md).
## Acknowledgements
We extend our sincere gratitude to all contributors and collaborators who played a pivotal role in the development of tablegpt-agent. Special thanks to our team members and the open-source community, whose insights and feedback were invaluable throughout the project.
Thank you to our early users for their suggestions and engagement, which have greatly helped in refining and enhancing this tool.
================================================
FILE: docs/reference.md
================================================
# API Reference
::: tablegpt.agent.create_tablegpt_graph
================================================
FILE: docs/stylesheets/extra.css
================================================
/* hide jupyter notebooks input/output numbers */
.jp-InputPrompt {
display: none !important;
}
.jp-OutputPrompt {
display: none !important;
}
================================================
FILE: docs/tutorials/chat-on-tabular-data.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "1944f5bf",
"metadata": {},
"source": [
"# Chat on Tabular Data\n",
"\n",
"TableGPT Agent excels at analyzing and processing tabular data. To perform data analysis, you need to first let the agent \"see\" the dataset. This is done by a specific \"file-reading\" workflow. In short, you begin by \"uploading\" the dataset and let the agent read it. Once the data is read, you can ask the agent questions about it.\n",
"\n",
"> To learn more about the file-reading workflow, see [File Reading](../../explanation/file-reading).\n",
"\n",
"For data analysis tasks, we introduce two important parameters when creating the agent: `checkpointer` and `session_id`.\n",
"\n",
"- The `checkpointer` should be an instance of `langgraph.checkpoint.base.BaseCheckpointSaver`, which acts as a versioned \"memory\" for the agent. (See [langgraph's persistence concept](https://langchain-ai.github.io/langgraph/concepts/persistence) for more details.)\n",
"- The `session_id` is a unique identifier for the current session. It ties the agent's execution to a specific kernel, ensuring that the agent's results are retained across multiple invocations.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ec321eaa",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from langgraph.checkpoint.memory import MemorySaver\n",
"from pybox import AsyncLocalPyBoxManager\n",
"from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR\n",
"from tablegpt.agent import create_tablegpt_graph\n",
"\n",
"llm = ChatOpenAI(openai_api_base=\"YOUR_VLLM_URL\", openai_api_key=\"whatever\", model_name=\"TableGPT2-7B\")\n",
"pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)\n",
"checkpointer = MemorySaver()\n",
"\n",
"agent = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
" checkpointer=checkpointer,\n",
" session_id=\"some-session-id\", # This is required when using file-reading\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b2554859",
"metadata": {},
"source": [
"Add the file for processing in the additional_kwargs of HumanMessage. Here's an example using the [Titanic dataset](https://github.com/tablegpt/tablegpt-agent/blob/main/examples/datasets/titanic.csv).\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "55a52fb7",
"metadata": {},
"outputs": [],
"source": [
"from typing import TypedDict\n",
"from langchain_core.messages import HumanMessage\n",
"\n",
"class Attachment(TypedDict):\n",
" \"\"\"Contains at least one dictionary with the key filename.\"\"\"\n",
" filename: str\n",
"\n",
"attachment_msg = HumanMessage(\n",
" content=\"\",\n",
" # Please make sure your iPython kernel can access your filename.\n",
" additional_kwargs={\"attachments\": [Attachment(filename=\"titanic.csv\")]},\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0630560d",
"metadata": {},
"source": [
"Invoke the agent as shown in the quick start:\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ce20b1b0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[HumanMessage(content='', additional_kwargs={'attachments': [{'filename': 'titanic.csv'}]}, response_metadata={}, id='ab0a7157-ad7d-4de8-9b24-1bee78ad7c55'),\n",
" AIMessage(content=\"我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查看 NaN 情况和数据类型。\\n```python\\n# Load the data into a DataFrame\\ndf = read_df('titanic.csv')\\n\\n# Remove leading and trailing whitespaces in column names\\ndf.columns = df.columns.str.strip()\\n\\n# Remove rows and columns that contain only empty values\\ndf = df.dropna(how='all').dropna(axis=1, how='all')\\n\\n# Get the basic information of the dataset\\ndf.info(memory_usage=False)\\n```\", additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '我已经收到您的数据文件,我需要查看文件内容以对数据集有一个初步的了解。首先我会读取数据到 `df` 变量中,并通过 `df.info` 查看 NaN 情况和数据类型。', 'action': {'tool': 'python', 'tool_input': \"# Load the data into a DataFrame\\ndf = read_df('titanic.csv')\\n\\n# Remove leading and trailing whitespaces in column names\\ndf.columns = df.columns.str.strip()\\n\\n# Remove rows and columns that contain only empty values\\ndf = df.dropna(how='all').dropna(axis=1, how='all')\\n\\n# Get the basic information of the dataset\\ndf.info(memory_usage=False)\"}, 'model_type': None}, response_metadata={}, id='add6691d-d7ea-411d-9699-e99ae0b7de97', tool_calls=[{'name': 'python', 'args': {'query': \"# Load the data into a DataFrame\\ndf = read_df('titanic.csv')\\n\\n# Remove leading and trailing whitespaces in column names\\ndf.columns = df.columns.str.strip()\\n\\n# Remove rows and columns that contain only empty values\\ndf = df.dropna(how='all').dropna(axis=1, how='all')\\n\\n# Get the basic information of the dataset\\ndf.info(memory_usage=False)\"}, 'id': 'b846aa01-04ef-4669-9a5c-53ddcb9a2dfb', 'type': 'tool_call'}]),\n",
" ToolMessage(content=[{'type': 'text', 'text': \"```pycon\\n<class 'pandas.core.frame.DataFrame'>\\nRangeIndex: 4 entries, 0 to 3\\nData columns (total 8 columns):\\n # Column Non-Null Count Dtype \\n--- ------ -------------- ----- \\n 0 Pclass 4 non-null int64 \\n 1 Sex 4 non-null object \\n 2 Age 4 non-null float64\\n 3 SibSp 4 non-null int64 \\n 4 Parch 4 non-null int64 \\n 5 Fare 4 non-null float64\\n 6 Embarked 4 non-null object \\n 7 Survived 4 non-null int64 \\ndtypes: float64(2), int64(4), object(2)\\n```\"}], name='python', id='0d441b21-bff3-463c-a07f-c0b12bd17bc5', tool_call_id='b846aa01-04ef-4669-9a5c-53ddcb9a2dfb', artifact=[]),\n",
" AIMessage(content='接下来我将用 `df.head(5)` 来查看数据集的前 5 行。\\n```python\\n# Show the first 5 rows to understand the structure\\ndf.head(5)\\n```', additional_kwargs={'parent_id': 'some-parent-id1', 'thought': '接下来我将用 `df.head(5)` 来查看数据集的前 5 行。', 'action': {'tool': 'python', 'tool_input': '# Show the first 5 rows to understand the structure\\ndf.head(5)'}, 'model_type': None}, response_metadata={}, id='5e26ef1d-7042-471e-b39f-194a51a185c7', tool_calls=[{'name': 'python', 'args': {'query': '# Show the first 5 rows to understand the structure\\ndf.head(5)'}, 'id': 'f6be0d96-05b3-4b5b-8313-90197a8c3d87', 'type': 'tool_call'}]),\n",
" ToolMessage(content=[{'type': 'text', 'text': '```pycon\\n Pclass Sex Age SibSp Parch Fare Embarked Survived\\n0 2 female 29.0 0 2 23.000 S 1\\n1 3 female 39.0 1 5 31.275 S 0\\n2 3 male 26.5 0 0 7.225 C 0\\n3 3 male 32.0 0 0 56.496 S 1\\n```'}], name='python', id='6fc6d8aa-546c-467e-91d3-d57b0b62dd68', tool_call_id='f6be0d96-05b3-4b5b-8313-90197a8c3d87', artifact=[]),\n",
" AIMessage(content='我已经了解了数据集 titanic.csv 的基本信息。请问我可以帮您做些什么?', additional_kwargs={'parent_id': 'some-parent-id1'}, response_metadata={}, id='b6dc3885-94cb-4b0f-b691-f37c4c8c9ba3')]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from datetime import date\n",
"from tablegpt.agent.file_reading import Stage\n",
"\n",
"# Reading and processing files.\n",
"response = await agent.ainvoke(\n",
" input={\n",
" \"entry_message\": attachment_msg,\n",
" \"processing_stage\": Stage.UPLOADED,\n",
" \"messages\": [attachment_msg],\n",
" \"parent_id\": \"some-parent-id1\",\n",
" \"date\": date.today(),\n",
" },\n",
" config={\n",
" # Using checkpointer requires binding thread_id at runtime.\n",
" \"configurable\": {\"thread_id\": \"some-thread-id\"},\n",
" },\n",
")\n",
"response[\"messages\"]"
]
},
{
"cell_type": "markdown",
"id": "bed80e60",
"metadata": {},
"source": [
"Continue to ask questions for data analysis:\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "dcceeebf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"content=\"为了回答您的问题,我将筛选出所有男性乘客并计算其中的幸存者数量。\\n```python\\n# Filter male passengers who survived and count them\\nmale_survivors = df[(df['Sex'] == 'male') & (df['Survived'] == 1)]\\nmale_survivors_count = male_survivors.shape[0]\\nmale_survivors_count\\n```\" additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'TableGPT2-7B'} id='run-661d7496-341d-4a6b-84d8-b4094db66ef0'\n",
"content=[{'type': 'text', 'text': '```pycon\\n1\\n```'}] name='python' id='1c7531db-9150-451d-a8dd-f07176454e6f' tool_call_id='2860e8bb-0fa7-421b-bb2d-bfeca873354b' artifact=[]\n",
"content='根据数据集,有 1 名男性乘客幸存。' additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'TableGPT2-7B'} id='run-db640705-0085-4f47-adb4-3e0adce694cd'\n"
]
}
],
"source": [
"human_message = HumanMessage(content=\"How many men survived?\")\n",
"\n",
"async for event in agent.astream_events(\n",
" input={\n",
" # After using checkpoint, you only need to add new messages here.\n",
" \"messages\": [human_message],\n",
" \"parent_id\": \"some-parent-id2\",\n",
" \"date\": date.today(),\n",
" },\n",
" version=\"v2\",\n",
" # We configure the same thread_id to use checkpoints to retrieve the memory of the last run.\n",
" config={\"configurable\": {\"thread_id\": \"some-thread-id\"}},\n",
"):\n",
" event_name: str = event[\"name\"]\n",
" evt: str = event[\"event\"]\n",
" if evt == \"on_chat_model_end\":\n",
" print(event[\"data\"][\"output\"])\n",
" elif event_name == \"tool_node\" and evt == \"on_chain_stream\":\n",
" for lc_msg in event[\"data\"][\"chunk\"][\"messages\"]:\n",
" print(lc_msg)\n",
" else:\n",
" # Other events can be handled here.\n",
" pass\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/tutorials/continue-analysis-on-generated-charts.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "98a1786c",
"metadata": {},
"source": [
"# Continue Analysis on Generated Charts\n",
"\n",
"While TableGPT2 excels in data analysis tasks, it currently lacks built-in support for visual modalities. Many data analysis tasks involve visualization, so to address this limitation, we provide an interface for integrating your own Visual Language Model (VLM) plugin.\n",
"\n",
"When the agent performs a visualization task—typically using `matplotlib.pyplot.show`—the VLM will take over from the LLM, offering a more nuanced summarization of the visualization. This approach avoids the common pitfalls of LLMs in visualization tasks, which often either state, \"I have plotted the data,\" or hallucinating the content of the plot.\n",
"\n",
"We continue using the agent from the previous section to perform a data visualization task and observe its final output.\n",
"> **NOTE** Before you start, you can install Chinese fonts using the following command:\n",
"```bash\n",
"apt-get update && apt-get install -y --no-install-recommends fonts-noto-cjk\n",
"mplfonts init\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "15aba93a",
"metadata": {},
"outputs": [],
"source": [
"from datetime import date\n",
"from typing import TypedDict\n",
"\n",
"from langchain_core.messages import HumanMessage\n",
"from langchain_openai import ChatOpenAI\n",
"from langgraph.checkpoint.memory import MemorySaver\n",
"from pybox import AsyncLocalPyBoxManager\n",
"from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR\n",
"from tablegpt.agent import create_tablegpt_graph\n",
"from tablegpt.agent.file_reading import Stage\n",
"\n",
"llm = ChatOpenAI(openai_api_base=\"YOUR_VLLM_URL\", openai_api_key=\"whatever\", model_name=\"TableGPT2-7B\")\n",
"pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)\n",
"checkpointer = MemorySaver()\n",
"\n",
"agent = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
" checkpointer=checkpointer,\n",
" session_id=\"some-session-id\", # This is required when using file-reading\n",
")\n",
"\n",
"class Attachment(TypedDict):\n",
" \"\"\"Contains at least one dictionary with the key filename.\"\"\"\n",
" filename: str\n",
"\n",
"attachment_msg = HumanMessage(\n",
" content=\"\",\n",
" # Please make sure your iPython kernel can access your filename.\n",
" additional_kwargs={\"attachments\": [Attachment(filename=\"titanic.csv\")]},\n",
")\n",
"\n",
"# Reading and processing files.\n",
"response = await agent.ainvoke(\n",
" input={\n",
" \"entry_message\": attachment_msg,\n",
" \"processing_stage\": Stage.UPLOADED,\n",
" \"messages\": [attachment_msg],\n",
" \"parent_id\": \"some-parent-id1\",\n",
" \"date\": date.today(),\n",
" },\n",
" config={\n",
" # Using checkpointer requires binding thread_id at runtime.\n",
" \"configurable\": {\"thread_id\": \"some-thread-id\"},\n",
" },\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "0afbab13",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"content=\"好的,我将基于性别绘制一个饼图,以展示每个性别的人数。首先,我们需要统计每个性别的人数,然后使用 `seaborn` 和 `matplotlib` 来绘制饼图。\\n\\n```python\\nimport seaborn as sns\\nimport matplotlib.pyplot as plt\\n\\n# Count the number of people for each gender\\ngender_counts = df['Sex'].value_counts()\\n\\n# Create a pie chart\\nplt.figure(figsize=(8, 6))\\nplt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))\\nplt.title('Gender Distribution')\\nplt.show()\\n```\" additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'TableGPT2-7B'} id='run-6115fe22-3b55-4d85-be09-6c31a59736f6'\n",
"content=[{'type': 'text', 'text': '```pycon\\n<Figure size 800x600 with 1 Axes>\\n```'}, {'type': 'image_url', 'image_url': {'url': 'data:image/png;base64,iVBORw0KG...'}}] name='python' id='226ba8f2-29a7-4706-9178-8cb5b4062488' tool_call_id='03eb1113-6aed-4e0a-a3c0-4cc0043a55ee' artifact=[]\n",
"content='饼图已经成功生成。' additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'TableGPT2-7B'} id='run-83468bd1-9451-4c78-91a3-b0f96ffa169a'\n"
]
}
],
"source": [
"# Define the human message that asks the model to draw a pie chart based on gender data\n",
"human_message = HumanMessage(content=\"Draw a pie chart based on gender and the number of people of each gender.\")\n",
"\n",
"async for event in agent.astream_events(\n",
" input={\n",
" \"messages\": [human_message],\n",
" \"parent_id\": \"some-parent-id2\",\n",
" \"date\": date.today(),\n",
" },\n",
" version=\"v2\",\n",
" # We configure the same thread_id to use checkpoints to retrieve the memory of the last run.\n",
" config={\"configurable\": {\"thread_id\": \"some-thread-id\"}},\n",
"):\n",
" evt = event[\"event\"]\n",
" if evt == \"on_chat_model_end\":\n",
" print(event[\"data\"][\"output\"])\n",
" elif event[\"name\"] == \"tool_node\" and evt == \"on_chain_stream\":\n",
" for lc_msg in event[\"data\"][\"chunk\"][\"messages\"]:\n",
" print(lc_msg)\n",
" else:\n",
" # Handle other events here\n",
" pass"
]
},
{
"cell_type": "markdown",
"id": "1c428aca",
"metadata": {},
"source": [
"Now let's set up the Visual Language Model (VLM) and create a new agent with VLM support:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "425633b7-14a4-4bbc-91e1-d94161a41682",
"metadata": {},
"outputs": [],
"source": [
"# Initialize the VLM instance\n",
"vlm = ChatOpenAI(openai_api_base=\"YOUR_VLM_URL\", openai_api_key=\"whatever\", model_name=\"YOUR_MODEL_NAME\")\n",
"\n",
"# Assume llm, pybox_manager, and memory_saver are defined elsewhere\n",
"agent_with_vlm = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
" vlm=vlm,\n",
" checkpointer=checkpointer,\n",
" session_id=\"some-session-id\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "40a19cb4-adbc-49de-90af-4d43e77d4308",
"metadata": {},
"source": [
"We use a [time travel](https://langchain-ai.github.io/langgraph/tutorials/introduction/#part-7-time-travel) feature to go back to before the last time the agent gave an answer, to avoid past memories hallucinating the model:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3652d131-6ed7-4d75-bfe2-152ba40fb090",
"metadata": {},
"outputs": [],
"source": [
"state_history = agent.get_state_history(config={\"configurable\": {\"thread_id\": \"some-thread-id\"}})\n",
"\n",
"to_replay = None\n",
"for state in list(state_history)[::-1]:\n",
" if state.next and state.next[0] == \"__start__\":\n",
" to_replay = state"
]
},
{
"cell_type": "markdown",
"id": "2a82aeef-7906-45b8-a1b0-2d2b3c18451b",
"metadata": {},
"source": [
"Send the same question to the model via the new agent with VLM support"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e138cb4a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"content=\"好的,我将绘制一个饼图来展示数据集中男性和女性乘客的数量。\\n```python\\n# Count the number of passengers by gender\\ngender_counts = df['Sex'].value_counts()\\n\\n# Plot a pie chart\\nplt.figure(figsize=(8, 6))\\nplt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=140)\\nplt.title('Gender Distribution')\\nplt.show()\\n```\\n\" additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'TableGPT2-7B'} id='run-2d05b2ab-32f4-481f-8fa5-43c78515d9c3'\n",
"content=[{'type': 'text', 'text': '```pycon\\n<Figure size 800x600 with 1 Axes>\\n```'}, {'type': 'image_url', 'image_url': {'url': 'data:image/png;base64,iVBORw0K...'}}] name='python' id='51a99935-b0b1-496d-9a45-c1f318104773' tool_call_id='918d57ee-7362-4e0d-8d66-64b7e57ecaf6' artifact=[]\n",
"content='饼图显示数据集中性别分布为 50% 女性和 50% 男性,这表明男性和女性乘客数量相等。' additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'qwen2-vl-7b-instruct'} id='run-d9b0e891-f03c-40c8-8474-9fef7511c40b'\n"
]
}
],
"source": [
"async for event in agent_with_vlm.astream_events(\n",
" None,\n",
" to_replay.config,\n",
" version=\"v2\",\n",
"):\n",
" evt = event[\"event\"]\n",
" if evt == \"on_chat_model_end\":\n",
" print(event[\"data\"][\"output\"])\n",
" elif event[\"name\"] == \"tool_node\" and evt == \"on_chain_stream\":\n",
" for lc_msg in event[\"data\"][\"chunk\"][\"messages\"]:\n",
" print(lc_msg)\n",
" else:\n",
" # Handle other events here\n",
" pass"
]
},
{
"cell_type": "markdown",
"id": "20d009cb",
"metadata": {},
"source": [
"We observe that the answer provided by the agent with VLM support is significantly more detailed, including a comprehensive description of the generated images."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: docs/tutorials/quick-start.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "9a12e134",
"metadata": {},
"source": [
"# Quickstart\n",
"\n",
"## Installation\n",
"\n",
"To install TableGPT Agent, use the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe436583",
"metadata": {},
"outputs": [],
"source": [
"%pip install tablegpt-agent"
]
},
{
"cell_type": "markdown",
"id": "b81082f9-c74b-4d11-ab1f-2c8c041a29c4",
"metadata": {},
"source": [
"TableGPT Agent depends on pybox to manage code execution environment. By default, pybox operates in an in-cluster mode. If you intend to run tablegpt-agent in a local environment, install the optional dependency as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "4c692b35-0e56-4d3e-b20a-6fefd6dbc9e4",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%pip install tablegpt-agent[local]"
]
},
{
"cell_type": "markdown",
"id": "2a55ff4b",
"metadata": {},
"source": [
"\n",
"This tutorial uses `langchain-openai` for the chat model instance. Please make sure you have it installed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "503a2807",
"metadata": {},
"outputs": [],
"source": [
"%pip install langchain-openai"
]
},
{
"cell_type": "markdown",
"id": "b2d82049",
"metadata": {},
"source": [
"## Setup the LLM Service\n",
"\n",
"Before using TableGPT Agent, ensure you have an OpenAI-compatible server configured to host TableGPT2. We recommend using [vllm](https://github.com/vllm-project/vllm) for this:\n",
"\n",
"```bash\n",
"python -m vllm.entrypoints.openai.api_server --served-model-name TableGPT2-7B --model path/to/weights\n",
"```\n",
"\n",
"> **NOTES:**\n",
">\n",
"> - To analyze tabular data with `tablegpt-agent`, make sure `TableGPT2` is served with `vllm` version 0.5.5 or higher.\n",
"> - For production environments, it's important to optimize the vllm server configuration. For details, refer to the [vllm documentation on server configuration](https://docs.vllm.ai/en/v0.6.0/serving/openai_compatible_server.html#command-line-arguments-for-the-server)."
]
},
{
"cell_type": "markdown",
"id": "2f0a9ec8",
"metadata": {},
"source": [
"## Create TableGPT Agent\n",
"\n",
"> **NOTE:** TableGPT Agent fully supports aync invocation. If you are running this tutorial in a Jupyter Notebook, no additional setup is required. However, if you plan to run the tutorial in a Python console, make sure to use a console that supports asynchronous operations. To get started, execute the following command:\n",
">\n",
"> ```bash\n",
"> python -m asyncio\n",
"> ```\n",
"\n",
"In the console or notebook, create the agent as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4ac32d2f",
"metadata": {},
"outputs": [],
"source": [
"from langchain_openai import ChatOpenAI\n",
"from pybox import AsyncLocalPyBoxManager\n",
"from tablegpt.agent import create_tablegpt_graph\n",
"\n",
"\n",
"llm = ChatOpenAI(openai_api_base=\"YOUR_VLLM_URL\", openai_api_key=\"whatever\", model_name=\"TableGPT2-7B\")\n",
"pybox_manager = AsyncLocalPyBoxManager()\n",
"\n",
"agent = create_tablegpt_graph(\n",
" llm=llm,\n",
" pybox_manager=pybox_manager,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "31ff4fe9",
"metadata": {},
"source": [
"## Start Chatting"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ee24c200",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[HumanMessage(content='Hi', additional_kwargs={}, response_metadata={}, id='34fe748c-81ab-49ea-bec6-9c621598a61a'), AIMessage(content=\"Hello! How can I assist you with data analysis today? Please let me know the details of the dataset you're working with and what specific analysis you'd like to perform.\", additional_kwargs={'parent_id': 'some-parent-id'}, response_metadata={}, id='a1ee29d2-723e-41c7-b420-27d0cfaed5dc')]\n"
]
}
],
"source": [
"from datetime import date\n",
"from langchain_core.messages import HumanMessage\n",
"\n",
"message = HumanMessage(content=\"Hi\")\n",
"\n",
"_input = {\n",
" \"messages\": [message],\n",
" \"parent_id\": \"some-parent-id\",\n",
" \"date\": date.today(),\n",
"}\n",
"\n",
"state = await agent.ainvoke(_input)\n",
"state[\"messages\"]"
]
},
{
"cell_type": "markdown",
"id": "3eca819d",
"metadata": {},
"source": [
"You can get more detailed outputs with the `astream_events` method:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3265cf83",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"content='Hello! How can I assist you with your data analysis today? Please let me know what dataset you are working with and what specific analyses or visualizations you would like to perform.' additional_kwargs={} response_metadata={'finish_reason': 'stop', 'model_name': 'TableGPT2-7B'} id='run-525eb149-0e3f-4b04-868b-708295f789ac'\n"
]
}
],
"source": [
"async for event in agent.astream_events(\n",
" input=_input,\n",
" version=\"v2\",\n",
"):\n",
" # We ignore irrelevant events here.\n",
" if event[\"event\"] == \"on_chat_model_end\":\n",
" print(event[\"data\"][\"output\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: examples/__init__.py
================================================
================================================
FILE: examples/data_analysis.py
================================================
import asyncio
from datetime import date
from typing import TypedDict
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from pybox import AsyncLocalPyBoxManager
from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR
from tablegpt.agent import create_tablegpt_graph
from tablegpt.agent.file_reading import Stage
class Attachment(TypedDict):
"""Contains at least one dictionary with the key filename."""
filename: str
"""The dataset uploaded in this session can be a filename, file path, or object storage address."""
# tablegpt-agent fully supports async invocation
async def main() -> None:
llm = ChatOpenAI(
openai_api_base="YOUR_VLLM_URL",
openai_api_key="whatever",
model_name="TableGPT2-7B",
)
# Use local pybox manager for development and testing
pybox_manager = AsyncLocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)
agent = create_tablegpt_graph(
llm=llm,
pybox_manager=pybox_manager,
# We use MemorySaver as a checkpointer to record memory automatically.
# See <https://langchain-ai.github.io/langgraph/concepts/persistence>
checkpointer=MemorySaver(),
# All code generated in this run will be executed in the kernel with kernel_id 'some-session-id'.
session_id="some-session-id",
)
attachment_msg = HumanMessage(
content="",
# The dataset can be viewed in examples/datasets/titanic.csv.
additional_kwargs={"attachments": [Attachment(filename="examples/datasets/titanic.csv")]},
)
await agent.ainvoke(
input={
"entry_message": attachment_msg,
"processing_stage": Stage.UPLOADED,
"messages": [attachment_msg],
"parent_id": "some-parent-id1",
"date": date.today(), # noqa: DTZ011
},
config={
"configurable": {"thread_id": "some-thread-id"},
},
)
human_message = HumanMessage(content="How many men survived?")
async for event in agent.astream_events(
input={
# After using checkpoint, you only need to add new messages here.
"messages": [human_message],
"parent_id": "some-parent-id2",
"date": date.today(), # noqa: DTZ011
},
version="v2",
# We configure the same thread_id to use checkpoints to retrieve the memory of the last run.
config={"configurable": {"thread_id": "some-thread-id"}},
):
print(event) # noqa: T201
asyncio.run(main())
================================================
FILE: examples/datasets/titanic.csv
================================================
Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Survived
2,female,29,0,2,23,S,1
3,female,39,1,5,31.275,S,0
3,male,26.5,0,0,7.225,C,0
3,male,32,0,0,56.4958,S,1
================================================
FILE: examples/datasets/产品销量表.csv
================================================
编号,名称,单位, 单价(元) ,销售量, 销售额
mb2033,法式面包,包, ¥7.40 ,305080," ¥2,257,592.00 "
mb2034,奶昔蛋糕,包, ¥5.80 ,93200," ¥540,560.00 "
mb2035,奶油夹心饼干,包, ¥3.10 ,215300," ¥667,430.00 "
mb2036,葱油饼,包, ¥2.20 ,102300," ¥225,060.00 "
mb2037,花生桃酥,包, ¥3.80 ,130000," ¥494,000.00 "
mb2038,巧克力饼干,包, ¥4.50 ,119800," ¥539,100.00 "
mb2039,果酱饼干,包, ¥4.10 ,120516," ¥494,115.60 "
mb2040,肉沫夹心饼,包, ¥5.50 ,86000," ¥473,000.00 "
mb2041,早餐饼干,包, ¥2.30 ,104500," ¥240,350.00 "
yl1322,矿泉水,瓶, ¥0.90 ,65000," ¥58,500.00 "
yl1323,可乐,瓶, ¥3.50 ,10200," ¥35,700.00 "
yl1324,冰咖啡,瓶, ¥5.60 ,235040," ¥1,316,224.00 "
yl1325,优果汁,瓶, ¥3.50 ,130500," ¥456,750.00 "
yl1326,奶茶,瓶, ¥4.20 ,98000," ¥411,600.00 "
gg0258,奶油瓜子,千克, ¥6.10 ,105000," ¥640,500.00 "
gg0259,五香瓜子,千克, ¥8.50 ,150000," ¥1,275,000.00 "
gg0260,白味瓜子,千克, ¥8.20 ,132000," ¥1,082,400.00 "
gg0261,麻辣花生,千克, ¥9.00 ,120500," ¥1,084,500.00 "
gg0262,麻辣瓜子仁,千克, ¥9.50 ,98000," ¥931,000.00 "
gg0263,薯条,千克, ¥9.50 ,130000," ¥1,235,000.00 "
gg0264,香酥爆米花,千克, ¥10.00 ,125800," ¥1,258,000.00 "
================================================
FILE: examples/quick_start.py
================================================
import asyncio
from datetime import date
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from pybox import AsyncLocalPyBoxManager
from tablegpt.agent import create_tablegpt_graph
# tablegpt-agent fully supports async invocation
async def main() -> None:
llm = ChatOpenAI(openai_api_base="YOUR_VLLM_URL", openai_api_key="whatever", model_name="TableGPT2-7B")
# Use local pybox manager for development and testing
pybox_manager = AsyncLocalPyBoxManager()
agent = create_tablegpt_graph(
llm=llm,
pybox_manager=pybox_manager,
)
message = HumanMessage(content="Hi")
_input = {
"messages": [message],
"parent_id": "some-parent-id",
"date": date.today(), # noqa: DTZ011
}
async for event in agent.astream_events(
input=_input,
version="v2",
):
print(event) # noqa: T201
asyncio.run(main())
================================================
FILE: ipython/README.md
================================================
# TableGPT IPython Kernel
This kernel is used to execute code generated by `tablegpt-agent` and has been equipped with data analysis and Chinese font support.
## Startup Scripts
It's recommended to put some helper functions or configurations in the startup scripts. Place your startup scripts to `~/.ipython/profile_default/startup/` directory to take effect.
Note: The `~/.ipython` directory must be writable for the process launching the kernel, otherwise there will be a warning message: `UserWarning: IPython dir '/home/jovyan/.ipython' is not a writable location, using a temp directory.` and the startup scripts won't take effects.
Official document at `~/.ipython/profile_default/startup/README`:
> This is the IPython startup directory
>
> .py and .ipy files in this directory will be run *prior* to any code or files specified
> via the exec_lines or exec_files configurables whenever you load this profile.
>
> Files will be run in lexicographical order, so you can control the execution order of files
> with a prefix, e.g.::
>
> 00-first.py
> 50-middle.py
> 99-last.ipy
================================================
FILE: ipython/ipython-startup-scripts/00-pandas.py
================================================
import pandas as pd
pd.set_option("display.width", 2048)
# 8 is the minimum value to display `df.describe()`. We have other truncation mechanisms so it's OK to flex this a bit.
pd.set_option("display.max_rows", 8)
pd.set_option("display.max_columns", 40)
pd.set_option("display.max_colwidth", 40)
pd.set_option("display.precision", 3)
pd.set_option("future.no_silent_downcasting", True)
================================================
FILE: ipython/ipython-startup-scripts/98-udfs.py
================================================
from __future__ import annotations
import concurrent.futures
import os
from pathlib import Path
from typing import NamedTuple, cast
import pandas as pd
class FileEncoding(NamedTuple):
"""File encoding as the NamedTuple."""
encoding: str | None
"""The encoding of the file."""
confidence: float
"""The confidence of the encoding."""
language: str | None
"""The language of the file."""
def detect_file_encodings(
file_path: str | Path, timeout: int = 5
) -> list[FileEncoding]:
"""Try to detect the file encoding.
Returns a list of `FileEncoding` tuples with the detected encodings ordered
by confidence.
Args:
file_path: The path to the file to detect the encoding for.
timeout: The timeout in seconds for the encoding detection.
"""
import chardet
file_path = str(file_path)
def read_and_detect(file_path: str) -> list[dict]:
with open(file_path, "rb") as f:
rawdata = f.read()
return cast(list[dict], chardet.detect_all(rawdata))
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(read_and_detect, file_path)
try:
encodings = future.result(timeout=timeout)
except concurrent.futures.TimeoutError:
raise TimeoutError(
f"Timeout reached while detecting encoding for {file_path}"
)
if all(encoding["encoding"] is None for encoding in encodings):
raise RuntimeError(f"Could not detect encoding for {file_path}")
return [FileEncoding(**enc) for enc in encodings if enc["encoding"] is not None]
def path_from_uri(uri: str) -> Path:
"""Return a new path from the given 'file' URI.
This is implemented in Python 3.13.
See <https://github.com/python/cpython/pull/107640>
and <https://github.com/python/cpython/pull/107640/files#diff-fa525485738fc33d05b06c159172ff1f319c26e88d8c6bb39f7dbaae4dc4105c>
TODO: remove when we migrate to Python 3.13"""
if not uri.startswith("file:"):
raise ValueError(f"URI does not start with 'file:': {uri!r}")
path = uri[5:]
if path[:3] == "///":
# Remove empty authority
path = path[2:]
elif path[:12] == "//localhost/":
# Remove 'localhost' authority
path = path[11:]
if path[:3] == "///" or (path[:1] == "/" and path[2:3] in ":|"):
# Remove slash before DOS device/UNC path
path = path[1:]
if path[1:2] == "|":
# Replace bar with colon in DOS drive
path = path[:1] + ":" + path[2:]
from urllib.parse import unquote_to_bytes
path = Path(os.fsdecode(unquote_to_bytes(path)))
if not path.is_absolute():
raise ValueError(f"URI is not absolute: {uri!r}")
return path
def file_extention(file: str) -> str:
path = Path(file)
return path.suffix
def read_df(uri: str, *, autodetect_encoding: bool = True, **kwargs) -> pd.DataFrame:
"""A simple wrapper to read different file formats into DataFrame."""
try:
return _read_df(uri, **kwargs)
except UnicodeDecodeError as e:
if autodetect_encoding:
detected_encodings = detect_file_encodings(path_from_uri(uri), timeout=30)
for encoding in detected_encodings:
try:
return _read_df(uri, encoding=encoding.encoding, **kwargs)
except UnicodeDecodeError:
continue
# Either we ran out of detected encoding, or autodetect_encoding is False,
# we should raise encoding error
raise ValueError(f"不支持的文件编码{e.encoding},请转换成 utf-8 后重试") # noqa: RUF001
def _read_df(uri: str, encoding: str = "utf-8", **kwargs) -> pd.DataFrame:
"""A simple wrapper to read different file formats into DataFrame."""
ext = file_extention(uri).lower()
if ext == ".csv":
df = pd.read_csv(uri, encoding=encoding, **kwargs)
elif ext == ".tsv":
df = pd.read_csv(uri, sep="\t", encoding=encoding, **kwargs)
elif ext in [".xls", ".xlsx", ".xlsm", ".xlsb", ".odf", ".ods", ".odt"]:
# read_excel does not support 'encoding' arg, also it seems that it does not need it.
df = pd.read_excel(uri, **kwargs)
else:
raise ValueError(
f"TableGPT 目前支持 csv、tsv 以及 xlsx 文件,您上传的文件格式 {ext} 暂不支持。" # noqa: RUF001
)
return df
================================================
FILE: ipython/ipython-startup-scripts/99-cfont.py
================================================
import seaborn as sns
from mplfonts import use_font
use_font("Noto Serif CJK SC")
sns.set_theme(font="Noto Serif CJK SC")
================================================
FILE: ipython/requirements.txt
================================================
pandas >=2.2,<3.0.0
scipy >=1.13.0,<2.0.0
tabulate >=0.9.0,<1.0.0
scikit-learn >=1.0.0,<2.0.0
statsmodels >=0.10.0,<1.0.0
matplotlib >=3.8.4,<4.0.0
seaborn >=0.13.1,<1.0.0
mplfonts >=0.0.8,<1.0.0
numexpr >=2.8.4
openpyxl >=3.1.2,<4.0.0 # read xlsx files
xlrd >= 2.0.1 # read xls files
odfpy # read ods files
================================================
FILE: mkdocs.yml
================================================
site_name: TableGPT Agent
theme:
name: "material"
features:
- navigation.footer
- search.highlight
- search.share
- content.action.edit
- content.action.view
icon:
edit: material/pencil
view: material/eye
palette:
# Palette toggle for light mode
- scheme: default
toggle:
icon: material/brightness-7
name: Switch to dark mode
# Palette toggle for dark mode
- scheme: slate
toggle:
icon: material/brightness-4
name: Switch to light mode
plugins:
- mkdocs-jupyter
- mkdocstrings
- search
extra_css:
- stylesheets/extra.css
markdown_extensions:
- pymdownx.highlight:
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.superfences
- toc:
permalink: "#"
nav:
- Home: index.md
- Tutorials:
- 'Quick Start': tutorials/quick-start.ipynb
- 'Chat on tablular data': tutorials/chat-on-tabular-data.ipynb
- 'Continue Analysis on Generated Charts': tutorials/continue-analysis-on-generated-charts.ipynb
- 'How-To Guides':
- 'Enhance TableGPT Agent with RAG': howto/retrieval.ipynb
- 'Persist Messages': howto/persist-messages.ipynb
- 'Messages Truncation': howto/messages-truncation.ipynb
- 'Incluster Code Execution': howto/incluster-code-execution.md
- 'Normalize Datasets': howto/normalize-datasets.ipynb
- 'Cleanup Error Trace': howto/cleanup-error-trace.md
- 'Customize Table Info': howto/customize-table-info.md
- Reference: reference.md
- Explanation:
- 'Agent Workflow': explanation/agent-workflow.md
- 'File Reading': explanation/file-reading.ipynb
- 'Code Sandbox': explanation/code-sandbox.md
- 'IPython Startup Scripts': explanation/ipython-startup-scripts.md
repo_name: tablegpt/tablegpt-agent
repo_url: https://github.com/tablegpt/tablegpt-agent
edit_uri: edit/main/docs/
================================================
FILE: pyproject.toml
================================================
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "tablegpt-agent"
dynamic = ["version"]
description = ''
readme = "README.md"
requires-python = ">=3.9"
license = {file = "LICENSE"}
keywords = []
authors = [
{ name = "Aofeng Su", email = "saf@zjuici.com" },
{ name = "Chen Zhou", email = "zc@zjuici.com" },
{ name = "Junbo Zhao", email = "j.zhao@zju.edu.cn" },
{ name = "Junlin Zhou", email = "jlzhou@zjuici.com" },
{ name = "Tao Zhang", email = "zt@zjuici.com" },
{ name = "Xiang Li", email = "xli@zjuici.com" },
]
classifiers = [
"Development Status :: 4 - Beta",
"Programming Language :: Python",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = [
"chardet>=5.2.0,<6.0.0",
"langchain-core>=0.3.0,<1.0.0",
"langgraph>=0.0.68,<1.0.0",
"pandas>=2.2,<3.0.0",
"pppybox>=0.0.17"
]
[project.urls]
Documentation = "https://tablegpt.github.io/tablegpt-agent/"
Issues = "https://github.com/tablegpt/tablegpt-agent/issues"
Source = "https://github.com/tablegpt/tablegpt-agent"
[project.optional-dependencies]
local = [
"pandas >=2.2,<3.0.0",
"scipy >=1.13.0,<2.0.0",
"tabulate >=0.9.0,<1.0.0",
"scikit-learn >=1.0.0,<2.0.0",
"statsmodels >=0.10.0,<1.0.0",
"matplotlib >=3.8.4,<4.0.0",
"seaborn >=0.13.1,<1.0.0",
"mplfonts >=0.0.8,<1.0.0",
"numexpr >=2.8.4",
"openpyxl >=3.1.2,<4.0.0",
"xlrd >= 2.0.1",
"odfpy",
"pppybox[local]"
]
[tool.hatch.build.targets.sdist]
exclude = [
".devcontainer",
".github",
"/docs",
"/examples",
"/realtabbench",
"collect_script.py",
]
[tool.hatch.build.targets.wheel]
packages = ["src/tablegpt"]
[tool.hatch.build.targets.wheel.shared-data]
"ipython/ipython-startup-scripts" = "share/ipykernel/profile/tablegpt/startup"
[tool.hatch.version]
path = "src/tablegpt/__about__.py"
[tool.hatch.envs.types]
extra-dependencies = [
"mypy>=1.0.0",
]
[tool.hatch.envs.types.scripts]
check = "mypy --install-types --non-interactive {args:src/tablegpt tests}"
[tool.hatch.envs.docs]
dependencies = [
"mkdocs",
"mkdocstrings[python]",
"mkdocs-jupyter",
"mkdocs-material",
]
[tool.coverage.run]
source_pkgs = ["tablegpt", "tests"]
branch = true
parallel = true
omit = [
"src/tablegpt/__about__.py",
]
[tool.coverage.paths]
tablegpt = ["src/tablegpt"]
tests = ["tests"]
[tool.coverage.report]
exclude_lines = [
"no cov",
"if __name__ == .__main__.:",
"if TYPE_CHECKING:",
]
[tool.ruff]
# Exclude a variety of commonly ignored directories.
exclude = [
"ipython"
]
# Allow lines to be as long as 120.
line-length = 120
[tool.ruff.lint.flake8-tidy-imports]
ban-relative-imports = "parents"
[tool.ruff.lint.flake8-type-checking]
runtime-evaluated-base-classes = ["pydantic.BaseModel", "sqlalchemy.orm.DeclarativeBase"]
runtime-evaluated-decorators = ["pydantic.validate_call", "attrs.define"]
================================================
FILE: re
gitextract_d58d11ud/
├── .devcontainer/
│ └── devcontainer.json
├── .gitattributes
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ └── bug_report.md
│ └── workflows/
│ ├── ci.yml
│ ├── publish-docs.yml
│ ├── publish.yml
│ └── stale.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CONTRIBUTING.md
├── LICENSE
├── Makefile
├── README.md
├── collect_script.py
├── docs/
│ ├── explanation/
│ │ ├── agent-workflow.md
│ │ ├── code-sandbox.md
│ │ ├── file-reading.ipynb
│ │ └── ipython-startup-scripts.md
│ ├── howto/
│ │ ├── cleanup-error-trace.md
│ │ ├── customize-table-info.md
│ │ ├── incluster-code-execution.md
│ │ ├── messages-truncation.ipynb
│ │ ├── normalize-datasets.ipynb
│ │ ├── persist-messages.ipynb
│ │ └── retrieval.ipynb
│ ├── index.md
│ ├── reference.md
│ ├── stylesheets/
│ │ └── extra.css
│ └── tutorials/
│ ├── chat-on-tabular-data.ipynb
│ ├── continue-analysis-on-generated-charts.ipynb
│ └── quick-start.ipynb
├── examples/
│ ├── __init__.py
│ ├── data_analysis.py
│ ├── datasets/
│ │ ├── titanic.csv
│ │ ├── 产品生产统计表.xlsx
│ │ └── 产品销量表.csv
│ └── quick_start.py
├── ipython/
│ ├── README.md
│ ├── ipython-startup-scripts/
│ │ ├── 00-pandas.py
│ │ ├── 98-udfs.py
│ │ └── 99-cfont.py
│ └── requirements.txt
├── mkdocs.yml
├── pyproject.toml
├── realtabbench/
│ ├── README.md
│ ├── __init__.py
│ ├── agent_eval/
│ │ ├── README.md
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── config.py
│ │ ├── evaluatee.py
│ │ ├── evaluator/
│ │ │ ├── __init__.py
│ │ │ ├── output_parser.py
│ │ │ └── prompt.py
│ │ ├── example-config.yaml
│ │ ├── questioner.py
│ │ ├── requirements.txt
│ │ ├── runner.py
│ │ ├── tablegpt_evaluatee.py
│ │ └── worker.py
│ ├── evalset/
│ │ ├── bird_data/
│ │ │ ├── dev.json
│ │ │ ├── dev.sql
│ │ │ └── dev_tables.json
│ │ └── spider_data/
│ │ ├── dev.json
│ │ ├── dev_gold.sql
│ │ ├── test.json
│ │ ├── test_gold.sql
│ │ └── test_tables.json
│ ├── inference.py
│ ├── inference_encoder.py
│ ├── requirements.txt
│ ├── run_text2sql_eval.py
│ ├── text2sql/
│ │ ├── __init__.py
│ │ └── src/
│ │ ├── __init__.py
│ │ ├── evaluation.py
│ │ ├── gpt_request.py
│ │ └── gpt_request_encoder.py
│ └── utils.py
├── src/
│ └── tablegpt/
│ ├── __about__.py
│ ├── __init__.py
│ ├── agent/
│ │ ├── __init__.py
│ │ ├── data_analyzer.py
│ │ ├── file_reading/
│ │ │ ├── __init__.py
│ │ │ └── data_normalizer.py
│ │ └── output_parser.py
│ ├── errors.py
│ ├── retriever/
│ │ ├── __init__.py
│ │ ├── compressor.py
│ │ └── loader.py
│ ├── safety.py
│ ├── tools.py
│ ├── translation.py
│ └── utils.py
└── tests/
├── __init__.py
├── agent/
│ ├── __init__.py
│ ├── file_reading/
│ │ ├── __init__.py
│ │ └── test_data_normalizer.py
│ └── test_output_parser.py
├── retriever/
│ ├── __init__.py
│ ├── test_compressor.py
│ ├── test_format.py
│ └── test_loader.py
├── test_profile_init.py
├── test_safety.py
├── test_tools.py
└── test_utils.py
SYMBOL INDEX (316 symbols across 44 files)
FILE: collect_script.py
function get_os_info (line 6) | def get_os_info():
function get_python_info (line 17) | def get_python_info():
function get_pip_list (line 25) | def get_pip_list():
function write_to_log_file (line 38) | def write_to_log_file(content, filename="env_output.log"):
function main (line 43) | def main():
FILE: examples/data_analysis.py
class Attachment (line 14) | class Attachment(TypedDict):
function main (line 22) | async def main() -> None:
FILE: examples/quick_start.py
function main (line 11) | async def main() -> None:
FILE: ipython/ipython-startup-scripts/98-udfs.py
class FileEncoding (line 11) | class FileEncoding(NamedTuple):
function detect_file_encodings (line 22) | def detect_file_encodings(
function path_from_uri (line 57) | def path_from_uri(uri: str) -> Path:
function file_extention (line 86) | def file_extention(file: str) -> str:
function read_df (line 91) | def read_df(uri: str, *, autodetect_encoding: bool = True, **kwargs) -> ...
function _read_df (line 108) | def _read_df(uri: str, encoding: str = "utf-8", **kwargs) -> pd.DataFrame:
FILE: realtabbench/agent_eval/__main__.py
function main (line 25) | async def main() -> None:
FILE: realtabbench/agent_eval/config.py
class DatasetSettings (line 14) | class DatasetSettings(BaseModel):
class EvalSettings (line 18) | class EvalSettings(BaseSettings):
function load_config (line 33) | def load_config() -> dict[str, Any]:
FILE: realtabbench/agent_eval/evaluatee.py
class AbstractEvaluatee (line 17) | class AbstractEvaluatee(AbstractAsyncContextManager, ABC):
method _call (line 19) | async def _call(self, message: BaseMessage, **kwargs) -> list[BaseMess...
method __call__ (line 21) | async def __call__(self, message: BaseMessage, **kwargs) -> list[BaseM...
method context (line 26) | def context(self):
method instance (line 31) | def instance(cls) -> Self: ...
FILE: realtabbench/agent_eval/evaluator/__init__.py
function create_evaluator_runnable (line 17) | def create_evaluator_runnable(llm: BaseLanguageModel):
FILE: realtabbench/agent_eval/evaluator/output_parser.py
class FloatScoreOutputParser (line 11) | class FloatScoreOutputParser(ScoreStringResultOutputParser):
method parse (line 16) | def parse(self, text: str) -> dict[str, Any]:
FILE: realtabbench/agent_eval/evaluator/prompt.py
function format_criteria (line 28) | def format_criteria(criteria: list[str]) -> str:
function format_redlines (line 40) | def format_redlines(attentions: list[str]) -> str:
function format_reference_answer (line 50) | def format_reference_answer(reference_answer: str) -> str:
FILE: realtabbench/agent_eval/questioner.py
function main (line 65) | def main(dataset_path, questions_path: Path, description: str, *, nrows:...
FILE: realtabbench/agent_eval/runner.py
class Runner (line 28) | class Runner:
method __init__ (line 34) | def __init__(self, config: EvalSettings) -> None:
method run (line 45) | async def run(self, stop_event: asyncio.Event) -> None:
function enqueue_samples (line 77) | async def enqueue_samples(queue: asyncio.Queue, dataset_configs: list[di...
function construct_samples (line 103) | def construct_samples(dataset: list[dict[str, Any]]) -> list[BaseMessage]:
FILE: realtabbench/agent_eval/tablegpt_evaluatee.py
class IpythonSettings (line 34) | class IpythonSettings(BaseModel):
class Settings (line 44) | class Settings(BaseSettings):
function get_settings (line 71) | def get_settings() -> Settings:
function get_llm_instance (line 76) | def get_llm_instance() -> BaseLanguageModel:
function get_vlm_instance (line 82) | def get_vlm_instance() -> BaseLanguageModel:
function get_guard_llm_instance (line 90) | def get_guard_llm_instance() -> BaseLanguageModel:
function get_normalize_llm_instance (line 98) | def get_normalize_llm_instance() -> BaseLanguageModel:
function get_pybox_manager (line 106) | def get_pybox_manager() -> BasePyBoxManager:
class Attachment (line 122) | class Attachment(TypedDict):
class TablegptEvaluatee (line 128) | class TablegptEvaluatee(AbstractEvaluatee):
method __init__ (line 129) | def __init__(
method __aenter__ (line 149) | async def __aenter__(self):
method __aexit__ (line 159) | async def __aexit__(self, exc_type, exc_value, traceback):
method _call (line 170) | async def _call(self, message: BaseMessage, **kwargs) -> list[BaseMess...
method context (line 225) | def context(self):
method instance (line 229) | def instance(cls) -> Self:
FILE: realtabbench/agent_eval/worker.py
class Worker (line 25) | class Worker:
method __init__ (line 26) | def __init__(
method run (line 42) | async def run(self) -> None:
class EvalExecutor (line 64) | class EvalExecutor:
method __init__ (line 65) | def __init__(
method run (line 75) | async def run(self, sample: BaseMessage) -> None:
FILE: realtabbench/inference.py
function get_infer_kwargs (line 8) | def get_infer_kwargs(args) -> dict:
function load_tokenizer_and_template (line 21) | def load_tokenizer_and_template(model_name_or_path, template=None):
function load_model (line 38) | def load_model(model_name_or_path, max_model_len=None, gpus_num=1):
function generate_outputs (line 54) | def generate_outputs(messages_batch, llm_model, tokenizer, generate_args):
FILE: realtabbench/inference_encoder.py
function extract_contrastive_table (line 16) | def extract_contrastive_table(df: pd.DataFrame):
function cleanup (line 32) | def cleanup():
function inference_with_encoder (line 42) | def inference_with_encoder(args, format_msg_datas):
function truncate (line 63) | def truncate(value, max_length=80):
function format_encoder_tables (line 69) | def format_encoder_tables(df_names, table_paths):
function build_encoder_table_part_content (line 97) | def build_encoder_table_part_content(df_names, table_paths):
function read_df_head (line 126) | def read_df_head(table_path, head_num=3, format_type="string"):
FILE: realtabbench/run_text2sql_eval.py
function main (line 9) | def main(args):
FILE: realtabbench/text2sql/src/evaluation.py
function load_json (line 14) | def load_json(dir): # noqa: A002
function execute_sql (line 23) | def execute_sql(predicted_sql, ground_truth, db_path):
function execute_model (line 41) | def execute_model(sql_pair, db_place, idx, meta_time_out):
function package_sqls (line 56) | def package_sqls(sql_path, db_root_path, mode="gpt", data_mode="dev"): ...
function run_sqls_parallel (line 82) | def run_sqls_parallel(sqls, db_places, num_cpus=1, meta_time_out=30.0):
function sort_results (line 90) | def sort_results(list_of_dicts):
function compute_acc_by_diff (line 94) | def compute_acc_by_diff(exec_results, contents):
function print_data (line 140) | def print_data(score_lists, count_lists):
function evaluation_main (line 153) | def evaluation_main(args, eval_datas, predicted_sql_path):
FILE: realtabbench/text2sql/src/gpt_request.py
function new_directory (line 18) | def new_directory(path):
function load_json (line 23) | def load_json(data_path):
function get_db_schemas (line 28) | def get_db_schemas(bench_root: str, db_name: str) -> dict[str, str]:
function nice_look_table (line 46) | def nice_look_table(column_names: list, values: list):
function generate_schema_prompt (line 60) | def generate_schema_prompt(db_path, num_rows=None):
function generate_comment_prompt (line 97) | def generate_comment_prompt(question, knowledge=None):
function cot_wizard (line 112) | def cot_wizard():
function few_shot (line 117) | def few_shot():
function few_shot_no_kg (line 125) | def few_shot_no_kg():
function generate_combined_prompts_one (line 133) | def generate_combined_prompts_one(db_path, question, knowledge=None):
function quota_giveup (line 140) | def quota_giveup(e):
function connect_gpt (line 144) | def connect_gpt(engine, prompt, max_tokens, temperature, stop):
function llm_generate_result (line 154) | def llm_generate_result(model_name_or_path, gpus_num, prompt_ls, args=No...
function gpt_generate_result (line 201) | def gpt_generate_result(model_name_or_path, gpus_num, prompt_ls, args=No...
function parser_sql (line 250) | def parser_sql(text):
function collect_response_from_gpt (line 266) | def collect_response_from_gpt(model_path, gpus_num, db_path_list, questi...
function question_package (line 303) | def question_package(data_json, knowledge=False): # noqa: ARG001, FBT002
function knowledge_package (line 307) | def knowledge_package(data_json, knowledge=False): # noqa: ARG001, FBT002
function decouple_question_schema (line 311) | def decouple_question_schema(datasets, db_root_path):
function generate_sql_file (line 324) | def generate_sql_file(sql_lst, output_path=None):
function generate_main (line 338) | def generate_main(eval_data, args):
FILE: realtabbench/text2sql/src/gpt_request_encoder.py
function get_table_info (line 50) | def get_table_info(db_path, enum_num=None):
function generate_combined_prompts_one_encoder (line 119) | def generate_combined_prompts_one_encoder(db_path, question, knowledge=N...
function get_encoder_prompt (line 127) | def get_encoder_prompt(table_info):
function get_messages_one (line 131) | def get_messages_one(db_path, question, knowledge=None):
function calculate_table_num (line 170) | def calculate_table_num():
function llm_generate_result_encoder (line 206) | def llm_generate_result_encoder(model_name_or_path, gpus_num, messages_ls):
function col_nums_max (line 234) | def col_nums_max(message):
function llm_generate_result_encoder_one (line 246) | def llm_generate_result_encoder_one(model_name_or_path, gpus_num, messag...
function collect_response_from_gpt_encoder (line 281) | def collect_response_from_gpt_encoder(model_path, gpus_num, db_path_list...
function generate_main_encoder (line 313) | def generate_main_encoder(eval_data, args):
function test_single (line 356) | def test_single():
FILE: realtabbench/utils.py
function read_jsonl (line 17) | def read_jsonl(file_path):
function load_json (line 22) | def load_json(data_path):
function save_json (line 30) | def save_json(data_path, data_list):
function get_dfs_info (line 38) | def get_dfs_info(table_paths):
function sample_from_two_lists (line 55) | def sample_from_two_lists(list1, list2, threshold=0.5):
function recraft_query (line 70) | def recraft_query(query, _locals):
function extract_code_without_comments (line 76) | def extract_code_without_comments(code):
function is_python_code (line 97) | def is_python_code(line: str) -> bool:
function extract_text_before_code (line 126) | def extract_text_before_code(text: str) -> str:
function extract_python_code (line 139) | def extract_python_code(text: str) -> str:
function fix_indents (line 146) | def fix_indents(text: str) -> str:
function filter_cot (line 150) | def filter_cot(completion: str):
function filter_code (line 165) | def filter_code(completion: str) -> tuple[str, str]:
function get_tool (line 191) | def get_tool(df: Any, df_names=None):
function get_table_infos (line 214) | def get_table_infos(table_paths):
class TimeoutException (line 232) | class TimeoutException(Exception): # noqa: N818
function timeout (line 238) | def timeout(time):
function run_code (line 255) | def run_code(code, result, tool):
function execute_with_timeout (line 263) | def execute_with_timeout(code, timeout_seconds, tool):
FILE: src/tablegpt/__init__.py
function _find_tablegpt_ipykernel_profile_dir (line 8) | def _find_tablegpt_ipykernel_profile_dir():
FILE: src/tablegpt/agent/__init__.py
class AgentState (line 23) | class AgentState(MessagesState):
function create_tablegpt_graph (line 35) | def create_tablegpt_graph(
FILE: src/tablegpt/agent/data_analyzer.py
class TruncationConfig (line 43) | class TruncationConfig:
function get_data_analyzer_agent (line 120) | def get_data_analyzer_agent(llm: BaseLanguageModel) -> Runnable:
class AgentState (line 124) | class AgentState(MessagesState):
function create_data_analyze_workflow (line 133) | def create_data_analyze_workflow(
FILE: src/tablegpt/agent/file_reading/__init__.py
class Stage (line 33) | class Stage(Enum):
class AgentState (line 39) | class AgentState(MessagesState):
function create_file_reading_workflow (line 51) | def create_file_reading_workflow(
FILE: src/tablegpt/agent/file_reading/data_normalizer.py
function seq_to_md (line 25) | def seq_to_md(raw_table_info: list[list[Any]]) -> str:
function is_split (line 55) | def is_split(origin: list[list[Any]], resp: list[list[Any]]) -> tuple[in...
class EvalResultError (line 84) | class EvalResultError(OutputParserException):
method __init__ (line 85) | def __init__(self, text: str):
class OutputTypeError (line 89) | class OutputTypeError(OutputParserException):
method __init__ (line 90) | def __init__(self, text: str, expected_type: str):
class ListListOutputParser (line 94) | class ListListOutputParser(BaseTransformOutputParser[list[list[Any]]]):
method parse (line 108) | def parse(self, text: str) -> list[list[Any]]:
class ListTupleOutputParser (line 122) | class ListTupleOutputParser(BaseTransformOutputParser[list[list[Any]]]):
method parse (line 136) | def parse(self, text: str) -> list[list[Any]]:
function get_table_reformat_chain (line 169) | def get_table_reformat_chain(llm: BaseLanguageModel) -> Runnable:
class NoFinalDFError (line 177) | class NoFinalDFError(OutputParserException):
method __init__ (line 178) | def __init__(self):
class NoPythonCodeError (line 182) | class NoPythonCodeError(OutputParserException):
method __init__ (line 183) | def __init__(self):
class CodeOutputParser (line 187) | class CodeOutputParser(StrOutputParser):
method parse (line 194) | def parse(self, text: str) -> str:
function get_data_normalize_chain (line 258) | def get_data_normalize_chain(llm: BaseLanguageModel) -> Runnable:
function wrap_normalize_code (line 294) | def wrap_normalize_code(var_name: str, normalization_code: str) -> str:
FILE: src/tablegpt/agent/output_parser.py
function override (line 21) | def override(func):
class MarkdownOutputParser (line 25) | class MarkdownOutputParser(BaseOutputParser[AgentAction | AgentFinish]):
method parse (line 36) | def parse(self, text: str) -> AgentAction | AgentFinish:
method _type (line 80) | def _type(self) -> str:
FILE: src/tablegpt/errors.py
class NoAttachmentsError (line 4) | class NoAttachmentsError(KeyError):
method __init__ (line 5) | def __init__(self):
class InvalidURIError (line 9) | class InvalidURIError(ValueError): ...
class InvalidFileURIError (line 12) | class InvalidFileURIError(InvalidURIError):
method __init__ (line 13) | def __init__(self, uri: str):
class NonAbsoluteURIError (line 17) | class NonAbsoluteURIError(InvalidURIError):
method __init__ (line 18) | def __init__(self, uri: str):
class UnsupportedFileFormatError (line 22) | class UnsupportedFileFormatError(ValueError):
method __init__ (line 23) | def __init__(self, ext: str):
class UnsupportedEncodingError (line 29) | class UnsupportedEncodingError(ValueError):
method __init__ (line 30) | def __init__(self, encoding: str):
class EncodingDetectionError (line 36) | class EncodingDetectionError(LookupError):
method __init__ (line 37) | def __init__(self, path: str):
class SimpleOutputParserException (line 41) | class SimpleOutputParserException(OutputParserException):
method __init__ (line 42) | def __init__(self, input_text: str):
FILE: src/tablegpt/retriever/__init__.py
function format_columns (line 19) | def format_columns(
function format_values (line 47) | def format_values(
FILE: src/tablegpt/retriever/compressor.py
function override (line 14) | def override(func):
class ColumnDocCompressor (line 24) | class ColumnDocCompressor(BaseDocumentCompressor):
method compress_documents (line 32) | def compress_documents(
FILE: src/tablegpt/retriever/loader.py
function override (line 17) | def override(func):
class CSVLoader (line 27) | class CSVLoader(BaseLoader):
method __init__ (line 33) | def __init__(
method lazy_load (line 59) | def lazy_load(self) -> Iterator[Document]:
method alazy_load (line 66) | async def alazy_load(self) -> AsyncIterator[Document]:
method column2docs (line 72) | def column2docs(self, column: Series) -> Iterator[Document]:
FILE: src/tablegpt/safety.py
class HazardOutputParser (line 32) | class HazardOutputParser(BaseTransformOutputParser[tuple[str, str | None...
method parse (line 33) | def parse(self, text: str) -> tuple[str, str | None]:
function create_hazard_classifier (line 63) | def create_hazard_classifier(llm: BaseLanguageModel) -> Runnable:
FILE: src/tablegpt/tools.py
function override (line 19) | def override(func):
class Artifact (line 28) | class Artifact(BaseModel):
method extract_filename (line 40) | def extract_filename(self) -> Self:
method ensure_path_absolute (line 46) | def ensure_path_absolute(cls, v: Path) -> Path:
class IPythonTool (line 50) | class IPythonTool(BaseTool):
method _run (line 77) | def _run(
method _arun (line 127) | async def _arun(
method _guess_artifact_paths (line 176) | def _guess_artifact_paths(self, code: str) -> list[Path]:
method _extract_error_trace (line 190) | def _extract_error_trace(self, e: ErrorContent) -> str:
function process_content (line 210) | def process_content(content: str | list[str | dict]) -> list[dict]:
FILE: src/tablegpt/translation.py
function create_translator (line 24) | def create_translator(llm: BaseLanguageModel) -> Runnable:
FILE: src/tablegpt/utils.py
function path_from_uri (line 25) | def path_from_uri(uri: str) -> Path:
function file_extension (line 61) | def file_extension(file: str) -> str:
function read_df (line 74) | def read_df(uri: str, *, autodetect_encoding: bool = True, **kwargs) -> ...
function _read_df (line 112) | def _read_df(uri: str, encoding: str = "utf-8", **kwargs) -> pd.DataFrame:
class FileEncoding (line 149) | class FileEncoding(NamedTuple):
function detect_file_encodings (line 160) | def detect_file_encodings(file_path: str | Path, timeout: int = 5) -> li...
function filter_contents (line 194) | def filter_contents(messages: list[BaseMessage], keep: Sequence[str] | N...
function filter_content (line 224) | def filter_content(message: BaseMessage, keep: Sequence[str] | None = No...
FILE: tests/agent/file_reading/test_data_normalizer.py
class TestListListOutputParser (line 12) | class TestListListOutputParser(unittest.TestCase):
method setUp (line 13) | def setUp(self):
method test_parse_within_text (line 16) | def test_parse_within_text(self):
method test_parse_single_inner_list (line 21) | def test_parse_single_inner_list(self):
method test_parse_multiple_inner_lists (line 26) | def test_parse_multiple_inner_lists(self):
method test_parse_inner_lists_with_brackets (line 31) | def test_parse_inner_lists_with_brackets(self):
method test_parse_list_with_empty_inner_list (line 36) | def test_parse_list_with_empty_inner_list(self):
method test_parse_inner_list_with_single_element (line 41) | def test_parse_inner_list_with_single_element(self):
method test_parse_whitespaces (line 46) | def test_parse_whitespaces(self):
method test_parse_inner_list_with_commas (line 51) | def test_parse_inner_list_with_commas(self):
method test_parse_outer_empty_list (line 56) | def test_parse_outer_empty_list(self):
method test_parse_inner_empty_lists (line 61) | def test_parse_inner_empty_lists(self):
method test_parse_inner_lists_with_mixed_item_types (line 66) | def test_parse_inner_lists_with_mixed_item_types(self):
method test_parse_mixed_types (line 74) | def test_parse_mixed_types(self):
method test_parse_inner_lists_with_special_characters (line 79) | def test_parse_inner_lists_with_special_characters(self):
method test_one_dimension_array (line 95) | def test_one_dimension_array(self):
method test_invalid_output (line 100) | def test_invalid_output(self):
method test_parse_unrecognized (line 130) | def test_parse_unrecognized(self):
class TestListTupleOutputParser (line 136) | class TestListTupleOutputParser(unittest.TestCase):
method setUp (line 137) | def setUp(self):
method test_parse_within_text (line 140) | def test_parse_within_text(self):
method test_parse_single_inner_tuple_list (line 145) | def test_parse_single_inner_tuple_list(self):
method test_parse_multiple_inner_tuple_lists (line 150) | def test_parse_multiple_inner_tuple_lists(self):
method test_parse_inner_tuple_with_parentheses (line 155) | def test_parse_inner_tuple_with_parentheses(self):
method test_parse_list_with_empty_tuple (line 160) | def test_parse_list_with_empty_tuple(self):
method test_parse_single_element_tuple (line 165) | def test_parse_single_element_tuple(self):
method test_parse_whitespaces (line 170) | def test_parse_whitespaces(self):
method test_parse_inner_tuple_with_commas (line 175) | def test_parse_inner_tuple_with_commas(self):
method test_parse_outer_empty_list (line 180) | def test_parse_outer_empty_list(self):
method test_parse_inner_empty_tuples (line 185) | def test_parse_inner_empty_tuples(self):
method test_parse_inner_tuples_with_mixed_item_types (line 190) | def test_parse_inner_tuples_with_mixed_item_types(self):
method test_parse_mixed_types (line 198) | def test_parse_mixed_types(self):
method test_parse_inner_tuples_with_special_characters (line 203) | def test_parse_inner_tuples_with_special_characters(self):
method test_one_dimension_array (line 218) | def test_one_dimension_array(self):
method test_invalid_output (line 223) | def test_invalid_output(self):
method test_parse_unrecognized (line 257) | def test_parse_unrecognized(self):
class TestCodeOutputParser (line 263) | class TestCodeOutputParser(unittest.TestCase):
method setUp (line 264) | def setUp(self):
method test_parse_valid_python (line 267) | def test_parse_valid_python(self):
method test_parse_valid_python_without_newline (line 275) | def test_parse_valid_python_without_newline(self):
method test_parse_unknown (line 283) | def test_parse_unknown(self):
method test_parse_no_final_df (line 287) | def test_parse_no_final_df(self):
class TestWrapNormalizeCode (line 296) | class TestWrapNormalizeCode(unittest.TestCase):
method test_wrap_normalize_code_basic (line 297) | def test_wrap_normalize_code_basic(self):
method test_wrap_empty_normalize_code (line 314) | def test_wrap_empty_normalize_code(self):
method test_wrap_multi_line_normalize_code (line 331) | def test_wrap_multi_line_normalize_code(self):
FILE: tests/agent/test_output_parser.py
class TestMarkdownOutputParser (line 14) | class TestMarkdownOutputParser(unittest.TestCase):
method test_valid_markdown_known_language_action (line 16) | def test_valid_markdown_known_language_action(self, mock_uuid):
method test_valid_markdown_unknown_language (line 49) | def test_valid_markdown_unknown_language(self):
method test_valid_markdown_no_code_block (line 57) | def test_valid_markdown_no_code_block(self):
method test_valid_markdown_multiple_code_blocks (line 68) | def test_valid_markdown_multiple_code_blocks(self):
method test_empty_input (line 105) | def test_empty_input(self):
FILE: tests/retriever/test_compressor.py
class TestCompressDocuments (line 7) | class TestCompressDocuments(unittest.TestCase):
method setUp (line 8) | def setUp(self):
method test_single_column_single_file (line 11) | def test_single_column_single_file(self):
method test_multiple_columns_single_file (line 33) | def test_multiple_columns_single_file(self):
method test_multiple_columns_multiple_files (line 59) | def test_multiple_columns_multiple_files(self):
method test_empty_input (line 93) | def test_empty_input(self):
FILE: tests/retriever/test_format.py
class TestFormatColumns (line 7) | class TestFormatColumns(unittest.TestCase):
method test_format_empty_column_docs (line 8) | def test_format_empty_column_docs(self):
method test_format_column_docs (line 12) | def test_format_column_docs(self):
method test_format_and_compress_column (line 33) | def test_format_and_compress_column(self):
FILE: tests/retriever/test_loader.py
function mock_df (line 10) | def mock_df():
function loader (line 16) | def loader():
function test_initialization (line 21) | def test_initialization(loader):
function test_lazy_load (line 27) | def test_lazy_load(loader, mock_df):
function test_lazy_load_with_missing_metadata (line 53) | def test_lazy_load_with_missing_metadata(mock_df):
function test_column2docs (line 78) | def test_column2docs(loader, mock_df):
function test_empty_csv (line 88) | def test_empty_csv(loader):
function test_csv_with_non_string_column (line 95) | def test_csv_with_non_string_column(loader):
FILE: tests/test_profile_init.py
class TestTableGPTInit (line 6) | class TestTableGPTInit(unittest.TestCase):
method setUp (line 7) | def setUp(self):
method tearDown (line 20) | def tearDown(self):
method test_find_tablegpt_ipykernel_profile_dir_found (line 30) | def test_find_tablegpt_ipykernel_profile_dir_found(self):
method test_default_tablegpt_ipykernel_profile_dir_not_found (line 39) | def test_default_tablegpt_ipykernel_profile_dir_not_found(self):
FILE: tests/test_safety.py
class TestHazardOutputParser (line 6) | class TestHazardOutputParser(unittest.TestCase):
method setUp (line 7) | def setUp(self):
method test_parse_safe (line 10) | def test_parse_safe(self):
method test_parse_safe_with_spaces (line 14) | def test_parse_safe_with_spaces(self):
method test_parse_unknown (line 18) | def test_parse_unknown(self):
method test_parse_unsafe_text_with_category (line 22) | def test_parse_unsafe_text_with_category(self):
method test_parse_unsafe_text_with_invalid_format (line 27) | def test_parse_unsafe_text_with_invalid_format(self):
FILE: tests/test_tools.py
class TestProcessContent (line 6) | class TestProcessContent(unittest.TestCase):
method test_single_string (line 7) | def test_single_string(self):
method test_list_of_strings (line 12) | def test_list_of_strings(self):
method test_list_of_mixed_strings_and_dicts (line 17) | def test_list_of_mixed_strings_and_dicts(self):
method test_list_of_only_dicts (line 29) | def test_list_of_only_dicts(self):
method test_empty_string (line 40) | def test_empty_string(self):
method test_empty_list (line 45) | def test_empty_list(self):
method test_list_with_empty_string (line 50) | def test_list_with_empty_string(self):
method test_text_in_dict (line 58) | def test_text_in_dict(self):
FILE: tests/test_utils.py
class TestPathFromUri (line 11) | class TestPathFromUri(unittest.TestCase):
method test_valid_file_uri_unix (line 13) | def test_valid_file_uri_unix(self):
method test_valid_file_uri_windows (line 20) | def test_valid_file_uri_windows(self):
method test_valid_file_uri_unc_path (line 27) | def test_valid_file_uri_unc_path(self):
method test_invalid_file_uri (line 33) | def test_invalid_file_uri(self):
method test_relative_file_uri (line 40) | def test_relative_file_uri(self):
method test_invalid_dos_drive (line 48) | def test_invalid_dos_drive(self):
method test_valid_file_uri_with_encoded_characters (line 55) | def test_valid_file_uri_with_encoded_characters(self):
class TestFilterContent (line 62) | class TestFilterContent(unittest.TestCase):
method test_filter_content_with_string_content (line 63) | def test_filter_content_with_string_content(self):
method test_filter_content_with_list_of_strings (line 68) | def test_filter_content_with_list_of_strings(self):
method test_filter_content_with_list_of_dicts (line 73) | def test_filter_content_with_list_of_dicts(self):
method test_filter_content_with_custom_keep (line 84) | def test_filter_content_with_custom_keep(self):
method test_filter_content_with_mixed_content (line 98) | def test_filter_content_with_mixed_content(self):
method test_filter_content_with_no_text_type (line 110) | def test_filter_content_with_no_text_type(self):
Condensed preview — 106 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (4,182K chars).
[
{
"path": ".devcontainer/devcontainer.json",
"chars": 837,
"preview": "// For format details, see https://aka.ms/devcontainer.json. For config options, see the\n// README at: https://github.co"
},
{
"path": ".gitattributes",
"chars": 89,
"preview": "* text=auto eol=lf\n*.{cmd,[cC][mM][dD]} text eol=crlf\n*.{bat,[bB][aA][tT]} text eol=crlf\n"
},
{
"path": ".github/ISSUE_TEMPLATE/bug_report.md",
"chars": 915,
"preview": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: \"\"\nlabels: bug\nassignees: \"\"\n---\n\n- [ ] I have sea"
},
{
"path": ".github/workflows/ci.yml",
"chars": 2141,
"preview": "name: \"CI\"\n\non:\n push:\n branches: [ main ]\n paths:\n - \"src/**\"\n - \"tests/**\"\n - \"Makefile\"\n -"
},
{
"path": ".github/workflows/publish-docs.yml",
"chars": 897,
"preview": "name: Publish docs\n\non:\n push:\n branches: [ main ]\n paths:\n - \"docs/**\"\n - \"mkdocs.yml\"\n workflow_disp"
},
{
"path": ".github/workflows/publish.yml",
"chars": 727,
"preview": "name: Publish to PyPI\n\non:\n release:\n types: [published]\n\njobs:\n deploy:\n\n runs-on: ubuntu-latest\n\n permissio"
},
{
"path": ".github/workflows/stale.yml",
"chars": 994,
"preview": "name: \"Close stale issues and PRs\"\n\npermissions:\n actions: write\n contents: write # only for delete-branch option\n is"
},
{
"path": ".gitignore",
"chars": 12763,
"preview": "##\n# MacOS\n# <https://github.com/github/gitignore/blob/main/Global/macOS.gitignore>\n##\n\n# General\n.DS_Store\n.AppleDouble"
},
{
"path": ".pre-commit-config.yaml",
"chars": 788,
"preview": "repos:\n - repo: https://github.com/pre-commit/pre-commit-hooks\n rev: v4.5.0\n hooks:\n - id: check-yaml\n "
},
{
"path": "CONTRIBUTING.md",
"chars": 3173,
"preview": "# Welcome to TableGPT-Agent contributing guide <!-- omit in toc -->\n\nThank you for investing your time in contributing t"
},
{
"path": "LICENSE",
"chars": 11358,
"preview": "\n Apache License\n Version 2.0, January 2004\n "
},
{
"path": "Makefile",
"chars": 718,
"preview": "\n# Default target executed when no arguments are given to make.\nall: help\n\nlint:\n\thatch fmt --check\n\nformat:\n\thatch fmt\n"
},
{
"path": "README.md",
"chars": 2107,
"preview": "# TableGPT Agent\n\n[](https://pypi.org/project/tablegp"
},
{
"path": "collect_script.py",
"chars": 1473,
"preview": "import platform\nimport subprocess\nimport sys\n\n\ndef get_os_info():\n return {\n \"system\": platform.system(),\n "
},
{
"path": "docs/explanation/agent-workflow.md",
"chars": 6452,
"preview": "# Agent Workflow\n\nThe Agent Workflow is the core functionality of the `tablegpt-agent`. It processes user input and gene"
},
{
"path": "docs/explanation/code-sandbox.md",
"chars": 8724,
"preview": "# Code Sandbox\n\n`tablegpt-agent` directs `tablegpt` to generate Python code for data analysis. However, the generated co"
},
{
"path": "docs/explanation/file-reading.ipynb",
"chars": 22618,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"229c30c0-9715-48a2-b5fe-ee8c733d847a\",\n \"metadata\": {},\n \"so"
},
{
"path": "docs/explanation/ipython-startup-scripts.md",
"chars": 48,
"preview": "# IPython Startup Scripts\n\n<!-- Placeholder -->\n"
},
{
"path": "docs/howto/cleanup-error-trace.md",
"chars": 44,
"preview": "# Cleanup Error Trace\n\n<!-- Placeholder -->\n"
},
{
"path": "docs/howto/customize-table-info.md",
"chars": 45,
"preview": "# Customize Table Info\n\n<!-- Placeholder -->\n"
},
{
"path": "docs/howto/incluster-code-execution.md",
"chars": 352,
"preview": "# Incluster Code Execution\n\nThe `tablegpt-agent` directs `tablegpt` to generate Python code for data analysis. This code"
},
{
"path": "docs/howto/messages-truncation.ipynb",
"chars": 10399,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Messages Truncation\\n\",\n \"\\n\","
},
{
"path": "docs/howto/normalize-datasets.ipynb",
"chars": 20933,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"592d977a-34b0-42f0-879c-1e8afe5cb134\",\n \"metadata\": {},\n \"so"
},
{
"path": "docs/howto/persist-messages.ipynb",
"chars": 29500,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"a04d67a0-660f-41ea-a873-bab8c5f6197c\",\n \"metadata\": {},\n \"so"
},
{
"path": "docs/howto/retrieval.ipynb",
"chars": 10703,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"25c9a82f-ea07-434c-a031-e844bc28279d\",\n \"metadata\": {},\n \"so"
},
{
"path": "docs/index.md",
"chars": 1936,
"preview": "# Home\n\n[](https://pypi.org/project/tablegpt-agent)\n["
},
{
"path": "docs/reference.md",
"chars": 58,
"preview": "# API Reference\n\n::: tablegpt.agent.create_tablegpt_graph\n"
},
{
"path": "docs/stylesheets/extra.css",
"chars": 147,
"preview": "/* hide jupyter notebooks input/output numbers */\n.jp-InputPrompt {\n display: none !important;\n}\n\n.jp-OutputPrompt {\n "
},
{
"path": "docs/tutorials/chat-on-tabular-data.ipynb",
"chars": 10879,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"1944f5bf\",\n \"metadata\": {},\n \"source\": [\n \"# Chat on Tabu"
},
{
"path": "docs/tutorials/continue-analysis-on-generated-charts.ipynb",
"chars": 10098,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"98a1786c\",\n \"metadata\": {},\n \"source\": [\n \"# Continue Ana"
},
{
"path": "docs/tutorials/quick-start.ipynb",
"chars": 6183,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"9a12e134\",\n \"metadata\": {},\n \"source\": [\n \"# Quickstart\\n"
},
{
"path": "examples/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "examples/data_analysis.py",
"chars": 2645,
"preview": "import asyncio\nfrom datetime import date\nfrom typing import TypedDict\n\nfrom langchain_core.messages import HumanMessage\n"
},
{
"path": "examples/datasets/titanic.csv",
"chars": 152,
"preview": "Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Survived\n2,female,29,0,2,23,S,1\n3,female,39,1,5,31.275,S,0\n3,male,26.5,0,0,7.22"
},
{
"path": "examples/datasets/产品销量表.csv",
"chars": 983,
"preview": "编号,名称,单位, 单价(元) ,销售量, 销售额 \nmb2033,法式面包,包, ¥7.40 ,305080,\" ¥2,257,592.00 \"\nmb2034,奶昔蛋糕,包, ¥5.80 ,93200,\" ¥540,560.00 \"\nm"
},
{
"path": "examples/quick_start.py",
"chars": 939,
"preview": "import asyncio\nfrom datetime import date\n\nfrom langchain_core.messages import HumanMessage\nfrom langchain_openai import "
},
{
"path": "ipython/README.md",
"chars": 1099,
"preview": "# TableGPT IPython Kernel\n\nThis kernel is used to execute code generated by `tablegpt-agent` and has been equipped with "
},
{
"path": "ipython/ipython-startup-scripts/00-pandas.py",
"chars": 388,
"preview": "import pandas as pd\n\npd.set_option(\"display.width\", 2048)\n# 8 is the minimum value to display `df.describe()`. We have o"
},
{
"path": "ipython/ipython-startup-scripts/98-udfs.py",
"chars": 4376,
"preview": "from __future__ import annotations\n\nimport concurrent.futures\nimport os\nfrom pathlib import Path\nfrom typing import Name"
},
{
"path": "ipython/ipython-startup-scripts/99-cfont.py",
"chars": 123,
"preview": "import seaborn as sns\nfrom mplfonts import use_font\n\nuse_font(\"Noto Serif CJK SC\")\nsns.set_theme(font=\"Noto Serif CJK SC"
},
{
"path": "ipython/requirements.txt",
"chars": 311,
"preview": "pandas >=2.2,<3.0.0\nscipy >=1.13.0,<2.0.0\ntabulate >=0.9.0,<1.0.0\nscikit-learn >=1.0.0,<2.0.0\nstatsmodels >=0.10.0,<1.0."
},
{
"path": "mkdocs.yml",
"chars": 1954,
"preview": "site_name: TableGPT Agent\n\ntheme:\n name: \"material\"\n features:\n - navigation.footer\n - search.highlight\n - se"
},
{
"path": "pyproject.toml",
"chars": 3105,
"preview": "[build-system]\nrequires = [\"hatchling\"]\nbuild-backend = \"hatchling.build\"\n\n[project]\nname = \"tablegpt-agent\"\ndynamic = ["
},
{
"path": "realtabbench/README.md",
"chars": 2925,
"preview": "# Benchmark Evaluations: A Variety of Academic and Table-Related Benchmark Evaluations for TableGPT2\n\n## Overview\n\nThis "
},
{
"path": "realtabbench/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "realtabbench/agent_eval/README.md",
"chars": 2436,
"preview": "# TableGPT Evaluation\n\nThis document will guide you through the process of setting up the evaluation environment and run"
},
{
"path": "realtabbench/agent_eval/__init__.py",
"chars": 141,
"preview": "import logging\nimport os\n\nLOG_LEVEL = os.getenv(\"LOG_LEVEL\", \"INFO\")\n\n\nlogger = logging.getLogger(__name__)\nlogger.setLe"
},
{
"path": "realtabbench/agent_eval/__main__.py",
"chars": 1437,
"preview": "import asyncio\nimport logging\nimport os\nimport signal\nimport sys\n\nfrom dotenv import find_dotenv, load_dotenv\nfrom langc"
},
{
"path": "realtabbench/agent_eval/config.py",
"chars": 1452,
"preview": "import argparse\nimport logging\nfrom pathlib import Path\nfrom typing import Any\nfrom uuid import uuid4\n\nimport yaml\nfrom "
},
{
"path": "realtabbench/agent_eval/evaluatee.py",
"chars": 801,
"preview": "from __future__ import annotations\n\nimport logging\nfrom abc import ABC, abstractmethod\nfrom contextlib import AbstractAs"
},
{
"path": "realtabbench/agent_eval/evaluator/__init__.py",
"chars": 986,
"preview": "from operator import itemgetter\n\nfrom langchain_core.language_models import BaseLanguageModel\nfrom langchain_core.prompt"
},
{
"path": "realtabbench/agent_eval/evaluator/output_parser.py",
"chars": 1472,
"preview": "from __future__ import annotations\n\nfrom typing import Any\n\nfrom langchain.evaluation.scoring.eval_chain import (\n _F"
},
{
"path": "realtabbench/agent_eval/evaluator/prompt.py",
"chars": 1916,
"preview": "from __future__ import annotations\n\nINSTRUCTION = \"\"\"You are a teacher grading a quiz. Start by providing a brief reason"
},
{
"path": "realtabbench/agent_eval/example-config.yaml",
"chars": 457,
"preview": "user: eval-example\n\nmetadata:\n name: tablegpt eval\n llm:\n name: qwen2.5-7b-instruct\n temperature: 0.1\n top_p:"
},
{
"path": "realtabbench/agent_eval/questioner.py",
"chars": 3401,
"preview": "import logging\nimport sys\nfrom pathlib import Path\n\nimport pandas as pd\nfrom langchain_core.output_parsers.list import N"
},
{
"path": "realtabbench/agent_eval/requirements.txt",
"chars": 253,
"preview": "tablegpt-agent\naiofiles\ntqdm\npydantic >= 2.0\npydantic-settings >= 2.0\npython-dotenv\npyyaml\nipython\nipykernel\nlangchain\nl"
},
{
"path": "realtabbench/agent_eval/runner.py",
"chars": 4669,
"preview": "from __future__ import annotations\n\nimport asyncio\nimport datetime\nimport json\nimport logging\nfrom typing import TYPE_CH"
},
{
"path": "realtabbench/agent_eval/tablegpt_evaluatee.py",
"chars": 8259,
"preview": "from __future__ import annotations\n\nimport logging\nimport shutil\nimport tempfile\nfrom datetime import date\nfrom functool"
},
{
"path": "realtabbench/agent_eval/worker.py",
"chars": 5992,
"preview": "from __future__ import annotations\n\nimport asyncio\nimport json\nimport logging\nimport traceback\nfrom typing import TYPE_C"
},
{
"path": "realtabbench/evalset/bird_data/dev.json",
"chars": 741332,
"preview": "[\n {\n \"question_id\": 0,\n \"db_id\": \"california_schools\",\n \"question\": \"What is the highest eligible free rate f"
},
{
"path": "realtabbench/evalset/bird_data/dev.sql",
"chars": 272002,
"preview": "SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST(`Free Mea"
},
{
"path": "realtabbench/evalset/bird_data/dev_tables.json",
"chars": 158350,
"preview": "[\n {\n \"db_id\": \"debit_card_specializing\",\n \"table_names_original\": [\n \"customers\",\n "
},
{
"path": "realtabbench/evalset/spider_data/dev.json",
"chars": 360398,
"preview": "[\n {\n \"question_id\": 0,\n \"db_id\": \"concert_singer\",\n \"question\": \"How many singers do we have?\","
},
{
"path": "realtabbench/evalset/spider_data/dev_gold.sql",
"chars": 124056,
"preview": "SELECT count(*) FROM singer\tconcert_singer\nSELECT count(*) FROM singer\tconcert_singer\nSELECT name , country , age FROM"
},
{
"path": "realtabbench/evalset/spider_data/test.json",
"chars": 768931,
"preview": "[\n {\n \"question_id\": 0,\n \"db_id\": \"soccer_3\",\n \"question\": \"How many clubs are there?\",\n "
},
{
"path": "realtabbench/evalset/spider_data/test_gold.sql",
"chars": 274612,
"preview": "SELECT count(*) FROM club\tsoccer_3\nSELECT count(*) FROM club\tsoccer_3\nSELECT Name FROM club ORDER BY Name ASC\tsoccer_3\nS"
},
{
"path": "realtabbench/evalset/spider_data/test_tables.json",
"chars": 760941,
"preview": "[\n {\n \"column_names\": [\n [\n -1,\n \"*\"\n ],\n [\n 0,\n \"perpetrator id\"\n ]"
},
{
"path": "realtabbench/inference.py",
"chars": 3030,
"preview": "import os\nimport pathlib\n\nfrom transformers import AutoTokenizer\nfrom vllm import LLM, SamplingParams\n\n\ndef get_infer_kw"
},
{
"path": "realtabbench/inference_encoder.py",
"chars": 4404,
"preview": "import contextlib\nimport copy\nimport gc\nimport logging\n\nimport pandas as pd\nimport torch\nfrom vllm import LLM\nfrom vllm."
},
{
"path": "realtabbench/requirements.txt",
"chars": 485,
"preview": "vllm>=0.7.2\ndefog_data==0.1.1\nfunc_timeout==4.3.5\nlangchain==0.2.5\nlangchain_core==0.2.43\nlangchain_experimental==0.0.61"
},
{
"path": "realtabbench/run_text2sql_eval.py",
"chars": 2539,
"preview": "import argparse\nimport json\n\nfrom text2sql.src.evaluation import evaluation_main\nfrom text2sql.src.gpt_request import ge"
},
{
"path": "realtabbench/text2sql/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "realtabbench/text2sql/src/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "realtabbench/text2sql/src/evaluation.py",
"chars": 6436,
"preview": "import json\nimport logging\nimport os\nimport sqlite3\nimport sys\n\nfrom func_timeout import FunctionTimedOut, func_timeout\n"
},
{
"path": "realtabbench/text2sql/src/gpt_request.py",
"chars": 15412,
"preview": "from __future__ import annotations\n\nimport json\nimport logging\nimport os\nimport re\nimport sqlite3\n\nimport openai\nfrom op"
},
{
"path": "realtabbench/text2sql/src/gpt_request_encoder.py",
"chars": 12068,
"preview": "# 1. git clone -b v0.5.5-tablegpt-merged https://github.com/zTaoplus/vllm.git\n# install tablegpt vllm\n\n\n## apply diff f"
},
{
"path": "realtabbench/utils.py",
"chars": 8208,
"preview": "from __future__ import annotations\n\nimport ast\nimport json\nimport random\nimport re\nimport signal\nimport threading\nfrom c"
},
{
"path": "src/tablegpt/__about__.py",
"chars": 23,
"preview": "__version__ = \"0.2.27\"\n"
},
{
"path": "src/tablegpt/__init__.py",
"chars": 1087,
"preview": "from __future__ import annotations\n\nimport sysconfig\nimport warnings\nfrom pathlib import Path\n\n\ndef _find_tablegpt_ipyke"
},
{
"path": "src/tablegpt/agent/__init__.py",
"chars": 5316,
"preview": "from __future__ import annotations\n\nfrom datetime import date # noqa: TCH003\nfrom typing import TYPE_CHECKING\n\nfrom lan"
},
{
"path": "src/tablegpt/agent/data_analyzer.py",
"chars": 16801,
"preview": "from __future__ import annotations\n\nfrom copy import deepcopy\nfrom dataclasses import asdict, dataclass\nfrom datetime im"
},
{
"path": "src/tablegpt/agent/file_reading/__init__.py",
"chars": 13161,
"preview": "from __future__ import annotations\n\nimport ast\nimport logging\nfrom ast import literal_eval\nfrom enum import Enum\nfrom ty"
},
{
"path": "src/tablegpt/agent/file_reading/data_normalizer.py",
"chars": 13018,
"preview": "from __future__ import annotations\n\nimport ast\nimport re\nimport textwrap\nfrom operator import itemgetter\nfrom re import "
},
{
"path": "src/tablegpt/agent/output_parser.py",
"chars": 3237,
"preview": "from __future__ import annotations\n\nimport logging\nimport re\nfrom re import Pattern\nfrom sys import version_info\nfrom uu"
},
{
"path": "src/tablegpt/errors.py",
"chars": 1198,
"preview": "from langchain_core.exceptions import OutputParserException\n\n\nclass NoAttachmentsError(KeyError):\n def __init__(self)"
},
{
"path": "src/tablegpt/retriever/__init__.py",
"chars": 2500,
"preview": "from __future__ import annotations\n\nimport json\nfrom typing import TYPE_CHECKING\n\nfrom tablegpt.retriever.compressor imp"
},
{
"path": "src/tablegpt/retriever/compressor.py",
"chars": 1971,
"preview": "from __future__ import annotations\n\nfrom collections import defaultdict\nfrom sys import version_info\nfrom typing import "
},
{
"path": "src/tablegpt/retriever/loader.py",
"chars": 2957,
"preview": "from __future__ import annotations\n\nfrom pathlib import Path\nfrom sys import version_info\nfrom typing import TYPE_CHECKI"
},
{
"path": "src/tablegpt/safety.py",
"chars": 4455,
"preview": "from __future__ import annotations\n\nfrom typing import TYPE_CHECKING\n\nfrom langchain_core.output_parsers import BaseTran"
},
{
"path": "src/tablegpt/tools.py",
"chars": 9683,
"preview": "from __future__ import annotations\n\nimport mimetypes\nimport re\nfrom pathlib import Path\nfrom re import Pattern\nfrom sys "
},
{
"path": "src/tablegpt/translation.py",
"chars": 761,
"preview": "from __future__ import annotations\n\nfrom typing import TYPE_CHECKING\n\nfrom langchain_core.output_parsers import StrOutpu"
},
{
"path": "src/tablegpt/utils.py",
"chars": 10408,
"preview": "from __future__ import annotations\n\nimport concurrent.futures\nimport os\nfrom copy import deepcopy\nfrom pathlib import Pa"
},
{
"path": "tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/agent/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/agent/file_reading/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/agent/file_reading/test_data_normalizer.py",
"chars": 14111,
"preview": "import unittest\n\nfrom langchain_core.exceptions import OutputParserException\nfrom tablegpt.agent.file_reading.data_norma"
},
{
"path": "tests/agent/test_output_parser.py",
"chars": 4794,
"preview": "import logging\nimport unittest\nfrom unittest.mock import patch\nfrom uuid import uuid4\n\nfrom langchain_core.agents import"
},
{
"path": "tests/retriever/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "tests/retriever/test_compressor.py",
"chars": 3651,
"preview": "import unittest\n\nfrom langchain_core.documents import Document\nfrom tablegpt.retriever.compressor import ColumnDocCompre"
},
{
"path": "tests/retriever/test_format.py",
"chars": 1753,
"preview": "import unittest\n\nfrom langchain_core.documents import Document\nfrom tablegpt.retriever import format_columns\n\n\nclass Tes"
},
{
"path": "tests/retriever/test_loader.py",
"chars": 3628,
"preview": "from unittest.mock import patch\n\nimport pytest\nfrom langchain_core.documents import Document\nfrom pandas import DataFram"
},
{
"path": "tests/test_profile_init.py",
"chars": 1837,
"preview": "import sys\nimport unittest\nfrom unittest.mock import MagicMock, patch\n\n\nclass TestTableGPTInit(unittest.TestCase):\n d"
},
{
"path": "tests/test_safety.py",
"chars": 960,
"preview": "import unittest\n\nfrom tablegpt.safety import HazardOutputParser\n\n\nclass TestHazardOutputParser(unittest.TestCase):\n d"
},
{
"path": "tests/test_tools.py",
"chars": 2131,
"preview": "import unittest\n\nfrom tablegpt.tools import process_content\n\n\nclass TestProcessContent(unittest.TestCase):\n def test_"
},
{
"path": "tests/test_utils.py",
"chars": 4708,
"preview": "import unittest\nfrom pathlib import Path\n\nfrom langchain_core.messages import BaseMessage\nfrom tablegpt.utils import (\n "
}
]
// ... and 1 more files (download for full content)
About this extraction
This page contains the full source code of the tablegpt/tablegpt-agent GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 106 files (3.7 MB), approximately 972.0k tokens, and a symbol index with 316 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.