Repository: GoogleCloudPlatform/Open_Data_QnA
Branch: main
Commit: 19960bb38ba2
Files: 253
Total size: 1.0 MB
Directory structure:
gitextract_yfzkgoxq/
├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── MANIFEST.in
├── OWNERS
├── README.md
├── SECURITY.md
├── agents/
│ ├── BuildSQLAgent.py
│ ├── DebugSQLAgent.py
│ ├── DescriptionAgent.py
│ ├── EmbedderAgent.py
│ ├── ResponseAgent.py
│ ├── ValidateSQLAgent.py
│ ├── VisualizeAgent.py
│ ├── __init__.py
│ └── core.py
├── app.py
├── backend-apis/
│ ├── README.md
│ ├── __init__.py
│ ├── main.py
│ └── policy.yaml
├── config.ini
├── dbconnectors/
│ ├── BQConnector.py
│ ├── FirestoreConnector.py
│ ├── PgConnector.py
│ ├── __init__.py
│ └── core.py
├── docs/
│ ├── README.md
│ ├── architecture.md
│ ├── best_practices.md
│ ├── changelog.md
│ ├── config_guide.md
│ ├── faq.md
│ └── repo_structure.md
├── embeddings/
│ ├── __init__.py
│ ├── kgq_embeddings.py
│ ├── retrieve_embeddings.py
│ └── store_embeddings.py
├── env_setup.py
├── frontend/
│ ├── .gitignore
│ ├── README.md
│ ├── angular.json
│ ├── database.indexes.json
│ ├── database.rules.json
│ ├── firebase_setup.json
│ ├── frontend-flutter/
│ │ ├── .flutter-plugins
│ │ ├── .flutter-plugins-dependencies
│ │ ├── Open Data QnA - Working Sheet V2 - sample_questions_UI copy.csv
│ │ ├── Open_Data_QnA_sample_questions_v3 copy.csv
│ │ ├── README.md
│ │ ├── analysis_options.yaml
│ │ ├── android/
│ │ │ ├── .gitignore
│ │ │ ├── app/
│ │ │ │ ├── build.gradle
│ │ │ │ ├── google-services.json
│ │ │ │ └── src/
│ │ │ │ ├── debug/
│ │ │ │ │ └── AndroidManifest.xml
│ │ │ │ ├── main/
│ │ │ │ │ ├── AndroidManifest.xml
│ │ │ │ │ ├── kotlin/
│ │ │ │ │ │ └── com/
│ │ │ │ │ │ └── pilotcap/
│ │ │ │ │ │ └── ttmd/
│ │ │ │ │ │ └── MainActivity.kt
│ │ │ │ │ └── res/
│ │ │ │ │ ├── drawable/
│ │ │ │ │ │ └── launch_background.xml
│ │ │ │ │ ├── drawable-v21/
│ │ │ │ │ │ └── launch_background.xml
│ │ │ │ │ ├── values/
│ │ │ │ │ │ └── styles.xml
│ │ │ │ │ └── values-night/
│ │ │ │ │ └── styles.xml
│ │ │ │ └── profile/
│ │ │ │ └── AndroidManifest.xml
│ │ │ ├── build.gradle
│ │ │ ├── gradle/
│ │ │ │ └── wrapper/
│ │ │ │ └── gradle-wrapper.properties
│ │ │ ├── gradle.properties
│ │ │ ├── nl2sql_oss_android.iml
│ │ │ └── settings.gradle
│ │ ├── build/
│ │ │ └── web/
│ │ │ └── .last_build_id
│ │ ├── ios/
│ │ │ ├── .gitignore
│ │ │ ├── Flutter/
│ │ │ │ ├── AppFrameworkInfo.plist
│ │ │ │ ├── Debug.xcconfig
│ │ │ │ └── Release.xcconfig
│ │ │ ├── Podfile
│ │ │ ├── Runner/
│ │ │ │ ├── AppDelegate.swift
│ │ │ │ ├── Assets.xcassets/
│ │ │ │ │ ├── AppIcon.appiconset/
│ │ │ │ │ │ └── Contents.json
│ │ │ │ │ └── LaunchImage.imageset/
│ │ │ │ │ ├── Contents.json
│ │ │ │ │ └── README.md
│ │ │ │ ├── Base.lproj/
│ │ │ │ │ ├── LaunchScreen.storyboard
│ │ │ │ │ └── Main.storyboard
│ │ │ │ ├── GoogleService-Info.plist
│ │ │ │ ├── Info.plist
│ │ │ │ └── Runner-Bridging-Header.h
│ │ │ ├── Runner.xcodeproj/
│ │ │ │ ├── project.pbxproj
│ │ │ │ ├── project.xcworkspace/
│ │ │ │ │ ├── contents.xcworkspacedata
│ │ │ │ │ └── xcshareddata/
│ │ │ │ │ ├── IDEWorkspaceChecks.plist
│ │ │ │ │ └── WorkspaceSettings.xcsettings
│ │ │ │ └── xcshareddata/
│ │ │ │ └── xcschemes/
│ │ │ │ └── Runner.xcscheme
│ │ │ ├── Runner.xcworkspace/
│ │ │ │ ├── contents.xcworkspacedata
│ │ │ │ └── xcshareddata/
│ │ │ │ ├── IDEWorkspaceChecks.plist
│ │ │ │ └── WorkspaceSettings.xcsettings
│ │ │ └── RunnerTests/
│ │ │ └── RunnerTests.swift
│ │ ├── lib/
│ │ │ ├── firebase_options.dart
│ │ │ ├── main.dart
│ │ │ ├── screens/
│ │ │ │ ├── bot.dart
│ │ │ │ ├── bot_chat_view.dart
│ │ │ │ ├── disclaimer.dart
│ │ │ │ └── settings.dart
│ │ │ ├── services/
│ │ │ │ ├── display_stepper/
│ │ │ │ │ ├── display_stepper_cubit.dart
│ │ │ │ │ └── display_stepper_state.dart
│ │ │ │ ├── first_question/
│ │ │ │ │ ├── first_question_cubit.dart
│ │ │ │ │ └── first_question_state.dart
│ │ │ │ ├── load_question/
│ │ │ │ │ ├── load_question_cubit.dart
│ │ │ │ │ └── load_question_state.dart
│ │ │ │ ├── new_suggestions/
│ │ │ │ │ ├── new_suggestion_cubit.dart
│ │ │ │ │ └── new_suggestion_state.dart
│ │ │ │ ├── text_to_doc_question/
│ │ │ │ │ ├── text_to_doc_question_cubit.dart
│ │ │ │ │ └── text_to_doc_question_state.dart
│ │ │ │ ├── update_expert_mode/
│ │ │ │ │ ├── update_expert_mode_cubit.dart
│ │ │ │ │ └── update_expert_mode_state.dart
│ │ │ │ ├── update_popular_questions/
│ │ │ │ │ ├── update_popular_questions_cubit.dart
│ │ │ │ │ └── update_popular_questions_state.dart
│ │ │ │ └── update_stepper/
│ │ │ │ ├── update_stepper_cubit.dart
│ │ │ │ └── update_stepper_state.dart
│ │ │ └── utils/
│ │ │ ├── Input_custom.dart
│ │ │ ├── TextToDocParameter.dart
│ │ │ ├── custom_input_field.dart
│ │ │ ├── most_popular_questions.dart
│ │ │ ├── pdf_viewer.dart
│ │ │ ├── stepper_expert_info.dart
│ │ │ └── tabbed_container.dart
│ │ ├── nl2sql_oss.iml
│ │ ├── pubspec.yaml
│ │ ├── test/
│ │ │ └── widget_test.dart
│ │ └── web/
│ │ ├── index 01.49.28.html
│ │ ├── index.html
│ │ └── manifest.json
│ ├── frontend.yaml
│ ├── package.json
│ ├── server.ts
│ ├── src/
│ │ ├── app/
│ │ │ ├── agent-chat/
│ │ │ │ ├── agent-chat.component.html
│ │ │ │ ├── agent-chat.component.scss
│ │ │ │ ├── agent-chat.component.spec.ts
│ │ │ │ └── agent-chat.component.ts
│ │ │ ├── app-routing.module.ts
│ │ │ ├── app.component.html
│ │ │ ├── app.component.scss
│ │ │ ├── app.component.spec.ts
│ │ │ ├── app.component.ts
│ │ │ ├── app.module.server.ts
│ │ │ ├── app.module.ts
│ │ │ ├── business-user/
│ │ │ │ ├── business-user.component.html
│ │ │ │ ├── business-user.component.scss
│ │ │ │ ├── business-user.component.spec.ts
│ │ │ │ └── business-user.component.ts
│ │ │ ├── grouping-modal/
│ │ │ │ ├── grouping-modal.component.html
│ │ │ │ ├── grouping-modal.component.scss
│ │ │ │ ├── grouping-modal.component.spec.ts
│ │ │ │ └── grouping-modal.component.ts
│ │ │ ├── header/
│ │ │ │ ├── header.component.html
│ │ │ │ ├── header.component.scss
│ │ │ │ ├── header.component.spec.ts
│ │ │ │ └── header.component.ts
│ │ │ ├── home/
│ │ │ │ ├── home.component.html
│ │ │ │ ├── home.component.scss
│ │ │ │ ├── home.component.spec.ts
│ │ │ │ └── home.component.ts
│ │ │ ├── http.interceptor.ts
│ │ │ ├── login/
│ │ │ │ ├── login.component.html
│ │ │ │ ├── login.component.scss
│ │ │ │ ├── login.component.spec.ts
│ │ │ │ └── login.component.ts
│ │ │ ├── login-button/
│ │ │ │ ├── login-button.component.html
│ │ │ │ ├── login-button.component.scss
│ │ │ │ ├── login-button.component.spec.ts
│ │ │ │ └── login-button.component.ts
│ │ │ ├── menu/
│ │ │ │ ├── menu.component.html
│ │ │ │ ├── menu.component.scss
│ │ │ │ ├── menu.component.spec.ts
│ │ │ │ └── menu.component.ts
│ │ │ ├── prism/
│ │ │ │ ├── prism.component.html
│ │ │ │ ├── prism.component.scss
│ │ │ │ ├── prism.component.spec.ts
│ │ │ │ ├── prism.component.ts
│ │ │ │ └── prism.d.ts
│ │ │ ├── scenario-list/
│ │ │ │ ├── scenario-list.component.html
│ │ │ │ ├── scenario-list.component.scss
│ │ │ │ ├── scenario-list.component.spec.ts
│ │ │ │ └── scenario-list.component.ts
│ │ │ ├── shared/
│ │ │ │ └── services/
│ │ │ │ ├── chat.service.spec.ts
│ │ │ │ ├── chat.service.ts
│ │ │ │ ├── home.service.spec.ts
│ │ │ │ ├── home.service.ts
│ │ │ │ ├── login.service.spec.ts
│ │ │ │ ├── login.service.ts
│ │ │ │ ├── shared.service.spec.ts
│ │ │ │ └── shared.service.ts
│ │ │ ├── upload-template/
│ │ │ │ ├── upload-template.component.html
│ │ │ │ ├── upload-template.component.scss
│ │ │ │ ├── upload-template.component.spec.ts
│ │ │ │ └── upload-template.component.ts
│ │ │ ├── user-journey/
│ │ │ │ ├── user-journey.component.html
│ │ │ │ ├── user-journey.component.scss
│ │ │ │ ├── user-journey.component.spec.ts
│ │ │ │ └── user-journey.component.ts
│ │ │ └── user-photo/
│ │ │ ├── user-photo.component.html
│ │ │ ├── user-photo.component.scss
│ │ │ ├── user-photo.component.spec.ts
│ │ │ └── user-photo.component.ts
│ │ ├── assets/
│ │ │ ├── .gitkeep
│ │ │ └── constants.ts
│ │ ├── index.html
│ │ ├── main.server.ts
│ │ ├── main.ts
│ │ ├── styles/
│ │ │ └── variables.scss
│ │ └── styles.scss
│ ├── tsconfig.app.json
│ ├── tsconfig.json
│ └── tsconfig.spec.json
├── notebooks/
│ ├── 0_CopyDataToBigQuery.ipynb
│ ├── 0_CopyDataToCloudSqlPG.ipynb
│ ├── 1_Setup_OpenDataQnA.ipynb
│ ├── 2_Run_OpenDataQnA.ipynb
│ └── 3_LoadKnownGoodSQL.ipynb
├── opendataqna.py
├── prompts.yaml
├── pyproject.toml
├── scripts/
│ ├── .~lock.Scenarios Sample.csv#
│ ├── Scenarios Sample.csv
│ ├── __init__.py
│ ├── copy_select_table_column_bigquery.py
│ ├── data_source_list.csv
│ ├── data_source_list_sample.csv
│ ├── known_good_sql.csv
│ ├── save_config.py
│ └── tables_columns_descriptions.csv
├── terraform/
│ ├── .gitignore
│ ├── README.md
│ ├── backend.tf
│ ├── bq.tf
│ ├── embeddings-setup.tf
│ ├── frontend.tf
│ ├── iam.tf
│ ├── locals.tf
│ ├── main.tf
│ ├── outputs.tf
│ ├── pg-vector.tf
│ ├── scripts/
│ │ ├── backend-deployment.sh
│ │ ├── copy-firebase-json.sh
│ │ ├── create-and-store-embeddings.py
│ │ ├── deploy-all.sh
│ │ ├── execute-gcloud-cmd.sh
│ │ ├── execute-python-files.sh
│ │ ├── frontend-deployment.sh
│ │ └── install-dependencies.sh
│ ├── templates/
│ │ ├── config.ini.tftpl
│ │ └── constants.ts.tftpl
│ ├── terraform.tfvars.sample
│ ├── variables.tf
│ └── versions.tf
└── utilities/
├── __init__.py
└── imgs/
└── aa
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.venv/
__pycache__/
agents/__pycache__/
application_default_credentials.json
databases/__pycache__/
embeddings/__pycache__/
utils/__pycache__/
*/__pycache__/
.DS_Store
poetry.lock
dist/
test-pypi-token.txt
firebase.json
.firebaserc
config_copy.ini
eval/
================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, gender identity and expression, level of
experience, education, socio-economic status, nationality, personal appearance,
race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, or to ban temporarily or permanently any
contributor for other behaviors that they deem inappropriate, threatening,
offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
This Code of Conduct also applies outside the project spaces when the Project
Steward has a reasonable belief that an individual's behavior may have a
negative impact on the project or its community.
## Conflict Resolution
We do not believe that all conflict is bad; healthy debate and disagreement
often yield positive results. However, it is never okay to be disrespectful or
to engage in behavior that violates the project’s code of conduct.
If you see someone violating the code of conduct, you are encouraged to address
the behavior directly with those involved. Many issues can be resolved quickly
and easily, and this gives people more control over the outcome of their
dispute. If you are unable to resolve the matter for any reason, or if the
behavior is threatening or harassing, report it. We are dedicated to providing
an environment where participants feel welcome and safe.
Reports should be directed to *googleapis-stewards@google.com*, the
Project Steward(s) for *Google Cloud Client Libraries*. It is the Project Steward’s duty to
receive and address reported violations of the code of conduct. They will then
work with a committee consisting of representatives from the Open Source
Programs Office and the Google Open Source Strategy team. If for any reason you
are uncomfortable reaching out to the Project Steward, please email
opensource@google.com.
We will investigate every complaint, but you may not receive a direct response.
We will use our discretion in determining when and how to follow up on reported
incidents, which may range from not taking action to permanent expulsion from
the project and project-sponsored spaces. We will notify the accused of the
report and provide them an opportunity to discuss it before any action is taken.
The identity of the reporter will be omitted from the details of the report
supplied to the accused. In potentially harmful situations, such as ongoing
harassment or threats to anyone's safety, we may take action without notice.
## Attribution
This Code of Conduct is adapted from the Contributor Covenant, version 1.4,
available at
https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
================================================
FILE: CONTRIBUTING.md
================================================
# How to contribute
We'd love to accept your patches and contributions to this project.
## Before you begin
### Sign our Contributor License Agreement
Contributions to this project must be accompanied by a
[Contributor License Agreement](https://cla.developers.google.com/about) (CLA).
You (or your employer) retain the copyright to your contribution; this simply
gives us permission to use and redistribute your contributions as part of the
project.
If you or your current employer have already signed the Google CLA (even if it
was for a different project), you probably don't need to do it again.
Visit to see your current agreements or to
sign a new one.
### Review our community guidelines
This project follows
[Google's Open Source Community Guidelines](https://opensource.google/conduct/).
## Contribution process
### Code reviews
All submissions, including submissions by project members, require review. We
use GitHub pull requests for this purpose. Consult
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
information on using pull requests.
================================================
FILE: Dockerfile
================================================
# Use the official lightweight Python image.
# https://hub.docker.com/_/python
FROM python:3.9-slim
# Allow statements and log messages to immediately appear in the Knative logs
ENV PYTHONUNBUFFERED True
# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . .
# Install production dependencies.
RUN pip install poetry
RUN poetry install
# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
# Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling.
# CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
CMD HOME=/root poetry run gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 backend-apis.main:app
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: MANIFEST.in
================================================
# -*- coding: utf-8 -*-
#
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Generated by synthtool. DO NOT EDIT!
include README.rst LICENSE
recursive-include third_party *
recursive-include bigframes *.json *.proto py.typed
recursive-include tests *
global-exclude *.py[co]
global-exclude __pycache__
# Exclude scripts for samples readmegen
prune scripts/readme-gen
================================================
FILE: OWNERS
================================================
msubasioglu@google.com
steveswalker@google.com
kpatlolla@google.com
srilakshmil@google.com
mokshazna@google.com
================================================
FILE: README.md
================================================
Open Data QnA - Chat with your SQL Database
_______________
🚨 Version 2.0.0 is now live! Refer to the Release Notes for detailed information on updates and fixes. 🚨
_______________
✨ Overview
-------------
The **Open Data QnA** python library enables you to chat with your databases by leveraging LLM Agents on Google Cloud.
Open Data QnA enables a conversational approach to interacting with your data. Ask questions about your PostgreSQL or BigQuery databases in natural language and receive informative responses, without needing to write SQL. Open Data QnA leverages Large Language Models (LLMs) to bridge the gap between human language and database queries, streamlining data analysis and decision-making.

**Key Features:**
* **Conversational Querying with Multiturn Support:** Ask questions naturally, without requiring SQL knowledge and ask follow up questions.
* **Table Grouping:** Group tables under one usecase/user grouping name which can help filtering your large number tables for LLMs to understand about.
* **Multi Schema/Dataset Support:** You can group tables from different schemas/datasets for embedding and asking questions against.
* **Prompt Customization and Additional Context:** The prompts that are being used are loaded from a yaml file and it also give you ability to add extra context as well
* **SQL Generation:** Automatically generates SQL queries based on your questions.
* **Query Refinement:** Validates and debugs queries to ensure accuracy.
* **Natural Language Responses:** DRun queries and present results in clear, easy-to-understand language.
* **Visualizations (Optional):** Explore data visually with generated charts.
* **Extensible:** Customize and integrate with your existing workflows(API, UI, Notebooks).
It is built on a modular design and currently supports the following components:
### Database Connectors
* **Google Cloud SQL for PostgreSQL**
* **Google BigQuery**
* **Google Firestore(for storing session logs)**
### Vector Stores
* **PGVector on Google Cloud SQL for PostgreSQL**
* **BigQuery Vector Store**
### Agents
* **BuildSQLAgent:** An agent specialized in generating SQL queries for BigQuery or PostgreSQL databases. It analyzes user questions, available table schemas, and column descriptions to construct syntactically and semantically correct SQL queries, adapting its process based on the target database type.
* **ValidateSQLAgent:** An agent that validates the syntax and semantic correctness of SQL queries. It uses a language model to analyze queries against a database schema and returns a JSON response indicating validity and potential errors.
* **DebugSQLAgent:** An agent designed to debug and refine SQL queries for BigQuery or PostgreSQL databases. It interacts with a chat-based language model to iteratively troubleshoot queries, using error messages to generate alternative, correct queries.
* **DescriptionAgent:** An agent specialized in generating descriptions for database tables and columns. It leverages a large language model to create concise and informative descriptions that aid in understanding data structures and facilitate SQL query generation.
* **EmbedderAgent:** An agent specialized in generating text embeddings using Large Language Models (LLMs). It supports direct interaction with Vertex AI's TextEmbeddingModel or uses LangChain's VertexAIEmbeddings for a simplified interface.
* **ResponseAgent:** An agent that generates natural language responses to user questions based on SQL query results. It acts as a data assistant, interpreting SQL results and transforming them into user-friendly answers using a language model.
* **VisualizeAgent:** An agent that generates JavaScript code for Google Charts based on user questions and SQL results. It suggests suitable chart types and constructs the JavaScript code to create visualizations of the data.
**Note:** the library was formerly named Talk2Data. You may still find artifacts with the old naming in this repository.
📏 Architecture
-------------
A detailed description of the Architecture can be found [`here`](/docs/architecture.md) in the docs.
🧬 Repository Structure
-------------
Details on the Repository Structure can be found [`here`](/docs/repo_structure.md) in the docs.
🏁 Getting Started: Main Repository
-------------
### Clone the repository and switch to the correct directory
git clone git@github.com:GoogleCloudPlatform/Open_Data_QnA.git
cd Open_Data_QnA
### 🚧 **Prerequisites**
Make sure that Google Cloud CLI and Python >= 3.10 are installed before moving ahead! You can refer to the link below for guidance
Installation Guide: https://cloud.google.com/sdk/docs/install
Download Python: https://www.python.org/downloads/
ℹ️ **You can setup this solution with three approaches. Choose one based on your requirements:**
- **A)** Using [Jupyter Notebooks](#a-jupyter-notebook-based-approach) (For better view at what is happening at each stage of the solution)
- **B)** Using [CLI](#b-command-line-interface-cli-based-approach) (For ease of use and running with simple python commands, without the need to understand every step of the solution)
- **C)** Using [terraform deployment](#c-using-terraform-to-deploy-the-solution) including your backend APIs with UI
### A) Jupyter Notebook Based Approach
#### 💻 **Install Code Dependencies (Create and setup venv)**
#### **All commands in this cell to be run on the terminal (typically Ctrl+Shift+`) where your notebooks are running**
Install the dependencies by running the poetry commands below
```
# Install poetry
pip uninstall poetry -y
pip install poetry --quiet
#Run the poetry commands below to set up the environment
poetry lock #resolve dependecies (also auto create poetry venv if not exists)
poetry install --quiet #installs dependencies
poetry env info #Displays the evn just created and the path to it
poetry shell #this command should activate your venv and you should see it enters into the venv
##inside the activated venv shell []
#If you are running on Worbench instance where the service account used has required permissions to run this solution you can skip the below gcloud auth commands and get to next kernel creation section
gcloud auth login # Use this or below command to authenticate
gcloud auth application-default login
gcloud services enable \
serviceusage.googleapis.com \
cloudresourcemanager.googleapis.com --project <>
```
Chose the relevant instructions based on where you are running the notebook
**For IDEs like Cloud Shell Editor, VS Code**
For IDEs adding Juypter Extensions will automatically give you option to change the kernel. If not, manually select the python interpreter in your IDE (The exact is shown in the above cell. Path would look like e.g. /home/admin_/opendata/.venv/bin/python or ~cache/user/opendataqna/.venv/bin/python)
Proceed to the Step 1 below
**For Jupyter Lab or Jupyter Environments on Workbench etc**
Create Kernel for with the envrionment created
```
pip install jupyter
ipython kernel install --name "openqna-venv" --user
```
Restart your kernel or close the exsiting notebook and open again, you should now see the "openqna-venv" in the kernel drop down
**What did we do here?**
* Created Application Default Credentials to use for the code
* Added venv to kernel to select for running the notebooks (For standalone Jupyter setups like Workbench etc)
#### 1. Run the [1_Setup_OpenDataQnA](/notebooks/1_Setup_OpenDataQnA.ipynb) (Run Once for Initial Setup)
This notebook guides you through the setup and execution of the Open Data QnA application. It provides comprehensive instructions for setup the solution.
#### 2. Run the [2_Run_OpenDataQnA](/notebooks/2_Run_OpenDataQnA.ipynb)
This notebook guides you by reading the configuration you setup with [1_Setup_OpenDataQnA](/1_Setup_OpenDataQnA) and running the pipeline to answer questions about your data.
#### 3. [Loading Known Good SQL Examples](/notebooks/3_LoadKnownGoodSQL.ipynb)
In case you want to separately load Known Good SQLs please run this notebook once the config variables are setup in config.ini file. This can be run multiple times just to load the known good sql queries and create embeddings for it.
___________
### B) Command Line Interface (CLI) Based Approach
#### 1. Add Configuration values for the solution in [config.ini](/config.ini)
For setup we require details for vector store, source database etc. Edit the [config.ini](/config.ini) file and add values for the parameters based of below information.
ℹ️ Follow the guidelines from the [config guide document](/docs/config_guide.md) to populate your [config.ini](/config.ini) file.
**Sources to connect**
- This solution lets you setup multiple data source at same time.
- You can group multiple tables from different datasets or schema into a grouping and provide the details
- If your dataset/schema has many tables and you want to run the solution against few you should specifically choose a group for that tables only
**Format for data_source_list.csv**
**source | user_grouping | schema | table**
**source** - Supported Data Sources. #Options: bigquery , cloudsql-pg
**user_grouping** - Logical grouping or use case name for tables from same or different schema/dataset. When left black it default to the schema value in the next column
**schema** - schema name for postgres or dataset name in bigquery
**table** - name of the tables to run the solutions against. Leave this column blank after filling schema/dataset if you want to run solution for whole dataset/schema
Update the [data_source_list.csv](/scripts/data_source_list.csv) according for your requirement.
Note that the source details filled in the csv should have already be present. If not please use the Copy Notebooks if you want the demo source setup.
Enabled Data Sources:
* PostgreSQL on Google Cloud SQL (Copy Sample Data: [0_CopyDataToCloudSqlPG.ipynb](0_CopyDataToCloudSqlPG.ipynb))
* BigQuery (Copy Sample Data: [0_CopyDataToBigQuery.ipynb](0_CopyDataToBigQuery.ipynb))
#### 2. Creating Virtual Environment and Install Dependencies
```
pip install poetry --quiet
poetry lock
poetry install --quiet
poetry env info
poetry shell
```
Authenticate your credentials
```
gcloud auth login
or
gcloud auth application-default login
```
```
gcloud services enable \
serviceusage.googleapis.com \
cloudresourcemanager.googleapis.com --project <>
```
```
gcloud auth application-default set-quota-project <>
```
Enable APIs for the solution setup
```
gcloud services enable \
cloudapis.googleapis.com \
compute.googleapis.com \
iam.googleapis.com \
run.googleapis.com \
sqladmin.googleapis.com \
aiplatform.googleapis.com \
bigquery.googleapis.com \
firestore.googleapis.com --project <>
```
#### 3. Run [env_setup.py](/env_setup.py) to create vector store based on the configuration you did in Step 1
```
python env_setup.py
```
#### 4. Run [opendataqna.py](/opendataqna.py) to run the pipeline you just setup
The Open Data QnA SQL Generation tool can be conveniently used from your terminal or command prompt using a simple CLI interface. Here's how:
```
python opendataqna.py --session_id "122133131f--ade-eweq" --user_question "What is most 5 common genres we have?" --user_grouping "MovieExplorer-bigquery"
```
Where
*session_id* : Keep this unique unique same for follow up questions.
*user_question* : Enter your question in string
*user_grouping* : Enter the BQ_DATASET_NAME for BigQuery sources or PG_SCHEMA for PostgreSQL sources (refer your [data_source_list.csv](/scripts/data_source_list.csv) file)
**Optional Parameters**
You can customize the pipeline's behavior using optional parameters. Here are some common examples:
```
# Enable the SQL debugger:
python opendataqna.py --session_id="..." --user_question "..." --user_grouping "..." --run_debugger
# Execute the final generated SQL:
python opendataqna.py --session_id="..." --user_question "..." --user_grouping "..." --execute_final_sql
# Change the number of debugging rounds:
python opendataqna.py --session_id="..." --user_question "..." --user_grouping "..." --debugging_rounds 5
# Adjust similarity thresholds:
python opendataqna.py --session_id="..." --user_question "..." --user_grouping "..." --table_similarity_threshold 0.25 --column_similarity_threshold 0.4
```
You can find a full list of available options and their descriptions by running:
```
python opendataqna.py --help
```
### C) Using Terraform to deploy the solution
The provided terraform streamlines the setup of this solution and serves as a blueprint for deployment. The script provides a one-click, one-time deployment option. However, it doesn't include CI/CD capabilities and is intended solely for initial setup.
> [!NOTE]
> Current version of the Terraform Google Cloud provider does not support deployment of a few resources, this solution uses null_resource to create those resources using Google Cloud SDK.
Prior to executing terraform, ensure that the below mentioned steps have been completed.
#### Data Sources Set Up
1. Source data should already be available. If you do not have readily available source data, use the notebooks [0_CopyDataToBigQuery.ipynb](/notebooks/0_CopyDataToBigQuery.ipynb) or [0_CopyDataToCloudSqlPG.ipynb](/notebooks/0_CopyDataToCloudSqlPG.ipynb) based on the preferred source to populate sample data.
2. Ensure that the [data_source_list.csv](/scripts/data_source_list.csv) is populated with the list of datasources to be used in this solution. Terraform will take care of creating the embeddings in the destination. Use [data_source_list_sample.csv](/scripts/data_source_list_sample.csv) to fill the [data_source_list.csv](/scripts/data_source_list.csv)
3. If you want to use known good sqls for few shot prompting, ensure that the [known_good_sql.csv](/scripts/known_good_sql.csv) is populated with the required data. Terraform will take care of creating the embeddings in the destination.
#### Enable Firebase
Firebase will be used to host the frontend of the application.
1. Go to https://console.firebase.google.com/
1. Select add project and load your Google Cloud Platform project
1. Add Firebase to one of your existing Google Cloud projects
1. Confirm Firebase billing plan
1. Continue and complete
#### Terraform deployment
> [!NOTE]
> Terraform apply command for this application uses gcloud config to fetch & pass the set project id to the scripts. Please ensure that gcloud config has been set to your intended project id before proceeding.
> [!IMPORTANT]
> The Terraform scripts require specific IAM permissions to function correctly. The user needs either the broad `roles/resourcemanager.projectIamAdmin` role or a custom role with tailored permissions to manage IAM policies and roles.
> Additionally, one script TEMPORARILY disables Domain Restricted Sharing Org Policies to enable the creation of a public endpoint. This requires the user to also have the `roles/orgpolicy.policyAdmin` role.
1. Install [terraform 1.7 or higher](https://developer.hashicorp.com/terraform/install).
1. [OPTIONAL] Update default values of variables in [variables.tf](/terraform/variables.tf) according to your preferences. You can find the description for each variable inside the file. This file will be used by terraform to get information about the resources it needs to deploy. If you do not update these, terraform will use the already specified default values in the file.
1. Move to the terraform directory in the terminal
```
cd Open_Data_QnA/terraform
#If you are running this outside Cloud Shell you need to set up your Google Cloud SDK Credentials
gcloud config set project
gcloud auth application-default set-quota-project
gcloud services enable \
serviceusage.googleapis.com \
cloudresourcemanager.googleapis.com --project <>
sh ./scripts/deploy-all.sh
```
This script will perform the following steps:
1. **Run terraform scripts** - These terraform scripts will generate all the GCP resources and configuration files required for the frontend & backend. It will also generate embeddings and store it in the destination vector db.
1. **Deploy cloud run backend service with latest backend code** - The terraform in the previous step uses a dummy container image to deploy the initial version of cloud run service. This is the step where the actual backend code gets deployed.
1. **Deploy frontend app** - All the config files, web app etc required to create the frontend are deployed via terraform. However, the actual UI deployment takes place in this step.
### After deployment
***Auth Provider***
You need to enable at least one authentication provider in Firebase, you can enable it using the following steps:
1. Go to https://console.firebase.google.com/project/your_project_id/authentication/providers (change the `your_project_id` value)
2. Click on Get Started (if needed)
3. Select Google and enable it
4. Set the name for the project and support email for project
5. Save
This should deploy you end to end solution in the project with firebase web url
For detailed steps and known issues refer to README.md under [`/terraform`](/terraform/)
🖥️ Build a angular based frontend for this solution
---------------------------------------------------
Deploy backend apis for the solution, refer to the README.md under [`/backend-apis`](/backend-apis/). This APIs are designed with work with the frontend and provide access to run the solution.
Once the backend APIs deployed successfully deploy the frontend for the solution, refer to the README.md under [`/frontend`](/frontend/).
📗 FAQs and Best Practices
-------------
If you successfully set up the solution accelerator and want to start optimizing to your needs, you can follow the tips in the [`Best Practice doc`](/docs/best_practices.md).
Additionally, if you stumble across any problems, take a look into the [`FAQ`](/docs/faq.md).
If neither of these resources helps, feel free to reach out to us directly by raising an Issue.
🧹 CleanUp Resources
-------------
To clean up the resources provisioned in this solution, use commands below to remove them using gcloud/bq:
For cloudsql-pgvector as vector store : [Delete SQL Instance]()
```
gcloud sql instances delete -q
```
Delete BigQuery Dataset Created for Logs and Vector Store : [Remove BQ Dataset]()
```
bq rm -r -f -d
```
(For Backend APIs)Remove the Cloud Run service : [Delete Service]()
```
gcloud run services delete
```
For frontend, based on firebase: [Remove the firebase app]()
📄 Documentation
-------------
* [Open Data QnA Source Code (GitHub)]()
* [Open Data QnA usage notebooks](/notebooks)
* [`Architecture`](/docs/architecture.md)
* [`FAQ`](/docs/faq.md)
* [`Best Practice doc`](/docs/best_practices.md)
🚧 Quotas and limits
------------------
[BigQuery quotas]() including hardware, software, and network components.
[Gemini quotas]().
🪪 License
-------
Open Data QnA is distributed with the [Apache-2.0 license]().
It also contains code derived from the following third-party packages:
* [pandas]()
* [Python]()
🧪 Disclaimer
----------
This repository provides an open-source solution accelerator designed to streamline your development process. Please be aware that all resources associated with this accelerator will be deployed within your own Google Cloud Platform (GCP) instances.
It is imperative that you thoroughly test all components and configurations in a non-production environment before integrating any part of this accelerator with your production data or systems.
While we strive to provide a secure and reliable solution, we cannot be held responsible for any data loss, service disruptions, or other issues that may arise from the use of this accelerator.
By utilizing this repository, you acknowledge that you are solely responsible for the deployment, management, and security of the resources deployed within your GCP environment.
If you encounter any issues or have concerns about potential risks, please refrain from using this accelerator in a production setting.
We encourage responsible and informed use of this open-source solution.
🙋 Getting Help
----------
If you have any questions or if you found any problems with this repository, please report through GitHub issues.
================================================
FILE: SECURITY.md
================================================
# Security Policy
To report a security issue, please use [g.co/vulnz](https://g.co/vulnz).
The Google Security Team will respond within 5 working days of your report on g.co/vulnz.
We use g.co/vulnz for our intake, and do coordination and disclosure here using GitHub Security Advisory to privately discuss and fix the issue.
================================================
FILE: agents/BuildSQLAgent.py
================================================
from abc import ABC
from vertexai.language_models import CodeChatModel
from vertexai.generative_models import GenerativeModel, Content, Part, GenerationConfig
from .core import Agent
import pandas as pd
import json
from datetime import datetime
from dbconnectors import pgconnector,bqconnector,firestoreconnector
from utilities import PROMPTS, format_prompt
from google.cloud.aiplatform import telemetry
import vertexai
from utilities import PROJECT_ID, PG_REGION
from vertexai.generative_models import GenerationConfig
vertexai.init(project=PROJECT_ID, location=PG_REGION)
class BuildSQLAgent(Agent, ABC):
agentType: str = "BuildSQLAgent"
def __init__(self, model_id = 'gemini-1.5-pro'):
super().__init__(model_id=model_id)
def build_sql(self,source_type,user_grouping, user_question,session_history,tables_schema,columns_schema, similar_sql, max_output_tokens=2048, temperature=0.4, top_p=1, top_k=32):
not_related_msg=f'''select 'Question is not related to the dataset' as unrelated_answer;'''
if source_type=='bigquery':
from dbconnectors import bq_specific_data_types
specific_data_types = bq_specific_data_types()
else:
from dbconnectors import pg_specific_data_types
specific_data_types = pg_specific_data_types()
if f'usecase_{source_type}_{user_grouping}' in PROMPTS:
usecase_context = PROMPTS[f'usecase_{source_type}_{user_grouping}']
else:
usecase_context = "No extra context for the usecase is provided"
context_prompt = PROMPTS[f'buildsql_{source_type}']
context_prompt = format_prompt(context_prompt,
specific_data_types = specific_data_types,
not_related_msg = not_related_msg,
usecase_context = usecase_context,
similar_sql=similar_sql,
tables_schema=tables_schema,
columns_schema = columns_schema)
# print(f"Prompt to Build SQL: \n{context_prompt}")
# Chat history Retrieval
chat_history=[]
for entry in session_history:
timestamp = entry["timestamp"]
timestamp_str = timestamp.isoformat(timespec='auto')
user_message = Content(
parts=[Part.from_text(entry["user_question"])],
role="user"
)
bot_message = Content(
parts=[Part.from_text(entry["bot_response"])],
role="assistant"
)
chat_history.extend([user_message, bot_message]) # Add both to the history
# print("Chat History Retrieved")
if self.model_id == 'codechat-bison-32k':
with telemetry.tool_context_manager('opendataqna-buildsql-v2'):
chat_session = self.model.start_chat(context=context_prompt)
elif 'gemini' in self.model_id:
with telemetry.tool_context_manager('opendataqna-buildsql-v2'):
# print("SQL Builder Agent : " + str(self.model_id))
config = GenerationConfig(
max_output_tokens=max_output_tokens, temperature=temperature, top_p=top_p, top_k=top_k
)
chat_session = self.model.start_chat(history=chat_history,response_validation=False)
chat_session.send_message(context_prompt)
else:
raise ValueError('Invalid Model Specified')
if session_history is None or not session_history:
concated_questions = None
re_written_qe = None
previous_question = None
previous_sql = None
else:
concated_questions,re_written_qe=self.rewrite_question(user_question,session_history)
previous_question, previous_sql = self.get_last_sql(session_history)
build_context_prompt=f"""
Below is the previous user question from this conversation and its generated sql.
Previous Question: {previous_question}
Previous Generated SQL : {previous_sql}
Respond with
Generate SQL for User Question : {user_question}
"""
# print("BUILD CONTEXT ::: "+str(build_context_prompt))
with telemetry.tool_context_manager('opendataqna-buildsql-v2'):
response = chat_session.send_message(build_context_prompt, stream=False)
generated_sql = (str(response.text)).replace("```sql", "").replace("```", "")
generated_sql = (str(response.text)).replace("```sql", "").replace("```", "")
# print(generated_sql)
return generated_sql
def rewrite_question(self,question,session_history):
formatted_history=''
concat_questions=''
for i, _row in enumerate(session_history,start=1):
user_question = _row['user_question']
sql_query = _row['bot_response']
# print(user_question)
formatted_history += f"User Question - Turn :: {i} : {user_question}\n"
formatted_history += f"Generated SQL - Turn :: {i}: {sql_query}\n\n"
concat_questions += f"{user_question} "
# print(formatted_history)
context_prompt = f"""
Your main objective is to rewrite and refine the question passed based on the session history of question and sql generated.
Refine the given question using the provided session history to produce a queryable statement. The refined question should be self-contained, requiring no additional context for accurate SQL generation.
Make sure all the information is included in the re-written question
Below is the previous session history:
{formatted_history}
Question to rewrite:
{question}
"""
re_written_qe = str(self.generate_llm_response(context_prompt))
print("*"*25 +"Re-written question for the follow up:: "+"*"*25+"\n"+str(re_written_qe))
return str(concat_questions),str(re_written_qe)
def get_last_sql(self,session_history):
for entry in reversed(session_history):
if entry.get("bot_response"):
return entry["user_question"],entry["bot_response"]
return None
================================================
FILE: agents/DebugSQLAgent.py
================================================
from abc import ABC
import vertexai
from vertexai.language_models import CodeChatModel
from vertexai.generative_models import GenerativeModel,GenerationConfig
from google.cloud.aiplatform import telemetry
from dbconnectors import pgconnector, bqconnector
from utilities import PROMPTS, format_prompt
from .core import Agent
import pandas as pd
import json
from utilities import PROJECT_ID, PG_REGION
vertexai.init(project=PROJECT_ID, location=PG_REGION)
class DebugSQLAgent(Agent, ABC):
"""
An agent designed to debug and refine SQL queries for BigQuery or PostgreSQL databases.
This agent interacts with a chat-based language model (CodeChat or Gemini) to iteratively troubleshoot SQL queries. It receives feedback in the form of error messages and uses the model's capabilities to generate alternative queries that address the identified issues. The agent strives to maintain the original intent of the query while ensuring its syntactic and semantic correctness.
Attributes:
agentType (str): Indicates the type of agent, fixed as "DebugSQLAgent".
model_id (str): The ID of the chat model to use for debugging. Valid options are:
- "codechat-bison-32k"
- "gemini-1.0-pro"
- "gemini-ultra"
Methods:
init_chat(source_type, tables_schema, columns_schema, sql_example) -> ChatSession:
Initializes a chat session with the chosen chat model.
Args:
source_type (str): The database type ("bigquery" or "postgresql").
tables_schema (str): A description of the available tables and their columns.
columns_schema (str): Detailed descriptions of the columns in the tables.
sql_example (str, optional): An example SQL query for reference. Defaults to "-No examples provided..-".
Returns:
ChatSession: The initiated chat session object.
rewrite_sql_chat(chat_session, question, error_df) -> str:
Generates an alternative SQL query based on the chat session, original query, and error message.
Args:
chat_session (ChatSession): The active chat session.
question (str): The original SQL query.
error_df (pandas.DataFrame): The error message as a DataFrame.
Returns:
str: The rewritten SQL query.
start_debugger(source_type, query, user_question, SQLChecker, tables_schema, columns_schema, AUDIT_TEXT, similar_sql, DEBUGGING_ROUNDS, LLM_VALIDATION) -> Tuple[str, bool, str]:
Args:
source_type (str): The database type ("bigquery" or "postgresql").
query (str): The initial SQL query to debug.
user_question (str): The user's original question for reference.
SQLChecker: An object to validate the SQL syntax.
tables_schema (str): Table schema information.
columns_schema (str): Detailed column descriptions.
AUDIT_TEXT (str): Textual audit trail of the debugging process.
similar_sql (str, optional): Example SQL queries. Defaults to "-No examples provided..-".
DEBUGGING_ROUNDS (int, optional): Maximum debugging attempts. Defaults to 2.
LLM_VALIDATION (bool, optional): Whether to use LLM for syntax validation. Defaults to True.
Returns:
Tuple[str, bool, str]:
- The final refined SQL query (or the original if unchanged).
- A boolean indicating if the final query is considered invalid.
- The updated AUDIT_TEXT with debugging steps.
"""
agentType: str = "DebugSQLAgent"
def __init__(self, model_id = 'gemini-1.5-pro'):
super().__init__(model_id=model_id)
def init_chat(self,source_type,user_grouping, tables_schema,columns_schema,similar_sql="-No examples provided..-"):
if f'usecase_{source_type}_{user_grouping}' in PROMPTS:
usecase_context = PROMPTS[f'usecase_{source_type}_{user_grouping}']
else:
usecase_context = "No extra context for the usecase is provided"
context_prompt = PROMPTS[f'debugsql_{source_type}']
context_prompt = format_prompt(context_prompt,
usecase_context = usecase_context,
similar_sql=similar_sql,
tables_schema=tables_schema,
columns_schema = columns_schema)
# print(f"Prompt to Debug SQL after formatting: \n{context_prompt}")
if self.model_id == 'codechat-bison-32k':
with telemetry.tool_context_manager('opendataqna-debugsql-v2'):
chat_session = self.model.start_chat(context=context_prompt)
elif 'gemini' in self.model_id:
with telemetry.tool_context_manager('opendataqna-debugsql-v2'):
chat_session = self.model.start_chat(response_validation=False)
chat_session.send_message(context_prompt)
else:
raise ValueError('Invalid Chat Model Specified')
return chat_session
def rewrite_sql_chat(self, chat_session, sql, question, error_df):
context_prompt = f"""
What is an alternative SQL statement to address the error mentioned below?
Present a different SQL from previous ones. It is important that the query still answer the original question.
All columns selected must be present on tables mentioned on the join section.
Avoid repeating suggestions.
{sql}
{question}
{error_df}
"""
if self.model_id =='codechat-bison-32k':
with telemetry.tool_context_manager('opendataqna-debugsql-v2'):
response = chat_session.send_message(context_prompt)
resp_return = (str(response.candidates[0])).replace("```sql", "").replace("```", "")
elif 'gemini' in self.model_id:
with telemetry.tool_context_manager('opendataqna-debugsql-v2'):
response = chat_session.send_message(context_prompt, stream=False)
resp_return = (str(response.text)).replace("```sql", "").replace("```", "")
else:
raise ValueError('Invalid Model Id')
return resp_return
def start_debugger (self,
source_type,
user_grouping,
query,
user_question,
SQLChecker,
tables_schema,
columns_schema,
AUDIT_TEXT,
similar_sql="-No examples provided..-",
DEBUGGING_ROUNDS = 2,
LLM_VALIDATION=False):
i = 0
STOP = False
invalid_response = False
chat_session = self.init_chat(source_type,user_grouping,tables_schema,columns_schema,similar_sql)
sql = query.replace("```sql","").replace("```","").replace("EXPLAIN ANALYZE ","")
AUDIT_TEXT=AUDIT_TEXT+"\n\nEntering the debugging steps!"
while (not STOP):
json_syntax_result={ "valid":True, "errors":"None"}
# Check if LLM Validation is enabled
if LLM_VALIDATION:
# sql = query.replace("```sql","").replace("```","").replace("EXPLAIN ANALYZE ","")
json_syntax_result = SQLChecker.check(source_type,user_question,tables_schema,columns_schema, sql)
else:
json_syntax_result['valid'] = True
AUDIT_TEXT=AUDIT_TEXT+"\nLLM Validation is deactivated. Jumping directly to dry run execution."
if json_syntax_result['valid'] is True:
AUDIT_TEXT=AUDIT_TEXT+"\nGenerated SQL is syntactically correct as per LLM Validation!"
# print(AUDIT_TEXT)
if source_type=='bigquery':
connector=bqconnector
else:
connector=pgconnector
correct_sql, exec_result_df = connector.test_sql_plan_execution(sql)
if not correct_sql:
AUDIT_TEXT=AUDIT_TEXT+"\nGenerated SQL failed on execution! Here is the feedback from bigquery dryrun/ explain plan: \n" + str(exec_result_df)
rewrite_result = self.rewrite_sql_chat(chat_session, sql, user_question, exec_result_df)
print('\n Rewritten and Cleaned SQL: ' + str(rewrite_result))
AUDIT_TEXT=AUDIT_TEXT+"\nRewritten and Cleaned SQL: \n' + str({rewrite_result})"
sql = str(rewrite_result).replace("```sql","").replace("```","").replace("EXPLAIN ANALYZE ","")
else: STOP = True
else:
print(f'\nGenerated qeury failed on syntax check as per LLM Validation!\nError Message from LLM: {json_syntax_result} \nRewriting the query...')
AUDIT_TEXT=AUDIT_TEXT+'\nGenerated qeury failed on syntax check as per LLM Validation! \nError Message from LLM: '+ str(json_syntax_result) + '\nRewriting the query...'
syntax_err_df = pd.read_json(json.dumps(json_syntax_result))
rewrite_result=self.rewrite_sql_chat(chat_session, sql, user_question, syntax_err_df)
print(rewrite_result)
AUDIT_TEXT=AUDIT_TEXT+'\n Rewritten SQL: ' + str(rewrite_result)
sql=str(rewrite_result).replace("```sql","").replace("```","").replace("EXPLAIN ANALYZE ","")
i+=1
if i > DEBUGGING_ROUNDS:
AUDIT_TEXT=AUDIT_TEXT+ "Exceeded the number of iterations for correction!"
AUDIT_TEXT=AUDIT_TEXT+ "The generated SQL can be invalid!"
STOP = True
invalid_response=True
# After the while is completed
if i > DEBUGGING_ROUNDS:
invalid_response=True
return sql, invalid_response, AUDIT_TEXT
================================================
FILE: agents/DescriptionAgent.py
================================================
from abc import ABC
from .core import Agent
class DescriptionAgent(Agent, ABC):
"""
An agent specialized in generating descriptions for database tables and columns.
This agent leverages a large language model to create concise and informative descriptions that aid in understanding the structure and content of database elements. The generated descriptions can be valuable for documenting schemas, enhancing data exploration, and facilitating SQL query generation.
Attributes:
agentType (str): Indicates the type of agent, fixed as "DescriptionAgent".
Methods:
generate_llm_response(prompt) -> str:
Generates a response from the underlying language model based on the given prompt.
Args:
prompt (str): The prompt to feed into the language model.
Returns:
str: The generated text response, cleaned of any SQL-related formatting artifacts.
generate_missing_descriptions(source, table_desc_df, column_name_df) -> Tuple[pd.DataFrame, pd.DataFrame]:
Generates missing table and column descriptions using the language model.
Args:
source (str): The source of the database schema ("bigquery").
table_desc_df (pd.DataFrame): A DataFrame containing table metadata with potential missing descriptions.
column_name_df (pd.DataFrame): A DataFrame containing column metadata with potential missing descriptions.
Returns:
Tuple[pd.DataFrame, pd.DataFrame]:
- The updated `table_desc_df` with generated table descriptions.
- The updated `column_name_df` with generated column descriptions.
"""
agentType: str = "DescriptionAgent"
def generate_llm_response(self,prompt):
context_query = self.model.generate_content(prompt,safety_settings=self.safety_settings,stream=False)
return str(context_query.candidates[0].text).replace("```sql", "").replace("```", "")
def generate_missing_descriptions(self,source,table_desc_df, column_name_df):
llm_generated=0
print("\n\n")
for index, row in table_desc_df.iterrows():
if row['table_description'] is None or row['table_description']=='NA':
q=f"table_name == '{row['table_name']}' and table_schema == '{row['table_schema']}'"
if source=='bigquery':
context_prompt = f"""
Generate short and crisp description for the table {row['project_id']}.{row['table_schema']}.{row['table_name']}
Remember that this desciprtion should help LLMs to help build better SQL for any quries related to this table.
Parameters:
- column metadata: {column_name_df.query(q).to_markdown(index = False)}
- table metadata: {table_desc_df.query(q).to_markdown(index = False)}
DO NOT generate description that is more than two lines
"""
else:
context_prompt = f"""
Generate short and crisp description for the table {row['table_schema']}.{row['table_name']}
Remember that this desciprtions should help LLMs to help build better SQL for any quries related to this table.
Parameters:
- column metadata: {column_name_df.query(q).to_markdown(index = False)}
- table metadata: {table_desc_df.query(q).to_markdown(index = False)}
DO NOT generate description that is more than two lines
"""
table_desc_df.at[index,'table_description']=self.generate_llm_response(context_prompt)
print(f"Generated table description for {row['table_schema']}.{row['table_name']}")
llm_generated=llm_generated+1
print("LLM generated "+ str(llm_generated) + " Table Descriptions")
llm_generated = 0
print("\n\n")
for index, row in column_name_df.iterrows():
# print(row['column_description'])
if row['column_description'] is None or row['column_description']=='':
q=f"table_name == '{row['table_name']}' and table_schema == '{row['table_schema']}'"
if source=='bigquery':
context_prompt = f"""
Generate short and crisp description for the column {row['project_id']}.{row['table_schema']}.{row['table_name']}.{row['column_name']}
Remember that this description should help LLMs to help generate better SQL for any queries related to these columns.
Consider the below information while generating the description
Name of the column : {row['column_name']}
Data type of the column is : {row['data_type']}
Details of the table of this column are below:
{table_desc_df.query(q).to_markdown(index=False)}
Column Contrainst of this column are : {row['column_constraints']}
DO NOT generate description that is more than two lines
"""
else:
context_prompt = f"""
Generate short and crisp description for the column {row['table_schema']}.{row['table_name']}.{row['column_name']}
Remember that this description should help LLMs to help generate better SQL for any queries related to these columns.
Consider the below information while generating the description
Name of the column : {row['column_name']}
Data type of the column is : {row['data_type']}
Details of the table of this column are below:
{table_desc_df.query(q).to_markdown(index=False)}
Column Contrainst of this column are : {row['column_constraints']}
DO NOT generate description that is more than two lines
"""
column_name_df.at[index,'column_description']=self.generate_llm_response(prompt=context_prompt)
print(f"Generated column description for {row['table_schema']}.{row['table_name']}.{row['column_name']}")
llm_generated=llm_generated+1
print("LLM generated "+ str(llm_generated) + " Column Descriptions")
return table_desc_df,column_name_df
================================================
FILE: agents/EmbedderAgent.py
================================================
from abc import ABC
from .core import Agent
from vertexai.language_models import TextEmbeddingModel
class EmbedderAgent(Agent, ABC):
"""
An agent specialized in generating text embeddings using Large Language Models (LLMs).
This agent supports two modes for generating embeddings:
1. "vertex": Directly interacts with the Vertex AI TextEmbeddingModel.
2. "lang-vertex": Uses LangChain's VertexAIEmbeddings for a streamlined interface.
Attributes:
agentType (str): Indicates the type of agent, fixed as "EmbedderAgent".
mode (str): The embedding generation mode ("vertex" or "lang-vertex").
model: The underlying embedding model (Vertex AI TextEmbeddingModel or LangChain's VertexAIEmbeddings).
Methods:
create(question) -> list:
Generates text embeddings for the given question(s).
Args:
question (str or list): The text input for which embeddings are to be generated. Can be a single string or a list of strings.
Returns:
list: A list of embedding vectors. Each embedding vector is represented as a list of floating-point numbers.
Raises:
ValueError: If the input `question` is not a string or list, or if the specified `mode` is invalid.
"""
agentType: str = "EmbedderAgent"
def __init__(self, mode, embeddings_model='text-embedding-004'):
if mode == 'vertex':
self.mode = mode
self.model = TextEmbeddingModel.from_pretrained(embeddings_model)
elif mode == 'lang-vertex':
self.mode = mode
from langchain.embeddings import VertexAIEmbeddings
self.model = VertexAIEmbeddings()
else: raise ValueError('EmbedderAgent mode must be either vertex or lang-vertex')
def create(self, question):
"""Text embedding with a Large Language Model."""
if self.mode == 'vertex':
if isinstance(question, str):
embeddings = self.model.get_embeddings([question])
for embedding in embeddings:
vector = embedding.values
return vector
elif isinstance(question, list):
vector = list()
for q in question:
embeddings = self.model.get_embeddings([q])
for embedding in embeddings:
vector.append(embedding.values)
return vector
else: raise ValueError('Input must be either str or list')
elif self.mode == 'lang-vertex':
vector = self.embeddings_service.embed_documents(question)
return vector
================================================
FILE: agents/ResponseAgent.py
================================================
import json
from abc import ABC
from .core import Agent
from utilities import PROMPTS, format_prompt
from vertexai.generative_models import HarmCategory, HarmBlockThreshold
from google.cloud.aiplatform import telemetry
import vertexai
from utilities import PROJECT_ID, PG_REGION
vertexai.init(project=PROJECT_ID, location=PG_REGION)
class ResponseAgent(Agent, ABC):
"""
An agent that generates natural language responses to user questions based on SQL query results.
This agent acts as a data assistant, interpreting SQL query results and transforming them into user-friendly, natural language answers. It utilizes a language model (currently Gemini) to craft responses that effectively convey the information derived from the data.
Attributes:
agentType (str): Indicates the type of agent, fixed as "ResponseAgent".
Methods:
run(user_question, sql_result) -> str:
Generates a natural language response to the user's question based on the SQL query result.
Args:
user_question (str): The question asked by the user in natural language.
sql_result (str): The result of the SQL query executed to answer the question.
Returns:
str: The generated natural language response.
"""
agentType: str = "ResponseAgent"
def run(self, user_question, sql_result):
context_prompt = PROMPTS['nl_reponse']
context_prompt = format_prompt(context_prompt,
user_question = user_question,
sql_result = sql_result)
# print(f"Prompt for Natural Language Response: \n{context_prompt}")
if 'gemini' in self.model_id:
with telemetry.tool_context_manager('opendataqna-response-v2'):
context_query = self.model.generate_content(context_prompt,safety_settings=self.safety_settings, stream=False)
generated_sql = str(context_query.candidates[0].text)
else:
with telemetry.tool_context_manager('opendataqna-response-v2'):
context_query = self.model.predict(context_prompt, max_output_tokens = 8000, temperature=0)
generated_sql = str(context_query.candidates[0])
return generated_sql
================================================
FILE: agents/ValidateSQLAgent.py
================================================
import json
from abc import ABC
from .core import Agent
from utilities import PROMPTS, format_prompt
class ValidateSQLAgent(Agent, ABC):
"""
An agent that validates the syntax and semantic correctness of SQL queries.
This agent leverages a language model (currently Gemini) to analyze a given SQL query against a provided database schema. It assesses whether the query is valid according to a set of predefined guidelines and generates a JSON response indicating the validity status and any potential errors.
Attributes:
agentType (str): Indicates the type of agent, fixed as "ValidateSQLAgent".
Methods:
check(user_question, tables_schema, columns_schema, generated_sql) -> dict:
Determines the validity of an SQL query and identifies potential errors.
Args:
user_question (str): The original question posed by the user (used for context).
tables_schema (str): A description of the database tables and their relationships.
columns_schema (str): Detailed descriptions of the columns within the tables.
generated_sql (str): The SQL query to be validated.
Returns:
dict: A JSON-formatted dictionary with the following keys:
- "valid": A boolean value indicating whether the query is valid or not.
- "errors": A string describing any errors found in the query (empty if valid).
"""
agentType: str = "ValidateSQLAgent"
def check(self,source_type, user_question, tables_schema, columns_schema, generated_sql):
context_prompt = PROMPTS['validatesql']
context_prompt = format_prompt(context_prompt,
source_type = source_type,
user_question = user_question,
tables_schema = tables_schema,
columns_schema = columns_schema,
generated_sql=generated_sql)
# print(f"Prompt to Validate SQL after formatting: \n{context_prompt}")
if "gemini" in self.model_id:
context_query = self.model.generate_content(context_prompt, stream=False)
generated_sql = str(context_query.candidates[0].text)
else:
context_query = self.model.predict(context_prompt, max_output_tokens = 8000, temperature=0)
generated_sql = str(context_query.candidates[0])
json_syntax_result = json.loads(str(generated_sql).replace("```json","").replace("```",""))
# print('\n SQL Syntax Validity:' + str(json_syntax_result['valid']))
# print('\n SQL Syntax Error Description:' +str(json_syntax_result['errors']) + '\n')
return json_syntax_result
================================================
FILE: agents/VisualizeAgent.py
================================================
#This agent generates google charts code for displaying charts on web application
#Generates two charts with elements "chart-div" and "chart-div-1"
#Code is in javascript
from abc import ABC
from vertexai.language_models import CodeChatModel
from vertexai.generative_models import GenerativeModel,HarmCategory,HarmBlockThreshold
from .core import Agent
from utilities import PROMPTS, format_prompt
from agents import ValidateSQLAgent
import pandas as pd
import json
from google.cloud.aiplatform import telemetry
import vertexai
from utilities import PROJECT_ID, PG_REGION
vertexai.init(project=PROJECT_ID, location=PG_REGION)
class VisualizeAgent(Agent, ABC):
"""
An agent that generates JavaScript code for Google Charts based on user questions and SQL results.
This agent analyzes the user's question and the corresponding SQL query results to determine suitable chart types. It then constructs JavaScript code that uses Google Charts to create visualizations based on the data.
Attributes:
agentType (str): Indicates the type of agent, fixed as "VisualizeAgent".
model_id (str): The ID of the language model used for chart type suggestion and code generation.
model: The language model instance.
Methods:
getChartType(user_question, generated_sql) -> str:
Suggests the two most suitable chart types based on the user's question and the generated SQL query.
Args:
user_question (str): The natural language question asked by the user.
generated_sql (str): The SQL query generated to answer the question.
Returns:
str: A JSON string containing two keys, "chart_1" and "chart_2", each representing a suggested chart type.
getChartPrompt(user_question, generated_sql, chart_type, chart_div, sql_results) -> str:
Creates a prompt for the language model to generate the JavaScript code for a specific chart.
Args:
user_question (str): The user's question.
generated_sql (str): The generated SQL query.
chart_type (str): The desired chart type (e.g., "Bar Chart", "Pie Chart").
chart_div (str): The HTML element ID where the chart will be rendered.
sql_results (str): The results of the SQL query in JSON format.
Returns:
str: The prompt for the language model to generate the chart code.
generate_charts(user_question, generated_sql, sql_results) -> dict:
Generates JavaScript code for two Google Charts based on the given inputs.
Args:
user_question (str): The user's question.
generated_sql (str): The generated SQL query.
sql_results (str): The results of the SQL query in JSON format.
Returns:
dict: A dictionary containing two keys, "chart_div" and "chart_div_1", each holding the generated JavaScript code for a chart.
"""
agentType: str ="VisualizeAgent"
def __init__(self):
self.model_id = 'gemini-1.5-pro'
self.model = GenerativeModel("gemini-1.5-pro-001")
def getChartType(self,user_question, generated_sql):
chart_type_prompt = PROMPTS['visualize_chart_type']
chart_type_prompt = format_prompt(chart_type_prompt,
user_question = user_question,
generated_sql = generated_sql)
chart_type=self.model.generate_content(chart_type_prompt, stream=False).candidates[0].text
# print(chart_type)
# chart_type = model.predict(map_prompt, max_output_tokens = 1024, temperature= 0.2).candidates[0].text
return chart_type.replace("\n", "").replace("```", "").replace("json", "").replace("```html", "").replace("```", "").replace("js\n","").replace("json\n","").replace("python\n","").replace("javascript","")
def getChartPrompt(self,user_question, generated_sql, chart_type, chart_div, sql_results):
chart_prompt = PROMPTS['visualize_generate_chart_code']
chart_prompt = format_prompt(chart_prompt,
user_question = user_question,
generated_sql = generated_sql,
chart_type = chart_type,
chart_div = chart_div,
sql_results = sql_results)
# print(f"Prompt to generate code for google charts visualization after formatting: \n{chart_prompt}")
return chart_prompt
def generate_charts(self,user_question,generated_sql,sql_results):
chart_type = self.getChartType(user_question,generated_sql)
# chart_type = chart_type.split(",")
# chart_list = [x.strip() for x in chart_type]
chart_json = json.loads(chart_type)
chart_list =[chart_json['chart_1'],chart_json['chart_2']]
print("Charts Suggested : " + str(chart_list))
context_prompt=self.getChartPrompt(user_question,generated_sql,chart_list[0],"chart_div",sql_results)
context_prompt_1=self.getChartPrompt(user_question,generated_sql,chart_list[1],"chart_div_1",sql_results)
safety_settings: Optional[dict] = {
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}
with telemetry.tool_context_manager('opendataqna-visualize-v2'):
context_query = self.model.generate_content(context_prompt,safety_settings=safety_settings, stream=False)
context_query_1 = self.model.generate_content(context_prompt_1,safety_settings=safety_settings, stream=False)
google_chart_js={"chart_div":context_query.candidates[0].text.replace("```json", "").replace("```", "").replace("json", "").replace("```html", "").replace("```", "").replace("js","").replace("json","").replace("python","").replace("javascript",""),
"chart_div_1":context_query_1.candidates[0].text.replace("```json", "").replace("```", "").replace("json", "").replace("```html", "").replace("```", "").replace("js","").replace("json","").replace("python","").replace("javascript","")}
return google_chart_js
================================================
FILE: agents/__init__.py
================================================
from .BuildSQLAgent import BuildSQLAgent
from .ValidateSQLAgent import ValidateSQLAgent
from .DebugSQLAgent import DebugSQLAgent
from .EmbedderAgent import EmbedderAgent
from .ResponseAgent import ResponseAgent
from .VisualizeAgent import VisualizeAgent
from .DescriptionAgent import DescriptionAgent
__all__ = ["BuildSQLAgent", "ValidateSQLAgent", "DebugSQLAgent", "EmbedderAgent", "ResponseAgent","VisualizeAgent", "DescriptionAgent"]
================================================
FILE: agents/core.py
================================================
"""
Provides the base class for all Agents
"""
from abc import ABC
import vertexai
from google.cloud.aiplatform import telemetry
from vertexai.language_models import TextGenerationModel
from vertexai.language_models import CodeGenerationModel
from vertexai.language_models import CodeChatModel
from vertexai.generative_models import GenerativeModel
from vertexai.generative_models import HarmCategory,HarmBlockThreshold
from utilities import PROJECT_ID, PG_REGION
vertexai.init(project=PROJECT_ID, location=PG_REGION)
class Agent(ABC):
"""
The core class for all Agents
"""
agentType: str = "Agent"
def __init__(self,
model_id:str):
"""
model_id is the Model ID for initialization
"""
self.model_id = model_id
if model_id == 'code-bison-32k':
with telemetry.tool_context_manager('opendataqna'):
self.model = CodeGenerationModel.from_pretrained('code-bison-32k')
elif model_id == 'text-bison-32k':
with telemetry.tool_context_manager('opendataqna'):
self.model = TextGenerationModel.from_pretrained('text-bison-32k')
elif model_id == 'codechat-bison-32k':
with telemetry.tool_context_manager('opendataqna'):
self.model = CodeChatModel.from_pretrained("codechat-bison-32k")
elif model_id == 'gemini-1.0-pro':
with telemetry.tool_context_manager('opendataqna'):
# print("Model is gemini 1.0 pro")
self.model = GenerativeModel("gemini-1.0-pro-001")
self.safety_settings: Optional[dict] = {
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}
elif model_id == 'gemini-1.5-flash':
with telemetry.tool_context_manager('opendataqna'):
# print("Model is gemini 1.5 flash")
self.model = GenerativeModel("gemini-1.5-flash-preview-0514")
self.safety_settings: Optional[dict] = {
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}
elif model_id == 'gemini-1.5-pro':
with telemetry.tool_context_manager('opendataqna'):
# print("Model is gemini 1.5 Pro")
self.model = GenerativeModel("gemini-1.5-pro-001")
self.safety_settings: Optional[dict] = {
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}
else:
raise ValueError("Please specify a compatible model.")
def generate_llm_response(self,prompt):
context_query = self.model.generate_content(prompt,safety_settings=self.safety_settings,stream=False)
return str(context_query.candidates[0].text).replace("```sql", "").replace("```", "").rstrip("\n")
def rewrite_question(self,question,session_history):
formatted_history=''
concat_questions=''
for i, _row in enumerate(session_history,start=1):
user_question = _row['user_question']
# print(user_question)
formatted_history += f"User Question - Turn :: {i} : {user_question}\n"
concat_questions += f"{user_question} "
# print(formatted_history)
context_prompt = f"""
Your main objective is to rewrite and refine the question based on the previous questions that has been asked.
Refine the given question using the provided questions history to produce a standalone question with full context. The refined question should be self-contained, requiring no additional context for answering it.
Make sure all the information is included in the re-written question. You just need to respond with the re-written question.
Below is the previous questions history:
{formatted_history}
Question to rewrite:
{question}
"""
re_written_qe = str(self.generate_llm_response(context_prompt))
print("*"*25 +"Re-written question:: "+"*"*25+"\n"+str(re_written_qe))
return str(concat_questions),str(re_written_qe)
================================================
FILE: app.py
================================================
import streamlit as st
import pandas as pd
import json
from streamlit.components.v1 import html
from streamlit.logger import get_logger
from opendataqna import generate_uuid, get_all_databases, run_pipeline, get_kgq
import asyncio
logger = get_logger(__name__)
# Initialize session state variables if they don't exist
if "session_id" not in st.session_state:
st.session_state.session_id = generate_uuid()
st.session_state.kgq = []
st.session_state.user_grouping = None
logger.info(f"New Session Created - {st.session_state.session_id}")
def get_known_databases():
"""Retrieves a list of available database schemas from the backend.
This function fetches a list of database schemas from the backend API.
These schemas represent the available datasets that users can query.
Returns:
list: A list of database schema names.
"""
logger.info("Getting list of all user databases")
json_groupings, _ = get_all_databases()
json_groupings = json.loads(json_groupings)
groupings = [item["table_schema"] for item in json_groupings if isinstance(item, dict)]
logger.info(f"user_groupings - {str(groupings)}")
return groupings
def get_known_sql(selected_schema):
"""Retrieves known good SQL queries (KGQs) for a specific database schema.
This function fetches a DataFrame containing KGQs for the given schema.
KGQs are pre-defined SQL queries that can be used as examples or suggestions.
Args:
selected_schema (str): The name of the database schema.
Returns:
pd.DataFrame: A DataFrame containing KGQs for the specified schema.
"""
data = get_kgq(selected_schema)
parsed_data = list(eval(data[0]))
df = pd.DataFrame(parsed_data)
return df
def generate_sql_results(selected_schema,user_question):
"""Generates SQL query, executes it, and returns results and response.
This function orchestrates the process of generating an SQL query based on
the user's question and selected schema, executing the query, and generating
a natural language response based on the results.
Args:
selected_schema (str): The name of the selected database schema.
user_question (str): The user's natural language question.
Returns:
tuple: A tuple containing the generated SQL query (str), the query results
as a Pandas DataFrame, and the generated natural language response (str).
"""
logger.info(f"generating response for user question - {user_question}")
logger.info(f"selected user groouping - {selected_schema}")
final_sql, results_df, response = asyncio.run(
run_pipeline(
st.session_state.session_id,
user_question,
selected_schema,
RUN_DEBUGGER=True,
EXECUTE_FINAL_SQL=True,
DEBUGGING_ROUNDS=2,
LLM_VALIDATION=False,
Embedder_model='vertex', # Options: 'vertex' or 'vertex-lang'
SQLBuilder_model='gemini-1.5-pro',
SQLChecker_model='gemini-1.5-pro',
SQLDebugger_model='gemini-1.5-pro',
Responder_model='gemini-1.5-pro',
num_table_matches=5,
num_column_matches=10,
table_similarity_threshold=0.1,
column_similarity_threshold=0.1,
example_similarity_threshold=0.1,
num_sql_matches=3
)
)
return(final_sql, results_df, response)
def generate_response(prompt):
"""Generates and displays a response to the user's prompt.
This function takes a user prompt as input, generates an SQL query and
response using the `generate_sql_results` function, and displays the
results in a conversational format using Streamlit's chat message feature.
Args:
prompt (str): The user's input prompt.
"""
for msg in st.session_state.messages:
st.chat_message(msg["role"]).write(msg["content"])
st.chat_message("user").write(prompt)
st.session_state.messages.append({"role": "assistant", "content": msg})
msg = "Generating Response"
st.session_state.messages.append({"role": "assistant", "content": msg})
st.chat_message("assistant").write(msg)
query, results, response = generate_sql_results(st.session_state.user_grouping, prompt)
msg = query
st.session_state.messages.append({"role": "assistant", "content": msg})
st.chat_message("assistant").write(msg)
msg = response
st.session_state.messages.append({"role": "assistant", "content": msg})
st.chat_message("assistant").write(msg)
with st.chat_message("assistant"):
st.dataframe(results)
st.session_state.messages.append({"role": "assistant", "content": results})
st.set_page_config(page_title='Open Data QnA', page_icon="📊", initial_sidebar_state="expanded", layout='wide')
st.markdown("""
""", unsafe_allow_html=True)
st.title("Open Data QnA")
with st.sidebar:
st.session_state.user_grouping = st.selectbox(
'Select Table Groupings',
get_known_databases())
if st.button("New Query"):
st.session_state.session_id = generate_uuid()
st.session_state.messages.clear()
st.rerun()
if "messages" not in st.session_state:
st.session_state["messages"] = [{"role": "assistant", "content": "Frequently Asked Questions"}]
if st.session_state.user_grouping is not None:
df = get_known_sql(st.session_state.user_grouping)
for index, row in df.iterrows():
url = text = row["example_user_question"]
st.session_state.kgq.append(text)
if prompt := st.chat_input():
generate_response(prompt)
================================================
FILE: backend-apis/README.md
================================================
Create Endpoints
Here we are going to create publicly accessible endpoints (no authentication) .
If you're working on a managed GCP project, it is common that there would be Domain Restricted Sharing Org Policies that will not allow the creation of a public facing endpoint.
So we can allow all the domains and re-enable the same policy so that we don’t change the existing policy.
Please run the below command before proceeding ahead. You need to have Organization Policy Admin rights to run the below commands.
```
export PROJECT_ID=
```
```
cd Open_Data_QnA/backend-apis
gcloud resource-manager org-policies set-policy --project=$PROJECT_ID policy.yaml #This command will create policy that overrides to allow all domain
```
Create the service account and add roles to run the solution backend for the APIs
```
gcloud iam service-accounts create opendataqna --project=$PROJECT_ID
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:opendataqna@$PROJECT_ID.iam.gserviceaccount.com --role='roles/cloudsql.client' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:opendataqna@$PROJECT_ID.iam.gserviceaccount.com --role='roles/bigquery.admin' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:opendataqna@$PROJECT_ID.iam.gserviceaccount.com --role='roles/aiplatform.user' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:opendataqna@$PROJECT_ID.iam.gserviceaccount.com --role='roles/datastore.owner' --project=$PROJECT_ID --quiet
```
**Technologies**
* **Programming language:** Python
* **Framework:** Flask
**Before you start :** Ensure all variables in your config.ini file are correct, especially those for your Postgres instance and BigQuery dataset. If you need to change the Postgres instance or BigQuery dataset values, update the config.ini file before proceeding.
The endpoints deployed here are completely customized for the UI built in this demo solution. Feel free to customize the endpoint if needed for different UI/frontend. The gcloud run deploy command create a cloud build that uses the Dockerfile in the OpenDataQnA folder
***Deploy endpoints to Cloud Run***
```
export PROJECT_ID=
```
```
export SERVICE_NAME=opendataqna #change the name if needed
export DEPLOY_REGION=us-central1 #change the cloud run deployment region if needed
```
Enable the cloud build API to deploy the endpoints
```
gcloud services enable cloudbuild.googleapis.com --project $PROJECT_ID
```
Get default service account for compute engine and cloud build to deploy the cloud run and add IAM Roles for deployment
```
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
export DEFAULT_CE_SA=$(gcloud iam service-accounts list --project=$PROJECT_ID --format="value(EMAIL)" --filter="EMAIL ~ $PROJECT_NUMBER-compute@developer.gserviceaccount.com")
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DEFAULT_CE_SA --role='roles/storage.admin' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DEFAULT_CE_SA --role='roles/artifactregistry.admin' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DEFAULT_CE_SA --role='roles/firebase.admin' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DEFAULT_CE_SA --role='roles/cloudbuild.builds.builder' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DEFAULT_CE_SA --role='roles/logging.logWriter' --project=$PROJECT_ID --quiet
export DEFAULT_CB_SA=$PROJECT_NUMBER'@cloudbuild.gserviceaccount.com'
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DEFAULT_CB_SA --role='roles/firebase.admin' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DEFAULT_CB_SA --role='roles/serviceusage.apiKeysAdmin' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DEFAULT_CB_SA --role='roles/cloudbuild.builds.builder' --project=$PROJECT_ID --quiet
gcloud projects add-iam-policy-binding $PROJECT_ID --member=serviceAccount:$DEFAULT_CB_SA --role='roles/artifactregistry.admin' --project=$PROJECT_ID --quiet
```
```
cd Open_Data_QnA
gcloud beta run deploy $SERVICE_NAME --region $DEPLOY_REGION --source . --service-account=opendataqna@$PROJECT_ID.iam.gserviceaccount.com --service-min-instances=1 --allow-unauthenticated --project=$PROJECT_ID
#if you are deploying cloud run application for the first time in the project you will be prompted for a couple of settings. Go ahead and type Yes.
```
Once the deployment is done successfully you should be able to see the Service URL (endpoint point) link as shown below. Please keep this handy to add this in the frontend or you can get this uri from the cloud run page in the GCP Console. e.g. *https://OpenDataQnA-aeiouAEI-uc.a.run.app*
Test if the endpoints are working with below command. This should return the dataset your created in the source env setup notebook.
```
curl /available_databases
```
Delete the Org Policy on the Project that's created above. Do not run this if you haven’t created the org policy above
```
gcloud resource-manager org-policies delete iam.allowedPolicyMemberDomains --project=$PROJECT_ID
```
**API Details**
All the payloads are in JSON format
1. List Databases : Get the available databases in the vector store that solution can run against
URI: {Service URL}/available_databases
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/available_databases
Method: GET
Request Payload : NONE
Request response:
```
{
"Error": "",
"KnownDB": "[{\"table_schema\":\"imdb-postgres\"},{\"table_schema\":\"retail-postgres\"}]",
"ResponseCode": 200
}
```
2. Known SQL : Get suggestive questions (previously asked/examples added) for selected database
URI: /get_known_sql
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/get_known_sql
Method: POST
Request Payload :
```
{
"user_grouping":"retail"
}
```
Request response:
```
{
"Error": "",
"KnownSQL": "[{\"example_user_question\":\"Which city had maximum number of sales and what was the count?\",\"example_generated_sql\":\"select st.city_id, count(st.city_id) as city_sales_count from retail.sales as s join retail.stores as st on s.id_store = st.id_store group by st.city_id order by city_sales_count desc limit 1;\"}]",
"ResponseCode": 200
}
```
3. SQL Generation : Generate the SQL for the input question asked against a database
URI: /generate_sql
Method: POST
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/get_known_sql
Request payload:
```
{
"session_id":"",
"user_id":"harry@hogwarts.com",
"user_question":"Which city had maximum number of sales?",
"user_grouping":"retail"
}
```
Request response:
```
{
"Error": "",
"GeneratedSQL": " select st.city_id from retail.sales as s join retail.stores as st on s.id_store = st.id_store group by st.city_id order by count(*) desc limit 1;",
"ResponseCode": 200,
"SessionID":"1iuu2u-k1ij2-kkkhhj12131"
}
```
4. Execute SQL : Run the SQL statement against provided database source
URI:/run_query
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/run_query
Method: POST
Request payload:
```
{ "user_grouping": "retail",
"generated_sql":"select st.city_id from retail.sales as s join retail.stores as st on s.id_store = st.id_store group by st.city_id order by count(*) desc limit 1;",
"session_id":"1iuu2u-k1ij2-kkkhhj12131"
}
```
Request response:
```
{
"SessionID":"1iuu2u-k1ij2-kkkhhj12131",
"Error": "",
"KnownDB": "[{\"city_id\":\"C014\"}]",
"ResponseCode": 200
}
```
5. Embedd SQL : To embed known good SQLs to your example embeddings
URI:/embed_sql
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/embed_sql
METHOD: POST
Request Payload:
```
{
"session_id":"1iuu2u-k1ij2-kkkhhj12131",
"user_question":"Which city had maximum number of sales?",
"generated_sql":"select st.city_id from retail.sales as s join retail.stores as st on s.id_store = st.id_store group by st.city_id order by count(*) desc limit 1;",
"user_grouping":"retail"
}
```
Request response:
```
{
"ResponseCode" : 201,
"Message" : "Example SQL has been accepted for embedding",
"Error":"",
"SessionID":"1iuu2u-k1ij2-kkkhhj12131"
}
```
6. Generate Visualization Code : To generated javascript Google Charts code based on the SQL Results and display them on the UI
As per design we have two visualizations suggested showing up when the user clicks the visualize button. Hence two divs are send as part of the response “chart_div”, “chart_div_1” to bind them to that element in the UI
If you are only looking to setup enpoint you can stop here. In case you require the demo app (frontend UI) built in the solution, proceed to the next step.
URI:/generate_viz
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/generate_viz
METHOD: POST
Request Payload:
```
{
"session_id":"1iuu2u-k1ij2-kkkhhj12131" ,
"user_question": "What are top 5 product skus that are ordered?",
"sql_generated": "SELECT productSKU as ProductSKUCode, sum(total_ordered) as TotalOrderedItems FROM `inbq1-joonix.demo.sales_sku` group by productSKU order by sum(total_ordered) desc limit 5",
"sql_results": [
{
"ProductSKUCode": "GGOEGOAQ012899",
"TotalOrderedItems": 456
},
{
"ProductSKUCode": "GGOEGDHC074099",
"TotalOrderedItems": 334
},
{
"ProductSKUCode": "GGOEGOCB017499",
"TotalOrderedItems": 319
},
{
"ProductSKUCode": "GGOEGOCC077999",
"TotalOrderedItems": 290
},
{
"ProductSKUCode": "GGOEGFYQ016599",
"TotalOrderedItems": 253
}
]
}
```
Request response:
```
{
"SessionID":"1iuu2u-k1ij2-kkkhhj12131",
"Error": "",
"GeneratedChartjs": {
"chart_div": "google.charts.load('current', {\n packages: ['corechart']\n});\ngoogle.charts.setOnLoadCallback(drawChart);\n\nfunction drawChart() {\n var data = google.visualization.arrayToDataTable([\n ['Product SKU', 'Total Ordered Items'],\n ['GGOEGOAQ012899', 456],\n ['GGOEGDHC074099', 334],\n ['GGOEGOCB017499', 319],\n ['GGOEGOCC077999', 290],\n ['GGOEGFYQ016599', 253],\n ]);\n\n var options = {\n title: 'Top 5 Product SKUs Ordered',\n width: 600,\n height: 300,\n hAxis: {\n textStyle: {\n fontSize: 12\n }\n },\n vAxis: {\n textStyle: {\n fontSize: 12\n }\n },\n legend: {\n textStyle: {\n fontSize: 12\n }\n },\n bar: {\n groupWidth: '50%'\n }\n };\n\n var chart = new google.visualization.BarChart(document.getElementById('chart_div'));\n\n chart.draw(data, options);\n}\n",
"chart_div_1": "google.charts.load('current', {'packages':['corechart']});\ngoogle.charts.setOnLoadCallback(drawChart);\nfunction drawChart() {\n var data = google.visualization.arrayToDataTable([\n ['ProductSKUCode', 'TotalOrderedItems'],\n ['GGOEGOAQ012899', 456],\n ['GGOEGDHC074099', 334],\n ['GGOEGOCB017499', 319],\n ['GGOEGOCC077999', 290],\n ['GGOEGFYQ016599', 253]\n ]);\n\n var options = {\n title: 'Top 5 Product SKUs that are Ordered',\n width: 600,\n height: 300,\n hAxis: {\n textStyle: {\n fontSize: 5\n }\n },\n vAxis: {\n textStyle: {\n fontSize: 5\n }\n },\n legend: {\n textStyle: {\n fontSize: 10\n }\n },\n bar: {\n groupWidth: \"60%\"\n }\n };\n\n var chart = new google.visualization.ColumnChart(document.getElementById('chart_div_1'));\n\n chart.draw(data, options);\n}\n"
},
"ResponseCode": 200
}
```
7. Get Results : To directly get the sql results in JSON format
URI:/get_results
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/embed_sql
METHOD: POST
Request Payload:
```
{
"user_question":"Which city had maximum number of sales?",
"user_database":"retail"
}
```
Request response:
```
{
"Error": "",
"GeneratedResults": "[{\"city_id\":\"C014\"}]",
"ResponseCode": 200
}
```
### For setting up the demo UI with these endpoints please refer to README.md under [`/frontend`](/frontend/)
================================================
FILE: backend-apis/__init__.py
================================================
================================================
FILE: backend-apis/main.py
================================================
# -*- coding: utf-8 -*-
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from flask import Flask, request, jsonify, render_template, Response
import asyncio
from collections.abc import Callable
import logging as log
import json
import datetime
import urllib
import re
import time
import textwrap
import pandas as pd
from flask_cors import CORS
import os
import sys
import firebase_admin
from firebase_admin import credentials, auth
from functools import wraps
firebase_admin.initialize_app()
from opendataqna import get_all_databases,get_kgq,generate_sql,embed_sql,get_response,get_results,visualize
module_path = os.path.abspath(os.path.join('.'))
sys.path.append(module_path)
def jwt_authenticated(func: Callable[..., int]) -> Callable[..., int]:
@wraps(func)
async def decorated_function(*args, **kwargs):
header = request.headers.get("Authorization", None)
if header:
token = header.split(" ")[1]
try:
print("TOKEN::"+str(token))
decoded_token = firebase_admin.auth.verify_id_token(token)
except Exception as e:
log.exception(e)
return Response(status=403, response=f"Error with authentication: {e}")
else:
return Response(status=401)
request.uid = decoded_token["uid"]
print("USER:: "+str(request.uid))
return await func(*args, **kwargs) if asyncio.iscoroutinefunction(func) else func(*args, **kwargs)
return decorated_function
RUN_DEBUGGER = True
DEBUGGING_ROUNDS = 2
LLM_VALIDATION = False
EXECUTE_FINAL_SQL = True
Embedder_model = 'vertex'
SQLBuilder_model = 'gemini-1.5-pro'
SQLChecker_model = 'gemini-1.5-pro'
SQLDebugger_model = 'gemini-1.5-pro'
num_table_matches = 5
num_column_matches = 10
table_similarity_threshold = 0.3
column_similarity_threshold = 0.3
example_similarity_threshold = 0.3
num_sql_matches = 3
app = Flask(__name__)
cors = CORS(app, resources={r"/*": {"origins": "*"}})
@app.route("/available_databases", methods=["GET"])
# @jwt_authenticated
def getBDList():
result,invalid_response=get_all_databases()
if not invalid_response:
responseDict = {
"ResponseCode" : 200,
"KnownDB" : result,
"Error":""
}
else:
responseDict = {
"ResponseCode" : 500,
"KnownDB" : "",
"Error":result
}
return jsonify(responseDict)
@app.route("/embed_sql", methods=["POST"])
# @jwt_authenticated
async def embedSql():
envelope = str(request.data.decode('utf-8'))
envelope=json.loads(envelope)
user_grouping=envelope.get('user_grouping')
generated_sql = envelope.get('generated_sql')
user_question = envelope.get('user_question')
session_id = envelope.get('session_id')
embedded, invalid_response=await embed_sql(session_id,user_grouping,user_question,generated_sql)
if not invalid_response:
responseDict = {
"ResponseCode" : 201,
"Message" : "Example SQL has been accepted for embedding",
"SessionID" : session_id,
"Error":""
}
return jsonify(responseDict)
else:
responseDict = {
"ResponseCode" : 500,
"KnownDB" : "",
"SessionID" : session_id,
"Error":embedded
}
return jsonify(responseDict)
@app.route("/run_query", methods=["POST"])
# @jwt_authenticated
def getSQLResult():
envelope = str(request.data.decode('utf-8'))
envelope=json.loads(envelope)
user_question = envelope.get('user_question')
user_grouping = envelope.get('user_grouping')
generated_sql = envelope.get('generated_sql')
session_id = envelope.get('session_id')
result_df,invalid_response=get_results(user_grouping,generated_sql)
if not invalid_response:
_resp,invalid_response=get_response(session_id,user_question,result_df.to_json(orient='records'))
if not invalid_response:
responseDict = {
"ResponseCode" : 200,
"KnownDB" : result_df.to_json(orient='records'),
"NaturalResponse" : _resp,
"SessionID" : session_id,
"Error":""
}
else:
responseDict = {
"ResponseCode" : 500,
"KnownDB" : result_df.to_json(orient='records'),
"NaturalResponse" : _resp,
"SessionID" : session_id,
"Error":""
}
else:
_resp=result_df
responseDict = {
"ResponseCode" : 500,
"KnownDB" : "",
"NaturalResponse" : _resp,
"SessionID" : session_id,
"Error":result_df
}
return jsonify(responseDict)
@app.route("/get_known_sql", methods=["POST"])
# @jwt_authenticated
def getKnownSQL():
print("Extracting the known SQLs from the example embeddings.")
envelope = str(request.data.decode('utf-8'))
envelope=json.loads(envelope)
user_grouping = envelope.get('user_grouping')
result,invalid_response=get_kgq(user_grouping)
if not invalid_response:
responseDict = {
"ResponseCode" : 200,
"KnownSQL" : result,
"Error":""
}
else:
responseDict = {
"ResponseCode" : 500,
"KnownSQL" : "",
"Error":result
}
return jsonify(responseDict)
@app.route("/generate_sql", methods=["POST"])
# @jwt_authenticated
async def generateSQL():
print("Here is the request payload ")
envelope = str(request.data.decode('utf-8'))
print("Here is the request payload " + envelope)
envelope=json.loads(envelope)
user_question = envelope.get('user_question')
user_grouping = envelope.get('user_grouping')
session_id = envelope.get('session_id')
user_id = envelope.get('user_id')
generated_sql,session_id,invalid_response = await generate_sql(session_id,
user_question,
user_grouping,
RUN_DEBUGGER,
DEBUGGING_ROUNDS,
LLM_VALIDATION,
Embedder_model,
SQLBuilder_model,
SQLChecker_model,
SQLDebugger_model,
num_table_matches,
num_column_matches,
table_similarity_threshold,
column_similarity_threshold,
example_similarity_threshold,
num_sql_matches,
user_id=user_id)
if not invalid_response:
responseDict = {
"ResponseCode" : 200,
"GeneratedSQL" : generated_sql,
"SessionID" : session_id,
"Error":""
}
else:
responseDict = {
"ResponseCode" : 500,
"GeneratedSQL" : "",
"SessionID" : session_id,
"Error":generated_sql
}
return jsonify(responseDict)
@app.route("/generate_viz", methods=["POST"])
# @jwt_authenticated
async def generateViz():
envelope = str(request.data.decode('utf-8'))
# print("Here is the request payload " + envelope)
envelope=json.loads(envelope)
user_question = envelope.get('user_question')
generated_sql = envelope.get('generated_sql')
sql_results = envelope.get('sql_results')
session_id = envelope.get('session_id')
chart_js=''
try:
chart_js, invalid_response = visualize(session_id,user_question,generated_sql,sql_results)
if not invalid_response:
responseDict = {
"ResponseCode" : 200,
"GeneratedChartjs" : chart_js,
"Error":"",
"SessionID":session_id
}
else:
responseDict = {
"ResponseCode" : 500,
"GeneratedSQL" : "",
"SessionID":session_id,
"Error": chart_js
}
return jsonify(responseDict)
except Exception as e:
# util.write_log_entry("Cannot generate the Visualization!!!, please check the logs!" + str(e))
responseDict = {
"ResponseCode" : 500,
"GeneratedSQL" : "",
"SessionID":session_id,
"Error":"Issue was encountered while generating the Google Chart, please check the logs!" + str(e)
}
return jsonify(responseDict)
@app.route("/summarize_results", methods=["POST"])
# @jwt_authenticated
async def getSummary():
envelope = str(request.data.decode('utf-8'))
envelope=json.loads(envelope)
user_question = envelope.get('user_question')
sql_results = envelope.get('sql_results')
result,invalid_response=get_response(user_question,sql_results)
if not invalid_response:
responseDict = {
"ResponseCode" : 200,
"summary_response" : result,
"Error":""
}
else:
responseDict = {
"ResponseCode" : 500,
"summary_response" : "",
"Error":result
}
return jsonify(responseDict)
@app.route("/natural_response", methods=["POST"])
# @jwt_authenticated
async def getNaturalResponse():
envelope = str(request.data.decode('utf-8'))
#print("Here is the request payload " + envelope)
envelope=json.loads(envelope)
user_question = envelope.get('user_question')
user_grouping = envelope.get('user_grouping')
generated_sql,session_id,invalid_response = await generate_sql(user_question,
user_grouping,
RUN_DEBUGGER,
DEBUGGING_ROUNDS,
LLM_VALIDATION,
Embedder_model,
SQLBuilder_model,
SQLChecker_model,
SQLDebugger_model,
num_table_matches,
num_column_matches,
table_similarity_threshold,
column_similarity_threshold,
example_similarity_threshold,
num_sql_matches)
if not invalid_response:
result_df,invalid_response=get_results(user_grouping,generated_sql)
if not invalid_response:
result,invalid_response=get_response(user_question,result_df.to_json(orient='records'))
if not invalid_response:
responseDict = {
"ResponseCode" : 200,
"summary_response" : result,
"Error":""
}
else:
responseDict = {
"ResponseCode" : 500,
"summary_response" : "",
"Error":result
}
else:
responseDict = {
"ResponseCode" : 500,
"KnownDB" : "",
"Error":result_df
}
else:
responseDict = {
"ResponseCode" : 500,
"GeneratedSQL" : "",
"Error":generated_sql
}
return jsonify(responseDict)
@app.route("/get_results", methods=["POST"])
async def getResultsResponse():
envelope = str(request.data.decode('utf-8'))
#print("Here is the request payload " + envelope)
envelope=json.loads(envelope)
user_question = envelope.get('user_question')
user_database = envelope.get('user_database')
generated_sql,invalid_response = await generate_sql(user_question,
user_database,
RUN_DEBUGGER,
DEBUGGING_ROUNDS,
LLM_VALIDATION,
Embedder_model,
SQLBuilder_model,
SQLChecker_model,
SQLDebugger_model,
num_table_matches,
num_column_matches,
table_similarity_threshold,
column_similarity_threshold,
example_similarity_threshold,
num_sql_matches)
if not invalid_response:
result_df,invalid_response=get_results(user_database,generated_sql)
if not invalid_response:
responseDict = {
"ResponseCode" : 200,
"GeneratedResults" : result_df.to_json(orient='records'),
"Error":""
}
else:
responseDict = {
"ResponseCode" : 500,
"GeneratedResults" : "",
"Error":result_df
}
else:
responseDict = {
"ResponseCode" : 500,
"GeneratedResults" : "",
"Error":generated_sql
}
return jsonify(responseDict)
if __name__ == "__main__":
app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
================================================
FILE: backend-apis/policy.yaml
================================================
constraint: constraints/iam.allowedPolicyMemberDomains
listPolicy:
allValues: ALLOW
================================================
FILE: config.ini
================================================
[CONFIG]
embedding_model = vertex
description_model = gemini-1.5-pro
vector_store = bigquery-vector
debugging = yes
logging = yes
kgq_examples = yes
firestore_region = us-central1
use_session_history = yes
use_column_samples = no
[GCP]
project_id = three-p-o
[PGCLOUDSQL]
pg_region = us-central1
pg_instance = pg15-opendataqna
pg_database = opendataqna-db
pg_user = pguser
pg_password = pg123
[BIGQUERY]
bq_dataset_region = us-central1
bq_opendataqna_dataset_name = opendataqna
bq_log_table_name = audit_log_table
================================================
FILE: dbconnectors/BQConnector.py
================================================
"""
BigQuery Connector Class
"""
from google.cloud import bigquery
from google.cloud import bigquery_connection_v1 as bq_connection
from dbconnectors import DBConnector
from abc import ABC
from datetime import datetime
import google.auth
import pandas as pd
from google.cloud.exceptions import NotFound
def get_auth_user():
credentials, project_id = google.auth.default()
if hasattr(credentials, 'service_account_email'):
return credentials.service_account_email
else:
return "Not Determined"
def bq_specific_data_types():
return '''
BigQuery offers a wide variety of datatypes to store different types of data effectively. Here's a breakdown of the available categories:
Numeric Types -
INTEGER (INT64): Stores whole numbers within the range of -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. Ideal for non-fractional values.
FLOAT (FLOAT64): Stores approximate floating-point numbers with a range of -1.7E+308 to 1.7E+308. Suitable for decimals with a degree of imprecision.
NUMERIC: Stores exact fixed-precision decimal numbers, with up to 38 digits of precision and 9 digits to the right of the decimal point. Useful for precise financial and accounting calculations.
BIGNUMERIC: Similar to NUMERIC but with even larger scale and precision. Designed for extreme precision in calculations.
Character Types -
STRING: Stores variable-length Unicode character sequences. Enclosed using single, double, or triple quotes.
Boolean Type -
BOOLEAN: Stores logical values of TRUE or FALSE (case-insensitive).
Date and Time Types -
DATE: Stores dates without associated time information.
TIME: Stores time information independent of a specific date.
DATETIME: Stores both date and time information (without timezone information).
TIMESTAMP: Stores an exact moment in time with microsecond precision, including a timezone component for global accuracy.
Other Types
BYTES: Stores variable-length binary data. Distinguished from strings by using 'B' or 'b' prefix in values.
GEOGRAPHY: Stores points, lines, and polygons representing locations on the Earth's surface.
ARRAY: Stores an ordered collection of zero or more elements of the same (non-ARRAY) data type.
STRUCT: Stores an ordered collection of fields, each with its own name and data type (can be nested).
This list covers the most common datatypes in BigQuery.
'''
class BQConnector(DBConnector, ABC):
"""
A connector class for interacting with BigQuery databases.
This class provides methods for connecting to BigQuery, executing queries, retrieving results as DataFrames, logging interactions, and managing embeddings.
Attributes:
project_id (str): The Google Cloud project ID where the BigQuery dataset resides.
region (str): The region where the BigQuery dataset is located.
dataset_name (str): The name of the BigQuery dataset to interact with.
opendataqna_dataset (str): Name of the dataset to use for OpenDataQnA functionalities.
audit_log_table_name (str): Name of the table to store audit logs.
client (bigquery.Client): The BigQuery client instance for executing queries.
Methods:
getconn() -> bigquery.Client:
Establishes a connection to BigQuery and returns a client object.
retrieve_df(query) -> pd.DataFrame:
Executes a SQL query and returns the results as a pandas DataFrame.
make_audit_entry(source_type, user_grouping, model, question, generated_sql, found_in_vector, need_rewrite, failure_step, error_msg, FULL_LOG_TEXT) -> str:
Logs an audit entry to BigQuery, recording details of the interaction and the generated SQL query.
create_vertex_connection(connection_id) -> None:
Creates a Vertex AI connection for remote model usage in BigQuery.
create_embedding_model(connection_id, embedding_model) -> None:
Creates or replaces an embedding model in BigQuery using a Vertex AI connection.
retrieve_matches(mode, user_grouping, qe, similarity_threshold, limit) -> list:
Retrieves the most similar table schemas, column schemas, or example queries based on the given mode and parameters.
getSimilarMatches(mode, user_grouping, qe, num_matches, similarity_threshold) -> str:
Returns a formatted string containing similar matches found for tables, columns, or examples.
getExactMatches(query) -> str or None:
Checks if the exact question is present in the example SQL set and returns the corresponding SQL query if found.
test_sql_plan_execution(generated_sql) -> Tuple[bool, str]:
Tests the execution plan of a generated SQL query in BigQuery. Returns a tuple indicating success and a message.
return_table_schema_sql(dataset, table_names=None) -> str:
Returns a SQL query to retrieve table schema information from a BigQuery dataset.
return_column_schema_sql(dataset, table_names=None) -> str:
Returns a SQL query to retrieve column schema information from a BigQuery dataset.
"""
def __init__(self,
project_id:str,
region:str,
opendataqna_dataset:str,
audit_log_table_name:str):
self.project_id = project_id
self.region = region
self.opendataqna_dataset = opendataqna_dataset
self.audit_log_table_name = audit_log_table_name
self.client=self.getconn()
def getconn(self):
client = bigquery.Client(project=self.project_id)
return client
def retrieve_df(self,query):
return self.client.query_and_wait(query).to_dataframe()
def make_audit_entry(self, source_type, user_grouping, model, question, generated_sql, found_in_vector, need_rewrite, failure_step, error_msg, FULL_LOG_TEXT):
# global FULL_LOG_TEXT
auth_user=get_auth_user()
PROJECT_ID = self.project_id
table_id= PROJECT_ID+ '.' + self.opendataqna_dataset + '.' + self.audit_log_table_name
now = datetime.now()
table_exists=False
client = self.getconn()
df1 = pd.DataFrame(columns=[
'source_type',
'project_id',
'user',
'user_grouping',
'model_used',
'question',
'generated_sql',
'found_in_vector',
'need_rewrite',
'failure_step',
'error_msg',
'execution_time',
'full_log'
])
new_row = {
"source_type":source_type,
"project_id":str(PROJECT_ID),
"user":str(auth_user),
"user_grouping": user_grouping,
"model_used": model,
"question": question,
"generated_sql": generated_sql,
"found_in_vector":found_in_vector,
"need_rewrite":need_rewrite,
"failure_step":failure_step,
"error_msg":error_msg,
"execution_time": now,
"full_log": FULL_LOG_TEXT
}
df1.loc[len(df1)] = new_row
db_schema=[
# Specify the type of columns whose type cannot be auto-detected. For
# example the "title" column uses pandas dtype "object", so its
# data type is ambiguous.
bigquery.SchemaField("source_type", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("project_id", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("user", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("user_grouping", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("model_used", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("question", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("generated_sql", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("found_in_vector", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("need_rewrite", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("failure_step", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("error_msg", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("execution_time", bigquery.enums.SqlTypeNames.TIMESTAMP),
bigquery.SchemaField("full_log", bigquery.enums.SqlTypeNames.STRING),
]
try:
client.get_table(table_id) # Make an API request.
# print("Table {} already exists.".format(table_id))
table_exists=True
except NotFound:
print("Table {} is not found. Will create this log table".format(table_id))
table_exists=False
if table_exists is True:
# print('Performing streaming insert')
errors = client.insert_rows_from_dataframe(table=table_id, dataframe=df1, selected_fields=db_schema) # Make an API request.
if errors == [[]]:
print("Logged the run")
else:
print("Encountered errors while inserting rows: {}".format(errors))
else:
job_config = bigquery.LoadJobConfig(schema=db_schema,write_disposition="WRITE_TRUNCATE")
# pandas_gbq.to_gbq(df1, table_id, project_id=PROJECT_ID) # replace to replace table; append to append to a table
client.load_table_from_dataframe(df1,table_id,job_config=job_config) # replace to replace table; append to append to a table
# df1.loc[len(df1)] = new_row
# pandas_gbq.to_gbq(df1, table_id, project_id=PROJECT_ID, if_exists='append') # replace to replace table; append to append to a table
# print('\n Query added to BQ log table \n')
return 'Completed the logging step'
def create_vertex_connection(self, connection_id : str):
client=bq_connection.ConnectionServiceClient()
cloud_resource_properties = bq_connection.types.CloudResourceProperties()
new_connection=bq_connection.Connection(cloud_resource=cloud_resource_properties)
response= client.create_connection(parent=f'projects/{self.project_id}/locations/{self.region}',connection=new_connection,connection_id=connection_id)
def create_embedding_model(self,connection_id: str, embedding_model: str):
client = self.getconn()
client.query_and_wait(f'''CREATE OR REPLACE MODEL `{self.project_id}.{self.opendataqna_dataset}.EMBEDDING_MODEL`
REMOTE WITH CONNECTION `{self.project_id}.{self.region}.{connection_id}`
OPTIONS (ENDPOINT = '{embedding_model}');''')
def retrieve_matches(self, mode, user_grouping, qe, similarity_threshold, limit):
"""
This function retrieves the most similar table_schema and column_schema.
Modes can be either 'table', 'column', or 'example'
"""
matches = []
if mode == 'table':
sql = '''select base.content as tables_content from vector_search(
(SELECT * FROM `{}.table_details_embeddings` WHERE user_grouping = '{}'), "embedding",
(SELECT {} as qe), top_k=> {},distance_type=>"COSINE") where 1-distance > {} '''
elif mode == 'column':
sql='''select base.content as columns_content from vector_search(
(SELECT * FROM `{}.tablecolumn_details_embeddings` WHERE user_grouping = '{}'), "embedding",
(SELECT {} as qe), top_k=> {}, distance_type=>"COSINE") where 1-distance > {} '''
elif mode == 'example':
sql='''select base.example_user_question, base.example_generated_sql from vector_search (
(SELECT * FROM `{}.example_prompt_sql_embeddings` WHERE user_grouping = '{}'), "embedding",
(select {} as qe), top_k=> {}, distance_type=>"COSINE") where 1-distance > {} '''
else:
ValueError("No valid mode. Must be either table, column, or example")
name_txt = ''
results=self.client.query_and_wait(sql.format('{}.{}'.format(self.project_id,self.opendataqna_dataset),user_grouping,qe,limit,similarity_threshold)).to_dataframe()
# CHECK RESULTS
if len(results) == 0:
print(f"Did not find any results for {mode}. Adjust the query parameters.")
else:
print(f"Found {len(results)} similarity matches for {mode}.")
if mode == 'table':
name_txt = ''
for _ , r in results.iterrows():
name_txt=name_txt+r["tables_content"]+"\n"
elif mode == 'column':
name_txt = ''
for _ ,r in results.iterrows():
name_txt=name_txt+r["columns_content"]+"\n"
elif mode == 'example':
name_txt = ''
for _ , r in results.iterrows():
example_user_question=r["example_user_question"]
example_sql=r["example_generated_sql"]
name_txt = name_txt + "\n Example_question: "+example_user_question+ "; Example_SQL: "+example_sql
else:
ValueError("No valid mode. Must be either table, column, or example")
name_txt = ''
matches.append(name_txt)
return matches
def getSimilarMatches(self, mode, user_grouping, qe, num_matches, similarity_threshold):
if mode == 'table':
match_result= self.retrieve_matches(mode, user_grouping, qe, similarity_threshold, num_matches)
match_result = match_result[0]
# print(match_result)
elif mode == 'column':
match_result= self.retrieve_matches(mode, user_grouping, qe, similarity_threshold, num_matches)
match_result = match_result[0]
elif mode == 'example':
match_result= self.retrieve_matches(mode, user_grouping, qe, similarity_threshold, num_matches)
if len(match_result) == 0:
match_result = None
else:
match_result = match_result[0]
return match_result
def getExactMatches(self, query):
"""Checks if the exact question is already present in the example SQL set"""
check_history_sql=f"""SELECT example_user_question,example_generated_sql FROM `{self.project_id}.{self.opendataqna_dataset}.example_prompt_sql_embeddings`
WHERE lower(example_user_question) = lower("{query}") LIMIT 1; """
exact_sql_history = self.client.query_and_wait(check_history_sql).to_dataframe()
if exact_sql_history[exact_sql_history.columns[0]].count() != 0:
sql_example_txt = ''
exact_sql = ''
for index, row in exact_sql_history.iterrows():
example_user_question=row["example_user_question"]
example_sql=row["example_generated_sql"]
exact_sql=example_sql
sql_example_txt = sql_example_txt + "\n Example_question: "+example_user_question+ "; Example_SQL: "+example_sql
# print("Found a matching question from the history!" + str(sql_example_txt))
final_sql=exact_sql
else:
print("No exact match found for the user prompt")
final_sql = None
return final_sql
def test_sql_plan_execution(self, generated_sql):
try:
exec_result_df=""
job_config=bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
query_job = self.client.query(generated_sql,job_config=job_config)
# print(query_job)
exec_result_df=("This query will process {} bytes.".format(query_job.total_bytes_processed))
correct_sql = True
print(exec_result_df)
return correct_sql, exec_result_df
except Exception as e:
return False,str(e)
def return_table_schema_sql(self, dataset, table_names=None):
"""
Returns the SQL query to be run on 'Source DB' to get the Table Schema
The SQL query below returns a df containing the cols table_schema, table_name, table_description, table_columns (with cols in the table)
for the schema specified above, e.g. 'retail'
- table_schema: e.g. retail
- table_name: name of the table inside the schema, e.g. products
- table_description: text descriptor, can be empty
- table_columns: aggregate of the col names inside the table
"""
user_dataset = self.project_id + '.' + dataset
table_filter_clause = ""
if table_names:
# Extract individual table names from the input string
#table_names = [name.strip() for name in table_names[1:-1].split(",")] # Handle the string as a list
formatted_table_names = [f"'{name}'" for name in table_names]
table_filter_clause = f"""AND TABLE_NAME IN ({', '.join(formatted_table_names)})"""
table_schema_sql = f"""
(SELECT
TABLE_CATALOG as project_id, TABLE_SCHEMA as table_schema , TABLE_NAME as table_name, OPTION_VALUE as table_description,
(SELECT STRING_AGG(column_name, ', ') from `{user_dataset}.INFORMATION_SCHEMA.COLUMNS` where TABLE_NAME= t.TABLE_NAME and TABLE_SCHEMA=t.TABLE_SCHEMA) as table_columns
FROM
`{user_dataset}.INFORMATION_SCHEMA.TABLE_OPTIONS` as t
WHERE
OPTION_NAME = "description"
{table_filter_clause}
ORDER BY
project_id, table_schema, table_name)
UNION ALL
(SELECT
TABLE_CATALOG as project_id, TABLE_SCHEMA as table_schema , TABLE_NAME as table_name, "NA" as table_description,
(SELECT STRING_AGG(column_name, ', ') from `{user_dataset}.INFORMATION_SCHEMA.COLUMNS` where TABLE_NAME= t.TABLE_NAME and TABLE_SCHEMA=t.TABLE_SCHEMA) as table_columns
FROM
`{user_dataset}.INFORMATION_SCHEMA.TABLES` as t
WHERE
NOT EXISTS (SELECT 1 FROM
`{user_dataset}.INFORMATION_SCHEMA.TABLE_OPTIONS`
WHERE
OPTION_NAME = "description" AND TABLE_NAME= t.TABLE_NAME and TABLE_SCHEMA=t.TABLE_SCHEMA)
{table_filter_clause}
ORDER BY
project_id, table_schema, table_name)
"""
return table_schema_sql
def return_column_schema_sql(self, dataset, table_names=None):
"""
Returns the SQL query to be run on 'Source DB' to get the column schema
The SQL query below returns a df containing the cols table_schema, table_name, column_name, data_type, column_description, table_description, primary_key, column_constraints
for the schema specified above, e.g. 'retail'
- table_schema: e.g. retail
- table_name: name of the tables inside the schema, e.g. products
- column_name: name of each col in each table in the schema, e.g. id_product
- data_type: data type of each col
- column_description: col descriptor, can be empty
- table_description: text descriptor, can be empty
- primary_key: whether the col is PK; if yes, the field contains the col_name
- column_constraints: e.g. "Primary key for this table"
"""
user_dataset = self.project_id + '.' + dataset
table_filter_clause = ""
if table_names:
# table_names = [name.strip() for name in table_names[1:-1].split(",")] # Handle the string as a list
formatted_table_names = [f"'{name}'" for name in table_names]
table_filter_clause = f"""AND C.TABLE_NAME IN ({', '.join(formatted_table_names)})"""
column_schema_sql = f"""
SELECT
C.TABLE_CATALOG as project_id, C.TABLE_SCHEMA as table_schema, C.TABLE_NAME as table_name, C.COLUMN_NAME as column_name,
C.DATA_TYPE as data_type, C.DESCRIPTION as column_description, CASE WHEN T.CONSTRAINT_TYPE="PRIMARY KEY" THEN "This Column is a Primary Key for this table" WHEN
T.CONSTRAINT_TYPE = "FOREIGN_KEY" THEN "This column is Foreign Key" ELSE NULL END as column_constraints
FROM
`{user_dataset}.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS` C
LEFT JOIN
`{user_dataset}.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` T
ON C.TABLE_CATALOG = T.TABLE_CATALOG AND
C.TABLE_SCHEMA = T.TABLE_SCHEMA AND
C.TABLE_NAME = T.TABLE_NAME AND
T.ENFORCED ='YES'
LEFT JOIN
`{user_dataset}.INFORMATION_SCHEMA.KEY_COLUMN_USAGE` K
ON K.CONSTRAINT_NAME=T.CONSTRAINT_NAME AND C.COLUMN_NAME = K.COLUMN_NAME
WHERE
1=1
{table_filter_clause}
ORDER BY
project_id, table_schema, table_name, column_name;
"""
return column_schema_sql
def get_column_samples(self,columns_df):
sample_column_list=[]
for index, row in columns_df.iterrows():
get_column_sample_sql=f'''SELECT STRING_AGG(CAST(value AS STRING)) as sample_values FROM UNNEST((SELECT APPROX_TOP_COUNT({row["column_name"]},5) as osn
FROM `{row["project_id"]}.{row["table_schema"]}.{row["table_name"]}`
))'''
column_samples_df=self.retrieve_df(get_column_sample_sql)
# display(column_samples_df)
sample_column_list.append(column_samples_df['sample_values'].to_string(index=False))
columns_df["sample_values"]=sample_column_list
return columns_df
================================================
FILE: dbconnectors/FirestoreConnector.py
================================================
from google.cloud import firestore
from google.cloud.exceptions import NotFound
import time
from dbconnectors import DBConnector
from abc import ABC
import uuid
def create_unique_id():
"""Creates a unique ID using the UUID4 algorithm.
Returns:
A string representing a unique ID.
"""
return str(uuid.uuid1())
class FirestoreConnector(DBConnector, ABC):
def __init__(self,
project_id:str,
firestore_database:str):
"""Initializes the Firestore connection and authentication."""
self.db = firestore.Client(project=project_id,database=firestore_database)
def log_chat(self,session_id, user_question, bot_response,user_id="TEST",):
"""Logs a chat message to Firestore.
Args:
session_id (str): The ID of the chat session.
user_id (str): The ID of the user who sent the message.
user_question (str): The question the user asked.
bot_response (str): The response from the bot.
"""
log_chat = {
"session_id": session_id,
"user_id": user_id,
"user_question": user_question,
"bot_response": bot_response,
"timestamp": firestore.SERVER_TIMESTAMP,
}
self.db.collection("session_logs").document().set(log_chat)
def get_chat_logs_for_session(self,session_id):
"""Gets all chat logs for a given session.
Args:
session_id (str): The ID of the chat session.
"""
sessions_log_ref = self.db.collection("session_logs")
# sessions_log_ref=sessions_log_ref.order_by("timestamp")
query= sessions_log_ref.where(filter=firestore.FieldFilter("session_id","==",session_id))
# query = sessions_log_ref.where("session_id", "==", session_id).order_by("timestamp")
# Note: Use of CollectionRef stream() is prefered to get()
docs = query.stream()
session_history=[]
for doc in docs:
session_history.append(doc.to_dict()) # Add values to the list
sorted_session_history=sorted(session_history,key=lambda x: x["timestamp"])
return [{'user_question': item['user_question'], 'bot_response': item['bot_response'],'timestamp':item['timestamp']} for item in sorted_session_history]
================================================
FILE: dbconnectors/PgConnector.py
================================================
"""
PostgreSQL Connector Class
"""
import asyncpg
from google.cloud.sql.connector import Connector
from sqlalchemy import create_engine
import pandas as pd
from sqlalchemy.sql import text
from pgvector.asyncpg import register_vector
import asyncio
from pg8000.exceptions import DatabaseError
from utilities import root_dir
from google.cloud.sql.connector import Connector
from dbconnectors import DBConnector
from abc import ABC
def pg_specific_data_types():
return '''
PostgreSQL offers a wide variety of datatypes to store different types of data effectively. Here's a breakdown of the available categories:
Numeric datatypes -
SMALLINT: Stores small-range integers between -32768 and 32767.
INTEGER: Stores typical integers between -2147483648 and 2147483647.
BIGINT: Stores large-range integers between -9223372036854775808 and 9223372036854775807.
DECIMAL(p,s): Stores arbitrary precision numbers with a maximum of p digits and s digits to the right of the decimal point.
NUMERIC: Similar to DECIMAL but with additional features like automatic scaling.
REAL: Stores single-precision floating-point numbers with an approximate range of -3.4E+38 to 3.4E+38.
DOUBLE PRECISION: Stores double-precision floating-point numbers with an approximate range of -1.7E+308 to 1.7E+308.
Character datatypes -
CHAR(n): Fixed-length character string with a specified length of n characters.
VARCHAR(n): Variable-length character string with a maximum length of n characters.
TEXT: Variable-length string with no maximum size limit.
CHARACTER VARYING(n): Alias for VARCHAR(n).
CHARACTER: Alias for CHAR.
Monetary datatypes -
MONEY: Stores monetary amounts with two decimal places.
Date/Time datatypes -
DATE: Stores dates without time information.
TIME: Stores time of day without date information (optionally with time zone).
TIMESTAMP: Stores both date and time information (optionally with time zone).
INTERVAL: Stores time intervals between two points in time.
Binary types -
BYTEA: Stores variable-length binary data.
BIT: Stores single bits.
BIT VARYING: Stores variable-length bit strings.
Other types -
BOOLEAN: Stores true or false values.
UUID: Stores universally unique identifiers.
XML: Stores XML data.
JSON: Stores JSON data.
ENUM: Stores user-defined enumerated values.
RANGE: Stores ranges of data values.
This list covers the most common datatypes in PostgreSQL.
'''
class PgConnector(DBConnector, ABC):
"""
A connector class for interacting with PostgreSQL databases.
This class provides methods for establishing connections to PostgreSQL instances, executing SQL queries, retrieving results as DataFrames, caching known SQL queries, and managing embeddings. It utilizes the `pg8000` library for connections and the `asyncpg` library for asynchronous operations.
Attributes:
project_id (str): The Google Cloud project ID where the PostgreSQL instance resides.
region (str): The region where the PostgreSQL instance is located.
instance_name (str): The name of the PostgreSQL instance.
database_name (str): The name of the database to connect to.
database_user (str): The username for authentication.
database_password (str): The password for authentication.
pool (Engine): A SQLAlchemy engine object for managing database connections.
Methods:
getconn() -> connection:
Establishes a connection to the PostgreSQL instance and returns a connection object.
retrieve_df(query) -> pd.DataFrame:
Executes a SQL query and returns the results as a pandas DataFrame. Handles potential database errors.
cache_known_sql() -> None:
Caches known good SQL queries into a PostgreSQL table for future reference.
retrieve_matches(mode, user_grouping, qe, similarity_threshold, limit) -> list:
Retrieves similar matches (table schemas, column schemas, or example queries) from the database based on the given mode, query embedding (`qe`), similarity threshold, and limit.
getSimilarMatches(mode, user_grouping, qe, num_matches, similarity_threshold) -> str:
Gets similar matches for tables, columns, or examples asynchronously, formatting the results into a string.
test_sql_plan_execution(generated_sql) -> Tuple[bool, pd.DataFrame]:
Tests the execution plan of a generated SQL query in PostgreSQL. Returns a tuple indicating success and the result DataFrame.
getExactMatches(query) -> str or None:
Checks if the exact question is present in the example SQL set and returns the corresponding SQL query if found.
return_column_schema_sql(schema) -> str:
Returns a SQL query to retrieve column schema information from a PostgreSQL schema.
return_table_schema_sql(schema) -> str:
Returns a SQL query to retrieve table schema information from a PostgreSQL schema.
"""
def __init__(self,
project_id:str,
region:str,
instance_name:str,
database_name:str,
database_user:str,
database_password:str):
self.project_id = project_id
self.region = region
self.instance_name = instance_name
self.database_name = database_name
self.database_user = database_user
self.database_password = database_password
self.pool = create_engine(
"postgresql+pg8000://",
creator=self.getconn,
)
def getconn(self):
"""
function to return the database connection object
"""
# initialize Connector object
connector = Connector()
conn = connector.connect(
f"{self.project_id}:{self.region}:{self.instance_name}",
"pg8000",
user=f"{self.database_user}",
password=f"{self.database_password}",
db=f"{self.database_name}"
)
return conn
def retrieve_df(self, query):
"""
TODO: Description
"""
result_df=pd.DataFrame()
try:
with self.pool.connect() as db_conn:
df = pd.read_sql(text(query), con=db_conn)
result_df = df
# print('\n Return from code execution: ' + str(result_df) )
return result_df
except Exception as e:
print(f"Database Error: {e}")
df = pd.DataFrame({'Error. Message': e}, index=[0])
return df
async def cache_known_sql(self):
df = pd.read_csv(f"{root_dir}/{scripts}/known_good_sql.csv")
df = df.loc[:, ["prompt", "sql", "database_name"]]
df = df.dropna()
loop = asyncio.get_running_loop()
async with Connector(loop=loop) as connector:
# # Create connection to Cloud SQL database.
conn: asyncpg.Connection = await connector.connect_async(
f"{self.project_id}:{self.region}:{self.instance_name}",
"asyncpg",
user=f"{self.database_user}",
password=f"{self.database_password}",
db=f"{self.database_name}",
)
await register_vector(conn)
# Delete the table if it exists.
await conn.execute("DROP TABLE IF EXISTS query_example_embeddings CASCADE")
# Create the `query_example_embeddings` table.
await conn.execute(
"""CREATE TABLE query_example_embeddings(
prompt TEXT,
sql TEXT,
user_grouping TEXT)"""
)
# Copy the dataframe to the 'query_example_embeddings' table.
tuples = list(df.itertuples(index=False))
await conn.copy_records_to_table(
"query_example_embeddings", records=tuples, columns=list(df), timeout=10000
)
await conn.close()
async def retrieve_matches(self, mode, user_groupinguping, qe, similarity_threshold, limit):
"""
This function retrieves the most similar table_schema and column_schema.
Modes can be either 'table', 'column', or 'example'
"""
matches = []
loop = asyncio.get_running_loop()
async with Connector(loop=loop) as connector:
# # Create connection to Cloud SQL database.
conn: asyncpg.Connection = await connector.connect_async(
f"{self.project_id}:{self.region}:{self.instance_name}",
"asyncpg",
user=f"{self.database_user}",
password=f"{self.database_password}",
db=f"{self.database_name}",
)
await register_vector(conn)
# Prepare the SQL depending on 'mode'
if mode == 'table':
sql = """
SELECT content as tables_content,
1 - (embedding <=> $1) AS similarity
FROM table_details_embeddings
WHERE 1 - (embedding <=> $1) > $2
AND user_grouping = $4
ORDER BY similarity DESC LIMIT $3
"""
elif mode == 'column':
sql = """
SELECT content as columns_content,
1 - (embedding <=> $1) AS similarity
FROM tablecolumn_details_embeddings
WHERE 1 - (embedding <=> $1) > $2
AND user_grouping = $4
ORDER BY similarity DESC LIMIT $3
"""
elif mode == 'example':
sql = """
SELECT user_grouping, example_user_question, example_generated_sql,
1 - (embedding <=> $1) AS similarity
FROM example_prompt_sql_embeddings
WHERE 1 - (embedding <=> $1) > $2
AND user_grouping = $4
ORDER BY similarity DESC LIMIT $3
"""
else:
ValueError("No valid mode. Must be either table, column, or example")
name_txt = ''
# print(sql,qe,similarity_threshold,limit,user_grouping)
# FETCH RESULTS FROM POSTGRES DB
results = await conn.fetch(
sql,
qe,
similarity_threshold,
limit,
user_groupinguping
)
# CHECK RESULTS
if len(results) == 0:
print(f"Did not find any results for {mode}. Adjust the query parameters.")
else:
print(f"Found {len(results)} similarity matches for {mode}.")
if mode == 'table':
name_txt = ''
for r in results:
name_txt=name_txt+r["tables_content"]+"\n\n"
elif mode == 'column':
name_txt = ''
for r in results:
name_txt=name_txt+r["columns_content"]+"\n\n "
elif mode == 'example':
name_txt = ''
for r in results:
example_user_question=r["example_user_question"]
example_sql=r["example_generated_sql"]
# print(example_user_question+"\nThreshold::"+str(r["similarity"]))
name_txt = name_txt + "\n Example_question: "+example_user_question+ "; Example_SQL: "+example_sql
else:
ValueError("No valid mode. Must be either table, column, or example")
name_txt = ''
matches.append(name_txt)
# Close the connection to the database.
await conn.close()
return matches
async def getSimilarMatches(self, mode, user_grouping, qe, num_matches, similarity_threshold):
if mode == 'table':
match_result=await self.retrieve_matches(mode, user_grouping, qe, similarity_threshold, num_matches)
match_result = match_result[0]
elif mode == 'column':
match_result=await self.retrieve_matches(mode, user_grouping, qe, similarity_threshold, num_matches)
match_result = match_result[0]
elif mode == 'example':
match_result=await self.retrieve_matches(mode, user_grouping, qe, similarity_threshold, num_matches)
if len(match_result) == 0:
match_result = None
else:
match_result = match_result[0]
return match_result
def test_sql_plan_execution(self, generated_sql):
try:
exec_result_df = pd.DataFrame()
sql = f"""EXPLAIN ANALYZE {generated_sql}"""
exec_result_df = self.retrieve_df(sql)
if not exec_result_df.empty:
if str(exec_result_df.iloc[0]).startswith('Error. Message'):
correct_sql = False
else:
print('\n No need to rewrite the query. This seems to work fine and returned rows...')
correct_sql = True
else:
print('\n No need to rewrite the query. This seems to work fine but no rows returned...')
correct_sql = True
return correct_sql, exec_result_df
except Exception as e:
return False,str(e)
def getExactMatches(self, query):
"""
Checks if the exact question is already present in the example SQL set
"""
check_history_sql=f"""SELECT example_user_question,example_generated_sql
FROM example_prompt_sql_embeddings
WHERE lower(example_user_question) = lower('{query}') LIMIT 1; """
exact_sql_history = self.retrieve_df(check_history_sql)
if exact_sql_history[exact_sql_history.columns[0]].count() != 0:
sql_example_txt = ''
exact_sql = ''
for index, row in exact_sql_history.iterrows():
example_user_question=row["example_user_question"]
example_sql=row["example_generated_sql"]
exact_sql=example_sql
sql_example_txt = sql_example_txt + "\n Example_question: "+example_user_question+ "; Example_SQL: "+example_sql
# print("Found a matching question from the history!" + str(sql_example_txt))
final_sql=exact_sql
else:
print("No exact match found for the user prompt")
final_sql = None
return final_sql
def return_column_schema_sql(self, schema, table_names=None):
"""
This SQL returns a df containing the cols table_schema, table_name, column_name, data_type, column_description, table_description, primary_key, column_constraints
for the schema specified above, e.g. 'retail'
- table_schema: e.g. retail
- table_name: name of the table inside the schema, e.g. products
- column_name: name of each col in each table in the schema, e.g. id_product
- data_type: data type of each col
- column_description: col descriptor, can be empty
- table_description: text descriptor, can be empty
- primary_key: whether the col is PK; if yes, the field contains the col_name
- column_constraints: e.g. "Primary key for this table"
"""
table_filter_clause = ""
if table_names:
# table_names = [name.strip() for name in table_names[1:-1].split(",")] # Handle the string as a list
formatted_table_names = [f"'{name}'" for name in table_names]
table_filter_clause = f"""and table_name in ({', '.join(formatted_table_names)})"""
column_schema_sql = f'''
WITH
columns_schema
AS
(select c.table_schema,c.table_name,c.column_name,c.data_type,d.description as column_description, obj_description(c1.oid) as table_description
from information_schema.columns c
inner join pg_class c1
on c.table_name=c1.relname
inner join pg_catalog.pg_namespace n
on c.table_schema=n.nspname
and c1.relnamespace=n.oid
left join pg_catalog.pg_description d
on d.objsubid=c.ordinal_position
and d.objoid=c1.oid
where
c.table_schema='{schema}' {table_filter_clause}) ,
pk_schema as
(SELECT table_name, column_name AS primary_key
FROM information_schema.key_column_usage
WHERE TABLE_SCHEMA='{schema}' {table_filter_clause}
AND CONSTRAINT_NAME like '%_pkey%'
ORDER BY table_name, primary_key),
fk_schema as
(SELECT table_name, column_name AS foreign_key
FROM information_schema.key_column_usage
WHERE TABLE_SCHEMA='{schema}' {table_filter_clause}
AND CONSTRAINT_NAME like '%_fkey%'
ORDER BY table_name, foreign_key)
select lr.*,
case
when primary_key is not null then 'Primary key for this table'
when foreign_key is not null then CONCAT('Foreign key',column_description)
else null
END as column_constraints
from
(select l.*,r.primary_key
from
columns_schema l
left outer join
pk_schema r
on
l.table_name=r.table_name
and
l.column_name=r.primary_key) lr
left outer join
fk_schema rt
on
lr.table_name=rt.table_name
and
lr.column_name=rt.foreign_key
;
'''
return column_schema_sql
def return_table_schema_sql(self, schema, table_names=None):
"""
This SQL returns a df containing the cols table_schema, table_name, table_description, table_columns (with cols in the table)
for the schema specified above, e.g. 'retail'
- table_schema: e.g. retail
- table_name: name of the table inside the schema, e.g. products
- table_description: text descriptor, can be empty
- table_columns: aggregate of the col names inside the table
"""
table_filter_clause = ""
if table_names:
# Extract individual table names from the input string
#table_names = [name.strip() for name in table_names[1:-1].split(",")] # Handle the string as a list
formatted_table_names = [f"'{name}'" for name in table_names]
table_filter_clause = f"""and table_name in ({', '.join(formatted_table_names)})"""
table_schema_sql = f'''
SELECT table_schema, table_name,table_description, array_to_string(array_agg(column_name), ' , ') as table_columns
FROM
(select c.table_schema,c.table_name,c.column_name,c.ordinal_position,c.column_default,c.data_type,d.description, obj_description(c1.oid) as table_description
from information_schema.columns c
inner join pg_class c1
on c.table_name=c1.relname
inner join pg_catalog.pg_namespace n
on c.table_schema=n.nspname
and c1.relnamespace=n.oid
left join pg_catalog.pg_description d
on d.objsubid=c.ordinal_position
and d.objoid=c1.oid
where
c.table_schema='{schema}' {table_filter_clause} ) data
GROUP BY table_schema, table_name, table_description
ORDER BY table_name;
'''
return table_schema_sql
def get_column_samples(self,columns_df):
sample_column_list=[]
for index, row in columns_df.iterrows():
get_column_sample_sql=f'''SELECT most_common_vals AS sample_values FROM pg_stats WHERE tablename = '{row["table_name"]}' AND schemaname = '{row["table_schema"]}' AND attname = '{row["column_name"]}' '''
column_samples_df=self.retrieve_df(get_column_sample_sql)
# display(column_samples_df)
sample_column_list.append(column_samples_df['sample_values'].to_string(index=False).replace("{","").replace("}",""))
columns_df["sample_values"]=sample_column_list
return columns_df
================================================
FILE: dbconnectors/__init__.py
================================================
from .core import DBConnector
from .PgConnector import PgConnector, pg_specific_data_types
from .BQConnector import BQConnector, bq_specific_data_types
from .FirestoreConnector import FirestoreConnector
from utilities import (PROJECT_ID,
PG_INSTANCE, PG_DATABASE, PG_USER, PG_PASSWORD, PG_REGION,BQ_REGION,
BQ_OPENDATAQNA_DATASET_NAME,BQ_LOG_TABLE_NAME)
pgconnector = PgConnector(PROJECT_ID, PG_REGION, PG_INSTANCE, PG_DATABASE, PG_USER, PG_PASSWORD)
bqconnector = BQConnector(PROJECT_ID,BQ_REGION,BQ_OPENDATAQNA_DATASET_NAME,BQ_LOG_TABLE_NAME)
firestoreconnector = FirestoreConnector(PROJECT_ID,"opendataqna-session-logs")
__all__ = ["pgconnector", "pg_specific_data_types", "bqconnector","firestoreconnector"]
================================================
FILE: dbconnectors/core.py
================================================
"""
Provides the base class for all Connectors
"""
from abc import ABC
class DBConnector(ABC):
"""
The core class for all Connectors
"""
connectorType: str = "Base"
def __init__(self,
project_id:str,
region:str,
instance_name:str,
database_name:str,
database_user:str,
database_password:str,
dataset_name:str):
"""
Args:
project_id (str | None): GCP Project Id.
dataset_name (str):
TODO
"""
self.project_id = project_id
self.region = region
self.instance_name = instance_name
self.database_name = database_name
self.database_user = database_user
self.database_password = database_password
self.dataset_name = dataset_name
================================================
FILE: docs/README.md
================================================
This directory contains documentation and resources to help you understand and use the Open Data QnA library effectively.
## Contents
* **README.md:** This file. Provides an overview of the documentation in this directory.
* **best_practices.md:** Best practices and guidelines for using the library, including recommended configurations, tips for improving performance, and common pitfalls to avoid.
* **faq.md:** Frequently asked questions about the library, covering common issues, troubleshooting tips, and general usage guidance.
* **repo_structure.md:** A detailed explanation of the library's repository structure, including the purpose of each file and directory, and how to navigate the codebase.
## How to Use This Documentation
**Start with the README.md on the root dir:** This file provides a high-level overview and guides you to the relevant resources.
**Consult the FAQ:** If you have any questions or encounter issues, check the FAQ section for possible solutions and answers.
**Explore Best Practices:** For optimizing your usage and getting the most out of the library, review the best practices document.
**Understand the Codebase:** If you want to dive deeper into the library's code, refer to the repository structure document for a detailed explanation of how the code is organized.
================================================
FILE: docs/architecture.md
================================================
Architecture
-------------
Architecture Summary
-------------
Open Data QnA operates in a sequence of well-defined steps, orchestrating various agents to process user queries and generate informative responses:
* **Vector Store Creation:** The vector store is initialized, storing embeddings of known good SQL queries, table schemas, and column details. This serves as a knowledge base for retrieval-augmented generation (RAG).
* **RAG (Retrieval-Augmented Generation):** User queries are embedded and compared to the vector store to retrieve relevant context (table/column details and similar past queries) for improved query generation.
* **SQL Generation (BuildSQLAgent):** The BuildSQLAgent leverages the retrieved context and the user's natural language question to generate an initial SQL query.
* **Optional Validation (ValidateSQLAgent):** If enabled, the ValidateSQLAgent assesses the generated SQL for syntactic and semantic correctness.
* **Optional Debugging (DebugSQLAgent):** If the initial SQL is invalid and debugging is enabled, the DebugSQLAgent iteratively refines the query based on error feedback.
* **SQL Execution (Dry Run/Explain):** The refined SQL query is tested with a dry run (BigQuery) or explain plan (PostgreSQL) to estimate resource usage and identify potential errors.
* **SQL Execution (Full Run):** If the query is deemed valid, it's executed against the database to fetch the results.
* **Response Generation (ResponseAgent):** The ResponseAgent analyzes the SQL results and the user's question to generate a natural language response, providing a clear and concise answer.
* **Optional Visualization (VisualizeAgent):** If enabled, the VisualizeAgent suggests suitable chart types and generates JavaScript code for Google Charts to display the SQL results in a visually appealing manner.
**Key Points:**
* **Modularity:** Each step is handled by a specialized agent, allowing for flexibility and customization.
* **RAG Enhancement:** The use of retrieval-augmented generation leverages existing knowledge for better query formulation.
* **Validation and Debugging:** Optional agents enhance the reliability and accuracy of generated queries.
* **Informative Responses:** The ResponseAgent aims to provide meaningful and contextually relevant answers.
* **Visual Appeal:** The optional visualization adds an interactive layer to the user experience.
================================================
FILE: docs/best_practices.md
================================================
# Open Data QnA: Best Practices
## General Usage
### Select the Right Database Connector:
Choose between `PgConnector`(Google Cloud SQL PostgreSQL) and `BQConnector`(BigQuery) to match your specific database.
### Prepare your data:
Ensure your database tables are structured logically with appropriate column names and data types. We further recommend adding concise descriptions to tables and columns to provide the LLM agents with the necessary context.
Additionally, please ensure that the overall data quality of your database is good - if you have pattern mismatches or missing values, these will impact the performance of the Open Data QnA solution.
### Start simple:
Begin with straightforward questions and fewer tables and progressively experiment with more complex queries and adding more tables.
### Leverage the ‘Known Good SQL’ Cache
The `Known Good SQL` cache can (and should) be populated with example user question <-> SQL query pairs relating to your use case. This benefits the solution in two ways:
Caching layer reduces latency: if a known user question is found in the cache that exactly matches (meaning, each char is matching, down to punctuation) the new input question, the known good SQL query is fetched and SQL generation will be skipped.
In Context Learning: if a known user question is found to be similar to one of the existing queries in the cache, the similar user question is retrieved along with the corresponding SQL query and used as a few-shot example in the prompt for the SQL Generation agent. The user can specify how many example values should be retrieved to use as few-shot examples. We recommend using 3-5 examples, but this further depends on the variations of user questions you expect in your use case.
### Explore Visualizations
Utilize the `VisualizeAgent`to generate charts and graphs for a more intuitive understanding of your data. However, make sure to only run the agent on queries that the pipeline has flagged as ‘valid’.
## Customization & Optimization
### Agent Modification
The `core`Agent class (agents/core.py) specifies the models supported for the different agents in the Open Data QnA solution.
In version 1, these are:
- Code Bison ('code-bison-32k')
- Text Bison ('text-bison-32k')
- Codechat Bison ('codechat-bison-32k')
- Gemini 1.0 pro ('gemini-1.0-pro')
You can set the different models for each agent when calling the pipeline_run function (see below under `Pipeline Run Configurations`).
### Prompt Engineering
Each of the defined agents has their own prompt specified in its agent class file.
BuildSQLAgent.py: prompts for BigQuery and PostgreSQL SQL Generation.
DebugSQLAgent.py: prompts for debugging for either BQ or PG queries.
DescriptionAgent.py: prompts for generating missing table and column descriptions.
ResponseAgent.py: prompt to generate a natural language response, answering the user question by using the output of the generated SQL query.
ValidateSQLAgent.py: prompt to classify a given SQL as valid or invalid.
VisualizeAgent.py two prompts; one for proposing a fitting graph / plot for a given question <-> SQL pair; the other for generating the visualization.
### Pipeline Run Configurations
Additionally to changing the base models and the prompts, it is advisable to experiment with different configuration settings of the pipeline run function:
```
async def run_pipeline(user_question,
RUN_DEBUGGER=True,
EXECUTE_FINAL_SQL=True,
DEBUGGING_ROUNDS = 2,
LLM_VALIDATION=True,
SQLBuilder_model= 'gemini-1.0-pro',
SQLChecker_model= 'gemini-1.0-pro',
SQLDebugger_model= 'gemini-1.0-pro',
Responder_model= 'gemini-1.0-pro',
num_table_matches = 5,
num_column_matches = 10,
table_similarity_threshold = 0.3,
column_similarity_threshold = 0.3,
example_similarity_threshold = 0.3,
num_sql_matches=3)
```
Args:
* **user_question (str):** The natural language question to answer.
* **RUN_DEBUGGER (bool, optional):** Whether to run the SQL debugger. Defaults to True.
It is recommended to use the debugger for improved SQL Generation accuracy.
* **DEBUGGING_ROUNDS (int, optional):** The number of debugging rounds. Defaults to 2.
We suggest using a value between 2-5, depending on your accuracy and latency requirements.
* **EXECUTE_FINAL_SQL (bool, optional):** Whether to execute the final SQL query. Defaults to True.
You can disable the SQL execution. This will leave you with the generated SQL query as a response, skipping the retrieval of the execution result and the response generation.
* **LLM_VALIDATION (bool, optional):** Whether to use LLM for SQL validation during debugging. Defaults to True.
You can disable the SQL Validator if you have specific latency requirements. When disabled, the Debugger will execute a dry run to retrieve any errors from the database call and debug accordingly.
* **SQLBuilder_model (str, optional):** The name of the SQL building model. Defaults to 'gemini-1.0-pro'.
* **SQLChecker_model (str, optional):** The name of the SQL validation model. Defaults to 'gemini-1.0-pro'.
* **SQLDebugger_model (str, optional):** The name of the SQL debugging model. Defaults to 'gemini-1.0-pro'.
* **Responder_model (str, optional):** The name of the response generation model. Defaults to 'gemini-1.0-pro'.
* **num_table_matches (int, optional):** The number of similar tables to retrieve. Defaults to 5.
These will be used when calling the SQL Generation Agent.
We recommend setting this higher if you have high variations in your database and user queries.
* **num_column_matches (int, optional):** The number of similar columns to retrieve. Defaults to 10.
These will be used when calling the SQL Generation Agent.
We recommend setting this higher if you have high variations in your database and user queries.
* **table_similarity_threshold (float, optional):** The similarity threshold for tables. Defaults to 0.3.
Start with higher values and gradually decrease them if you’re not getting enough relevant results.
* **column_similarity_threshold (float, optional):** The similarity threshold for columns. Defaults to 0.3.
Start with higher values and gradually decrease them if you’re not getting enough relevant results.
* **example_similarity_threshold (float, optional):** The similarity threshold for example questions. Defaults to 0.3.
Start with higher values and gradually decrease them if you’re not getting enough relevant results.
* **num_sql_matches (int, optional):** The number of similar SQL queries to retrieve. Defaults to 3.
================================================
FILE: docs/changelog.md
================================================
# Release Notes - Open Data QnA v2.0.0
This major release brings significant improvements and new features to Open Data QnA.
## Multi turn capabilities
Ability to interact back and forth with the database in a context. Initial v1 was established with a single turn query. In this release, we have created a multi turn architecture that saves the session info, previous query information and can answer accordingly. For more information on the architecture: link
## Table Grouping
Initial v1 was tied to single dataset processing and all the tables under this dataset. In reality, users most likely want to restrict the tables and add other datasets if needed. This table grouping provides a way for users to be able to define their scope
## Data Sampling
We provide a sampling of data values in a column to provide contextual information to the SQL Generation agent. For this, top 5 values are retrieved for every column in the specified tables.
This information is aggregated and stored back into the vector store, and is retrieved during the retrieval process.
## Data summarization
In the initial V1 release, the results were in tabular format. With this release , we provide summarized answers in a natural language format that can be integrated into a chatbot. User does have an option to still get the tabular and visualized results based on their settings.
## Resolving ambiguities
The multi-turn approach helps to resolve ambiguities in the questions, by allowing the user to provide follow-up questions and clarifications.
Furthermore, it is possible to provide additional context in the instruction prompt to let the LLM resolve ambiguities before triggering the pipeline. This can be achieved with the help of a LLM router added as a first layer before the Open Data QnA pipeline.
These clarification questions can help provide more context to the SQL creation.
Ambiguities can be categorized into semantic, application, business and database context. With this release we look for semantic and business level context and resolve such ambiguities through the chat interface.
## UX through Flutter and Streamlit
In addition to the AngularJS, we have added support through Flutter as part of the release which can be found under the front end code folder.
Furthermore, to enable more efficient development, we have added support for streamlit, so users can quickly iterate and test in a dev frontend before deploying to Angular or Flutter.
# Release Notes - Open Data QnA v1.2.0
This release brings significant improvements and new features to enhance the stability, functionality, and user experience of the Open Data QnA.
## 🗝️ Key Enhancements:
* **Enhanced Functionality:** Added the ability to specify a list of table names to be processed in BQ, instead of parsing all tables in a dataset.
* **Improved Debugging:** The SQL debugger now incorporates the user's question into its prompts, leading to more accurate and relevant debugging suggestions.
* **Simplified Setup:** Streamlined notebook setup and environment variable management for a smoother user experience.
* **Quickstart**: Added a standalone notebook for quick experimentation with the overall approach, limited to BQ.
* **Flexible Configuration:** Introduced optional arguments for the CLI pipeline, allowing users to customize various parameters like table and column similarity thresholds.
* **Code Refinements:** Removed hardcoded embedding models and added a save_config function for cleaner configuration management.
* **Bug Fixes:** Resolved various bugs, including issues with root directory checking, utility initialization, source type determination, and safety settings.
* **Expanded Documentation:** Added comprehensive docstrings to functions for better clarity and understanding.
## 📈 Additional Improvements:
* **Code Cleanup:** Removed unnecessary files and redundant code, improving overall code maintainability.
* **Updated README:** Improved the README file with clearer instructions and updated information.
* **Enhanced User Interface:** Introduced a CLI approach (experimental) for more streamlined interaction.
## 🐜 Bug Fixes:
* Fixed bugs in standalone notebook functionality.
* Removed telemetry test code.
* Corrected embedding distances in BigQuery.
* Resolved various typos and inconsistencies in the codebase.
This release marks a significant step forward in the development of the Open Data QnA SQL Generation tool, making it more reliable, flexible, and user-friendly. We encourage you to upgrade and explore the new features!
================================================
FILE: docs/config_guide.md
================================================
## Follow the below guide to populate your config.ini file:
______________
**[CONFIG]**
**embedding_model = vertex** *;Options: 'vertex' or 'vertex-lang'*
**description_model = gemini-1.0-pro** *;Options 'gemini-1.0-pro', 'gemini-1.5-pro', 'text-bison-32k', 'gemini-1.5-flash'*
**vector_store = cloudsql-pgvector** *;Options: 'bigquery-vector', 'cloudsql-pgvector'*
**debugging = yes** *;if debugging is enabled. yes or no*
**logging = yes** *;if logging is enabled. yes or no*
**kgq_examples = yes** *;if known-good-queries are provided. yes or no.*
**use_session_history = yes** *;if you want to use current session's questions without re-evaluating them*
**use_column_samples = yes** *;if you want the solution to collect some samples values from the data source columns to imporve understanding of values. yes or no*
**[GCP]**
**project_id = my_project** *;your GCP project*
*; fill out the values below if you want to use PG as your vector database:*
**[PGCLOUDSQL]**
**pg_region = us-central1**
**pg_instance = pg15-opendataqna**
**pg_database = opendataqna-db**
**pg_user = pguser**
**pg_password = pg123**
*; fill out the values below if you want to use BQ as your vector database:*
**[BIGQUERY]**
*; the remaining values are the settings for the BQ vector store / log dataset and table created by the solution:*
**bq_dataset_region = us-central1**
**bq_opendataqna_dataset_name = opendataqna**
**bq_log_table_name = audit_log_table**
**firestore_region = us-central** *;region for NoSQL DB firestore region to deploy*
________________
================================================
FILE: docs/faq.md
================================================
# Open Data QnA: FAQ
## Source and Vector Store Setup
**Q: If new to the vector store concept, which vector store would you recommend?**
A: Both the vector stores (pgvector and bigquery vector) are created using embedding model as you specify and also the vector search for both the vector stores are using cosine similarity to find the nearest matches. You can choose bigquery vector as that avoids any extra resource like cloudsql.
Vector Embeddings and Search
________
**Q: Why are my example SQLs not being pulled as few-shot examples for the question asked even though the question is almost similar?**
A: Verify if the embedding of the example question has happened successfully.
Check the retrieval SQL written to pull the similar sqls for a few shot examples. If the cosine similarity logic is wrong that might be the reason for the issue. Correct the SQL to pull required similarity based SQLs
## Accuracy and Latency
**Q: How accurate are the results?**
A: Depending on the context, the more accurate these are helpful with accuracy.
Building blocks such as known good sql, validation all help with accuracy
________
**Q: How is the latency overall?**
A: Ambiguous questions have increased latency. If latency is a factor, would suggest adding caching layer and reducing validation steps
V2 is also coming up with resolving ambiguity
## Overall Solution
**Q: How do I get started quickly?**
A: The quickest way is to follow the "Quickstart with Open Data QnA: Standalone BigQuery Notebook." It provides a simplified experience using BigQuery. If you need more customization, follow the instructions for setting up the main repository.
________
**Q: Which databases does Open Data QnA currently support?**
A: Currently, it supports Google Cloud SQL for PostgreSQL and Google BigQuery.
________
**Q: What are the requirements to use Open Data QnA?**
A: You'll need:
A Google Cloud Project
An active database (PostgreSQL or BigQuery)
Python 3.9 or higher
Required Python packages (listed in requirements.txt)
________
**Q: Can I customize the behavior of the agents?**
A: Yes, the agents are designed to be modular and extensible. You can modify their code or create your own custom agents.
________
**Q: How do I incorporate my own known good SQL queries into the system?**
A: Follow the setup instructions or use the "3. Loading Known Good SQL Examples" notebook to add your own SQL queries to the vector store. This will improve the accuracy of query generation through RAG.
________
**Q: How do I set the table, column, and example similarity thresholds?**
A: These thresholds are used during the Retrieval-Augmented Generation (RAG) process to determine how similar your query is to the stored embeddings.
Table Similarity Threshold: Determines how closely a user's query needs to match a table name in the vector store to be considered relevant. Higher values make the matching stricter.
Column Similarity Threshold: Similar to the table threshold, but for column names.
Example Similarity Threshold: Controls how closely a user's query needs to match a known good SQL query example to be considered similar.
You can adjust these thresholds when running the pipeline_run function. Start with the default values and experiment to find what works best for your specific data and queries. Generally, start with higher values and gradually decrease them if you're not getting enough relevant results.
________
**Q: Can I visualize the results of my queries?**
A: Yes, the VisualizeAgent can generate JavaScript code for Google Charts to create visualizations of your data.
________
**Q: Are all building blocks mandatory?**
A: No. They can be replaced
________
**Q: Can this be tested against any database?**
A: Tested against Oracle and Snowflake
________
**Q: How are the competitors doing?**
A: Few langchain labs, some experimenting with agents
________
**Q: I created a test colab with langchain and a simple implementation. Why complicate it?**
A: If your environment is not complex, we would suggest to leverage your simplified approach, or look into the [standalone notebook](/notebooks/(standalone)Run_OpenDataQnA.ipynb)
================================================
FILE: docs/repo_structure.md
================================================
Repository Structure
-------------
```
.
├── agents
└── __init__.py
└── core.py
└── BuildSQLAgent.py
└── DebugSQLAgent.py
└── DescriptionAgent.py
└── EmbedderAgent.py
└── ResponseAgent.py
└── ValidateSQLAgent.py
└── VisualizeAgent.py
└── Dockerfile
└── backend-apis
└── __init__.py
└── policy.yaml
└── main.py
└── dbconnectors
└── __init__.py
└── core.py
└── PgConnector.py
└── BQConnector.py
└── docs
└── best_practices.md
└── faq.md
└── repo_structure.md
└── embeddings
└── __init__.py
└── retrieve_embeddings.py
└── store_embeddings.py
└── kgq_embeddings.py
└── frontend
└── notebooks
└── 0_CopyDataToBigQuery.ipynb
└── 0_CopyDataToCloudSqlPG.ipynb
└── 1_Setup_OpenDataQnA.ipynb
└── 2_Run_OpenDataQnA.ipynb
└── 3_LoadKnownGoodSQL.ipynb
└── scripts
└── tables_columns_descriptions.csv
└── copy_select_table_column_bigquery.csv
└── data_source_list.csv
└── known_good_sql.csv
└── save_config.py
└── Scenarios Sample.csv
└── utilities
└── __init__.py
└── prompts.yaml
└── pyproject.toml
└── config.ini
└── env_setup.py
└── opendataqna.py
```
- [`/agents`](/agents): Source code for the LLM Agents.
- [`/backend-apis`](/backend-apis/) : Cloud Run based api deployement for frontend to demo the solution on a UI
- [`/dbconnectors`](/dbconnectors): Source code for database connectors.
- [`/docs`](/docs): Documentations, including FAQ & Best Practices for using this library.
- [`/embeddings`](/embeddings): Source code for creating and storing embeddings.
- [`/retrieve_embeddings.py`](/embeddings/retrieve_embeddings.py): Source code for retrieving table schema and embedding creation.
- [`/store_embeddings.py`](/embeddings/store_embeddings.py): Source code for storing table schema embeddings in Vector Store.
- [`/kgq_embeddings.py`](/embeddings/kgq_embeddings.py): Source code for loading good sqls and creating embeddings in the Vector Store)
- [`/frontend`](/frontend) : Angular based frontend code to deploy demo app using the API developed with [`/main.py`](backend-apis/main.py)
- [`/notebooks`](/notebooks): Sample notebooks demonstrating the usage of this library.
- [`/scripts`](/scripts): Additional scripts for initial setup.
- [`/Sample Scenarios.csv`](/scripts/Scenarios%20Sample.csv): Sample Scenarios file that can used to load them on the frontend UI for demos
- [`/copy_select_table_column_bigquery.py`](/scripts/copy_select_table_column_bigquery.py): Code Sample to copy select tables and columns from one BQ table to another; add table and column descriptions from csv file.
- [`/tables_columns_descriptions.csv`](/scripts/tables_columns_descriptions.csv): CSV file containing table and column names and descriptions to be copied
- [`/known_good_sql.csv`](/scripts/known_good_sql.csv): CSV files
- [`/data_source_list.csv`](/scripts/data_source_list.csv): Data Source CSV File to mention the list of tables and source type etc.
- [`/Dockerfile`](/Dockerfile): Dockerfile for deployment of backend apis. It is placed at the root folder to give it right context and access to the files.
- [`/env_setup.py`](/env_setup.py): Python file for initial setup.
- [`/opendataqna.py`](/opendataqna.py): Python file for running the main pipeline.
- [`/prompts.yaml`](/prompts.yaml): Yaml file that contains the prompts used by the solution. It also provides users the ability to prompt extra context for the use case if any.
================================================
FILE: embeddings/__init__.py
================================================
from .retrieve_embeddings import retrieve_embeddings
from .store_embeddings import store_schema_embeddings
from .kgq_embeddings import store_kgq_embeddings, setup_kgq_table, load_kgq_df
__all__ = ["retrieve_embeddings", "store_schema_embeddings","store_kgq_embeddings", "setup_kgq_table", "load_kgq_df"]
================================================
FILE: embeddings/kgq_embeddings.py
================================================
import os
import asyncio
import asyncpg
import pandas as pd
import numpy as np
from pgvector.asyncpg import register_vector
from google.cloud.sql.connector import Connector
from langchain_community.embeddings import VertexAIEmbeddings
from google.cloud import bigquery
from dbconnectors import pgconnector
from agents import EmbedderAgent
from sqlalchemy.sql import text
from utilities import PROJECT_ID, PG_INSTANCE, PG_DATABASE, PG_USER, PG_PASSWORD, PG_REGION, BQ_OPENDATAQNA_DATASET_NAME, BQ_REGION
embedder = EmbedderAgent('vertex')
async def setup_kgq_table( project_id,
instance_name,
database_name,
schema,
database_user,
database_password,
region,
VECTOR_STORE = "cloudsql-pgvector"):
"""
This function sets up or refreshes the Vector Store for Known Good Queries (KGQ)
"""
if VECTOR_STORE=='bigquery-vector':
# Create BQ Client
client=bigquery.Client(project=project_id)
# Delete an old table
# client.query_and_wait(f'''DROP TABLE IF EXISTS `{project_id}.{schema}.example_prompt_sql_embeddings`''')
# Create a new emptry table
client.query_and_wait(f'''CREATE TABLE IF NOT EXISTS `{project_id}.{schema}.example_prompt_sql_embeddings` (
user_grouping string NOT NULL, example_user_question string NOT NULL, example_generated_sql string NOT NULL,
embedding ARRAY)''')
elif VECTOR_STORE=='cloudsql-pgvector':
loop = asyncio.get_running_loop()
async with Connector(loop=loop) as connector:
# Create connection to Cloud SQL database
conn: asyncpg.Connection = await connector.connect_async(
f"{project_id}:{region}:{instance_name}", # Cloud SQL instance connection name
"asyncpg",
user=f"{database_user}",
password=f"{database_password}",
db=f"{database_name}",
)
# Drop on old table
# await conn.execute("DROP TABLE IF EXISTS example_prompt_sql_embeddings")
# Create a new emptry table
await conn.execute(
"""CREATE TABLE IF NOT EXISTS example_prompt_sql_embeddings(
user_grouping VARCHAR(1024) NOT NULL,
example_user_question text NOT NULL,
example_generated_sql text NOT NULL,
embedding vector(768))"""
)
else: raise ValueError("Not a valid parameter for a vector store.")
async def store_kgq_embeddings(df_kgq,
project_id,
instance_name,
database_name,
schema,
database_user,
database_password,
region,
VECTOR_STORE = "cloudsql-pgvector"
):
"""
Create and save the Known Good Query Embeddings to Vector Store
"""
if VECTOR_STORE=='bigquery-vector':
client=bigquery.Client(project=project_id)
example_sql_details_chunked = []
for _, row_aug in df_kgq.iterrows():
example_user_question = str(row_aug['prompt'])
example_generated_sql = str(row_aug['sql'])
example_grouping = str(row_aug['user_grouping'])
emb = embedder.create(example_user_question)
r = {"example_grouping":example_grouping,"example_user_question": example_user_question,"example_generated_sql": example_generated_sql,"embedding": emb}
example_sql_details_chunked.append(r)
example_prompt_sql_embeddings = pd.DataFrame(example_sql_details_chunked)
client.query_and_wait(f'''CREATE TABLE IF NOT EXISTS `{project_id}.{schema}.example_prompt_sql_embeddings` (
user_grouping string NOT NULL, example_user_question string NOT NULL, example_generated_sql string NOT NULL,
embedding ARRAY)''')
for _, row in example_prompt_sql_embeddings.iterrows():
client.query_and_wait(f'''DELETE FROM `{project_id}.{schema}.example_prompt_sql_embeddings`
WHERE user_grouping= '{row["example_grouping"]}' and example_user_question= "{row["example_user_question"]}" '''
)
# embedding=np.array(row["embedding"])
cleaned_sql = row["example_generated_sql"].replace("\r", " ").replace("\n", " ")
client.query_and_wait(f'''INSERT INTO `{project_id}.{schema}.example_prompt_sql_embeddings`
VALUES ("{row["example_grouping"]}","{row["example_user_question"]}" ,
"{cleaned_sql}",{row["embedding"]} )''')
elif VECTOR_STORE=='cloudsql-pgvector':
loop = asyncio.get_running_loop()
async with Connector(loop=loop) as connector:
# Create connection to Cloud SQL database
conn: asyncpg.Connection = await connector.connect_async(
f"{project_id}:{region}:{instance_name}", # Cloud SQL instance connection name
"asyncpg",
user=f"{database_user}",
password=f"{database_password}",
db=f"{database_name}",
)
example_sql_details_chunked = []
for _, row_aug in df_kgq.iterrows():
example_user_question = str(row_aug['prompt'])
example_generated_sql = str(row_aug['sql'])
example_grouping = str(row_aug['user_grouping'])
emb = embedder.create(example_user_question)
r = {"example_grouping":example_grouping,"example_user_question": example_user_question,"example_generated_sql": example_generated_sql,"embedding": emb}
example_sql_details_chunked.append(r)
example_prompt_sql_embeddings = pd.DataFrame(example_sql_details_chunked)
for _, row in example_prompt_sql_embeddings.iterrows():
await conn.execute(
"DELETE FROM example_prompt_sql_embeddings WHERE user_grouping= $1 and example_user_question=$2",
row["example_grouping"],
row["example_user_question"])
cleaned_sql = row["example_generated_sql"].replace("\r", " ").replace("\n", " ")
await conn.execute(
"INSERT INTO example_prompt_sql_embeddings (user_grouping, example_user_question, example_generated_sql, embedding) VALUES ($1, $2, $3, $4)",
row["example_grouping"],
row["example_user_question"],
cleaned_sql,
str(row["embedding"]),
)
await conn.close()
else: raise ValueError("Not a valid parameter for a vector store.")
def load_kgq_df():
import pandas as pd
def is_root_dir():
current_dir = os.getcwd()
notebooks_path = os.path.join(current_dir, "notebooks")
agents_path = os.path.join(current_dir, "agents")
return os.path.exists(notebooks_path) or os.path.exists(agents_path)
if is_root_dir():
current_dir = os.getcwd()
root_dir = current_dir
else:
root_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
file_path = root_dir + "/scripts/known_good_sql.csv"
# Load the file
df_kgq = pd.read_csv(file_path)
df_kgq = df_kgq.loc[:, ["prompt", "sql", "user_grouping"]]
df_kgq = df_kgq.dropna()
return df_kgq
if __name__ == '__main__':
from utilities import PROJECT_ID, PG_INSTANCE, PG_DATABASE, PG_USER, PG_PASSWORD, PG_REGION
VECTOR_STORE = "cloudsql-pgvector"
current_dir = os.getcwd()
root_dir = os.path.expanduser('~') # Start at the user's home directory
while current_dir != root_dir:
for dirpath, dirnames, filenames in os.walk(current_dir):
config_path = os.path.join(dirpath, 'known_good_sql.csv')
if os.path.exists(config_path):
file_path = config_path # Update root_dir to the found directory
break # Stop outer loop once found
current_dir = os.path.dirname(current_dir)
print("Known Good SQL Found at Path :: "+file_path)
# Load the file
df_kgq = pd.read_csv(file_path)
df_kgq = df_kgq.loc[:, ["prompt", "sql", "database_name"]]
df_kgq = df_kgq.dropna()
print('Known Good SQLs Loaded into a Dataframe')
asyncio.run(setup_kgq_table(PROJECT_ID,
PG_INSTANCE,
PG_DATABASE,
PG_USER,
PG_PASSWORD,
PG_REGION,
VECTOR_STORE))
asyncio.run(store_kgq_embeddings(df_kgq,
PROJECT_ID,
PG_INSTANCE,
PG_DATABASE,
PG_USER,
PG_PASSWORD,
PG_REGION,
VECTOR_STORE))
================================================
FILE: embeddings/retrieve_embeddings.py
================================================
import re
import io
import sys
import pandas as pd
from dbconnectors import pgconnector,bqconnector
from agents import EmbedderAgent, ResponseAgent, DescriptionAgent
from utilities import EMBEDDING_MODEL, DESCRIPTION_MODEL, USE_COLUMN_SAMPLES
embedder = EmbedderAgent(EMBEDDING_MODEL)
# responder = ResponseAgent('gemini-1.0-pro')
descriptor = DescriptionAgent(DESCRIPTION_MODEL)
def get_embedding_chunked(textinput, batch_size):
for i in range(0, len(textinput), batch_size):
request = [x["content"] for x in textinput[i : i + batch_size]]
response = embedder.create(request) # Vertex Textmodel Embedder
# Store the retrieved vector embeddings for each chunk back.
for x, e in zip(textinput[i : i + batch_size], response):
x["embedding"] = e
# Store the generated embeddings in a pandas dataframe.
out_df = pd.DataFrame(textinput)
return out_df
def retrieve_embeddings(SOURCE, SCHEMA="public", table_names = None):
""" Augment all the DB schema blocks to create document for embedding """
if SOURCE == "cloudsql-pg":
table_schema_sql = pgconnector.return_table_schema_sql(SCHEMA,table_names=table_names)
table_desc_df = pgconnector.retrieve_df(table_schema_sql)
column_schema_sql = pgconnector.return_column_schema_sql(SCHEMA,table_names=table_names)
column_name_df = pgconnector.retrieve_df(column_schema_sql)
#GENERATE MISSING DESCRIPTIONS
table_desc_df,column_name_df= descriptor.generate_missing_descriptions(SOURCE,table_desc_df,column_name_df)
#ADD SAMPLES VALUES FOR COLUMNS
column_name_df["sample_values"]=None
if USE_COLUMN_SAMPLES:
column_name_df = pgconnector.get_column_samples(column_name_df)
### TABLE EMBEDDING ###
"""
This SQL returns a df containing the cols table_schema, table_name, table_description, table_columns (with cols in the table)
for the schema specified above, e.g. 'retail'
"""
table_details_chunked = []
for index_aug, row_aug in table_desc_df.iterrows():
cur_table_name = str(row_aug['table_name'])
cur_table_schema = str(row_aug['table_schema'])
curr_col_names = str(row_aug['table_columns'])
curr_tbl_desc = str(row_aug['table_description'])
table_detailed_description=f"""
Table Name: {cur_table_name} |
Schema Name: {cur_table_schema} |
Table Description - {curr_tbl_desc}) |
Columns List: [{curr_col_names}]"""
r = {"table_schema": cur_table_schema,"table_name": cur_table_name,"content": table_detailed_description}
table_details_chunked.append(r)
table_details_embeddings = get_embedding_chunked(table_details_chunked, 10)
### COLUMN EMBEDDING ###
"""
This SQL returns a df containing the cols table_schema, table_name, column_name, data_type, column_description, table_description, primary_key, column_constraints
for the schema specified above, e.g. 'retail'
"""
column_details_chunked = []
for index_aug, row_aug in column_name_df.iterrows():
cur_table_name = str(row_aug['table_name'])
cur_table_owner = str(row_aug['table_schema'])
curr_col_name = str(row_aug['table_schema'])+'.'+str(row_aug['table_name'])+'.'+str(row_aug['column_name'])
curr_col_datatype = str(row_aug['data_type'])
curr_col_description = str(row_aug['column_description'])
curr_col_constraints = str(row_aug['column_constraints'])
curr_column_name = str(row_aug['column_name'])
curr_column_samples = str(row_aug['sample_values'])
column_detailed_description=f"""Schema Name:{cur_table_owner} | Column Name: {curr_col_name} (Data type: {curr_col_datatype}) | Table Name: {cur_table_name} | (column description: {curr_col_description})(constraints: {curr_col_constraints}) | (Sample Values in the Column: {curr_column_samples})"""
r = {"table_schema": cur_table_owner,"table_name": cur_table_name,"column_name":curr_column_name, "content": column_detailed_description}
column_details_chunked.append(r)
column_details_embeddings = get_embedding_chunked(column_details_chunked, 10)
elif SOURCE=='bigquery':
table_schema_sql = bqconnector.return_table_schema_sql(SCHEMA, table_names=table_names)
table_desc_df = bqconnector.retrieve_df(table_schema_sql)
column_schema_sql = bqconnector.return_column_schema_sql(SCHEMA, table_names=table_names)
column_name_df = bqconnector.retrieve_df(column_schema_sql)
#GENERATE MISSING DESCRIPTIONS
table_desc_df,column_name_df= descriptor.generate_missing_descriptions(SOURCE,table_desc_df,column_name_df)
#ADD SAMPLES VALUES FOR COLUMNS
column_name_df["sample_values"]=None
if USE_COLUMN_SAMPLES:
column_name_df = bqconnector.get_column_samples(column_name_df)
#TABLE EMBEDDINGS
table_details_chunked = []
for index_aug, row_aug in table_desc_df.iterrows():
cur_project_name =str(row_aug['project_id'])
cur_table_name = str(row_aug['table_name'])
cur_table_schema = str(row_aug['table_schema'])
curr_col_names = str(row_aug['table_columns'])
curr_tbl_desc = str(row_aug['table_description'])
table_detailed_description=f"""
Full Table Name : {cur_project_name}.{cur_table_schema}.{cur_table_name} |
Table Columns List: [{curr_col_names}] |
Table Description: {curr_tbl_desc} """
r = {"table_schema": cur_table_schema,"table_name": cur_table_name,"content": table_detailed_description}
table_details_chunked.append(r)
table_details_embeddings = get_embedding_chunked(table_details_chunked, 10)
### COLUMN EMBEDDING ###
"""
This SQL returns a df containing the cols table_schema, table_name, column_name, data_type, column_description, table_description, primary_key, column_constraints
for the schema specified above, e.g. 'retail'
"""
column_details_chunked = []
for index_aug, row_aug in column_name_df.iterrows():
cur_project_name =str(row_aug['project_id'])
cur_table_name = str(row_aug['table_name'])
cur_table_owner = str(row_aug['table_schema'])
curr_col_name = str(row_aug['table_schema'])+'.'+str(row_aug['table_name'])+'.'+str(row_aug['column_name'])
curr_col_datatype = str(row_aug['data_type'])
curr_col_description = str(row_aug['column_description'])
curr_col_constraints = str(row_aug['column_constraints'])
curr_column_name = str(row_aug['column_name'])
curr_column_samples = str(row_aug['sample_values'])
column_detailed_description=f"""
Column Name: {curr_col_name}|
Full Table Name : {cur_project_name}.{cur_table_schema}.{cur_table_name} |
Data type: {curr_col_datatype}|
Column description: {curr_col_description}|
Column Constraints: {curr_col_constraints}|
Sample Values in the Column : {curr_column_samples}"""
r = {"table_schema": cur_table_owner,"table_name": cur_table_name,"column_name":curr_column_name, "content": column_detailed_description}
column_details_chunked.append(r)
column_details_embeddings = get_embedding_chunked(column_details_chunked, 10)
return table_details_embeddings, column_details_embeddings
if __name__ == '__main__':
SOURCE = 'cloudsql-pg'
t, c = retrieve_embeddings(SOURCE, SCHEMA="public")
================================================
FILE: embeddings/store_embeddings.py
================================================
import asyncio
import asyncpg
import pandas as pd
import numpy as np
from pgvector.asyncpg import register_vector
from google.cloud.sql.connector import Connector
from langchain_community.embeddings import VertexAIEmbeddings
from google.cloud import bigquery
from dbconnectors import pgconnector
from agents import EmbedderAgent
from sqlalchemy.sql import text
from utilities import VECTOR_STORE, PROJECT_ID, PG_INSTANCE, PG_DATABASE, PG_USER, PG_PASSWORD, PG_REGION, BQ_OPENDATAQNA_DATASET_NAME, BQ_REGION, EMBEDDING_MODEL
embedder = EmbedderAgent(EMBEDDING_MODEL)
async def store_schema_embeddings(table_details_embeddings,
tablecolumn_details_embeddings,
project_id,
instance_name,
database_name,
schema,
database_user,
database_password,
region,
VECTOR_STORE):
"""
Store the vectorised table and column details in the DB table.
This code may run for a few minutes.
"""
if VECTOR_STORE == "cloudsql-pgvector":
loop = asyncio.get_running_loop()
async with Connector(loop=loop) as connector:
# Create connection to Cloud SQL database.
conn: asyncpg.Connection = await connector.connect_async(
f"{project_id}:{region}:{instance_name}", # Cloud SQL instance connection name
"asyncpg",
user=f"{database_user}",
password=f"{database_password}",
db=f"{database_name}",
)
await conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
await register_vector(conn)
# await conn.execute(f"DROP SCHEMA IF EXISTS {pg_schema} CASCADE")
# await conn.execute(f"CREATE SCHEMA {pg_schema}")
# await conn.execute("DROP TABLE IF EXISTS table_details_embeddings")
# Create the `table_details_embeddings` table to store vector embeddings.
await conn.execute(
"""CREATE TABLE IF NOT EXISTS table_details_embeddings(
source_type VARCHAR(100) NOT NULL,
user_grouping VARCHAR(100) NOT NULL,
table_schema VARCHAR(1024) NOT NULL,
table_name VARCHAR(1024) NOT NULL,
content TEXT,
embedding vector(768))"""
)
# Store all the generated embeddings back into the database.
for index, row in table_details_embeddings.iterrows():
await conn.execute(
f"""
MERGE INTO table_details_embeddings AS target
USING (SELECT $1::text AS source_type, $2::text AS user_grouping, $3::text AS table_schema, $4::text AS table_name, $5::text AS content, $6::vector AS embedding) AS source
ON target.user_grouping = source.user_grouping AND target.table_name = source.table_name
WHEN MATCHED THEN
UPDATE SET source_type = source.source_type, table_schema = source.table_schema, content = source.content, embedding = source.embedding
WHEN NOT MATCHED THEN
INSERT (source_type, user_grouping, table_schema, table_name, content, embedding)
VALUES (source.source_type, source.user_grouping, source.table_schema, source.table_name, source.content, source.embedding);
""",
row["source_type"],
row["user_grouping"],
row["table_schema"],
row["table_name"],
row["content"],
np.array(row["embedding"]),
)
# await conn.execute("DROP TABLE IF EXISTS tablecolumn_details_embeddings")
# Create the `table_details_embeddings` table to store vector embeddings.
await conn.execute(
"""CREATE TABLE IF NOT EXISTS tablecolumn_details_embeddings(
source_type VARCHAR(100) NOT NULL,
user_grouping VARCHAR(100) NOT NULL,
table_schema VARCHAR(1024) NOT NULL,
table_name VARCHAR(1024) NOT NULL,
column_name VARCHAR(1024) NOT NULL,
content TEXT,
embedding vector(768))"""
)
# Store all the generated embeddings back into the database.
for index, row in tablecolumn_details_embeddings.iterrows():
await conn.execute(
f"""
MERGE INTO tablecolumn_details_embeddings AS target
USING (SELECT $1::text AS source_type, $2::text AS user_grouping, $3::text AS table_schema,
$4::text AS table_name, $5::text AS column_name, $6::text AS content, $7::vector AS embedding) AS source
ON target.user_grouping = source.user_grouping
AND target.table_name = source.table_name
AND target.column_name = source.column_name
WHEN MATCHED THEN
UPDATE SET source_type = source.source_type, table_schema = source.table_schema, content = source.content, embedding = source.embedding
WHEN NOT MATCHED THEN
INSERT (source_type, user_grouping, table_schema, table_name, column_name, content, embedding)
VALUES (source.source_type, source.user_grouping, source.table_schema, source.table_name, source.column_name, source.content, source.embedding);
""",
row["source_type"],
row["user_grouping"],
row["table_schema"],
row["table_name"],
row["column_name"],
row["content"],
np.array(row["embedding"]),
)
await conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
await register_vector(conn)
# await conn.execute("DROP TABLE IF EXISTS example_prompt_sql_embeddings")
await conn.execute(
"""CREATE TABLE IF NOT EXISTS example_prompt_sql_embeddings(
user_grouping VARCHAR(1024) NOT NULL,
example_user_question text NOT NULL,
example_generated_sql text NOT NULL,
embedding vector(768))"""
)
await conn.close()
elif VECTOR_STORE == "bigquery-vector":
client=bigquery.Client(project=project_id)
#Store table embeddings
client.query_and_wait(f'''CREATE TABLE IF NOT EXISTS `{project_id}.{schema}.table_details_embeddings` (
source_type string NOT NULL, user_grouping string NOT NULL, table_schema string NOT NULL, table_name string NOT NULL, content string, embedding ARRAY)''')
#job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
delete_conditions = table_details_embeddings[['user_grouping', 'table_name']].apply(tuple, axis=1).tolist()
where_clause = " OR ".join([f"(user_grouping = '{cond[0]}' AND table_name = '{cond[1]}')" for cond in delete_conditions])
delete_query = f"""
DELETE FROM `{project_id}.{schema}.table_details_embeddings`
WHERE {where_clause}
"""
client.query_and_wait(delete_query)
client.load_table_from_dataframe(table_details_embeddings,f'{project_id}.{schema}.table_details_embeddings')
#Store column embeddings
client.query_and_wait(f'''CREATE TABLE IF NOT EXISTS `{project_id}.{schema}.tablecolumn_details_embeddings` (
source_type string NOT NULL,user_grouping string NOT NULL, table_schema string NOT NULL, table_name string NOT NULL, column_name string NOT NULL,
content string, embedding ARRAY)''')
#job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
delete_conditions = tablecolumn_details_embeddings[['user_grouping', 'table_name', 'column_name']].apply(tuple, axis=1).tolist()
where_clause = " OR ".join([f"(user_grouping = '{cond[0]}' AND table_name = '{cond[1]}' AND column_name = '{cond[2]}')" for cond in delete_conditions])
delete_query = f"""
DELETE FROM `{project_id}.{schema}.tablecolumn_details_embeddings`
WHERE {where_clause}
"""
client.query_and_wait(delete_query)
client.load_table_from_dataframe(tablecolumn_details_embeddings,f'{project_id}.{schema}.tablecolumn_details_embeddings')
client.query_and_wait(f'''CREATE TABLE IF NOT EXISTS `{project_id}.{schema}.example_prompt_sql_embeddings` (
user_grouping string NOT NULL, example_user_question string NOT NULL, example_generated_sql string NOT NULL,
embedding ARRAY)''')
else: raise ValueError("Please provide a valid Vector Store.")
return "Embeddings are stored successfully"
async def add_sql_embedding(user_question, generated_sql, database):
emb=embedder.create(user_question)
if VECTOR_STORE == "cloudsql-pgvector":
# sql= f'''MERGE INTO example_prompt_sql_embeddings as tgt
# using (SELECT '{user_question}' as example_user_question) as src
# on tgt.example_user_question=src.example_user_question
# when not matched then
# insert (table_schema, example_user_question,example_generated_sql,embedding)
# values('{database}','{user_question}','{generated_sql}','{(emb)}')
# when matched then update set
# table_schema = '{database}',
# example_generated_sql = '{generated_sql}',
# embedding = '{(emb)}' '''
# # print(sql)
# conn=pgconnector.pool.connect()
# await conn.execute(text(sql))
# pgconnector.retrieve_df(sql)
loop = asyncio.get_running_loop()
async with Connector(loop=loop) as connector:
# Create connection to Cloud SQL database.
conn: asyncpg.Connection = await connector.connect_async(
f"{PROJECT_ID}:{PG_REGION}:{PG_INSTANCE}", # Cloud SQL instance connection name
"asyncpg",
user=f"{PG_USER}",
password=f"{PG_PASSWORD}",
db=f"{PG_DATABASE}",
)
await conn.execute("CREATE EXTENSION IF NOT EXISTS vector")
await register_vector(conn)
await conn.execute("DELETE FROM example_prompt_sql_embeddings WHERE user_grouping= $1 and example_user_question=$2",
database,
user_question)
cleaned_sql =generated_sql.replace("\r", " ").replace("\n", " ")
await conn.execute(
"INSERT INTO example_prompt_sql_embeddings (user_grouping, example_user_question, example_generated_sql, embedding) VALUES ($1, $2, $3, $4)",
database,
user_question,
cleaned_sql,
np.array(emb),
)
elif VECTOR_STORE == "bigquery-vector":
client=bigquery.Client(project=PROJECT_ID)
client.query_and_wait(f'''CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{BQ_OPENDATAQNA_DATASET_NAME}.example_prompt_sql_embeddings` (
user_grouping string NOT NULL, example_user_question string NOT NULL, example_generated_sql string NOT NULL,
embedding ARRAY)''')
client.query_and_wait(f'''DELETE FROM `{PROJECT_ID}.{BQ_OPENDATAQNA_DATASET_NAME}.example_prompt_sql_embeddings`
WHERE user_grouping= '{database}' and example_user_question= "{user_question}" '''
)
# embedding=np.array(row["embedding"])
cleaned_sql = generated_sql.replace("\r", " ").replace("\n", " ")
client.query_and_wait(f'''INSERT INTO `{PROJECT_ID}.{BQ_OPENDATAQNA_DATASET_NAME}.example_prompt_sql_embeddings`
VALUES ("{database}","{user_question}" ,
"{cleaned_sql}",{emb})''')
return 1
if __name__ == '__main__':
from retrieve_embeddings import retrieve_embeddings
from utilities import PG_SCHEMA, PROJECT_ID, PG_INSTANCE, PG_DATABASE, PG_USER, PG_PASSWORD, PG_REGION
VECTOR_STORE = "cloudsql-pgvector"
t, c = retrieve_embeddings(VECTOR_STORE, PG_SCHEMA)
asyncio.run(store_schema_embeddings(t,
c,
PROJECT_ID,
PG_INSTANCE,
PG_DATABASE,
PG_SCHEMA,
PG_USER,
PG_PASSWORD,
PG_REGION,
VECTOR_STORE = VECTOR_STORE))
================================================
FILE: env_setup.py
================================================
import asyncio
from google.cloud import bigquery
import google.api_core
from embeddings import retrieve_embeddings, store_schema_embeddings, setup_kgq_table, load_kgq_df, store_kgq_embeddings
from utilities import ( PG_REGION, PG_INSTANCE, PG_DATABASE, PG_USER, PG_PASSWORD,
BQ_REGION,
EXAMPLES, LOGGING, VECTOR_STORE, PROJECT_ID,
BQ_OPENDATAQNA_DATASET_NAME,FIRESTORE_REGION)
import subprocess
import time
if VECTOR_STORE == 'bigquery-vector':
DATASET_REGION = BQ_REGION
elif VECTOR_STORE == 'cloudsql-pgvector':
DATASET_REGION = PG_REGION
def setup_postgresql(pg_instance, pg_region, pg_database, pg_user, pg_password):
"""Sets up a PostgreSQL Cloud SQL instance with a database and user.
Args:
pg_instance (str): Name of the Cloud SQL instance.
pg_region (str): Region where the instance should be located.
pg_database (str): Name of the database to create.
pg_user (str): Name of the user to create.
pg_password (str): Password for the user.
"""
# Check if Cloud SQL instance exists
describe_cmd = ["gcloud", "sql", "instances", "describe", pg_instance, "--format=value(databaseVersion)"]
describe_process = subprocess.run(describe_cmd, capture_output=True, text=True)
if describe_process.returncode == 0:
if describe_process.stdout.startswith("POSTGRES"):
print("Found existing Postgres Cloud SQL Instance!")
else:
raise RuntimeError("Existing Cloud SQL instance is not PostgreSQL")
else:
print("Creating new Cloud SQL instance...")
create_cmd = [
"gcloud", "sql", "instances", "create", pg_instance,
"--database-version=POSTGRES_15", "--region", pg_region,
"--cpu=1", "--memory=4GB", "--root-password", pg_password,
"--database-flags=cloudsql.iam_authentication=On"
]
subprocess.run(create_cmd, check=True) # Raises an exception if creation fails
# Wait for instance to be ready
print("Waiting for instance to be ready...")
time.sleep(9999) # You might need to adjust this depending on how long it takes
# Create the database
list_cmd = ["gcloud", "sql", "databases", "list", "--instance", pg_instance]
list_process = subprocess.run(list_cmd, capture_output=True, text=True)
if pg_database in list_process.stdout:
print("Found existing Postgres Cloud SQL database!")
else:
print("Creating new Cloud SQL database...")
create_db_cmd = ["gcloud", "sql", "databases", "create", pg_database, "--instance", pg_instance]
subprocess.run(create_db_cmd, check=True)
# Create the user
create_user_cmd = [
"gcloud", "sql", "users", "create", pg_user,
"--instance", pg_instance, "--password", pg_password
]
subprocess.run(create_user_cmd, check=True)
print(f"PG Database {pg_database} in instance {pg_instance} is ready.")
def create_vector_store():
"""
Initializes the environment and sets up the vector store for Open Data QnA.
This function performs the following steps:
1. Loads configurations from the "config.ini" file.
2. Determines the data source (BigQuery or CloudSQL PostgreSQL) and sets the dataset region accordingly.
3. If the vector store is "cloudsql-pgvector" and the data source is not CloudSQL PostgreSQL, it creates a new PostgreSQL dataset for the vector store.
4. If logging is enabled or the vector store is "bigquery-vector", it creates a BigQuery dataset for the vector store and logging table.
5. It creates a Vertex AI connection for the specified model and embeds the table schemas and columns into the vector database.
6. If embeddings are stored in BigQuery, creates a table column_details_embeddings in the BigQuery Dataset.
7. It generates the embeddings for the table schemas and column descriptions, and then inserts those embeddings into the BigQuery table.
Configuration:
- Requires the following environment variables to be set in "config.ini":
- `DATA_SOURCE`: The data source (e.g., "bigquery" or "cloudsql-pg").
- `VECTOR_STORE`: The type of vector store (e.g., "bigquery-vector" or "cloudsql-pgvector").
- `BQ_REGION`: The BigQuery region.
- `PROJECT_ID`: The Google Cloud project ID.
- `BQ_OPENDATAQNA_DATASET_NAME`: The name of the BigQuery dataset for Open Data QnA.
- `LOGGING`: Whether logging is enabled.
- If `VECTOR_STORE` is "cloudsql-pgvector" and `DATA_SOURCE` is not "cloudsql-pg":
- Requires additional environment variables for PostgreSQL instance setup.
Returns:
None
Raises:
RuntimeError: If there are errors during the setup process (e.g., dataset creation failure).
"""
print("Initializing environment setup.")
print("Loading configurations from config.ini file.")
print("Vector Store source set to: ", VECTOR_STORE)
# Create PostgreSQL Instance is data source is different from PostgreSQL Instance
if VECTOR_STORE == 'cloudsql-pgvector' :
print("Generating pg dataset for vector store.")
# Parameters for PostgreSQL Instance
pg_region = DATASET_REGION
pg_instance = "pg15-opendataqna"
pg_database = "opendataqna-db"
pg_user = "pguser"
pg_password = "pg123"
pg_schema = 'pg-vector-store'
setup_postgresql(pg_instance, pg_region, pg_database, pg_user, pg_password)
# Create a new data set on Bigquery to use for the logs table
if LOGGING or VECTOR_STORE == 'bigquery-vector':
if LOGGING:
print("Logging is enabled")
if VECTOR_STORE == 'bigquery-vector':
print("Vector store set to 'bigquery-vector'")
print(f"Generating Big Query dataset {BQ_OPENDATAQNA_DATASET_NAME}")
client=bigquery.Client(project=PROJECT_ID)
dataset_ref = f"{PROJECT_ID}.{BQ_OPENDATAQNA_DATASET_NAME}"
# Create the dataset if it does not exist already
try:
client.get_dataset(dataset_ref)
print("Destination Dataset exists")
except google.api_core.exceptions.NotFound:
print("Cannot find the dataset hence creating.......")
dataset=bigquery.Dataset(dataset_ref)
dataset.location=DATASET_REGION
client.create_dataset(dataset)
print(str(dataset_ref)+" is created")
def get_embeddings():
"""Generates and returns embeddings for table schemas and column descriptions.
This function performs the following steps:
1. Retrieves table schema and column description data based on the specified data source (BigQuery or PostgreSQL).
2. Generates embeddings for the retrieved data using the configured embedding model.
3. Returns the generated embeddings for both tables and columns.
Returns:
Tuple[pd.DataFrame, pd.DataFrame]: A tuple containing two pandas DataFrames:
- table_schema_embeddings: Embeddings for the table schemas.
- col_schema_embeddings: Embeddings for the column descriptions.
Configuration:
This function relies on the following configuration variables:
- DATA_SOURCE: The source database ("bigquery" or "cloudsql-pg").
- BQ_DATASET_NAME (if DATA_SOURCE is "bigquery"): The BigQuery dataset name.
- BQ_TABLE_LIST (if DATA_SOURCE is "bigquery"): The list of BigQuery tables to process.
- PG_SCHEMA (if DATA_SOURCE is "cloudsql-pg"): The PostgreSQL schema name.
"""
print("Generating embeddings from source db schemas")
import pandas as pd
import os
current_dir = os.getcwd()
root_dir = os.path.expanduser('~') # Start at the user's home directory
while current_dir != root_dir:
for dirpath, dirnames, filenames in os.walk(current_dir):
config_path = os.path.join(dirpath, 'data_source_list.csv')
if os.path.exists(config_path):
file_path = config_path # Update root_dir to the found directory
break # Stop outer loop once found
current_dir = os.path.dirname(current_dir)
print("Source Found at Path :: "+file_path)
# Load the file
df_src = pd.read_csv(file_path)
df_src = df_src.loc[:, ["source", "user_grouping", "schema","table"]]
df_src = df_src.sort_values(by=["source","user_grouping","schema","table"])
#If no schema Error Out
if df_src['schema'].astype(str).str.len().min()==0 or df_src['schema'].isna().any():
raise ValueError("Schema column cannot be empty")
#Group by for all the tables filtered
df=df_src.groupby(['source','schema'])['table'].agg(lambda x: list(x.dropna().unique())).reset_index()
df['table']=df['table'].apply(lambda x: None if pd.isna(x).any() else x)
print("The Embeddings are extracted for the below combinations")
print(df)
table_schema_embeddings=pd.DataFrame(columns=['source_type','join_by','table_schema', 'table_name', 'content','embedding'])
col_schema_embeddings=pd.DataFrame(columns=['source_type','join_by','table_schema', 'table_name', 'column_name', 'content','embedding'])
for _, row in df.iterrows():
DATA_SOURCE = row['source']
SCHEMA = row['schema']
TABLE_LIST = row['table']
_t, _c = retrieve_embeddings(DATA_SOURCE, SCHEMA=SCHEMA, table_names=TABLE_LIST)
_t["source_type"]=DATA_SOURCE
_c["source_type"]=DATA_SOURCE
if not TABLE_LIST:
_t["join_by"]=DATA_SOURCE+"_"+SCHEMA+"_"+SCHEMA
_c["join_by"]=DATA_SOURCE+"_"+SCHEMA+"_"+SCHEMA
table_schema_embeddings = pd.concat([table_schema_embeddings,_t],ignore_index=True)
col_schema_embeddings = pd.concat([col_schema_embeddings,_c],ignore_index=True)
df_src['join_by'] = df_src.apply(
lambda row: f"{row['source']}_{row['schema']}_{row['schema']}" if pd.isna(row['table']) else f"{row['source']}_{row['schema']}_{row['table']}",axis=1)
table_schema_embeddings['join_by'] = table_schema_embeddings['join_by'].fillna(table_schema_embeddings['source_type'] + "_" + table_schema_embeddings['table_schema'] + "_" + table_schema_embeddings['table_name'])
col_schema_embeddings['join_by'] = col_schema_embeddings['join_by'].fillna(col_schema_embeddings['source_type'] + "_" + col_schema_embeddings['table_schema'] + "_" + col_schema_embeddings['table_name'])
table_schema_embeddings = table_schema_embeddings.merge(df_src[['join_by', 'user_grouping']], on='join_by', how='left')
table_schema_embeddings.drop(columns=["join_by"],inplace=True)
#Replace NaN values in group to default to the schema
table_schema_embeddings['user_grouping'] = table_schema_embeddings['user_grouping'].fillna(table_schema_embeddings['table_schema']+"-"+table_schema_embeddings['source_type'])
col_schema_embeddings = col_schema_embeddings.merge(df_src[['join_by', 'user_grouping']], on='join_by', how='left')
col_schema_embeddings.drop(columns=["join_by"],inplace=True)
#Replace NaN values in group to default to the schema
col_schema_embeddings['user_grouping'] = col_schema_embeddings['user_grouping'].fillna(col_schema_embeddings['table_schema']+"-"+col_schema_embeddings['source_type'])
print("Table and Column embeddings are created")
return table_schema_embeddings, col_schema_embeddings
async def store_embeddings(table_schema_embeddings, col_schema_embeddings):
"""
Stores table and column embeddings into the specified vector store.
This asynchronous function saves precomputed embeddings for table schemas and column descriptions
into either BigQuery or PostgreSQL (with pgvector extension) based on the VECTOR_STORE configuration.
Args:
table_schema_embeddings (pd.DataFrame): Embeddings for the table schemas.
col_schema_embeddings (pd.DataFrame): Embeddings for the column descriptions.
Configuration:
This function relies on the following configuration variables:
- VECTOR_STORE: Determines the target vector store ("bigquery-vector" or "cloudsql-pgvector").
- PROJECT_ID, BQ_REGION, BQ_OPENDATAQNA_DATASET_NAME (if VECTOR_STORE is "bigquery-vector"):
Configuration for BigQuery storage.
- PG_INSTANCE, PG_DATABASE, PG_USER, PG_PASSWORD, PG_REGION (if VECTOR_STORE is "cloudsql-pgvector"):
Configuration for PostgreSQL storage.
Returns:
None
"""
print("Storing embeddings back to the vector store.")
if VECTOR_STORE=='bigquery-vector':
await(store_schema_embeddings(table_details_embeddings=table_schema_embeddings,
tablecolumn_details_embeddings=col_schema_embeddings,
project_id=PROJECT_ID,
instance_name=None,
database_name=None,
schema=BQ_OPENDATAQNA_DATASET_NAME,
database_user=None,
database_password=None,
region=BQ_REGION,
VECTOR_STORE = VECTOR_STORE
))
elif VECTOR_STORE=='cloudsql-pgvector':
await(store_schema_embeddings(table_details_embeddings=table_schema_embeddings,
tablecolumn_details_embeddings=col_schema_embeddings,
project_id=PROJECT_ID,
instance_name=PG_INSTANCE,
database_name=PG_DATABASE,
schema=None,
database_user=PG_USER,
database_password=PG_PASSWORD,
region=PG_REGION,
VECTOR_STORE = VECTOR_STORE
))
print("Table and Column embeddings are saved to vector store")
async def create_kgq_sql_table():
"""
Creates a table for storing Known Good Query (KGQ) embeddings in the vector store.
This asynchronous function conditionally sets up a table to store known good SQL queries and their embeddings,
which are used to provide examples to the LLM during query generation. The table is created only
if the `EXAMPLES` configuration variable is set to 'yes'. If not, it prints a warning message encouraging
the user to create a query cache for better results.
Configuration:
This function relies on the following configuration variables:
- EXAMPLES: Determines whether to create the KGQ table ('yes' to create).
- VECTOR_STORE: Specifies the target vector store ("bigquery-vector" or "cloudsql-pgvector").
- PROJECT_ID, BQ_REGION, BQ_OPENDATAQNA_DATASET_NAME (if VECTOR_STORE is "bigquery-vector"):
Configuration for BigQuery storage.
- PG_INSTANCE, PG_DATABASE, PG_USER, PG_PASSWORD, PG_REGION (if VECTOR_STORE is "cloudsql-pgvector"):
Configuration for PostgreSQL storage.
Returns:
None
"""
if EXAMPLES:
print("Creating kgq table in vector store.")
# Delete any old tables and create a new table to KGQ embeddings
if VECTOR_STORE=='bigquery-vector':
await(setup_kgq_table(project_id=PROJECT_ID,
instance_name=None,
database_name=None,
schema=BQ_OPENDATAQNA_DATASET_NAME,
database_user=None,
database_password=None,
region=BQ_REGION,
VECTOR_STORE = VECTOR_STORE
))
elif VECTOR_STORE=='cloudsql-pgvector':
await(setup_kgq_table(project_id=PROJECT_ID,
instance_name=PG_INSTANCE,
database_name=PG_DATABASE,
schema=None,
database_user=PG_USER,
database_password=PG_PASSWORD,
region=PG_REGION,
VECTOR_STORE = VECTOR_STORE
))
else:
print("⚠️ WARNING: No Known Good Queries are provided to create query cache for Few shot examples!")
print("Creating a query cache is highly recommended for best outcomes")
print("If no Known Good Queries for the dataset are availabe at this time, you can use 3_LoadKnownGoodSQL.ipynb to load them later!!")
async def store_kgq_sql_embeddings():
"""
Stores known good query (KGQ) embeddings into the specified vector store.
This asynchronous function reads known good SQL queries from the "known_good_sql.csv" file
and stores their embeddings in either BigQuery or PostgreSQL (with pgvector) depending on the
`VECTOR_STORE` configuration. This process is only performed if the `EXAMPLES` configuration
variable is set to 'yes'. Otherwise, a warning message is displayed, highlighting the
importance of creating a query cache.
Configuration:
- Requires the "known_good_sql.csv" file to be present in the project directory.
- Relies on the following configuration variables:
- `EXAMPLES`: Determines whether to store KGQ embeddings ('yes' to store).
- `VECTOR_STORE`: Specifies the target vector store ("bigquery-vector" or "cloudsql-pgvector").
- `PROJECT_ID`, `BQ_REGION`, `BQ_OPENDATAQNA_DATASET_NAME` (if VECTOR_STORE is "bigquery-vector"):
Configuration for BigQuery storage.
- `PG_INSTANCE`, `PG_DATABASE`, `PG_USER`, `PG_PASSWORD`, `PG_REGION` (if VECTOR_STORE is "cloudsql-pgvector"):
Configuration for PostgreSQL storage.
Returns:
None
"""
if EXAMPLES:
print("Reading contents of known_good_sql.csv")
# Load the contents of the known_good_sql.csv file into a dataframe
df_kgq = load_kgq_df()
print("Storing kgq embeddings in vector store table.")
# Add KGQ to the vector store
if VECTOR_STORE=='bigquery-vector':
await(store_kgq_embeddings(df_kgq,
project_id=PROJECT_ID,
instance_name=None,
database_name=None,
schema=BQ_OPENDATAQNA_DATASET_NAME,
database_user=None,
database_password=None,
region=BQ_REGION,
VECTOR_STORE = VECTOR_STORE
))
elif VECTOR_STORE=='cloudsql-pgvector':
await(store_kgq_embeddings(df_kgq,
project_id=PROJECT_ID,
instance_name=PG_INSTANCE,
database_name=PG_DATABASE,
schema=None,
database_user=PG_USER,
database_password=PG_PASSWORD,
region=PG_REGION,
VECTOR_STORE = VECTOR_STORE
))
print('kgq embeddings stored.')
else:
print("⚠️ WARNING: No Known Good Queries are provided to create query cache for Few shot examples!")
print("Creating a query cache is highly recommended for best outcomes")
print("If no Known Good Queries for the dataset are availabe at this time, you can use 3_LoadKnownGoodSQL.ipynb to load them later!!")
def create_firestore_db(firestore_region=FIRESTORE_REGION,firestore_database="opendataqna-session-logs"):
# Check if Firestore database exists
database_exists_cmd = [
"gcloud", "firestore", "databases", "list",
"--filter", f"name=projects/{PROJECT_ID}/databases/{firestore_database}",
"--format", "value(name)" # Extract just the name if found
]
database_exists_process = subprocess.run(
database_exists_cmd, capture_output=True, text=True
)
if database_exists_process.returncode == 0 and database_exists_process.stdout:
if database_exists_process.stdout.startswith(f"projects/{PROJECT_ID}/databases/{firestore_database}"):
print("Found existing Firestore database with this name already!")
else:
raise RuntimeError("Issue with checking if the firestore db exists or not")
else:
# Create Firestore database
print("Creating new Firestore database...")
create_db_cmd = [
"gcloud", "firestore", "databases", "create",
"--database", firestore_database,
"--location", firestore_region,
"--project", PROJECT_ID
]
subprocess.run(create_db_cmd, check=True) # Raise exception on failure
# Potential wait for database readiness (optional)
time.sleep(30) # May not be strictly necessary for basic use
if __name__ == '__main__':
# Setup vector store for embeddings
create_vector_store()
# Generate embeddings for tables and columns
table_schema_embeddings, col_schema_embeddings = get_embeddings()
# Store table/column embeddings (asynchronous)
asyncio.run(store_embeddings(table_schema_embeddings, col_schema_embeddings))
# Create table for known good queries (if enabled)
asyncio.run(create_kgq_sql_table())
# Store known good query embeddings (if enabled)
asyncio.run(store_kgq_sql_embeddings())
create_firestore_db()
================================================
FILE: frontend/.gitignore
================================================
# See http://help.github.com/ignore-files/ for more about ignoring files.
# Compiled output
/tmp
/out-tsc
/bazel-out
# Node
/node_modules
npm-debug.log
yarn-error.log
# IDEs and editors
.idea/
.project
.classpath
.c9/
*.launch
.settings/
*.sublime-workspace
# Visual Studio Code
.vscode/*
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
.history/*
# Miscellaneous
/.angular/cache
.sass-cache/
/connect.lock
/coverage
/libpeerconnection.log
testem.log
/typings
# System files
.DS_Store
Thumbs.db
================================================
FILE: frontend/README.md
================================================
Deploy Frontend Demo UI
**Technologies and Components**
* **Framework:** Angular
* **Hosting Platform:** Firebase
**Note** : This UI demo doesn't configure any domain restrictions. If you choose to build one refer to this link https://firebase.google.com/docs/functions/auth-blocking-events?gen=2nd#only_allowing_registration_from_a_specific_domain
1. Install the firebase tools to run CLI commands
```
cd Open_Data_QnA
gcloud services enable firebase.googleapis.com --project=$PROJECT_ID # Enable firebase management API
npm install -g firebase-tools
```
```
export PROJECT_ID=
export REGION=
```
2. Build the firebase community builder image
Cloud Build provides a Firebase community builder image that you can use to invoke firebase commands in Cloud Build. To use this builder in a Cloud Build config file, you must first build the image and push it to the Container Registry in your project.
**Note**:*Please complete the steps carely and use the same project which you are going to host the app*
Follow detailed instructions:
1. Navigate to your project root directory.
2. Clone the cloud-builders-community repository:
```
git clone https://github.com/GoogleCloudPlatform/cloud-builders-community.git
```
3. Navigate to the firebase builder image:
```
cd cloud-builders-community/firebase
```
4. Submit the builder to your project, where REGION is one of the supported build regions:
```
gcloud builds submit --region=$REGION . --project=$PROJECT_ID
```
5. Navigate back to your project root directory:
```
cd ../..
```
6. Remove the repository from your root directory:
```
rm -rf cloud-builders-community/
```
3. Create and Initialize Firebase
```
cd Open_Data_QnA/frontend
rm firebase.json .firebaserc
firebase login --no-localhost
## Below command can be used re authenticate in case of authentication errors
firebase login --reauth --no-localhost
#If incase there are old firebase files
```
```
firebase init hosting
## Select "Add Firebase to an existing Google Cloud Platfrom Project"
## For the public directory prompt provide >> /dist/frontend/browser
## Rewrite all URLs to index prompt enter >> Yes (Enter No for any other options)
## You should now see firebase.json created in the folder
```
```
## To modify the contents for this solution update it using the cp command as below
cp firebase_setup.json firebase.json
```
```
## Run below command to create a webapp to host your application
firebase apps:create --project $PROJECT_ID
## Select Web and Provide name : "opendataqna"
```
```
## Below command provides the initialization code to add to your constant file
firebase apps:sdkconfig --project $PROJECT_ID
```
4. Enable Google Authentication in Firebase Console
- Go to the Firebase console (https://console.firebase.google.com/).
- Select your project.
- Navigate to "Authentication" -> "Sign-in method".
- Click "Add new provider" and select "Google".
- Provide a support email and click "Enable". This will enable Google authentication for your project.
5. Update the Config Code and Endpoint URLs for the frontend
In the file [`/frontend/src/assets/constants.ts`](/frontend/src/assets/constants.ts)
* Replace the config object with the one you copied in the above step
* Replace the ENDPOINT_OPENDATAQNA value with the Service URL from the Endpoint Deployment section in the backend-apis README.md
***Note that these variables need to be exported using "export" keyword. So make sure export is mentioned for both the variables***
6. Deploy the app
Run the below commands on the terminal
```
cd Open_Data_QnA/frontend
```
```
gcloud builds submit . --config frontend.yaml --substitutions _FIREBASE_PROJECT_ID=$PROJECT_ID --project=$PROJECT_ID
```
---------
---------
You can see the app URL at the end of successful deployment
> Once deployed login if your Google Account > Select Business User > Select a database in the dropdown (top right) > Type in the Query > Hit Query
A successful SQL generated will be show as below with result as below
Hit Visualize to see the results and charts as below
----
----
**API Details**
All the payloads are in JSON format
1. List Databases : Get the available databases in the vector store that solution can run against
URI: {Service URL}/available_databases
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/available_databases
Method: GET
Request Payload : NONE
Request response:
```
{
"Error": "",
"KnownDB": "[{\"table_schema\":\"imdb-postgres\"},{\"table_schema\":\"retail-postgres\"}]",
"ResponseCode": 200
}
```
2. Known SQL : Get suggestive questions (previously asked/examples added) for selected database
URI: /get_known_sql
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/get_known_sql
Method: POST
Request Payload :
```
{
"user_grouping":"retail"
}
```
Request response:
```
{
"Error": "",
"KnownSQL": "[{\"example_user_question\":\"Which city had maximum number of sales and what was the count?\",\"example_generated_sql\":\"select st.city_id, count(st.city_id) as city_sales_count from retail.sales as s join retail.stores as st on s.id_store = st.id_store group by st.city_id order by city_sales_count desc limit 1;\"}]",
"ResponseCode": 200
}
```
3. SQL Generation : Generate the SQL for the input question asked against a database
URI: /generate_sql
Method: POST
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/get_known_sql
Request payload:
```
{
"session_id":"",
"user_id":"harry@hogwarts.com",
"user_question":"Which city had maximum number of sales?",
"user_grouping":"retail"
}
```
Request response:
```
{
"Error": "",
"GeneratedSQL": " select st.city_id from retail.sales as s join retail.stores as st on s.id_store = st.id_store group by st.city_id order by count(*) desc limit 1;",
"ResponseCode": 200,
"SessionID":"1iuu2u-k1ij2-kkkhhj12131"
}
```
4. Execute SQL : Run the SQL statement against provided database source
URI:/run_query
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/run_query
Method: POST
Request payload:
```
{ "user_grouping": "retail",
"generated_sql":"select st.city_id from retail.sales as s join retail.stores as st on s.id_store = st.id_store group by st.city_id order by count(*) desc limit 1;",
"session_id":"1iuu2u-k1ij2-kkkhhj12131"
}
```
Request response:
```
{
"SessionID":"1iuu2u-k1ij2-kkkhhj12131",
"Error": "",
"KnownDB": "[{\"city_id\":\"C014\"}]",
"ResponseCode": 200
}
```
5. Embedd SQL : To embed known good SQLs to your example embeddings
URI:/embed_sql
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/embed_sql
METHOD: POST
Request Payload:
```
{
"session_id":"1iuu2u-k1ij2-kkkhhj12131",
"user_question":"Which city had maximum number of sales?",
"generated_sql":"select st.city_id from retail.sales as s join retail.stores as st on s.id_store = st.id_store group by st.city_id order by count(*) desc limit 1;",
"user_grouping":"retail"
}
```
Request response:
```
{
"ResponseCode" : 201,
"Message" : "Example SQL has been accepted for embedding",
"Error":"",
"SessionID":"1iuu2u-k1ij2-kkkhhj12131"
}
```
6. Generate Visualization Code : To generated javascript Google Charts code based on the SQL Results and display them on the UI
As per design we have two visualizations suggested showing up when the user clicks the visualize button. Hence two divs are send as part of the response “chart_div”, “chart_div_1” to bind them to that element in the UI
If you are only looking to setup enpoint you can stop here. In case you require the demo app (frontend UI) built in the solution, proceed to the next step.
URI:/generate_viz
Complete URL Sample : https://OpenDataQnA-aeiouAEI-uc.a.run.app/generate_viz
METHOD: POST
Request Payload:
```
{
"session_id":"1iuu2u-k1ij2-kkkhhj12131" ,
"user_question": "What are top 5 product skus that are ordered?",
"sql_generated": "SELECT productSKU as ProductSKUCode, sum(total_ordered) as TotalOrderedItems FROM `inbq1-joonix.demo.sales_sku` group by productSKU order by sum(total_ordered) desc limit 5",
"sql_results": [
{
"ProductSKUCode": "GGOEGOAQ012899",
"TotalOrderedItems": 456
},
{
"ProductSKUCode": "GGOEGDHC074099",
"TotalOrderedItems": 334
},
{
"ProductSKUCode": "GGOEGOCB017499",
"TotalOrderedItems": 319
},
{
"ProductSKUCode": "GGOEGOCC077999",
"TotalOrderedItems": 290
},
{
"ProductSKUCode": "GGOEGFYQ016599",
"TotalOrderedItems": 253
}
]
}
```
Request response:
```
{
"SessionID":"1iuu2u-k1ij2-kkkhhj12131",
"Error": "",
"GeneratedChartjs": {
"chart_div": "google.charts.load('current', {\n packages: ['corechart']\n});\ngoogle.charts.setOnLoadCallback(drawChart);\n\nfunction drawChart() {\n var data = google.visualization.arrayToDataTable([\n ['Product SKU', 'Total Ordered Items'],\n ['GGOEGOAQ012899', 456],\n ['GGOEGDHC074099', 334],\n ['GGOEGOCB017499', 319],\n ['GGOEGOCC077999', 290],\n ['GGOEGFYQ016599', 253],\n ]);\n\n var options = {\n title: 'Top 5 Product SKUs Ordered',\n width: 600,\n height: 300,\n hAxis: {\n textStyle: {\n fontSize: 12\n }\n },\n vAxis: {\n textStyle: {\n fontSize: 12\n }\n },\n legend: {\n textStyle: {\n fontSize: 12\n }\n },\n bar: {\n groupWidth: '50%'\n }\n };\n\n var chart = new google.visualization.BarChart(document.getElementById('chart_div'));\n\n chart.draw(data, options);\n}\n",
"chart_div_1": "google.charts.load('current', {'packages':['corechart']});\ngoogle.charts.setOnLoadCallback(drawChart);\nfunction drawChart() {\n var data = google.visualization.arrayToDataTable([\n ['ProductSKUCode', 'TotalOrderedItems'],\n ['GGOEGOAQ012899', 456],\n ['GGOEGDHC074099', 334],\n ['GGOEGOCB017499', 319],\n ['GGOEGOCC077999', 290],\n ['GGOEGFYQ016599', 253]\n ]);\n\n var options = {\n title: 'Top 5 Product SKUs that are Ordered',\n width: 600,\n height: 300,\n hAxis: {\n textStyle: {\n fontSize: 5\n }\n },\n vAxis: {\n textStyle: {\n fontSize: 5\n }\n },\n legend: {\n textStyle: {\n fontSize: 10\n }\n },\n bar: {\n groupWidth: \"60%\"\n }\n };\n\n var chart = new google.visualization.ColumnChart(document.getElementById('chart_div_1'));\n\n chart.draw(data, options);\n}\n"
},
"ResponseCode": 200
}
```
================================================
FILE: frontend/angular.json
================================================
{
"$schema": "./node_modules/@angular/cli/lib/config/schema.json",
"version": 1,
"newProjectRoot": "projects",
"projects": {
"frontend": {
"projectType": "application",
"schematics": {
"@schematics/angular:component": {
"style": "scss"
}
},
"root": "",
"sourceRoot": "src",
"prefix": "app",
"architect": {
"build": {
"builder": "@angular-devkit/build-angular:application",
"options": {
"outputPath": "dist/frontend",
"index": "src/index.html",
"browser": "src/main.ts",
"polyfills": [
"zone.js"
],
"tsConfig": "tsconfig.app.json",
"inlineStyleLanguage": "scss",
"assets": [
"src/favicon.ico",
"src/assets"
],
"styles": [
"node_modules/bootstrap/dist/css/bootstrap.min.css",
"src/styles.scss"
],
"scripts": [],
"server": "src/main.server.ts",
"prerender": true,
"ssr": {
"entry": "server.ts"
}
},
"configurations": {
"production": {
"budgets": [
{
"type": "initial",
"maximumWarning": "1mb",
"maximumError": "2mb"
},
{
"type": "anyComponentStyle",
"maximumWarning": "2kb",
"maximumError": "4kb"
}
],
"outputHashing": "all"
},
"development": {
"optimization": false,
"extractLicenses": false,
"sourceMap": true
}
},
"defaultConfiguration": "production"
},
"serve": {
"builder": "@angular-devkit/build-angular:dev-server",
"options": {
"buildTarget": "frontend:build"
},
"configurations": {
"production": {
"buildTarget": "frontend:build:production"
},
"development": {
"buildTarget": "frontend:build:development"
}
},
"defaultConfiguration": "development"
},
"extract-i18n": {
"builder": "@angular-devkit/build-angular:extract-i18n",
"options": {
"buildTarget": "frontend:build"
}
},
"test": {
"builder": "@angular-devkit/build-angular:karma",
"options": {
"polyfills": [
"zone.js",
"zone.js/testing"
],
"tsConfig": "tsconfig.spec.json",
"inlineStyleLanguage": "scss",
"assets": [
"src/favicon.ico",
"src/assets"
],
"styles": [
"src/styles.scss"
],
"scripts": []
}
}
}
}
},
"cli": {
"analytics": false
}
}
================================================
FILE: frontend/database.indexes.json
================================================
{
"indexes": [
{
"collectionGroup": "session_logs",
"queryScope": "COLLECTION",
"fields": [
{
"fieldPath": "user_id",
"order": "ASCENDING"
},
{
"fieldPath": "timestamp",
"order": "DESCENDING"
}
]
}
],
"fieldOverrides": []
}
================================================
FILE: frontend/database.rules.json
================================================
rules_version = '2';
service cloud.firestore {
match /databases/{database}/documents {
match /{document=**} {
allow read: if true;
allow write: if false;
}
}
}
================================================
FILE: frontend/firebase_setup.json
================================================
{
"hosting": {
"public": "/dist/frontend/browser",
"ignore": [
"firebase.json",
"**/.*",
"**/node_modules/**"
],
"rewrites": [
{
"source": "**",
"destination": "/index.html"
}
]
},
"firestore": [
{
"database": "opendataqna-session-logs",
"rules": "database.rules.json",
"indexes": "database.indexes.json"
}
]
}
================================================
FILE: frontend/frontend-flutter/.flutter-plugins
================================================
# This is a generated file; do not edit or check into version control.
audio_waveforms=/Users/raimeur/.pub-cache/hosted/pub.dev/audio_waveforms-1.0.5/
cloud_firestore=/Users/raimeur/.pub-cache/hosted/pub.dev/cloud_firestore-5.4.0/
cloud_firestore_web=/Users/raimeur/.pub-cache/hosted/pub.dev/cloud_firestore_web-4.2.0/
emoji_picker_flutter=/Users/raimeur/.pub-cache/hosted/pub.dev/emoji_picker_flutter-1.6.4/
file_picker=/Users/raimeur/.pub-cache/hosted/pub.dev/file_picker-8.0.6/
file_selector_linux=/Users/raimeur/.pub-cache/hosted/pub.dev/file_selector_linux-0.9.2+1/
file_selector_macos=/Users/raimeur/.pub-cache/hosted/pub.dev/file_selector_macos-0.9.4/
file_selector_windows=/Users/raimeur/.pub-cache/hosted/pub.dev/file_selector_windows-0.9.3+2/
firebase_auth=/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_auth-5.1.3/
firebase_auth_web=/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_auth_web-5.12.5/
firebase_core=/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_core-3.4.0/
firebase_core_web=/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_core_web-2.17.5/
flutter_inappwebview=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_inappwebview-6.0.0/
flutter_inappwebview_android=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_inappwebview_android-1.0.13/
flutter_inappwebview_ios=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_inappwebview_ios-1.0.13/
flutter_inappwebview_macos=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_inappwebview_macos-1.0.11/
flutter_inappwebview_web=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_inappwebview_web-1.0.8/
flutter_keyboard_visibility=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility-6.0.0/
flutter_keyboard_visibility_linux=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility_linux-1.0.0/
flutter_keyboard_visibility_macos=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility_macos-1.0.0/
flutter_keyboard_visibility_web=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility_web-2.0.0/
flutter_keyboard_visibility_windows=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility_windows-1.0.0/
flutter_plugin_android_lifecycle=/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_plugin_android_lifecycle-2.0.21/
image_picker=/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker-1.1.2/
image_picker_android=/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_android-0.8.12+10/
image_picker_for_web=/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_for_web-3.0.4/
image_picker_ios=/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_ios-0.8.12/
image_picker_linux=/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_linux-0.2.1+1/
image_picker_macos=/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_macos-0.2.1+1/
image_picker_web=/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_web-4.0.0/
image_picker_windows=/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_windows-0.2.1+1/
libphonenumber_plugin=/Users/raimeur/.pub-cache/hosted/pub.dev/libphonenumber_plugin-0.3.3/
libphonenumber_web=/Users/raimeur/.pub-cache/hosted/pub.dev/libphonenumber_web-0.3.2/
path_provider=/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider-2.1.4/
path_provider_android=/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider_android-2.2.9/
path_provider_foundation=/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider_foundation-2.4.0/
path_provider_linux=/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider_linux-2.2.1/
path_provider_windows=/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider_windows-2.3.0/
pointer_interceptor=/Users/raimeur/.pub-cache/hosted/pub.dev/pointer_interceptor-0.10.1+1/
pointer_interceptor_ios=/Users/raimeur/.pub-cache/hosted/pub.dev/pointer_interceptor_ios-0.10.1/
pointer_interceptor_web=/Users/raimeur/.pub-cache/hosted/pub.dev/pointer_interceptor_web-0.10.2/
shared_preferences=/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences-2.3.0/
shared_preferences_android=/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_android-2.3.0/
shared_preferences_foundation=/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_foundation-2.5.0/
shared_preferences_linux=/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_linux-2.4.0/
shared_preferences_web=/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_web-2.4.0/
shared_preferences_windows=/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_windows-2.4.0/
url_launcher=/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher-6.3.0/
url_launcher_android=/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_android-6.3.8/
url_launcher_ios=/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_ios-6.3.1/
url_launcher_linux=/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_linux-3.1.1/
url_launcher_macos=/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_macos-3.2.0/
url_launcher_web=/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_web-2.3.1/
url_launcher_windows=/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_windows-3.1.2/
================================================
FILE: frontend/frontend-flutter/.flutter-plugins-dependencies
================================================
{"info":"This is a generated file; do not edit or check into version control.","plugins":{"ios":[{"name":"audio_waveforms","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/audio_waveforms-1.0.5/","native_build":true,"dependencies":[]},{"name":"cloud_firestore","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/cloud_firestore-5.4.0/","native_build":true,"dependencies":["firebase_core"]},{"name":"emoji_picker_flutter","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/emoji_picker_flutter-1.6.4/","native_build":true,"dependencies":[]},{"name":"file_picker","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/file_picker-8.0.6/","native_build":true,"dependencies":[]},{"name":"firebase_auth","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_auth-5.1.3/","native_build":true,"dependencies":["firebase_core"]},{"name":"firebase_core","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_core-3.4.0/","native_build":true,"dependencies":[]},{"name":"flutter_inappwebview_ios","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_inappwebview_ios-1.0.13/","native_build":true,"dependencies":[]},{"name":"flutter_keyboard_visibility","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility-6.0.0/","native_build":true,"dependencies":[]},{"name":"image_picker_ios","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_ios-0.8.12/","native_build":true,"dependencies":[]},{"name":"libphonenumber_plugin","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/libphonenumber_plugin-0.3.3/","native_build":true,"dependencies":[]},{"name":"path_provider_foundation","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider_foundation-2.4.0/","shared_darwin_source":true,"native_build":true,"dependencies":[]},{"name":"pointer_interceptor_ios","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/pointer_interceptor_ios-0.10.1/","native_build":true,"dependencies":[]},{"name":"shared_preferences_foundation","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_foundation-2.5.0/","shared_darwin_source":true,"native_build":true,"dependencies":[]},{"name":"url_launcher_ios","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_ios-6.3.1/","native_build":true,"dependencies":[]}],"android":[{"name":"audio_waveforms","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/audio_waveforms-1.0.5/","native_build":true,"dependencies":[]},{"name":"cloud_firestore","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/cloud_firestore-5.4.0/","native_build":true,"dependencies":["firebase_core"]},{"name":"emoji_picker_flutter","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/emoji_picker_flutter-1.6.4/","native_build":true,"dependencies":[]},{"name":"file_picker","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/file_picker-8.0.6/","native_build":true,"dependencies":["flutter_plugin_android_lifecycle"]},{"name":"firebase_auth","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_auth-5.1.3/","native_build":true,"dependencies":["firebase_core"]},{"name":"firebase_core","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_core-3.4.0/","native_build":true,"dependencies":[]},{"name":"flutter_inappwebview_android","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_inappwebview_android-1.0.13/","native_build":true,"dependencies":[]},{"name":"flutter_keyboard_visibility","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility-6.0.0/","native_build":true,"dependencies":[]},{"name":"flutter_plugin_android_lifecycle","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_plugin_android_lifecycle-2.0.21/","native_build":true,"dependencies":[]},{"name":"image_picker_android","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_android-0.8.12+10/","native_build":true,"dependencies":["flutter_plugin_android_lifecycle"]},{"name":"libphonenumber_plugin","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/libphonenumber_plugin-0.3.3/","native_build":true,"dependencies":[]},{"name":"path_provider_android","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider_android-2.2.9/","native_build":true,"dependencies":[]},{"name":"shared_preferences_android","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_android-2.3.0/","native_build":true,"dependencies":[]},{"name":"url_launcher_android","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_android-6.3.8/","native_build":true,"dependencies":[]}],"macos":[{"name":"cloud_firestore","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/cloud_firestore-5.4.0/","native_build":true,"dependencies":["firebase_core"]},{"name":"emoji_picker_flutter","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/emoji_picker_flutter-1.6.4/","native_build":true,"dependencies":[]},{"name":"file_selector_macos","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/file_selector_macos-0.9.4/","native_build":true,"dependencies":[]},{"name":"firebase_auth","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_auth-5.1.3/","native_build":true,"dependencies":["firebase_core"]},{"name":"firebase_core","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_core-3.4.0/","native_build":true,"dependencies":[]},{"name":"flutter_inappwebview_macos","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_inappwebview_macos-1.0.11/","native_build":true,"dependencies":[]},{"name":"flutter_keyboard_visibility_macos","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility_macos-1.0.0/","native_build":false,"dependencies":[]},{"name":"image_picker_macos","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_macos-0.2.1+1/","native_build":false,"dependencies":["file_selector_macos"]},{"name":"path_provider_foundation","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider_foundation-2.4.0/","shared_darwin_source":true,"native_build":true,"dependencies":[]},{"name":"shared_preferences_foundation","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_foundation-2.5.0/","shared_darwin_source":true,"native_build":true,"dependencies":[]},{"name":"url_launcher_macos","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_macos-3.2.0/","native_build":true,"dependencies":[]}],"linux":[{"name":"emoji_picker_flutter","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/emoji_picker_flutter-1.6.4/","native_build":true,"dependencies":[]},{"name":"file_selector_linux","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/file_selector_linux-0.9.2+1/","native_build":true,"dependencies":[]},{"name":"flutter_keyboard_visibility_linux","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility_linux-1.0.0/","native_build":false,"dependencies":[]},{"name":"image_picker_linux","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_linux-0.2.1+1/","native_build":false,"dependencies":["file_selector_linux"]},{"name":"path_provider_linux","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider_linux-2.2.1/","native_build":false,"dependencies":[]},{"name":"shared_preferences_linux","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_linux-2.4.0/","native_build":false,"dependencies":["path_provider_linux"]},{"name":"url_launcher_linux","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_linux-3.1.1/","native_build":true,"dependencies":[]}],"windows":[{"name":"cloud_firestore","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/cloud_firestore-5.4.0/","native_build":true,"dependencies":["firebase_core"]},{"name":"emoji_picker_flutter","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/emoji_picker_flutter-1.6.4/","native_build":true,"dependencies":[]},{"name":"file_selector_windows","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/file_selector_windows-0.9.3+2/","native_build":true,"dependencies":[]},{"name":"firebase_auth","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_auth-5.1.3/","native_build":true,"dependencies":["firebase_core"]},{"name":"firebase_core","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_core-3.4.0/","native_build":true,"dependencies":[]},{"name":"flutter_keyboard_visibility_windows","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility_windows-1.0.0/","native_build":false,"dependencies":[]},{"name":"image_picker_windows","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_windows-0.2.1+1/","native_build":false,"dependencies":["file_selector_windows"]},{"name":"path_provider_windows","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/path_provider_windows-2.3.0/","native_build":false,"dependencies":[]},{"name":"shared_preferences_windows","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_windows-2.4.0/","native_build":false,"dependencies":["path_provider_windows"]},{"name":"url_launcher_windows","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_windows-3.1.2/","native_build":true,"dependencies":[]}],"web":[{"name":"cloud_firestore_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/cloud_firestore_web-4.2.0/","dependencies":["firebase_core_web"]},{"name":"emoji_picker_flutter","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/emoji_picker_flutter-1.6.4/","dependencies":[]},{"name":"file_picker","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/file_picker-8.0.6/","dependencies":[]},{"name":"firebase_auth_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_auth_web-5.12.5/","dependencies":["firebase_core_web"]},{"name":"firebase_core_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/firebase_core_web-2.17.5/","dependencies":[]},{"name":"flutter_inappwebview_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_inappwebview_web-1.0.8/","dependencies":[]},{"name":"flutter_keyboard_visibility_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/flutter_keyboard_visibility_web-2.0.0/","dependencies":[]},{"name":"image_picker_for_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_for_web-3.0.4/","dependencies":[]},{"name":"image_picker_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/image_picker_web-4.0.0/","dependencies":[]},{"name":"libphonenumber_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/libphonenumber_web-0.3.2/","dependencies":[]},{"name":"pointer_interceptor_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/pointer_interceptor_web-0.10.2/","dependencies":[]},{"name":"shared_preferences_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/shared_preferences_web-2.4.0/","dependencies":[]},{"name":"url_launcher_web","path":"/Users/raimeur/.pub-cache/hosted/pub.dev/url_launcher_web-2.3.1/","dependencies":[]}]},"dependencyGraph":[{"name":"audio_waveforms","dependencies":[]},{"name":"cloud_firestore","dependencies":["cloud_firestore_web","firebase_core"]},{"name":"cloud_firestore_web","dependencies":["firebase_core","firebase_core_web"]},{"name":"emoji_picker_flutter","dependencies":["shared_preferences"]},{"name":"file_picker","dependencies":["flutter_plugin_android_lifecycle"]},{"name":"file_selector_linux","dependencies":[]},{"name":"file_selector_macos","dependencies":[]},{"name":"file_selector_windows","dependencies":[]},{"name":"firebase_auth","dependencies":["firebase_auth_web","firebase_core"]},{"name":"firebase_auth_web","dependencies":["firebase_core","firebase_core_web"]},{"name":"firebase_core","dependencies":["firebase_core_web"]},{"name":"firebase_core_web","dependencies":[]},{"name":"flutter_inappwebview","dependencies":["flutter_inappwebview_android","flutter_inappwebview_ios","flutter_inappwebview_macos","flutter_inappwebview_web"]},{"name":"flutter_inappwebview_android","dependencies":[]},{"name":"flutter_inappwebview_ios","dependencies":[]},{"name":"flutter_inappwebview_macos","dependencies":[]},{"name":"flutter_inappwebview_web","dependencies":[]},{"name":"flutter_keyboard_visibility","dependencies":["flutter_keyboard_visibility_linux","flutter_keyboard_visibility_macos","flutter_keyboard_visibility_web","flutter_keyboard_visibility_windows"]},{"name":"flutter_keyboard_visibility_linux","dependencies":[]},{"name":"flutter_keyboard_visibility_macos","dependencies":[]},{"name":"flutter_keyboard_visibility_web","dependencies":[]},{"name":"flutter_keyboard_visibility_windows","dependencies":[]},{"name":"flutter_plugin_android_lifecycle","dependencies":[]},{"name":"image_picker","dependencies":["image_picker_android","image_picker_for_web","image_picker_ios","image_picker_linux","image_picker_macos","image_picker_windows"]},{"name":"image_picker_android","dependencies":["flutter_plugin_android_lifecycle"]},{"name":"image_picker_for_web","dependencies":[]},{"name":"image_picker_ios","dependencies":[]},{"name":"image_picker_linux","dependencies":["file_selector_linux"]},{"name":"image_picker_macos","dependencies":["file_selector_macos"]},{"name":"image_picker_web","dependencies":[]},{"name":"image_picker_windows","dependencies":["file_selector_windows"]},{"name":"libphonenumber_plugin","dependencies":["libphonenumber_web"]},{"name":"libphonenumber_web","dependencies":[]},{"name":"path_provider","dependencies":["path_provider_android","path_provider_foundation","path_provider_linux","path_provider_windows"]},{"name":"path_provider_android","dependencies":[]},{"name":"path_provider_foundation","dependencies":[]},{"name":"path_provider_linux","dependencies":[]},{"name":"path_provider_windows","dependencies":[]},{"name":"pointer_interceptor","dependencies":["pointer_interceptor_ios","pointer_interceptor_web"]},{"name":"pointer_interceptor_ios","dependencies":[]},{"name":"pointer_interceptor_web","dependencies":[]},{"name":"shared_preferences","dependencies":["shared_preferences_android","shared_preferences_foundation","shared_preferences_linux","shared_preferences_web","shared_preferences_windows"]},{"name":"shared_preferences_android","dependencies":[]},{"name":"shared_preferences_foundation","dependencies":[]},{"name":"shared_preferences_linux","dependencies":["path_provider_linux"]},{"name":"shared_preferences_web","dependencies":[]},{"name":"shared_preferences_windows","dependencies":["path_provider_windows"]},{"name":"url_launcher","dependencies":["url_launcher_android","url_launcher_ios","url_launcher_linux","url_launcher_macos","url_launcher_web","url_launcher_windows"]},{"name":"url_launcher_android","dependencies":[]},{"name":"url_launcher_ios","dependencies":[]},{"name":"url_launcher_linux","dependencies":[]},{"name":"url_launcher_macos","dependencies":[]},{"name":"url_launcher_web","dependencies":[]},{"name":"url_launcher_windows","dependencies":[]}],"date_created":"2024-09-10 18:23:23.537388","version":"3.22.2"}
================================================
FILE: frontend/frontend-flutter/Open Data QnA - Working Sheet V2 - sample_questions_UI copy.csv
================================================
user_grouping,scenario,question
MovieExplorer-bigquery,Genres,What are the top 5 most common movie genres in the dataset?
MovieExplorer-bigquery,Genres,How many are musicals?
MovieExplorer-bigquery,Genres,Romance?
MovieExplorer-bigquery,Movie,What is the average user rating of the God Father movie?
MovieExplorer-bigquery,Movie,Which year was it released?
MovieExplorer-bigquery,Movie,How long is it?
MovieExplorer-bigquery,Movie,Who is the lead actor?
MovieExplorer-bigquery,Movie,Director
MovieExplorer-bigquery,Movie,Cast
WorldCensus-cloudsql-pg,Life Expectancy,What is the life expectancy for men and women in a United States in 2022?
WorldCensus-cloudsql-pg,Life Expectancy,In India?
WorldCensus-cloudsql-pg,Life Expectancy,Which country has highest male life expectancy?
WorldCensus-cloudsql-pg,Life Expectancy,Female life expectancy?
WorldCensus-cloudsql-pg,Population Density,What are the top 5 coutries with highest population density in 2024?
WorldCensus-cloudsql-pg,Population Density,What are the birth and death rates in these counties?
WorldCensus-cloudsql-pg,Sex Ratio at Birth,What is the sex ratio at birth in China in 2023?
WorldCensus-cloudsql-pg,Sex Ratio at Birth,Which country has the highest?
WorldCensus-cloudsql-pg,Sex Ratio at Birth,Whats the world average?
================================================
FILE: frontend/frontend-flutter/Open_Data_QnA_sample_questions_v3 copy.csv
================================================
user_grouping,scenario,question,main_question
MovieExplorer-bigquery,Genres,What are the top 5 most common movie genres in the dataset?,Y
MovieExplorer-bigquery,Genres,How many are musicals?,N
MovieExplorer-bigquery,Genres,Romance?,N
MovieExplorer-bigquery,Movie,What is the average user rating of the God Father movie?,Y
MovieExplorer-bigquery,Movie,Which year was it released?,N
MovieExplorer-bigquery,Movie,How long is it?,N
MovieExplorer-bigquery,Movie,Who is the lead actor?,N
MovieExplorer-bigquery,Movie,Director,N
MovieExplorer-bigquery,Movie,Cast,N
MovieExplorer-bigquery,Movie,Who is the actor playing the role of the godfather in the Godfather movie?,Y
MovieExplorer-bigquery,Movie,and the one playing the role of Sony?,N
MovieExplorer-bigquery,Movie,How many people saw the Godfather?,y
WorldCensus-cloudsql-pg,Life Expectancy,What is the life expectancy for men and women in a United States in 2022?,Y
WorldCensus-cloudsql-pg,Life Expectancy,In India?,N
WorldCensus-cloudsql-pg,Life Expectancy,Which country has highest male life expectancy?,N
WorldCensus-cloudsql-pg,Life Expectancy,Female life expectancy?,N
WorldCensus-cloudsql-pg,Population Density,What are the top 5 coutries with highest population density in 2024?,Y
WorldCensus-cloudsql-pg,Population Density,What are the birth and death rates in these counties?,N
WorldCensus-cloudsql-pg,Sex Ratio at Birth,What is the sex ratio at birth in China in 2023?,Y
WorldCensus-cloudsql-pg,Sex Ratio at Birth,Which country has the highest?,N
WorldCensus-cloudsql-pg,Sex Ratio at Birth,What is the world average?,N
WorldCensus-cloudsql-pg,Sex Ratio at Birth,What country has the lowest male ration?,Y
WorldCensus-cloudsql-pg,Sex Ratio at Birth,Since when?,N
================================================
FILE: frontend/frontend-flutter/README.md
================================================
# Deploy the Flutter-based Frontend demo UI
## Technologies and Components
In order to use easily the Open Data QnA SDK and the backend you have just installed (please refer to README.md related to the backend on this repo for the details if not installed yet) you need to have a front end.
This page explains how to install the frontend provided by this solution gracefully so that you can jump start the use of the Open Data QnA solution.
Obviously, you can pretty much develop your own frontend and call the APIs available of the backend deployments.
There are 2 versions of the frontend, one written in Angular and one written in Flutter, an Open Souce UI framework written in dart language and backed up by Google. Functionality-wise, they are both equivalent even though some minor differences may exist.
This readme is about installing the Flutter-based frontend.
For more information on Flutter :
- [Flutter documentation](https://docs.flutter.dev/?_gl=1*17csnxq*_ga*MTI3NTU2MjQxMC4xNzI1ODc2Njg5*_ga_04YGWK0175*MTcyNTg3NjY4OS4xLjAuMTcyNTg3NjY4OS4wLjAuMA..)
The frontend needs to be deployed to Firebase, a Mobile Backend as a Service. It has a free tier call "Spark Plan" allowing you to use the services required to run the frontend.
For more details on Firebase, plase have a look at the documentation below:
- [Firebase pricing tiers](https://firebase.google.com/pricing)
- [Firebase Documentation](https://firebase.google.com/docs)
## Getting Started
### Installing Flutter and dart SDK
#### Installing Flutter
The first step is to install the Flutter framework.
To build the Flutter app, you can either use an IDE, like Visual Studio Code with the Flutter SDK (and plugin) and relevant extension, or just the Flutter's command-line tools to build the app manually.
This guide only explains how to use the Flutter's command-line tools via the installation of the Flutter SDK bundle (that also contains the dart SDK).
Please click on the link below.
- [Flutter SDK installation](https://docs.flutter.dev/get-started/install)
You'll end-up on the the landing page below.
1- Click on the platform corresponding to the OS of the desktop you'll install the frontend on.
As an example, let's use Windows :
2- Click on the Web type of app :
3- Click on the flutter_windows_3.24.2-stable.zip button (version will vary over time)
4- Move the zip file into the target folder you want, let's say "dev"
5- Extract the archive
6- Update the PATH environment variable :
%USERPROFILE%\dev\flutter\bin
7- Test the installation by typing the "flutter doctor" command :
### Installing Firebase tools
1- Install the Firebase CLI
In order to interact easily with Firebase, you have to install the Firebase CLI.
You need to install npm first as a prerequisite:
```
npm install -g firebase-tools
```
With these Firebase CLI commands, you're able to authenticate to Firebase and do the deployment of the frontend on the Firebase Hosting service.
2- Test the Firebase CLI
In order to very the Firebase CLI are working fine, login to Firebase using the command below.
```
firebase login
```
For more details on the installation, please look at the link below:
- [Firebase CLI installation documentation](https://firebase.google.com/docs/cli#windows-npm)
### Installing Flutterfire CLI
Flutter is tightly integrated to Firebase.
The flutterfire_cli command,which is part of the FlutterFire CLI, is specifically designed for Flutter projects. It streamlines the process of connecting your Flutter app to Firebase and generating the necessary configuration files (firebase_options.dart) for different platforms (Android, iOS, web, etc.).
Install it from any directory, we'll need it later on.
```
dart pub global activate flutterfire_cli
```
You may need to modify the PATH environment variable to access the flutterfire CLI.
On Windows, the binary is stored in :
%USERPROFILE%\AppData\Local\Pub\Cache\bin
### Creating and configuring the Firebase project
1- Go to the Firebase console
Click on "Get started with a Firebase project"
```
https://console.firebase.google.com/
```
2- Give a name to the project
Use the same name you used during the backend installation. As an example, let's us "opendataqna"
3- Google Analytics
You can choose to activate Google Analytics or not. This is not required for the app to work.
For the sake of completness, let's use it as it is proposed by default.
Just click on the "Continue" button.
Then click on the "Create project" button.
Once done, you have access to your newly Firebase project
Alternatively, you can also use the "flutterfire configure" CLI to create the Firebase project.
4- List the project using the Firebase CLI
Now go back to your terminal and list this new Firebase project :
```
firebase projects:list
```
### Creating the Firestore Database
The frontend requires a Firestore database (which is a Firebase service) to work.
The free tier only allows the creation of 1 database that has the default name (derived from the project name)
1-On the Firebase console, click on the Firestore menu
2- Select the location
3- Select the production mode
4- Modify the security rules
The front end needs to read and write on different collections in order to store the configuration, the know good SQL and the imported questions.
For that, you need to change the security rules like below (please take appropriate measures to protect access to the database):
### Enabling Sign-In method
The app requires that you authenticate using your Google account, whether it is personal or professional.
1- Go to the Firebase Authentication menu
For that, go to the Firebase console and click on the Authentication menu and then on the Sing-in method tab:
2- Choose Google as the Provider
Click on the Google icon to enable federated identity via Google:
3- Update the project level settings with the info below
### Installing the frontend
Now that the Firebase project is created, let's get the source code of the Flutter frontend.
1- Create a folder for your project
Creat folder that will contain the source code and the Firebase configuration.
Let's call it opendatadna as well:
```
mkdir OpenDataQnA
```
2- clone the source code from the repository
Go to the opendataqna folder and clone the source code of the frontend app using the command below:
```
git clone https://github.com/GoogleCloudPlatform/Open_Data_QnA.git
cd frontend/frontend-flutter
```
3- Registering the frontend app to Firebase
In order to register the app to Firebase and selected the Firebase project it will leave in, use the flutterfire configure command:
```
flutterfire configure
```
You'll be guided by the command via a couple of questions :
You can check on the Firebase console that app has been registered to the opendataqna project by clicking on "Project Overview":
4- Deploy the app to Firebase Hosting
Firebase Hosting is a Firebase service allowing to host a web app.
If not logged in yet to Firebase, do so with the firebase
```
firebase login
```
You also need to initialize the Firebase project. That will update the firebase.json file create by flutterfire configuration by adding the service you want to use (hosting).
For that make sure you're at the root of the project :
```
cd OpenData/frontend/frontend-flutter
```
then add the command below to help Firebase hosting better understand and integrate with Flutter in order to optimize the build and deployment:
```
firebase experiments:enable webframeworks
```
Then type the command below:
```
firebase init hosting
```
Answer the questions asked, and a firebase.json file will be generated at the root of the project that looks like the one below:
We're now ready to deploy the app on Firebase Hosting.
For that, still at the root level of the project (frontend-flutter folder) type the command below to deploy the app to Firebase Hosting service:
```
firebase deploy
```
Once the app has been built and deployed, it is available via the Hosting URL that shows up at the bottom of the output of the forebase deploy commad and has teh ofrm:
https://.web.app
### Using the frontend
1- Setup the config_frontend config file
The app requires some configuration to work, like the URI of the endpoint created during the backend installation and other information.
Create a json file and name it config_frontend.json (the name does not matter) and copy paste the json object below:
```
{
"endpoint_opendataqnq": "https://opendataqna-ol22ywferse-uc.v.run.app",
"firestore_database_id": "opendataqna-session-logs",
"firestore_history_collection": "session_logs",
"firestore_cfg_collection": "front_end_flutter_cfg",
"expert_mode": true,
"anonymized_data": false,
"firebase_app_name": "opendataqna",
"imported_questions": "imported_questions"
}
```
Change the values based on your setup.
- endpoint_opendataqnq : contains the URI of the backend created (String)
- firestore_database_id : name of the database you created earlier. If it is the default database, use "default" (String)
- firestore_history_collection : name of the collection used to store all the known-good sql, questions, answers, user_id and timestamp (String)
- firestore_cfg_collection : name of this file (String)
- expert_mode : if true, activates the expert mode (boolean)
- anonymized_data : if true, activates data anonymization
- firebase_app_name : name of the app as registered to the Firebase project as it shows up on the Project overview on the Firebase console (String)
- imported_questions : name of the collection used to store imported questions
2- Access the app
Access the app using the https://.web.app link generated during the deployment of the app.
Accept the terms and conditions and agree.
3- Authenticate yourself using your Google account
4- Landing page
You end up on the landing page below.
- The stepper shows the status and the processing time of the request. It only shows up when the user is in Expert mode (more on that below)
- The app is keeping track of the questions and answers (and SQL request) taht were successful in the Firestore database. When the app is launched, At most, the 4 last questions asked are displayed. Ckicking on any of them fills out automatically the input text field
- the humberger menu is collapsed by default so that the app can provide more real estate for the canvas.
5- Menu
- New chat : Allows to reset the context so that the next query is not answered based on the previous answers
- Import : This allows the user to import questions
- Imported questions: import a csv file containing questions that are asked often. Ot only shows up in Expert mode. There are 2 falvors of this cvs file that are supported :
- 3 columns that have to be : user_grouping,scenario,question
An example of such a file is provided in frontend-flutter/script/Open Data QnA - Working Sheet V2 - sample_questions_UI copy.csv)
```
user_grouping,scenario,question
MovieExplorer-bigquery,Genres,What are the top 5 most common movie genres in the dataset?
MovieExplorer-bigquery,Genres,How many are musicals?
MovieExplorer-bigquery,Genres,Romance?
MovieExplorer-bigquery,Movie,What is the average user rating of the God Father movie?
MovieExplorer-bigquery,Movie,Which year was it released?
MovieExplorer-bigquery,Movie,How long is it?
MovieExplorer-bigquery,Movie,Who is the lead actor?
MovieExplorer-bigquery,Movie,Director
MovieExplorer-bigquery,Movie,Cast
WorldCensus-cloudsql-pg,Life Expectancy,What is the life expectancy for men and women in a United States in 2022?
WorldCensus-cloudsql-pg,Life Expectancy,In India?
WorldCensus-cloudsql-pg,Life Expectancy,Which country has highest male life expectancy?
WorldCensus-cloudsql-pg,Life Expectancy,Female life expectancy?
WorldCensus-cloudsql-pg,Population Density,What are the top 5 coutries with highest population density in 2024?
WorldCensus-cloudsql-pg,Population Density,What are the birth and death rates in these counties?
WorldCensus-cloudsql-pg,Sex Ratio at Birth,What is the sex ratio at birth in China in 2023?
WorldCensus-cloudsql-pg,Sex Ratio at Birth,Which country has the highest?
```
-
- 4 columns that have to be : user_grouping,scenario,question,main_question
This format allows to add an indent to the follow-up question for the sake of clarity. An example of such a file is provided in frontend-flutter/script/Open_Data_QnA_sample_questions_v3.csv)
```
user_grouping,scenario,question,main_question
MovieExplorer-bigquery,Genres,What are the top 5 most common movie genres in the dataset?,Y
MovieExplorer-bigquery,Genres,How many are musicals?,N
MovieExplorer-bigquery,Genres,Romance?,N
MovieExplorer-bigquery,Movie,What is the average user rating of the God Father movie?,Y
MovieExplorer-bigquery,Movie,Which year was it released?,N
MovieExplorer-bigquery,Movie,How long is it?,N
MovieExplorer-bigquery,Movie,Who is the lead actor?,N
MovieExplorer-bigquery,Movie,Director,N
MovieExplorer-bigquery,Movie,Cast,N
MovieExplorer-bigquery,Movie,Who is the actor playing the role of the godfather in the Godfather movie?,Y
MovieExplorer-bigquery,Movie,and the one playing the role of Sony?,N
MovieExplorer-bigquery,Movie,How many people saw the Godfather?,y
WorldCensus-cloudsql-pg,Life Expectancy,What is the life expectancy for men and women in a United States in 2022?,Y
WorldCensus-cloudsql-pg,Life Expectancy,In India?,N
WorldCensus-cloudsql-pg,Life Expectancy,Which country has highest male life expectancy?,N
WorldCensus-cloudsql-pg,Life Expectancy,Female life expectancy?,N
WorldCensus-cloudsql-pg,Population Density,What are the top 5 coutries with highest population density in 2024?,Y
WorldCensus-cloudsql-pg,Population Density,What are the birth and death rates in these counties?,N
WorldCensus-cloudsql-pg,Sex Ratio at Birth,What is the sex ratio at birth in China in 2023?,Y
WorldCensus-cloudsql-pg,Sex Ratio at Birth,Which country has the highest?,N
WorldCensus-cloudsql-pg,Sex Ratio at Birth,What is the world average?,N
WorldCensus-cloudsql-pg,Sex Ratio at Birth,What country has the lowest male ration?,Y
WorldCensus-cloudsql-pg,Sex Ratio at Birth,Since when?,N
```
- user_grouping : name of the dataset or database
- scenario : used to tag your questions that are related to the same topic
- question: question in natural language
- main_question : when the value is Y, this is a fully understable and contextual question (let's call it main question). When the value is N, it is a follow-up question of the main question and does not have a full context
- Once loaded, the questions show up like this (4 colums format)
- History : Gives the list of questions types during the session
- Most popular questions : Lists all the historical questions sorted by the number they have been typed
- Settings : Allows to
- upload the config_frontend.json file which is stored on Firestore so that fater the app is relaunched the configuration is kept
- anonymize data (in case the user wants to do a demo but not show real data)
- set the expert mode
- All the other options are not implemented
6- Suggestions
When ever a question is clicked, whether from the 4 last question cards, the imported questions, the history or most popular questions, 3 suggestions are showing up at the bottom of the app. These suggestions are actually coming from the known good sql stored on the backend.
Clicking on one of these suggestions also triggers 3 other suggestions.
7- Auto-completion
The app is proposing autocompletion each time a character is typed.
The questions suggested for the autocompletion are coming from the known good sql stored on the backed.
8- Expert mode
As said before, when enabled, the Expert mode allows to display the imported questions and the stepper.
Clicking on a step, once completed, display details of this step :