Showing preview only (4,506K chars total). Download the full file or copy to clipboard to get everything.
Repository: GoogleCloudPlatform/analytics-componentized-patterns
Branch: master
Commit: 3c57f520e13f
Files: 83
Total size: 4.3 MB
Directory structure:
gitextract_tju_xjf6/
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── gaming/
│ └── propensity-model/
│ └── bqml/
│ ├── README.md
│ └── bqml_ga4_gaming_propensity_to_churn.ipynb
└── retail/
├── clustering/
│ └── bqml/
│ ├── README.md
│ └── bqml_scaled_clustering.ipynb
├── ltv/
│ └── bqml/
│ ├── README.md
│ ├── notebooks/
│ │ └── bqml_automl_ltv_activate_lookalike.ipynb
│ └── scripts/
│ ├── 00_procedure_persist.sql
│ ├── 10_procedure_match.sql
│ ├── 20_procedure_prepare.sql
│ ├── 30_procedure_train.sql
│ ├── 40_procedure_predict.sql
│ ├── 50_procedure_top.sql
│ └── run.sh
├── propensity-model/
│ └── bqml/
│ ├── README.md
│ └── bqml_kfp_retail_propensity_to_purchase.ipynb
├── recommendation-system/
│ ├── bqml/
│ │ ├── README.md
│ │ └── bqml_retail_recommendation_system.ipynb
│ ├── bqml-mlops/
│ │ ├── README.md
│ │ ├── dockerfile
│ │ ├── kfp_tutorial.ipynb
│ │ ├── part_2/
│ │ │ ├── Dockerfile
│ │ │ ├── README.md
│ │ │ ├── cloudbuild.yaml
│ │ │ ├── dockerbuild.sh
│ │ │ └── pipeline.py
│ │ └── part_3/
│ │ ├── Dockerfile
│ │ ├── README.md
│ │ ├── dockerbuild.sh
│ │ └── vertex_ai_pipeline.ipynb
│ └── bqml-scann/
│ ├── .gitignore
│ ├── 00_prep_bq_and_datastore.ipynb
│ ├── 00_prep_bq_procedures.ipynb
│ ├── 01_train_bqml_mf_pmi.ipynb
│ ├── 02_export_bqml_mf_embeddings.ipynb
│ ├── 03_create_embedding_lookup_model.ipynb
│ ├── 04_build_embeddings_scann.ipynb
│ ├── 05_deploy_lookup_and_scann_caip.ipynb
│ ├── README.md
│ ├── ann01_create_index.ipynb
│ ├── ann02_run_pipeline.ipynb
│ ├── ann_grpc/
│ │ ├── match_pb2.py
│ │ └── match_pb2_grpc.py
│ ├── ann_setup.md
│ ├── embeddings_exporter/
│ │ ├── __init__.py
│ │ ├── pipeline.py
│ │ ├── runner.py
│ │ └── setup.py
│ ├── embeddings_lookup/
│ │ └── lookup_creator.py
│ ├── index_builder/
│ │ ├── builder/
│ │ │ ├── __init__.py
│ │ │ ├── indexer.py
│ │ │ └── task.py
│ │ ├── config.yaml
│ │ └── setup.py
│ ├── index_server/
│ │ ├── Dockerfile
│ │ ├── cloudbuild.yaml
│ │ ├── lookup.py
│ │ ├── main.py
│ │ ├── matching.py
│ │ └── requirements.txt
│ ├── perf_test.ipynb
│ ├── requirements.txt
│ ├── sql_scripts/
│ │ ├── sp_ComputePMI.sql
│ │ ├── sp_ExractEmbeddings.sql
│ │ └── sp_TrainItemMatchingModel.sql
│ ├── tfx01_interactive.ipynb
│ ├── tfx02_deploy_run.ipynb
│ └── tfx_pipeline/
│ ├── Dockerfile
│ ├── __init__.py
│ ├── bq_components.py
│ ├── config.py
│ ├── item_matcher.py
│ ├── lookup_creator.py
│ ├── pipeline.py
│ ├── runner.py
│ ├── scann_evaluator.py
│ ├── scann_indexer.py
│ └── schema/
│ └── schema.pbtxt
└── time-series/
└── bqml-demand-forecasting/
├── README.md
└── bqml_retail_demand_forecasting.ipynb
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.ipynb_checkpoints/
.DS_Store
.vscode/
**/*.cpython-37..pyc
**/*.sqllite
**/*.tar.gz
retail/recommendation-system/bqml-scann/vocabulary.txt
================================================
FILE: CONTRIBUTING.md
================================================
# How to Contribute
We'd love to accept your patches and contributions to this project. There are
just a few small guidelines you need to follow.
## Contributor License Agreement
Contributions to this project must be accompanied by a Contributor License
Agreement. You (or your employer) retain the copyright to your contribution;
this simply gives us permission to use and redistribute your contributions as
part of the project. Head over to <https://cla.developers.google.com/> to see
your current agreements on file or to sign a new one.
You generally only need to submit a CLA once, so if you've already submitted one
(even if it was for a different project), you probably don't need to do it
again.
## Code reviews
All submissions, including submissions by project members, require review. We
use GitHub pull requests for this purpose. Consult
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
information on using pull requests.
## Community Guidelines
This project follows [Google's Open Source Community
Guidelines](https://opensource.google/conduct/).
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: README.md
================================================
> [!NOTE]
> This repository has been archived and is no longer actively maintained.
[](LICENSE)
## Analytics Componentized Patterns
From sample dataset to activation, these componentized patterns are designed to help you get the most out of [BigQuery ML](https://cloud.google.com/bigquery-ml/docs) and other Google Cloud products in production.
### Retail use cases
* Recommendation systems:
* How to build an end to end recommendation system with CI/CD MLOps pipeline on hotel data using BigQuery ML. ([Code][bqml_mlops_code] | [Blogpost][bqml_scann_guide])
* How to build a recommendation system on e-commerce data using BigQuery ML. ([Code][recomm_code] | [Blogpost][recomm_blog] | [Video][recomm_video])
* How to build an item-item real-time recommendation system on song playlists data using BigQuery ML. ([Code][bqml_scann_code] | [Reference Guide][bqml_scann_guide])
* Propensity to purchase model:
* How to build an end-to-end propensity to purchase solution using BigQuery ML and Kubeflow Pipelines. ([Code][propen_code] | [Blogpost][propen_blog])
* Activate on Lifetime Value predictions:
* How to predict the monetary value of your customers and extract emails of the top customers to use in Adwords for example to create similar audiences. Automation is done by a combination of BigQuery Scripting, Stored Procedure and bash script. ([Code][ltv_code])
* Clustering:
* How to build customer segmentation through k-means clustering using BigQuery ML. ([Code][clustering_code] | [Blogpost][clustering_blog])
* Demand Forecasting:
* How to build a time series demand forecasting model using BigQuery ML ([Code][demandforecasting_code] | [Blogpost][demandforecasting_blog] | [Video][demandforecasting_video])
### Gaming use cases
* Propensity to churn model:
* Churn prediction for game developers using Google Analytics 4 (GA4) and BigQuery ML. ([Code][gaming_propen_code] | [Blogpost][gaming_propen_blog] | [Video][gaming_propen_video])
### Financial use cases
* Fraud detection
* How to build a real-time credit card fraud detection solution. ([Code][ccfraud_code] | [Blogpost][ccfraud_techblog] | [Video][ccfraud_video])
[gaming_propen_code]: gaming/propensity-model/bqml
[gaming_propen_blog]: https://cloud.google.com/blog/topics/developers-practitioners/churn-prediction-game-developers-using-google-analytics-4-ga4-and-bigquery-ml
[gaming_propen_video]: https://www.youtube.com/watch?v=t5a0gwPM4I8
[recomm_code]: retail/recommendation-system/bqml
[recomm_blog]: https://medium.com/google-cloud/how-to-build-a-recommendation-system-on-e-commerce-data-using-bigquery-ml-df9af2b8c110
[recomm_video]: https://youtube.com/watch?v=sEx8RwvT_-8
[bqml_scann_code]: retail/recommendation-system/bqml-scann
[bqml_mlops_code]: retail/recommendation-system/bqml-mlops
[bqml_scann_guide]: https://cloud.google.com/solutions/real-time-item-matching
[propen_code]: retail/propensity-model/bqml
[propen_blog]: https://medium.com/google-cloud/how-to-build-an-end-to-end-propensity-to-purchase-solution-using-bigquery-ml-and-kubeflow-pipelines-cd4161f734d9
[ltv_code]: retail/ltv/bqml
[clustering_code]: retail/clustering/bqml
[clustering_blog]: https://towardsdatascience.com/how-to-build-audience-clusters-with-website-data-using-bigquery-ml-6b604c6a084c
[demandforecasting_code]: retail/time-series/bqml-demand-forecasting
[demandforecasting_blog]: https://cloud.google.com/blog/topics/developers-practitioners/how-build-demand-forecasting-models-bigquery-ml
[demandforecasting_video]: https://www.youtube.com/watch?v=dwOt68CevYA
[ccfraud_code]: https://gitlab.qdatalabs.com/uk-gtm/patterns/cc_fraud_detection/tree/master
[ccfraud_techblog]: https://cloud.google.com/blog/products/data-analytics/how-to-build-a-fraud-detection-solution
[ccfraud_video]: https://youtu.be/qQnxq3COr9Q
## Questions? Feedback?
If you have any questions or feedback, please open up a [new issue](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/issues).
## Disclaimer
This is not an officially supported Google product.
All files in this repository are under the [Apache License, Version 2.0](LICENSE.txt) unless noted otherwise.
================================================
FILE: gaming/propensity-model/bqml/README.md
================================================
## License
```
Copyright 2021 Google LLC
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```
[](LICENSE)
# Churn prediction for game developers using Google Analytics 4 (GA4) and BigQuery ML
This notebook showcases how you can use BigQuery ML to run propensity models on Google Analytics 4 data from your gaming app to determine the likelihood of specific users returning to your app.
Using this notebook, you'll learn how to:
- Explore the BigQuery export dataset for Google Analytics 4
- Prepare the training data using demographic and behavioural attributes
- Train propensity models using BigQuery ML
- Evaluate BigQuery ML models
- Make predictions using the BigQuery ML models
- Implement model insights in practical implementations
## Architecture Diagram

## More resources
If you’d like to learn more about any of the topics covered in this notebook, check out these resources:
- [BigQuery export of Google Analytics data]
- [BigQuery ML quickstart]
- [Events automatically collected by Google Analytics 4]
## Questions? Feedback?
If you have any questions or feedback, please open up a [new issue](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/issues).
[BigQuery export of Google Analytics data]: https://support.google.com/analytics/answer/9358801
[BigQuery ML quickstart]: https://cloud.google.com/bigquery-ml/docs/bigqueryml-web-ui-start
[Events automatically collected by Google Analytics 4]: https://support.google.com/analytics/answer/9234069
================================================
FILE: gaming/propensity-model/bqml/bqml_ga4_gaming_propensity_to_churn.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XM4xjzQNzHwz"
},
"outputs": [],
"source": [
"# Copyright 2020 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tTt2Oe0szW0M"
},
"source": [
"<table align=\"left\">\n",
" <td>\n",
" <a href=\"https://console.cloud.google.com/ai-platform/notebooks/deploy-notebook?name=Churn%20prediction%20for%20game%20developers%20using%20Google%20Analytics%204%20%28GA4%29%20and%20BigQuery%20ML%20Notebook&download_url=https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fanalytics-componentized-patterns%2Fmaster%2Fgaming%2Fpropensity-model%2Fbqml%2Fbqml_ga4_gaming_propensity_to_churn.ipynb\">\n",
" <img src=\"https://cloud.google.com/images/products/ai/ai-solutions-icon.svg\" alt=\"AI Platform Notebooks\">Run on AI Platform Notebooks</a>\n",
" </td>\n",
" <td>\n",
" <a href=\"https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/blob/master/gaming/propensity-model/bqml/bqml_ga4_gaming_propensity_to_churn.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
" View on GitHub\n",
" </a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DWW_k7u_zaER"
},
"source": [
"# Overview\n",
"This notebook shows you how you can train, evaluate, and deploy a propensity model in BigQuery ML to predict user retention on a mobile game, based on app measurement data from Google Analytics 4.\n",
"\n",
"#### Propensity modeling in the mobile gaming industry\n",
"According to a [2019 study](https://gameanalytics.com/reports/mobile-gaming-industry-analysis-h1-2019) on 100K mobile games by the Mobile Gaming Industry Analysis, most mobile games only see a 25% retention rate for users after the first 24 hours, and any game \"below 30% retention generally needs improvement\". In light of this, using machine learning -- to identify the propensity that a user churn after day 1 -- can allow app developers to incentivize users at higher risk of churning to return.\n",
"\n",
"To predict the propensity (a.k.a. likelihood) that a user will return vs churn, you can use classification algorithms, like logistic regression, XGBoost, neural networks, or AutoML Tables, all of which are available with BigQuery ML.\n",
"\n",
"#### Propensity modeling in BigQuery ML\n",
"With BigQuery ML, you can train, evaluate and deploy our models directly within BigQuery using SQL, which saves time from needing to manually configure ML infrastructure. You can train and deploy ML models directly where the data is already stored, which also helps to avoid potential issues around data governance.\n",
"\n",
"Using classification models that you train and deploy in BigQuery ML, you can predict propensity using the output of the models. The models outputs provide a probability score between 0 and 1.0 -- how likely the model predicts that the user will churn (1) or not churn (0).\n",
"\n",
"Using the probability (propensity) scores, you can then, for example, target users who may not return on their own, but could potentially return if they are provided with an incentive or notification.\n",
"\n",
"#### Not just churn -- propensity modeling for any behavior\n",
"Propensity modeling is not limited to predicting churn. In fact, you can calculate a propensity score for any behavior you may want to predict. For example, you may want to predict the likelihood a user will spend money on in-app purchases. Or, perhaps you can predict the likelihood of a user performing \"stickier\" behaviors such as adding and playing with friends, which could lead to longer-term retention and organic user growth. Whichever the case, you can easily modify this notebook to suit your needs, as the overall workflow will still be the same."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "t44p6IQUzrY4"
},
"source": [
"## Scope of this notebook"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nznL3qK4z8Jm"
},
"source": [
"### Dataset\n",
"\n",
"This notebook uses [this public BigQuery dataset](https://console.cloud.google.com/bigquery?p=firebase-public-project&d=analytics_153293282&t=events_20181003&page=table), contains raw event data from a real mobile gaming app called Flood It! ([Android app](https://play.google.com/store/apps/details?id=com.labpixies.flood), [iOS app](https://itunes.apple.com/us/app/flood-it!/id476943146?mt=8)). The [data schema](https://support.google.com/analytics/answer/7029846) originates from Google Analytics for Firebase, but is the same schema as [Google Analytics 4](https://support.google.com/analytics/answer/9358801); this notebook applies to use cases that use either Google Analytics for Firebase or Google Analytics 4 data.\n",
"\n",
"Google Analytics 4 (GA4) uses an [event-based](https://support.google.com/analytics/answer/9322688) measurement model. Events provide insight on what is happening in an app or on a website, such as user actions, system events, or errors. Every row in the dataset is an event, with various characteristics relevant to that event stored in a nested format within the row. While Google Analytics logs many types of events already by default, developers can also customize the types of events they also wish to log.\n",
"\n",
"Note that as you cannot simply use the raw event data to train a machine learning model, in this notebook, you will also learn the important steps of how to pre-process the raw data into an appropriate format to use as training data for classification models."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hxrwZlF2PYQ8"
},
"source": [
"#### Using your own GA4 data?\n",
"If you are already using a Google Analytics 4 property, follow [this guide]((https://support.google.com/analytics/answer/9823238) to learn how to export your GA4 data to BigQuery. Once the GA4 data is in BigQuery, there will be two tables:\n",
"\n",
"* `events_`\n",
"* `events_intraday_`\n",
"\n",
"For this notebook, you can replace the table in the `FROM` clause in SQL queries with your `events_` table that is updated daily. The `events_intraday_` table contains streaming data for the current day.\n",
"\n",
"Note that if you use your own GA4 data, you may need to slightly modify some of the scripts in this notebook to predict a different output behavior or the types events in the training data that are specific to your use case. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KJB3rOtfrBIl"
},
"source": [
"#### Using data from other non-Google Analytics data collection tools?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9corrB3IrPeH"
},
"source": [
"While this notebook provides code based on a Google Analytics dataset, you can also use your own dataset from other non-Google Analytics data collection tools. The overall concepts and process of propensity modeling will be the same, but you may need to customize the code in order to prepare your dataset into the training data format described in this notebook.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sH0CZGku0BPp"
},
"source": [
"### Objective and Problem Statement\n",
"\n",
"The goal of this notebook is to provide an end-to-end solution for propensity modeling to predict user churn on GA4 data using BigQuery ML. Using the \"Flood It!\" dataset, based on a user's activity within the first 24 hrs of app installation, you will try various classification models to predict the propensity to churn (1) or not churn (0).\n",
"\n",
"By the end of this notebook, you will know how to:\n",
"* Explore the export of Google Analytics 4 data on BigQuery\n",
"* Prepare the training data using demographic, behavioral data, and the label (churn/not-churn)\n",
"* Train classification models using BigQuery ML\n",
"* Evaluate classification models using BigQuery ML\n",
"* Make predictions on which users will churn using BigQuery ML\n",
"* Activate on model predictions"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IStlwydL0NWC"
},
"source": [
"### Costs \n",
"\n",
"There is no cost associated with using the free version of Google Analytics and using the BigQuery Export feature. This tutorial uses billable components of Google Cloud Platform (GCP):\n",
"\n",
"* BigQuery\n",
"* BigQuery ML\n",
"\n",
"Learn about [BigQuery pricing](https://cloud.google.com/bigquery/pricing), [BigQuery ML\n",
"pricing](https://cloud.google.com/bigquery-ml/pricing) and use the [Pricing\n",
"Calculator](https://cloud.google.com/products/calculator/)\n",
"to generate a cost estimate based on your projected usage."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "coMAAaOH0Tcl"
},
"source": [
"## Setup"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NUlqCGWz0VGL"
},
"source": [
"### PIP Install Packages and dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MujO3LNG0W26"
},
"outputs": [],
"source": [
"!pip install google-cloud-bigquery"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oRab23dd0aeC"
},
"outputs": [],
"source": [
"# Automatically restart kernel after installs\n",
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True) "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8PpWu5MR0bPf"
},
"source": [
"### Set up your GCP project\n",
"\n",
"_The following steps are required, regardless of your notebook environment._\n",
"\n",
"1. [Select or create a GCP project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
"\n",
"1. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
"\n",
"1. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n",
"\n",
"1. Enter your project ID and region in the cell below. Then run the cell to make sure the\n",
"Cloud SDK uses the right project for all the commands in this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"PROJECT_ID = \"YOUR-PROJECT-ID\" #replace with your project id\n",
"REGION = 'US'"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mm5cGuC00mHI"
},
"source": [
"### Import libraries and define constants"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DRYImlXa0mld"
},
"outputs": [],
"source": [
"from google.cloud import bigquery\n",
"import pandas as pd\n",
"\n",
"pd.set_option('display.float_format', lambda x: '%.3f' % x)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BlfwBj4P0rHD"
},
"source": [
"### Create a BigQuery dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "X9TLkb8f0tCE"
},
"source": [
"In this notebook, you will need to create a dataset in your project called `bqmlga4`. To create it, run the following cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d8LyhoGy0ttK"
},
"outputs": [],
"source": [
"DATASET_NAME = \"bqmlga4\"\n",
"!bq mk --location=$REGION --dataset $PROJECT_ID:$DATASET_NAME"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_sOVqTXM0-Wc"
},
"source": [
"## The dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4LMr6Q3MQt0I"
},
"source": [
"### Using the sample gaming event data from Flood it!\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wV7wi8X51eNw"
},
"source": [
"The sample dataset contains raw event data, as shown in the next cell:\n",
"\n",
"_Note_: Jupyter runs cells starting with %%bigquery as SQL queries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 551
},
"id": "wwz_N3Kh1f9l",
"outputId": "8b162c57-12a6-4262-e482-6e4d0ae3b47c"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"SELECT \n",
" *\n",
"FROM\n",
" `firebase-public-project.analytics_153293282.events_*`\n",
" \n",
"TABLESAMPLE SYSTEM (1 PERCENT)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PYBMN963Qydl"
},
"source": [
"It may be helpful to take a look at the overall schema used in Google Analytics 4. As mentioned earlier, Google Analytics 4 uses an event based measurement model and each row in this dataset is an event. [Click here](https://support.google.com/analytics/answer/7029846) to view the complete schema and details about each column. As you can see above, certain columns are nested records and contain detailed information:\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IaM8co6eRsOp"
},
"source": [
"\n",
"* `app_info`\n",
"* `device`\n",
"* `ecommerce`\n",
"* `event_params`\n",
"* `geo`\n",
"* `traffic_source`\n",
"* `user_properties`\n",
"* `items`*\n",
"* `web_info`*\n",
"\n",
"_* present by default in GA4 datasets_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dLUv-7xNRhAj"
},
"source": [
"As we can see below, there are 15K users and 5.7M events in this dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 77
},
"id": "MjqKMGVDRPyZ",
"outputId": "6c0d76c7-ad92-40de-a689-365867b23281"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"SELECT \n",
" COUNT(DISTINCT user_pseudo_id) as count_distinct_users,\n",
" COUNT(event_timestamp) as count_events\n",
"FROM\n",
" `firebase-public-project.analytics_153293282.events_*`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3iHaV9-q1k1i"
},
"source": [
"### Preparing the training data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P358Jo_s8WC0"
},
"source": [
"You cannot simply use raw event data to train a machine learning model as it would not be in the right shape and format to use as training data. So in this section, you will learn how to pre-process the raw data into an appropriate format to use as training data for classification models.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eXl1kSh1yPXk"
},
"source": [
"To predict which user is going to _churn_ or _return_, the ideal training data format for classification should look like the following: \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Xv8ibjMNy_bV"
},
"source": [
"|User ID|User demographic data|User behavioral data|Churned|\n",
"|-|-|-|-|\n",
"|User1|(e.g., country, device_type)|(e.g., # of times they did something within a time period)|1\n",
"|User2|(e.g., country, device_type)|(e.g., # of times they did something within a time period)|0\n",
"|User3|(e.g., country, device_type)|(e.g., # of times they did something within a time period)|1\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HydYCB2jzrzn"
},
"source": [
"Characteristics of the training data:\n",
"- each row is a separate unique user ID\n",
"- feature(s) for **demographic data**\n",
"- feature(s) for **behavioral data**\n",
"- the actual **label** that you want to train the model to predict (e.g., 1 = churned, 0 = returned)\n",
"\n",
"You can train a model with only demographic data or behavioral data, but having a combination of both will likely help you create a more predictive model. For this reason, in this section, you will learn how to pre-process the raw data to follow this training data format."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ICpTsrfg2-Cw"
},
"source": [
"The following sections will walk you through preparing the demographic data, behavioral data, and the label before joining them all together as the training data.\n",
"\n",
"1. Identifying the label for each user (churned or returned)\n",
"1. Extracting demographic data for each user\n",
"1. Extracting behavioral data for each user\n",
"1. Combining the label, demographic and behavioral data together as training data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZYHefnNx21lO"
},
"source": [
"#### Step 1: Identifying the label for each user"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Qt6a5Kv-25iq"
},
"source": [
"The raw dataset doesn't have a feature that simply identifies users as \"churned\" or \"returned\", so in this section, you will need to create this label based on some of the existing columns."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lgqxD30Hl6FM"
},
"source": [
"There are many ways to define user churn, but for the purposes of this notebook, you will predict 1-day churn as users who do not come back and use the app again after 24 hr of the user's first engagement. \n",
"\n",
"In other words, after 24 hr of a user's first engagement with the app:\n",
"- if the user _shows no event data thereafter_, the user is considered **churned**. \n",
"- if the user _does have at least one event datapoint thereafter_, then the user is considered **returned**\n",
"\n",
"You may also want to remove users who were unlikely to have ever returned anyway after spending just a few minutes with the app, which is sometimes referred to as \"bouncing\". For example, we can say want to build our model only on users who spent at least 10 minutes with the app (users who didn't bounce).\n",
"\n",
"So your updated definition of a **churned user** for this notebook is:\n",
"> \"any user who spent at least 10 minutes on the app, but after 24 hour from when they first engaged with the app, never used the app again\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_YyxwMQQ1uQW"
},
"source": [
"In SQL, since the raw data contains all of the events for every user, from their first touch (app installation) to their last touch, you can use this information to create two columns: `churned` and `bounced`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "d_JCCtuZVzne"
},
"source": [
"Take a look at the following SQL query and the results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 531
},
"id": "_QQO3POV2EQ4",
"outputId": "8369d0bd-8527-42bd-d1c6-aa8e4c380cb0"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID \n",
"\n",
"CREATE OR REPLACE VIEW bqmlga4.returningusers AS (\n",
" WITH firstlasttouch AS (\n",
" SELECT\n",
" user_pseudo_id,\n",
" MIN(event_timestamp) AS user_first_engagement,\n",
" MAX(event_timestamp) AS user_last_engagement\n",
" FROM\n",
" `firebase-public-project.analytics_153293282.events_*`\n",
" WHERE event_name=\"user_engagement\"\n",
" GROUP BY\n",
" user_pseudo_id\n",
"\n",
" )\n",
" SELECT\n",
" user_pseudo_id,\n",
" user_first_engagement,\n",
" user_last_engagement,\n",
" EXTRACT(MONTH from TIMESTAMP_MICROS(user_first_engagement)) as month,\n",
" EXTRACT(DAYOFYEAR from TIMESTAMP_MICROS(user_first_engagement)) as julianday,\n",
" EXTRACT(DAYOFWEEK from TIMESTAMP_MICROS(user_first_engagement)) as dayofweek,\n",
"\n",
" #add 24 hr to user's first touch\n",
" (user_first_engagement + 86400000000) AS ts_24hr_after_first_engagement,\n",
"\n",
"#churned = 1 if last_touch within 24 hr of app installation, else 0\n",
"IF (user_last_engagement < (user_first_engagement + 86400000000),\n",
" 1,\n",
" 0 ) AS churned,\n",
"\n",
"#bounced = 1 if last_touch within 10 min, else 0\n",
"IF (user_last_engagement <= (user_first_engagement + 600000000),\n",
" 1,\n",
" 0 ) AS bounced,\n",
" FROM\n",
" firstlasttouch\n",
" GROUP BY\n",
" 1,2,3\n",
" );\n",
"\n",
"SELECT \n",
" * \n",
"FROM \n",
" bqmlga4.returningusers \n",
"LIMIT 100;"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FOoqPb2J2Q5f"
},
"source": [
"For the `churned` column, `churned=0` if the user performs an action after 24 hours since their first touch, otherwise if their last action was only within the first 24 hours, then `churned=1`.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sC3sIc0C2a4Z"
},
"source": [
"For the `bounced` column, `bounced=1` if the user's last action was within the first ten minutes since their first touch with the app, otherwise `bounced=0`. We can use this column to filter our training data later on, by conditionally querying for users where `bounced = 0`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ulbfb8SY2fSM"
},
"source": [
"You might wonder how many of these 15k users bounced and returned? You can run the following query to check:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 136
},
"id": "gC32zyIE2olw",
"outputId": "6c51e523-0a4f-4a8d-a4c4-e0f7f0131925"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"SELECT\n",
" bounced,\n",
" churned, \n",
" COUNT(churned) as count_users\n",
"FROM\n",
" bqmlga4.returningusers\n",
"GROUP BY 1,2\n",
"ORDER BY bounced"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Z29RsKJi2uwO"
},
"source": [
"For the training data, you will only end up using data where `bounced = 0`. Based on the 15k users, you can see that 5,557 (\\~41%) users bounced within the first ten minutes of their first engagement with the app, but of the remaining 8,031 users, 1,883 users (\\~23%) churned after 24 hours."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 77
},
"id": "ZStzVtgEIkzh",
"outputId": "15b2abb5-f966-4363-f7f6-b676f6b521b8"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"SELECT\n",
" COUNTIF(churned=1)/COUNT(churned) as churn_rate\n",
"FROM\n",
" bqmlga4.returningusers\n",
"WHERE bounced = 0"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "daSQViux_XWR"
},
"source": [
"#### Step 2. Extracting demographic data for each user"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qcC2JJ6W_-sb"
},
"source": [
"This section is focused on extracting the demographic information for each user. Different demographic information about the user is available in the dataset already, including `app_info`, `device`, `ecommerce`, `event_params`, `geo`. Demographic features can help the model predict whether users on certain devices or countries are more likely to churn.\n",
"\n",
"For this notebook, you can start just with `geo.country`, `device.operating_system`, and `device.language`. If you are using your own dataset and have joinable first-party data, this section is a good opportunity to add any additional attributes for each user that may not be readily available in Google Analytics 4.\n",
"\n",
"Note that a user's demographics may occasionally change (e.g. moving from one country to another). For simplicity, you will just use the demographic information that Google Analytics 4 provides when the user first engaged with the app as indicated by `MIN(event_timestamp)`. This enables every unique user to be represented by a single row."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 343
},
"id": "gc47WFyM_5nQ",
"outputId": "5c545aef-eaa8-451e-8443-38461d2c9923"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"CREATE OR REPLACE VIEW bqmlga4.user_demographics AS (\n",
"\n",
" WITH first_values AS (\n",
" SELECT\n",
" user_pseudo_id,\n",
" geo.country as country,\n",
" device.operating_system as operating_system,\n",
" device.language as language,\n",
" ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp DESC) AS row_num\n",
" FROM `firebase-public-project.analytics_153293282.events_*`\n",
" WHERE event_name=\"user_engagement\"\n",
" )\n",
" SELECT * EXCEPT (row_num)\n",
" FROM first_values\n",
" WHERE row_num = 1\n",
" );\n",
"\n",
"SELECT\n",
" *\n",
"FROM\n",
" bqmlga4.user_demographics\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Nxv9yaTt2zD1"
},
"source": [
"#### Step 3. Extracting behavioral data for each user"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c_9qPkVAfmpY"
},
"source": [
"Behavioral data in the raw event data spans across multiple events -- and thus rows -- per user. The goal of this section is to aggregate and extract behavioral data for each user, resulting in one row of behavioral data per unique user.\n",
"\n",
"But what kind of behavioral data will you need to prepare? Since the end goal of this notebook is to predict, based on a user's activity within the first 24 hrs since app installation, whether that user will churn or return thereafter, then you will want to use behavioral data from the first 24 hrs in your training data. Later on, we can also extract some extra time-related features from `user_first_engagement`, such as the month or day of the first engagement.\n",
"\n",
"Google Analytics automatically collects [certain events](https://support.google.com/analytics/answer/6317485) that you can use to analyze behavior. In addition, there are certain recommended [events for games](https://support.google.com/analytics/answer/6317494). \n",
"\n",
"\n",
"As a first step, you can explore all the unique events that exist in this dataset, based on `event_name`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "XsXBNmeAf3fI",
"outputId": "da0f4a32-83ba-42e5-c381-49178c44f5f1"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"SELECT\n",
" event_name,\n",
" COUNT(event_name) as event_count\n",
"FROM\n",
" `firebase-public-project.analytics_153293282.events_*`\n",
"GROUP BY 1\n",
"ORDER BY\n",
" event_count DESC"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "smsCdPFpmJpT"
},
"source": [
"For this notebook, to predict whether a user will churn or return, you can start by counting the number of times a user engages in the following event types:\n",
"\n",
"* `user_engagement`\n",
"* `level_start_quickplay`\n",
"* `level_end_quickplay`\n",
"* `level_complete_quickplay`\n",
"* `level_reset_quickplay`\n",
"* `post_score`\n",
"* `spend_virtual_currency`\n",
"* `ad_reward`\n",
"* `challenge_a_friend`\n",
"* `completed_5_levels`\n",
"* `use_extra_steps`\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "D9vIqhuj_qFW"
},
"source": [
"In SQL, you can aggregate the behavioral data by calculating the total number of times when each of the above `event_names` occurred in the data set per user.\n",
"\n",
"If you are using your own dataset, you may have different event types that you can aggregate and extract. Your app may be sending very different `event_names` to Google Analytics so be sure to use events most suitable to your scenario."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 363
},
"id": "nzbVtI6G_Y9p",
"outputId": "8ae82387-01ff-4bbe-b629-659e4754ac81"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"CREATE OR REPLACE VIEW bqmlga4.user_aggregate_behavior AS (\n",
"WITH\n",
" events_first24hr AS (\n",
" #select user data only from first 24 hr of using the app\n",
" SELECT\n",
" e.*\n",
" FROM\n",
" `firebase-public-project.analytics_153293282.events_*` e\n",
" JOIN\n",
" bqmlga4.returningusers r\n",
" ON\n",
" e.user_pseudo_id = r.user_pseudo_id\n",
" WHERE\n",
" e.event_timestamp <= r.ts_24hr_after_first_engagement\n",
" )\n",
"SELECT\n",
" user_pseudo_id,\n",
" SUM(IF(event_name = 'user_engagement', 1, 0)) AS cnt_user_engagement,\n",
" SUM(IF(event_name = 'level_start_quickplay', 1, 0)) AS cnt_level_start_quickplay,\n",
" SUM(IF(event_name = 'level_end_quickplay', 1, 0)) AS cnt_level_end_quickplay,\n",
" SUM(IF(event_name = 'level_complete_quickplay', 1, 0)) AS cnt_level_complete_quickplay,\n",
" SUM(IF(event_name = 'level_reset_quickplay', 1, 0)) AS cnt_level_reset_quickplay,\n",
" SUM(IF(event_name = 'post_score', 1, 0)) AS cnt_post_score,\n",
" SUM(IF(event_name = 'spend_virtual_currency', 1, 0)) AS cnt_spend_virtual_currency,\n",
" SUM(IF(event_name = 'ad_reward', 1, 0)) AS cnt_ad_reward,\n",
" SUM(IF(event_name = 'challenge_a_friend', 1, 0)) AS cnt_challenge_a_friend,\n",
" SUM(IF(event_name = 'completed_5_levels', 1, 0)) AS cnt_completed_5_levels,\n",
" SUM(IF(event_name = 'use_extra_steps', 1, 0)) AS cnt_use_extra_steps,\n",
"FROM\n",
" events_first24hr\n",
"GROUP BY\n",
" 1\n",
" );\n",
"\n",
"SELECT\n",
" *\n",
"FROM\n",
" bqmlga4.user_aggregate_behavior\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EO1zcaiK_I4S"
},
"source": [
"Note that in addition to frequency of performing an action, you can also include other behavioral features in this step such as the total amount of in-game currency they spent, or if they reached certain app-specifc milestones that may be more relevant to your app (e.g., gained a certain threshold amount of XP or leveled up at least 5 times). This is an opportunity for you to extend this notebook to suit your needs."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PrWx_WQQBitA"
},
"source": [
"#### Step 4: Combining the label, demographic and behavioral data together as training data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o54qf5U6ik2l"
},
"source": [
"In this section, you can now combine these three intermediary views (label, demographic, and behavioral data) into the final training data. Here you can also specify `bounced = 0`, in order to limit the training data only to users who did not \"bounce\" within the first 10 minutes of using the app."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 531
},
"id": "2i4WeTqLB1mC",
"outputId": "c6669996-83c4-4fb5-fcde-c3135ffd7705"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"CREATE OR REPLACE VIEW bqmlga4.train AS (\n",
" \n",
" SELECT\n",
" dem.*,\n",
" IFNULL(beh.cnt_user_engagement, 0) AS cnt_user_engagement,\n",
" IFNULL(beh.cnt_level_start_quickplay, 0) AS cnt_level_start_quickplay,\n",
" IFNULL(beh.cnt_level_end_quickplay, 0) AS cnt_level_end_quickplay,\n",
" IFNULL(beh.cnt_level_complete_quickplay, 0) AS cnt_level_complete_quickplay,\n",
" IFNULL(beh.cnt_level_reset_quickplay, 0) AS cnt_level_reset_quickplay,\n",
" IFNULL(beh.cnt_post_score, 0) AS cnt_post_score,\n",
" IFNULL(beh.cnt_spend_virtual_currency, 0) AS cnt_spend_virtual_currency,\n",
" IFNULL(beh.cnt_ad_reward, 0) AS cnt_ad_reward,\n",
" IFNULL(beh.cnt_challenge_a_friend, 0) AS cnt_challenge_a_friend,\n",
" IFNULL(beh.cnt_completed_5_levels, 0) AS cnt_completed_5_levels,\n",
" IFNULL(beh.cnt_use_extra_steps, 0) AS cnt_use_extra_steps,\n",
" ret.user_first_engagement,\n",
" ret.month,\n",
" ret.julianday,\n",
" ret.dayofweek,\n",
" ret.churned\n",
" FROM\n",
" bqmlga4.returningusers ret\n",
" LEFT OUTER JOIN\n",
" bqmlga4.user_demographics dem\n",
" ON \n",
" ret.user_pseudo_id = dem.user_pseudo_id\n",
" LEFT OUTER JOIN \n",
" bqmlga4.user_aggregate_behavior beh\n",
" ON\n",
" ret.user_pseudo_id = beh.user_pseudo_id\n",
" WHERE ret.bounced = 0\n",
" );\n",
"\n",
"SELECT\n",
" *\n",
"FROM\n",
" bqmlga4.train\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Co90TkTsCk9p"
},
"source": [
"## Training the propensity model with BigQuery ML"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "stwX9Np9CyWG"
},
"source": [
"In this section, using the training data you prepared, you will now train machine learning models in SQL using BigQuery ML. The remainder of the notebook will only use logistic regression, but you can also follow the optional code below to train other model types."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FQTRryQ9Fr_l"
},
"source": [
"**Choosing the model:**\n",
"As this is a binary classification task, for simplicity, you can start with [logistic regression](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create), but you can also train other classification models like [XGBoost](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree), [deep neural networks](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models) and [AutoML Tables](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl) in BigQuery ML to calculate propensity scores. Each of these models will output a probability score (propensity) between 0 and 1.0 of how likely the model prediction is based on the training data. In this notebook, the model predicts whether the user will churn (1) or return (0) after 24 hours of the user's first engagement with the app.\n",
"\n",
"\n",
"|Model| model_type| Advantages | Disadvantages|\n",
"|-|-|-|-|\n",
"|**Logistic Regression**| `LOGISTIC_REG` ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create))| Fast to train vs. other model types | May not have the highest model performance |\n",
"|**XGBoost**| `BOOSTED_TREE_CLASSIFIER` ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree))| Higher model performance. Can inspect feature importance. | Slower to train vs. `LOGISTIC_REG`.|\n",
"|**Deep Neural Networks**| `DNN_CLASSIFIER` ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models))| Higher model performance | Slower to train vs. `LOGISTIC_REG`.|\n",
"|**AutoML Tables**| `AUTOML_CLASSIFIER` ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl))| Very high model performance | May take at least a few hours to train, not easy to explain how the model works. |\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "W326AA24FyPF"
},
"source": [
"**There's no need to split your data into train/test:**\n",
"- When you run the `CREATE MODEL` statement, BigQuery ML will automatically split your data into training and test, so you can evaluate your model immediately after training (see the [documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#data_split_method) for more information or how to specify the split manually).\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zfNMGKb_GAPM"
},
"source": [
"**Hyperparameter tuning:**\n",
"Note that you can also tune hyperparameters for each model, although it is beyond the scope of this notebook. See the [BigQuery ML documentation for CREATE MODEL](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create) for further details on the available hyperparameters."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "X5Ei0HB_HTAT"
},
"source": [
"**`TRANSFORM()`:** \n",
"It may also be useful to extract features from datetimes/timestamps as one simple example of additional feature preprocessing before training. For example, we can extract the month, day of year, and day of week from `user_first_engagement`. [`TRANSFORM()`](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#transform) allows the model to remember the extracted values so you won't need to extract them again when making predictions using the model later on."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cu-xHARS90xN"
},
"source": [
"#### Train a logistic regression model"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5A_PGKG5Ob8u"
},
"source": [
"The following code trains a logistic regression model. This should only take a minute or two to train.\n",
"\n",
"For more information on the default hyperparameters used, you can read the documentation: \n",
"[CREATE MODEL statement](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 31
},
"id": "1WRGHLIIC-RL",
"outputId": "32339cf6-5548-4239-e211-002d1c5743a7"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"CREATE OR REPLACE MODEL bqmlga4.churn_logreg\n",
"\n",
"OPTIONS(\n",
" MODEL_TYPE=\"LOGISTIC_REG\",\n",
" INPUT_LABEL_COLS=[\"churned\"]\n",
") AS\n",
"\n",
"SELECT\n",
" *\n",
"FROM\n",
" bqmlga4.train"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZBX-46-Q94tI"
},
"source": [
"#### Train an XGBoost model (optional)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uqewIJBoOnAZ"
},
"source": [
"The following code trains an XGBoost model. This may take several minutes to train.\n",
"\n",
"For more information on the default hyperparameters used, you can read the documentation: \n",
"[CREATE MODEL statement for Boosted Tree models using XGBoost](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-boosted-tree)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3XNDx5YoEFmY"
},
"outputs": [],
"source": [
"# %%bigquery --project $PROJECT_ID\n",
"\n",
"# CREATE OR REPLACE MODEL bqmlga4.churn_xgb\n",
"\n",
"# OPTIONS(\n",
"# MODEL_TYPE=\"BOOSTED_TREE_CLASSIFIER\",\n",
"# INPUT_LABEL_COLS=[\"churned\"]\n",
"# ) AS\n",
"\n",
"# SELECT\n",
"# * EXCEPT(user_pseudo_id)\n",
"# FROM\n",
"# bqmlga4.train"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0tSEDUcl98eE"
},
"source": [
"#### Train a deep neural network (DNN) model (optional)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FPpfxT6ZOslN"
},
"source": [
"The following code trains a deep neural network. This may take several minutes to train.\n",
"\n",
"For more information on the default hyperparameters used, you can read the documentation: \n",
"[CREATE MODEL statement for Deep Neural Network (DNN) models](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1o99jDldEGbT"
},
"outputs": [],
"source": [
"# %%bigquery --project $PROJECT_ID\n",
"\n",
"# CREATE OR REPLACE MODEL bqmlga4.churn_dnn\n",
"\n",
"# OPTIONS(\n",
"# MODEL_TYPE=\"DNN_CLASSIFIER\",\n",
"# INPUT_LABEL_COLS=[\"churned\"]\n",
"# ) AS\n",
"\n",
"# SELECT\n",
"# * EXCEPT(user_pseudo_id)\n",
"# FROM\n",
"# bqmlga4.train"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "X-DJU4-IkTsP"
},
"source": [
"### Train an AutoML Tables model (optional)\n",
"\n",
"[AutoML Tables](https://cloud.google.com/automl-tables) enables you to automatically build state-of-the-art machine learning models on structured data at massively increased speed and scale. AutoML Tables automatically searches through Google’s model zoo for structured data to find the best model for your needs, ranging from linear/logistic regression models for simpler datasets to advanced deep, ensemble, and architecture-search methods for larger, more complex ones.\n",
"\n",
"You can train an [AutoML model directly with BigQuery ML](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl), as in the code below.\n",
"\n",
"Note that the `BUDGET_HOURS` parameter is for AutoML Tables training, specified in hours. The default value is 1.0 hour and must be between 1.0 and 72.0. The total query processing time can be greater than the budgeted hours specified in the query.\n",
"\n",
"**Note:** This may take a few hours to train.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-tcfIVLZEHXU"
},
"outputs": [],
"source": [
"# %%bigquery --project $PROJECT_ID\n",
"\n",
"# CREATE OR REPLACE MODEL bqmlga4.churn_automl\n",
"\n",
"# OPTIONS(\n",
"# MODEL_TYPE=\"AUTOML_CLASSIFIER\",\n",
"# INPUT_LABEL_COLS=[\"churned\"],\n",
"# BUDGET_HOURS=1.0\n",
"# ) AS\n",
"\n",
"# SELECT\n",
"# * EXCEPT(user_pseudo_id)\n",
"# FROM\n",
"# bqmlga4.train"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "s_t0zxoeE1d7"
},
"source": [
"## Model Evaluation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b69YK1iX_ksq"
},
"source": [
"To evaluate the model, you can run [`ML.EVALUATE`](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-evaluate) on a model that has finished training to inspect some of the metrics.\n",
"\n",
"The metrics are based on the test sample data that was automatically split during model creation ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#data_split_method))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 77
},
"id": "p16-00xjE3JQ",
"outputId": "cd9db9bf-2211-4152-e84e-5c299354663d"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"SELECT\n",
" *\n",
"FROM\n",
" ML.EVALUATE(MODEL bqmlga4.churn_logreg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ejdRfRYeATL-"
},
"source": [
"`ML.EVALUATE` generates the `precision`, `recall`, `accuracy` and `f1_score` using the default classification threshold of 0.5, which can be modified by using the optional [`THRESHOLD`](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-evaluate#eval_threshold) parameter.\n",
"\n",
"Generally speaking, you can use the `log_loss` and `roc_auc` metrics to compare model performance.\n",
"\n",
"The `log_loss` ranges between 0 and 1.0, and the closer the `log_loss` is the zero, the closer the predicted labels were to the actual labels.\n",
"The `roc_auc` ranges between 0 and 1.0, and the closer the `roc_auc` is to 1.0, the better the model is at distinguishing between the classes.\n",
"\n",
"For more information on these metrics, you can read through the definitions on [precision and recall](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall), [accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy), [f1-score](https://en.wikipedia.org/wiki/F-score), [log_loss](https://en.wikipedia.org/wiki/Loss_functions_for_classification#Logistic_loss) and [roc_auc](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "glkKlvoqP7Uf"
},
"source": [
"#### Confusion matrix: predicted vs actual values"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AYB1xZ9oQDOv"
},
"source": [
"In addition to model evaluation metrics, you may also want to use a confusion matrix to inspect how well the model predicted the labels, compared to the actual labels.\n",
"\n",
"With the rows indicating the actual labels, and the columns as the predicted labels, the resulting format for ML.CONFUSION_MATRIX for binary classification looks like:\n",
"\n",
"| | Predicted_0 | Predicted_1|\n",
"|-|-|-|\n",
"|Actual_0| True Negatives | False Positives|\n",
"|Actual_1| False Negatives | True Positives|\n",
"\n",
"For more information on confusion matrices, you can read through a detailed explanation [here](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 106
},
"id": "Db5M8U8QQgyi",
"outputId": "2ea7e3c1-e411-43aa-e32c-6a29a2175c14"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"SELECT\n",
" expected_label,\n",
" _0 AS predicted_0,\n",
" _1 AS predicted_1\n",
"FROM\n",
" ML.CONFUSION_MATRIX(MODEL bqmlga4.churn_logreg)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ix5q8Onw1aYs"
},
"source": [
"#### ROC Curve"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XleHv73dIZx-"
},
"source": [
"You can plot the AUC-ROC curve by using `ML.ROC_CURVE` to return the metrics for different threshold values for the model ([documentation](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-roc))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "p_iG9vOfbgX6"
},
"outputs": [],
"source": [
"%%bigquery df_roc --project $PROJECT_ID\n",
"SELECT * FROM ML.ROC_CURVE(MODEL bqmlga4.churn_logreg)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 422
},
"id": "GoHGUcC7bzvX",
"outputId": "c809aec7-5153-4a8b-bae8-14103c996c61"
},
"outputs": [],
"source": [
"df_roc"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "s_zu9zZXJVNG"
},
"source": [
"Plot the AUC-ROC curve"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 313
},
"id": "jSRTXr6ub2ty",
"outputId": "46f6cf64-9108-4c8d-d148-ecbd2036ed2d"
},
"outputs": [],
"source": [
"df_roc.plot(x=\"false_positive_rate\", y=\"recall\", title=\"AUC-ROC curve\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oyeAzheVFUxU"
},
"source": [
"## Model prediction"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "02ZQPKIebo7y"
},
"source": [
"You can run [`ML.PREDICT`](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-predict) to make predictions on the propensity to churn. The following code returns all the information from `ML.PREDICT`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 590
},
"id": "229TPkhUFe23",
"outputId": "81495cbc-754a-4845-8e74-cd9ef0a49ede"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"SELECT\n",
" *\n",
"FROM\n",
" ML.PREDICT(MODEL bqmlga4.churn_logreg,\n",
" (SELECT * FROM bqmlga4.train)) #can be replaced with a test dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4WinvkfYdE1q"
},
"source": [
"For propensity modeling, the most important output is the probability of a behavior occuring. The following query returns the probability that the user will return after 24 hrs. The higher the probability and closer it is to 1, the more likely the user is predicted to churn, and the closer it is to 0, the more likely the user is predicted to return."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 422
},
"id": "92eOP8Sw7zO_",
"outputId": "a98b4dd5-39d4-43ff-ed39-7d1cdbb9a33f"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"SELECT\n",
" user_pseudo_id,\n",
" churned,\n",
" predicted_churned,\n",
" predicted_churned_probs[OFFSET(0)].prob as probability_churned\n",
" \n",
"FROM\n",
" ML.PREDICT(MODEL bqmlga4.churn_logreg,\n",
" (SELECT * FROM bqmlga4.train)) #can be replaced with a proper test dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TCGAfP9DKLDH"
},
"source": [
"### Exporting the predictions out of Bigquery"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EMETxq7CKtTQ"
},
"source": [
"##### Reading the predictions directly from BigQuery"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6_BUvnJvKQQP"
},
"source": [
"With the predictions from `ML.PREDICT`, you can export the data into a Pandas dataframe using the BigQuery Storage API (see [documentation and code samples](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas#download_table_data_using_the_client_library)). You can also use other [BigQuery client libraries](https://cloud.google.com/bigquery/docs/reference/libraries).\n",
"\n",
"Alternatively you can also export directly into pandas in a notebook using the %%bigquery <variable name> as in:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0qB2DsQkKgEH"
},
"outputs": [],
"source": [
"%%bigquery df --project $PROJECT_ID\n",
"\n",
"SELECT\n",
" user_pseudo_id,\n",
" churned,\n",
" predicted_churned,\n",
" predicted_churned_probs[OFFSET(0)].prob as probability_churned\n",
" \n",
"FROM\n",
" ML.PREDICT(MODEL bqmlga4.churn_logreg,\n",
" (SELECT * FROM bqmlga4.train)) #can be replaced with a proper test dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-yJAFuUOKl1i"
},
"outputs": [],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wzEsO_oPK2vv"
},
"source": [
"##### Export predictions table to Google Cloud Storage"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xxgKXizcK5Cy"
},
"source": [
"There are several ways to export the predictions table to Google Cloud Storage (GCS), so that you can use them in a separate service. Perhaps the easiest way is to export directly to GCS using SQL ([documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#export_data_statement))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5iah7a0KK9lv"
},
"outputs": [],
"source": [
"%%bigquery --project $PROJECT_ID\n",
"\n",
"EXPORT DATA OPTIONS (\n",
"uri=\"gs://mybucket/myfile/churnpredictions.csv\", \n",
" format=CSV\n",
") AS \n",
"SELECT\n",
" user_pseudo_id,\n",
" churned,\n",
" predicted_churned,\n",
" predicted_churned_probs[OFFSET(0)].prob as probability_churned\n",
"FROM\n",
" ML.PREDICT(MODEL bqmlga4.churn_logreg,\n",
" (SELECT * FROM bqmlga4.train)) #can be replaced with a proper test dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kDdj9jDZFl5z"
},
"source": [
"## Activate on model predictions"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WfWm7WFdmMN7"
},
"source": [
"Once you have the model predictions, there are different steps you can take based on your business objective.\n",
"\n",
"In our analysis, we used `user_pseudo_id` as the user identifier. However, ideally, your app should send back the `user_id` from your app to Google Analytics. This will help you to:\n",
"\n",
"* join any first-party data you have for model predictions\n",
"* joins the model predictions with your first-party data\n",
"\n",
"Once you have this join capability, you can:\n",
"\n",
"* Export the model predictions back into Google Analytics as user attribute. This can be done using [Data Import feature](https://support.google.com/analytics/answer/10071301) in Google Analytics 4.\n",
" * Based on the prediction values you can [Create and edit audiences](https://support.google.com/analytics/answer/2611404) and also do [Audience targeting](https://support.google.com/optimize/answer/6283435). For example, an audience can be users with prediction probability between 0.4 and 0.7, to represent users who are predicted to be \"on the fence\" between churning and returning.\n",
"* Adjust the user experience for targeted users within your app. For Firebase Apps, you can use the [Import segmentments](https://firebase.google.com/docs/projects/import-segments) feature. You can tailor user experience by targeting your identified users through Firebase services such as Remote Config, Cloud Messaging, and In-App Messaging. This will involve importing the segment data from BigQuery into Firebase. After that you can send notifications to the users, configure the app for them, or follow the user journeys across devices.\n",
"* Run targeted marketing campaigns via CRMs like Salesforce, e.g. send out reminder emails.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ODjAEK2cmf9S"
},
"source": [
"## Further resources: \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "u-K-dCfJKSpi"
},
"source": [
"As you collect more data from your users, you may want to regularly evaluate your model on fresh data and re-train the model if you notice that the model quality is decaying.\n",
"\n",
"Continuous evaluation—the process of ensuring a production machine learning model is still performing well on new data—is an essential part in any ML workflow. Performing continuous evaluation can help you catch model drift, a phenomenon that occurs when the data used to train your model no longer reflects the current environment. \n",
"\n",
"To learn more about how to do continous model evaluation and re-train models, you can read the blogpost: [Continuous model evaluation with BigQuery ML, Stored Procedures, and Cloud Scheduler](https://cloud.google.com/blog/topics/developers-practitioners/continuous-model-evaluation-bigquery-ml-stored-procedures-and-cloud-scheduler)"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "Pattern - Propensity modeling (churn) BigQuery ML using GA4 data.ipynb",
"provenance": []
},
"environment": {
"name": "tf2-gpu.2-3.m65",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-3:m65"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
================================================
FILE: retail/clustering/bqml/README.md
================================================
A common marketing analytics challenge is to understand consumer behavior and develop customer attributes or archetypes. As organizations get better at tackling this problem, they can activate marketing strategies to incorporate additional customer knowledge into their campaigns. Building customer profiles is now easier than ever with BigQuery ML. In this notebook, you’ll learn how to create segmentation and how to use these audiences for marketing activation.
The notebook can be found [here](bqml_scaled_clustering.ipynb)
## Questions? Feedback?
If you have any questions or feedback, please open up a [new issue](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/issues).
================================================
FILE: retail/clustering/bqml/bqml_scaled_clustering.ipynb
================================================
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "ur8xi4C7S06n"
},
"outputs": [],
"source": [
"# Copyright 2020 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "eHLV0D7Y5jtU"
},
"source": [
"<table align=\"left\">\n",
" <td>\n",
" <a href=\"https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/blob/master/retail/clustering/bqml/bqml_scaled_clustering.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
" View on GitHub\n",
" </a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "tvgnzT1CKxrO"
},
"source": [
"# How to build k-means clustering models for market segmentation using BigQuery ML\n",
"\n",
"A common marketing analytics challenge is to understand consumer behavior and develop customer attributes or archetypes. As organizations get better at tackling this problem, they can activate marketing strategies to incorporate additional customer knowledge into their campaigns. \n",
"\n",
"Clustering algorithms are a common vehicle to address this challenge. They allow businesses to better segment and understand their customers and users. In the field of Machine Learning, which is a combination of both art and science, unsupervised learning may require more art compared to supervised learning algorithms. By definition, unsupervised learning has no single metric to guide the algorithm's learning process. Instead, the data science team will need to work hand in hand with business owners to determine feature selection, optimal number of clusters (the number of clusters is often abbreviated as k), and most importantly, to gain a deeper understanding of what each cluster represents. \n",
"\n",
"### How can clustering algorithms help businesses succeed?\n",
"\n",
"Clustering algorithms can help companies identify groups of similar customers that can be used for targeting in advertising campaigns. This is paramount as we are breathing a prediction era where customers expect personalization from brands. \n",
" \n",
"Using a public sample Google Analytics 360 e-commerce dataset on BigQuery, you will learn how to create and deploy clustering algorithms in production. You will also get an example of how to navigate unsupervised learning. Keep in mind, your clusters will be even more meaningful when you bring additional data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "6XbaFW4SQxGD"
},
"source": [
"# Objective\n",
"\n",
"By the end of this notebook, you will know how to:\n",
"* Explore features to understand what might be interesting for a clustering model\n",
"* Pre-process data into the correct format needed to create a clustering model using BigQuery ML\n",
"* Train (and deploy) the k-means model in BigQuery ML\n",
"* Evaluate the model\n",
"* Make predictions using the model\n",
"* Write the results to be used for batch prediction, for example, to send ads based on segmentation\n",
"\n",
"## Dataset\n",
"\n",
"The [Google Analytics Sample](https://console.cloud.google.com/marketplace/details/obfuscated-ga360-data/obfuscated-ga360-data?filter=solution-type:dataset) dataset, which is hosted publicly on BigQuery, is a dataset that provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the [Google Merchandise Store](https://www.googlemerchandisestore.com/), a real e-commerce store that sells Google-branded merchandise.\n",
"\n",
"\n",
"## Costs \n",
"\n",
"This tutorial uses billable components of Google Cloud Platform:\n",
"\n",
"* BigQuery\n",
"* BigQuery ML\n",
"\n",
"Learn about [BigQuery pricing](https://cloud.google.com/bigquery/pricing), [BigQuery ML\n",
"pricing](https://cloud.google.com/bigquery-ml/pricing) and use the [Pricing\n",
"Calculator](https://cloud.google.com/products/calculator/)\n",
"to generate a cost estimate based on your projected usage."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "i7EUnXsZhAGF"
},
"source": [
"## PIP install packages and dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install google-cloud-bigquery\n",
"!pip install google-cloud-bigquery-storage\n",
"!pip install pandas-gbq\n",
"\n",
"# Reservation package needed to setup flex slots for flat-rate pricing\n",
"!pip install google-cloud-bigquery-reservation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Automatically restart kernel after installs\n",
"import IPython\n",
"app = IPython.Application.instance()\n",
"app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "BF1j6f9HApxa"
},
"source": [
"### Set up your Google Cloud Platform project\n",
"\n",
"_The following steps are required, regardless of your notebook environment._\n",
"\n",
"1. [Select or create a project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
"\n",
"1. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
"\n",
"1. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n",
"\n",
"1. Enter your project ID and region in the cell below. Then run the cell to make sure the\n",
"Cloud SDK uses the right project for all the commands in this notebook.\n",
"\n",
"_Note_: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "W8SLpZgiyMSP"
},
"source": [
"### Set project ID and authenticate\n",
"\n",
"Update your Project ID below. The rest of the notebook will run using these credentials. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"PROJECT_ID = \"UPDATE TO YOUR PROJECT ID\" \n",
"REGION = 'US'\n",
"DATA_SET_ID = 'bqml_kmeans' # Ensure you first create a data set in BigQuery\n",
"!gcloud config set project $PROJECT_ID\n",
"# If you have not built the Data Set, the following command will build it for you\n",
"# !bq mk --location=$REGION --dataset $PROJECT_ID:$DATA_SET_ID "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "XoEqT2Y4DJmf"
},
"source": [
"### Import libraries and define constants"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 392
},
"colab_type": "code",
"id": "bbnrwv-nyi82",
"outputId": "d9b05979-e17b-411a-d910-8c91e8755501"
},
"outputs": [],
"source": [
"from google.cloud import bigquery\n",
"import numpy as np\n",
"import pandas as pd\n",
"import pandas_gbq\n",
"import matplotlib.pyplot as plt\n",
"\n",
"pd.set_option('display.float_format', lambda x: '%.3f' % x) # used to display float format\n",
"client = bigquery.Client(project=PROJECT_ID)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Ks4xTsQJk1Ia"
},
"source": [
"# Data exploration and preparation\n",
"\n",
"Prior to building your models, you are typically expected to invest a significant amount of time cleaning, exploring, and aggregating your dataset in a meaningful way for modeling. For the purpose of this demo, we aren't showing this step only to prioritize showcasing clustering with k-means in BigQuery ML. "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "z3NG3XuNYKnO"
},
"source": [
"## Building synthetic data\n",
"\n",
"Our goal is to use both online (GA360) and offline (CRM) data. You can use your own CRM data, however, in this case since we don't have CRM data to showcase, we will instead generate synthetic data. We will generate estimated House Hold Income, and Gender. To do so, we will hash fullVisitorID and build simple rules based on the last digit of the hash. When you run this process with your own data, you can join CRM data with several dimensions, but this is just an example of what is possible. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# We start with GA360 data, and will eventually build synthetic CRM as an example. \n",
"# This block is the first step, just working with GA360\n",
"\n",
"ga360_only_view = 'GA360_View'\n",
"shared_dataset_ref = client.dataset(DATA_SET_ID)\n",
"ga360_view_ref = shared_dataset_ref.table(ga360_only_view)\n",
"ga360_view = bigquery.Table(ga360_view_ref)\n",
"\n",
"ga360_query = '''\n",
"SELECT\n",
" fullVisitorID,\n",
" ABS(farm_fingerprint(fullVisitorID)) AS Hashed_fullVisitorID, # This will be used to generate random data.\n",
" MAX(device.operatingSystem) AS OS, # We can aggregate this because an OS is tied to a fullVisitorID.\n",
" SUM (CASE\n",
" WHEN REGEXP_EXTRACT (v2ProductCategory, \n",
" r'^(?:(?:.*?)Home/)(.*?)/') \n",
" = 'Apparel' THEN 1 ELSE 0 END) AS Apparel,\n",
" SUM (CASE \n",
" WHEN REGEXP_EXTRACT (v2ProductCategory, \n",
" r'^(?:(?:.*?)Home/)(.*?)/') \n",
" = 'Office' THEN 1 ELSE 0 END) AS Office,\n",
" SUM (CASE\n",
" WHEN REGEXP_EXTRACT (v2ProductCategory, \n",
" r'^(?:(?:.*?)Home/)(.*?)/') \n",
" = 'Electronics' THEN 1 ELSE 0 END) AS Electronics,\n",
" SUM (CASE\n",
" WHEN REGEXP_EXTRACT (v2ProductCategory, \n",
" r'^(?:(?:.*?)Home/)(.*?)/') \n",
" = 'Limited Supply' THEN 1 ELSE 0 END) AS LimitedSupply,\n",
" SUM (CASE\n",
" WHEN REGEXP_EXTRACT (v2ProductCategory, \n",
" r'^(?:(?:.*?)Home/)(.*?)/') \n",
" = 'Accessories' THEN 1 ELSE 0 END) AS Accessories,\n",
" SUM (CASE\n",
" WHEN REGEXP_EXTRACT (v2ProductCategory, \n",
" r'^(?:(?:.*?)Home/)(.*?)/') \n",
" = 'Shop by Brand' THEN 1 ELSE 0 END) AS ShopByBrand,\n",
" SUM (CASE\n",
" WHEN REGEXP_EXTRACT (v2ProductCategory, \n",
" r'^(?:(?:.*?)Home/)(.*?)/') \n",
" = 'Bags' THEN 1 ELSE 0 END) AS Bags,\n",
" ROUND (SUM (productPrice/1000000),2) AS productPrice_USD\n",
"FROM\n",
" `bigquery-public-data.google_analytics_sample.ga_sessions_*`,\n",
" UNNEST(hits) AS hits,\n",
" UNNEST(hits.product) AS hits_product\n",
"WHERE\n",
" _TABLE_SUFFIX BETWEEN '20160801'\n",
" AND '20160831'\n",
" AND geoNetwork.country = 'United States'\n",
" AND type = 'EVENT'\n",
"GROUP BY\n",
" 1,\n",
" 2\n",
"'''\n",
"\n",
"\n",
"ga360_view.view_query = ga360_query.format(PROJECT_ID)\n",
"ga360_view = client.create_table(ga360_view) # API request\n",
"\n",
"print(f\"Successfully created view at {ga360_view.full_table_id}\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>fullVisitorID</th>\n",
" <th>Hashed_fullVisitorID</th>\n",
" <th>OS</th>\n",
" <th>Apparel</th>\n",
" <th>Office</th>\n",
" <th>Electronics</th>\n",
" <th>LimitedSupply</th>\n",
" <th>Accessories</th>\n",
" <th>ShopByBrand</th>\n",
" <th>Bags</th>\n",
" <th>productPrice_USD</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0793220752145578759</td>\n",
" <td>4074807331962730552</td>\n",
" <td>Linux</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>148.960</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>6626084732166116798</td>\n",
" <td>9209336555480734198</td>\n",
" <td>Windows</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>28.990</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6402635554390648387</td>\n",
" <td>6330846949202373940</td>\n",
" <td>Windows</td>\n",
" <td>102</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2247.980</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1774577907793414721</td>\n",
" <td>6826645565243937471</td>\n",
" <td>Chrome OS</td>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>289.910</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>9875610913644487984</td>\n",
" <td>8099941684224314656</td>\n",
" <td>iOS</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>24.990</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" fullVisitorID Hashed_fullVisitorID OS Apparel Office \\\n",
"0 0793220752145578759 4074807331962730552 Linux 4 0 \n",
"1 6626084732166116798 9209336555480734198 Windows 1 0 \n",
"2 6402635554390648387 6330846949202373940 Windows 102 0 \n",
"3 1774577907793414721 6826645565243937471 Chrome OS 7 0 \n",
"4 9875610913644487984 8099941684224314656 iOS 0 0 \n",
"\n",
" Electronics LimitedSupply Accessories ShopByBrand Bags \\\n",
"0 0 0 0 0 0 \n",
"1 0 0 0 0 0 \n",
"2 0 0 0 0 0 \n",
"3 0 1 0 0 0 \n",
"4 0 0 1 0 0 \n",
"\n",
" productPrice_USD \n",
"0 148.960 \n",
"1 28.990 \n",
"2 2247.980 \n",
"3 289.910 \n",
"4 24.990 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show a sample of GA360 data\n",
"\n",
"ga360_query_df = f'''\n",
"SELECT * FROM {ga360_view.full_table_id.replace(\":\", \".\")} LIMIT 5\n",
"'''\n",
"\n",
"job_config = bigquery.QueryJobConfig()\n",
"\n",
"# Start the query\n",
"query_job = client.query(ga360_query_df, job_config=job_config) #API Request\n",
"df_ga360 = query_job.result()\n",
"df_ga360 = df_ga360.to_dataframe()\n",
"\n",
"df_ga360"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create synthetic CRM data in SQL\n",
"\n",
"CRM_only_view = 'CRM_View'\n",
"shared_dataset_ref = client.dataset(DATA_SET_ID)\n",
"CRM_view_ref = shared_dataset_ref.table(CRM_only_view)\n",
"CRM_view = bigquery.Table(CRM_view_ref)\n",
"\n",
"# Query below works by hashing the fullVisitorID, which creates a random distribution. \n",
"# We use modulo to artificially split gender and hhi distribution.\n",
"CRM_query = '''\n",
"SELECT\n",
" fullVisitorID,\n",
"IF\n",
" (MOD(Hashed_fullVisitorID,2) = 0,\n",
" \"M\",\n",
" \"F\") AS gender,\n",
" CASE\n",
" WHEN MOD(Hashed_fullVisitorID,10) = 0 THEN 55000\n",
" WHEN MOD(Hashed_fullVisitorID,10) < 3 THEN 65000\n",
" WHEN MOD(Hashed_fullVisitorID,10) < 7 THEN 75000\n",
" WHEN MOD(Hashed_fullVisitorID,10) < 9 THEN 85000\n",
" WHEN MOD(Hashed_fullVisitorID,10) = 9 THEN 95000\n",
" ELSE\n",
" Hashed_fullVisitorID\n",
"END\n",
" AS hhi\n",
"FROM (\n",
" SELECT\n",
" fullVisitorID,\n",
" ABS(farm_fingerprint(fullVisitorID)) AS Hashed_fullVisitorID,\n",
" FROM\n",
" `bigquery-public-data.google_analytics_sample.ga_sessions_*`,\n",
" UNNEST(hits) AS hits,\n",
" UNNEST(hits.product) AS hits_product\n",
" WHERE\n",
" _TABLE_SUFFIX BETWEEN '20160801'\n",
" AND '20160831'\n",
" AND geoNetwork.country = 'United States'\n",
" AND type = 'EVENT'\n",
" GROUP BY\n",
" 1,\n",
" 2)\n",
"'''\n",
"\n",
"CRM_view.view_query = CRM_query.format(PROJECT_ID)\n",
"CRM_view = client.create_table(CRM_view) # API request\n",
"\n",
"print(f\"Successfully created view at {CRM_view.full_table_id}\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>fullVisitorID</th>\n",
" <th>gender</th>\n",
" <th>hhi</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>297008845417084558</td>\n",
" <td>F</td>\n",
" <td>85000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>8780554431432234301</td>\n",
" <td>F</td>\n",
" <td>65000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3912300160509220549</td>\n",
" <td>M</td>\n",
" <td>85000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0183860411504195373</td>\n",
" <td>M</td>\n",
" <td>55000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5824687589795910572</td>\n",
" <td>M</td>\n",
" <td>75000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" fullVisitorID gender hhi\n",
"0 297008845417084558 F 85000\n",
"1 8780554431432234301 F 65000\n",
"2 3912300160509220549 M 85000\n",
"3 0183860411504195373 M 55000\n",
"4 5824687589795910572 M 75000"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# See an output of the synthetic CRM data\n",
"\n",
"CRM_query_df = f'''\n",
"SELECT * FROM {CRM_view.full_table_id.replace(\":\", \".\")} LIMIT 5\n",
"'''\n",
"\n",
"job_config = bigquery.QueryJobConfig()\n",
"\n",
"# Start the query\n",
"query_job = client.query(CRM_query_df, job_config=job_config) #API Request\n",
"df_CRM = query_job.result()\n",
"df_CRM = df_CRM.to_dataframe()\n",
"\n",
"df_CRM"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build a final view for to use as trainding data for clustering\n",
"\n",
"You may decide to change the view below based on your specific dataset. This is fine, and is exactly why we're creating a view. All steps subsequent to this will reference this view. If you change the SQL below, you won't need to modify other parts of the notebook. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Build a final view, which joins GA360 data with CRM data\n",
"\n",
"final_data_view = 'Final_View'\n",
"shared_dataset_ref = client.dataset(DATA_SET_ID)\n",
"final_view_ref = shared_dataset_ref.table(final_data_view)\n",
"final_view = bigquery.Table(final_view_ref)\n",
"\n",
"final_data_query = f'''\n",
"SELECT\n",
" g.*,\n",
" c.* EXCEPT(fullVisitorId)\n",
"FROM {ga360_view.full_table_id.replace(\":\", \".\")} g\n",
"JOIN {CRM_view.full_table_id.replace(\":\", \".\")} c\n",
"ON g.fullVisitorId = c.fullVisitorId\n",
"'''\n",
"\n",
"final_view.view_query = final_data_query.format(PROJECT_ID)\n",
"final_view = client.create_table(final_view) # API request\n",
"\n",
"print(f\"Successfully created view at {final_view.full_table_id}\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>fullVisitorID</th>\n",
" <th>Hashed_fullVisitorID</th>\n",
" <th>OS</th>\n",
" <th>Apparel</th>\n",
" <th>Office</th>\n",
" <th>Electronics</th>\n",
" <th>LimitedSupply</th>\n",
" <th>Accessories</th>\n",
" <th>ShopByBrand</th>\n",
" <th>Bags</th>\n",
" <th>productPrice_USD</th>\n",
" <th>gender</th>\n",
" <th>hhi</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0314742824007569667</td>\n",
" <td>6946654416482415939</td>\n",
" <td>Android</td>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>106.940</td>\n",
" <td>F</td>\n",
" <td>95000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2242360300500476735</td>\n",
" <td>7127521692467899408</td>\n",
" <td>Macintosh</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1.990</td>\n",
" <td>M</td>\n",
" <td>85000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>6667898188086359119</td>\n",
" <td>928068574965520919</td>\n",
" <td>iOS</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4.990</td>\n",
" <td>F</td>\n",
" <td>95000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4757349846427056292</td>\n",
" <td>4102516958220717880</td>\n",
" <td>Chrome OS</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1.500</td>\n",
" <td>M</td>\n",
" <td>55000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>7253626747991065846</td>\n",
" <td>2270834522148945194</td>\n",
" <td>Android</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>92.990</td>\n",
" <td>M</td>\n",
" <td>75000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" fullVisitorID Hashed_fullVisitorID OS Apparel Office \\\n",
"0 0314742824007569667 6946654416482415939 Android 6 0 \n",
"1 2242360300500476735 7127521692467899408 Macintosh 0 0 \n",
"2 6667898188086359119 928068574965520919 iOS 0 0 \n",
"3 4757349846427056292 4102516958220717880 Chrome OS 0 0 \n",
"4 7253626747991065846 2270834522148945194 Android 0 0 \n",
"\n",
" Electronics LimitedSupply Accessories ShopByBrand Bags \\\n",
"0 0 0 0 0 0 \n",
"1 0 0 1 0 0 \n",
"2 0 0 0 0 0 \n",
"3 1 0 0 0 0 \n",
"4 0 0 0 0 0 \n",
"\n",
" productPrice_USD gender hhi \n",
"0 106.940 F 95000 \n",
"1 1.990 M 85000 \n",
"2 4.990 F 95000 \n",
"3 1.500 M 55000 \n",
"4 92.990 M 75000 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Show final data used prior to modeling\n",
"\n",
"sql_demo = f'''\n",
"SELECT * FROM {final_view.full_table_id.replace(\":\", \".\")} LIMIT 5\n",
"'''\n",
"\n",
"job_config = bigquery.QueryJobConfig()\n",
"\n",
"# Start the query\n",
"query_job = client.query(sql_demo, job_config=job_config) #API Request\n",
"df_demo = query_job.result()\n",
"df_demo = df_demo.to_dataframe()\n",
"\n",
"df_demo"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "wpGc5AggVOBN"
},
"source": [
"# Create our initial model\n",
"\n",
"In this section, we will build our initial k-means model. We won't focus on optimal k or other hyperparemeters just yet.\n",
"\n",
"Some additional points: \n",
"\n",
"1. We remove fullVisitorId as an input, even though it is grouped at that level because we don't need fullVisitorID as a feature for clustering. fullVisitorID should never be used as feature.\n",
"2. We have both categorical as well as numerical features\n",
"3. We do not have to normalize any numerical features, as BigQuery ML will automatically do this for us. "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "DyOB3FePfiIe"
},
"source": [
"## Build a function to build our model\n",
"\n",
"We will build a simple python function to build our model, rather than doing everything in SQL. This approach means we can asynchronously start several models and let BQ run in parallel."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def makeModel (n_Clusters, Model_Name):\n",
" sql =f'''\n",
" CREATE OR REPLACE MODEL `{PROJECT_ID}.{DATA_SET_ID}.{Model_Name}` \n",
" OPTIONS(model_type='kmeans',\n",
" kmeans_init_method = 'KMEANS++',\n",
" num_clusters={n_Clusters}) AS\n",
"\n",
" SELECT * except(fullVisitorID, Hashed_fullVisitorID) FROM `{final_view.full_table_id.replace(\":\", \".\")}`\n",
" '''\n",
" job_config = bigquery.QueryJobConfig()\n",
" client.query(sql, job_config=job_config) # Make an API request."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "vDUIzlcYyMOt"
},
"outputs": [],
"source": [
"# Let's start with a simple test to ensure everything works. \n",
"# After running makeModel(), allow a few minutes for training to complete.\n",
"\n",
"model_test_name = \"test\"\n",
"makeModel(3, model_test_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# After training is completed, you can either check in the UI, or you can interact with it using list_models(). \n",
"\n",
"for model in client.list_models(DATA_SET_ID):\n",
" print(model)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Y2mDX1-WZg8k"
},
"source": [
"# Work towards creating a better model\n",
"\n",
"In this section, we want to determine the proper k value. Determining the right value of k depends completely on the use case. There are straight forward examples that will simply tell you how many clusters are needed. Suppose you are pre-processing hand written digits - this tells us k should be 10. Or perhaps your business stakeholder only wants to deliver three different marketing campaigns and needs you to identify three clusters of customers, then setting k=3 might be meaningful. However, the use case is sometimes more open ended and you may want to explore different numbers of clusters to see how your datapoints group together with the minimal error within each cluster. To accomplish this process, we start by performing the 'Elbow Method', which simply charts loss vs k. Then, we'll also use the Davies-Bouldin score.\n",
"(https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ajkLrlFjlvPQ"
},
"source": [
"Below we are going to create several models to perform both the Elbow Method and get the Davies-Bouldin score. You may change parameters like low_k and high_k. Our process will create models between these two values. There is an additional parameter called model_prefix_name. We recommend you leave this as its current value. It is used to generate a naming convention for our models. "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 238
},
"colab_type": "code",
"id": "WAuyizlkzzQU",
"outputId": "68fb1c75-045b-4e8a-a7c3-2979ee4b3ed2"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model started: kmeans_clusters_3\n",
"Model started: kmeans_clusters_4\n",
"Model started: kmeans_clusters_5\n",
"Model started: kmeans_clusters_6\n",
"Model started: kmeans_clusters_7\n",
"Model started: kmeans_clusters_8\n",
"Model started: kmeans_clusters_9\n",
"Model started: kmeans_clusters_10\n",
"Model started: kmeans_clusters_11\n",
"Model started: kmeans_clusters_12\n",
"Model started: kmeans_clusters_13\n",
"Model started: kmeans_clusters_14\n",
"Model started: kmeans_clusters_15\n"
]
}
],
"source": [
"# Define upper and lower bound for k, then build individual models for each. \n",
"# After running this loop, look at the UI to see several model objects that exist. \n",
"\n",
"low_k = 3\n",
"high_k = 15\n",
"model_prefix_name = 'kmeans_clusters_'\n",
"\n",
"lst = list(range (low_k, high_k+1)) #build list to iterate through k values\n",
"\n",
"for k in lst:\n",
" model_name = model_prefix_name + str(k)\n",
" makeModel(k, model_name)\n",
" print(f\"Model started: {model_name}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "OVxjQFWmVVmH"
},
"source": [
"## Select optimal k"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 255
},
"colab_type": "code",
"id": "tp7y6mksNY4D",
"outputId": "07fbbd01-2387-46ad-b365-62f34d471a62"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bqml_kmeans.kmeans_clusters_10\n",
"bqml_kmeans.kmeans_clusters_11\n",
"bqml_kmeans.kmeans_clusters_12\n",
"bqml_kmeans.kmeans_clusters_13\n",
"bqml_kmeans.kmeans_clusters_14\n",
"bqml_kmeans.kmeans_clusters_15\n",
"bqml_kmeans.kmeans_clusters_3\n",
"bqml_kmeans.kmeans_clusters_4\n",
"bqml_kmeans.kmeans_clusters_5\n",
"bqml_kmeans.kmeans_clusters_6\n",
"bqml_kmeans.kmeans_clusters_7\n",
"bqml_kmeans.kmeans_clusters_8\n",
"bqml_kmeans.kmeans_clusters_9\n",
"bqml_kmeans.test\n"
]
}
],
"source": [
"# list all current models\n",
"models = client.list_models(DATA_SET_ID) # Make an API request.\n",
"print(\"Listing current models:\")\n",
"for model in models:\n",
" full_model_id = f\"{model.dataset_id}.{model.model_id}\"\n",
" print(full_model_id)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "AC8GAkKxhN9B",
"outputId": "f2d30fb1-8fd8-40e2-9c71-0597231d6e1a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Deleted model 'bqml_kmeans.test'.\n"
]
}
],
"source": [
"# Remove our sample model from BigQuery, so we only have remaining models from our previous loop\n",
"\n",
"model_id = DATA_SET_ID+\".\"+model_test_name\n",
"client.delete_model(model_id) # Make an API request.\n",
"print(f\"Deleted model '{model_id}'\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "SnGjnnrvH2Ej"
},
"outputs": [],
"source": [
"# This will create a dataframe with each model name, the Davies Bouldin Index, and Loss. \n",
"# It will be used for the elbow method and to help determine optimal K\n",
"\n",
"df = pd.DataFrame(columns=['davies_bouldin_index', 'mean_squared_distance'])\n",
"models = client.list_models(DATA_SET_ID) # Make an API request.\n",
"for model in models:\n",
" full_model_id = f\"{model.dataset_id}.{model.model_id}\"\n",
" sql =f'''\n",
" SELECT \n",
" davies_bouldin_index,\n",
" mean_squared_distance \n",
" FROM ML.EVALUATE(MODEL `{full_model_id}`)\n",
" '''\n",
"\n",
" job_config = bigquery.QueryJobConfig()\n",
"\n",
" # Start the query, passing in the extra configuration.\n",
" query_job = client.query(sql, job_config=job_config) # Make an API request.\n",
" df_temp = query_job.to_dataframe() # Wait for the job to complete.\n",
" df_temp['model_name'] = model.model_id\n",
" df = pd.concat([df, df_temp], axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "zLBMjSR-qqCq"
},
"source": [
"The code below assumes we've used the naming convention originally created in this notebook, and the k value occurs after the 2nd underscore. If you've changed the model_prefix_name variable, then this code might break. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 452
},
"colab_type": "code",
"id": "4DSIBlqVahZ7",
"outputId": "a47da0cb-0430-479d-b979-f98a55b8f388"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>davies_bouldin_index</th>\n",
" <th>mean_squared_distance</th>\n",
" <th>model_name</th>\n",
" <th>n_clusters</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2.146</td>\n",
" <td>8.920</td>\n",
" <td>kmeans_clusters_3</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.576</td>\n",
" <td>8.492</td>\n",
" <td>kmeans_clusters_4</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.446</td>\n",
" <td>7.611</td>\n",
" <td>kmeans_clusters_5</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2.272</td>\n",
" <td>7.459</td>\n",
" <td>kmeans_clusters_6</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.532</td>\n",
" <td>6.994</td>\n",
" <td>kmeans_clusters_7</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.678</td>\n",
" <td>6.582</td>\n",
" <td>kmeans_clusters_8</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.557</td>\n",
" <td>6.012</td>\n",
" <td>kmeans_clusters_9</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.285</td>\n",
" <td>5.870</td>\n",
" <td>kmeans_clusters_10</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.388</td>\n",
" <td>5.665</td>\n",
" <td>kmeans_clusters_11</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.607</td>\n",
" <td>5.075</td>\n",
" <td>kmeans_clusters_12</td>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.380</td>\n",
" <td>4.989</td>\n",
" <td>kmeans_clusters_13</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.280</td>\n",
" <td>4.840</td>\n",
" <td>kmeans_clusters_14</td>\n",
" <td>14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.449</td>\n",
" <td>4.709</td>\n",
" <td>kmeans_clusters_15</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" davies_bouldin_index mean_squared_distance model_name n_clusters\n",
"0 2.146 8.920 kmeans_clusters_3 3\n",
"0 1.576 8.492 kmeans_clusters_4 4\n",
"0 1.446 7.611 kmeans_clusters_5 5\n",
"0 2.272 7.459 kmeans_clusters_6 6\n",
"0 1.532 6.994 kmeans_clusters_7 7\n",
"0 1.678 6.582 kmeans_clusters_8 8\n",
"0 1.557 6.012 kmeans_clusters_9 9\n",
"0 1.285 5.870 kmeans_clusters_10 10\n",
"0 1.388 5.665 kmeans_clusters_11 11\n",
"0 1.607 5.075 kmeans_clusters_12 12\n",
"0 1.380 4.989 kmeans_clusters_13 13\n",
"0 1.280 4.840 kmeans_clusters_14 14\n",
"0 1.449 4.709 kmeans_clusters_15 15"
]
},
"execution_count": 17,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"# This will modify the dataframe above, produce a new field with 'n_clusters', and will sort for graphing\n",
"\n",
"df['n_clusters'] = df['model_name'].str.split('_').map(lambda x: x[2])\n",
"df['n_clusters'] = df['n_clusters'].apply(pd.to_numeric)\n",
"df = df.sort_values(by='n_clusters', ascending=True)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 297
},
"colab_type": "code",
"id": "jLVVMKm8QIFv",
"outputId": "df4ef800-3eb6-4018-ba2d-be751c313b5a"
},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:xlabel='n_clusters'>"
]
},
"execution_count": 18,
"metadata": {
"tags": []
},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAEHCAYAAACHsgxnAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAA0TklEQVR4nO3dd3xUVf7/8dcnhfSQhAQIHWkiJUACiAUBCyqoqCg2VnGF79J1f9a1Ydvisq4FV0URXEVREVdQxAaIjZIECFVqgFBCOumknN8fdxISSMiQzGQmyef5eMxjJpk7dz53CO85c+bcc8QYg1JKKffl4eoClFJKnZ0GtVJKuTkNaqWUcnMa1Eop5eY0qJVSys15OWOn4eHhplOnTs7YtVJKNUpxcXGpxpiIqu5zSlB36tSJ2NhYZ+xaKaUaJRE5UN192vWhlFJuToNaKaXcnAa1Ukq5Obv6qEVkJjAREOBtY8zLzixKqfpUVFREUlISBQUFri5FNQG+vr60a9cOb29vux9TY1CLSG+skB4EnARWiMhXxpjdta5UKTeSlJREUFAQnTp1QkRcXY5qxIwxpKWlkZSUROfOne1+nD1dHz2BtcaYPGNMMfAjcGMt61TK7RQUFNCiRQsNaeV0IkKLFi3O+dObPUG9FRgqIi1ExB+4FmhfRQGTRCRWRGJTUlLOqQilXE1DWtWX2vyt1RjUxpgdwD+A74AVwGaguIrt5hpjYowxMRERVY7Zrlnsu5C8rXaPVUqpRsquUR/GmHnGmAHGmKFAOuD4/umCE/Dd0/DGRfDe9fD7CigtdfjTKKVUQ2NXUItIS9t1B+Am4COHV+IbDDM3wxXPQNoe+GgczImGdXOhMNvhT6eUO5s1axazZ88+58c99dRTfP/99y6toSrDhg2r8mzlBQsWMG3aNADefPNN/vvf/57zvmNjY5kxY8Y5PcaRx1Yf7D2F/DMRaQEUAVONMRlOqcY/DC65H4ZMhR3LYO0b8PVDsPI5GPAHGDQJQjs65amVagyeffZZV5dQa3/6059q9biYmBhiYmIcXI17sSuojTGXOruQSjy9ofdN1iUp1grsdW/C2v/A+aPgwinQYQjoF0DKwZ5Zto3tR044dJ8XtAnm6et61bjdCy+8wH//+1/at29PREQE0dHRvP3228ydO5eTJ0/StWtX3n//fYqKioiKimLfvn14eHiQl5dHjx492LdvHxMnTmT06NGMHTuWuLg4/vznP5OTk0N4eDgLFiwgMjKSV199lTfffBMvLy8uuOACFi1aVG1NmzdvZsSIERw6dIiHH36YiRMnYozh4Ycf5uuvv0ZEeOKJJxg3bhyrV69m9uzZfPnllwBMmzaNmJgY7rnnnkr7nD9/Pn/729+IjIyke/fu+Pj4AFYrNzAwkAcffJBhw4YxePBgVq1aRWZmJvPmzePSS6uOoYrPO2vWLA4ePMi+ffs4ePAg999/f3lru6rXF2Dv3r1MnTqVlJQU/P39efvtt+natStDhgzhn//8J8OGDeOxxx7Dw8ODF154ocZ/R2dwyqRMDtUuBsbOg6xnYcM7EDffam1HRlmB3etG8PJxdZVK1UlcXByLFi1i48aNFBcXM2DAAKKjo7npppuYOHEiAE888QTz5s1j+vTpREVF8eOPPzJ8+HCWLVvGyJEjK51AUVRUxPTp0/niiy+IiIjg448/5vHHH+fdd9/l73//O/v378fHx4fMzMyz1pWQkMDatWvJzc2lf//+jBo1it9++41NmzaxefNmUlNTGThwIEOHDrXrOI8ePcrTTz9NXFwczZs3Z/jw4fTv37/KbYuLi1m/fj3Lly/nmWeesbtLZ+fOnaxatYrs7Gx69OjB5MmTSUhIqPL1BZg0aRJvvvkm3bp1Y926dUyZMoWVK1eyYMECxo4dy6uvvsqKFStYt26dXc/vDO4f1GWat4UrnoahD0HCx1Yr+/P/g++egoH3QfQECKzlaBOlbOxp+TrDTz/9xI033oi/vz8A119/PQBbt27liSeeIDMzk5ycHEaOHAnAuHHj+Pjjjxk+fDiLFi1iypQplfb3+++/s3XrVq688koASkpKiIyMBKBv377ceeedjBkzhjFjxpy1rhtuuAE/Pz/8/PwYPnw469ev5+eff+b222/H09OTVq1acdlll7FhwwaCg4NrPM5169YxbNgwykaGjRs3jl27dlW57U033QRAdHQ0iYmJNe67zKhRo/Dx8cHHx4eWLVuSnJxc7eubk5PDr7/+yi233FL++MLCQgB69erF+PHjue666/jtt99o1qyZ3TU4WsMJ6jLN/CFmAkTfA/tWWYG96gVYMxv63gKDJ0Pr3q6uUqlzVtX42nvuuYf//e9/REVFsWDBAlavXg1YQfPYY4+Rnp5OXFwcI0aMqPQ4Ywy9evXit99+O2OfX331FWvWrGHp0qU899xzbNu2DS+vqqPg9JpEBGNMldt6eXlRWmGkVnUnddg7jrisS8TT05Pi4jNGBNf4uNMfW9XzlpaWEhISwqZNm6rc15YtWwgJCSE5Odnu53eGhjspkwh0GQF3fgrTYmHAeNi6BN68GBaMhp3LobTE1VUqZZehQ4fy+eefk5+fT3Z2NsuWLQMgOzubyMhIioqKWLhwYfn2gYGBDBo0iJkzZzJ69Gg8PT0r7a9Hjx6kpKSUB3VRURHbtm2jtLSUQ4cOMXz4cF588cXylnp1vvjiCwoKCkhLS2P16tXl3Rwff/wxJSUlpKSksGbNGgYNGkTHjh3Zvn07hYWFZGVl8cMPP5yxv8GDB7N69WrS0tIoKiri008/dcTLV6PqXt/g4GA6d+5cXocxhs2bNwOwZMkS0tLSWLNmDTNmzKixm8iZGl6Luirh3WDUv2DEExD/X2tI36LbIbQzDP4T9L8TfIJcXaVS1RowYADjxo2jX79+dOzYsfyLs+eee47BgwfTsWNH+vTpQ3b2qaGq48aN45ZbbilvZVfUrFkzFi9ezIwZM8jKyqK4uJj777+f7t27c9ddd5GVlYUxhgceeICQkJBq6xo0aBCjRo3i4MGDPPnkk7Rp04Ybb7yR3377jaioKESEF198kdatWwNw66230rdvX7p161Zl33NkZCSzZs1iyJAhREZGMmDAAEpKnN+gqu71BVi4cCGTJ0/m+eefp6ioiNtuu422bdvy6KOP8sMPP9C+fXumTZvGzJkzee+995xea1Wkuo8xdRETE2NcusJLSTHstA3vO7QOfIKh/3gYPAlCO7muLuWWduzYQc+ePV1dhmpCqvqbE5E4Y0yV4wwbbtfH2Xh6WaNB/vgtTFwJ3a+G9W/Bq/1h0Z1waIOrK1RKKbs1zqCuqG003Pw23L8FLvkzHPgV3h0J8e+7ujKl3ML8+fPp169fpcvUqVNdXdYZvvnmmzPqvPHGpjGRZ+Ps+jibwmz45A+wd6XVp33pg3riTBOnXR+qvmnXR018guD2j6HPrbDyeVj+oI4OUUq5tcYx6uNceTWDG9+CoNbw66uQcxxuehu8fV1dmVJKnaHptajLeHjAVc/ByL/CjqXwwU2Qn+nqqpRS6gxNN6jLDJkKN8+DQ+th/jVw4oirK1JKqUo0qAH6jIW7FkPmIXjnSkj53dUVKaWAxMREeve2f0qIivNeX3vttWc9m/Dll18mLy+vriXWCw3qMucNgwlfQclJa/jeofWurkipRutc5u6oreXLl5/1rMuGFNRN88vE6kRGWSfJfHCztRzYLfOhxzWurkrVp68fhWNbHLvP1n3gmr+fdZPExESuvvpqLrnkEtauXUtUVBQTJkzg6aef5vjx4yxcuJBevXoxffp0tmzZQnFxMbNmzeKGG24gMTGR8ePHk5ubC8CcOXO46KKLWL16NbNmzSI8PJytW7cSHR3NBx98UO2kSI8++ihLly7Fy8uLq666itmzZ7N//37uuOMOiouLufrqq/n3v/9NTk7OWeeefvbZZ1m2bBn5+flcdNFFvPXWW4gIw4YN46KLLuKXX37h+uuvZ9iwYVXOlx0XF8e9996Lv78/l1xyyVlft/z8fCZMmMD27dvp2bMn+fn55fd16tSJ2NhY/Pz8uPXWW0lKSqKkpIQnn3yS5ORkjhw5wvDhwwkPD2fVqlVMnjyZDRs2kJ+fz9ixY3nmmWfK93P33XezbNmy8vlJzj//fHJycpg+fTqxsbGICE8//TQ333wz3377LU8//TSFhYV06dKF+fPnExgYaPefS1W0RX26sM5WWLfsCYvugDjXnNuvmp49e/Ywc+ZMEhIS2LlzJx9++CE///wzs2fP5q9//SsvvPACI0aMYMOGDaxatYqHHnqI3NxcWrZsyXfffUd8fDwff/xxpWWpNm7cyMsvv8z27dvZt28fv/zyS5XPnZ6ezueff862bdtISEjgiSeeAGDmzJnlAVY2n0dNpk2bxoYNG9i6dSv5+fnlYQ6QmZnJjz/+yIwZM5g+fTqLFy8uD+bHH38cgAkTJvDqq69WOfPf6d544w38/f1JSEjg8ccfJy4u7oxtVqxYQZs2bdi8eTNbt27l6quvZsaMGbRp04ZVq1axatUqwFpYIDY2loSEBH788UcSEhLK9xEeHk58fDyTJ08uX8Lrueeeo3nz5mzZsoWEhARGjBhBamoqzz//PN9//z3x8fHExMTw0ksv2fW6nY1dLWoReQC4DzDAFmCCMabqOQwbg4BwuHsZfHoPLJsBOcnWPNh6YkzjV0PL15k6d+5Mnz59AGsu5MsvvxwRoU+fPiQmJpKUlMTSpUvLg6KgoICDBw/Spk0bpk2bxqZNm/D09Kw0v/OgQYNo164dAP369SMxMbHKVmpwcDC+vr7cd999jBo1itGjRwPwyy+/8NlnnwEwfvx4HnnkkRqPY9WqVbz44ovk5eWRnp5Or169uO666wBrIimofr7srKwsMjMzueyyy8qf8+uvv672ucpmtgNrnu2+ffuesU2fPn148MEHeeSRRxg9enS1K8V88sknzJ07l+LiYo4ePcr27dvL91dxbuwlS5YA8P3331daHSc0NJQvv/yS7du3c/HFFwNw8uRJhgwZUuNrVpMag1pE2gIzgAuMMfki8glwG7Cgzs/uznwC4faPYOkMa77r7KNw7Wzw8Kz5sUrVQsV5lD08PMp/9vDwoLi4GE9PTz777DN69OhR6XGzZs2iVatWbN68mdLSUnx9favc59nmdfby8mL9+vX88MMPLFq0iDlz5rBy5Uqg6nmcq5t7uqCggClTphAbG0v79u2ZNWtWpXmpAwICgOrny87MzLR7vuoyNW3fvXt34uLiWL58OY899hhXXXUVTz31VKVt9u/fz+zZs9mwYQOhoaHcc889lequam5sY8wZz22M4corr+Sjjxy7/re9XR9egJ+IeAH+QNMYw+bpDWP+A5c8ALHvWqeeF+XX/DilnGDkyJG89tpr5RP3b9y4EYCsrCwiIyPx8PDg/fffr9W0oTk5OWRlZXHttdfy8ssvl0+kf/HFF5e3GivOh13d3NNl4RYeHk5OTg6LFy+u8vmqmy87JCSE5s2b8/PPP5/xnFUZOnRo+TZbt26t1F1R5siRI/j7+3PXXXfx4IMPEh8fD0BQUFD5tLEnTpwgICCA5s2bk5ycfNZWfJmrrrqKOXPmlP+ckZHBhRdeyC+//MKePXsAyMvLq3YFm3NRY1AbYw4Ds4GDwFEgyxjz7enbicgkEYkVkdiUlJQ6F+Y2ROCKWXD1P2DnV/D+jZDvnEXYlTqbJ598kqKiIvr27Uvv3r158sknAZgyZQrvvfceF154Ibt27SpvtZ6L7OxsRo8eTd++fbnsssv497//DcArr7zC66+/zsCBA8nKyirfvn379uVzT995553lc0+HhIQwceJE+vTpw5gxYxg4cGCVz1c2X/YjjzxCVFQU/fr149dffwWsSaKmTp3KkCFD8PPzO2vdkydPJicnh759+/Liiy8yaNCgM7bZsmULgwYNol+/frzwwgvl/e+TJk3immuuYfjw4URFRdG/f3969erFvffeW951cTZPPPEEGRkZ9O7dm6ioKFatWkVERAQLFizg9ttvp2/fvlx44YXs3Lmzxn3VpMZJmUQkFPgMGAdkAp8Ci40xH1T3GLeelKkuti6x1mkM6wJ3fWat46gaPJ2UyX6BgYFnXRFG2ccZkzJdAew3xqQYY4qAJcBFda60Iep9kxXQJw7DvCvheN3fKZVSqib2BPVB4EIR8Rer5/xyYIdzy3JjnYfChOVQWmydGHNwrasrUuqc3HjjjWfM6/zNN9/Y9VhXtaab8lzUYOd81CLyDFbXRzGwEbjPGFNY3faNtuujoowD1kROWUnWXCE9R7u6IlVLO3bs4Pzzzz/n0QZK1YYxhp07dzp+PmpjzNPGmPONMb2NMePPFtJNRmhHuPdbaNUbPhkPsfNdXZGqJV9fX9LS0nDGIhpKVWSMIS0trdIQSnvoKeR1EdAC7l5qnRjz5f3WiTGXPaInxjQw7dq1IykpiUY1Wkm5LV9f3/KTkOylQV1XzQLgtg9h2f2w+m+2E2P+ZS2wqxoEb29vOnfu7OoylKqWpokjeHrDDXOsFWN+mg05KTB2HniffQyoUkrZQ4PaUUTg8ietsF7+kDX73oA/QHh3CO8G/mGurlAp1UBpUDvaoIkQEAFfTIOl00793r+FFdotutrC2xbgIR21m0QpdVaaEM7Qawz0vA4yD0LqbkjdBWm7rdu7VsDG909t6+ENLbpUCPBupwLdL8RVR6CUciMa1M7i4WnNbR3WGbpfVfm+/AxI3WMFeOouSLPd3rXCOpGmTEDLCuHd7dTt5u11Fj+lmhANalfwC4X2A61LRSVF1ok0ZQGeuttqiW/7HAoyT23n6WNrgXeDbldCn1vBq1m9HoJSqv7YdWbiuWoSZybWJ2MgL+1UN0pZiB/fAVkHIbgtXDTd+vKy2bnPnKaUcr2znZmoLeqGQMRadSYgHDpWWC3CGNjzA/z8Eqx4FH58ES6cbH2h6RfqunqVUg6layY2ZCLQ7Qprkqh7v4F2A63VaP7dG759ErKPubpCpZQDaFA3Fh0uhDs/gT/9DN1Hwm9z4OW+8OUDkL7f1dUppepAg7qxad0Hxr4L02Kh3+2w8QN4LRo+mwjJ211dnVKqFjSoG6sWXeC6V2BmgtVvvfMreGMIfHgbHNrg6uqUUudAg7qxC46EkS/AA1th2F/g0FqYdwUsGG19EalTeyrl9jSomwr/MBj2CNy/FUb+FdL2WgsfzB0G27+A0lJXV6iUqoYGdVPjEwhDpsLMTXDdq1B4Aj75A/xnMGxcaJ10o5RyKzUGtYj0EJFNFS4nROT+eqhNOZOXD0TfbX3pOHa+9fMXU+CVfrD2TTiZ5+oKlVI253Rmooh4AoeBwcaYA9Vtp2cmNkDGwJ7v4aeX4OCv1mx/F06GgRN1ciil6kGd10ys4HJg79lCWjVQIta8Ifd+DRNWQNtoWPn8qZNnMg+6ukKlmqxzbVG/C8QbY+ZUcd8kYBJAhw4dog8c0Cxv8I5tsVrY2/9n/dz9Ghh0H5w3XNeFVMrBztaitjuoRaQZcAToZYxJPtu22vXRyGQegrj5EPce5KVCi24w8D7rhBrf5q6uTqlGwVFdH9dgtabPGtKqEQppD5c/BX/eDjfOtfqsVzwC/+ppnaKuZzwq5VTnMnve7cBHzipENQBePhA1zrocjocN71hD+mLfhY6XWLP2nT/KWuxXKeUwdnV9iIg/cAg4zxiTVdP22vXRhOSmWUuLxc6zvnAMagMxE2DA3RDUytXVKdVgOKSP+lxoUDdBpSWw+1tYPxf2rrTWgrzgBhg0CdoP0i8flaqBLhygnM/DE3pcY11S91jdIpsWwtbF1ox+gyZB77HQzN/VlSrV4GiLWjlPYQ5s+QTWvwPHt4FvCPS/Cwb+EcLOc3V1SrkV7fpQrmUMHPjV6hbZsQxMqXVyzaBJ0OVy8NApZ5TSrg/lWiLQ6WLrcuIIxC2wLgvHQmhna0x2/zt1nUelqqFNGVW/gtvA8L9Y063ePA8CW8G3j1tjsn941uouUUpVokGtXMOrGfQZC3/8Bv7vJ2v89U//gjkDIeETXdBAqQo0qJXrRfaFsfPg3m8hsCUsmQjvjoQjG11dmVJuQYNauY8Og2HiKrh+DqTvg7nD4YtpkJPi6sqUcikNauVePDxgwHiYHmetRLP5I3htAPz6GhSfdHV1SrmEBrVyT77NrUV5p6yF9oPh2yfgjYtg93eurkypeqdBrdxbeDe4azHc8ak1/nrhWFh4q7U4r1JNhAa1ahi6X2W1rq98zjp55vXB1sozBSdcXZlSTqdBrRoOr2Zw8Qyr/7rvOPj1VXgtGjZ+AKWlrq5OKafRoFYNT1ArGPM6TFwJoZ3gi6nwzuVwaIOrK1PKKTSoVcPVNhr++K216kz2UZh3BSyZBCeOuroypRxKg1o1bCLWijPTYuHS/wfbPre6Q376FxQVuLo6pRxCg1o1Dj6B1rqOU9dDl+HWvCH/GQw7v9LT0VWDZ1dQi0iIiCwWkZ0iskNEhji7MKVqJawz3LYQxv8PvHxh0R3w/o1wfKerK1Oq1uxtUb8CrDDGnA9EATucV5JSDtBlOPzpF7jmRTgSb50s8/UjkHlIW9iqwalx4QARCQY2Yy1sa9dfuC4coNxKbhqset6aA9uUgre/tcJMiy4Q1qXydUCEru+oXKJOK7yISD9gLrAdqzUdB8w0xuSett0kYBJAhw4dog8cOFD3ypVypOM74cDP1lmNaXshfS9kJEJp8altmgVBi/PODPCwLuAfpiGunKauQR0DrAUuNsasE5FXgBPGmCere4y2qFWDUVIMWQchbZ8V3GUBnrYXMg9YLfAyvs2hRdfTQtwW6n4hLjsE1TjUdSmuJCDJGLPO9vNi4FFHFaeUS3l6Wd0gYecBV1S+r/ikFdYVwzt9LxxcC1s+BSo0cvxbnArw8G7Qpr910eXFlAPUGNTGmGMickhEehhjfgcux+oGUapx82pmhW54tzPvKyqwuk3KAjxtjzWH9r4fralZy4R1gbYDrJNz2kZD6z7g7Vdvh6AaB3sXt50OLBSRZsA+YILzSlKqAfD2hZbnW5fT5WfC0U1wOA4Ox0PiL7YWOODhBS0vsAW3LcAjzgcPz/qsXjUwNfZR14b2USt1mhNHrWGCZeF9JB4Ksqz7vP0hsp8tuG3hHdJRv7hsYuraR62UqqvgSAgeZS3iC9Zsf+n7Kof3+rehpNC6378FtBlwquXdZgAERriufuVSGtRKuYKHB4R3tS59b7V+V1IEydsqhPdG2PvDqZEnIR0qhHe0tfKNp/4Xbgr0X1kpd+HpDW36WZeYe63fFebA0c0VwjsOtv/Puq9lLxj1L+ioMzo0dhrUSrkzn0DodLF1KZObCntXWhNPzb8aom6HK5+FwJauq1M5lc6ep1RDExBudZdMXWdN7bplsTW167q3rBN4VKOjQa1UQ9UswJradcpvVp/11w/D3GHWCTmqUdGgVqqhC+8G4z+HW/8L+enw7kj43xTISXF1ZcpBNKiVagxE4IIbYNoGuOQBSPjE1h0yV7tDGgENaqUak2YBcMUsmPwrtO0PXz8Ebw+DQ+tdXZmqAw1qpRqjiO7WKje3LLDm4553pbVae26qqytTtaBBrVRjJQK9brS6Qy6eCZsXwWsDrDMgS0tcXZ06BxrUSjV2PoHWOOvJv0JkFCx/EN4eDoc2uLoyZScNaqWaioge8IelMPZdyDkO866AL6Zpd0gDoEGtVFMiAr1vtrpDLpphzZ39WjRsmKfdIW5Mg1qppsgnCK56zlqpvXUf+OrP8PYISIpzdWWqChrUSjVlLc+Hu5fBzfMg+xi8czksnWGNFFFuQ4NaqaZOBPqMtbpDhkyFjR/AnGiIna/dIW7CrhVeRCQRyAZKgOLqViEooyu8KNWAJW+H5Q/BgZ/BPxyCIq1FCwIqXAJbVv45IMJaY1LVmqNWeBlujNGvh5Vq7FpdAPd8CduWwL7V1pwhuSnWAr45KVCcX/XjfJtDgC3ATw/208PdJ0iXGjsHOh+1UupMZaNDet985n2FOVZw56ZC7nHrdlmY5x63fn98J+T+ZE0SVRUvX1toh0Ngawg7z1rtpoVt1ffAVhrkFdgb1Ab4VkQM8JYxZu7pG4jIJGASQIcOHRxXoVLKvfgEWpewzjVvW1IEeWnWuO3clFOXnOOngj7rkNVyr9hS9wmGFl1OBXeLrtZ1WBdo5u+0Q3NX9vZRtzHGHBGRlsB3wHRjzJrqttc+aqXUOSkthROHIW03pO6xXe+2uluyDlXetnn7U8FdFuTh3SCojbUWZQNV5z5qY8wR2/VxEfkcGARUG9RKKXVOPDwgpL116TKi8n0ncyFt75khvulDOJlzajtv/9Na4d1s3SldrT7xBqzGoBaRAMDDGJNtu30V8KzTK1NKKbCmbo3sa10qMsYa+12x9Z2621oIePv/Tq3eDtbIlbDzILRThUtn6zog3O37w+1pUbcCPhfrQLyAD40xK5xalVJK1UQEgiOtS+ehle8rLoT0fbYAt7XEMxJh7yrIPlJ5W++A0wK8k9X/HtrJ6mbx9q2PozmrGoPaGLMPiKqHWpRSyjG8fKBlT+tyuqJ8yDxoBXfZJX2/Fex7V542/FAguM2ZQV7PrXEdnqeUalq8/ayZBCN6nHmfMdaIlIohnrHf1hpfCdlHT9tXwJkt8UETHV6yBrVSSpURgaBW1qXD4DPvr9gaT99foUVua437t9CgVkopl6qpNZ6f4ZSnbbiDDpVSyp2IgH+YU3atQa2UUm5Og1oppdycBrVSSrk5DWqllHJzGtRKKeXmNKiVUsrNaVArpZSb06BWSik3p0GtlFJuToNaKaXcnAa1Ukq5OQ1qpZRyc3YHtYh4ishGEfnSmQUppZSq7Fxa1DOBHc4qRCmlVNXsCmoRaQeMAt5xbjlKKaVOZ2+L+mXgYaC0hu2UUko5WI1BLSKjgePGmLgatpskIrEiEpuSkuKwApVSqqmzp0V9MXC9iCQCi4ARIvLB6RsZY+YaY2KMMTEREREOLlMppZquGoPaGPOYMaadMaYTcBuw0hhzl9MrU0opBeg4aqWUcnvntAq5MWY1sNoplSillKqStqiVUsrNaVArpZSb06BWSik3p0GtlFJuToNaKaXcnAa1Ukq5OQ1qpZRycxrUSinl5jSolVLKzWlQK6WUm9OgVkopN6dBrZRSbk6DWiml3JwGtVJKuTkNaqWUcnMa1Eop5eY0qJVSys3Zswq5r4isF5HNIrJNRJ6pj8KUUkpZ7FmKqxAYYYzJERFv4GcR+doYs9bJtSmllMKOoDbGGCDH9qO37WKcWZRSSqlT7OqjFhFPEdkEHAe+M8asc2pVSimlytkV1MaYEmNMP6AdMEhEep++jYhMEpFYEYlNSUlxcJlKKdV0ndOoD2NMJrAauLqK++YaY2KMMTERERGOqU4ppZRdoz4iRCTEdtsPuALY6eS6lFJK2dgz6iMSeE9EPLGC/RNjzJfOLUsppVQZe0Z9JAD966EWpZRSVdAzE5VSys1pUCullJvToFZKKTenQa2UUm5Og1oppdycBrVSSrk5DWqllHJzGtRKKeXmNKiVUsrNaVArpZSb06BWSik3p0GtlFJuToO6AYtNTGdfSk7NGyqlGjQN6gbIGMN/Vu9h7Ju/ce2rP/FJ7CFXl6SUciJ75qNWbqSwuITHlmxhSfxhRveNJC3nJA8vTmD9/nSeu6E3fs08XV2iUsrB3Cqotx3JonurILw9taFflfTck/zf+7FsSMzgz1d2Z/qIrpQaeOX7Xby2ag8JSZn8584BdG0Z5OpSlVIO5DaJmJVfxG1z13LNKz+xZpcujnu63cnZ3PD6zyQkZfHa7f2ZcXk3RARPD+HPV/XgvQmDSMs5yfVzfuHzjUmuLlcp5UD2rJnYXkRWicgOEdkmIjOdUUiwrxcv3dqPopJS/vDueu57L5YDabnOeKoGZ82uFG76z6/knyxl0aQLuS6qzRnbDO0ewVczLqV3m+Y88PFmHluSQEFRiQuqVUo5mhhjzr6BSCQQaYyJF5EgIA4YY4zZXt1jYmJiTGxsbK0KKiwu4d2fE5mzcjdFJYY/XtqZacO7EuDjVr009eb93xKZtWw73VoGMu+egbQN8Tvr9sUlpbz03S7+s3ovPSODef2O/pwXEVhP1SqlaktE4owxMVXeV1NQV7GzL4A5xpjvqtumLkFdJvlEAf9YsZMl8YdpGeTDo9ecz5h+bfHwkDrtt6EoLinl+a92sODXRK7o2ZKXb+tP4Dm8Wa3aeZwHPtlEcYnh7zf3YXTfM1vhSin34bCgFpFOwBqgtzHmxGn3TQImAXTo0CH6wIEDtS64oviDGTyzdBubk7Lo3yGEWdf1Iqp9iEP27a5OFBQx7cONrNmVwsRLO/PoNT3xrMUb1JHMfKZ9GE/8wUzGX9iRJ0b3xMdLR4Uo5Y4cEtQiEgj8CLxgjFlytm0d0aKuqLTU8Fl8Ev9Y8TupOYWMjW7Hw1f3oGWQr8Oew10cSs/j3gUb2J+ay3NjenP7oA512l9RSSkvrtjJ2z/tp3fbYP5zRzQdWvg7qFqllKPUOahFxBv4EvjGGPNSTds7OqjLZBcUMWfVHt79eT8+Xp5MH9GVCRd3ppmX2wxeqZMNien83/txlJQa3rhrABd1CXfYvr/bnsz/+2QTBvjn2Ciu7t3aYftWStVdnYJaRAR4D0g3xtxvzxM6K6jL7E/N5fkvt/PDzuN0Dg/gydE9GXF+K6c9X31YEp/Eo59toW2oH/PujnHKF4CH0vOY9mE8m5OymHBxJx67pmejeZNTqqGra1BfAvwEbAFKbb/+izFmeXWPcXZQl1n9+3Ge/XI7+1JyGdYjgidHX0CXBjbCobTU8NJ3u5izag9DzmvBG3cNIMS/mdOe72RxKX9dbn1JGdU+hNfv6E+7UO0KUcrVHDrqwx71FdRg9cG+92sir3y/m/yiEu65qBMzruhGsK93vTx/XeSfLOH/fbqJ5VuOcdvA9jw3pne9nZX59ZajPLw4AQ8P4V+3RHHFBQ37E4lSDV2jDuoyqTmFzP7mdz6OPUSLgGY8NLIHt0S3d9vhfMknCpj431i2HM7i8Wt78sdLOmP1MtWfA2m5TFkYz7YjJ5g09DweGtlDT99XykWaRFCX2ZKUxTPLthF7IIM+bZsz6/oLiO4Y5pJaqrP1cBb3vRfLiYIiXr2tv0tbswVFJTz/1XY+WHuQ6I6hvHZ7f9rUcFKNM5SWGrd9U1WqPjSpoAZrGtClm4/wt+U7OXaigDH92vDoNT1p3dz1w/m+2XaM+xdtItTfm3fuHsgFbYJdXRIASzcf4bHPEmjm5cFL4/oxvEdLhz9HSanhSGY++1Nzyy/7UnPZn5rD4Yx8mvt50zbUj3Yh/tZ1qB/tQv1pG+JHuzC/BtGdpVRtNbmgLpN3spg3Vu/lrTX78BRh6vAu3Hfpefh61/9JH8YY3lqzj3+s2EnfdiG8/YdotxsHvjclh6kL49l5LJspw7rw5yu743WOXSHGGNJyT1pBnHIqiPen5pKYlsfJ4tLybQN9vOgcHkDn8ADahfqRlV9EUkY+SRl5HM7Mp6CotNK+g3y9aBfqT7tQPyu8bUHezhbqzf286737qDE7nl1AcYkhsrmvvq71oMkGdZlD6Xm88NUOVmw7RvswPyZf1pUerQNpH+ZPRKCP0/8ITxaX8vjnW/g0LonRfSOZfUuUS94s7FFQVMKspdtYtOEQgzqH8drt/WkVfOYbSk5hMYllLeKUU2G8LzWX7ILi8u28PYWOLawwPs8Wyp3DA+gcEXDW174s8A9n5FcK76SMfNvv8sg9WXnSqUAfr/IAP6NFHupHWEAzDZxqFJWUsuPoCeIPZBB/MJO4AxkczswHILK5L9EdQxnYKYzojqH0jAyu1Zmy6uyafFCX+XVPKs8s287vydnlv/P19qBDmD8dwvxpb7suu7QL9a/zRPzpuSf50wdxrN+fzozLu3H/5d0aRF/s5xuT+MuSrfg38+TBkT3ILiiygjjF6rI4nl1Yafu2IX6nQtgWxF3CA2kT4nvOrXJ7GGMqtcCTbIFeFuZJGXmV3jAA/Lw96RkZxKXdIhjaPYJ+7UOabOCk5hSWh3L8wQwSkjLLP8G0CvYhumMoAzqE4u3pQeyBDGIT0zmaVQBYb4j9O4QQ0zGMmE6h9Gsf0mQnTXMkDeoKSkoN+1NzOZSRx6H0PA6m5XEw/dQl77RWWssgnzODvIV1HRHoc9bQ3XM8hz++t4GjWQX8c2xfbujX1tmH51C7k7OZsjCe3cetdRnDAppVCuPzbIHcqUWAW35CyMov4nB5eFthHnfACqVSY02te0m3cIbagtsVX6LWh+KSUnYey2bjwVPBfCAtDwAvD6FX2+YM6BDCgA6hDOgYSptqujoOZ+YTm5hObGIGGxLT+T05G2PA00O4IDK4vNUd0ym0yk9h6uw0qO1kjCE992R5aB+qEOCH0vM5kpVPxZfLx8ujUiu84u0jmfnMWLQRHy8P3hofQ3THUNcdWB0UFJWw53gO7UL9nHoiTn3KzDvJz3tSWbMrhTW7Ujl2wmopdm0ZyNBuEVzWI4LBncPc8s3HHhm5J9l4KIO4AxnEH8hkc1JmeQMkPNCH6I6nQrlP2+a1Ps4TBUXEH7CeJzYxg42HMspb5e3D/Mpb3DEdw+jWMrBBfJKsi8LiEo6fKKR9WO1OINOgdpDC4hKOZBZUDvIKLfKcwsoftXu0CmLePTF65p8bM8aw+3gOa3al8OOuFNbtT+dkcSk+Xh4M6hzGZd2t1na3loFu2b9dUmrYfTyb+ANWv/LGgxnsS7UW3Chr6Q7oEMIAW1dGu1A/px1HUUkp24+cYENiOnEHMtiQmEFqjtVFFuzrRXTHUGI6hRHTMZSo9iEN9o2wpNRwIC2XXcnZ/H4sh13J2ew8doLEtDwiAn1Y+5fLa7VfDep6YIwhM6+oPLSz8ou4oV8bgnRIWYNSUFTCuv3pttZ2Snm3T+tgX4Z2D2do9wgu6Rper58uSksNqbmFJGcVcuxEAcdOFHA0M58th7PYeDCzvIEQFtDM1lK2Wsx92zXHv5nr+o6NMRxMzyM2MYPYA1aXSdnr6e0p9G7bnIGdwujfPoS2oX60CvalRUAzp3ynURvGGI5mFfD7sWx+T85ml+16z/EcCm2jl0SgQ5g/3VsF0aNVEN1bB3Fd38havRlqUCtVS0cy8/lpt9VF8tPuFE4UFOMh0LddCEO7R3BZ93Ci2oXUOlzyT5ZY4ZtVQLIthCveTs4q4Hh2IcWllf+fenoI3VsFnerG6BBKxxb+btnqrygj9yTxB63WdtyBdDYnZVUasukhVvdMq2BfWgX7EBFkXZf93DLItzzQHdmVkpZTWCGMrVbyrmPZZFf4lNw62JfurYPo0SrQCubWQXRtGeiwN0MNaqUcoLiklM1JWVZre3cKmw+d+lLy4q5Wa3to9wjahvhRWmoNL0y2Be+xEwVV3j5x2sgUsEZVtAr2oXVzK5RaB/tWut0q2JfwQPdpedZFYXEJu5NzrDen7AKSTxRy3Pb6JJ8o5Hh2Aak5J894nJeHEBHkQ8tgX1oF+dAy2IdWthBvWR7svoT6Vx5bn1NYXB7Cvydn8/uxbHYlZ1d6juZ+3vRofaqF3MPWWm7u79xPxxrUSjlBVl7RqS8ld6eUD18LD/QhK/8kRSWV/295CEQE+ZSHbZVB3Nz3nJZcawpOFpeSmlNYKbxP3T4V7Bl5RWc8tpmnBxFBPkQE+ZCSXVg+Nhys4Zqnt5B7tAoiIsj551ZURYNaKSczxrDneA4/7kphV3I2LQIrB3LrRtQKdlcFRSWkZJcFeYVgP2F1H4UFNKNH66Dy/uR2oX5uNRLlbEGtb91KOYCI0K1VEN1aBbm6lCbL19uT9rZhso2Nvr0rpZSb06BWSik3V2NQi8i7InJcRLbWR0FKKaUqs6dFvQC42sl1KKWUqkaNQW2MWQOk10MtSimlquCwPmoRmSQisSISm5KS4qjdKqVUk+ewoDbGzDXGxBhjYiIiIhy1W6WUavJ01IdSSrk5p5zwEhcXlyoiB5yxbwcJB1JdXYSD6LG4n8ZyHKDHUp86VndHjaeQi8hHwDCsg0wGnjbGzHNkdfVNRGKrO1WzodFjcT+N5ThAj8Vd1NiiNsbcXh+FKKWUqpr2USullJtrqkE919UFOJAei/tpLMcBeixuwSnTnCqllHKcptqiVkqpBkODWiml3FyTC2oR8RSRjSLypatrqQsRCRGRxSKyU0R2iMgQV9dUWyLygIhsE5GtIvKRiPi6uiZ7VTW7pIiEich3IrLbdh3qyhrtVc2x/NP2N5YgIp+LSIgLS7Tb2Wb9FJEHRcSISLgraquNJhfUwExgh6uLcIBXgBXGmPOBKBroMYlIW2AGEGOM6Q14Are5tqpzsoAzZ5d8FPjBGNMN+MH2c0OwgDOP5TugtzGmL7ALeKy+i6qlBVQx66eItAeuBA7Wd0F10aSCWkTaAaOAd1xdS12ISDAwFJgHYIw5aYzJdGlRdeMF+ImIF+APHHFxPXarZnbJG4D3bLffA8bUZ021VdWxGGO+NcaULZW+FmhX74XVwllm/fw38DDQoEZRNKmgBl7G+kcqdXEddXUekALMt3XjvCMiAa4uqjaMMYeB2VgtnKNAljHmW9dWVWetjDFHAWzXLV1cj6PcC3zt6iJqS0SuBw4bYza7upZz1WSCWkRGA8eNMXGursUBvIABwBvGmP5ALg3n43Ultv7bG4DOQBsgQETucm1V6nQi8jhQDCx0dS21ISL+wOPAU66upTaaTFADFwPXi0gisAgYISIfuLakWksCkowx62w/L8YK7oboCmC/MSbFGFMELAEucnFNdZUsIpEAtuvjLq6nTkTkbmA0cKdpuCdedMFqDGy2ZUA7IF5EWru0Kjs1maA2xjxmjGlnjOmE9WXVSmNMg2y5GWOOAYdEpIftV5cD211YUl0cBC4UEX8REaxjaZBfjFawFLjbdvtu4AsX1lInInI18AhwvTEmz9X11JYxZosxpqUxppMtA5KAAbb/S26vyQR1IzQdWCgiCUA/4K+uLad2bJ8KFgPxwBasv8kGc6qvbXbJ34AeIpIkIn8E/g5cKSK7sUYY/N2VNdqrmmOZAwQB34nIJhF506VF2qmaY2mw9BRypZRyc9qiVkopN6dBrZRSbk6DWiml3JwGtVJKuTkNaqWUcnMa1Eop5eY0qFWjISKdqprW0s7H3iMibRxdk1KOoEGtlOUerLlG7Gab7U8pp9OgVm7D1iLeISJv2xYS+FZE/KrZtquIfC8im0UkXkS6nHb/PSIyp8LPX4rIMNvCEQtsixRssS1aMBaIwTrTc5OI+IlItIj8KCJxIvJNhbk7VovIX0XkR2CmiNxi29dmEVnjxJdHNWHaIlDuphtwuzFmooh8AtwMVDV51kLg78aYz20rwnhg33Si/YC2tkUKEJEQY0ymiEwDHjTGxIqIN/AacIMxJkVExgEvYE3zCRBijLnM9vgtwEhjzOGGsvqJang0qJW72W+M2WS7HQd0On0DEQnCCtvPAYwxBbbf27P/fcB5IvIa8BVQ1dzXPYDeWPNbgLXqzNEK939c4fYvwALbm8oSewpQ6lxpUCt3U1jhdglQVdeHPYlcTOWuPV8AY0yGiEQBI4GpwK2cailX3P82Y0x161Dmlt0wxvxJRAZjrRy0SUT6GWPS7KhPKbtpH7VqcIwxJ4AkERkDICI+tonhK0oE+omIh22dvEG2bcMBD2PMZ8CTnJrHOxtrljiA34EIsS0YLCLeItKrqlpEpIsxZp0x5ikgFWjvoMNUqpy2qFVDNR54S0SeBYqAW6i8xNovwH6sqVO3Yk2jCtAWawmzskZK2WKtC4A3RSQfGAKMBV4VkeZY/09eBrZVUcc/RaQbViv8B6DBLfOk3J9Oc6qUUm5Ouz6UUsrNadeHcmsi8jrWepcVvWKMme+KepRyBe36UEopN6ddH0op5eY0qJVSys1pUCullJvToFZKKTf3/wFxL0pkcbG1qwAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light",
"tags": []
},
"output_type": "display_data"
}
],
"source": [
"df.plot.line(x='n_clusters', y=['davies_bouldin_index', 'mean_squared_distance'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "OKWIJMXBcatm"
},
"source": [
"Note - when you run this notebook, you will get different results, due to random cluster initialization. If you'd like to consistently return the same cluster for reach run, you may explicitly select your initialization through hyperparameter selection (https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create#kmeans_init_method). \n",
"\n",
"Making our k selection: There is no perfect approach or process when determining the optimal k value. It can often be determined by business rules or requirements. In this example, there isn't a simple requirement, so these considerations can also be followed:\n",
"\n",
"\n",
"1. We start with the 'elbow method', which is effectively charting loss vs k. Sometimes, though not always, there's a natural 'elbow' where incremental clusters do not drastically reduce loss. In this specific example, and as you often might find, unfortunately there isn't a natural 'elbow', so we must continue our process. \n",
"2. Next, we chart Davies-Bouldin vs k. This score tells us how 'different' each cluster is, with the optimal score at zero. With 5 clusters, we see a score of ~1.4, and only with k>9, do we see better values. \n",
"3. Finally, we begin to try to interpret the difference of each model. You can review the evaluation module for various models to understand distributions of our features. With our data, we can look for patterns by gender, house hold income, and shopping habits.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "6aUBVjqsy3Lo"
},
"source": [
"# Analyze our final cluster\n",
"\n",
"There are 2 options to understand the characteristics of your model. You can either 1) look in the BigQuery UI, or you can 2) programmatically interact with your model object. Below you’ll find a simple example for the latter option. \n"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "Uo6wjkebuOVB"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>centroid_id</th>\n",
" <th>feature</th>\n",
" <th>categorical_value</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>OS</td>\n",
" <td>[{'category': 'Linux', 'value': 0.035714285714...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>gender</td>\n",
" <td>[{'category': 'M', 'value': 0.4285714285714285...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>OS</td>\n",
" <td>[{'category': 'iOS', 'value': 0.04276315789473...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>gender</td>\n",
" <td>[{'category': 'F', 'value': 0.4967105263157895...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3</td>\n",
" <td>OS</td>\n",
" <td>[{'category': 'iOS', 'value': 0.09637391424238...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" centroid_id feature categorical_value\n",
"0 1 OS [{'category': 'Linux', 'value': 0.035714285714...\n",
"1 1 gender [{'category': 'M', 'value': 0.4285714285714285...\n",
"2 2 OS [{'category': 'iOS', 'value': 0.04276315789473...\n",
"3 2 gender [{'category': 'F', 'value': 0.4967105263157895...\n",
"4 3 OS [{'category': 'iOS', 'value': 0.09637391424238..."
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_to_use = 'kmeans_clusters_5' # User can edit this\n",
"final_model = DATA_SET_ID+'.'+model_to_use\n",
"\n",
"sql_get_attributes = f'''\n",
"SELECT\n",
" centroid_id,\n",
" feature,\n",
" categorical_value\n",
"FROM\n",
" ML.CENTROIDS(MODEL {final_model})\n",
"WHERE\n",
" feature IN ('OS','gender')\n",
"'''\n",
"\n",
"job_config = bigquery.QueryJobConfig()\n",
"\n",
"# Start the query\n",
"query_job = client.query(sql_get_attributes, job_config=job_config) #API Request\n",
"df_attributes = query_job.result()\n",
"df_attributes = df_attributes.to_dataframe()\n",
"df_attributes.head()"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>centroid_id</th>\n",
" <th>Total_Users</th>\n",
" <th>Apparel</th>\n",
" <th>Office</th>\n",
" <th>Electronics</th>\n",
" <th>LimitedSupply</th>\n",
" <th>Accessories</th>\n",
" <th>ShopByBrand</th>\n",
" <th>Bags</th>\n",
" <th>Total_Purchases</th>\n",
" <th>productPrice_USD</th>\n",
" <th>hhi</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>31</td>\n",
" <td>90.800</td>\n",
" <td>11.200</td>\n",
" <td>9.300</td>\n",
" <td>4.000</td>\n",
" <td>5.900</td>\n",
" <td>1.600</td>\n",
" <td>1.300</td>\n",
" <td>158.900</td>\n",
" <td>3486.300</td>\n",
" <td>77857.100</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>344</td>\n",
" <td>14.600</td>\n",
" <td>5.100</td>\n",
" <td>5.700</td>\n",
" <td>2.400</td>\n",
" <td>0.800</td>\n",
" <td>3.000</td>\n",
" <td>3.100</td>\n",
" <td>44.700</td>\n",
" <td>1268.700</td>\n",
" <td>75296.100</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>7198</td>\n",
" <td>1.900</td>\n",
" <td>0.600</td>\n",
" <td>0.500</td>\n",
" <td>0.400</td>\n",
" <td>0.200</td>\n",
" <td>0.200</td>\n",
" <td>0.200</td>\n",
" <td>5.500</td>\n",
" <td>153.600</td>\n",
" <td>74898.000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>301</td>\n",
" <td>4.100</td>\n",
" <td>2.500</td>\n",
" <td>1.600</td>\n",
" <td>1.600</td>\n",
" <td>7.300</td>\n",
" <td>0.400</td>\n",
" <td>0.000</td>\n",
" <td>22.300</td>\n",
" <td>383.700</td>\n",
" <td>75415.200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0.000</td>\n",
" <td>340.000</td>\n",
" <td>0.000</td>\n",
" <td>0.000</td>\n",
" <td>0.000</td>\n",
" <td>0.000</td>\n",
" <td>0.000</td>\n",
" <td>354.000</td>\n",
" <td>2022.900</td>\n",
" <td>75000.000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" centroid_id Total_Users Apparel Office Electronics LimitedSupply \\\n",
"0 1 31 90.800 11.200 9.300 4.000 \n",
"1 2 344 14.600 5.100 5.700 2.400 \n",
"2 3 7198 1.900 0.600 0.500 0.400 \n",
"3 4 301 4.100 2.500 1.600 1.600 \n",
"4 5 1 0.000 340.000 0.000 0.000 \n",
"\n",
" Accessories ShopByBrand Bags Total_Purchases productPrice_USD hhi \n",
"0 5.900 1.600 1.300 158.900 3486.300 77857.100 \n",
"1 0.800 3.000 3.100 44.700 1268.700 75296.100 \n",
"2 0.200 0.200 0.200 5.500 153.600 74898.000 \n",
"3 7.300 0.400 0.000 22.300 383.700 75415.200 \n",
"4 0.000 0.000 0.000 354.000 2022.900 75000.000 "
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get numerical information about clusters\n",
"\n",
"sql_get_numerical_attributes = f'''\n",
"WITH T AS (\n",
"SELECT \n",
" centroid_id,\n",
" ARRAY_AGG(STRUCT(feature AS name, \n",
" ROUND(numerical_value,1) AS value) \n",
" ORDER BY centroid_id) \n",
" AS cluster\n",
"FROM ML.CENTROIDS(MODEL {final_model})\n",
"GROUP BY centroid_id\n",
"),\n",
"\n",
"Users AS(\n",
"SELECT\n",
" centroid_id,\n",
" COUNT(*) AS Total_Users\n",
"FROM(\n",
"SELECT\n",
" * EXCEPT(nearest_centroids_distance)\n",
"FROM\n",
" ML.PREDICT(MODEL {final_model},\n",
" (\n",
" SELECT\n",
" *\n",
" FROM\n",
" {final_view.full_table_id.replace(\":\", \".\")}\n",
" )))\n",
"GROUP BY centroid_id\n",
")\n",
"\n",
"SELECT\n",
" centroid_id,\n",
" Total_Users,\n",
" (SELECT value from unnest(cluster) WHERE name = 'Apparel') AS Apparel,\n",
" (SELECT value from unnest(cluster) WHERE name = 'Office') AS Office,\n",
" (SELECT value from unnest(cluster) WHERE name = 'Electronics') AS Electronics,\n",
" (SELECT value from unnest(cluster) WHERE name = 'LimitedSupply') AS LimitedSupply,\n",
" (SELECT value from unnest(cluster) WHERE name = 'Accessories') AS Accessories,\n",
" (SELECT value from unnest(cluster) WHERE name = 'ShopByBrand') AS ShopByBrand,\n",
" (SELECT value from unnest(cluster) WHERE name = 'Bags') AS Bags,\n",
" (SELECT value from unnest(cluster) WHERE name = 'productPrice_USD') AS productPrice_USD,\n",
" (SELECT value from unnest(cluster) WHERE name = 'hhi') AS hhi\n",
"\n",
"FROM T LEFT JOIN Users USING(centroid_id)\n",
"ORDER BY centroid_id ASC\n",
"'''\n",
"\n",
"job_config = bigquery.QueryJobConfig()\n",
"\n",
"# Start the query\n",
"query_job = client.query(sql_get_numerical_attributes, job_config=job_config) #API Request\n",
"df_numerical_attributes = query_job.result()\n",
"df_numerical_attributes = df_numerical_attributes.to_dataframe()\n",
"df_numerical_attributes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "t95SlHjlx3Qn"
},
"source": [
"In addition to the output above, I'll note a few insights we get from our clusters. \n",
"\n",
"Cluster 1 - The apparel shopper, which also purchases more often than normal. This (although synthetic data) segment skews female.\n",
"\n",
"Cluster 2 - Most likely to shop by brand, and interested in bags. This segment has fewer purchases on average than the first cluster, however, this is the highest value customer.\n",
"\n",
"Cluster 3 - The most populated cluster, this one has a small amount of purchases and spends less on average. This segment is the one time buyer, rather than the brand loyalist. \n",
"\n",
"Cluster 4 - Most interested in accessories, does not buy as often as cluster 1 and 2, however buys more than cluster 3. \n",
"\n",
"Cluster 5 - This is an outlier as only 1 person belongs to this group. "
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Cv0hMz3rcb1s"
},
"source": [
"# Use model to group new website behavior, and then push results to GA360 for marketing activation\n",
"\n",
"After we have a finalized model, we want to use it for inference. The code below outlines how to score or assign users into clusters. These are labeled as the CENTROID_ID. Although this by itself is helpful, we also recommend a process to ingest these scores back into GA360. The easiest way to export your BigQuery ML predictions from a BigQuery table to Google Analytics 360 is to use the MoDeM (Model Deployment for Marketing, https://github.com/google/modem) reference implementation. MoDeM helps you load data into Google Analytics for eventual activation in Google Ads, Display & Video 360 and Search Ads 360."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "-fAsssnWnv5C",
"outputId": "a4fd7009-c808-4f42-972c-8fb9d1a34bde"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CENTROID_ID</th>\n",
" <th>fullVisitorID</th>\n",
" <th>Hashed_fullVisitorID</th>\n",
" <th>OS</th>\n",
" <th>Apparel</th>\n",
" <th>Office</th>\n",
" <th>Electronics</th>\n",
" <th>LimitedSupply</th>\n",
" <th>Accessories</th>\n",
" <th>ShopByBrand</th>\n",
" <th>Bags</th>\n",
" <th>productPrice_USD</th>\n",
" <th>gender</th>\n",
" <th>hhi</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>5</td>\n",
" <td>3355435945434430291</td>\n",
" <td>2759439800079197041</td>\n",
" <td>Macintosh</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>1079.880</td>\n",
" <td>F</td>\n",
" <td>65000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" CENTROID_ID fullVisitorID Hashed_fullVisitorID OS Apparel \\\n",
"0 5 3355435945434430291 2759439800079197041 Macintosh 0 \n",
"\n",
" Office Electronics LimitedSupply Accessories ShopByBrand Bags \\\n",
"0 0 0 6 0 0 5 \n",
"\n",
" productPrice_USD gender hhi \n",
"0 1079.880 F 65000 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sql_score = f'''\n",
"SELECT * EXCEPT(nearest_centroids_distance)\n",
"FROM\n",
" ML.PREDICT(MODEL {final_model},\n",
" (\n",
" SELECT\n",
" *\n",
" FROM\n",
" {final_view.full_table_id.replace(\":\", \".\")}\n",
" LIMIT 1))\n",
"'''\n",
"\n",
"job_config = bigquery.QueryJobConfig()\n",
"\n",
"# Start the query\n",
"query_job = client.query(sql_score, job_config=job_config) #API Request\n",
"df_score = query_job.result()\n",
"df_score = df_score.to_dataframe()\n",
"\n",
"df_score"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "lnHw5kQUYfkK"
},
"source": [
"# Clean up: Delete all models and tables "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 238
},
"colab_type": "code",
"id": "fJ8VwVcMYlW7",
"outputId": "babbcc21-9ea3-4a4a-d107-6871aaa82bd9"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Deleted: bqml_kmeans.kmeans_clusters_10\n",
"Deleted: bqml_kmeans.kmeans_clusters_11\n",
"Deleted: bqml_kmeans.kmeans_clusters_12\n",
"Deleted: bqml_kmeans.kmeans_clusters_13\n",
"Deleted: bqml_kmeans.kmeans_clusters_14\n",
"Deleted: bqml_kmeans.kmeans_clusters_15\n",
"Deleted: bqml_kmeans.kmeans_clusters_3\n",
"Deleted: bqml_kmeans.kmeans_clusters_4\n",
"Deleted: bqml_kmeans.kmeans_clusters_5\n",
"Deleted: bqml_kmeans.kmeans_clusters_6\n",
"Deleted: bqml_kmeans.kmeans_clusters_7\n",
"Deleted: bqml_kmeans.kmeans_clusters_8\n",
"Deleted: bqml_kmeans.kmeans_clusters_9\n",
"Deleted: bqml_kmeans.test\n"
]
}
],
"source": [
"# Are you sure you want to do this? This is to delete all models\n",
"\n",
"models = client.list_models(DATA_SET_ID) # Make an API request.\n",
"for model in models:\n",
" full_model_id = f\"{model.dataset_id}.{model.model_id}\"\n",
" client.delete_model(full_model_id) # Make an API request.\n",
" print(f\"Deleted: {full_model_id}\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 51
},
"colab_type": "code",
"id": "CAXrHpAJYwCI",
"outputId": "969f2d34-a56c-4e7b-c4c3-230f7d8bbecc"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Deleted: bqml_kmeans.CRM_View\n",
"Deleted: bqml_kmeans.Final_View\n",
"Deleted: bqml_kmeans.GA360_View\n"
]
}
],
"source": [
"# Are you sure you want to do this? This is to delete all tables and views\n",
"\n",
"tables = client.list_tables(DATA_SET_ID) # Make an API request.\n",
"for table in tables:\n",
" full_table_id = f\"{table.dataset_id}.{table.table_id}\"\n",
" client.delete_table(full_table_id) # Make an API request.\n",
" print(f\"Deleted: {full_table_id}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "LYucZEztyk2K"
},
"source": [
"# Wrapping it all up"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "OfzOUkdiytXm"
},
"source": [
"In this exercise, we’ve accomplished some cool things with k-means in BigQuery ML. Most notably, we’re able to join online and offline user level information to gain more insight into a holistic view of our customers. We’ve modeled user behavior, and detailed an approach to determine the optimal number of clusters. We’re able to take this insight and apply to future behavior through inference. Finally, we can import this inference score back into GA360 for future marketing campaigns. "
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "BQML_Scaled_Clustering_TC_Draft.ipynb",
"provenance": [],
"toc_visible": true
},
"environment": {
"name": "common-cpu.m54",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m54"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3-final"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
================================================
FILE: retail/ltv/bqml/README.md
================================================
# Activate on LTV predictions.
This guide refactors the [final part][series_final] of an existing series about predicting Lifetime Value (LTV). The series uses Tensorflow and shows multiple approaches such as statistical models or deep neural networks to predict the monetary value of customers. The final part leverages [AutoML Tables][automl_tables].
This document shows an opiniated way to predict the monetary value of your customers for a specific time in the future using historical data.
This updated version differs in the following:
- Predicts future monetary value for a specific period of time.
- Minimizes development time by using AutoML directly from [BigQuery ML][bq_ml].
- Uses two new datasets for sales and customer data.
- Creates additional training examples by moving the date that separates input and target orders (more details in the notebook)
- Shows how to activate the LTV predictions to create similar audiences in marketing tools.
The end to end flow assumes that you start with a data dump stored into BigQuery and runs through the following steps:
1. Match your dataset to the sales dataset template.
1. Create features from a list of orders.
1. Train a model using monetary value as a label.
1. Predict future monetary value of customers.
1. Extract the emails of the top customers.
For more general information about LTV, read the [first part][series_first] of the series.
[series_final]:https://cloud.google.com/solutions/machine-learning/clv-prediction-with-automl-tables
[automl_tables]:https://cloud.google.com/automl-tables
[bq_ml]:https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro
[series_first]:https://cloud.google.com/solutions/machine-learning/clv-prediction-with-offline-training-intro
## Files
There are two main sets of files:
**[1. ./notebooks](./notebooks)**
You can use the notebook in this folder to manually run the flow using example datasets. You can also used your own data.
**[2. ./scripts](./scripts)**
The scripts in this folder facilitate automation through BigQuery scripting, BigQuery stored procedures and bash scripting. Scripts use statements from the notebook to:
1. Transform data.
1. Train and use model to predict LTV.
1. Extract emails of the top LTV customers.
*Note: For production use cases, you can reuse the SQL statements from the scripts folder in pipeline tools such as Kubeflow Pipelines or Cloud Composer.*
The scripts assume that you already have the sales and crm datasets stored in BigQuery.
## Recommended flow
1. Do research in the Notebook.
1. Extract important SQL.
1. Write SQL scripts.
1. Test end-to-end flow through bash scripts.
1. Integrate into a data pipeline.
1. Run as part of a CI/CD pipeline.
This code shows you the steps 1 to 4.
## Run code
After you went through the notebook, you can run through all the steps at once using the [run.sh script][run_script].
1. If you use your own sales table, update the [matching query][matching_query] to transform your table into a table with a schema that the script understands.
1. Make sure that you can run the run.sh script
```chmod +x run.sh```
1. Check how to set parameters
```./run.sh --help```
1. Run the script
```./run.sh --project-id [YOUR_PROJECT_ID] --dataset-id [YOUR_DATASET_ID]
## Questions? Feedback?
If you have any questions or feedback, please open up a [new issue](https://github.com/GoogleCloudPlatform/analytics-componentized-patterns/issues).
## Disclaimer
This is not an officially supported Google product.
All files in this folder are under the Apache License, Version 2.0 unless noted otherwise.
[run_script]:./scripts/run.sh
[matching_query]:./scripts/10_procedure_match.sql
================================================
FILE: retail/ltv/bqml/notebooks/bqml_automl_ltv_activate_lookalike.ipynb
================================================
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "bqml_automl_ltv_activate_lookalike.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "code",
"metadata": {
"id": "ur8xi4C7S06n",
"colab_type": "code",
"colab": {}
},
"source": [
"# Copyright 2020 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "JAPoU8Sm5E6e",
"colab_type": "text"
},
"source": [
"<table align=\"left\">\n",
" <td>\n",
" <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/ai-platform-samples/blob/master/notebooks/templates/ai_platform_notebooks_template_hybrid.ipynb\"\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
" </a>\n",
" </td>\n",
" <td>\n",
" <a href=\"https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/notebooks/templates/ai_platform_notebooks_template_hybrid.ipynb\">\n",
" <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n",
" View on GitHub\n",
" </a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tvgnzT1CKxrO",
"colab_type": "text"
},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a2VgGzVIlLo2",
"colab_type": "text"
},
"source": [
"### Objective\n",
"\n",
"Estimate how much an existing customer will spend in the future based on their historical orders to find similar new customers using lookalike features of advertising tools.\n",
"\n",
"In this tutorial, you will:\n",
"- Define how far in the future you want to predict the monetary value of your customers (ex: 3 months)\n",
"- Use a moving session concept to aggregate multiple inputs and targets per customer (For more details, see the *Create inputs and targets* section of this tutorial).\n",
"- Use primarly inputs such as Recency, Frequency and Monetary which are common values to use in an LTV context, especially in statistical model due to their distribution patterns.\n",
"- Accelerate model development by using AutoML from within BigQuery ML.\n",
"- Predict the monetary value of all existing customers for a predefined period of time in the future.\n",
"- Use first-party data to extract the most valuable customers email in order to run lookalike campaigns using the [Google Ads API][ads_api]. You can extend the concept to do the same using [Facebook API][fb_api].\n",
"\n",
"[ads_api]:https://developers.google.com/adwords/api/docs/samples/python/remarketing#create-and-populate-a-user-list\n",
"[fb_api]:https://www.facebook.com/business/help/341425252616329?id=2469097953376494\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lAeYu9bBlLaZ",
"colab_type": "text"
},
"source": [
"### Sales dataset\n",
"Sales data can be of different forms but generally look like a list of transactions where each record contains at a minimum the following:\n",
"- a customer reference\n",
"- a transaction date\n",
"- a transaction reference\n",
"- a monetary value\n",
"\n",
"Each record usually represents one of the following:\n",
"- An entire order which contains aggregated values across products for that order. You can find the total order value in the record.\n",
"- A part of a transaction which contains a unique product, some of its characteristics including SKU and unit price and the quantity ordered. \n",
"\n",
"This tutorial uses the latter.\n",
"\n",
"You can run this tutorial with your own dataset. The dataset that you provide must meet the following requirements:\n",
"1. Each row represents a transaction related to a product item and linked to a transaction, a date and a customer. A row can be either a transaction (quantity is > 0) or a return (quantity is < 0)\n",
"1. Columns must include the following:\n",
"\n",
"| Field name | Type | Description |\n",
"| :-|:-|:-|\n",
"| customer_id | STRING | First party identifier of the customer.\t |\n",
"| order_id | STRING | First party identitfier of the order. |\n",
"| order_date | DATE | Date of the order. |\n",
"| product_sku | STRING | First party identitifer of the product. |\n",
"| qty | INTEGER | Quantity of the product either ordered or returned. |\n",
"| unit_price | FLOAT | Unit price of the product. |"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "__KXS_qcyNRK",
"colab_type": "text"
},
"source": [
"### Customer dataset\n",
"Customer data often resides in a Customer Relationship Management (CRM). \n",
"\n",
"This first party data is key for companies that want to provide a certain level of customer service.\n",
"\n",
"This tutorial only uses two columns of the customer dataset:\n",
"- customer_id to join with the sales data\n",
"- email to create a marketing list.\n",
"\n",
"The public dataset contains other fields that are not relevant for this tutorial and your data might have other fields. This tutorial focuses on an activation based on email addresses."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LdyWWnIElL0C",
"colab_type": "text"
},
"source": [
"### Costs \n",
"\n",
"This tutorial uses billable components of Google Cloud Platform (GCP):\n",
"\n",
"* BigQuery\n",
"* BigQuery ML\n",
"* Cloud Storage\n",
"\n",
"To learn more about pricing:\n",
"- Read [BigQuery pricing](https://cloud.google.com/bigquery/docs/pricing)\n",
"- Read [BigQuery ML pricing](https://cloud.google.com/bigquery-ml/pricing)\n",
"- Read [Cloud Storage pricing](https://cloud.google.com/storage/pricing)\n",
"- Use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n",
"to generate a cost estimate based on your projected usage."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UDn9SREOlXaN",
"colab_type": "text"
},
"source": [
"### Terminology\n",
"- **'Input' transactions**: The set of transactions that the training task uses to create inputs values for the model.\n",
"- **'Target' transactions**: The set of transactions that the training task uses to create the target value to predict. The target value is an aggregated monetary value per customer for a defined timeline.\n",
"- **Threshold date**: Date that separates 'Input' transactions from 'Target' transactions per customer."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fmvu_N0ovY8N",
"colab_type": "text"
},
"source": [
"## Setup\n",
"This step sets up packages, variables, authentication, APIs clients and resources for Google Cloud and Adwords."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "i7EUnXsZhAGF",
"colab_type": "text"
},
"source": [
"### Install packages and dependencies\n",
"Installs libraries, packages and dependencies to run this tutorial"
]
},
{
"cell_type": "code",
"metadata": {
"id": "wyy5Lbnzg5fi",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "f5fa8b23-a707-4480-a170-061eaf260c01"
},
"source": [
"# Install libraries. \n",
"# The magic cells insures that those libraries can be part of a custom container\n",
"# if moving the code somewhere else.\n",
"%pip install -q googleads\n",
"%pip install -q -U kfp matplotlib Faker --user\n",
"\n",
"# Automatically restart kernel after installs\n",
"# import IPython\n",
"# app = IPython.Application.instance()\n",
"# app.kernel.do_shutdown(True) "
],
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": [
"\u001b[K |████████████████████████████████| 51kB 2.5MB/s \n",
"\u001b[K |████████████████████████████████| 276kB 14.1MB/s \n",
"\u001b[K |████████████████████████████████| 102kB 8.7MB/s \n",
"\u001b[K |████████████████████████████████| 51kB 5.5MB/s \n",
"\u001b[K |████████████████████████████████| 61kB 6.5MB/s \n",
"\u001b[?25h Building wheel for googleads (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for PyYAML (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"\u001b[K |████████████████████████████████| 122kB 4.7MB/s \n",
"\u001b[K |████████████████████████████████| 11.6MB 39.6MB/s \n",
"\u001b[K |████████████████████████████████| 1.0MB 45.3MB/s \n",
"\u001b[K |████████████████████████████████| 1.5MB 48.8MB/s \n",
"\u001b[K |████████████████████████████████| 61kB 6.7MB/s \n",
"\u001b[K |████████████████████████████████| 61kB 6.4MB/s \n",
"\u001b[K |████████████████████████████████| 204kB 45.9MB/s \n",
"\u001b[?25h Building wheel for kfp (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for kfp-server-api (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for strip-hints (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"\u001b[31mERROR: nbclient 0.5.0 has requirement jupyter-client>=6.1.5, but you'll have jupyter-client 5.3.5 which is incompatible.\u001b[0m\n",
"\u001b[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.\u001b[0m\n",
"\u001b[33m WARNING: The script jsonschema is installed in '/root/.local/bin' which is not on PATH.\n",
" Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\n",
"\u001b[33m WARNING: The script strip-hints is installed in '/root/.local/bin' which is not on PATH.\n",
" Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\n",
"\u001b[33m WARNING: The scripts dsl-compile and kfp are installed in '/root/.local/bin' which is not on PATH.\n",
" Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\n",
"\u001b[33m WARNING: The script faker is installed in '/root/.local/bin' which is not on PATH.\n",
" Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vmSgVQ9x4aO0",
"colab_type": "text"
},
"source": [
"### Import packages"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ouL1aMrvVgnL",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "e9eac7e7-a422-4353-d06a-d8a69ec013a7"
},
"source": [
"# Import\n",
"from __future__ import absolute_import\n",
"from __future__ import division\n",
"from __future__ import print_function\n",
"\n",
"import os, json, random\n",
"import hashlib, uuid\n",
"import time, calendar, math\n",
"import pandas as pd, numpy as np\n",
"import matplotlib.pyplot as plt, seaborn as sns\n",
"from datetime import datetime\n",
"from google.cloud import bigquery\n",
"\n",
"from googleads import adwords"
],
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.\n",
" import pandas.util.testing as tm\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BF1j6f9HApxa",
"colab_type": "text"
},
"source": [
"### Set up your GCP project\n",
"\n",
"**The following steps are required, regardless of your notebook environment.**\n",
"\n",
"1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
"\n",
"2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
"\n",
"3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n",
"\n",
"4. If you are running this notebook locally, you will need to install [Google Cloud SDK](https://cloud.google.com/sdk).\n",
"\n",
"5. Enter your project ID in the cell below. Then run the cell to make sure the\n",
"Cloud SDK uses the right project for all the commands in this notebook."
]
},
{
"cell_type": "code",
"metadata": {
"id": "oM1iC_MfAts1",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "fd734b46-80c3-4149-e48d-1b2d26ad0d6b"
},
"source": [
"PROJECT_ID = \"[YOUR-PROJECT]\" #@param {type:\"string\"}\n",
"REGION = \"US\"\n",
"! gcloud config set project $PROJECT_ID"
],
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": [
"Updated property [core/project].\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dr--iN2kAylZ",
"colab_type": "text"
},
"source": [
"### Authenticate your GCP account\n",
"If you are using AI Platform Notebooks, you are already authenticated so there is no need to run this step."
]
},
{
"cell_type": "code",
"metadata": {
"id": "PyQmSRbKA8r-",
"colab_type": "code",
"colab": {}
},
"source": [
"import sys\n",
"\n",
"if 'google.colab' in sys.modules:\n",
" from google.colab import auth as google_auth\n",
" google_auth.authenticate_user()"
],
"execution_count": 4,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "T4MuyozBoPv9",
"colab_type": "text"
},
"source": [
"### Create a working dataset\n",
"This tutorial mostly uses BigQuery magic cells where the --params field does not support variables for datasets, tables and column names. \n",
"\n",
"This steps hardcode the dataset where all the steps of this tutorial happens."
]
},
{
"cell_type": "code",
"metadata": {
"id": "sGRa9QxXoZMO",
"colab_type": "code",
"colab": {}
},
"source": [
"! bq show $PROJECT_ID:ltv_ecommerce || bq mk $PROJECT_ID:ltv_ecommerce\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "9i6ZRZgqCEPY",
"colab_type": "text"
},
"source": [
"### Load example tables"
]
},
{
"cell_type": "code",
"metadata": {
"id": "pjc30TarCHSp",
"colab_type": "code",
"colab": {}
},
"source": [
"# Loads CRM data\n",
"!bq load \\\n",
" --project_id $PROJECT_ID \\\n",
" --skip_leading_rows 1 \\\n",
" --max_bad_records 100000 \\\n",
" --replace \\\n",
" --field_delimiter \",\" \\\n",
" --autodetect \\\n",
" ltv_ecommerce.00_crm \\\n",
" gs://solutions-public-assets/analytics-componentized-patterns/ltv/crm.csv"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Fe4GT-fWEcUt",
"colab_type": "code",
"colab": {}
},
"source": [
"# Loads Sales data\n",
"!bq load \\\n",
" --project_id $PROJECT_ID \\\n",
" --skip_leading_rows 1 \\\n",
" --max_bad_records 100000 \\\n",
" --replace \\\n",
" --field_delimiter \",\" \\\n",
" --autodetect \\\n",
" ltv_ecommerce.10_orders \\\n",
" gs://solutions-public-assets/analytics-componentized-patterns/ltv/sales_*"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "EVNkkGV8jUN6",
"colab_type": "text"
},
"source": [
"### Create clients"
]
},
{
"cell_type": "code",
"metadata": {
"id": "amZzWHkjjWTi",
"colab_type": "code",
"colab": {}
},
"source": [
"# BigQuery client\n",
"bq_client = bigquery.Client(project=PROJECT_ID)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "TJB0PYUS7sYw",
"colab_type": "text"
},
"source": [
"## [Optional] Match your dataset to template\n",
"If you use the example data, you can skip this step.\n",
"\n",
"This tutorial assumes that you have a dump of your sales data already available in BigQuery located at `[YOUR_PROJECT].[YOUR_DATASET].[YOUR_SOURCE_TABLE]`\n",
"\n",
"You are free to adapt the SQL query in the next cell to a SQL statement that transforms your data according to the template."
]
},
{
"cell_type": "code",
"metadata": {
"id": "NK_wCQt78JjO",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 32
},
"outputId": "b1f6d49d-3d26-46e0-d753-f5b1e55f1a65"
},
"source": [
"%%bigquery --params $MATCH_FIELDS --project $PROJECT_ID\n",
"\n",
"CREATE OR REPLACE TABLE `ltv_ecommerce.10_orders` AS (\n",
"SELECT\n",
" CAST(customer_id AS STRING) AS customer_id,\n",
" order_id AS order_id,\n",
" transaction_date AS transaction_date,\n",
" product_sku AS product_sku,\n",
" qty AS qty,\n",
" unit_price AS unit_price\n",
"FROM\n",
" `[YOUR_PROJECT].[YOUR_DATASET].[YOUR_SOURCE_TABLE]`\n",
");"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"metadata": {
"tags": []
},
"execution_count": 32
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T9q6jKCM4-v8",
"colab_type": "text"
},
"source": [
"## Analyze dataset\n",
"\n",
"**Some charts might use a log scale.**\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LEchqKz1HiOZ",
"colab_type": "text"
},
"source": [
"#### Quantity\n",
"This sections shows how to use the BigQuery [ML BUCKETIZE][bucketize] preprocessing function to create buckets of data for quantity and display a log scaled distribution of the `qty` field.\n",
"\n",
"[bucketize]:https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-preprocessing-functions#bucketize"
]
},
{
"cell_type": "code",
"metadata": {
"id": "keis4kuj5A71",
"colab_type": "code",
"colab": {}
},
"source": [
"%%bigquery df_histo_qty --project $PROJECT_ID\n",
"\n",
"WITH\n",
" min_max AS (\n",
" SELECT\n",
" MIN(qty) min_qty,\n",
" MAX(qty) max_qty,\n",
" CEIL((MAX(qty) - MIN(qty)) / 100) step\n",
" FROM\n",
" `ltv_ecommerce.10_orders` \n",
")\n",
"SELECT\n",
" COUNT(1) c,\n",
" bucket_same_size AS bucket\n",
"FROM (\n",
" SELECT\n",
" -- Creates (1000-100)/100 + 1 buckets of data.\n",
" ML.BUCKETIZE(qty, GENERATE_ARRAY(min_qty, max_qty, step)) AS bucket_same_size,\n",
" -- Creates custom ranges.\n",
" ML.BUCKETIZE(qty, [-1, -1, -2, -3, -4, -5, 0, 1, 2, 3, 4, 5]) AS bucket_specific,\n",
" FROM\n",
" `ltv_ecommerce.10_orders`, min_max )\n",
" # WHERE bucket != \"bin_1\" and bucket != \"bin_2\"\n",
"GROUP BY\n",
" bucket\n",
" -- Ohterwise, orders bin_10 before bin_2\n",
"ORDER BY CAST(SPLIT(bucket, \"_\")[OFFSET(1)] AS INT64)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "O6JqHipXDPto",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 307
},
"outputId": "33d8e932-c3f9-4b1f-d3df-fcf8355bcb42"
},
"source": [
"# Uses a log scale for bucket_same_size.\n",
"# Can remove the log scale when using bucket_specific.\n",
"plt.figure(figsize=(12,5))\n",
"plt.title('Log scaled distribution for qty')\n",
"hqty = sns.barplot( x='bucket', y='c', data=df_histo_qty)\n",
"hqty.set_yscale(\"log\")\n"
],
"execution_count": null,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtMAAAFOCAYAAABE5JExAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAeZElEQVR4nO3de5hkdX3n8fdHRjSKjijECzCOySAy8W4HTUwiMV7AdWS9ISNeMMjEC8k+a5KNCRpRQ9xo3M1FohldBG8goJud0UlYV0WUQMLgLcAscUJABlSQywjRqOB3/6gza9FO99Scruo6p+f9ep56puqcU7/61nequz99+nfOSVUhSZIkaffdbdoFSJIkSX1lmJYkSZJaMkxLkiRJLRmmJUmSpJYM05IkSVJLhmlJkiSpJcO0JC1QkuOSfGExnpvk6iRPa+7/QZL3tXndOca+PcnPNPdPT/JHYxz7PUneOK7xhsZNkvcnuSXJP457fEnaFcO0pE4ZDouaX1X9cVW9clfbJTk/yS63q6p9quqqhda1s18QqupVVfXWhY69E78EPB04sKoOm8D4d5Hk5CQfmvTrSOoPw7Qk7eGSLJt2DQvwUODqqvq33X1iz9+3pI4wTEvqhST3SPJnSa5vbn+W5B5D6/9Lkm80616ZpJKsmmOs45JcleS2JP+a5NihdSck2dKsuyLJ45vlr0/yL0PLnztPrY9I8qkkNye5MsnRQ+sekGRDku800xJ+dhfv+6VJrklyU5KTZq37/3tJk9wzyYea7W5NckmSByY5Bfhl4F3NNI53NdtXktcm+RrwtaFlwz3br3kftyX5XJKHNtutbLZdNlTL+U3fDwXeA/xC83q3NuvvMm2k6fPWpkcbkjxkaF0leVWSrzXv5dQk2UlvjgfeN/Rabx5x7Lu87131fMdfS5IcAfwB8KLm9b6S5IVJLp31/Ncl+V9z/JdKWmIM05L64iTgScBjgccAhwFvAGhCzuuApwGrgMPnGiTJvYG/AI6sqvsAvwh8uVn3QuBk4GXAfYHnADc1T/0XBqF0OfBm4ENJHjzH+J8CPgL8NHAM8FdJVjebnAr8O/Bg4Neb21y1rgbeDbwUeAjwAODAOTZ/eVPbQc12rwK+V1UnAZ8HTmymcZw49Jz/CDwRWD17sMaxwFuB/Rj06MNz1bpDVW1pXvui5vXut5P39VTgbcDRDPpwDXDWrM2eDfw88Ohmu2fu5LX+x6zXetOIY8/5vufreVX9HfDHwEeb13sMsAF4WPNLxA4vBT6w0wZJWnIM05L64ljgLVV1Q1XdyCDQvrRZdzTw/qq6vKq+yyAQz+dHwCOT/FRVfaOqLm+WvxJ4e1VdUgNbq+oagKo6p6qur6ofVdVHGezV3Nkc3WczmHbw/qq6o6q+BHwMeGGSvYDnA39YVf9WVZcBZ8xT5wuAT1TVBVX1feCNTe0780MGwW9VVd1ZVZdW1Xd20Ye3VdXNVfW9OdZ/cui1T2KwB/igXYw5imOB06rqi83Yv9+MvXJom/9aVbdW1deBzzL4JWpcY8/3vnen5zTbfBR4CUCSnwNWAp8YsV5JPWeYltQXD2Gwl3GHa5plO9ZdO7Ru+P5dNHNrX8Rgj+Y3knwyySOa1Qcx2AP9E5K8LMmXm2kHtwKPZLDHdraHAk/csV2z7bHAg4D9gWWz6rtmJ2PscJf31dR+0xzbfhA4Dzgrg6kub09y93nGhnn6NHt9Vd0O3MyPe74Qd/m/bMa+CThgaJtvDt3/LrDPGMee733vTs93OAN4cTMV5aXA2U3IlrQHMExL6ovrGQTVHVY0ywC+wV2nP8y797SqzquqpzOYBvB/gfc2q65lJ3OYm7nC7wVOBB7QTF24DPiJebzNGJ+rqvsN3fapqlcDNwJ3zKpvxTylfmN42yT3YrD3eWfv6YdV9eaqWs1g6sqzGUxXAag5xp9r+Q7Dr70PcH8GPd9xsN+9hrZ90G6Me5f/y2ZqzAOA63bxvFGMMvZ89e2q5z/x3Kq6GPgBg2lAL2bwi42kPYRhWlIX3b05oG7HbRlwJvCGJPsn2Q/4Q2DHKcrOBl6R5NAm/Mx5PuPmoLyjmpD1feB2fvxn/PcBv5PkCRlY1QTpezMIUTc2Y7yCwZ7pnfkE8PDmILa7N7efT3JoVd0JfBw4Ocm9mvm5L5+nD+cCz07yS0n2Bt7CHN+3k/xqkkc1U0m+w2Dax4739S3gZ+Z5nbk8a+i13wpcXFXXNtNsrgNekmSvJL/OXX8J+RZwYPO8nTmTwf/XYzM4iPSPgX+oqqtb1DjusXfV828BK5PM/n/4APAu4IdV1eqc45L6yTAtqYs2Ad8bup0M/BGwGfgq8E/AF5tlVNXfMjio8LPAVuDiZpyd/an9bgwOVryewbSFpwCvbsY5BziFwcGDtwF/A9y/qq4A3glcxCBMPQq4cGeFV9VtwDMYHHh4PYPpCn8C7DjzyIkMpix8EzgdeP9cTWjmcr+2qecbwC3Atjk2fxCDIPgdYAvwOX68h/TPgRdkcGGTv5jr9XbiI8CbGPTpCTTzghsnAL/LYArEzwF/P7TuM8DlwDeTfHsn7+v/MPiF52PN+/pZBv1asIWOPULPz2n+vSnJF4eWf5DBL1ieg1raw6RqV3+Nk6R+ac6scBlwj6q6Y9r1qN+SXA28sgnqc23zU8ANwOOraqen3JO0NLlnWtKSkOS5GZyLel8Ge4I3GqS1iF4NXGKQlvY8Xv1J0lLxGwymTdzJYIrDa6ZajfYYzZ7rMDh/taQ9jNM8JEmSpJac5iFJkiS1ZJiWJEmSWur1nOn99tuvVq5cOe0yJEmStMRdeuml366q/Wcv73WYXrlyJZs3b552GZIkSVriklyzs+W9nOaRZE2S9du3b592KZIkSdqD9TJMV9XGqlq3fPnyaZciSZKkPVgvw7QkSZLUBYZpSZIkqSXDtCRJktSSYVqSJElqyTAtSZIktWSYliRJkloyTEuSJEkt9TJMe9EWSZIkdUEvLydeVRuBjTMzMydMuxZJautdv71x2iUsqhPfuWbaJUjS2PVyz7QkSZLUBYZpSZIkqSXDtCRJktSSYVqSJElqyTAtSZIktWSYliRJkloyTEuSJEktGaYlSZKklgzTkiRJUkuduQJikl8GjmVQ0+qq+sUplyRJkiTNa6J7ppOcluSGJJfNWn5EkiuTbE3yeoCq+nxVvQr4BHDGJOuSJEmSxmHS0zxOB44YXpBkL+BU4EhgNbA2yeqhTV4MfGTCdUmSJEkLNtEwXVUXADfPWnwYsLWqrqqqHwBnAUcBJFkBbK+q2yZZlyRJkjQO0zgA8QDg2qHH25plAMcD75/vyUnWJdmcZPONN944oRIlSZKkXevU2Tyq6k1V9fe72GZ9Vc1U1cz++++/WKVJkiRJP2EaYfo64KChxwc2y0aWZE2S9du3bx9rYZIkSdLumEaYvgQ4OMnDkuwNHANs2J0BqmpjVa1bvnz5RAqUJEmSRjHpU+OdCVwEHJJkW5Ljq+oO4ETgPGALcHZVXT7JOiRJkqRJmOhFW6pq7RzLNwGb2o6bZA2wZtWqVW2HkCRJkhasUwcgjsppHpIkSeqCXoZpSZIkqQt6GaY9m4ckSZK6oJdh2mkekiRJ6oJehmlJkiSpC3oZpp3mIUmSpC7oZZh2mockSZK6oJdhWpIkSeoCw7QkSZLUUi/DtHOmJUmS1AW9DNPOmZYkSVIX9DJMS5IkSV1gmJYkSZJaMkxLkiRJLfUyTHsAoiRJkrqgl2HaAxAlSZLUBb0M05IkSVIXGKYlSZKklgzTkiRJUkuGaUmSJKmlXoZpz+YhSZKkLuhlmPZsHpIkSeqCXoZpSZIkqQsM05IkSVJLhmlJkiSpJcO0JEmS1JJhWpIkSWrJMC1JkiS1ZJiWJEmSWuplmPaiLZIkSeqCXoZpL9oiSZKkLlg27QIkLT2f+5WnTLuERfOUCz437RIkSVPUyz3TkiRJUhcYpiVJkqSWDNOSJElSS4ZpSZIkqSXDtCRJktSSYVqSJElqyTAtSZIktWSYliRJklrqzEVbktwNeCtwX2BzVZ0x5ZIkSZKkeU10z3SS05LckOSyWcuPSHJlkq1JXt8sPgo4EPghsG2SdUmSJEnjMOlpHqcDRwwvSLIXcCpwJLAaWJtkNXAI8PdV9Trg1ROuS5IkSVqwiYbpqroAuHnW4sOArVV1VVX9ADiLwV7pbcAtzTZ3TrIuSZIkaRymcQDiAcC1Q4+3Ncs+DjwzyV8CF8z15CTrkmxOsvnGG2+cbKWSJEnSPDpzAGJVfRc4foTt1gPrAWZmZmrSdUmSJElzmcae6euAg4YeH9gskyRJknplGmH6EuDgJA9LsjdwDLBhdwZIsibJ+u3bt0+kQEmSJGkUkz413pnARcAhSbYlOb6q7gBOBM4DtgBnV9XluzNuVW2sqnXLly8ff9GSJEnSiCY6Z7qq1s6xfBOwqe24SdYAa1atWtV2CEmSJGnBenk5cfdMS5IkqQt6GaYlSZKkLuhlmPYAREmSJHVBL8O00zwkSZLUBb0M05IkSVIX9DJMO81DkiRJXdDLMO00D0mSJHVBL8O0JEmS1AWGaUmSJKmlXoZp50xLkiSpC3oZpp0zLUmSpC7oZZiWJEmSusAwLUmSJLVkmJYkSZJa6mWY9gBESZIkdUEvw7QHIEqSJKkLehmmJUmSpC4wTEuSJEktGaYlSZKklgzTkiRJUku9DNOezUOSJEld0Msw7dk8JEmS1AW9DNOSJElSFximJUmSpJYM05IkSVJLhmlJkiSpJcO0JEmS1JJhWpIkSWrJMC1JkiS11Msw7UVbJEmS1AW9DNNetEWSJEld0MswLUmSJHWBYVqSJElqyTAtSZIktWSYliRJkloyTEuSJEktGaYlSZKklgzTkiRJUkuGaUmSJKklw7QkSZLUUmfCdJLDk3w+yXuSHD7teiRJkqRdmWiYTnJakhuSXDZr+RFJrkyyNcnrm8UF3A7cE9g2ybokSZKkcZj0nunTgSOGFyTZCzgVOBJYDaxNshr4fFUdCfwe8OYJ1yVJkiQt2ETDdFVdANw8a/FhwNaquqqqfgCcBRxVVT9q1t8C3GOSdUmSJEnjsGwKr3kAcO3Q423AE5M8D3gmcD/gXXM9Ock6YB3AihUrJlimJEmSNL9phOmdqq
gitextract_tju_xjf6/
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── gaming/
│ └── propensity-model/
│ └── bqml/
│ ├── README.md
│ └── bqml_ga4_gaming_propensity_to_churn.ipynb
└── retail/
├── clustering/
│ └── bqml/
│ ├── README.md
│ └── bqml_scaled_clustering.ipynb
├── ltv/
│ └── bqml/
│ ├── README.md
│ ├── notebooks/
│ │ └── bqml_automl_ltv_activate_lookalike.ipynb
│ └── scripts/
│ ├── 00_procedure_persist.sql
│ ├── 10_procedure_match.sql
│ ├── 20_procedure_prepare.sql
│ ├── 30_procedure_train.sql
│ ├── 40_procedure_predict.sql
│ ├── 50_procedure_top.sql
│ └── run.sh
├── propensity-model/
│ └── bqml/
│ ├── README.md
│ └── bqml_kfp_retail_propensity_to_purchase.ipynb
├── recommendation-system/
│ ├── bqml/
│ │ ├── README.md
│ │ └── bqml_retail_recommendation_system.ipynb
│ ├── bqml-mlops/
│ │ ├── README.md
│ │ ├── dockerfile
│ │ ├── kfp_tutorial.ipynb
│ │ ├── part_2/
│ │ │ ├── Dockerfile
│ │ │ ├── README.md
│ │ │ ├── cloudbuild.yaml
│ │ │ ├── dockerbuild.sh
│ │ │ └── pipeline.py
│ │ └── part_3/
│ │ ├── Dockerfile
│ │ ├── README.md
│ │ ├── dockerbuild.sh
│ │ └── vertex_ai_pipeline.ipynb
│ └── bqml-scann/
│ ├── .gitignore
│ ├── 00_prep_bq_and_datastore.ipynb
│ ├── 00_prep_bq_procedures.ipynb
│ ├── 01_train_bqml_mf_pmi.ipynb
│ ├── 02_export_bqml_mf_embeddings.ipynb
│ ├── 03_create_embedding_lookup_model.ipynb
│ ├── 04_build_embeddings_scann.ipynb
│ ├── 05_deploy_lookup_and_scann_caip.ipynb
│ ├── README.md
│ ├── ann01_create_index.ipynb
│ ├── ann02_run_pipeline.ipynb
│ ├── ann_grpc/
│ │ ├── match_pb2.py
│ │ └── match_pb2_grpc.py
│ ├── ann_setup.md
│ ├── embeddings_exporter/
│ │ ├── __init__.py
│ │ ├── pipeline.py
│ │ ├── runner.py
│ │ └── setup.py
│ ├── embeddings_lookup/
│ │ └── lookup_creator.py
│ ├── index_builder/
│ │ ├── builder/
│ │ │ ├── __init__.py
│ │ │ ├── indexer.py
│ │ │ └── task.py
│ │ ├── config.yaml
│ │ └── setup.py
│ ├── index_server/
│ │ ├── Dockerfile
│ │ ├── cloudbuild.yaml
│ │ ├── lookup.py
│ │ ├── main.py
│ │ ├── matching.py
│ │ └── requirements.txt
│ ├── perf_test.ipynb
│ ├── requirements.txt
│ ├── sql_scripts/
│ │ ├── sp_ComputePMI.sql
│ │ ├── sp_ExractEmbeddings.sql
│ │ └── sp_TrainItemMatchingModel.sql
│ ├── tfx01_interactive.ipynb
│ ├── tfx02_deploy_run.ipynb
│ └── tfx_pipeline/
│ ├── Dockerfile
│ ├── __init__.py
│ ├── bq_components.py
│ ├── config.py
│ ├── item_matcher.py
│ ├── lookup_creator.py
│ ├── pipeline.py
│ ├── runner.py
│ ├── scann_evaluator.py
│ ├── scann_indexer.py
│ └── schema/
│ └── schema.pbtxt
└── time-series/
└── bqml-demand-forecasting/
├── README.md
└── bqml_retail_demand_forecasting.ipynb
SYMBOL INDEX (67 symbols across 16 files)
FILE: retail/recommendation-system/bqml-mlops/part_2/pipeline.py
function run_bigquery_ddl (line 6) | def run_bigquery_ddl(project_id: str, query_string: str, location: str) ...
function train_matrix_factorization_model (line 35) | def train_matrix_factorization_model(ddlop, project_id: str, dataset: str):
function evaluate_matrix_factorization_model (line 56) | def evaluate_matrix_factorization_model(project_id:str, mf_model:str, lo...
function create_user_features (line 74) | def create_user_features(ddlop, project_id:str, dataset:str, mf_model:str):
function create_hotel_features (line 99) | def create_hotel_features(ddlop, project_id:str, dataset:str, mf_model:s...
function combine_features (line 124) | def combine_features(ddlop, project_id:str, dataset:str, mf_model:str, h...
function train_xgboost_model (line 146) | def train_xgboost_model(ddlop, project_id:str, dataset:str, total_featur...
function evaluate_class (line 162) | def evaluate_class(project_id:str, dataset:str, class_model:str, total_f...
function export_bqml_model (line 187) | def export_bqml_model(project_id:str, model:str, destination:str) -> Nam...
function training_pipeline (line 205) | def training_pipeline(project_id:str, dataset_name:str, model_storage:st...
function main (line 266) | def main(**args):
FILE: retail/recommendation-system/bqml-scann/ann_grpc/match_pb2_grpc.py
class MatchServiceStub (line 8) | class MatchServiceStub(object):
method __init__ (line 13) | def __init__(self, channel):
class MatchServiceServicer (line 31) | class MatchServiceServicer(object):
method Match (line 36) | def Match(self, request, context):
method BatchMatch (line 44) | def BatchMatch(self, request, context):
function add_MatchServiceServicer_to_server (line 53) | def add_MatchServiceServicer_to_server(servicer, server):
class MatchService (line 72) | class MatchService(object):
method Match (line 78) | def Match(request,
method BatchMatch (line 95) | def BatchMatch(request,
FILE: retail/recommendation-system/bqml-scann/embeddings_exporter/pipeline.py
function get_query (line 21) | def get_query(dataset_name, table_name):
function to_csv (line 32) | def to_csv(entry):
function run (line 40) | def run(bq_dataset_name, embeddings_table_name, output_dir, pipeline_args):
FILE: retail/recommendation-system/bqml-scann/embeddings_exporter/runner.py
function get_args (line 22) | def get_args(argv):
function main (line 41) | def main(argv=None):
FILE: retail/recommendation-system/bqml-scann/embeddings_lookup/lookup_creator.py
class EmbeddingLookup (line 22) | class EmbeddingLookup(tf.keras.Model):
method __init__ (line 24) | def __init__(self, embedding_files_prefix, **kwargs):
method __call__ (line 64) | def __call__(self, inputs):
function export_saved_model (line 77) | def export_saved_model(embedding_files_path, model_output_dir):
FILE: retail/recommendation-system/bqml-scann/index_builder/builder/indexer.py
function load_embeddings (line 31) | def load_embeddings(embedding_files_pattern):
function build_index (line 56) | def build_index(embeddings, num_leaves):
function save_index (line 78) | def save_index(index, tokens, output_dir):
function build (line 93) | def build(embedding_files_pattern, output_dir, num_leaves=None):
FILE: retail/recommendation-system/bqml-scann/index_builder/builder/task.py
function get_args (line 18) | def get_args():
function main (line 49) | def main():
FILE: retail/recommendation-system/bqml-scann/index_server/lookup.py
class EmbeddingLookup (line 19) | class EmbeddingLookup(object):
method __init__ (line 21) | def __init__(self, project, region, model_name, version):
method lookup (line 29) | def lookup(self, instances):
FILE: retail/recommendation-system/bqml-scann/index_server/main.py
function health (line 39) | def health(model, version):
function predict (line 44) | def predict(model, version):
function validate_request (line 68) | def validate_request(query, show):
FILE: retail/recommendation-system/bqml-scann/index_server/matching.py
class ScaNNMatcher (line 24) | class ScaNNMatcher(object):
method __init__ (line 26) | def __init__(self, index_dir):
method match (line 35) | def match(self, vector, num_matches=10):
FILE: retail/recommendation-system/bqml-scann/tfx_pipeline/bq_components.py
function compute_pmi (line 33) | def compute_pmi(
function train_item_matching_model (line 66) | def train_item_matching_model(
function extract_embeddings (line 96) | def extract_embeddings(
FILE: retail/recommendation-system/bqml-scann/tfx_pipeline/item_matcher.py
class ScaNNMatcher (line 26) | class ScaNNMatcher(object):
method __init__ (line 28) | def __init__(self, index_dir):
method match (line 37) | def match(self, vector, num_matches=10):
class ExactMatcher (line 45) | class ExactMatcher(object):
method __init__ (line 47) | def __init__(self, embeddings, tokens):
method match (line 53) | def match(self, vector, num_matches=10):
FILE: retail/recommendation-system/bqml-scann/tfx_pipeline/lookup_creator.py
class EmbeddingLookup (line 24) | class EmbeddingLookup(tf.keras.Model):
method __init__ (line 26) | def __init__(self, embedding_files_prefix, schema_file_path, **kwargs):
method __call__ (line 75) | def __call__(self, inputs):
function run_fn (line 89) | def run_fn(params):
FILE: retail/recommendation-system/bqml-scann/tfx_pipeline/pipeline.py
function create_pipeline (line 44) | def create_pipeline(pipeline_name: Text,
FILE: retail/recommendation-system/bqml-scann/tfx_pipeline/scann_evaluator.py
class IndexEvaluatorSpec (line 52) | class IndexEvaluatorSpec(tfx.types.ComponentSpec):
class ScaNNIndexEvaluatorExecutor (line 71) | class ScaNNIndexEvaluatorExecutor(base_executor.BaseExecutor):
method Do (line 73) | def Do(self,
class IndexEvaluator (line 172) | class IndexEvaluator(base_component.BaseComponent):
method __init__ (line 177) | def __init__(self,
FILE: retail/recommendation-system/bqml-scann/tfx_pipeline/scann_indexer.py
function load_embeddings (line 36) | def load_embeddings(embedding_files_pattern, schema_file_path):
function build_index (line 71) | def build_index(embeddings, num_leaves):
function save_index (line 91) | def save_index(index, tokens, output_dir):
function run_fn (line 108) | def run_fn(params):
Condensed preview — 83 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (4,578K chars).
[
{
"path": ".gitignore",
"chars": 140,
"preview": ".ipynb_checkpoints/\n.DS_Store\n.vscode/\n**/*.cpython-37..pyc\n**/*.sqllite\n**/*.tar.gz\nretail/recommendation-system/bqml-s"
},
{
"path": "CONTRIBUTING.md",
"chars": 1097,
"preview": "# How to Contribute\n\nWe'd love to accept your patches and contributions to this project. There are\njust a few small guid"
},
{
"path": "LICENSE",
"chars": 11358,
"preview": "\n Apache License\n Version 2.0, January 2004\n "
},
{
"path": "README.md",
"chars": 4233,
"preview": "> [!NOTE]\n> This repository has been archived and is no longer actively maintained.\n\n[;\nyou may not us"
},
{
"path": "gaming/propensity-model/bqml/bqml_ga4_gaming_propensity_to_churn.ipynb",
"chars": 59304,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {\n \"id\": \"XM4xjzQNzHwz\"\n },\n "
},
{
"path": "retail/clustering/bqml/README.md",
"chars": 702,
"preview": "A common marketing analytics challenge is to understand consumer behavior and develop customer attributes or archetypes."
},
{
"path": "retail/clustering/bqml/bqml_scaled_clustering.ipynb",
"chars": 85854,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": 1,\n \"metadata\": {\n \"colab\": {},\n \"colab_type\": \""
},
{
"path": "retail/ltv/bqml/README.md",
"chars": 3695,
"preview": "# Activate on LTV predictions.\nThis guide refactors the [final part][series_final] of an existing series about predictin"
},
{
"path": "retail/ltv/bqml/notebooks/bqml_automl_ltv_activate_lookalike.ipynb",
"chars": 317671,
"preview": "{\n \"nbformat\": 4,\n \"nbformat_minor\": 0,\n \"metadata\": {\n \"colab\": {\n \"name\": \"bqml_automl_ltv_activate_lookali"
},
{
"path": "retail/ltv/bqml/scripts/00_procedure_persist.sql",
"chars": 878,
"preview": "-- Copyright 2020 Google LLC\n--\n-- Licensed under the Apache License, Version 2.0 (the \"License\");\n-- you may not use th"
},
{
"path": "retail/ltv/bqml/scripts/10_procedure_match.sql",
"chars": 1056,
"preview": "-- Copyright 2020 Google LLC\n--\n-- Licensed under the Apache License, Version 2.0 (the \"License\");\n-- you may not use th"
},
{
"path": "retail/ltv/bqml/scripts/20_procedure_prepare.sql",
"chars": 7218,
"preview": "-- Copyright 2020 Google LLC\n--\n-- Licensed under the Apache License, Version 2.0 (the \"License\");\n-- you may not use th"
},
{
"path": "retail/ltv/bqml/scripts/30_procedure_train.sql",
"chars": 990,
"preview": "-- Copyright 2020 Google LLC\n--\n-- Licensed under the Apache License, Version 2.0 (the \"License\");\n-- you may not use th"
},
{
"path": "retail/ltv/bqml/scripts/40_procedure_predict.sql",
"chars": 2750,
"preview": "-- Copyright 2020 Google LLC\n--\n-- Licensed under the Apache License, Version 2.0 (the \"License\");\n-- you may not use th"
},
{
"path": "retail/ltv/bqml/scripts/50_procedure_top.sql",
"chars": 1590,
"preview": "-- Copyright 2020 Google LLC\n--\n-- Licensed under the Apache License, Version 2.0 (the \"License\");\n-- you may not use th"
},
{
"path": "retail/ltv/bqml/scripts/run.sh",
"chars": 7500,
"preview": "#!/bin/bash\n# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may no"
},
{
"path": "retail/propensity-model/bqml/README.md",
"chars": 8352,
"preview": "## License\n```\nCopyright 2020 Google LLC\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not us"
},
{
"path": "retail/propensity-model/bqml/bqml_kfp_retail_propensity_to_purchase.ipynb",
"chars": 133605,
"preview": "{\n \"nbformat\": 4,\n \"nbformat_minor\": 0,\n \"metadata\": {\n \"kernelspec\": {\n \"display_name\": \"Python 3\",\n \"l"
},
{
"path": "retail/recommendation-system/bqml/README.md",
"chars": 1246,
"preview": "## License\n```\nCopyright 2020 Google LLC\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not us"
},
{
"path": "retail/recommendation-system/bqml/bqml_retail_recommendation_system.ipynb",
"chars": 35659,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {\n \"colab\": {},\n \"colab_type\""
},
{
"path": "retail/recommendation-system/bqml-mlops/README.md",
"chars": 2400,
"preview": "```python\n# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not "
},
{
"path": "retail/recommendation-system/bqml-mlops/dockerfile",
"chars": 156,
"preview": "FROM gcr.io/deeplearning-platform-release/base-cpu \n\nRUN apt-get update -y && apt-get -y install kubectl \n\nRUN python -m"
},
{
"path": "retail/recommendation-system/bqml-mlops/kfp_tutorial.ipynb",
"chars": 34271,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": "
},
{
"path": "retail/recommendation-system/bqml-mlops/part_2/Dockerfile",
"chars": 217,
"preview": "FROM python:3.7\n\n# ensure local python is preferred over distribution python\nENV PATH /usr/local/bin:$PATH\n\nRUN pip3 ins"
},
{
"path": "retail/recommendation-system/bqml-mlops/part_2/README.md",
"chars": 3999,
"preview": "```\n# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use th"
},
{
"path": "retail/recommendation-system/bqml-mlops/part_2/cloudbuild.yaml",
"chars": 893,
"preview": "# Copyright 2020 Google Inc. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# "
},
{
"path": "retail/recommendation-system/bqml-mlops/part_2/dockerbuild.sh",
"chars": 306,
"preview": "export PROJECT_ID=$(gcloud config list project --format \"value(core.project)\")\nexport IMAGE_REPO_NAME=hotel_recommender_"
},
{
"path": "retail/recommendation-system/bqml-mlops/part_2/pipeline.py",
"chars": 12891,
"preview": "from typing import NamedTuple\nimport json\nimport os\nimport fire\n\ndef run_bigquery_ddl(project_id: str, query_string: str"
},
{
"path": "retail/recommendation-system/bqml-mlops/part_3/Dockerfile",
"chars": 679,
"preview": "FROM python:3.7\n\n# Install Scikit-Learn\n# Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4.\n#"
},
{
"path": "retail/recommendation-system/bqml-mlops/part_3/README.md",
"chars": 1614,
"preview": "```python\n# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not "
},
{
"path": "retail/recommendation-system/bqml-mlops/part_3/dockerbuild.sh",
"chars": 305,
"preview": "export PROJECT_ID=$(gcloud config list project --format \"value(core.project)\")\nexport IMAGE_REPO_NAME=hotel_recommender_"
},
{
"path": "retail/recommendation-system/bqml-mlops/part_3/vertex_ai_pipeline.ipynb",
"chars": 35144,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {},\n \"outputs\": [],\n \"source\": "
},
{
"path": "retail/recommendation-system/bqml-scann/.gitignore",
"chars": 65,
"preview": ".ipynb_checkpoints/\n.DS_Store\n.idea/\n*.pyc\n*.egg-info/\nworkspace/"
},
{
"path": "retail/recommendation-system/bqml-scann/00_prep_bq_and_datastore.ipynb",
"chars": 335545,
"preview": "{\"nbformat\":4,\"nbformat_minor\":0,\"metadata\":{\"colab\":{\"name\":\"00_prep_bq_and_datastore.ipynb\",\"provenance\":[],\"collapsed"
},
{
"path": "retail/recommendation-system/bqml-scann/00_prep_bq_procedures.ipynb",
"chars": 7682,
"preview": "{\"nbformat\":4,\"nbformat_minor\":0,\"metadata\":{\"environment\":{\"name\":\"tf2-gpu.2-3.m61\",\"type\":\"gcloud\",\"uri\":\"gcr.io/deepl"
},
{
"path": "retail/recommendation-system/bqml-scann/01_train_bqml_mf_pmi.ipynb",
"chars": 20419,
"preview": "{\"nbformat\":4,\"nbformat_minor\":0,\"metadata\":{\"colab\":{\"name\":\"01_train_bqml_mf_pmi.ipynb\",\"provenance\":[],\"collapsed_sec"
},
{
"path": "retail/recommendation-system/bqml-scann/02_export_bqml_mf_embeddings.ipynb",
"chars": 9740,
"preview": "{\"nbformat\":4,\"nbformat_minor\":0,\"metadata\":{\"colab\":{\"name\":\"02_export_bqml_mf_embeddings.ipynb\",\"provenance\":[],\"colla"
},
{
"path": "retail/recommendation-system/bqml-scann/03_create_embedding_lookup_model.ipynb",
"chars": 9183,
"preview": "{\"nbformat\":4,\"nbformat_minor\":0,\"metadata\":{\"colab\":{\"name\":\"03_create_embedding_lookup_model.ipynb\",\"provenance\":[],\"c"
},
{
"path": "retail/recommendation-system/bqml-scann/04_build_embeddings_scann.ipynb",
"chars": 8866,
"preview": "{\"nbformat\":4,\"nbformat_minor\":0,\"metadata\":{\"environment\":{\"name\":\"tf2-2-3-gpu.2-3.m59\",\"type\":\"gcloud\",\"uri\":\"gcr.io/d"
},
{
"path": "retail/recommendation-system/bqml-scann/05_deploy_lookup_and_scann_caip.ipynb",
"chars": 23303,
"preview": "{\"nbformat\":4,\"nbformat_minor\":0,\"metadata\":{\"environment\":{\"name\":\"tf2-2-3-gpu.2-3.m59\",\"type\":\"gcloud\",\"uri\":\"gcr.io/d"
},
{
"path": "retail/recommendation-system/bqml-scann/README.md",
"chars": 23198,
"preview": "# Real-time Item-to-item Recommendation with BigQuery ML Matrix Factorization and ScaNN\n\nThis directory contains code sa"
},
{
"path": "retail/recommendation-system/bqml-scann/ann01_create_index.ipynb",
"chars": 107864,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Low-latency item-to-item recommen"
},
{
"path": "retail/recommendation-system/bqml-scann/ann02_run_pipeline.ipynb",
"chars": 54602,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# Low-latency item-to-item recommen"
},
{
"path": "retail/recommendation-system/bqml-scann/ann_grpc/match_pb2.py",
"chars": 20139,
"preview": "# -*- coding: utf-8 -*-\n# Generated by the protocol buffer compiler. DO NOT EDIT!\n# source: match.proto\n\"\"\"Generated pr"
},
{
"path": "retail/recommendation-system/bqml-scann/ann_grpc/match_pb2_grpc.py",
"chars": 4364,
"preview": "# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT!\n\"\"\"Client and server classes corresponding to prot"
},
{
"path": "retail/recommendation-system/bqml-scann/ann_setup.md",
"chars": 2358,
"preview": "## Setting up the ANN Service Experimental release\n\nThis document outlines the steps required to enable and configure th"
},
{
"path": "retail/recommendation-system/bqml-scann/embeddings_exporter/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "retail/recommendation-system/bqml-scann/embeddings_exporter/pipeline.py",
"chars": 1809,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/embeddings_exporter/runner.py",
"chars": 1545,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/embeddings_exporter/setup.py",
"chars": 829,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/embeddings_lookup/lookup_creator.py",
"chars": 3101,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/index_builder/builder/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "retail/recommendation-system/bqml-scann/index_builder/builder/indexer.py",
"chars": 3305,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/index_builder/builder/task.py",
"chars": 1407,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/index_builder/config.yaml",
"chars": 62,
"preview": "trainingInput:\n scaleTier: CUSTOM\n masterType: n1-standard-8"
},
{
"path": "retail/recommendation-system/bqml-scann/index_builder/setup.py",
"chars": 858,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/index_server/Dockerfile",
"chars": 199,
"preview": "FROM python:3.8-slim\n\nCOPY requirements.txt .\nRUN pip install -r requirements.txt\n\nCOPY . ./\n\nARG PORT\nENV PORT=$PORT\n\nC"
},
{
"path": "retail/recommendation-system/bqml-scann/index_server/cloudbuild.yaml",
"chars": 173,
"preview": "steps:\n\n- name: 'gcr.io/cloud-builders/docker'\n args: ['build', '--tag', '${_IMAGE_URL}', '.', '--build-arg=PORT=${_POR"
},
{
"path": "retail/recommendation-system/bqml-scann/index_server/lookup.py",
"chars": 1428,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/index_server/main.py",
"chars": 2243,
"preview": "# Copyright 2019 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/index_server/matching.py",
"chars": 1436,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/index_server/requirements.txt",
"chars": 87,
"preview": "pip==20.2.4\nFlask==1.1.2\ngunicorn==20.0.4\ngoogle-api-python-client==1.12.5\nscann==1.1.1"
},
{
"path": "retail/recommendation-system/bqml-scann/perf_test.ipynb",
"chars": 5879,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"## Test the Retrieval Latency of Ap"
},
{
"path": "retail/recommendation-system/bqml-scann/requirements.txt",
"chars": 149,
"preview": "tensorflow==2.4.0\ntfx==0.25.0\napache-beam[gcp]\ngoogle-cloud-bigquery \npyarrow\ngoogle-auth \ngoogle-api-python-client \ngoo"
},
{
"path": "retail/recommendation-system/bqml-scann/sql_scripts/sp_ComputePMI.sql",
"chars": 2583,
"preview": "CREATE OR REPLACE PROCEDURE @DATASET_NAME.sp_ComputePMI(\n IN min_item_frequency INT64,\n IN max_group_size INT64\n)\n\nBEG"
},
{
"path": "retail/recommendation-system/bqml-scann/sql_scripts/sp_ExractEmbeddings.sql",
"chars": 818,
"preview": "CREATE OR REPLACE PROCEDURE @DATASET_NAME.sp_ExractEmbeddings() \nBEGIN\n CREATE OR REPLACE TABLE @DATASET_NAME.item_emb"
},
{
"path": "retail/recommendation-system/bqml-scann/sql_scripts/sp_TrainItemMatchingModel.sql",
"chars": 508,
"preview": "CREATE OR REPLACE PROCEDURE @DATASET_NAME.sp_TrainItemMatchingModel(\n IN dimensions INT64\n)\n\nBEGIN\n\n CREATE OR REPLACE"
},
{
"path": "retail/recommendation-system/bqml-scann/tfx01_interactive.ipynb",
"chars": 28045,
"preview": "{\"nbformat\":4,\"nbformat_minor\":0,\"metadata\":{\"environment\":{\"name\":\"tf2-2-3-gpu.2-3.m59\",\"type\":\"gcloud\",\"uri\":\"gcr.io/d"
},
{
"path": "retail/recommendation-system/bqml-scann/tfx02_deploy_run.ipynb",
"chars": 11980,
"preview": "{\"nbformat\":4,\"nbformat_minor\":0,\"metadata\":{\"environment\":{\"name\":\"tf2-2-3-gpu.2-3.m59\",\"type\":\"gcloud\",\"uri\":\"gcr.io/d"
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/Dockerfile",
"chars": 174,
"preview": "FROM tensorflow/tfx:0.25.0\n\nRUN pip install scann==1.1.1 google-cloud-bigquery==1.26.1 protobuf==3.13.0\n\nWORKDIR /pipeli"
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/bq_components.py",
"chars": 4043,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/config.py",
"chars": 1468,
"preview": "# Copyright 2020 Google LLC. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# "
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/item_matcher.py",
"chars": 2086,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/lookup_creator.py",
"chars": 3722,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/pipeline.py",
"chars": 8109,
"preview": "# Copyright 2020 Google Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# "
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/runner.py",
"chars": 3261,
"preview": "# Copyright 2020 Google Inc. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# "
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/scann_evaluator.py",
"chars": 7449,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/scann_indexer.py",
"chars": 3875,
"preview": "# Copyright 2020 Google LLC\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this f"
},
{
"path": "retail/recommendation-system/bqml-scann/tfx_pipeline/schema/schema.pbtxt",
"chars": 452,
"preview": "feature {\n name: \"item_Id\"\n type: BYTES\n int_domain {\n }\n presence {\n min_fraction: 1.0\n min_count: 1\n }\n s"
},
{
"path": "retail/time-series/bqml-demand-forecasting/README.md",
"chars": 8010,
"preview": "# How to build a time series demand forecasting model using BigQuery ML\n\nThe goal of this repo is to provide an end-to-e"
},
{
"path": "retail/time-series/bqml-demand-forecasting/bqml_retail_demand_forecasting.ipynb",
"chars": 2967874,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"code\",\n \"execution_count\": null,\n \"metadata\": {\n \"id\": \"ur8xi4C7S06n\"\n },\n "
}
]
About this extraction
This page contains the full source code of the GoogleCloudPlatform/analytics-componentized-patterns GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 83 files (4.3 MB), approximately 1.1M tokens, and a symbol index with 67 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.