Repository: DataTalksClub/data-engineering-zoomcamp
Branch: main
Commit: ef44b885b9bb
Files: 381
Total size: 12.5 MB

Directory structure:
gitextract_ffam8vtv/

├── .github/
│   └── FUNDING.yml
├── .gitignore
├── 01-docker-terraform/
│   ├── README.md
│   ├── docker-sql/
│   │   ├── 01-introduction.md
│   │   ├── 02-virtual-environment.md
│   │   ├── 03-dockerizing-pipeline.md
│   │   ├── 04-postgres-docker.md
│   │   ├── 05-data-ingestion.md
│   │   ├── 06-ingestion-script.md
│   │   ├── 07-pgadmin.md
│   │   ├── 08-dockerizing-ingestion.md
│   │   ├── 09-docker-compose.md
│   │   ├── 10-sql-refresher.md
│   │   ├── 11-cleanup.md
│   │   ├── README.md
│   │   └── pipeline/
│   │       ├── .python-version
│   │       ├── Dockerfile
│   │       ├── docker-compose.yaml
│   │       ├── docker-helper-scripts/
│   │       │   ├── docker-ingest.sh
│   │       │   ├── docker-pgadmin.sh
│   │       │   └── docker-postgres.sh
│   │       ├── ingest_data.py
│   │       └── pyproject.toml
│   └── terraform/
│       ├── 1_terraform_overview.md
│       ├── 2_gcp_overview.md
│       ├── README.md
│       ├── terraform/
│       │   ├── README.md
│       │   ├── terraform_basic/
│       │   │   └── main.tf
│       │   ├── terraform_with_variable_AWS/
│       │   │   ├── README.md
│       │   │   ├── main.tf
│       │   │   ├── terraform.tfvars
│       │   │   └── variables.tf
│       │   └── terraform_with_variables/
│       │       ├── main.tf
│       │       └── variables.tf
│       └── windows.md
├── 02-workflow-orchestration/
│   ├── README.md
│   ├── docker-compose.yml
│   └── flows/
│       ├── 01_hello_world.yaml
│       ├── 02_python.yaml
│       ├── 03_getting_started_data_pipeline.yaml
│       ├── 04_postgres_taxi.yaml
│       ├── 05_postgres_taxi_scheduled.yaml
│       ├── 06_gcp_kv.yaml
│       ├── 07_gcp_setup.yaml
│       ├── 08_gcp_taxi.yaml
│       ├── 09_gcp_taxi_scheduled.yaml
│       ├── 10_chat_without_rag.yaml
│       └── 11_chat_with_rag.yaml
├── 03-data-warehouse/
│   ├── README.md
│   ├── big_query.sql
│   ├── big_query_hw.sql
│   ├── big_query_ml.sql
│   ├── extract_model.md
│   └── extras/
│       ├── .env-example
│       ├── .gitignore
│       ├── README.md
│       ├── pyproject.toml
│       ├── web_to_gcs.py
│       └── web_to_gcs_with_progress_bar.py
├── 04-analytics-engineering/
│   ├── README.md
│   ├── class_notes/
│   │   ├── 4_1_1_analytics_engineering_basics.md
│   │   ├── 4_1_2_what_is_dbt.md
│   │   ├── 4_2_1_dbt_core_vs_dbt_cloud.md
│   │   ├── 4_3_1_dbt_project_structure.md
│   │   ├── 4_3_2_dbt_sources.md
│   │   ├── 4_4_1_dbt_models.md
│   │   ├── 4_4_2_dbt_seeds_and_macros.md
│   │   ├── 4_5_1_documentation.md
│   │   ├── 4_5_2_dbt_tests.md
│   │   ├── 4_5_3_dbt_packages.md
│   │   └── 4_6_1_dbt_commands.md
│   ├── refreshers/
│   │   └── SQL.md
│   ├── setup/
│   │   ├── cloud_setup.md
│   │   ├── duckdb_troubleshooting.md
│   │   └── local_setup.md
│   └── taxi_rides_ny/
│       ├── .gitignore
│       ├── dbt_project.yml
│       ├── macros/
│       │   ├── get_trip_duration_minutes.sql
│       │   ├── get_vendor_data.sql
│       │   ├── macros_properties.yml
│       │   └── safe_cast.sql
│       ├── models/
│       │   ├── intermediate/
│       │   │   ├── int_trips.sql
│       │   │   ├── int_trips_unioned.sql
│       │   │   └── schema.yml
│       │   ├── marts/
│       │   │   ├── dim_vendors.sql
│       │   │   ├── dim_zones.sql
│       │   │   ├── fct_trips.sql
│       │   │   ├── reporting/
│       │   │   │   ├── fct_monthly_zone_revenue.sql
│       │   │   │   └── schema.yml
│       │   │   └── schema.yml
│       │   └── staging/
│       │       ├── schema.yml
│       │       ├── sources.yml
│       │       ├── stg_green_tripdata.sql
│       │       └── stg_yellow_tripdata.sql
│       ├── package-lock.yml
│       ├── packages.yml
│       ├── seeds/
│       │   └── seeds_properties.yml
│       ├── snapshots/
│       │   └── .gitkeep
│       └── tests/
│           └── .gitkeep
├── 05-data-platforms/
│   ├── README.md
│   └── notes/
│       ├── 01-introduction.md
│       ├── 02-getting-started.md
│       ├── 03-nyc-taxi-pipeline.md
│       ├── 04-bruin-mcp.md
│       ├── 05-bruin-cloud.md
│       ├── 06-core-01-projects.md
│       ├── 06-core-02-pipelines.md
│       ├── 06-core-03-assets.md
│       ├── 06-core-04-variables.md
│       └── 06-core-05-commands.md
├── 06-batch/
│   ├── .gitignore
│   ├── README.md
│   ├── code/
│   │   ├── 03_test.ipynb
│   │   ├── 04_pyspark.ipynb
│   │   ├── 05_taxi_schema.ipynb
│   │   ├── 06_spark_sql.ipynb
│   │   ├── 06_spark_sql.py
│   │   ├── 06_spark_sql_big_query.py
│   │   ├── 07_groupby_join.ipynb
│   │   ├── 08_rdds.ipynb
│   │   ├── 09_spark_gcs.ipynb
│   │   ├── cloud.md
│   │   ├── download_data.sh
│   │   └── homework.ipynb
│   └── setup/
│       ├── config/
│       │   ├── core-site.xml
│       │   ├── spark-defaults.conf
│       │   └── spark.dockerfile
│       ├── hadoop-yarn.md
│       ├── linux.md
│       ├── macos.md
│       └── windows.md
├── 07-streaming/
│   ├── .gitignore
│   ├── README.md
│   ├── extras/
│   │   ├── README.md
│   │   ├── ksqldb/
│   │   │   └── commands.md
│   │   ├── pyflink/
│   │   │   ├── .gitignore
│   │   │   ├── Dockerfile.flink
│   │   │   ├── LICENSE
│   │   │   ├── Makefile
│   │   │   ├── README.md
│   │   │   ├── docker-compose.yml
│   │   │   ├── homework.md
│   │   │   ├── requirements.txt
│   │   │   └── src/
│   │   │       ├── job/
│   │   │       │   ├── aggregation_job.py
│   │   │       │   ├── start_job.py
│   │   │       │   └── taxi_job.py
│   │   │       └── producers/
│   │   │           ├── load_taxi_data.py
│   │   │           └── producer.py
│   │   └── python/
│   │       ├── README.md
│   │       ├── avro_example/
│   │       │   ├── consumer.py
│   │       │   ├── producer.py
│   │       │   ├── ride_record.py
│   │       │   ├── ride_record_key.py
│   │       │   └── settings.py
│   │       ├── docker/
│   │       │   ├── README.md
│   │       │   ├── docker-compose.yml
│   │       │   ├── kafka/
│   │       │   │   └── docker-compose.yml
│   │       │   └── spark/
│   │       │       ├── build.sh
│   │       │       ├── cluster-base.Dockerfile
│   │       │       ├── docker-compose.yml
│   │       │       ├── jupyterlab.Dockerfile
│   │       │       ├── spark-base.Dockerfile
│   │       │       ├── spark-master.Dockerfile
│   │       │       └── spark-worker.Dockerfile
│   │       ├── json_example/
│   │       │   ├── consumer.py
│   │       │   ├── producer.py
│   │       │   ├── ride.py
│   │       │   └── settings.py
│   │       ├── redpanda_example/
│   │       │   ├── README.md
│   │       │   ├── consumer.py
│   │       │   ├── docker-compose.yaml
│   │       │   ├── producer.py
│   │       │   ├── ride.py
│   │       │   └── settings.py
│   │       ├── requirements.txt
│   │       ├── resources/
│   │       │   └── schemas/
│   │       │       ├── taxi_ride_key.avsc
│   │       │       └── taxi_ride_value.avsc
│   │       └── streams-example/
│   │           ├── faust/
│   │           │   ├── branch_price.py
│   │           │   ├── producer_taxi_json.py
│   │           │   ├── stream.py
│   │           │   ├── stream_count_vendor_trips.py
│   │           │   ├── taxi_rides.py
│   │           │   └── windowing.py
│   │           ├── pyspark/
│   │           │   ├── README.md
│   │           │   ├── consumer.py
│   │           │   ├── producer.py
│   │           │   ├── settings.py
│   │           │   ├── spark-submit.sh
│   │           │   ├── streaming-notebook.ipynb
│   │           │   └── streaming.py
│   │           └── redpanda/
│   │               ├── README.md
│   │               ├── consumer.py
│   │               ├── docker-compose.yaml
│   │               ├── producer.py
│   │               ├── settings.py
│   │               ├── spark-submit.sh
│   │               ├── streaming-notebook.ipynb
│   │               └── streaming.py
│   ├── theory/
│   │   ├── README.md
│   │   └── java/
│   │       └── kafka_examples/
│   │           ├── .gitignore
│   │           ├── build/
│   │           │   └── generated-main-avro-java/
│   │           │       └── schemaregistry/
│   │           │           ├── RideRecord.java
│   │           │           ├── RideRecordCompatible.java
│   │           │           └── RideRecordNoneCompatible.java
│   │           ├── build.gradle
│   │           ├── gradle/
│   │           │   └── wrapper/
│   │           │       ├── gradle-wrapper.jar
│   │           │       └── gradle-wrapper.properties
│   │           ├── gradlew
│   │           ├── gradlew.bat
│   │           ├── settings.gradle
│   │           └── src/
│   │               ├── main/
│   │               │   ├── avro/
│   │               │   │   ├── rides.avsc
│   │               │   │   ├── rides_compatible.avsc
│   │               │   │   └── rides_non_compatible.avsc
│   │               │   └── java/
│   │               │       └── org/
│   │               │           └── example/
│   │               │               ├── AvroProducer.java
│   │               │               ├── JsonConsumer.java
│   │               │               ├── JsonKStream.java
│   │               │               ├── JsonKStreamJoins.java
│   │               │               ├── JsonKStreamWindow.java
│   │               │               ├── JsonProducer.java
│   │               │               ├── JsonProducerPickupLocation.java
│   │               │               ├── Secrets.java
│   │               │               ├── Topics.java
│   │               │               ├── customserdes/
│   │               │               │   └── CustomSerdes.java
│   │               │               └── data/
│   │               │                   ├── PickupLocation.java
│   │               │                   ├── Ride.java
│   │               │                   └── VendorInfo.java
│   │               └── test/
│   │                   └── java/
│   │                       └── org/
│   │                           └── example/
│   │                               ├── JsonKStreamJoinsTest.java
│   │                               ├── JsonKStreamTest.java
│   │                               └── helper/
│   │                                   └── DataGeneratorHelper.java
│   └── workshop/
│       ├── .python-version
│       ├── Dockerfile.flink
│       ├── Dockerfile_ARM64.flink
│       ├── Makefile
│       ├── README.md
│       ├── docker-compose.yml
│       ├── flink-config.yaml
│       ├── live/
│       │   ├── .gitignore
│       │   ├── .python-version
│       │   ├── Dockerfile.flink
│       │   ├── README.md
│       │   ├── docker-compose.yaml
│       │   ├── flink-config.yaml
│       │   ├── main.py
│       │   ├── notebooks/
│       │   │   ├── consumer_db.ipynb
│       │   │   ├── models.py
│       │   │   └── producer.ipynb
│       │   ├── pyproject.flink.toml
│       │   ├── pyproject.toml
│       │   └── src/
│       │       ├── job/
│       │       │   ├── aggregation_job.py
│       │       │   └── pass_through_job.py
│       │       └── producers/
│       │           ├── models.py
│       │           └── producer_realtime.py
│       ├── pyproject.flink.toml
│       ├── pyproject.toml
│       └── src/
│           ├── consumers/
│           │   ├── consumer.py
│           │   └── consumer_postgres.py
│           ├── job/
│           │   ├── aggregation_job.py
│           │   ├── aggregation_job_demo.py
│           │   └── pass_through_job.py
│           ├── models.py
│           └── producers/
│               ├── producer.py
│               └── producer_realtime.py
├── README.md
├── after-sign-up.md
├── asking-questions.md
├── awesome-data-engineering.md
├── certificates.md
├── cohorts/
│   ├── 2022/
│   │   ├── README.md
│   │   ├── project.md
│   │   ├── week_1_basics_n_setup/
│   │   │   └── homework.md
│   │   ├── week_2_data_ingestion/
│   │   │   ├── README.md
│   │   │   ├── airflow/
│   │   │   │   ├── .env_example
│   │   │   │   ├── 1_setup_official.md
│   │   │   │   ├── 2_setup_nofrills.md
│   │   │   │   ├── Dockerfile
│   │   │   │   ├── README.md
│   │   │   │   ├── dags/
│   │   │   │   │   └── data_ingestion_gcs_dag.py
│   │   │   │   ├── dags_local/
│   │   │   │   │   ├── data_ingestion_local.py
│   │   │   │   │   └── ingest_script.py
│   │   │   │   ├── docker-compose-nofrills.yml
│   │   │   │   ├── docker-compose.yaml
│   │   │   │   ├── docker-compose_2.3.4.yaml
│   │   │   │   ├── docs/
│   │   │   │   │   └── 1_concepts.md
│   │   │   │   ├── extras/
│   │   │   │   │   ├── data_ingestion_gcs_dag_ex2.py
│   │   │   │   │   └── web_to_gcs.sh
│   │   │   │   ├── requirements.txt
│   │   │   │   └── scripts/
│   │   │   │       └── entrypoint.sh
│   │   │   ├── homework/
│   │   │   │   ├── homework.md
│   │   │   │   └── solution.py
│   │   │   └── transfer_service/
│   │   │       └── README.md
│   │   ├── week_3_data_warehouse/
│   │   │   └── airflow/
│   │   │       ├── .env_example
│   │   │       ├── 1_setup_official.md
│   │   │       ├── 2_setup_nofrills.md
│   │   │       ├── README.md
│   │   │       ├── dags/
│   │   │       │   └── gcs_to_bq_dag.py
│   │   │       ├── docker-compose-nofrills.yml
│   │   │       ├── docker-compose.yaml
│   │   │       └── scripts/
│   │   │           └── entrypoint.sh
│   │   ├── week_5_batch_processing/
│   │   │   └── homework.md
│   │   └── week_6_stream_processing/
│   │       └── homework.md
│   ├── 2023/
│   │   ├── README.md
│   │   ├── leaderboard.md
│   │   ├── project.md
│   │   ├── week_1_docker_sql/
│   │   │   └── homework.md
│   │   ├── week_1_terraform/
│   │   │   └── homework.md
│   │   ├── week_2_workflow_orchestration/
│   │   │   ├── README.md
│   │   │   └── homework.md
│   │   ├── week_3_data_warehouse/
│   │   │   └── homework.md
│   │   ├── week_4_analytics_engineering/
│   │   │   └── homework.md
│   │   ├── week_5_batch_processing/
│   │   │   └── homework.md
│   │   ├── week_6_stream_processing/
│   │   │   ├── client.properties
│   │   │   ├── homework.md
│   │   │   ├── producer_confluent.py
│   │   │   ├── settings.py
│   │   │   ├── spark-submit.sh
│   │   │   └── streaming_confluent.py
│   │   └── workshops/
│   │       └── piperider.md
│   ├── 2024/
│   │   ├── 01-docker-terraform/
│   │   │   ├── homework.md
│   │   │   └── solutions.md
│   │   ├── 02-workflow-orchestration/
│   │   │   ├── README.md
│   │   │   └── homework.md
│   │   ├── 03-data-warehouse/
│   │   │   └── homework.md
│   │   ├── 04-analytics-engineering/
│   │   │   └── homework.md
│   │   ├── 05-batch/
│   │   │   └── homework.md
│   │   ├── 06-streaming/
│   │   │   ├── docker-compose.yml
│   │   │   └── homework.md
│   │   ├── README.md
│   │   ├── leaderboard.md
│   │   ├── project.md
│   │   └── workshops/
│   │       ├── dlt.md
│   │       ├── dlt_resources/
│   │       │   ├── data_ingestion_workshop.md
│   │       │   ├── homework_solution.ipynb
│   │       │   ├── homework_starter.ipynb
│   │       │   └── workshop.ipynb
│   │       └── rising-wave.md
│   ├── 2025/
│   │   ├── 01-docker-terraform/
│   │   │   └── homework.md
│   │   ├── 02-workflow-orchestration/
│   │   │   ├── README.md
│   │   │   ├── flows/
│   │   │   │   ├── 01_getting_started_data_pipeline.yaml
│   │   │   │   ├── 02_postgres_taxi.yaml
│   │   │   │   ├── 02_postgres_taxi_scheduled.yaml
│   │   │   │   ├── 03_postgres_dbt.yaml
│   │   │   │   ├── 04_gcp_kv.yaml
│   │   │   │   ├── 05_gcp_setup.yaml
│   │   │   │   ├── 06_gcp_taxi.yaml
│   │   │   │   ├── 06_gcp_taxi_scheduled.yaml
│   │   │   │   └── 07_gcp_dbt.yaml
│   │   │   └── homework.md
│   │   ├── 03-data-warehouse/
│   │   │   ├── DLT_upload_to_GCP.ipynb
│   │   │   ├── homework.md
│   │   │   └── load_yellow_taxi_data.py
│   │   ├── 04-analytics-engineering/
│   │   │   └── homework.md
│   │   ├── 05-batch/
│   │   │   └── homework.md
│   │   ├── 06-streaming/
│   │   │   ├── homework/
│   │   │   │   └── homework.ipynb
│   │   │   └── homework.md
│   │   ├── README.md
│   │   ├── project.md
│   │   └── workshops/
│   │       ├── dlt/
│   │       │   ├── README.md
│   │       │   ├── data_ingestion_workshop.md
│   │       │   └── dlt_homework.md
│   │       └── dynamic_load_dlt.py
│   └── 2026/
│       ├── 01-docker-terraform/
│       │   └── homework.md
│       ├── 02-workflow-orchestration/
│       │   └── homework.md
│       ├── 03-data-warehouse/
│       │   ├── DLT_upload_to_GCP.ipynb
│       │   ├── homework.md
│       │   └── load_yellow_taxi_data.py
│       ├── 04-analytics-engineering/
│       │   └── homework.md
│       ├── 05-data-platforms/
│       │   └── homework.md
│       ├── 06-batch/
│       │   └── homework.md
│       ├── 07-streaming/
│       │   └── homework.md
│       ├── README.md
│       ├── project.md
│       └── workshops/
│           ├── dlt/
│           │   ├── README.md
│           │   ├── analysis.py
│           │   ├── dlt_Pipeline_Overview.ipynb
│           │   ├── dlt_homework.md
│           │   ├── open_library_pipeline.py
│           │   └── pyproject.toml
│           └── dlt.md
├── learning-in-public.md
├── projects/
│   ├── README.md
│   └── datasets.md
└── workshop-best-practices.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/FUNDING.yml
================================================
github: alexeygrigorev


================================================
FILE: .gitignore
================================================

.DS_Store
.idea
*.tfstate
*.tfstate.*
**.terraform
**.terraform.lock.*
**google_credentials.json
**logs/
**.env
**__pycache__/
.history
**/ny_taxi_postgres_data/*
serving_dir
.ipynb_checkpoints/
!week_6_stream_processing/avro_example/data/rides.csv
*.parquet
*.csv
*.duckdb


================================================
FILE: 01-docker-terraform/README.md
================================================
# Introduction

[![](https://markdown-videos-api.jorgenkh.no/youtube/JgspdlKXS-w)](https://www.youtube.com/watch?v=JgspdlKXS-w)


We suggest watching videos in the same order as in this document.


# Docker + Postgres

## Workshop

[![](https://markdown-videos-api.jorgenkh.no/youtube/lP8xXebHmuE)](https://youtu.be/lP8xXebHmuE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10)

* Video: https://www.youtube.com/watch?v=lP8xXebHmuE
* Follow the instructions here: [docker-sql/](docker-sql/)

## :movie_camera: SQL refresher


[![](https://markdown-videos-api.jorgenkh.no/youtube/QEcps_iskgg)](https://youtu.be/QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10)

* Video: https://www.youtube.com/watch?v=QEcps_iskgg
* SQL queries: [10-sql-refresher.md](docker-sql/10-sql-refresher.md)


# GCP

## :movie_camera: Introduction to GCP (Google Cloud Platform)

[![](https://markdown-videos-api.jorgenkh.no/youtube/18jIzE41fJ4)](https://youtu.be/18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=3)

# Terraform

[Code and notes](terraform/)

## :movie_camera: Introduction Terraform: Concepts and Overview, a primer

[![](https://markdown-videos-api.jorgenkh.no/youtube/s2bOYDCKl_M)](https://youtu.be/s2bOYDCKl_M&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=11)

## :movie_camera: Terraform Basics: Simple one file Terraform Deployment

[![](https://markdown-videos-api.jorgenkh.no/youtube/Y2ux7gq3Z0o)](https://youtu.be/Y2ux7gq3Z0o&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=12)

## :movie_camera: Deployment with a Variables File

[![](https://markdown-videos-api.jorgenkh.no/youtube/PBi0hHjLftk)](https://youtu.be/PBi0hHjLftk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=13)

## Configuring terraform and GCP SDK on Windows

* [Instructions](terraform/windows.md)


# Homework

* [Homework](../cohorts/2026/01-docker-terraform/homework.md)


# Community notes

<details>
<summary>Did you take notes? You can share them here</summary>

* [Notes from Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/1_intro.md)
* [Notes from Abd](https://itnadigital.notion.site/Week-1-Introduction-f18de7e69eb4453594175d0b1334b2f4)
* [Notes from Aaron](https://github.com/ABZ-Aaron/DataEngineerZoomCamp/blob/master/week_1_basics_n_setup/README.md)
* [Notes from Faisal](https://github.com/FaisalMohd/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/Notes/DE%20Zoomcamp%20Week-1.pdf)
* [Michael Harty's Notes](https://github.com/mharty3/data_engineering_zoomcamp_2022/tree/main/week01)
* [Blog post from Isaac Kargar](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/18/data-engineering-w1.html)
* [Handwritten Notes By Mahmoud Zaher](https://github.com/zaherweb/DataEngineering/blob/master/week%201.pdf)
* [Notes from Candace Williams](https://teacherc.github.io/data-engineering/2023/01/18/zoomcamp1.html)
* [Notes from Marcos Torregrosa](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-1/)
* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)
* [Notes from Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week1)
* [Notes from froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_1_basics_n_setup/notes/notes_week_01.md)
* [Notes from adamiaonr](https://github.com/adamiaonr/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/2_docker_sql/NOTES.md)
* [Notes from Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/01/week-1-data-engineering-zoomcamp-notes/)
* [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%201/Detailed%20Week%201%20Notes.ipynb)
* [Notes from Erik](https://twitter.com/ehub96/status/1621351266281730049)
* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week1.md)
* Notes on [Docker, Docker Compose, and setting up a proper Python environment](https://medium.com/@verazabeida/zoomcamp-2023-week-1-f4f94cb360ae), by Vera
* [Setting up the development environment on Google Virtual Machine](https://itsadityagupta.hashnode.dev/setting-up-the-development-environment-on-google-virtual-machine), blog post by Aditya Gupta
* [Notes from Zharko Cekovski](https://www.zharconsulting.com/contents/data/data-engineering-bootcamp-2024/week-1-postgres-docker-and-ingestion-scripts/)
* [2024 Module-01 Walkthough video by ellacharmed on youtube](https://youtu.be/VUZshlVAnk4)
* [2024 Companion Module Walkthough slides by ellacharmed](https://github.com/ellacharmed/data-engineering-zoomcamp/blob/ella2024/cohorts/2024/01-docker-terraform/walkthrough-01.pdf)
* [2024 Module-01 Environment setup video by ellacharmed on youtube](https://youtu.be/Zce_Hd37NGs)
* [Docker Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/1a-docker_sql/readme.md) • [Terraform Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/1b-terraform_gcp/readme.md)
* [Notes from Hammad Tariq](https://github.com/hamad-tariq/HammadTariq-ZoomCamp2024/blob/9c8b4908416eb8cade3d7ec220e7664c003e9b11/week_1_basics_n_setup/README.md)
* [Hung's Notes](https://hung.bearblog.dev/docker/) & [Docker Cheatsheet](https://github.com/HangenYuu/docker-cheatsheet)
* [Kemal's Notes](https://github.com/kemaldahha/data-engineering-course/blob/main/week_1_notes.md)
* [Notes from Manuel Guerra (Windows+WSL2 Environment)](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/1_Containerization-and-Infrastructure-as-Code/README.md)
* [Notes from Horeb SEIDOU](https://spotted-hardhat-eea.notion.site/Week-1-Containerization-and-Infrastructure-as-Code-15729780dc4a80a08288e497ba937a37)
* [2025 Gitbook Notes from Tinker0425](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/introduction/introduction-and-set-up)
* [Alex's Docker Notes](https://github.com/alexg9010/2025_data_engineering_zoomcamp/blob/master/01_docker/README.md) | [Alex's Terraform Notes](https://github.com/alexg9010/2025_data_engineering_zoomcamp/blob/master/01_3_terraform/README.md)
* [2025 SQL Refresher - Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/01_docker_postgress/0_sql_refresh.ipynb)
* [2025 Setting up the Environment - Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/01_docker_postgress/_setting_up.md)
* [Notes from Mercy Markus: Linux/Fedora Tweaks and Tips](https://mercymarkus.com/posts/2025/series/dtc-dez-jan-2025/dtc-dez-2025-module-1/)
* [[2026 tutorial video - Khanh Nguyen] Setting up the environment for homework-w1](https://youtu.be/_iqCWi_UoOc)
* Add your notes above this line

</details>


================================================
FILE: 01-docker-terraform/docker-sql/01-introduction.md
================================================
# Introduction to Docker

**[↑ Up](README.md)** | **[← Previous](README.md)** | **[Next →](02-virtual-environment.md)**

Docker is a _containerization software_ that allows us to isolate software in a similar way to virtual machines but in a much leaner way.

A Docker image is a _snapshot_ of a container that we can define to run our software, or in this case our data pipelines. By exporting our Docker images to Cloud providers such as Amazon Web Services or Google Cloud Platform we can run our containers there.

## Why Docker?

Docker provides the following advantages:

- Reproducibility: Same environment everywhere
- Isolation: Applications run independently
- Portability: Run anywhere Docker is installed

They are used in many situations:

- Integration tests: CI/CD pipelines
- Running pipelines on the cloud: AWS Batch, Kubernetes jobs
- Spark: Analytics engine for large-scale data processing
- Serverless: AWS Lambda, Google Functions

## Basic Docker Commands

Check Docker version:

```bash
docker --version
```

Run a simple container:

```bash
docker run hello-world
```

Run something more complex:

```bash
docker run ubuntu
```

Nothing happens. Need to run it in `-it` mode:

```bash
docker run -it ubuntu
```

We don't have `python` there so let's install it:

```bash
apt update && apt install python3
python3 -V
```

## Stateless Containers

Important: Docker containers are stateless - any changes done inside a container will NOT be saved when the container is killed and started again.

When you exit the container and use it again, the changes are gone:

```bash
docker run -it ubuntu
python3 -V
```

This is good, because it doesn't affect your host system. Let's say you do something crazy like this:

```bash
docker run -it ubuntu
rm -rf / # don't run it on your computer!
```

Next time we run it, all the files are back.

## Managing Containers

But, this is not _completely_ correct. The state is saved somewhere. We can see stopped containers:

```bash
docker ps -a
```

We can restart one of them, but we won't do it, because it's not a good practice. They take space, so let's delete them:

```bash
docker rm $(docker ps -aq)
```

Next time we run something, we add `--rm`:

```bash
docker run -it --rm ubuntu
```

## Different Base Images

There are other base images besides `hello-world` and `ubuntu`. For example, Python:

```bash
docker run -it --rm python:3.9.16
# add -slim to get a smaller version
```

This one starts `python`. If we want bash, we need to overwrite `entrypoint`:

```bash
docker run -it \
    --rm \
    --entrypoint=bash \
    python:3.9.16-slim
```

## Volumes

So, we know that with docker we can restore any container to its initial state in a reproducible manner. But what about data? A common way to do so is with _volumes_.

Let's create some data in `test`:

```bash
mkdir test
cd test
touch file1.txt file2.txt file3.txt
echo "Hello from host" > file1.txt
cd ..
```

Now let's create a simple script `test/list_files.py` that shows the files in the folder:

```python
from pathlib import Path

current_dir = Path.cwd()
current_file = Path(__file__).name

print(f"Files in {current_dir}:")

for filepath in current_dir.iterdir():
    if filepath.name == current_file:
        continue

    print(f"  - {filepath.name}")

    if filepath.is_file():
        content = filepath.read_text(encoding='utf-8')
        print(f"    Content: {content}")
```

Now let's map this to a Python container:

```bash
docker run -it \
    --rm \
    -v $(pwd)/test:/app/test \
    --entrypoint=bash \
    python:3.9.16-slim
```

Inside the container, run:

```bash
cd /app/test
ls -la
cat file1.txt
python list_files.py
```

You'll see the files from your host machine are accessible in the container!

**[↑ Up](README.md)** | **[← Previous](README.md)** | **[Next →](02-virtual-environment.md)**


================================================
FILE: 01-docker-terraform/docker-sql/02-virtual-environment.md
================================================
# Virtual Environments and Data Pipelines

**[↑ Up](README.md)** | **[← Previous](01-introduction.md)** | **[Next →](03-dockerizing-pipeline.md)**

A **data pipeline** is a service that receives data as input and outputs more data. For example, reading a CSV file, transforming the data somehow and storing it as a table in a PostgreSQL database.

```mermaid
graph LR
    A[CSV File] --> B[Data Pipeline]
    B --> C[Parquet File]
    B --> D[PostgreSQL Database]
    B --> E[Data Warehouse]
    style B fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff
```

In this workshop, we'll build pipelines that:
- Download CSV data from the web
- Transform and clean the data with pandas
- Load it into PostgreSQL for querying
- Process data in chunks to handle large files

## Creating a Simple Pipeline

Let's create an example pipeline. First, create a directory `pipeline` and inside, create a file  `pipeline.py`:

```python
import sys
print("arguments", sys.argv)

day = int(sys.argv[1])
print(f"Running pipeline for day {day}")
```

Now let's add pandas:

```python
import pandas as pd

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
print(df.head())

df.to_parquet(f"output_day_{sys.argv[1]}.parquet")
```

## Why Virtual Environments?

We need pandas, but we don't have it. We want to test it before we run things in a container.

We can install it with `pip`:

```bash
pip install pandas pyarrow
```

But this installs it globally on your system. This can cause conflicts if different projects need different versions of the same package.

Instead, we want to use a **virtual environment** - an isolated Python environment that keeps dependencies for this project separate from other projects and from your system Python.

## Using uv - Modern Python Package Manager

We'll use `uv` - a modern, fast Python package and project manager written in Rust. It's much faster than pip and handles virtual environments automatically.

```bash
pip install uv
```

Now initialize a Python project with uv:

```bash
uv init --python=3.13
```

This creates a `pyproject.toml` file for managing dependencies and a `.python-version` file.

### Comparing Python Versions

```bash
uv run which python  # Python in the virtual environment
uv run python -V

which python        # System Python
python -V
```

You'll see they're different - `uv run` uses the isolated environment.

### Adding Dependencies

Now let's add pandas:

```bash
uv add pandas pyarrow
```

This adds pandas to your `pyproject.toml` and installs it in the virtual environment.

### Running the Pipeline

Now we can execute the file:

```bash
uv run python pipeline.py 10
```

We will see:

* `['pipeline.py', '10']`
* `job finished successfully for day = 10`

## Git Configuration

This script produces a binary (parquet) file, so let's make sure we don't accidentally commit it to git by adding parquet extensions to `.gitignore`:

```
*.parquet
```

**[↑ Up](README.md)** | **[← Previous](01-introduction.md)** | **[Next →](03-dockerizing-pipeline.md)**


================================================
FILE: 01-docker-terraform/docker-sql/03-dockerizing-pipeline.md
================================================
# Dockerizing the Pipeline

**[↑ Up](README.md)** | **[← Previous](02-virtual-environment.md)** | **[Next →](04-postgres-docker.md)**

Now let's containerize the script. Create the following `Dockerfile` file:

## Simple Dockerfile with pip

```dockerfile
# base Docker image that we will build on
FROM python:3.13.11-slim

# set up our image by installing prerequisites; pandas in this case
RUN pip install pandas pyarrow

# set up the working directory inside the container
WORKDIR /app
# copy the script to the container. 1st name is source file, 2nd is destination
COPY pipeline.py pipeline.py

# define what to do first when the container runs
# in this example, we will just run the script
ENTRYPOINT ["python", "pipeline.py"]
```

**Explanation:**

- `FROM`: Base image (Python 3.13)
- `RUN`: Execute commands during build
- `WORKDIR`: Set working directory
- `COPY`: Copy files into the image
- `ENTRYPOINT`: Default command to run

### Build and Run

Let's build the image:

```bash
docker build -t test:pandas .
```

* The image name will be `test` and its tag will be `pandas`. If the tag isn't specified it will default to `latest`.

We can now run the container and pass an argument to it, so that our pipeline will receive it:

```bash
docker run -it test:pandas some_number
```

You should get the same output you did when you ran the pipeline script by itself.

> Note: these instructions assume that `pipeline.py` and `Dockerfile` are in the same directory. The Docker commands should also be run from the same directory as these files.

## Dockerfile with uv

What about uv? Let's use it instead of using pip:

```dockerfile
# Start with slim Python 3.13 image
FROM python:3.13.10-slim

# Copy uv binary from official uv image (multi-stage build pattern)
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

# Set working directory
WORKDIR /app

# Add virtual environment to PATH so we can use installed packages
ENV PATH="/app/.venv/bin:$PATH"

# Copy dependency files first (better layer caching)
COPY "pyproject.toml" "uv.lock" ".python-version" ./
# Install dependencies from lock file (ensures reproducible builds)
RUN uv sync --locked

# Copy application code
COPY pipeline.py pipeline.py

# Set entry point
ENTRYPOINT ["uv", "run", "python", "pipeline.py"]
```

**[↑ Up](README.md)** | **[← Previous](02-virtual-environment.md)** | **[Next →](04-postgres-docker.md)**


================================================
FILE: 01-docker-terraform/docker-sql/04-postgres-docker.md
================================================
# Running PostgreSQL with Docker

**[↑ Up](README.md)** | **[← Previous](03-dockerizing-pipeline.md)** | **[Next →](05-data-ingestion.md)**

Now we want to do real data engineering. Let's use a Postgres database for that.

You can run a containerized version of Postgres that doesn't require any installation steps. You only need to provide a few _environment variables_ to it as well as a _volume_ for storing data.

## Running PostgreSQL in a Container

Create a folder anywhere you'd like for Postgres to store data in. We will use the example folder `ny_taxi_postgres_data`. Here's how to run the container:

```bash
docker run -it --rm \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v ny_taxi_postgres_data:/var/lib/postgresql \
  -p 5432:5432 \
  postgres:18
```

### Explanation of Parameters

* `-e` sets environment variables (user, password, database name)
* `-v ny_taxi_postgres_data:/var/lib/postgresql` creates a **named volume**
  * Docker manages this volume automatically
  * Data persists even after container is removed
  * Volume is stored in Docker's internal storage
* `-p 5432:5432` maps port 5432 from container to host
* `postgres:18` uses PostgreSQL version 18 (latest as of Dec 2025)

### Alternative Approach - Bind Mount

First create the directory, then map it:

```bash
mkdir ny_taxi_postgres_data

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql \
  -p 5432:5432 \
  postgres:18
```

### Named Volume vs Bind Mount

* **Named volume** (`name:/path`): Managed by Docker, easier
* **Bind mount** (`/host/path:/container/path`): Direct mapping to host filesystem, more control

## Connecting to PostgreSQL

Once the container is running, we can log into our database with [pgcli](https://www.pgcli.com/).

Install pgcli:

```bash
uv add --dev pgcli
```

The `--dev` flag marks this as a development dependency (not needed in production). It will be added to the `[dependency-groups]` section of `pyproject.toml` instead of the main `dependencies` section.

Now use it to connect to Postgres:

```bash
uv run pgcli -h localhost -p 5432 -u root -d ny_taxi
```

* `uv run` executes a command in the context of the virtual environment
* `-h` is the host. Since we're running locally we can use `localhost`.
* `-p` is the port.
* `-u` is the username.
* `-d` is the database name.
* The password is not provided; it will be requested after running the command.

When prompted, enter the password: `root`

## Basic SQL Commands

Try some SQL commands:

```sql
-- List tables
\dt

-- Create a test table
CREATE TABLE test (id INTEGER, name VARCHAR(50));

-- Insert data
INSERT INTO test VALUES (1, 'Hello Docker');

-- Query data
SELECT * FROM test;

-- Exit
\q
```

**[↑ Up](README.md)** | **[← Previous](03-dockerizing-pipeline.md)** | **[Next →](05-data-ingestion.md)**


================================================
FILE: 01-docker-terraform/docker-sql/05-data-ingestion.md
================================================
# NY Taxi Dataset and Data Ingestion

**[↑ Up](README.md)** | **[← Previous](04-postgres-docker.md)** | **[Next →](06-ingestion-script.md)**

We will now create a Jupyter Notebook `notebook.ipynb` file which we will use to read a CSV file and export it to Postgres.

## Setting up Jupyter

Install Jupyter:

```bash
uv add --dev jupyter
```

Let's create a Jupyter notebook to explore the data:

```bash
uv run jupyter notebook
```

## The NYC Taxi Dataset

We will use data from the [NYC TLC Trip Record Data website](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

Specifically, we will use the [Yellow taxi trip records CSV file for January 2021](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz).

This data used to be csv, but later they switched to parquet. We want to keep using CSV because we need to do a bit of extra pre-processing (for the purposes of learning it).

A dictionary to understand each field is available [here](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf).

> Note: The CSV data is stored as gzipped files. Pandas can read them directly.

## Explore the Data

Create a new notebook and run:

```python
import pandas as pd

# Read a sample of the data
prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/'
df = pd.read_csv(prefix + 'yellow_tripdata_2021-01.csv.gz', nrows=100)

# Display first rows
df.head()

# Check data types
df.dtypes

# Check data shape
df.shape
```

### Handling Data Types

We have a warning: (Note that this warning might pop up later for some users, so it's best to follow the instructions below)

```
/tmp/ipykernel_25483/2933316018.py:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
```

So we need to specify the types:

```python
dtype = {
    "VendorID": "Int64",
    "passenger_count": "Int64",
    "trip_distance": "float64",
    "RatecodeID": "Int64",
    "store_and_fwd_flag": "string",
    "PULocationID": "Int64",
    "DOLocationID": "Int64",
    "payment_type": "Int64",
    "fare_amount": "float64",
    "extra": "float64",
    "mta_tax": "float64",
    "tip_amount": "float64",
    "tolls_amount": "float64",
    "improvement_surcharge": "float64",
    "total_amount": "float64",
    "congestion_surcharge": "float64"
}

parse_dates = [
    "tpep_pickup_datetime",
    "tpep_dropoff_datetime"
]

df = pd.read_csv(
    prefix + 'yellow_tripdata_2021-01.csv.gz',
    nrows=100,
    dtype=dtype,
    parse_dates=parse_dates
)
```

## Ingesting Data into Postgres

In the Jupyter notebook, we create code to:

1. Download the CSV file
2. Read it in chunks with pandas
3. Convert datetime columns
4. Insert data into PostgreSQL using SQLAlchemy

### Install SQLAlchemy

```bash
uv add sqlalchemy "psycopg[binary,pool]"
```

### Create Database Connection

```python
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg://root:root@localhost:5432/ny_taxi')
```

### Get DDL Schema

```python
print(pd.io.sql.get_schema(df, name='yellow_taxi_data', con=engine))
```

Output:

```sql
CREATE TABLE yellow_taxi_data (
    "VendorID" BIGINT,
    tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE,
    tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE,
    passenger_count BIGINT,
    trip_distance FLOAT(53),
    "RatecodeID" BIGINT,
    store_and_fwd_flag TEXT,
    "PULocationID" BIGINT,
    "DOLocationID" BIGINT,
    payment_type BIGINT,
    fare_amount FLOAT(53),
    extra FLOAT(53),
    mta_tax FLOAT(53),
    tip_amount FLOAT(53),
    tolls_amount FLOAT(53),
    improvement_surcharge FLOAT(53),
    total_amount FLOAT(53),
    congestion_surcharge FLOAT(53)
)
```

### Create the Table

```python
df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')
```

`head(n=0)` makes sure we only create the table, we don't add any data yet.

## Ingesting Data in Chunks

We don't want to insert all the data at once. Let's do it in batches and use an iterator for that:

```python
df_iter = pd.read_csv(
    prefix + 'yellow_tripdata_2021-01.csv.gz',
    dtype=dtype,
    parse_dates=parse_dates,
    iterator=True,
    chunksize=100000
)
```

### Iterate Over Chunks

```python
for df_chunk in df_iter:
    print(len(df_chunk))
```

### Inserting Data

```python
df_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')
```

### Complete Ingestion Loop

```python
first = True

for df_chunk in df_iter:

    if first:
        # Create table schema (no data)
        df_chunk.head(0).to_sql(
            name="yellow_taxi_data",
            con=engine,
            if_exists="replace"
        )
        first = False
        print("Table created")

    # Insert chunk
    df_chunk.to_sql(
        name="yellow_taxi_data",
        con=engine,
        if_exists="append"
    )

    print("Inserted:", len(df_chunk))
```

### Alternative Approach (Without First Flag)

```python
first_chunk = next(df_iter)

first_chunk.head(0).to_sql(
    name="yellow_taxi_data",
    con=engine,
    if_exists="replace"
)

print("Table created")

first_chunk.to_sql(
    name="yellow_taxi_data",
    con=engine,
    if_exists="append"
)

print("Inserted first chunk:", len(first_chunk))

for df_chunk in df_iter:
    df_chunk.to_sql(
        name="yellow_taxi_data",
        con=engine,
        if_exists="append"
    )
    print("Inserted chunk:", len(df_chunk))
```

## Adding Progress Bar

Add `tqdm` to see progress:

```bash
uv add tqdm
```

Put it around the iterable:

```python
from tqdm.auto import tqdm

for df_chunk in tqdm(df_iter):
    ...
```
To see progress in terms of total chunks, you would have to add the `total` argument to `tqdm(df_iter)`. In our scenario, the pragmatic way is 
to hardcode a value based on the number of entries in the table.

## Verify the Data

Connect to it using pgcli:

```bash
uv run pgcli -h localhost -p 5432 -u root -d ny_taxi
```

And explore the data.

**[↑ Up](README.md)** | **[← Previous](04-postgres-docker.md)** | **[Next →](06-ingestion-script.md)**


================================================
FILE: 01-docker-terraform/docker-sql/06-ingestion-script.md
================================================
# Creating the Data Ingestion Script

**[↑ Up](README.md)** | **[← Previous](05-data-ingestion.md)** | **[Next →](07-pgadmin.md)**

Now let's convert the notebook to a Python script.

## Convert Notebook to Script

```bash
uv run jupyter nbconvert --to=script notebook.ipynb
mv notebook.py ingest_data.py
```

## The Complete Ingestion Script

See the `pipeline/` directory for the complete script with click integration. Here's the core structure:

```python
import pandas as pd
from sqlalchemy import create_engine
from tqdm.auto import tqdm

dtype = {
    "VendorID": "Int64",
    "passenger_count": "Int64",
    "trip_distance": "float64",
    "RatecodeID": "Int64",
    "store_and_fwd_flag": "string",
    "PULocationID": "Int64",
    "DOLocationID": "Int64",
    "payment_type": "Int64",
    "fare_amount": "float64",
    "extra": "float64",
    "mta_tax": "float64",
    "tip_amount": "float64",
    "tolls_amount": "float64",
    "improvement_surcharge": "float64",
    "total_amount": "float64",
    "congestion_surcharge": "float64"
}

parse_dates = [
    "tpep_pickup_datetime",
    "tpep_dropoff_datetime"
]
```

## Click Integration

The script uses `click` for command-line argument parsing:

```python
import click

@click.command()
@click.option('--pg-user', default='root', help='PostgreSQL user')
@click.option('--pg-pass', default='root', help='PostgreSQL password')
@click.option('--pg-host', default='localhost', help='PostgreSQL host')
@click.option('--pg-port', default=5432, type=int, help='PostgreSQL port')
@click.option('--pg-db', default='ny_taxi', help='PostgreSQL database name')
@click.option('--target-table', default='yellow_taxi_data', help='Target table name')
def run(pg_user, pg_pass, pg_host, pg_port, pg_db, target_table):
    # Ingestion logic here
    pass
```

## Running the Script

The script reads data in chunks (100,000 rows at a time) to handle large files efficiently without running out of memory.

Example usage:

```bash
uv run python ingest_data.py \
  --pg-user=root \
  --pg-pass=root \
  --pg-host=localhost \
  --pg-port=5432 \
  --pg-db=ny_taxi \
  --target-table=yellow_taxi_trips
```

**[↑ Up](README.md)** | **[← Previous](05-data-ingestion.md)** | **[Next →](07-pgadmin.md)**


================================================
FILE: 01-docker-terraform/docker-sql/07-pgadmin.md
================================================
# pgAdmin - Database Management Tool

**[↑ Up](README.md)** | **[← Previous](06-ingestion-script.md)** | **[Next →](08-dockerizing-ingestion.md)**

`pgcli` is a handy tool but it's cumbersome to use for complex queries and database management. [`pgAdmin` is a web-based tool](https://www.pgadmin.org/) that makes it more convenient to access and manage our databases.

It's possible to run pgAdmin as a container along with the Postgres container, but both containers will have to be in the same _virtual network_ so that they can find each other.

## Run pgAdmin Container

```bash
docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -v pgadmin_data:/var/lib/pgadmin \
  -p 8085:80 \
  dpage/pgadmin4
```

The `-v pgadmin_data:/var/lib/pgadmin` volume mapping saves pgAdmin settings (server connections, preferences) so you don't have to reconfigure it every time you restart the container.

### Parameters Explained

* The container needs 2 environment variables: a login email and a password. We use `admin@admin.com` and `root` in this example.
* pgAdmin is a web app and its default port is 80; we map it to 8085 in our localhost to avoid any possible conflicts.
* The actual image name is `dpage/pgadmin4`.

**Note:** This won't work yet because pgAdmin can't see the PostgreSQL container. They need to be on the same Docker network!

## Docker Networks

Let's create a virtual Docker network called `pg-network`:

```bash
docker network create pg-network
```

> You can remove the network later with the command `docker network rm pg-network`. You can look at the existing networks with `docker network ls`.

### Run Containers on the Same Network

Stop both containers and re-run them with the network configuration:

```bash
# Run PostgreSQL on the network
docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v ny_taxi_postgres_data:/var/lib/postgresql \
  -p 5432:5432 \
  --network=pg-network \
  --name pgdatabase \
  postgres:18

# In another terminal, run pgAdmin on the same network
docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -v pgadmin_data:/var/lib/pgadmin \
  -p 8085:80 \
  --network=pg-network \
  --name pgadmin \
  dpage/pgadmin4
```

* Just like with the Postgres container, we specify a network and a name for pgAdmin.
* The container names (`pgdatabase` and `pgadmin`) allow the containers to find each other within the network.

## Connect pgAdmin to PostgreSQL

You should now be able to load pgAdmin on a web browser by browsing to `http://localhost:8085`. Use the same email and password you used for running the container to log in.

1. Open browser and go to `http://localhost:8085`
2. Login with email: `admin@admin.com`, password: `root`
3. Right-click "Servers" → Register → Server
4. Configure:
   - **General tab**: Name: `Local Docker`
   - **Connection tab**:
     - Host: `pgdatabase` (the container name)
     - Port: `5432`
     - Username: `root`
     - Password: `root`
5. Save

Now you can explore the database using the pgAdmin interface!

**[↑ Up](README.md)** | **[← Previous](06-ingestion-script.md)** | **[Next →](08-dockerizing-ingestion.md)**


================================================
FILE: 01-docker-terraform/docker-sql/08-dockerizing-ingestion.md
================================================
# Dockerizing the Ingestion Script

**[↑ Up](README.md)** | **[← Previous](07-pgadmin.md)** | **[Next →](09-docker-compose.md)**

Now let's containerize the ingestion script so we can run it in Docker.

## The Dockerfile

The `pipeline/Dockerfile` shows how to containerize the ingestion script:

```dockerfile
FROM python:3.13.11-slim
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

WORKDIR /code
ENV PATH="/code/.venv/bin:$PATH"

COPY pyproject.toml .python-version uv.lock ./
RUN uv sync --locked

COPY ingest_data.py .

ENTRYPOINT ["uv", "run", "python", "ingest_data.py"]
```

### Explanation

- `FROM python:3.13.11-slim`: Start with slim Python 3.13 image for smaller size
- `COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/`: Copy uv binary from official uv image
- `WORKDIR /code`: Set working directory inside container
- `ENV PATH="/code/.venv/bin:$PATH"`: Add virtual environment to PATH
- `COPY pyproject.toml .python-version uv.lock ./`: Copy dependency files first (better caching)
- `RUN uv sync --locked`: Install all dependencies from lock file (ensures reproducible builds)
- `COPY ingest_data.py .`: Copy ingestion script
- `ENTRYPOINT ["uv", "run", "python", "ingest_data.py"]`: Set entry point to run the ingestion script

## Build the Docker Image

```bash
cd pipeline
docker build -t taxi_ingest:v001 .
```

## Run the Containerized Ingestion

```bash
docker run -it \
  --network=pg-network \
  taxi_ingest:v001 \
    --pg-user=root \
    --pg-pass=root \
    --pg-host=pgdatabase \
    --pg-port=5432 \
    --pg-db=ny_taxi \
    --target-table=yellow_taxi_trips
```

### Important Notes

* We need to provide the network for Docker to find the Postgres container. It goes before the name of the image.
* Since Postgres is running on a separate container, the host argument will have to point to the container name of Postgres (`pgdatabase`).
* You can drop the table in pgAdmin beforehand if you want, but the script will automatically replace the pre-existing table.

**[↑ Up](README.md)** | **[← Previous](07-pgadmin.md)** | **[Next →](09-docker-compose.md)**


================================================
FILE: 01-docker-terraform/docker-sql/09-docker-compose.md
================================================
# Docker Compose

**[↑ Up](README.md)** | **[← Previous](08-dockerizing-ingestion.md)** | **[Next →](10-sql-refresher.md)**

`docker-compose` allows us to launch multiple containers using a single configuration file, so that we don't have to run multiple complex `docker run` commands separately.

Docker compose makes use of YAML files. Here's the `docker-compose.yaml` file:

```yaml
services:
  pgdatabase:
    image: postgres:18
    environment:
      POSTGRES_USER: "root"
      POSTGRES_PASSWORD: "root"
      POSTGRES_DB: "ny_taxi"
    volumes:
      - "ny_taxi_postgres_data:/var/lib/postgresql"
    ports:
      - "5432:5432"

  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: "admin@admin.com"
      PGADMIN_DEFAULT_PASSWORD: "root"
    volumes:
      - "pgadmin_data:/var/lib/pgadmin"
    ports:
      - "8085:80"


volumes:
  ny_taxi_postgres_data:
  pgadmin_data:
```

### Explanation

* We don't have to specify a network because `docker compose` takes care of it: every single container (or "service", as the file states) will run within the same network and will be able to find each other according to their names (`pgdatabase` and `pgadmin` in this example).
* All other details from the `docker run` commands (environment variables, volumes and ports) are mentioned accordingly in the file following YAML syntax.

## Start Services with Docker Compose

We can now run Docker compose by running the following command from the same directory where `docker-compose.yaml` is found. Make sure that all previous containers aren't running anymore:

```bash
docker-compose up
```

### Detached Mode

If you want to run the containers again in the background rather than in the foreground (thus freeing up your terminal), you can run them in detached mode:

```bash
docker-compose up -d
```

## Stop Services

You will have to press `Ctrl+C` in order to shut down the containers when running in foreground mode. The proper way of shutting them down is with this command:

```bash
docker-compose down
```

## Other Useful Commands

```bash
# View logs
docker-compose logs

# Stop and remove volumes
docker-compose down -v
```

## Benefits of Docker Compose

- Single command to start all services
- Automatic network creation
- Easy configuration management
- Declarative infrastructure

## Running the Ingestion Script with Docker Compose

If you want to re-run the dockerized ingest script when you run Postgres and pgAdmin with `docker compose`, you will have to find the name of the virtual network that Docker compose created for the containers.

```bash
# check the network link:
docker network ls

# it's pipeline_default (or similar based on directory name)
# now run the script:
docker run -it --rm\
  --network=pipeline_default \
  taxi_ingest:v001 \
    --pg-user=root \
    --pg-pass=root \
    --pg-host=pgdatabase \
    --pg-port=5432 \
    --pg-db=ny_taxi \
    --target-table=yellow_taxi_trips
```

**[↑ Up](README.md)** | **[← Previous](08-dockerizing-ingestion.md)** | **[Next →](10-sql-refresher.md)**


================================================
FILE: 01-docker-terraform/docker-sql/10-sql-refresher.md
================================================
# SQL Refresher

**[↑ Up](README.md)** | **[← Previous](09-docker-compose.md)** | **[Next →](11-cleanup.md)**

[![](https://markdown-videos-api.jorgenkh.no/youtube/QEcps_iskgg)](https://youtu.be/QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10)

Pre-Requisites: If you followed the course in the given order, Docker Compose should already be running with pgdatabase and pgAdmin.

Once done, you can go to http://localhost:8085/browser/ to access pgAdmin.
Don't forget to Right Click on the server or database to refresh it in case you don't see the new table.

Now start querying!

## Inner Joins

### Implicit INNER JOIN

Joining Yellow Taxi table with Zones Lookup table (implicit INNER JOIN):

```sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
FROM
    yellow_taxi_trips t,
    zones zpu,
    zones zdo
WHERE
    t."PULocationID" = zpu."LocationID"
    AND t."DOLocationID" = zdo."LocationID"
LIMIT 100;
```

### Explicit INNER JOIN

```sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
FROM
    yellow_taxi_trips t
JOIN
-- or INNER JOIN but it's less used, when writing JOIN, postgreSQL understands implicitly that we want to use an INNER JOIN
    zones zpu ON t."PULocationID" = zpu."LocationID"
JOIN
    zones zdo ON t."DOLocationID" = zdo."LocationID"
LIMIT 100;
```

## Data Quality Checks

### Checking for NULL Location IDs

```sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    "PULocationID",
    "DOLocationID"
FROM
    yellow_taxi_trips
WHERE
    "PULocationID" IS NULL
    OR "DOLocationID" IS NULL
LIMIT 100;
```

### Checking for Location IDs NOT IN Zones Table

```sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    "PULocationID",
    "DOLocationID"
FROM
    yellow_taxi_trips
WHERE
    "DOLocationID" NOT IN (SELECT "LocationID" from zones)
    OR "PULocationID" NOT IN (SELECT "LocationID" from zones)
LIMIT 100;
```

## LEFT, RIGHT, and OUTER JOINS

Using LEFT, RIGHT, and OUTER JOINS when some Location IDs are not in either Tables:

```sql
DELETE FROM zones WHERE "LocationID" = 142;

SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
FROM
    yellow_taxi_trips t
LEFT JOIN
    zones zpu ON t."PULocationID" = zpu."LocationID"
JOIN
    zones zdo ON t."DOLocationID" = zdo."LocationID"
LIMIT 100;
```

```sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
FROM
    yellow_taxi_trips t
RIGHT JOIN
    zones zpu ON t."PULocationID" = zpu."LocationID"
JOIN
    zones zdo ON t."DOLocationID" = zdo."LocationID"
LIMIT 100;
```

```sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc"
FROM
    yellow_taxi_trips t
OUTER JOIN
    zones zpu ON t."PULocationID" = zpu."LocationID"
JOIN
    zones zdo ON t."DOLocationID" = zdo."LocationID"
LIMIT 100;
```

## GROUP BY

### Calculate Number of Trips Per Day

```sql
SELECT
    CAST(tpep_dropoff_datetime AS DATE) AS "day",
    COUNT(1)
FROM
    yellow_taxi_trips
GROUP BY
    CAST(tpep_dropoff_datetime AS DATE)
LIMIT 100;
```

## ORDER BY

### Ordering by Day

```sql
SELECT
    CAST(tpep_dropoff_datetime AS DATE) AS "day",
    COUNT(1)
FROM
    yellow_taxi_trips
GROUP BY
    CAST(tpep_dropoff_datetime AS DATE)
ORDER BY
    "day" ASC
LIMIT 100;
```

### Ordering by Count

```sql
SELECT
    CAST(tpep_dropoff_datetime AS DATE) AS "day",
    COUNT(1) AS "count"
FROM
    yellow_taxi_trips
GROUP BY
    CAST(tpep_dropoff_datetime AS DATE)
ORDER BY
    "count" DESC
LIMIT 100;
```

## Other Aggregations

```sql
SELECT
    CAST(tpep_dropoff_datetime AS DATE) AS "day",
    COUNT(1) AS "count",
    MAX(total_amount) AS "total_amount",
    MAX(passenger_count) AS "passenger_count"
FROM
    yellow_taxi_trips
GROUP BY
    CAST(tpep_dropoff_datetime AS DATE)
ORDER BY
    "count" DESC
LIMIT 100;
```

## Grouping by Multiple Fields

```sql
SELECT
    CAST(tpep_dropoff_datetime AS DATE) AS "day",
    "DOLocationID",
    COUNT(1) AS "count",
    MAX(total_amount) AS "total_amount",
    MAX(passenger_count) AS "passenger_count"
FROM
    yellow_taxi_trips
GROUP BY
    1, 2
ORDER BY
    "day" ASC,
    "DOLocationID" ASC
LIMIT 100;
```

**[↑ Up](README.md)** | **[← Previous](09-docker-compose.md)** | **[Next →](11-cleanup.md)**


================================================
FILE: 01-docker-terraform/docker-sql/11-cleanup.md
================================================
# Cleanup

**[↑ Up](README.md)** | **[← Previous](10-sql-refresher.md)** | **[Next →](../README.md)**

When you're done with the workshop, clean up Docker resources to free up disk space.

## Stop All Running Containers

```bash
docker-compose down
```

## Remove Specific Containers

```bash
# List all containers
docker ps -a

# Remove specific container
docker rm <container_id>

# Remove all stopped containers
docker container prune
```

## Remove Docker Images

```bash
# List all images
docker images

# Remove specific image
docker rmi taxi_ingest:v001

# Remove all unused images
docker image prune -a
```

## Remove Docker Volumes

```bash
# List volumes
docker volume ls

# Remove specific volumes
docker volume rm ny_taxi_postgres_data
docker volume rm pgadmin_data

# Remove all unused volumes
docker volume prune
```

## Remove Docker Networks

```bash
# List networks
docker network ls

# Remove specific network
docker network rm pg-network

# Remove all unused networks
docker network prune
```

## Complete Cleanup

Removes ALL Docker resources - use with caution!

```bash
# ⚠️ Warning: This removes ALL Docker resources!
docker system prune -a --volumes
```

## Clean Up Local Files

```bash
# Remove parquet files
rm *.parquet

# Remove Python cache
rm -rf __pycache__ .pytest_cache

# Remove virtual environment (if using venv)
rm -rf .venv
```

---

That's all for today. Happy learning! 🐳📊

**[↑ Up](README.md)** | **[← Previous](10-sql-refresher.md)** | **[Next →](../README.md)**


================================================
FILE: 01-docker-terraform/docker-sql/README.md
================================================
# Docker and PostgreSQL: Data Engineering Workshop

* Video: [link](https://www.youtube.com/watch?v=lP8xXebHmuE)
* Slides: [link](https://docs.google.com/presentation/d/19pXcInDwBnlvKWCukP5sDoCAb69SPqgIoxJ_0Bikr00/edit?usp=sharing)
* Code: [pipeline/](pipeline/)

In this workshop, we will explore Docker fundamentals and data engineering workflows using Docker containers. This workshop is part of Module 1 of the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp).

**Data Engineering** is the design and development of systems for collecting, storing and analyzing data at scale.

## Prerequisites

- Basic understanding of Python
- Basic SQL knowledge (helpful but not required)
- Docker and Python installed on your machine
- Git (optional)

## Workshop Contents

1. [Introduction to Docker](01-introduction.md) - What is Docker, why use it, basic commands
2. [Virtual Environments and Data Pipelines](02-virtual-environment.md) - Setting up Python environments with uv
3. [Dockerizing the Pipeline](03-dockerizing-pipeline.md) - Creating a Dockerfile for a simple pipeline
4. [Running PostgreSQL with Docker](04-postgres-docker.md) - Dockerizing PostgreSQL database
5. [NY Taxi Dataset and Data Ingestion](05-data-ingestion.md) - Working with real data, pandas, SQLAlchemy
6. [Creating the Data Ingestion Script](06-ingestion-script.md) - Converting notebook to Python script
7. [pgAdmin - Database Management Tool](07-pgadmin.md) - Web-based database management
8. [Dockerizing the Ingestion Script](08-dockerizing-ingestion.md) - Containerizing the pipeline
9. [Docker Compose](09-docker-compose.md) - Multi-container orchestration
10. [SQL Refresher](10-sql-refresher.md) - SQL joins, aggregations, and queries
11. [Cleanup](11-cleanup.md) - Cleaning up Docker resources


================================================
FILE: 01-docker-terraform/docker-sql/pipeline/.python-version
================================================
3.13


================================================
FILE: 01-docker-terraform/docker-sql/pipeline/Dockerfile
================================================
FROM python:3.13.11-slim
COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

WORKDIR /code
ENV PATH="/code/.venv/bin:$PATH"

COPY pyproject.toml .python-version uv.lock ./
RUN uv sync --locked

COPY ingest_data.py .

ENTRYPOINT ["python", "ingest_data.py"]

================================================
FILE: 01-docker-terraform/docker-sql/pipeline/docker-compose.yaml
================================================
services:
  pgdatabase:
    image: postgres:18
    environment:
      POSTGRES_USER: "root"
      POSTGRES_PASSWORD: "root"
      POSTGRES_DB: "ny_taxi"
    volumes:
      - ny_taxi_postgres_data:/var/lib/postgresql
    ports:
      - "5432:5432"

  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: "admin@admin.com"
      PGADMIN_DEFAULT_PASSWORD: "root"
    volumes:
      - pgadmin_data:/var/lib/pgadmin
    ports:
      - "8085:80"


volumes:
  ny_taxi_postgres_data:
  pgadmin_data:


================================================
FILE: 01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-ingest.sh
================================================
#!/usr/bin/env bash

## bash script to run the ingestion container
echo "Running data ingestion for January 2021..."

docker run -it --rm \
  --network=pg-network \
  taxi_ingest:v001 \
  --year=2021 \
  --month=1 \
  --pg-user=root \
  --pg-pass=root \
  --pg-host=pgdatabase \
  --pg-port=5432 \
  --pg-db=ny_taxi \
  --chunksize=100000 \
  --target-table=yellow_taxi_trips

================================================
FILE: 01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-pgadmin.sh
================================================
#!/usr/bin/env bash

## bash script to start pgadmin
echo "Starting pgAdmin container..."
mkdir -p ../pgadmin_data

docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -v ../pgadmin_data:/var/lib/pgadmin \
  -p 8085:80 \
  --network=pg-network \
  --name pgadmin \
  dpage/pgadmin4

================================================
FILE: 01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-postgres.sh
================================================
#!/usr/bin/env bash

## bash script to start the Postgres container
mkdir -p ../ny_taxi_postgres_data

echo "Starting PostgreSQL container..."

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v ../ny_taxi_postgres_data:/var/lib/postgresql \
  -p 5432:5432 \
  --network=pg-network \
  --name pgdatabase \
  postgres:18

# to use the pgcli
# pgcli -h localhost -p 5432 -u root -d ny_taxi

================================================
FILE: 01-docker-terraform/docker-sql/pipeline/ingest_data.py
================================================
#!/usr/bin/env python
# coding: utf-8

import click
import pandas as pd
from sqlalchemy import create_engine
from tqdm.auto import tqdm

dtype = {
    "VendorID": "Int64",
    "passenger_count": "Int64",
    "trip_distance": "float64",
    "RatecodeID": "Int64",
    "store_and_fwd_flag": "string",
    "PULocationID": "Int64",
    "DOLocationID": "Int64",
    "payment_type": "Int64",
    "fare_amount": "float64",
    "extra": "float64",
    "mta_tax": "float64",
    "tip_amount": "float64",
    "tolls_amount": "float64",
    "improvement_surcharge": "float64",
    "total_amount": "float64",
    "congestion_surcharge": "float64"
}

parse_dates = [
    "tpep_pickup_datetime",
    "tpep_dropoff_datetime"
]


@click.command()
@click.option('--pg-user', default='root', help='PostgreSQL user')
@click.option('--pg-pass', default='root', help='PostgreSQL password')
@click.option('--pg-host', default='localhost', help='PostgreSQL host')
@click.option('--pg-port', default=5432, type=int, help='PostgreSQL port')
@click.option('--pg-db', default='ny_taxi', help='PostgreSQL database name')
@click.option('--year', default=2021, type=int, help='Year of the data')
@click.option('--month', default=1, type=int, help='Month of the data')
@click.option('--target-table', default='yellow_taxi_data', help='Target table name')
@click.option('--chunksize', default=100000, type=int, help='Chunk size for reading CSV')
def run(pg_user, pg_pass, pg_host, pg_port, pg_db, year, month, target_table, chunksize):
    """Ingest NYC taxi data into PostgreSQL database."""
    prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow'
    url = f'{prefix}/yellow_tripdata_{year}-{month:02d}.csv.gz'

    engine = create_engine(f'postgresql+psycopg://{pg_user}:{pg_pass}@{pg_host}:{pg_port}/{pg_db}')

    df_iter = pd.read_csv(
        url,
        dtype=dtype,
        parse_dates=parse_dates,
        iterator=True,
        chunksize=chunksize,
    )

    first = True

    for df_chunk in tqdm(df_iter):
        if first:
            df_chunk.head(0).to_sql(
                name=target_table,
                con=engine,
                if_exists='replace'
            )
            first = False

        df_chunk.to_sql(
            name=target_table,
            con=engine,
            if_exists='append'
        )

if __name__ == '__main__':
    run()


================================================
FILE: 01-docker-terraform/docker-sql/pipeline/pyproject.toml
================================================
[project]
name = "pipeline"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
    "click>=8.3.1",
    "pandas>=2.3.3",
    "psycopg2-binary>=2.9.11",
    "pyarrow>=22.0.0",
    "sqlalchemy>=2.0.44",
    "tqdm>=4.67.1",
]

[dependency-groups]
dev = [
    "jupyter>=1.1.1",
    "pgcli>=4.3.0",
]


================================================
FILE: 01-docker-terraform/terraform/1_terraform_overview.md
================================================
## Terraform Overview

[Video](https://www.youtube.com/watch?v=18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=2)

### Concepts

#### Introduction

1. What is [Terraform](https://www.terraform.io)?
   * open-source tool by [HashiCorp](https://www.hashicorp.com), used for provisioning infrastructure resources
   * supports DevOps best practices for change management
   * Managing configuration files in source control to maintain an ideal provisioning state 
     for testing and production environments
2. What is IaC?
   * Infrastructure-as-Code
   * build, change, and manage your infrastructure in a safe, consistent, and repeatable way 
     by defining resource configurations that you can version, reuse, and share.
3. Some advantages
   * Infrastructure lifecycle management
   * Version control commits
   * Very useful for stack-based deployments, and with cloud providers such as AWS, GCP, Azure, K8S…
   * State-based approach to track resource changes throughout deployments


#### Files

* `main.tf`
* `variables.tf`
* Optional: `resources.tf`, `output.tf`
* `.tfstate`

#### Declarations
* `terraform`: configure basic Terraform settings to provision your infrastructure
   * `required_version`: minimum Terraform version to apply to your configuration
   * `backend`: stores Terraform's "state" snapshots, to map real-world resources to your configuration.
      * `local`: stores state file locally as `terraform.tfstate`
   * `required_providers`: specifies the providers required by the current module
* `provider`:
   * adds a set of resource types and/or data sources that Terraform can manage
   * The Terraform Registry is the main directory of publicly available providers from most major infrastructure platforms.
* `resource`
  * blocks to define components of your infrastructure
  * Project modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table
* `variable` & `locals`
  * runtime arguments and constants


#### Execution steps
1. `terraform init`: 
    * Initializes & configures the backend, installs plugins/providers, & checks out an existing configuration from a version control 
2. `terraform plan`:
    * Matches/previews local changes against a remote state, and proposes an Execution Plan.
3. `terraform apply`: 
    * Asks for approval to the proposed plan, and applies changes to cloud
4. `terraform destroy`
    * Removes your stack from the Cloud


### Terraform Workshop to create GCP Infra
Continue [here](./terraform): `week_1_basics_n_setup/1_terraform_gcp/terraform`


### References
https://learn.hashicorp.com/collections/terraform/gcp-get-started


================================================
FILE: 01-docker-terraform/terraform/2_gcp_overview.md
================================================
## GCP Overview

[Video](https://www.youtube.com/watch?v=18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=2)


### Project infrastructure modules in GCP:
* Google Cloud Storage (GCS): Data Lake
* BigQuery: Data Warehouse

(Concepts explained in Week 2 - Data Ingestion)

### Initial Setup

For this course, we'll use a free version (upto EUR 300 credits). 

1. Create an account with your Google email ID 
2. Setup your first [project](https://console.cloud.google.com/) if you haven't already
    * eg. "DTC DE Course", and note down the "Project ID" (we'll use this later when deploying infra with TF)
3. Setup [service account & authentication](https://cloud.google.com/docs/authentication/getting-started) for this project
    * Grant `Viewer` role to begin with.
    * Download service-account-keys (.json) for auth.
4. Download [SDK](https://cloud.google.com/sdk/docs/quickstart) for local setup
5. Set environment variable to point to your downloaded GCP keys:
   ```shell
   export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
   
   # Refresh token/session, and verify authentication
   gcloud auth application-default login
   ```
   
### Setup for Access
 
1. [IAM Roles](https://cloud.google.com/storage/docs/access-control/iam-roles) for Service account:
   * Go to the *IAM* section of *IAM & Admin* https://console.cloud.google.com/iam-admin/iam
   * Click the *Edit principal* icon for your service account.
   * Add these roles in addition to *Viewer* : **Storage Admin** + **Storage Object Admin** + **BigQuery Admin**
   
2. Enable these APIs for your project:
   * https://console.cloud.google.com/apis/library/iam.googleapis.com
   * https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com
   
3. Please ensure `GOOGLE_APPLICATION_CREDENTIALS` env-var is set.
   ```shell
   export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
   ```
 
### Terraform Workshop to create GCP Infra
Continue [here](./terraform): `week_1_basics_n_setup/1_terraform_gcp/terraform`


================================================
FILE: 01-docker-terraform/terraform/README.md
================================================
## Local Setup for Terraform and GCP

### Pre-Requisites
1. Terraform client installation: https://www.terraform.io/downloads
2. Cloud Provider account: https://console.cloud.google.com/ 

### Terraform Concepts
[Terraform Overview](1_terraform_overview.md)

### GCP setup

1. [Setup for First-time](2_gcp_overview.md#initial-setup)
    * [Only for Windows](windows.md) - Steps 4 & 5
2. [IAM / Access specific to this course](2_gcp_overview.md#setup-for-access)

### Terraform Workshop for GCP Infra
Your setup is ready!
Now head to the [terraform](terraform) directory, and perform the execution steps to create your infrastructure.


================================================
FILE: 01-docker-terraform/terraform/terraform/README.md
================================================
### Concepts
* [Terraform_overview](../1_terraform_overview.md)
* If you were unable to generate a service account keyfile due to organizational policies, refer to the instructions [below](#fallback)

### Execution

```shell
# Refresh service-account's auth-token for this session
gcloud auth application-default login

# Initialize state file (.tfstate)
terraform init

# Check changes to new infra plan
terraform plan -var="project=<your-gcp-project-id>"
```

```shell
# Create new infra
terraform apply -var="project=<your-gcp-project-id>"
```

```shell
# Delete infra after your work, to avoid costs on any running services
terraform destroy
```

### Warning
Remember to use a [proper gitignore](https://github.com/github/gitignore/blob/main/Terraform.gitignore) file before publishing your code on GitHub

### Fallback
1. Give yourself the token creator role on the pertinent service account
    ```bash
    gcloud iam service-accounts add-iam-policy-binding \
        <SERVICE_ACCOUNT_EMAIL> \
        --member="user:YOUR_EMAIL@gmail.com" \
        --role="roles/iam.serviceAccountTokenCreator"
    ```
2. Add the sections below the first block to your main terraform configuration
   ```terraform
    # Connect to gcp using ADC (identity verification)
    provider "google" {
      project = var.project
      region  = var.region
      zone    = var.zone
    }

    /* add these data blocks */
    
    # This data source gets a temporary token for the service account
    data "google_service_account_access_token" "default" {
      provider               = google
      target_service_account = "<SERVICE_ACCOUNT_EMAIL>"
      scopes                 = ["https://www.googleapis.com/auth/cloud-platform"]
      lifetime               = "3600s"
    }
    
    # This second provider block uses that temporary token and does the real work
    provider "google" {
      alias        = "impersonated"
      access_token = data.google_service_account_access_token.default.access_token
      project      = var.project
      region       = var.region
      zone         = var.zone
    }
   ```

3. Now, you can follow the instructions [above](#execution)


================================================
FILE: 01-docker-terraform/terraform/terraform/terraform_basic/main.tf
================================================
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "4.51.0"
    }
  }
}

provider "google" {
# Credentials only needs to be set if you do not have the GOOGLE_APPLICATION_CREDENTIALS set
#  credentials = 
  project = "<Your Project ID>"
  region  = "us-central1"
}


resource "google_storage_bucket" "data-lake-bucket" {
  name          = "<Your Unique Bucket Name>"
  location      = "US"

  # Optional, but recommended settings:
  storage_class = "STANDARD"
  uniform_bucket_level_access = true

  versioning {
    enabled     = true
  }

  lifecycle_rule {
    action {
      type = "Delete"
    }
    condition {
      age = 30  // days
    }
  }

  force_destroy = true
}


resource "google_bigquery_dataset" "dataset" {
  dataset_id = "<The Dataset Name You Want to Use>"
  project    = "<Your Project ID>"
  location   = "US"
}

================================================
FILE: 01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/README.md
================================================
# AWS Terraform Data Lake (GCP Equivalent)

## 📌 Overview

This repository contains an **AWS-based Terraform implementation** that mirrors the **Google Cloud Platform (GCP)** infrastructure used in the Data Engineering course (e.g. GCS + BigQuery), but implemented using **AWS services**.

The goal is to help learners who:
- Are enrolled in a **GCP-focused Data Engineering course**
- Prefer or need to work with **AWS**
- Want to understand **cloud-agnostic data engineering concepts**

This setup focuses on building a **basic data lake foundation** using:
- **Amazon S3** (equivalent to GCS)
- **AWS Glue Data Catalog** (equivalent to BigQuery datasets / metadata layer)
- **Terraform** as Infrastructure as Code (IaC)

---

## 🏗️ Architecture Mapping (GCP → AWS)

| GCP Service | AWS Equivalent | Purpose |
|------------|---------------|---------|
| Google Cloud Storage (GCS) | Amazon S3 | Data Lake storage |
| Uniform Bucket Level Access | S3 Public Access Block | Secure bucket access |
| Object Lifecycle Rules | S3 Lifecycle Configuration | Automatic data expiration |
| BigQuery Dataset | AWS Glue Catalog Database | Metadata & query layer |
| Terraform (GCP provider) | Terraform (AWS provider) | Infrastructure as Code |

---

## 📁 Project Structure

```text
.
├── main.tf            # Core infrastructure resources
├── variables.tf       # Input variable definitions
├── terraform.tfvars   # Environment-specific values
└── README.md          # Project documentation


================================================
FILE: 01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/main.tf
================================================
terraform {
    required_providers {
        aws = {
            source  = "hashicorp/aws"
            version = "~> 5.0"
        }
    }
}

provider "aws" {
    region = var.aws_region
}

#S3 Bucket to store data equivalent to GCS Bucket in GCP
resource "aws_s3_bucket" "data_lake_bucket" {
  bucket        = var.bucket_name
  force_destroy = true
}

#Bucket verisioning
resource "aws_s3_bucket_versioning" "versioning" {
  bucket = aws_s3_bucket.data_lake_bucket.id # Reference the S3 bucket created above

  versioning_configuration {
    status = "Enabled" # Enable versioning
  }
}

# "Uniform bucket level access" ~ control prin policy/ACL; recomandat: block public access
resource "aws_s3_bucket_public_access_block" "block_public_access" {
  bucket = aws_s3_bucket.data_lake_bucket.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Lifecycle: delete objects older than 30 days (echivalent lifecycle_rule age=30)
resource "aws_s3_bucket_lifecycle_configuration" "lifecycle_rules" {
  bucket = aws_s3_bucket.data_lake_bucket.id

  rule {
    id     = "Delete_old_older_than_30_days"
    status = "Enabled"

    expiration {
      days = 30
    }
    filter {
      prefix = "" # Apply to all objects in the bucket
    }
  }
}

resource "aws_glue_catalog_database" "dataset" {
  name = var.dataset_name
}


================================================
FILE: 01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/terraform.tfvars
================================================
bucket_name  = "my-unique-data-lake-bucket-12345"
dataset_name = "ny_taxi_dataset"


================================================
FILE: 01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/variables.tf
================================================
# Specifies the geographic location for AWS resource deployment.
# Defaulting to Stockholm (eu-north-1) to keep latency low for European users.
variable "aws_region" {
  description = "AWS region to deploy resources in"
  type = string
  default = "eu-north-1"

}

# The unique identifier for the S3 bucket where raw data will be stored.
# S3 bucket names must be globally unique across all AWS accounts.
variable "bucket_name" {
  description = "Name of the S3 bucket"
  type        = string
  default     = "data-engineering-zoomcamp-1568692036"
}

# Defines the logical grouping for metadata in the AWS Glue Catalog.
# This allows tools like Athena to query the S3 data using SQL.
variable "dataset_name" {
  description = "Glue Catalog database name (logical dataset for Athena/Glue)"
  type        = string
  default = "ny_taxi_database"
}


================================================
FILE: 01-docker-terraform/terraform/terraform/terraform_with_variables/main.tf
================================================
terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "5.6.0"
    }
  }
}

provider "google" {
  credentials = file(var.credentials)
  project     = var.project
  region      = var.region
}


resource "google_storage_bucket" "demo-bucket" {
  name          = var.gcs_bucket_name
  location      = var.location
  force_destroy = true


  lifecycle_rule {
    condition {
      age = 1
    }
    action {
      type = "AbortIncompleteMultipartUpload"
    }
  }
}


resource "google_bigquery_dataset" "demo_dataset" {
  dataset_id = var.bq_dataset_name
  location   = var.location
}

================================================
FILE: 01-docker-terraform/terraform/terraform/terraform_with_variables/variables.tf
================================================
variable "credentials" {
  description = "My Credentials"
  default     = "<Path to your Service Account json file>"
  #ex: if you have a directory where this file is called keys with your service account json file
  #saved there as my-creds.json you could use default = "./keys/my-creds.json"
}


variable "project" {
  description = "Project"
  default     = "<Your Project ID>"
}

variable "region" {
  description = "Region"
  #Update the below to your desired region
  default     = "us-central1"
}

variable "location" {
  description = "Project Location"
  #Update the below to your desired location
  default     = "US"
}

variable "bq_dataset_name" {
  description = "My BigQuery Dataset Name"
  #Update the below to what you want your dataset to be called
  default     = "demo_dataset"
}

variable "gcs_bucket_name" {
  description = "My Storage Bucket Name"
  #Update the below to a unique bucket name
  default     = "terraform-demo-terra-bucket"
}

variable "gcs_storage_class" {
  description = "Bucket Storage Class"
  default     = "STANDARD"
}

================================================
FILE: 01-docker-terraform/terraform/windows.md
================================================
## GCP and Terraform on Windows

You don't need these instructions if you use WSL. It's only for "plain Windows" 

### Google Cloud SDK

* For this tutorial, you'll need a Linux-like environment, e.g. [GitBash](https://gitforwindows.org/), [MinGW](https://www.mingw-w64.org/) or [cygwin](https://www.cygwin.com/)
  * Power Shell should also work, but will require adjustments 
* Download SDK in zip: https://dl.google.com/dl/cloudsdk/channels/rapid/google-cloud-sdk.zip
  * source: https://cloud.google.com/sdk/docs/downloads-interactive
* Unzip it and run the `install.sh` script

When installing it, you might see something like that:

```
The installer is unable to automatically update your system PATH. Please add
  C:\tools\google-cloud-sdk\bin
```

* To fix that, adjust your `.bashrc` to include this in `PATH` ([instructions](https://unix.stackexchange.com/questions/26047/how-to-correctly-add-a-path-to-path))
* You can also do it system-wide ([instructions](https://gist.github.com/nex3/c395b2f8fd4b02068be37c961301caa7))

Now we need to point it to correct Python installation. Assuming you use [Anaconda](https://www.anaconda.com/products/individual):

```bash
export CLOUDSDK_PYTHON=~/Anaconda3/python
```

Now let's check that it works:

```bash
$ gcloud version
Google Cloud SDK 367.0.0
bq 2.0.72
core 2021.12.10
gsutil 5.5
```

### Google Cloud SDK Authentication 

* Now create a service account and generate keys like shown in the videos
* Download the key and put it to some location, e.g. `.gc/ny-rides.json`
* Set `GOOGLE_APPLICATION_CREDENTIALS` to point to the file

```bash
export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/ny-rides.json
```

Now authenticate: 

```bash
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
```

Alternatively, you can authenticate using OAuth like shown in the video

```bash
gcloud auth application-default login
```

If you get a message like `quota exceeded`

> WARNING:
> Cannot find a quota project to add to ADC. You might receive a "quota exceeded" or "API not enabled" error. 
> Run `$ gcloud auth application-default set-quota-project` to add a quota project.

Then run this:

```bash
PROJECT_NAME="ny-rides-alexey"
gcloud auth application-default set-quota-project ${PROJECT_NAME}
```


### Terraform 

* [Download Terraform](https://www.terraform.io/downloads)
* Put it to a folder in [PATH](https://gist.github.com/nex3/c395b2f8fd4b02068be37c961301caa7)
* Go to the location with Terraform files and initialize it

```bash
terraform init
```

Optionally you can configure your terraform files (`variables.tf`) to include your project id:

```bash
variable "project" {
  description = "Your GCP Project ID"
  default = "ny-rides-alexey"
  type = string
}
```

* Now [follow the instructions](1_terraform_overview.md#execution-steps)
  * Run `terraform plan`
  * Next, run `terraform apply`

If you get an error like that:

> Error: googleapi: Error 403: terraform@ny-rides-alexey.iam.gserviceaccount.com does not have
> storage.buckets.create access to the Google Cloud project., forbidden


Then you need to give your service account all the permissions. Make sure you follow the instructions in the videos 

* You can also use [this file](https://docs.google.com/document/d/e/2PACX-1vSZapy7gIj0TP-EFzub2OpAlAkuifGEVJ4XpkA1RvxZ45NjiQi29b6OhLuetdXXHWAn2lbbKxnbzMdd/pub), but it doesn't list all the required permissions


================================================
FILE: 02-workflow-orchestration/README.md
================================================
# Workflow Orchestration

Welcome to Module 2 of the Data Engineering Zoomcamp! This week, we’ll dive into workflow orchestration using [Kestra](https://go.kestra.io/de-zoomcamp/github). 

Kestra is an open-source, event-driven orchestration platform that simplifies building both scheduled and event-driven workflows. By adopting Infrastructure as Code practices for data and process orchestration, Kestra enables you to build reliable workflows with just a few lines of YAML.

> [!NOTE]  
>You can find all videos for this week in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist).

---

## Course Structure

- [2.1 - Introduction to Workflow Orchestration](#21-introduction-to-workflow-orchestration)
- [2.2 - Getting Started With Kestra](#22-getting-started-with-kestra)
- [2.3 - Hands-On Coding Project: Build ETL Data Pipelines with Kestra](#23-hands-on-coding-project-build-data-pipelines-with-kestra)
- [2.4 - ELT Pipelines in Kestra: Google Cloud Platform](#24-elt-pipelines-in-kestra-google-cloud-platform)
- [2.5 - Using AI for Data Engineering in Kestra](#25-using-ai-for-data-engineering-in-kestra)
- [2.6 - Bonus](#26-bonus-deploy-to-the-cloud-optional)


## 2.1 Introduction to Workflow Orchestration

In this section, you’ll learn the foundations of workflow orchestration, its importance, and how Kestra fits into the orchestration landscape.

### 2.1.1 - What is Workflow Orchestration?
  
Think of a music orchestra. There's a variety of different instruments. Some more than others, all with different roles when it comes to playing music. To make sure they all come together at the right time, they follow a conductor who helps the orchestra to play together. 

Now replace the instruments with tools and the conductor with an orchestrator. We often have multiple tools and platforms that we need to work together. Sometimes on a routine schedule, other times based on events that happen. That's where the orchestrator comes in to help all of these tools work together.

A workflow orchestrator might do the following tasks:
- Run workflows which contain a number of predefined steps
- Monitor and log errors, as well as taking a number of extra steps when they occur
- Automatically run workflows based on schedules and events

In data engineering, you often need to move data from one place, to another, sometimes with some modifications made to the data in the middle. This is where a workflow orchestrator can help out by managing these steps, while giving us visibility into it at the same time. 

In this module, we're going to build our own data pipeline using ETL (Extract, Transform Load) with Kestra at the core of the operation, but first we need to understand a bit more about how Kestra works before we can get building! 

#### Videos
- **2.1.1 - What is Workflow Orchestration?**  
  [![2.1.1 - What is Workflow Orchestration?](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F-JLnp-iLins)](https://youtu.be/-JLnp-iLins)


### 2.1.2 - What is Kestra?

Kestra is an open-source, infinitely-scalable orchestration platform that enables all engineers to manage business-critical workflows. 

Kestra is a great choice for workflow orchestration:
- Build with Flow code (YAML), No-code or with the AI Copilot - flexibility in how you build your workflows
- 1000+ Plugins - integrate with all the tools you use
- Support for any programming language - pick the right tool for the job
- Schedule or Event Based Triggers - have your workflows respond to data

#### Videos

- **2.1.2 - What is Kestra?**  
  [![2.1.2 - What is Kestra?](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FZvVN_NmB_1s)](https://youtu.be/ZvVN_NmB_1s)

### Resources
- [Quickstart Guide](https://go.kestra.io/de-zoomcamp/quickstart)
- [What is an Orchestrator?](https://go.kestra.io/de-zoomcamp/what-is-an-orchestrator)

---

## 2.2 Getting Started with Kestra

In this section, you'll learn how to install Kestra, as well as the key concepts required to build your first workflow. Once our first workflow is built, we can extend this further by executing a Python script inside of a workflow. 

You will:
1. Install Kestra using Docker Compose
2. Learn the concepts of Kestra to build your first workflow
3. Execute a Python script inside of a Kestra Flow

### 2.2.1 - Installing Kestra

To install Kestra, we are going to use Docker Compose. We already have a Postgres database set up, along with pgAdmin from Module 1. We can continue to use these with Kestra but we'll need to make a few modifications to our Docker Compose file.

Use [this example Docker Compose file](docker-compose.yml) to correctly add the 2 new services and set up the volumes correctly.

Add information about setting a username and password.

We'll set up Kestra using Docker Compose containing one container for the Kestra server and another for the Postgres database:

```bash
cd 02-workflow-orchestration
docker compose up -d
```

**Note:** Check that `pgAdmin` isn't running on the same ports as Kestra. If so, check out the [FAQ](#troubleshooting-tips) at the bottom of the README.

Once the container starts, you can access the Kestra UI at [http://localhost:8080](http://localhost:8080).

To shut down Kestra, go to the same directory and run the following command:

```bash
docker compose down
```
#### Add Flows to Kestra

Flows can be added to Kestra by copying and pasting the YAML directly into the editor, or by adding via Kestra's API. See below for adding programmatically.

<details>
<summary>Add Flows to Kestra programmatically</summary>

If you prefer to add flows programmatically using Kestra's API, run the following commands:

```bash
# Import all flows: assuming username admin@kestra.io and password Admin1234! (adjust to match your username and password)
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/01_hello_world.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/02_python.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/03_getting_started_data_pipeline.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/04_postgres_taxi.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/05_postgres_taxi_scheduled.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/06_gcp_kv.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/07_gcp_setup.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/08_gcp_taxi.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/09_gcp_taxi_scheduled.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/10_chat_without_rag.yaml
curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/11_chat_with_rag.yaml
```
</details>

#### Videos

- **2.2.1 - Installing Kestra**  
  [![2.2.1 - Installing Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FwgPxC4UjoLM)](https://youtu.be/wgPxC4UjoLM)

#### Resources
- [Install Kestra with Docker Compose](https://go.kestra.io/de-zoomcamp/docker-compose)


### 2.2.2 - Kestra Concepts

To start building workflows in Kestra, we need to understand a number of concepts.
- [Flow](https://go.kestra.io/de-zoomcamp/flow) - a container for tasks and their orchestration logic. 
- [Tasks](https://go.kestra.io/de-zoomcamp/tasks) - the steps within a flow.
- [Inputs](https://go.kestra.io/de-zoomcamp/inputs) - dynamic values passed to the flow at runtime.
- [Outputs](https://go.kestra.io/de-zoomcamp/outputs) - pass data between tasks and flows.
- [Triggers](https://go.kestra.io/de-zoomcamp/triggers) - mechanism that automatically starts the execution of a flow.
- [Execution](https://go.kestra.io/de-zoomcamp/execution) - a single run of a flow with a specific state.
- [Variables](https://go.kestra.io/de-zoomcamp/variables) - key–value pairs that let you reuse values across tasks.
- [Plugin Defaults](https://go.kestra.io/de-zoomcamp/plugin-defaults) - default values applied to every task of a given type within one or more flows.
- [Concurrency](https://go.kestra.io/de-zoomcamp/concurrency) - control how many executions of a flow can run at the same time.

While there are more concepts used for building powerful workflows, these are the ones we're going to use to build our data pipelines.

The flow [`01_hello_world.yaml`](flows/01_hello_world.yaml) showcases all of these concepts inside of one workflow:
- The flow has 5 tasks: 3 log tasks and a sleep task
- The flow takes an input called `name`.
- There is a variable that takes the `name` input to generate a full welcome message.
- An output is generated from the return task and is logged in a later log task.
- There is a trigger to execute this flow every day at 10am.
- Plugin Defaults are used to make both log tasks send their messages as `ERROR` level.
- We have a concurrency limit of 2 executions. Any further ones made while 2 are running will fail.

#### Videos
- **2.2.2 - Kestra Concepts**  
  [![2.2.2 - Kestra Concepts](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FMNOKVx8780E)](https://youtu.be/MNOKVx8780E)

#### Resources
- [Tutorial](https://go.kestra.io/de-zoomcamp/tutorial)
- [Workflow Components Documentation](https://go.kestra.io/de-zoomcamp/workflow-components)

### 2.2.3 - Orchestrate Python Code

Now that we've built our first workflow, we can take it a step further by adding Python code into our flow. In Kestra, we can run Python code from a dedicated file or write it directly inside of our workflow.

While Kestra has a huge variety of plugins available for building your workflows, you also have the option to write your own code and have Kestra execute that based on schedules or events. This means you can pick the right tools for your pipelines, rather than the ones you're limited to. 

In our example Python workflow, [`02_python.yaml`](flows/02_python.yaml), our code fetches the number of Docker image pulls from DockerHub and returns it as an output to Kestra. This is useful as we can access this output with other tasks, even though it was generated inside of our Python script.

#### Videos
- **2.2.3 - Orchestrate Python Code**  
  [![2.2.3 - Orchestrate Python Code](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FVAHm0R_XjqI)](https://youtu.be/VAHm0R_XjqI)

#### Resources
- [How-to Guide: Python](https://go.kestra.io/de-zoomcamp/python)


## 2.3 Hands-On Coding Project: Build Data Pipelines with Kestra

Next, we're gonna build ETL pipelines for Yellow and Green Taxi data from NYC’s Taxi and Limousine Commission (TLC). You will:
1. Extract data from [CSV files](https://github.com/DataTalksClub/nyc-tlc-data/releases).
2. Load it into Postgres or Google Cloud (GCS + BigQuery).
3. Explore scheduling and backfilling workflows.

### 2.3.1 Getting Started Pipeline

This introductory flow is added just to demonstrate a simple data pipeline which extracts data via HTTP REST API, transforms that data in Python and then queries it using DuckDB. For this stage, a new separate Postgres database is created for the exercises. 


```mermaid
graph LR
  Extract[Extract Data via HTTP REST API] --> Transform[Transform Data in Python]
  Transform --> Query[Query Data with DuckDB]
```

Add the flow [`03_getting_started_data_pipeline.yaml`](flows/03_getting_started_data_pipeline.yaml) from the UI if you haven't already and execute it to see the results. Inspect the Gantt and Logs tabs to understand the flow execution.

#### Videos

- **2.3.1 - Getting Started Pipeline**   
  [![Create an ETL Pipeline with Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F-KmwrCqRhic)](https://youtu.be/-KmwrCqRhic)

#### Resources
- [ETL Tutorial Video](https://go.kestra.io/de-zoomcamp/etl-tutorial)
- [ETL in 3 Minutes](https://go.kestra.io/de-zoomcamp/etl-get-started)

### 2.3.2 Local DB: Load Taxi Data to Postgres

Before we start loading data to GCP, we'll first play with the Yellow and Green Taxi data using a local Postgres database running in a Docker container. We will use the same database from Module 1 which should be in the same Docker Compose file as Kestra.

The flow will extract CSV data partitioned by year and month, create tables, load data to the monthly table, and finally merge the data to the final destination table.

```mermaid
graph LR
  Start[Select Year & Month] --> SetLabel[Set Labels]
  SetLabel --> Extract[Extract CSV Data]
  Extract -->|Taxi=Yellow| YellowFinalTable[Create Yellow Final Table]:::yellow
  Extract -->|Taxi=Green| GreenFinalTable[Create Green Final Table]:::green
  YellowFinalTable --> YellowMonthlyTable[Create Yellow Monthly Table]:::yellow
  GreenFinalTable --> GreenMonthlyTable[Create Green Monthly Table]:::green
  YellowMonthlyTable --> YellowCopyIn[Load Data to Monthly Table]:::yellow
  GreenMonthlyTable --> GreenCopyIn[Load Data to Monthly Table]:::green
  YellowCopyIn --> YellowMerge[Merge Yellow Data]:::yellow
  GreenCopyIn --> GreenMerge[Merge Green Data]:::green

  classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px,color:#000;
  classDef green fill:#32CD32,stroke:#000,stroke-width:1px,color:#000;

```

The flow code: [`04_postgres_taxi.yaml`](flows/04_postgres_taxi.yaml).


> [!NOTE]  
> The NYC Taxi and Limousine Commission (TLC) Trip Record Data provided on the [nyc.gov](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) website is currently available only in a Parquet format, but this is NOT the dataset we're going to use in this course. For the purpose of this course, we'll use the **CSV files** available [here on GitHub](https://github.com/DataTalksClub/nyc-tlc-data/releases). This is because the Parquet format can be challenging to understand by newcomers, and we want to make the course as accessible as possible — the CSV format can be easily introspected using tools like Excel or Google Sheets, or even a simple text editor.

#### Videos

- **2.3.2 - Local DB: Load Taxi Data to Postgres**   
  [![Local DB: Load Taxi Data to Postgres](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FZ9ZmmwtXDcU)](https://youtu.be/Z9ZmmwtXDcU)

#### Resources
- [Docker Compose with Kestra, Postgres and pgAdmin](docker-compose.yml)

### 2.3.3 Local DB: Learn Scheduling and Backfills

We can now schedule the same pipeline shown above to run daily at 9 AM UTC. We'll also demonstrate how to backfill the data pipeline to run on historical data.

Note: given the large dataset, we'll backfill only data for the green taxi dataset for the year 2019.

The flow code: [`05_postgres_taxi_scheduled.yaml`](flows/05_postgres_taxi_scheduled.yaml).

#### Videos

- **2.3.3 - Scheduling and Backfills**  
  [![Scheduling and Backfills](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F1pu_C_oOAMA)](https://youtu.be/1pu_C_oOAMA)
---

## 2.4 ELT Pipelines in Kestra: Google Cloud Platform

Now that you've learned how to build ETL pipelines locally using Postgres, we are ready to move to the cloud. In this section, we'll load the same Yellow and Green Taxi data to Google Cloud Platform (GCP) using: 
1. Google Cloud Storage (GCS) as a data lake  
2. BigQuery as a data warehouse.

### 2.4.1 - ETL vs ELT

In 2.3, we made a ETL pipeline inside of Kestra:
- **Extract:** Firstly, we extract the dataset from GitHub
- **Transform:** Next, we transform it with Python
- **Load:** Finally, we load it into our Postgres database

While this is very standard across the industry, sometimes it makes sense to change the order when working with the cloud. If you're working with a large dataset, like the Yellow Taxi data, there can be benefits to extracting and loading straight into a data warehouse, and then performing transformations directly in the data warehouse. When working with BigQuery, we will use ELT:
- **Extract:** Firstly, we extract the dataset from GitHub
- **Load:** Next, we load this dataset (in this case, a csv file) into a data lake (Google Cloud Storage)
- **Transform:** Finally, we can create a table inside of our data warehouse (BigQuery) which uses the data from our data lake to perform our transformations.

The reason for loading into the data warehouse before transforming means we can utilize the cloud's performance benefits for transforming large datasets. What might take a lot longer for a local machine, can take a fraction of the time in the cloud.

Over the next few videos, we'll look at setting up BigQuery and transforming the Yellow Taxi dataset.

#### Videos

- **2.4.1 - ETL vs ELT**  
  [![ETL vs ELT](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FE04yurp1tSU)](https://youtu.be/E04yurp1tSU)

#### Resources
- [ETL vs ELT Video](https://go.kestra.io/de-zoomcamp/etl-vs-elt)
- [Data Warehouse 101 Video](https://go.kestra.io/de-zoomcamp/data-warehouse-101)
- [Data Lakes 101 Video](https://go.kestra.io/de-zoomcamp/data-lakes-101)

### 2.4.2 Setup Google Cloud Platform (GCP)

Before we start loading data to GCP, we need to set up the Google Cloud Platform. 

First, adjust the following flow [`06_gcp_kv.yaml`](flows/06_gcp_kv.yaml) to include your service account, GCP project ID, BigQuery dataset and GCS bucket name (_along with their location_) as KV Store values:
- GCP_PROJECT_ID
- GCP_LOCATION
- GCP_BUCKET_NAME
- GCP_DATASET.

#### Create GCP Resources

If you haven't already created the GCS bucket and BigQuery dataset in the first week of the course, you can use this flow to create them: [`07_gcp_setup.yaml`](flows/07_gcp_setup.yaml).

> [!WARNING]  
> The `GCP_CREDS` service account contains sensitive information. Ensure you keep it secure and do not commit it to Git. Keep it as secure as your passwords.


#### Videos

- **2.4.2 - Setup Google Cloud Platform**  
  [![Setup Google Cloud Platform](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FTLGFAOHpOYM)](https://youtu.be/TLGFAOHpOYM)

#### Resources
- [Set up Google Cloud Service Account in Kestra](https://go.kestra.io/de-zoomcamp/google-sa)

### 2.4.3 GCP Workflow: Load Taxi Data to BigQuery

Now that Google Cloud is set up with a storage bucket, we can start the ELT process.

```mermaid
graph LR
  SetLabel[Set Labels] --> Extract[Extract CSV Data]
  Extract --> UploadToGCS[Upload Data to GCS]
  UploadToGCS -->|Taxi=Yellow| BQYellowTripdata[Main Yellow Tripdata Table]:::yellow
  UploadToGCS -->|Taxi=Green| BQGreenTripdata[Main Green Tripdata Table]:::green
  BQYellowTripdata --> BQYellowTableExt[External Table]:::yellow
  BQGreenTripdata --> BQGreenTableExt[External Table]:::green
  BQYellowTableExt --> BQYellowTableTmp[Monthly Table]:::yellow
  BQGreenTableExt --> BQGreenTableTmp[Monthly Table]:::green
  BQYellowTableTmp --> BQYellowMerge[Merge to Main Table]:::yellow
  BQGreenTableTmp --> BQGreenMerge[Merge to Main Table]:::green
  BQYellowMerge --> PurgeFiles[Purge Files]
  BQGreenMerge --> PurgeFiles[Purge Files]

  classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px,color:#000
  classDef green fill:#32CD32,stroke:#000,stroke-width:1px,color:#000
```

The flow code: [`08_gcp_taxi.yaml`](flows/08_gcp_taxi.yaml).

#### Videos

- **2.4.3 - Create an ETL Pipeline with GCS and BigQuery in Kestra**  
  [![Create an ETL Pipeline with GCS and BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F52u9X_bfTAo)](https://youtu.be/52u9X_bfTAo)

### 2.4.4 GCP Workflow: Schedule and Backfill Full Dataset

We can now schedule the same pipeline shown above to run daily at 9 AM UTC for the green dataset and at 10 AM UTC for the yellow dataset. You can backfill historical data directly from the Kestra UI.

Since we now process data in a cloud environment with infinitely scalable storage and compute, we can backfill the entire dataset for both the yellow and green taxi data without the risk of running out of resources on our local machine.

The flow code: [`09_gcp_taxi_scheduled.yaml`](flows/09_gcp_taxi_scheduled.yaml).

#### Videos

- **2.4.4 - GCP Workflow: Schedule and Backfills**  
  [![GCP Workflow: Schedule and Backfills](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fb-6KhfWfk2M)](https://youtu.be/b-6KhfWfk2M)

---

## 2.5 Using AI for Data Engineering in Kestra

This section builds on what you learned earlier in Module 2 to show you how AI can speed up workflow development.

By the end of this section, you will:
- Understand why context engineering matters when collaborating with LLMs
- Use AI Copilot to build Kestra flows faster
- Use Retrieval Augmented Generation (RAG) in data pipelines

### Prerequisites

- Completion of earlier sections in Module 2 (Workflow Orchestration with Kestra)
- Kestra running locally
- Google Cloud account with access to Gemini API (there's a generous free tier!)

---

### 2.5.1 Introduction: Why AI for Workflows?

As data engineers, we spend significant time writing boilerplate code, searching documentation, and structuring data pipelines. AI tools can help us:

- **Generate workflows faster**: Describe what you want to accomplish in natural language instead of writing YAML from scratch
- **Avoid errors**: Get syntax-correct, up-to-date workflow code that follows best practices

However, AI is only as good as the context we provide. This section teaches you how to engineer that context for reliable, production-ready data workflows.

#### Videos

- **2.5.1 - Using AI for Data Engineering**  
  [![Using AI for Data Engineering](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FGHPtRDAv044)](https://youtu.be/GHPtRDAv044)

---

### 2.5.2 Context Engineering with ChatGPT

Let's start by seeing what happens when AI lacks proper context.

#### Experiment: ChatGPT Without Context

1. **Open ChatGPT in a private browser window** (to avoid any existing chat context): https://chatgpt.com

2. **Enter this prompt:**
   ```
   Create a Kestra flow that loads NYC taxi data from a CSV file to BigQuery. The flow should extract data, upload to GCS, and load to BigQuery.
   ```

3. **Observe the results:**
   - ChatGPT will generate a Kestra flow, but it likely contains:
     - **Outdated plugin syntax** e.g., old task types that have been renamed
     - **Incorrect property names** e.g., properties that don't exist in current versions
     - **Hallucinated features** e.g., tasks, triggers or properties that never existed

#### Why Does This Happen?

Large Language Models (LLMs) like GPT models from OpenAI are trained on data up to a specific point in time (knowledge cutoff). They don't automatically know about:
- Software updates and new releases
- Renamed plugins or changed APIs

This is the fundamental challenge of using AI: **the model can only work with information it has access to.**

#### Key Learning: Context is Everything

Without proper context:
- ❌ Generic AI assistants hallucinate outdated or incorrect code
- ❌ You can't trust the output for production use

With proper context:
- ✅ AI generates accurate, current, production-ready code
- ✅ You can iterate faster by letting AI generate boilerplate workflow code

In the next section, we'll see how Kestra's AI Copilot solves this problem.

#### Videos

- **2.5.2 - Context Engineering with ChatGPT**  
  [![Context Engineering with ChatGPT](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FLmnfjGKwnVU)](https://youtu.be/LmnfjGKwnVU)

---

### 2.5.3 AI Copilot in Kestra

Kestra's AI Copilot is specifically designed to generate and modify Kestra flows with full context about the latest plugins, workflow syntax, and best practices.

#### Setup AI Copilot

Before using AI Copilot, you need to configure Gemini API access in your Kestra instance.

**Step 1: Get Your Gemini API Key**

1. Visit Google AI Studio: https://aistudio.google.com/app/apikey
2. Sign in with your Google account
3. Click "Create API Key"
4. Copy the generated key (keep it secure!)

> [!WARNING]  
> Never commit API keys to Git. Always use environment variables or Kestra's KV Store.

**Step 2: Configure Kestra AI Copilot**

Add the following to your Kestra configuration. You can do this by modifying your `docker-compose.yml` file from 2.2:

```yaml
services:
  kestra:
    environment:
      KESTRA_CONFIGURATION: |
        kestra:
          ai:
            type: gemini
            gemini:
              model-name: gemini-2.5-flash
              api-key: ${GEMINI_API_KEY}
```

Then restart Kestra:
```bash
cd 02-workflow-orchestration/docker
export GEMINI_API_KEY="your-api-key-here"
docker compose up -d
```

#### Exercise: ChatGPT vs AI Copilot Comparison

**Objective:** Learn why context engineering matters.

1. **Open Kestra UI** at http://localhost:8080
2. **Create a new flow** and open the Code editor panel
3. **Click the AI Copilot button** (sparkle icon ✨) in the top-right corner
4. **Enter the same exact prompt** we used with ChatGPT:
   ```
   Create a Kestra flow that loads NYC taxi data from a CSV file to BigQuery. The flow should extract data, upload to GCS, and load to BigQuery.
   ```
5. **Compare the outputs:**
   - ✅ Copilot generates executable, working YAML
   - ✅ Copilot uses correct plugin types and properties
   - ✅ Copilot follows current Kestra best practices

**Key Learning:** Context matters! AI Copilot has access to current Kestra documentation, generating Kestra flows better than a generic ChatGPT assistant.

#### Videos

- **2.5.3 - AI Copilot in Kestra**  
  [![AI Copilot in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F3IbjHfC8bMg)](https://youtu.be/3IbjHfC8bMg)


### 2.5.4 Bonus: Retrieval Augmented Generation (RAG)

To further learn how to provide context to your prompts, this bonus section demonstrates how to use RAG.

#### What is RAG?

**RAG (Retrieval Augmented Generation)** is a technique that:
1. **Retrieves** relevant information from your data sources
2. **Augments** the AI prompt with this context
3. **Generates** a response grounded in real data

This solves the hallucination problem by ensuring the AI has access to current, accurate information at query time.

#### How RAG Works in Kestra

```mermaid
graph LR
    A[Ask AI] --> B[Fetch Docs]
    B --> C[Create Embeddings]
    C --> D[Find Similar Content]
    D --> E[Add Context to Prompt]
    E --> F[LLM Answer]
```

**The Process:**
1. **Ingest documents**: Load documentation, release notes, or other data sources
2. **Create embeddings**: Convert text into vector representations using an LLM
3. **Store embeddings**: Save vectors in Kestra's KV Store (or a vector database)
4. **Query with context**: When you ask a question, retrieve relevant embeddings and include them in the prompt
5. **Generate response**: The LLM has real context and provides accurate answers

#### Exercise: Retrieval With vs Without Context

**Objective:** Understand how RAG eliminates hallucinations by grounding LLM responses in real data.

**Part A: Without RAG**
1. Navigate to the [`10_chat_without_rag.yaml`](flows/10_chat_without_rag.yaml) flow in your Kestra UI
2. Click **Execute**
3. Wait for the execution to complete
4. Open the **Logs** tab
5. Read the output - notice how the response about "Kestra 1.1 features" is:
   - Vague or generic
   - Potentially incorrect
   - Missing specific details
   - Based only on the model's training data (which may be outdated)

**Part B: With RAG**
1. Navigate to the [`11_chat_with_rag.yaml`](flows/11_chat_with_rag.yaml) flow
2. Click **Execute**
3. Watch the execution:
   - First task: **Ingests** Kestra 1.1 release documentation, creates **embeddings** and stores them
   - Second task: **Prompts LLM** with context retrieved from stored embeddings
4. Open the **Logs** tab
5. Compare this output with the previous one - notice how it's:
   - ✅ Specific and detailed
   - ✅ Accurate with real features from the release
   - ✅ Grounded in actual documentation

**Key Learning:** RAG (Retrieval Augmented Generation) grounds AI responses in current documentation, eliminating hallucinations and providing accurate, context-aware answers.

#### RAG Best Practices

1. **Keep documents updated**: Regularly re-ingest to ensure current information
2. **Chunk appropriately**: Break large documents into meaningful chunks
3. **Test retrieval quality**: Verify that the right documents are retrieved

#### Additional AI Resources

Kestra Documentation:
- [AI Tools Overview](https://go.kestra.io/de-zoomcamp/ai-tools)
- [AI Copilot](https://go.kestra.io/de-zoomcamp/ai-copilot)
- [RAG Workflows](https://go.kestra.io/de-zoomcamp/rag-workflows)
- [AI Workflows](https://go.kestra.io/de-zoomcamp/ai-workflows)
- [Kestra Blueprints](https://go.kestra.io/de-zoomcamp/blueprints) - Pre-built workflow examples

Kestra Plugin Documentation:
- [AI Plugin](https://go.kestra.io/de-zoomcamp/ai-plugin)
- [RAG Tasks](https://go.kestra.io/de-zoomcamp/ai-rag-task)

External Documentation:
- [Google Gemini](https://go.kestra.io/de-zoomcamp/gemini-docs)
- [Google AI Studio](https://go.kestra.io/de-zoomcamp/ai-studio)

#### Videos

- **2.5.4 (Bonus) - Retrieval Augmented Generation**  
  [![Retrieval Augmented Generation](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FXuPDQ1UcNyI)](https://youtu.be/XuPDQ1UcNyI)

## 2.6 Bonus: Deploy to the Cloud (Optional)

Now that we've got all our pipelines working and we know how to quickly create new flows with Kestra's AI Copilot, we can deploy Kestra to the cloud so it can continue to orchestrate our scheduled pipelines. 

In this bonus section, we'll cover how you can deploy Kestra on Google Cloud and automatically sync your workflows from a Git repository.

Note: When committing your workflows to Kestra, make sure your workflow doesn't contain any sensitive information. You can use [Secrets](https://go.kestra.io/de-zoomcamp/secret) and the [KV Store](https://go.kestra.io/de-zoomcamp/kv-store) to keep sensitive data out of your workflow logic.

#### Resources

- [Install Kestra on Google Cloud](https://go.kestra.io/de-zoomcamp/gcp-install)
- [Moving from Development to Production](https://go.kestra.io/de-zoomcamp/dev-to-prod)
- [Using Git in Kestra](https://go.kestra.io/de-zoomcamp/git)
- [Deploy Flows with GitHub Actions](https://go.kestra.io/de-zoomcamp/deploy-github-actions)

## 2.7 Additional Resources 📚

- Check [Kestra Docs](https://go.kestra.io/de-zoomcamp/docs)
- Explore our [Blueprints](https://go.kestra.io/de-zoomcamp/blueprints) library
- Browse over 600 [plugins](https://go.kestra.io/de-zoomcamp/plugins) available in Kestra
- Give us a star on [GitHub](https://go.kestra.io/de-zoomcamp/github)
- Join our [Slack community](https://go.kestra.io/de-zoomcamp/slack) if you have any questions
- Find all the videos in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist)


### Troubleshooting tips

If you face any issues with Kestra flows in Module 2, make sure to use the following Docker images/ports:
- `image: kestra/kestra:v1.1` - pin your Kestra Docker image to this version so we can ensure reproducibility; do NOT use `kestra/kestra:develop` as this is a bleeding-edge development version that might contain bugs
- `postgres:18` — make sure to pin your Postgres image to version 18
- If you run `pgAdmin` or something else on port 8080, you can adjust Kestra `docker-compose` to use a different port, e.g. change port mapping to 18080 instead of 8080, and then access Kestra UI in your browser from http://localhost:18080/ instead of from http://localhost:8080/

If you are still facing any issues, stop and remove your existing Kestra + Postgres containers and start them again using `docker-compose up -d`. If this doesn't help, post your question on the DataTalksClub Slack or on Kestra's Slack http://kestra.io/slack.

If you encounter similar errors to:
```
BigQueryError{reason=invalid, location=null, 
message=Error while reading table: kestra-sandbox.zooomcamp.yellow_tripdata_2020_01, 
error message: CSV table references column position 17, but line contains only 14 columns.; 
line_number: 2103925 byte_offset_to_start_of_line: 194863028 
column_index: 17 column_name: "congestion_surcharge" column_type: NUMERIC 
File: gs://anna-geller/yellow_tripdata_2020-01.csv}
```

It means that the CSV file you're trying to load into BigQuery has a mismatch in the number of columns between the external source table (i.e. file in GCS) and the destination table in BigQuery. This can happen when for due to network/transfer issues, the file is not fully downloaded from GitHub or not correctly uploaded to GCS. The error suggests schema issues but that's not the case. Simply rerun the entire execution including redownloading the CSV file and reuploading it to GCS. This should resolve the issue.

---

## Homework 

See the [2026 cohort folder](../cohorts/2026/02-workflow-orchestration/homework.md)

---

# Community notes

Did you take notes? You can share them by creating a PR to this file! 

* Add your notes above this line

---

# Previous Cohorts

* 2022: [notes](../cohorts/2022/week_2_data_ingestion#community-notes) and [videos](../cohorts/2022/week_2_data_ingestion)
* 2023: [notes](../cohorts/2023/week_2_workflow_orchestration#community-notes) and [videos](../cohorts/2023/week_2_workflow_orchestration)
* 2024: [notes](../cohorts/2024/02-workflow-orchestration#community-notes) and [videos](../cohorts/2024/02-workflow-orchestration)
* 2025: [notes](../cohorts/2025/02-workflow-orchestration/README.md#community-notes) and [videos](../cohorts/2025/02-workflow-orchestration)


================================================
FILE: 02-workflow-orchestration/docker-compose.yml
================================================
volumes:
  ny_taxi_postgres_data:
    driver: local
  kestra_postgres_data:
    driver: local
  kestra_data:
    driver: local
  kestra_tmp:
    driver: local

services:
  pgdatabase:
    image: postgres:18
    environment:
      POSTGRES_USER: root
      POSTGRES_PASSWORD: root
      POSTGRES_DB: ny_taxi
    ports:
      - "5432:5432"
    volumes:
      - ny_taxi_postgres_data:/var/lib/postgresql
    depends_on:
      kestra:
        condition: service_started

  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - "8085:80"
    depends_on:
      pgdatabase:
        condition: service_started

  kestra_postgres:
    image: postgres:18
    volumes:
      - kestra_postgres_data:/var/lib/postgresql
    environment:
      POSTGRES_DB: kestra
      POSTGRES_USER: kestra
      POSTGRES_PASSWORD: k3str4
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -d $${POSTGRES_DB} -U $${POSTGRES_USER}"]
      interval: 30s
      timeout: 10s
      retries: 10

  kestra:
    image: kestra/kestra:v1.1
    pull_policy: always
    # Note that this setup with a root user is intended for development purpose.
    # Our base image runs without root, but the Docker Compose implementation needs root to access the Docker socket
    # To run Kestra in a rootless mode in production, see: https://kestra.io/docs/installation/podman-compose
    user: "root"
    command: server standalone
    volumes:
      - kestra_data:/app/storage
      - /var/run/docker.sock:/var/run/docker.sock
      - kestra_tmp:/tmp/kestra-wd
    environment:
      KESTRA_CONFIGURATION: |
        datasources:
          postgres:
            url: jdbc:postgresql://kestra_postgres:5432/kestra
            driverClassName: org.postgresql.Driver
            username: kestra
            password: k3str4
        kestra:
          server:
            basicAuth:
              username: "admin@kestra.io" # it must be a valid email address
              password: Admin1234!
          repository:
            type: postgres
          storage:
            type: local
            local:
              basePath: "/app/storage"
          queue:
            type: postgres
          tasks:
            tmpDir:
              path: /tmp/kestra-wd/tmp
          url: http://localhost:8080/
    ports:
      - "8080:8080"
      - "8081:8081"
    depends_on:
      kestra_postgres:
        condition: service_started
    

================================================
FILE: 02-workflow-orchestration/flows/01_hello_world.yaml
================================================
id: 01_hello_world
namespace: zoomcamp

inputs:
  - id: name
    type: STRING
    defaults: Will

concurrency:
  behavior: FAIL
  limit: 2

variables:
  welcome_message: "Hello, {{ inputs.name }}!"
  
tasks:
  - id: hello_message
    type: io.kestra.plugin.core.log.Log
    message: "{{ render(vars.welcome_message) }}"
  
  - id: generate_output
    type: io.kestra.plugin.core.debug.Return
    format: I was generated during this workflow.

  - id: sleep
    type: io.kestra.plugin.core.flow.Sleep
    duration: PT15S

  - id: log_output
    type: io.kestra.plugin.core.log.Log
    message: "This is an output: {{ outputs.generate_output.value }}"

  - id: goodbye_message
    type: io.kestra.plugin.core.log.Log
    message: "Goodbye, {{ inputs.name }}!"

pluginDefaults:
  - type: io.kestra.plugin.core.log.Log
    values:
      level: ERROR

triggers:
  - id: schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 10 * * *"
    inputs:
      name: Sarah
    disabled: true


================================================
FILE: 02-workflow-orchestration/flows/02_python.yaml
================================================
id: 02_python
namespace: zoomcamp

description: This flow will install the pip package in a Docker container, and use kestra's Python library to generate outputs (number of downloads of the Kestra Docker image) and metrics (duration of the script).

tasks:
  - id: collect_stats
    type: io.kestra.plugin.scripts.python.Script
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
    containerImage: python:slim
    dependencies:
      - requests
      - kestra
    script: |
      from kestra import Kestra
      import requests
      def get_docker_image_downloads(image_name: str = "kestra/kestra"):
          """Queries the Docker Hub API to get the number of downloads for a specific Docker image."""
          url = f"https://hub.docker.com/v2/repositories/{image_name}/"
          response = requests.get(url)
          data = response.json()
          downloads = data.get('pull_count', 'Not available')
          return downloads
      downloads = get_docker_image_downloads()
      outputs = {
          'downloads': downloads
      }
      Kestra.outputs(outputs)

================================================
FILE: 02-workflow-orchestration/flows/03_getting_started_data_pipeline.yaml
================================================
id: 03_getting_started_data_pipeline
namespace: zoomcamp

inputs:
  - id: columns_to_keep
    type: ARRAY
    itemType: STRING
    defaults:
      - brand
      - price

tasks:
  - id: extract
    type: io.kestra.plugin.core.http.Download
    uri: https://dummyjson.com/products

  - id: transform
    type: io.kestra.plugin.scripts.python.Script
    containerImage: python:3.11-alpine
    inputFiles:
      data.json: "{{outputs.extract.uri}}"
    outputFiles:
      - "*.json"
    env:
      COLUMNS_TO_KEEP: "{{inputs.columns_to_keep}}"
    script: |
      import json
      import os

      columns_to_keep_str = os.getenv("COLUMNS_TO_KEEP")
      columns_to_keep = json.loads(columns_to_keep_str)

      with open("data.json", "r") as file:
          data = json.load(file)

      filtered_data = [
          {column: product.get(column, "N/A") for column in columns_to_keep}
          for product in data["products"]
      ]

      with open("products.json", "w") as file:
          json.dump(filtered_data, file, indent=4)

  - id: query
    type: io.kestra.plugin.jdbc.duckdb.Queries
    inputFiles:
      products.json: "{{outputs.transform.outputFiles['products.json']}}"
    sql: |
      INSTALL json;
      LOAD json;
      SELECT brand, round(avg(price), 2) as avg_price
      FROM read_json_auto('{{workingDir}}/products.json')
      GROUP BY brand
      ORDER BY avg_price DESC;
    fetchType: STORE


================================================
FILE: 02-workflow-orchestration/flows/04_postgres_taxi.yaml
================================================
id: 04_postgres_taxi
namespace: zoomcamp
description: |
  The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases

inputs:
  - id: taxi
    type: SELECT
    displayName: Select taxi type
    values: [yellow, green]
    defaults: yellow

  - id: year
    type: SELECT
    displayName: Select year
    values: ["2019", "2020"]
    defaults: "2019"

  - id: month
    type: SELECT
    displayName: Select month
    values: ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
    defaults: "01"

variables:
  file: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv"
  staging_table: "public.{{inputs.taxi}}_tripdata_staging"
  table: "public.{{inputs.taxi}}_tripdata"
  data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}"

tasks:
  - id: set_label
    type: io.kestra.plugin.core.execution.Labels
    labels:
      file: "{{render(vars.file)}}"
      taxi: "{{inputs.taxi}}"

  - id: extract
    type: io.kestra.plugin.scripts.shell.Commands
    outputFiles:
      - "*.csv"
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}

  - id: if_yellow_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'yellow'}}"
    then:
      - id: yellow_create_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              tpep_pickup_datetime   timestamp,
              tpep_dropoff_datetime  timestamp,
              passenger_count        integer,
              trip_distance          double precision,
              RatecodeID             text,
              store_and_fwd_flag     text,
              PULocationID           text,
              DOLocationID           text,
              payment_type           integer,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              congestion_surcharge   double precision
          );

      - id: yellow_create_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              tpep_pickup_datetime   timestamp,
              tpep_dropoff_datetime  timestamp,
              passenger_count        integer,
              trip_distance          double precision,
              RatecodeID             text,
              store_and_fwd_flag     text,
              PULocationID           text,
              DOLocationID           text,
              payment_type           integer,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              congestion_surcharge   double precision
          );

      - id: yellow_truncate_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          TRUNCATE TABLE {{render(vars.staging_table)}};

      - id: yellow_copy_in_to_staging_table
        type: io.kestra.plugin.jdbc.postgresql.CopyIn
        format: CSV
        from: "{{render(vars.data)}}"
        table: "{{render(vars.staging_table)}}"
        header: true
        columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge]

      - id: yellow_add_unique_id_and_filename
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          UPDATE {{render(vars.staging_table)}}
          SET 
            unique_row_id = md5(
              COALESCE(CAST(VendorID AS text), '') ||
              COALESCE(CAST(tpep_pickup_datetime AS text), '') || 
              COALESCE(CAST(tpep_dropoff_datetime AS text), '') || 
              COALESCE(PULocationID, '') || 
              COALESCE(DOLocationID, '') || 
              COALESCE(CAST(fare_amount AS text), '') || 
              COALESCE(CAST(trip_distance AS text), '')      
            ),
            filename = '{{render(vars.file)}}';

      - id: yellow_merge_data
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          MERGE INTO {{render(vars.table)}} AS T
          USING {{render(vars.staging_table)}} AS S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (
              unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,
              passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,
              DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,
              improvement_surcharge, total_amount, congestion_surcharge
            )
            VALUES (
              S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,
              S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,
              S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,
              S.improvement_surcharge, S.total_amount, S.congestion_surcharge
            );

  - id: if_green_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'green'}}"
    then:
      - id: green_create_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              lpep_pickup_datetime   timestamp,
              lpep_dropoff_datetime  timestamp,
              store_and_fwd_flag     text,
              RatecodeID             text,
              PULocationID           text,
              DOLocationID           text,
              passenger_count        integer,
              trip_distance          double precision,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              ehail_fee              double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              payment_type           integer,
              trip_type              integer,
              congestion_surcharge   double precision
          );

      - id: green_create_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              lpep_pickup_datetime   timestamp,
              lpep_dropoff_datetime  timestamp,
              store_and_fwd_flag     text,
              RatecodeID             text,
              PULocationID           text,
              DOLocationID           text,
              passenger_count        integer,
              trip_distance          double precision,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              ehail_fee              double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              payment_type           integer,
              trip_type              integer,
              congestion_surcharge   double precision
          );

      - id: green_truncate_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          TRUNCATE TABLE {{render(vars.staging_table)}};

      - id: green_copy_in_to_staging_table
        type: io.kestra.plugin.jdbc.postgresql.CopyIn
        format: CSV
        from: "{{render(vars.data)}}"
        table: "{{render(vars.staging_table)}}"
        header: true
        columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge]

      - id: green_add_unique_id_and_filename
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          UPDATE {{render(vars.staging_table)}}
          SET 
            unique_row_id = md5(
              COALESCE(CAST(VendorID AS text), '') ||
              COALESCE(CAST(lpep_pickup_datetime AS text), '') || 
              COALESCE(CAST(lpep_dropoff_datetime AS text), '') || 
              COALESCE(PULocationID, '') || 
              COALESCE(DOLocationID, '') || 
              COALESCE(CAST(fare_amount AS text), '') || 
              COALESCE(CAST(trip_distance AS text), '')      
            ),
            filename = '{{render(vars.file)}}';

      - id: green_merge_data
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          MERGE INTO {{render(vars.table)}} AS T
          USING {{render(vars.staging_table)}} AS S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (
              unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime,
              store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count,
              trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee,
              improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge
            )
            VALUES (
              S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime,
              S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count,
              S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee,
              S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge
            );
  
  - id: purge_files
    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles
    description: This will remove output files. If you'd like to explore Kestra outputs, disable it.

pluginDefaults:
  - type: io.kestra.plugin.jdbc.postgresql
    values:
      url: jdbc:postgresql://pgdatabase:5432/ny_taxi
      username: root
      password: root


================================================
FILE: 02-workflow-orchestration/flows/05_postgres_taxi_scheduled.yaml
================================================
id: 05_postgres_taxi_scheduled
namespace: zoomcamp
description: |
  Best to add a label `backfill:true` from the UI to track executions created via a backfill.
  CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases

concurrency:
  limit: 1

inputs:
  - id: taxi
    type: SELECT
    displayName: Select taxi type
    values: [yellow, green]
    defaults: yellow

variables:
  file: "{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv"
  staging_table: "public.{{inputs.taxi}}_tripdata_staging"
  table: "public.{{inputs.taxi}}_tripdata"
  data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}"

tasks:
  - id: set_label
    type: io.kestra.plugin.core.execution.Labels
    labels:
      file: "{{render(vars.file)}}"
      taxi: "{{inputs.taxi}}"

  - id: extract
    type: io.kestra.plugin.scripts.shell.Commands
    outputFiles:
      - "*.csv"
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}

  - id: if_yellow_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'yellow'}}"
    then:
      - id: yellow_create_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              tpep_pickup_datetime   timestamp,
              tpep_dropoff_datetime  timestamp,
              passenger_count        integer,
              trip_distance          double precision,
              RatecodeID             text,
              store_and_fwd_flag     text,
              PULocationID           text,
              DOLocationID           text,
              payment_type           integer,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              congestion_surcharge   double precision
          );

      - id: yellow_create_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              tpep_pickup_datetime   timestamp,
              tpep_dropoff_datetime  timestamp,
              passenger_count        integer,
              trip_distance          double precision,
              RatecodeID             text,
              store_and_fwd_flag     text,
              PULocationID           text,
              DOLocationID           text,
              payment_type           integer,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              congestion_surcharge   double precision
          );

      - id: yellow_truncate_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          TRUNCATE TABLE {{render(vars.staging_table)}};

      - id: yellow_copy_in_to_staging_table
        type: io.kestra.plugin.jdbc.postgresql.CopyIn
        format: CSV
        from: "{{render(vars.data)}}"
        table: "{{render(vars.staging_table)}}"
        header: true
        columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge]

      - id: yellow_add_unique_id_and_filename
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          UPDATE {{render(vars.staging_table)}}
          SET 
            unique_row_id = md5(
              COALESCE(CAST(VendorID AS text), '') ||
              COALESCE(CAST(tpep_pickup_datetime AS text), '') || 
              COALESCE(CAST(tpep_dropoff_datetime AS text), '') || 
              COALESCE(PULocationID, '') || 
              COALESCE(DOLocationID, '') || 
              COALESCE(CAST(fare_amount AS text), '') || 
              COALESCE(CAST(trip_distance AS text), '')      
            ),
            filename = '{{render(vars.file)}}';

      - id: yellow_merge_data
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          MERGE INTO {{render(vars.table)}} AS T
          USING {{render(vars.staging_table)}} AS S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (
              unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,
              passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,
              DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,
              improvement_surcharge, total_amount, congestion_surcharge
            )
            VALUES (
              S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,
              S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,
              S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,
              S.improvement_surcharge, S.total_amount, S.congestion_surcharge
            );

  - id: if_green_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'green'}}"
    then:
      - id: green_create_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              lpep_pickup_datetime   timestamp,
              lpep_dropoff_datetime  timestamp,
              store_and_fwd_flag     text,
              RatecodeID             text,
              PULocationID           text,
              DOLocationID           text,
              passenger_count        integer,
              trip_distance          double precision,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              ehail_fee              double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              payment_type           integer,
              trip_type              integer,
              congestion_surcharge   double precision
          );

      - id: green_create_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              lpep_pickup_datetime   timestamp,
              lpep_dropoff_datetime  timestamp,
              store_and_fwd_flag     text,
              RatecodeID             text,
              PULocationID           text,
              DOLocationID           text,
              passenger_count        integer,
              trip_distance          double precision,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              ehail_fee              double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              payment_type           integer,
              trip_type              integer,
              congestion_surcharge   double precision
          );

      - id: green_truncate_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          TRUNCATE TABLE {{render(vars.staging_table)}};

      - id: green_copy_in_to_staging_table
        type: io.kestra.plugin.jdbc.postgresql.CopyIn
        format: CSV
        from: "{{render(vars.data)}}"
        table: "{{render(vars.staging_table)}}"
        header: true
        columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge]

      - id: green_add_unique_id_and_filename
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          UPDATE {{render(vars.staging_table)}}
          SET 
            unique_row_id = md5(
              COALESCE(CAST(VendorID AS text), '') ||
              COALESCE(CAST(lpep_pickup_datetime AS text), '') || 
              COALESCE(CAST(lpep_dropoff_datetime AS text), '') || 
              COALESCE(PULocationID, '') || 
              COALESCE(DOLocationID, '') || 
              COALESCE(CAST(fare_amount AS text), '') || 
              COALESCE(CAST(trip_distance AS text), '')      
            ),
            filename = '{{render(vars.file)}}';

      - id: green_merge_data
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          MERGE INTO {{render(vars.table)}} AS T
          USING {{render(vars.staging_table)}} AS S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (
              unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime,
              store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count,
              trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee,
              improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge
            )
            VALUES (
              S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime,
              S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count,
              S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee,
              S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge
            );
  
  - id: purge_files
    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles
    description: To avoid cluttering your storage, we will remove the downloaded files

pluginDefaults:
  - type: io.kestra.plugin.jdbc.postgresql
    values:
      url: jdbc:postgresql://pgdatabase:5432/ny_taxi
      username: root
      password: root

triggers:
  - id: green_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 9 1 * *"
    inputs:
      taxi: green

  - id: yellow_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 10 1 * *"
    inputs:
      taxi: yellow


================================================
FILE: 02-workflow-orchestration/flows/06_gcp_kv.yaml
================================================
id: 06_gcp_kv
namespace: zoomcamp

tasks:
  - id: gcp_project_id
    type: io.kestra.plugin.core.kv.Set
    key: GCP_PROJECT_ID
    kvType: STRING
    value: kestra-sandbox # TODO replace with your project id

  - id: gcp_location
    type: io.kestra.plugin.core.kv.Set
    key: GCP_LOCATION
    kvType: STRING
    value: europe-west2

  - id: gcp_bucket_name
    type: io.kestra.plugin.core.kv.Set
    key: GCP_BUCKET_NAME
    kvType: STRING
    value: your-name-kestra # TODO make sure it's globally unique!

  - id: gcp_dataset
    type: io.kestra.plugin.core.kv.Set
    key: GCP_DATASET
    kvType: STRING
    value: zoomcamp


================================================
FILE: 02-workflow-orchestration/flows/07_gcp_setup.yaml
================================================
id: 07_gcp_setup
namespace: zoomcamp

tasks:
  - id: create_gcs_bucket
    type: io.kestra.plugin.gcp.gcs.CreateBucket
    ifExists: SKIP
    storageClass: REGIONAL
    name: "{{kv('GCP_BUCKET_NAME')}}" # make sure it's globally unique!

  - id: create_bq_dataset
    type: io.kestra.plugin.gcp.bigquery.CreateDataset
    name: "{{kv('GCP_DATASET')}}"
    ifExists: SKIP

pluginDefaults:
  - type: io.kestra.plugin.gcp
    values:
      serviceAccount: "{{secret('GCP_CREDS')}}"
      projectId: "{{kv('GCP_PROJECT_ID')}}"
      location: "{{kv('GCP_LOCATION')}}"
      bucket: "{{kv('GCP_BUCKET_NAME')}}"


================================================
FILE: 02-workflow-orchestration/flows/08_gcp_taxi.yaml
================================================
id: 08_gcp_taxi
namespace: zoomcamp
description: |
  The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases

inputs:
  - id: taxi
    type: SELECT
    displayName: Select taxi type
    values: [yellow, green]
    defaults: green

  - id: year
    type: SELECT
    displayName: Select year
    values: ["2019", "2020"]
    defaults: "2019"
    allowCustomValue: true # allows you to type 2021 from the UI for the homework 🤗

  - id: month
    type: SELECT
    displayName: Select month
    values: ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
    defaults: "01"

variables:
  file: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv"
  gcs_file: "gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}"
  table: "{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{inputs.year}}_{{inputs.month}}"
  data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}"

tasks:
  - id: set_label
    type: io.kestra.plugin.core.execution.Labels
    labels:
      file: "{{render(vars.file)}}"
      taxi: "{{inputs.taxi}}"

  - id: extract
    type: io.kestra.plugin.scripts.shell.Commands
    outputFiles:
      - "*.csv"
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}

  - id: upload_to_gcs
    type: io.kestra.plugin.gcp.gcs.Upload
    from: "{{render(vars.data)}}"
    to: "{{render(vars.gcs_file)}}"

  - id: if_yellow_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'yellow'}}"
    then:
      - id: bq_yellow_tripdata
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata`
          (
              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),
              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          PARTITION BY DATE(tpep_pickup_datetime);

      - id: bq_yellow_table_ext
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`
          (
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          OPTIONS (
              format = 'CSV',
              uris = ['{{render(vars.gcs_file)}}'],
              skip_leading_rows = 1,
              ignore_unknown_values = TRUE
          );

      - id: bq_yellow_table_tmp
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`
          AS
          SELECT
            MD5(CONCAT(
              COALESCE(CAST(VendorID AS STRING), ""),
              COALESCE(CAST(tpep_pickup_datetime AS STRING), ""),
              COALESCE(CAST(tpep_dropoff_datetime AS STRING), ""),
              COALESCE(CAST(PULocationID AS STRING), ""),
              COALESCE(CAST(DOLocationID AS STRING), "")
            )) AS unique_row_id,
            "{{render(vars.file)}}" AS filename,
            *
          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;

      - id: bq_yellow_merge
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T
          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge)
            VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge);

  - id: if_green_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'green'}}"
    then:
      - id: bq_green_tripdata
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata`
          (
              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),
              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              ehail_fee NUMERIC,
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          PARTITION BY DATE(lpep_pickup_datetime);

      - id: bq_green_table_ext
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`
          (
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              ehail_fee NUMERIC,
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          OPTIONS (
              format = 'CSV',
              uris = ['{{render(vars.gcs_file)}}'],
              skip_leading_rows = 1,
              ignore_unknown_values = TRUE
          );

      - id: bq_green_table_tmp
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`
          AS
          SELECT
            MD5(CONCAT(
              COALESCE(CAST(VendorID AS STRING), ""),
              COALESCE(CAST(lpep_pickup_datetime AS STRING), ""),
              COALESCE(CAST(lpep_dropoff_datetime AS STRING), ""),
              COALESCE(CAST(PULocationID AS STRING), ""),
              COALESCE(CAST(DOLocationID AS STRING), "")
            )) AS unique_row_id,
            "{{render(vars.file)}}" AS filename,
            *
          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;

      - id: bq_green_merge
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T
          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge)
            VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge);

  - id: purge_files
    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles
    description: If you'd like to explore Kestra outputs, disable it.
    disabled: false

pluginDefaults:
  - type: io.kestra.plugin.gcp
    values:
      serviceAccount: "{{secret('GCP_CREDS')}}"
      projectId: "{{kv('GCP_PROJECT_ID')}}"
      location: "{{kv('GCP_LOCATION')}}"
      bucket: "{{kv('GCP_BUCKET_NAME')}}"


================================================
FILE: 02-workflow-orchestration/flows/09_gcp_taxi_scheduled.yaml
================================================

id: 09_gcp_taxi_scheduled
namespace: zoomcamp
description: |
  Best to add a label `backfill:true` from the UI to track executions created via a backfill.
  CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases

inputs:
  - id: taxi
    type: SELECT
    displayName: Select taxi type
    values: [yellow, green]
    defaults: green

variables:
  file: "{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv"
  gcs_file: "gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}"
  table: "{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy_MM')}}"
  data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}"

tasks:
  - id: set_label
    type: io.kestra.plugin.core.execution.Labels
    labels:
      file: "{{render(vars.file)}}"
      taxi: "{{inputs.taxi}}"

  - id: extract
    type: io.kestra.plugin.scripts.shell.Commands
    outputFiles:
      - "*.csv"
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}

  - id: upload_to_gcs
    type: io.kestra.plugin.gcp.gcs.Upload
    from: "{{render(vars.data)}}"
    to: "{{render(vars.gcs_file)}}"

  - id: if_yellow_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'yellow'}}"
    then:
      - id: bq_yellow_tripdata
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata`
          (
              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),
              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          PARTITION BY DATE(tpep_pickup_datetime);

      - id: bq_yellow_table_ext
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`
          (
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          OPTIONS (
              format = 'CSV',
              uris = ['{{render(vars.gcs_file)}}'],
              skip_leading_rows = 1,
              ignore_unknown_values = TRUE
          );

      - id: bq_yellow_table_tmp
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`
          AS
          SELECT
            MD5(CONCAT(
              COALESCE(CAST(VendorID AS STRING), ""),
              COALESCE(CAST(tpep_pickup_datetime AS STRING), ""),
              COALESCE(CAST(tpep_dropoff_datetime AS STRING), ""),
              COALESCE(CAST(PULocationID AS STRING), ""),
              COALESCE(CAST(DOLocationID AS STRING), "")
            )) AS unique_row_id,
            "{{render(vars.file)}}" AS filename,
            *
          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;

      - id: bq_yellow_merge
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T
          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge)
            VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge);

  - id: if_green_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'green'}}"
    then:
      - id: bq_green_tripdata
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata`
          (
              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),
              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              ehail_fee NUMERIC,
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          PARTITION BY DATE(lpep_pickup_datetime);

      - id: bq_green_table_ext
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`
          (
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              ehail_fee NUMERIC,
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          OPTIONS (
              format = 'CSV',
              uris = ['{{render(vars.gcs_file)}}'],
              skip_leading_rows = 1,
              ignore_unknown_values = TRUE
          );

      - id: bq_green_table_tmp
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`
          AS
          SELECT
            MD5(CONCAT(
              COALESCE(CAST(VendorID AS STRING), ""),
              COALESCE(CAST(lpep_pickup_datetime AS STRING), ""),
              COALESCE(CAST(lpep_dropoff_datetime AS STRING), ""),
              COALESCE(CAST(PULocationID AS STRING), ""),
              COALESCE(CAST(DOLocationID AS STRING), "")
            )) AS unique_row_id,
            "{{render(vars.file)}}" AS filename,
            *
          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;

      - id: bq_green_merge
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T
          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge)
            VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge);

  - id: purge_files
    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles
    description: To avoid cluttering your storage, we will remove the downloaded files

pluginDefaults:
  - type: io.kestra.plugin.gcp
    values:
      serviceAccount: "{{secret('GCP_CREDS')}}"
      projectId: "{{kv('GCP_PROJECT_ID')}}"
      location: "{{kv('GCP_LOCATION')}}"
      bucket: "{{kv('GCP_BUCKET_NAME')}}"

triggers:
  - id: green_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 9 1 * *"
    inputs:
      taxi: green

  - id: yellow_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 10 1 * *"
    inputs:
      taxi: yellow


================================================
FILE: 02-workflow-orchestration/flows/10_chat_without_rag.yaml
================================================
id: 10_chat_without_rag
namespace: zoomcamp

description: |
  This flow demonstrates what happens when you query an LLM WITHOUT RAG.
  The model can only rely on its training data, which may be outdated or incomplete.
  
  After running this, check out 11_chat_with_rag.yaml to see how RAG fixes these issues.

tasks:
  - id: chat_without_rag
    type: io.kestra.plugin.ai.completion.ChatCompletion
    description: Query about Kestra 1.1 features WITHOUT RAG
    provider:
      type: io.kestra.plugin.ai.provider.GoogleGemini
      modelName: gemini-2.5-flash
      apiKey: "{{ kv('GEMINI_API_KEY') }}"
    messages:
      - type: USER
        content: |
          Which features were released in Kestra 1.1? 
          Please list at least 5 major features with brief descriptions.

  - id: log_results
    type: io.kestra.plugin.core.log.Log
    message: |
      ❌ Response WITHOUT RAG (no retrieved context):
      {{ outputs.chat_without_rag.textOutput }}
      
      🤔 Did you notice that this response seems to be:
      - Incorrect
      - Vague/generic
      - Listing features that haven't been added in exactly this version but rather a long time ago
      
      👉 This is why context matters. Run `11_chat_with_rag.yaml` to see the accurate, context-grounded response.


================================================
FILE: 02-workflow-orchestration/flows/11_chat_with_rag.yaml
================================================
id: 11_chat_with_rag
namespace: zoomcamp

description: |
  This flow demonstrates RAG (Retrieval Augmented Generation) by ingesting Kestra release documentation and using it to answer questions accurately.
  
  Compare this with 10_chat_without_rag.yaml to see the difference RAG makes.

tasks:
  - id: ingest_release_notes
    type: io.kestra.plugin.ai.rag.IngestDocument
    description: Ingest Kestra 1.1 release notes to create embeddings
    provider:
      type: io.kestra.plugin.ai.provider.GoogleGemini
      modelName: gemini-embedding-001
      apiKey: "{{ kv('GEMINI_API_KEY') }}"
    embeddings:
      type: io.kestra.plugin.ai.embeddings.KestraKVStore
    drop: true
    fromExternalURLs:
      - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/src/contents/blogs/release-1-1/index.md

  - id: chat_with_rag
    type: io.kestra.plugin.ai.rag.ChatCompletion
    description: Query about Kestra 1.1 features with RAG context
    chatProvider:
      type: io.kestra.plugin.ai.provider.GoogleGemini
      modelName: gemini-2.5-flash
      apiKey: "{{ kv('GEMINI_API_KEY') }}"
    embeddingProvider:
      type: io.kestra.plugin.ai.provider.GoogleGemini
      modelName: gemini-embedding-001
      apiKey: "{{ kv('GEMINI_API_KEY') }}"
    embeddings:
      type: io.kestra.plugin.ai.embeddings.KestraKVStore
    systemMessage: |
      You are a helpful assistant that answers questions about Kestra.
      Use the provided documentation to give accurate, specific answers.
      If you don't find the information in the context, say so.
    prompt: |
      Which features were released in Kestra 1.1? 
      Please list at least 5 major features with brief descriptions.

  - id: log_results
    type: io.kestra.plugin.core.log.Log
    message: |
      ✅ RAG Response (with retrieved context):
      {{ outputs.chat_with_rag.textOutput }}
      
      Note that this response is detailed, accurate, and grounded in the actual release documentation. Compare this with the output from 06_chat_without_rag.yaml.


================================================
FILE: 03-data-warehouse/README.md
================================================
# Data Warehouse and BigQuery

- [Slides](https://docs.google.com/presentation/d/1a3ZoBAXFk8-EhUsd7rAZd-5p_HpltkzSeujjRGB2TAI/edit?usp=sharing)  
- [Big Query basic SQL](big_query.sql)

# Videos

## Data Warehouse

- Data Warehouse and BigQuery

[![](https://markdown-videos-api.jorgenkh.no/youtube/jrHljAoD6nM)](https://youtu.be/jrHljAoD6nM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34)

## :movie_camera: Partitioning and clustering

- Partitioning vs Clustering

[![](https://markdown-videos-api.jorgenkh.no/youtube/-CqXf7vhhDs)](https://youtu.be/-CqXf7vhhDs?si=p1sYQCAs8dAa7jIm&t=193&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=35)

## :movie_camera: Best practices

[![](https://markdown-videos-api.jorgenkh.no/youtube/k81mLJVX08w)](https://youtu.be/k81mLJVX08w&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=36)

## :movie_camera: Internals of BigQuery

[![](https://markdown-videos-api.jorgenkh.no/youtube/eduHi1inM4s)](https://youtu.be/eduHi1inM4s&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=37)

## Advanced topics

### :movie_camera: Machine Learning in Big Query

[![](https://markdown-videos-api.jorgenkh.no/youtube/B-WtpB0PuG4)](https://youtu.be/B-WtpB0PuG4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34)

* [SQL for ML in BigQuery](big_query_ml.sql)

**Important links**

- [BigQuery ML Tutorials](https://cloud.google.com/bigquery-ml/docs/tutorials)
- [BigQuery ML Reference Parameter](https://cloud.google.com/bigquery-ml/docs/analytics-reference-patterns)
- [Hyper Parameter tuning](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-glm)
- [Feature preprocessing](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-preprocess-overview)

### :movie_camera: Deploying Machine Learning model from BigQuery

[![](https://markdown-videos-api.jorgenkh.no/youtube/BjARzEWaznU)](https://youtu.be/BjARzEWaznU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=39)

- [Steps to extract and deploy model with docker](extract_model.md)  


# Homework

* [2026 Homework](../cohorts/2026/03-data-warehouse/homework.md)


# Community notes

<details>
<summary>Did you take notes? You can share them here</summary>

* [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/3_data_warehouse.md)
* [Isaac Kargar's blog post](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/30/data-engineering-w3.html)
* [Marcos Torregrosa's blog post](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-3/) 
* [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week3)
* [Notes from Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/02/week-3-data-engineering-zoomcamp-notes-data-warehouse-and-bigquery/)
* [Bigger picture summary on Data Lakes, Data Warehouses, and tooling](https://medium.com/@verazabeida/zoomcamp-week-4-b8bde661bf98), by Vera
* [Notes by froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_3_data_warehouse/notes/notes_week_03.md)
* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week3.md)
* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)
* [2024 videos transcript week3](https://drive.google.com/drive/folders/1quIiwWO-tJCruqvtlqe_Olw8nvYSmmDJ?usp=sharing) by Maria Fisher 
* [Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/3a-data-warehouse/readme.md)
* [Jonah Oliver's blog post](https://www.jonahboliver.com/blog/de-zc-w3)
* [2024 - steps to send data from Mage to GCS + creating external table](https://drive.google.com/file/d/1GIi6xnS4070a8MUlIg-ozITt485_-ePB/view?usp=drive_link) by Maria Fisher
* [2024 - mage dataloader script to load the parquet files from a remote URL and push it to Google bucket as parquet file](https://github.com/amohan601/dataengineering-zoomcamp2024/blob/main/week_3_data_warehouse/mage_scripts/green_taxi_2022_v2.py) by Anju Mohan
* [2024 - steps to send data from Mage to GCS + creating external table](https://drive.google.com/file/d/1GIi6xnS4070a8MUlIg-ozITt485_-ePB/view?usp=drive_link) by Maria Fisher 
* [Notes by HongWei](https://github.com/hwchua0209/data-engineering-zoomcamp-submission/blob/main/03-data-warehouse/README.md)
* [2025 Notes by Manuel Guerra](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/3_Data-Warehouse/README.md)
* [Notes from Horeb SEIDOU](https://spotted-hardhat-eea.notion.site/Week-3-Data-Warehouse-and-BigQuery-17c29780dc4a80c8a226f372543ae388)
* [2025 - Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/03_data_warehouse/00_notes.md)
* [2025 Gitbook Notes Tinker0425](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/module-3/introduction-to-module-3)
* [2025 Notes from Daniel Lachner](https://drive.google.com/file/d/105zjtLFi0sRqqFFgdMSCTzfcLPx2rfv4/view?usp=sharing)
* [2026 Notes from Catherine Frost](https://docs.google.com/document/d/1j3jeNnBI2fw1nq7JwEauPx2G8FybDfTqmMk7eRu0vSo/edit?tab=t.0)
* Add your notes here (above this line)

</details>


================================================
FILE: 03-data-warehouse/big_query.sql
================================================
-- Query public available table
SELECT station_id, name FROM
    bigquery-public-data.new_york_citibike.citibike_stations
LIMIT 100;


-- Creating external table referring to gcs path
CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.external_yellow_tripdata`
OPTIONS (
  format = 'CSV',
  uris = ['gs://nyc-tl-data/trip data/yellow_tripdata_2019-*.csv', 'gs://nyc-tl-data/trip data/yellow_tripdata_2020-*.csv']
);

-- Check yellow trip data
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata limit 10;

-- Create a non partitioned table from external table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_non_partitioned AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;


-- Create a partitioned table from external table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned
PARTITION BY
  DATE(tpep_pickup_datetime) AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

-- Impact of partition
-- Scanning 1.6GB of data
SELECT DISTINCT(VendorID)
FROM taxi-rides-ny.nytaxi.yellow_tripdata_non_partitioned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';

-- Scanning ~106 MB of DATA
SELECT DISTINCT(VendorID)
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';

-- Let's look into the partitions
SELECT table_name, partition_id, total_rows
FROM `nytaxi.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'yellow_tripdata_partitioned'
ORDER BY total_rows DESC;

-- Creating a partition and cluster table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

-- Query scans 1.1 GB
SELECT count(*) as trips
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'
  AND VendorID=1;

-- Query scans 864.5 MB
SELECT count(*) as trips
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'
  AND VendorID=1;


================================================
FILE: 03-data-warehouse/big_query_hw.sql
================================================
CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.fhv_tripdata`
OPTIONS (
  format = 'CSV',
  uris = ['gs://nyc-tl-data/trip data/fhv_tripdata_2019-*.csv']
);


SELECT count(*) FROM `taxi-rides-ny.nytaxi.fhv_tripdata`;


SELECT COUNT(DISTINCT(dispatching_base_num)) FROM `taxi-rides-ny.nytaxi.fhv_tripdata`;


CREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.fhv_nonpartitioned_tripdata`
AS SELECT * FROM `taxi-rides-ny.nytaxi.fhv_tripdata`;

CREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.fhv_partitioned_tripdata`
PARTITION BY DATE(dropoff_datetime)
CLUSTER BY dispatching_base_num AS (
  SELECT * FROM `taxi-rides-ny.nytaxi.fhv_tripdata`
);

SELECT count(*) FROM  `taxi-rides-ny.nytaxi.fhv_nonpartitioned_tripdata`
WHERE DATE(dropoff_datetime) BETWEEN '2019-01-01' AND '2019-03-31'
  AND dispatching_base_num IN ('B00987', 'B02279', 'B02060');


SELECT count(*) FROM `taxi-rides-ny.nytaxi.fhv_partitioned_tripdata`
WHERE DATE(dropoff_datetime) BETWEEN '2019-01-01' AND '2019-03-31'
  AND dispatching_base_num IN ('B00987', 'B02279', 'B02060');


================================================
FILE: 03-data-warehouse/big_query_ml.sql
================================================
-- SELECT THE COLUMNS INTERESTED FOR YOU
SELECT passenger_count, trip_distance, PULocationID, DOLocationID, payment_type, fare_amount, tolls_amount, tip_amount
FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitioned` WHERE fare_amount != 0;

-- CREATE A ML TABLE WITH APPROPRIATE TYPE
CREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.yellow_tripdata_ml` (
`passenger_count` INTEGER,
`trip_distance` FLOAT64,
`PULocationID` STRING,
`DOLocationID` STRING,
`payment_type` STRING,
`fare_amount` FLOAT64,
`tolls_amount` FLOAT64,
`tip_amount` FLOAT64
) AS (
SELECT passenger_count, trip_distance, cast(PULocationID AS STRING), CAST(DOLocationID AS STRING),
CAST(payment_type AS STRING), fare_amount, tolls_amount, tip_amount
FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitioned` WHERE fare_amount != 0
);

-- CREATE MODEL WITH DEFAULT SETTING
CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_model`
OPTIONS
(model_type='linear_reg',
input_label_cols=['tip_amount'],
DATA_SPLIT_METHOD='AUTO_SPLIT') AS
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL;

-- CHECK FEATURES
SELECT * FROM ML.FEATURE_INFO(MODEL `taxi-rides-ny.nytaxi.tip_model`);

-- EVALUATE THE MODEL
SELECT
*
FROM
ML.EVALUATE(MODEL `taxi-rides-ny.nytaxi.tip_model`,
(
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL
));

-- PREDICT THE MODEL
SELECT
*
FROM
ML.PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,
(
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL
));

-- PREDICT AND EXPLAIN
SELECT
*
FROM
ML.EXPLAIN_PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,
(
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL
), STRUCT(3 as top_k_features));

-- HYPER PARAM TUNNING
CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_hyperparam_model`
OPTIONS
(model_type='linear_reg',
input_label_cols=['tip_amount'],
DATA_SPLIT_METHOD='AUTO_SPLIT',
num_trials=5,
max_parallel_trials=2,
l1_reg=hparam_range(0, 20),
l2_reg=hparam_candidates([0, 0.1, 1, 10])) AS
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL;


================================================
FILE: 03-data-warehouse/extract_model.md
================================================
## Model deployment
[Tutorial](https://cloud.google.com/bigquery-ml/docs/export-model-tutorial)
### Steps
- gcloud auth login
- bq --project_id taxi-rides-ny extract -m nytaxi.tip_model gs://taxi_ml_model/tip_model
- mkdir /tmp/model
- gsutil cp -r gs://taxi_ml_model/tip_model /tmp/model
- mkdir -p serving_dir/tip_model/1
- cp -r /tmp/model/tip_model/* serving_dir/tip_model/1
- docker pull tensorflow/serving
- docker run -p 8501:8501 --mount type=bind,source=`pwd`/serving_dir/tip_model,target=
  /models/tip_model -e MODEL_NAME=tip_model -t tensorflow/serving &
- curl -d '{"instances": [{"passenger_count":1, "trip_distance":12.2, "PULocationID":"193", "DOLocationID":"264", "payment_type":"2","fare_amount":20.4,"tolls_amount":0.0}]}' -X POST http://localhost:8501/v1/models/tip_model:predict
- http://localhost:8501/v1/models/tip_model

================================================
FILE: 03-data-warehouse/extras/.env-example
================================================
GCP_GCS_BUCKET="your_bucket_name"
GOOGLE_APPLICATION_CREDENTIALS=Path/to/key/GCP_service_account_key.json

================================================
FILE: 03-data-warehouse/extras/.gitignore
================================================
*.env
*.parquet
*.csv*

================================================
FILE: 03-data-warehouse/extras/README.md
================================================
Quick hack to load files directly to GCS, without Airflow. Downloads csv files from https://nyc-tlc.s3.amazonaws.com/trip+data/ and uploads them to your Cloud Storage Account as parquet files.

1. Install pre-reqs with `uv sync` 
2. Run: `uv run python web_to_gcs_with_progress_bar.py`
2. or Run: `uv run python web_to_gcs.py` for less verbose (if you have fast internet connection in upload)


================================================
FILE: 03-data-warehouse/extras/pyproject.toml
================================================
[project]
name = "extras"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.14"
dependencies = [
    "google-cloud-storage>=3.8.0",
    "pandas>=3.0.0",
    "pyarrow>=23.0.0",
    "python-dotenv>=1.2.1",
    "requests>=2.32.5",
    "tqdm>=4.67.1",
]


================================================
FILE: 03-data-warehouse/extras/web_to_gcs.py
================================================
import os
import requests
import pandas as pd
from google.cloud import storage
from dotenv import load_dotenv


"""
Pre-reqs: 
1. run `uv sync` from this 'extra' folder (create venv and install dependencies from pyproject.toml)
2. rename .env-example to .env (not commited thanks to .gitignore)
3. in .env, 
    - set GCP_GCS_BUCKET as your bucket or change default value of BUCKET
    - Set GOOGLE_APPLICATION_CREDENTIALS to your project/service-account json key 
    (or don't set it if you use google ADC)
"""
# load env vars from .env
load_dotenv()

# services = ['fhv','green','yellow']
init_url = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download/"
# if not done in .env, switch out the default bucketname
BUCKET = os.environ.get("GCP_GCS_BUCKET", "dtc-data-lake-bucketname")


def upload_to_gcs(bucket, object_name, local_file):
    """
    Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
    """
    # # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.
    # # (Ref: https://github.com/googleapis/python-storage/issues/74)
    # storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB
    # storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB

    client = storage.Client()
    bucket = client.bucket(bucket)
    blob = bucket.blob(object_name)
    blob.upload_from_filename(local_file)


def web_to_gcs(year, service):
    for i in range(12):
        # sets the month part of the file_name string
        month = "0" + str(i + 1)
        month = month[-2:]

        # csv file_name
        file_name = f"{service}_tripdata_{year}-{month}.csv.gz"

        # download it using requests via a pandas df
        request_url = f"{init_url}{service}/{file_name}"
        r = requests.get(request_url)
        open(file_name, "wb").write(r.content)
        print(f"Local: {file_name}")

        # read it back into a parquet file
        # enforce types so parquet columns will directly have good types
        # (as we did in module 1 in ingest.py script)
        dtypes = {
            "VendorID": "Int64",
            "RatecodeID": "Int64",
            "PULocationID": "Int64",
            "DOLocationID": "Int64",
            "passenger_count": "Int64",
            "payment_type": "Int64",
            "trip_type": "Int64",  # only in green but ignored if missing column
            "store_and_fwd_flag": "string",
            "trip_distance": "float64",
            "fare_amount": "float64",
            "extra": "float64",
            "mta_tax": "float64",
            "tip_amount": "float64",
            "tolls_amount": "float64",
            "ehailfee": "float64",  # only in green but ignored if missing column
            "improvement_surcharge": "float64",
            "total_amount": "float64",
            "congestion_surcharge": "float64",
        }

        if service == "yellow":
            parse_dates = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]
        else:
            parse_dates = ["lpep_pickup_datetime", "lpep_dropoff_datetime"]

        df = pd.read_csv(
            file_name, dtype=dtypes, parse_dates=parse_dates, compression="gzip"
        )
        file_name = file_name.replace(".csv.gz", ".parquet")
        df.to_parquet(file_name, engine="pyarrow")
        print(f"Parquet: {file_name}")

        # upload it to gcs
        upload_to_gcs(BUCKET, f"{service}/{file_name}", file_name)
        print(f"GCS: {service}/{file_name}")


web_to_gcs("2019", "green")
web_to_gcs("2020", "green")
web_to_gcs("2021", "green")  # fail when reach 08 (normal, file not in github :)
# web_to_gcs("2019", "yellow")
# web_to_gcs("2020", "yellow")
# web_to_gcs("2021", "yellow") # fail when reach 08 (normal, file not in github :)


================================================
FILE: 03-data-warehouse/extras/web_to_gcs_with_progress_bar.py
================================================
import os
import requests
import pandas as pd
from google.cloud import storage
from dotenv import load_dotenv
from tqdm import tqdm
import gzip
import pyarrow as pa
import pyarrow.parquet as pq


"""
Pre-reqs: 
1. run `uv sync` from this 'extra' folder (create venv and install dependencies from pyproject.toml)
2. rename .env-example to .env (not commited thanks to .gitignore)
3. in .env, 
    - set GCP_GCS_BUCKET as your bucket or change default value of BUCKET
    - Set GOOGLE_APPLICATION_CREDENTIALS to your project/service-account json key 
    (or don't set it if you use google ADC)
"""
# load env vars from .env
load_dotenv()

# services = ['fhv','green','yellow']
init_url = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download/"
# if not done in .env, switch out the default bucketname
BUCKET = os.environ.get("GCP_GCS_BUCKET", "dtc-data-lake-bucketname")


def download_with_progress(url: str, local_path: str, desc: str = "Downloading"):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        total = int(r.headers.get("content-length", 0))
        # Configure tqdm for bytes
        with (
            open(local_path, "wb") as f,
            tqdm(
                total=total,
                unit="B",
                unit_scale=True,
                unit_divisor=1024,
                desc=desc,
            ) as bar,
        ):
            for chunk in r.iter_content(chunk_size=1024 * 1024):  # 1 MB
                if not chunk:
                    continue
                size = f.write(chunk)
                bar.update(size)


def csv_to_parquet_with_progress(
    csv_path: str, parquet_path: str, service_color: str, chunksize: int = 100_000
):
    # 1) Count rows (gzip-aware)
    with gzip.open(csv_path, mode="rt") as f:
        total_rows = sum(1 for _ in f) - 1  # minus header
    if total_rows <= 0:
        raise ValueError("CSV appears to be empty")

    # 2) Read in chunks with fixed dtypes so parquet columns will directly have good types
    # (as we did in module 1 in ingest.py script)
    dtypes = {
        "VendorID": "Int64",
        "RatecodeID": "Int64",
        "PULocationID": "Int64",
        "DOLocationID": "Int64",
        "passenger_count": "Int64",
        "payment_type": "Int64",
        "trip_type": "Int64",  # only in green but ignored if missing column
        "store_and_fwd_flag": "string",
        "trip_distance": "float64",
        "fare_amount": "float64",
        "extra": "float64",
        "mta_tax": "float64",
        "tip_amount": "float64",
        "tolls_amount": "float64",
        "ehailfee": "float64",  # only in green but ignored if missing column
        "improvement_surcharge": "float64",
        "total_amount": "float64",
        "congestion_surcharge": "float64",
    }

    if service_color == "yellow":
        parse_dates = ["tpep_pickup_datetime", "tpep_dropoff_datetime"]
    else:
        parse_dates = ["lpep_pickup_datetime", "lpep_dropoff_datetime"]

    reader = pd.read_csv(
        csv_path,
        dtype=dtypes,
        parse_dates=parse_dates,
        compression="gzip",
        chunksize=chunksize,
        low_memory=False,
    )

    writer = None

    with tqdm(total=total_rows, unit="rows", desc=f"Parquet {csv_path}") as bar:
        for chunk in reader:
            table = pa.Table.from_pandas(chunk)
            if writer is None:
                writer = pq.ParquetWriter(parquet_path, table.schema)
            else:
                # Optional safety: align to first schema
                table = table.cast(writer.schema)
            writer.write_table(table)
            bar.update(len(chunk))

    if writer is not None:
        writer.close()


def upload_to_gcs_with_progress(bucket: str, object_name: str, local_file: str):
    # # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.
    # # (Ref: https://github.com/googleapis/python-storage/issues/74)
    # Optional: tune chunk size (must be multiple of 256 KiB)
    storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB
    storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB

    client = storage.Client()
    bucket_obj = client.bucket(bucket)
    blob = bucket_obj.blob(object_name)

    if blob.exists(client):
        print(f"Skipping upload, already in GCS: gs://{bucket}/{object_name}")
        return

    file_size = os.path.getsize(local_file)

    with open(local_file, "rb") as f:
        with tqdm.wrapattr(
            f,
            "read",
            total=file_size,
            miniters=1,
            unit="B",
            unit_scale=True,
            unit_divisor=1024,
            desc=f"Uploading {os.path.basename(local_file)}",
        ) as wrapped_file:
            blob.upload_from_file(
                wrapped_file,
                size=file_size,  # important so the library knows total bytes
            )

    print(f"Uploaded to GCS: gs://{bucket}/{object_name}")


def web_to_gcs(year, service):
    client = storage.Client()
    bucket_obj = client.bucket(BUCKET)

    for i in tqdm(range(12), desc=f"{service} {year}", unit="month"):
        month = f"{i + 1:02d}"

        csv_file_name = f"{service}_tripdata_{year}-{month}.csv.gz"
        parquet_file_name = csv_file_name.replace(".csv.gz", ".parquet")
        object_name = f"{service}/{parquet_file_name}"

        # 1) Check if parquet already in GCS
        blob = bucket_obj.blob(object_name)
        if blob.exists(client):
            print(f"Already in GCS, skipping: gs://{BUCKET}/{object_name}")
            continue

        # 2) Check if CSV already downloaded locally
        if os.path.exists(csv_file_name):
            print(f"CSV already exists locally, skipping download: {csv_file_name}")
        else:
            request_url = f"{init_url}{service}/{csv_file_name}"
            download_with_progress(
                request_url, csv_file_name, desc=f"Downloading {csv_file_name}"
            )

        # 3) Check if Parquet already exists locally
        if os.path.exists(parquet_file_name):
            print(
                f"Parquet already exists locally, skipping conversion: {parquet_file_name}"
            )
        else:
            csv_to_parquet_with_progress(csv_file_name, parquet_file_name, service)
            print(f"Parquet: {parquet_file_name}")

        # 4) Upload with per-byte progress bar
        upload_to_gcs_with_progress(BUCKET, object_name, parquet_file_name)


web_to_gcs("2019", "green")
web_to_gcs("2020", "green")
web_to_gcs(
    "2021", "green"
)  # will fail when reaching 08 (normal, file does not exists in github :)
# web_to_gcs("2019", "yellow")
# web_to_gcs("2020", "yellow")
# web_to_gcs("2021", "yellow") # will fail when reaching 08 (normal, file does not exists in github :)


================================================
FILE: 04-analytics-engineering/README.md
================================================
# Module 4: Analytics Engineering

Goal: Transforming the data loaded in DWH into Analytical Views developing a [dbt project](taxi_rides_ny/README.md).

### Prerequisites

The prerequisites depend on which setup path you choose:

**For Cloud Setup (BigQuery):**

- Completed [Module 3: Data Warehouse](../03-data-warehouse/) with:
  - A GCP project with BigQuery enabled
  - Service account with BigQuery permissions
  - NYC taxi data loaded into BigQuery (yellow and green taxi data for 2019-2020)

**For Local Setup (DuckDB):**

- No prerequisites! The local setup guide will walk you through downloading and loading the data.

> [!NOTE]
> This module focuses on **yellow and green taxi data** (2019-2020). While Module 3 may have included FHV data, it is not used in this dbt project.

## Setting up your environment

Choose your setup path:

### 🏠 [Local Setup](setup/local_setup.md)

- **Stack**: DuckDB + dbt Core
- **Cost**: Free
- [→ Get Started](setup/local_setup.md)

### ☁️ [Cloud Setup](setup/cloud_setup.md)

- **Stack**: BigQuery + dbt Cloud
- **Cost**: Free tier available (dbt Cloud Developer), BigQuery costs vary
- **Requires**: Completed Module 3 with BigQuery data
- [→ Get Started](setup/cloud_setup.md)

## Content

### Introduction to Analytics Engineering

[![](https://markdown-videos-api.jorgenkh.no/youtube/HxMIsPrIyGQ)](https://www.youtube.com/watch?v=HxMIsPrIyGQ)

### Introduction to data modeling

[![](https://markdown-videos-api.jorgenkh.no/youtube/uF76d5EmdtU)](https://www.youtube.com/watch?v=uF76d5EmdtU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=40)

### What is dbt?

[![](https://markdown-videos-api.jorgenkh.no/youtube/gsKuETFJr54)](https://www.youtube.com/watch?v=gsKuETFJr54&list=PLaNLNpjZpzwgneiI-Gl8df8GCsPYp_6Bs&index=5)

### Differences between dbt Core and dbt Cloud

[![](https://markdown-videos-api.jorgenkh.no/youtube/auzcdLRyEIk)](https://www.youtube.com/watch?v=auzcdLRyEIk)

### Project Setup

| Alternative A  | Alternative B   |
|-----------------------------|--------------------------------|
| BigQuery + dbt Platform | DuckDB + dbt core |
| [![](https://markdown-videos-api.jorgenkh.no/youtube/GFbwlrt6f54)](https://www.youtube.com/watch?v=GFbwlrt6f54) | [![](https://markdown-videos-api.jorgenkh.no/youtube/GoFAbJYfvlw)](https://www.youtube.com/watch?v=GoFAbJYfvlw) |

### dbt Course

| dbt Project Structure | dbt Sources | dbt Models | Seeds and Macros |
|-----------------------|-------------|------------|------------------|
| [![](https://markdown-videos-api.jorgenkh.no/youtube/2dYDS4OQbT0)](https://www.youtube.com/watch?v=2dYDS4OQbT0) | [![](https://markdown-videos-api.jorgenkh.no/youtube/7CrrXazV_8k)](https://www.youtube.com/watch?v=7CrrXazV_8k) | [![](https://markdown-videos-api.jorgenkh.no/youtube/JQYz-8sl1aQ)](https://www.youtube.com/watch?v=JQYz-8sl1aQ) | [![](https://markdown-videos-api.jorgenkh.no/youtube/lT4fmTDEqVk)](https://www.youtube.com/watch?v=lT4fmTDEqVk) |

| dbt Tests | Documentation | dbt Packages | dbt Commands |
|-----------|---------------|----------------------|---------------|
| [![](https://markdown-videos-api.jorgenkh.no/youtube/bvZ-rJm7uMU)](https://www.youtube.com/watch?v=bvZ-rJm7uMU) | [![](https://markdown-videos-api.jorgenkh.no/youtube/UqoWyMjcqrA)](https://www.youtube.com/watch?v=UqoWyMjcqrA) | [![](https://markdown-videos-api.jorgenkh.no/youtube/KfhUA9Kfp8Y)](https://www.youtube.com/watch?v=KfhUA9Kfp8Y) | [![](https://markdown-videos-api.jorgenkh.no/youtube/t4OeWHW3SsA)](https://www.youtube.com/watch?v=t4OeWHW3SsA) |

## Troubleshooting

- [DuckDB Troubleshooting Guide](setup/duckdb_troubleshooting.md) — If you're getting OOM errors during `dbt build` with DuckDB

## Extra resources

> [!NOTE]
> If you find the videos above overwhelming, we recommend completing the [dbt Fundamentals](https://learn.getdbt.com/courses/dbt-fundamentals) course and then rewatching the module. It provides a solid foundation for all the key concepts you need in this module.

## SQL refresher

The homework for this module focuses heavily on window functions and CTEs. If you need a refresher on these topics, you can refer to these notes.

* [SQL refresher](refreshers/SQL.md)

## Homework

* [2026 Homework](../cohorts/2026/04-analytics-engineering/homework.md)

# Community notes

<details>
<summary>Did you take notes? You can share them here</summary>

* [Slides used in previous years](https://docs.google.com/presentation/d/1xSll_jv0T8JF4rYZvLHfkJXYqUjPtThA/edit?usp=sharing&ouid=114544032874539580154&rtpof=true&sd=true)
* [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/4_analytics.md)
* [Sandy's DE learning blog](https://learningdataengineering540969211.wordpress.com/2022/02/17/week-4-setting-up-dbt-cloud-with-bigquery/)
* [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week4)
* [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-4/)
* [Notes by froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_4_analytics_engineering/notes/notes_week_04.md)
* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week4.md)
* [Setting up Prefect with dbt by Vera](https://medium.com/@verazabeida/zoomcamp-week-5-5b6a9d53a3a0)
* [Blog by Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/02/week-4-data-engineering-zoomcamp-notes-analytics-engineering-and-dbt/)
* [Setting up DBT with BigQuery by Tofag](https://medium.com/@fagbuyit/setting-up-your-dbt-cloud-dej-9-d18e5b7c96ba)
* [Blog post by Dewi Oktaviani](https://medium.com/@oktavianidewi/de-zoomcamp-2023-learning-week-4-analytics-engineering-with-dbt-53f781803d3e)
* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)
* [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%204/Data%20Engineering%20Zoomcamp%20Week%204.ipynb)
* [Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/4-analytics-engineering/readme.md)
* [2024 - Videos transcript week4](https://drive.google.com/drive/folders/1V2sHWOotPEMQTdMT4IMki1fbMPTn3jOP?usp=drive)
* [Blog Post](https://www.jonahboliver.com/blog/de-zc-w4) by Jonah Oliver
* [2025 Notes by Manuel Guerra](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/4_Analytics-Engineering/README.md)
* [2025 Notes by Horeb SEIDOU](https://spotted-hardhat-eea.notion.site/Week-4-Analytics-Engineering-18929780dc4a808692e4e0ee488bf49c?pvs=74)
* [2025 Notes by Daniel Lachner](https://github.com/mossdet/dlp_data_eng/blob/main/Notes/04_01_Analytics_Engineering.pdf)
* [2026 Notes by Sharad K. Gupta](https://github.com/sharadgupta27/data-engineering/blob/main/Notes/dbt_commands.md)
* [Analytical Engineering overview](https://github.com/khanhnguyen7802/DataEngineer101/tree/main/week4-analytics-engineering#readme) 
* [2026 Notes about dbt](https://github.com/khanhnguyen7802/DataEngineer101/blob/main/week4-analytics-engineering/dbt_installation.md) | [dbt + Duckdb setup using Docker](https://github.com/khanhnguyen7802/DataEngineer101/blob/main/week4-analytics-engineering/dbt_installation.md) by Khanh Nguyen
* Add your notes here (above this line)

</details>


================================================
FILE: 04-analytics-engineering/class_notes/4_1_1_analytics_engineering_basics.md
================================================
# DE Zoomcamp 4.1.1 — Analytics Engineering Basics

> 📄 Video: [Analytics Engineering Basics](https://www.youtube.com/watch?v=uF76d5EmdtU)  
> 📄 Further reading: [What is Analytics Engineering?](https://docs.getdbt.com/docs/introduction)  
> 📄 Kimball's Dimensional Modeling: *The Data Warehouse Toolkit* (Ralph Kimball & Margy Ross)

This is the kickoff video for Module 4. No hands-on coding here — it's all about setting the stage. Why does analytics engineering exist, what does it actually do, and what are the data modeling concepts we'll be leaning on for the rest of the module. Worth sitting with before diving into the dbt stuff.

---

## Why analytics engineering exists

A few shifts in the data world created a gap that nobody was filling:

- **Cloud data warehouses** (BigQuery, Snowflake, Redshift) made storage and compute cheap. You no longer have to be surgical about what data you load.
- **EL tools** like Fivetran and Stitch made getting data into the warehouse almost trivial — the extract and load steps are basically automated now.
- **SQL-first BI tools** like Looker brought version control into the data workflow. And tools like Mode enabled self-service analytics for business users.
- **Data governance** became a bigger conversation as more people started touching data.

All of this changed how data teams work and how stakeholders consume data. But it left a gap between the people building the infrastructure and the people using the data.

### The traditional data team

In the old model you had three roles and a pretty clean split:

- **Data Engineer** — builds and maintains the infrastructure. Great software engineer, but not necessarily close to how the business actually uses the data.
- **Data Analyst** — uses the data to answer questions and solve business problems. Understands the business well, but not trained as a software engineer.
- **Data Scientist** — similar story to the analyst. Writing more and more code these days, but software engineering best practices weren't part of the training.

### The gap

Analysts and scientists are writing more code, but they weren't trained for it. Engineers are great at building systems, but they don't always know how the data gets consumed downstream. Nobody was bridging that gap.

### Analytics Engineer

The analytics engineer is the bridge. They bring software engineering best practices — version control, testing, documentation, modularity — into the work that analysts and scientists are already doing. It's a role that sits at the intersection of the data engineer and the data analyst.

In terms of the toolchain, an analytics engineer might touch:

- **Data loading** — tools like Fivetran, Stitch (the EL layer)
- **Data storing** — cloud data warehouses, shared territory with data engineers
- **Data modeling** — this is the core of it. Tools like dbt or Dataform. This is where most of Module 4 lives.
- **Data presentation** — BI tools like Google Looker Studio. The end product that business users actually see.

The focus this week is on modeling and presentation — everything in between "data is in the warehouse" and "business user sees a dashboard."

---

## ETL vs ELT — a quick recap

Two philosophies for getting data transformed and ready:

**ETL (Extract → Transform → Load)** — you transform the data *before* it hits the warehouse. Takes longer to set up because the transformation logic has to be built first, but the data in the warehouse is clean and stable from day one.

**ELT (Extract → Load → Transform)** — you load the raw data first, then transform it *inside* the warehouse. Faster and more flexible. This is the approach that cloud warehouses made possible — storage is cheap, so just load everything and figure out the transformations later.

ELT is the dominant approach now, and it's the one we'll be working with. dbt fits squarely into the "T" of ELT — it runs transformations inside the warehouse using SQL.

---

## Dimensional Modeling — the key concepts

This is Kimball's framework, and it's the main mental model for how we'll structure our data this week. The goal is twofold: make the data **understandable to business users**, and make **queries fast**.

Note: unlike third normal form (3NF), dimensional modeling deliberately allows some data redundancy. The priority is usability and performance, not eliminating duplication.

### Fact tables vs Dimension tables (Star Schema)

The two building blocks:

- **Fact tables** — measurements, metrics, business events. Think of them as **verbs**. "A sale happened." "An order was placed." They correspond to a business process.
- **Dimension tables** — the context around those facts. Think of them as **nouns**. "Who bought it? What product? When?" They correspond to a business entity like a customer or a product.

Together they form a **star schema** — the fact table in the center, dimension tables radiating out around it. It's the classic layout you'll see in most data warehouses.

### The Kitchen Analogy

Kimball's book uses a restaurant analogy to describe how data flows through a warehouse. It maps pretty cleanly onto what we'll be doing in the project:

- **Staging area (the pantry)** — raw data lands here. Not meant for business users. Only people who know what they're doing should be poking around in it.
- **Processing area (the kitchen)** — this is where raw data gets transformed into proper data models. Again, limited to the people doing the cooking — the data engineers and analytics engineers. The focus here is on efficiency and following standards.
- **Presentation area (the dining hall)** — the final, polished output. This is what business stakeholders actually see and interact with. Clean, structured, ready to consume.

We'll be building exactly this layered structure in our dbt project throughout the module.

================================================
FILE: 04-analytics-engineering/class_notes/4_1_2_what_is_dbt.md
================================================
# DE Zoomcamp 4.1.2 — What is dbt?

> 📄 Video: [What is dbt?](https://www.youtube.com/watch?v=gsKuETFJr54)  
> 📄 Official docs: [Introduction to dbt](https://docs.getdbt.com/docs/introduction)  
> 📄 dbt Cloud vs Core: [Choose your dbt](https://docs.getdbt.com/docs/cloud/about-cloud/dbt-cloud-features)

This is the big-picture overview of dbt before we start building anything. What it is, what problems it solves, and how we'll be using it in the course. No hands-on work yet — just the framing.

---

## What is dbt?

dbt is a transformation workflow tool. It sits on top of your data warehouse and helps you turn raw data into something useful for downstream consumers (analysts, BI tools, ML pipelines, whatever needs clean, structured data).

You write SQL (or Python) to define your transformations, and dbt handles the rest: compiling it, running it against the warehouse, managing dependencies, and persisting the results as tables or views.

In a real company setup, you'd have data flowing in from all over the place — backend systems, frontend apps, third-party APIs like weather data. All of that gets loaded into your warehouse (BigQuery, Snowflake, Databricks, whatever), and dbt is the layer that transforms that raw data into something the business can actually consume.

---

## What problems it solves

The transformation step has always existed. What dbt brings to the table is **software engineering best practices for analytics code**. Things that software engineers have been doing for years but didn't have a clear path into the analytics world:

- **Version control** — your transformations live in git, just like any other code
- **Modularity** — break complex logic into reusable pieces instead of massive spaghetti queries
- **Testing** — automated data quality checks that run with every deployment
- **Documentation** — generated from your code, not a separate wiki that gets out of date
- **Environments** — separate dev and prod. Each developer gets their own sandbox to work in without stepping on each other's toes
- **CI/CD** — automated deployments with validation and rollback

The result is higher-quality pipelines that are easier to maintain and less prone to breaking in production.

---

## How it works — the mechanics

You write a SQL file. It looks like a normal `SELECT` statement. dbt takes that file, figures out where it should go in the warehouse (which schema, which dataset, what environment), wraps it in the necessary DDL/DML, compiles it with any Jinja templating you've used, and runs it.

When you run `dbt run`, it:
1. Compiles your SQL (resolves `ref()` calls, `source()` calls, Jinja macros, everything)
2. Sends the compiled SQL to your warehouse
3. Materializes the result as a table, view, incremental table, or ephemeral CTE — whatever you configured

You don't write `CREATE TABLE` statements yourself. You just write the `SELECT`, and dbt handles the rest.

---

## dbt Core vs dbt Cloud

There are two ways to use dbt, and it's worth understanding the difference:

### dbt Core

Open source. Free. You install it locally on your machine (or wherever) and run commands from the terminal. You're responsible for:

- Setting up your dev environment
- Orchestrating production runs (Airflow, cron jobs, whatever you want)
- Hosting documentation if you want it accessible
- Managing logs and metadata

It's the raw engine. You get full control, but you also have to build the surrounding infrastructure yourself.

### dbt Cloud

SaaS product that runs dbt Core under the hood. It gives you:

- A web-based IDE for writing transformations (or you can use a Cloud CLI if you prefer local development)
- Environment management — dev/staging/prod, all handled for you
- Built-in orchestration (job scheduling, triggers, dependencies)
- Hosted documentation (automatically generated and served)
- Logging and observability
- APIs for administration and metadata access
- A semantic layer for metrics (if you need it)

There's a free Developer plan that works for small teams or individual learning. For anything bigger, it's a paid product.

---

## The course setup — two paths

The Zoomcamp gives you two options, and the videos will alternate between them (version A and version B):

### Option A: BigQuery + dbt Cloud (recommended)

- Data warehouse: BigQuery (assuming you set this up in previous weeks)
- dbt: dbt Cloud Developer plan (free account, web IDE)
- No local installation needed

This is the path most of the videos will follow. It's the fastest way to get started and closest to how teams actually use dbt in production.

### Option B: DuckDB + dbt Core

- Data warehouse: DuckDB (local or however you've got it set up)
- dbt: dbt Core installed locally
- Dev environment: your own IDE (VS Code, etc.)
- Orchestration: you'll need to handle this separately (Airflow, Prefect, whatever)

This path gives you more hands-on control but requires more setup.

---

## The project flow

By the time we get to the end of the module, here's what we'll have built:

1. Raw data sitting in the warehouse — trip data from previous weeks, plus a lookup table to demonstrate joining multiple sources
2. dbt transformations that turn that raw data into properly modeled tables following the dimensional modeling concepts from 4.1.1
3. Dashboards that consume the final output and make it useful for business stakeholders

The next videos will walk through actually setting this up and building it out step by step.

================================================
FILE: 04-analytics-engineering/class_notes/4_2_1_dbt_core_vs_dbt_cloud.md
================================================
# DE Zoomcamp 4.2.1 — dbt Core vs dbt Cloud

> 📄 Official feature comparison: [dbt Core vs dbt Cloud](https://www.getdbt.com/product/dbt-core-vs-dbt-cloud)

## dbt Core
- Born in **2016** as a fully **open-source, command-line tool**
- 100% free, runs locally on your own machine
- All code is available on GitHub (can fork, modify, etc.)

## dbt Cloud
- Introduced **two years after dbt Core** (~2018) by dbt Labs (originally called Fishtown Analytics)
- Sold as a **paid SaaS platform** — no need to manage infrastructure yourself
- Handles the heavy lifting:
  - Hosting dbt documentation
  - Orchestration
  - Environment setup
  - Backups of dbt artifacts (e.g. for Slim CI)
- Comes with **collaboration and security features** useful for teams/companies

## How They Were Used Together (Hybrid Approach)
- Common pattern: more technical users worked with dbt Core; less technical users used dbt Cloud
- The two were designed to be **compatible** — e.g. developers could work locally with dbt Core while production runs were executed through dbt Cloud
- dbt Labs published an article in **October 2024** outlining how both products were meant to coexist side by side → [How we think about dbt Core and dbt Cloud](https://www.getdbt.com/blog/how-we-think-about-dbt-core-and-dbt-cloud)

## dbt Fusion — The Future
- In **May 2025**, dbt Labs announced a **full rewrite of the code base** using a new engine called **Fusion**
- Key improvements:
  - **Faster compilation** of dbt code (up to 30x faster in some cases)
  - **Better developer experience** — catches many errors *before* running/building, saving time and money
- dbt Core will continue to be maintained, but **Fusion is the future direction** for both Core and Cloud

### Fusion Limitations
- **Not supported by all adapters** — as of early 2026, Fusion supports major adapters like Snowflake, Databricks, Postgres (and derivatives), BigQuery, and Redshift
- Notably **does not support DuckDB** (yet) or many community-maintained adapters
- If you use a less common adapter, dbt Fusion and the newest versions of dbt Cloud may not work for you
- Adapter support is being actively expanded — check the official docs for the current list

> 📄 Fusion upgrade guide: [Upgrading to the dbt Fusion engine](https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-fusion)  
> 📄 Full adapter support list: [Supported features](https://docs.getdbt.com/docs/fusion/supported-features)

## New Vision: Unified License
- Instead of splitting users between Core and Cloud, Fusion envisions **everyone having a dbt license**
- Users can choose to work in:
  - The **dbt Cloud IDE**, or
  - **VS Code** using the official dbt Labs extension
- Both options are backed by the same Fusion engine

## Course Decisions & Recommendations
- This course uses **DuckDB + dbt Core** (local, via VS Code) because:
  - It forces learners to understand what's actually happening under the hood
  - dbt Cloud abstracts a lot away — understanding Core first makes Cloud easier to pick up later
- If you follow along with dbt Cloud + BigQuery, the concepts transfer well
- dbt Labs' own documentation and courses are excellent resources for learning dbt Cloud specifically → [dbt Developer Hub](https://docs.getdbt.com)
- **Bottom line:** It doesn't matter much which one you learn first — especially as a consultant, you'll likely use both. Focus on the shared fundamentals.

---

*Note: This document was last updated February 2026. For the latest information on dbt Fusion and adapter support, always consult the official dbt documentation.*

================================================
FILE: 04-analytics-engineering/class_notes/4_3_1_dbt_project_structure.md
================================================
# DE Zoomcamp 4.3.1 — dbt Project Structure

> 📄 Video: [dbt Project Structure](https://www.youtube.com/watch?v=2dYDS4OQbT0)  
> 📄 Official docs: [About dbt projects](https://docs.getdbt.com/docs/build/projects)  
> 📄 Best practices: [How we structure our dbt projects](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview)

When you run `dbt init`, dbt automatically creates a set of files and folders. This video walks through each one and explains its purpose. The structure below applies to both dbt Core and dbt Cloud (the DuckDB database file and `data/` folder are local-only artifacts and can be ignored here).

---

## Top-Level Files & Folders

### `analysis/`
- A place for **ad-hoc SQL scripts** that you don't necessarily want to share with stakeholders
- Not heavily used by everyone, but handy for things like **data quality reports** or **administrative checks**
- Think of it as a scratchpad — if you want to investigate how bad a data quality issue is, drop a SQL script here

### `dbt_project.yml`
- **The most important file in a dbt project**
- Every time you run a dbt command, dbt looks for this file first — if it's missing, the command fails
- Key things it contains:
  - Project name
  - Profile name (must match your `profiles.yml` — critical for dbt Core users)
  - Default materializations
  - Variables
- Also a place to set project-wide defaults and configuration

> 📄 [dbt_project.yml reference](https://docs.getdbt.com/reference/dbt_project.yml)

### `macros/`
- Macros behave like **reusable functions** (similar to Python functions or UDFs)
- Use them when you find yourself **repeating the same SQL logic** in multiple places, or when you want to **encapsulate a piece of logic** in one place
- Benefits:
  - Easier to test (you're testing a small, isolated chunk)
  - If a definition changes, you only update it in one place
- Common use cases:
  - **Calendar conversions** (e.g. converting standard dates to a company's fiscal calendar)
  - **Tax rates or regulatory definitions** that might change over time
  - Any reusable business logic that shouldn't be duplicated across models

> 📄 [Jinja and macros](https://docs.getdbt.com/docs/build/jinja-macros)

### `models/`
- The **most important directory** — this is where all your SQL transformation logic lives
- dbt suggests breaking it into **three subfolders** (see below)

### `README.md`
- Standard project documentation — the first thing someone sees when they open your project
- dbt creates a default one, but most teams customize it
- Good things to include:
  - How to run the project
  - Whether you need credentials or onboarding
  - Contact information
  - Installation/setup guides

### `seeds/`
- A place to **upload CSV or flat files** and ingest them as dbt models in your database
- Considered a **quick-and-dirty** approach — if you have the option, it's better to load data properly at the source
- Useful for:
  - **Lookup tables**
  - Quick experiments or prototypes
  - Showing a stakeholder something before fully committing to a data load
- Use when you don't have the right permissions, or the data is expected to change frequently during experimentation

> 📄 [Seeds](https://docs.getdbt.com/docs/build/seeds)

### `snapshots/`
- Solves a specific problem: a source table has a column that **overwrites itself**, but you need to **keep the history**
- Example: an `orders` table with a `current_status` column that only ever shows the latest status. For analytics, you want to know *when* each status changed
- How it works: a snapshot takes a **"picture" of a table at a point in time**. Each time you run it, if a value has changed, a new row is recorded with a timestamp — without overwriting the previous value
- Like seeds, this is a **workaround** — ideally you'd solve this at the source. But if you don't control the source, snapshots work well

> 📄 [Snapshots](https://docs.getdbt.com/docs/build/snapshots)

### `tests/`
- A place for **singular tests** written as SQL assertions
- The logic is simple: **if the query returns more than zero rows, the dbt build fails**
- Example from the course: a client needed to ensure that vehicle timestamps always covered exactly 24 hours per day. A test query checked for any day where the total hours deviated from 24 — catching logic errors like accidental filters or bad joins early
- This is one of several ways to test in dbt, but singular tests are especially good for **custom business rules** that don't fit standard schema tests

> 📄 [Data tests (singular & generic)](https://docs.getdbt.com/docs/build/data-tests)

---

## The `models/` Subfolders

dbt suggests organizing models into three layers:

### `staging/`
- Contains two things:
  - **Source definitions** — telling dbt where your raw data lives in the database
  - **Staging models** — a **1:1 copy** of each source table with only **minimal cleaning** applied
- Minimal cleaning means things like:
  - Fixing data types
  - Renaming columns
  - Filtering out clearly empty rows
  - Removing unnecessary columns
  - Standardizing values
- Keep it **1:1** — same number of rows and columns as the raw source. Breaking this rule is occasionally convenient but should be the exception

### `intermediate/`
- Everything that is **not raw** and **not ready to expose** to end users
- A catch-all for:
  - Complex joins
  - Heavy-duty cleaning or standardization
  - Data quality processing
- No strict guidelines on what goes here — if it doesn't fit neatly into staging or marts, it belongs in intermediate

### `marts/`
- Where all the **final, consumption-ready** tables live
- If it's in marts, it's **ready for end users**
- In a well-governed dbt project, **only marts tables should be exposed** to BI tools, analysts, and business stakeholders — nothing else
- Typically contains:
  - Tables ready for dashboards
  - Properly modeled, clean tables
  - Often star schemas, but not necessarily

---

## A Note on Conventions

The `staging → intermediate → marts` structure is dbt's recommendation, but it's not mandatory. The instructor has seen teams use:
- **Medallion architecture** naming: `bronze`, `silver`, `gold`
- Numbered layers: `first`, `second`, `third`, `last`
- Other custom conventions

If your organization already has a convention, follow it. Otherwise, stick with dbt's default structure — it's well thought out and what this course uses.

================================================
FILE: 04-analytics-engineering/class_notes/4_3_2_dbt_sources.md
================================================
# DE Zoomcamp 4.3.2 — dbt Sources

> 📄 Video: [dbt Sources](https://www.youtube.com/watch?v=7CrrXazV_8k)  
> 📄 Official docs: [Sources](https://docs.getdbt.com/docs/build/sources)  
> 📄 Best practices: [How we structure our dbt projects — Staging](https://docs.getdbt.com/best-practices/how-we-structure/03-staging)

This video is about telling dbt where your raw data actually lives. Sources are how dbt knows which tables to pull from before any transformation happens. Everything in this video takes place inside the `models/staging/` folder that we set up in 4.3.1.

---

## Defining Sources

### `sources.yml`
- A **YAML file** inside `models/staging/` that tells dbt where your raw data is
- The **name** of the file is arbitrary — common choices are `sources.yml`, `_sources.yml` (underscore so it sorts to the top), or something named after the origin like `bigquery_sources.yml`
- You give your source a **name** — this is arbitrary too. Think of it as a label: `raw`, `raw_data`, or something more descriptive like `google_analytics_data` or `finance_data`
- Then you provide three fields that are **not** arbitrary — they must exactly match your warehouse:
  - **database** — the database name or GCP project
  - **schema** — the schema inside that database or BigQuery dataset
  - **tables** — the individual tables you want to reference

```yaml
sources:
  - name: nytaxi
    database: taxi_rides_ny # Or name of your GCP project
    schema: prod # Or name of your BigQuery dataset
    
    tables:
      - name: green_tripdata
      - name: yellow_tripdata
```

> 📄 [Sources — full reference](https://docs.getdbt.com/docs/build/sources)

### Local (DuckDB) vs BigQuery — what goes where

The meaning of database, schema, and tables changes depending on your setup:

| Field | Local (DuckDB) | BigQuery |
|---|---|---|
| **database** | `taxi_rides_ny` | Your GCP Project ID |
| **schema** | `main` | Your BigQuery Dataset name (e.g. `trips_data_all`) |
| **tables** | `green_tripdata`, `yellow_tripdata` | Same table names |

- If you followed the default local setup, these names should be exactly right out of the box
- If you're on BigQuery, just double-check that your table names match what you actually have in your dataset

---

## Using Sources in Your Models

### The `source()` function
- Instead of hard-coding the full path to your table (e.g. `FROM production.trips_data_all.green_tripdata`), you use the **`source()`** function
- It's a **Jinja macro** — you'll recognize it by the double curly brackets `{{ }}`
- It takes two arguments:
  - The **source name** — the one you defined in your YAML (e.g. `staging`)
  - The **table name** — must match exactly what you put under `tables` in the YAML
- As long as there's a YAML file somewhere in your project with a matching source declaration, this will resolve correctly at compile time

```sql
select * from {{ source('staging', 'green_tripdata') }}
```

- Run a preview and you should see the raw table data come back
- If it works, that's the foundation — everything else builds on this

---

## Building a Proper Staging Model

### Naming convention
- Prefix your staging model files with **`stg_`** to make it clear what layer they belong to
- So `green_tripdata.sql` becomes `stg_green_tripdata.sql`
- Other common prefixes: `int_` for intermediate, and sometimes nothing at all for final mart models

### Rename and reorder columns
- List out every column explicitly and give them **cleaner aliases**
- Be purposeful about the **order** — it should follow a logical grouping:
  - **Identifiers first** — `vendor_id`, `trip_id`, anything that's an ID
  - **Timestamps next** — `pickup_datetime`, `dropoff_datetime`
  - **Trip details** — `passenger_count`, `trip_distance`, `trip_type`
  - **Payment info last** — `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `total_amount`, `payment_type`

### Cast data types explicitly
- Don't rely on whatever the source gave you — cast everything to the type you actually want:
  - IDs → `integer`
  - Timestamps → `timestamp`
  - Counts → `integer`
  - Monetary values → `numeric` or `float` (depends on your platform)

```sql
with tripdata as (
  select *
  from {{ source('staging','green_tripdata') }}
  where vendorid is not null 
),

renamed as (
  select
      -- identifiers
      cast(vendorid as integer) as vendorid,
      cast(ratecodeid as integer) as ratecodeid,
      cast(pulocationid as integer) as pickup_locationid,
      cast(dolocationid as integer) as dropoff_locationid,
      
      -- timestamps
      cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
      cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
      
      -- trip info
      store_and_fwd_flag,
      cast(passenger_count as integer) as passenger_count,
      cast(trip_distance as numeric) as trip_distance,
      cast(trip_type as integer) as trip_type,
      
      -- payment info
      cast(fare_amount as numeric) as fare_amount,
      cast(extra as numeric) as extra,
      cast(mta_tax as numeric) as mta_tax,
      cast(tip_amount as numeric) as tip_amount,
      cast(tolls_amount as numeric) as tolls_amount,
      cast(ehail_fee as numeric) as ehail_fee,
      cast(improvement_surcharge as numeric) as improvement_surcharge,
      cast(total_amount as numeric) as total_amount,
      cast(payment_type as integer) as payment_type,
      {{ get_payment_type_description('payment_type') }} as payment_type_description
  from tripdata
)

select * from renamed
```

---

## A Note on Filtering

- The general recommendation is to keep staging models as **1:1 copies** of the source — same number of rows, same number of columns, just cleaned up
- That said, this dataset has some data quality issues (we'll cover those later), so it makes sense to filter out rows where **`vendor_id IS NULL`** right here in staging
- It's a deviation from convention, but a practical one for this project

---

## Your Exercise

Do the same thing for the **yellow tripdata** table. The columns are almost identical to green, so it shouldn't be too painful. By the end you should have:
- A `sources.yml` that declares both tables
- A `stg_green_tripdata.sql` staging model
- A `stg_yellow_tripdata.sql` staging model

================================================
FILE: 04-analytics-engineering/class_notes/4_4_1_dbt_models.md
================================================
# DE Zoomcamp 4.4.1 — dbt Models

> 📄 Video: [dbt Models](https://www.youtube.com/watch?v=JQYz-8sl1aQ)  
> 📄 Official docs: [SQL models](https://docs.getdbt.com/docs/build/sql-models)  
> 📄 ref() function: [About ref](https://docs.getdbt.com/reference/dbt-jinja-functions/ref)

Staging is done. From here on out it's not just typing SQL behind a computer — you need to actually **explore the data**, understand what's in it, and get some **business context**. In a real org that means querying exhaustively until you understand the common data quality issues, what a normal row looks like, and talking to people about what the codes mean and when rows trigger. All of that understanding eventually gets encoded as SQL.

---

## What are we building?

Before writing any code, it helps to think about what the end result should look like. There are generally two things you want in your marts:

### Reports and dashboards
- If there's an important dashboard or data application out there — especially one that requires a lot of manual work or spreadsheet maintenance — that's a sign it should become a dbt model
- Example: imagine there's a dashboard with a dataset called **monthly revenue per location**. That's something we want to build and version-control properly

### A dimensional model
- Beyond reports, you want a proper **star schema** — the kind of structure you see in data warehouses
- Two key table types to know:
  - **Fact tables** — one row per event/process. One row per trip, one row per sale, one row per order. Named with a `fct_` prefix (e.g. `fct_trips`)
  - **Dimension tables** — attributes of an entity. Named with a `dim_` prefix (e.g. `dim_zones`, `dim_vendors` is not shown here)
- The power of a good star schema: answering "how many?" questions becomes trivial. *How many zones do we have?* → `COUNT(*)` on `dim_zones`. *How many trips?* → `COUNT(*)` on `fct_trips`. Simple, focused tables that you join when you need something more complex

### What we're building in this course
- `dim_zones` — zone/location attributes  
- `fct_trips` — one row per trip (yellow + green combined)
- A report model for monthly revenue per zone (inside a `models/core/` folder)

---

## source() vs ref() — the key distinction

This is an important moment in the course. Up until now we've been using `{{ source() }}` to pull in raw data. But that's **only** for things declared in your sources YAML — i.e. raw tables that live outside of dbt.

If the input to your model is **another dbt model**, you use `{{ ref() }}` instead.

- `{{ source('name', 'table') }}` → raw data defined in your YAML
- `{{ ref('model_name') }}` → another dbt model

> 📄 [ref() — full reference](https://docs.getdbt.com/reference/dbt-jinja-functions/ref)

This distinction matters because `ref()` also does something useful under the hood: it automatically builds the **dependency graph**. dbt knows that if model B refs model A, then A has to run first. You never have to manage run order yourself.

---

## The intermediate layer — why it exists

We want `fct_trips` to be a union of yellow and green trip data. But doing that union directly inside the fact model would make it messy. So we put it in an **intermediate model** instead — something that's not raw, and not ready to expose to end users.

- Convention: prefix intermediate models with `int_`  
- In this case: `int_trips_unioned.sql`
- The idea is to keep intermediate work out of marts. Marts should only contain things that are consumption-ready

```sql
with green_data as (
    select *, 
        'Green' as service_type 
    from {{ ref('stg_green_tripdata') }}
), 

yellow_data as (
    select *, 
        'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
), 

trips_unioned as (
    select * from green_data
    union all
    select * from yellow_data
)

select * from trips_unioned
```

---

## The union problem — yellow and green aren't identical

When you try to union the two staging models, it fails. The error: *set operation can only be applied with expressions with the same number of columns*. Turns out green has **two extra columns** that yellow doesn't:

### `trip_type`
- Values are `1` or `2`
- `1` = street hail (you flag down the taxi)
- `2` = booked via phone or app
- Yellow taxis **don't have this column** because by law you can only get a yellow taxi by hailing it on the street — it's always type 1
- Fix: add `trip_type` to the yellow staging model and hard-code it as `1` (street hail)

### `ehail_fee` (e-hail fee)
- An extra fee that can apply when you request a taxi through an app
- In practice, most of this data is null — the feature isn't consistently implemented across vendors
- Yellow taxis by definition **never** have an e-hail fee
- Fix: add `ehail_fee` to the yellow staging model and hard-code it as `0`

```sql
-- Updated stg_yellow_tripdata.sql to match green schema
with tripdata as (
  select *
  from {{ source('staging','yellow_tripdata') }}
  where vendorid is not null 
),

renamed as (
    select
        -- identifiers
        cast(vendorid as integer) as vendor_id,
        cast(ratecodeid as integer) as ratecode_id,
        cast(pulocationid as integer) as pickup_location_id,
        cast(dolocationid as integer) as dropoff_location_id,
        
        -- timestamps
        cast(tpep_pickup_datetime as timestamp) as pickup_datetime,
        cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime,
        
        -- trip info
        store_and_fwd_flag,
        cast(passenger_count as integer) as passenger_count,
        cast(trip_distance as numeric) as trip_distance,
        cast(1 as integer) as trip_type,  -- Yellow only does street-hail
        
        -- payment info
        cast(fare_amount as numeric) as fare_amount,
        cast(extra as numeric) as extra,
        cast(mta_tax as numeric) as mta_tax,
        cast(tip_amount as numeric) as tip_amount,
        cast(tolls_amount as numeric) as tolls_amount,
        cast(0 as numeric) as ehail_fee,  -- Yellow doesn't have ehail
        cast(improvement_surcharge as numeric) as improvement_surcharge,
        cast(total_amount as numeric) as total_amount,
        cast(payment_type as integer) as payment_type,
    from tripdata
)

select * from renamed
```

A note: adding these columns directly in staging is technically a break from the "1:1 copy" rule. It's done here to keep things simple, but in a stricter project you'd handle this in the intermediate layer.

**Updated union after schema alignment:**

```sql
-- models/staging/int_trips_unioned.sql
with green_data as (
    select *, 
        'Green' as service_type 
    from {{ ref('stg_green_tripdata') }}
), 

yellow_data as (
    select *, 
        'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
), 

trips_unioned as (
    select * from green_data
    union all
    select * from yellow_data
)

select * from trips_unioned
```

---

## Why the business context matters

The column discrepancy between yellow and green isn't just a technical problem — it's a **business story**. Yellow and green taxis exist because of how NYC taxi licensing works: yellow cabs stay in Manhattan, green cabs were created so people in the outer boroughs could get rides too. Understanding that context is what lets you make the right call on how to handle `trip_type` and `ehail_fee` — not just technically, but semantically.

This is the part of analytics engineering where you stop just writing SQL and start understanding what the data actually represents.

================================================
FILE: 04-analytics-engineering/class_notes/4_4_2_dbt_seeds_and_macros.md
================================================
# DE Zoomcamp 4.4.2 — dbt Seeds and Macros

> 📄 Video: [dbt Seeds and Macros](https://www.youtube.com/watch?v=lT4fmTDEqVk)  
> 📄 Seeds docs: [Seeds](https://docs.getdbt.com/docs/build/seeds)  
> 📄 Macros docs: [Jinja and macros](https://docs.getdbt.com/docs/build/jinja-macros)

The union model is done, but right now vendor IDs and location IDs are just numbers — meaningless codes. This video is about enriching that data. Two dbt features come in: **seeds** for bringing in lookup data, and **macros** for turning reusable SQL logic into something you don't have to copy-paste everywhere.

---

## The problem — codes everywhere

If you query `vendor_id`, you get values: 1 and 2. Those map to real companies:
- **1** → Creative Mobile Technologies
- **2** → VeriFone Inc.

Same story with locations — 265 location IDs that could have names, boroughs, coordinates, and more. The raw data just doesn't have any of that. So how do we add it?

---

## Seeds — bringing in lookup data

### What seeds are
- A way to **upload a CSV file** and make it available as a dbt model
- You drop the CSV into the `seeds/` directory, run `dbt seed`, and it becomes queryable just like any other model
- You reference it with `{{ ref('filename') }}` — same as any other model

### When to use them
- **Lookup tables** that don't exist anywhere in your warehouse yet
- Cases where you don't have write permissions to load data properly
- Quick experiments or local testing before committing to a proper data load
- Small, static datasets

### When NOT to use them
- **Never commit confidential data** — seeds go into your git repo
- Keep the data **small** — large CSVs in git will slow down pulls and pushes
- If you have the option to load the data properly at the source, do that instead. Seeds are a quick-and-dirty workaround

> 📄 [Seeds — full reference](https://docs.getdbt.com/docs/build/seeds)

---

## dim_zones — using a seed in practice

The taxi zone lookup CSV has exactly what we need: location ID, borough, zone name, and service area. Drop it into `seeds/`, run `dbt seed`, and it's live.

Now we build `dim_zones`. The model simply selects from the seed and renames columns to something cleaner.

```sql
select
    locationid as location_id,
    borough,
    zone,
    service_zone
from {{ ref('taxi_zone_lookup') }}
```

That's it — first dimension table done. The seed did the heavy lifting.

---

## dim_vendors — the CASE WHEN problem (not implemented in this project, but shown for learning)

For vendors, we could pull distinct `vendor_id` from the intermediate union model using `ref()`. Easy enough. But we want to enrich it with vendor **names**.

### The naive approach: CASE WHEN
You could just write it inline:

```sql
with vendors as (
    select distinct vendorid
    from {{ ref('stg_green_tripdata') }}
)

select
    vendorid,
    case 
        when vendorid = 1 then 'Creative Mobile Technologies, LLC'
        when vendorid = 2 then 'VeriFone Inc.'
        else 'Unknown'
    end as vendor_name
from vendors
```

This works. But it has a real problem: **what happens when a new vendor appears, or a vendor changes its name?** You have to open this file, find the CASE block, and add another line. And if you need the same mapping somewhere else in the project, you copy-paste the whole thing. Eventually someone forgets to update one of the copies.

### The better approach: macros

Macros are dbt's answer to this. Think of them as **reusable SQL functions** — same idea as a Python function, but for SQL snippets.

> 📄 [Jinja and macros — full reference](https://docs.getdbt.com/docs/build/jinja-macros)

### How macros work
- Defined in `.sql` files inside the `macros/` directory
- You wrap your SQL logic in `{% macro macro_name(argument) %}` ... `{% endmacro %}`
- The argument works just like a function parameter — you pass in a value when you call it
- You call it in your models with `{{ macro_name(argument) }}`
- dbt compiles it down — the final SQL looks exactly like you typed the CASE block inline, but your source code stays clean

```sql
{% macro get_vendor_data(vendor_id_column) %}

{% set vendors = {
    1: 'Creative Mobile Technologies',
    2: 'VeriFone Inc.',
    4: 'Unknown/Other'
} %}

case {{ vendor_id_column }}
    {% for vendor_id, vendor_name in vendors.items() %}
    when {{ vendor_id }} then '{{ vendor_name }}'
    {% endfor %}
end

{% endmacro %}

```

**Using the macro in a model:**

```sql
with trips as (
    select * from {{ ref('fct_trips') }}
),

vendors as (
    select distinct
        vendor_id,
        {{ get_vendor_data('vendor_id') }} as vendor_name
    from trips
)

select * from vendors
```

### Why this is better
- **Reusable** — need the same payment type logic somewhere else? Just call the macro again
- **Single source of truth** — payment types change? Update the macro in one place, it's fixed everywhere
- **Testable** — the logic is isolated in its own file, easier to reason about

---

## Homework preview — fct_trips

The fact trips model is left as an exercise. Here's what's expected:

- **One row per trip** — yellow and green combined (the union is already done in the intermediate model)
- **Add a primary key** (`trip_id`) — it has to be **unique**
- **Find and fix duplicates** — there are quite a few in this dataset. Some come from the source, some get introduced during the union. Find them, understand why they happen, and fix them
- **Enrich `payment_type`** (there is a seed for this in the repo).

================================================
FILE: 04-analytics-engineering/class_notes/4_5_1_documentation.md
================================================
# DE Zoomcamp 4.5.1 — Documentation

> 📄 Video: [Documentation](https://www.youtube.com/watch?v=UqoWyMjcqrA)  
> 📄 Official docs: [Documentation](https://docs.getdbt.com/docs/build/documentation)  
> 📄 Model properties: [Model properties](https://docs.getdbt.com/reference/model-properties)

The models are built. Now it's time to make sure other people can actually understand what they do. This video covers how dbt's documentation system works — what you write, where you write it, and what dbt does with it.

---

## Where documentation lives — YAML files

You've already seen YAML files in the context of sources. But they do more than just declare where raw data lives — they're also the **primary place to document your entire project**.

The most common convention is to have a single file called `schema.yml` per directory. Some teams prefer **one YAML file per model** — that's fine too, it keeps things from getting unwieldy when projects get large. For this course we stick with `schema.yml`.

> 📄 [Model properties — full reference](https://docs.getdbt.com/reference/model-properties)

---

## What you can document

Almost everything in dbt can be documented. The structure is the same pattern regardless of what you're documenting:

### Sources

You already have a `sources.yml` — you can add descriptions to the source itself and to each table inside it.

```yaml
version: 2

sources:
  - name: staging
    description: >
      Raw NYC taxi trip data loaded from BigQuery external tables.
      Contains both yellow and green taxi trip records for 2019-2020.
    database: production
    schema: trips_data_all
    
    tables:
      - name: green_tripdata
        description: >
          Green taxi trip records. Green taxis operate primarily in
          outer boroughs (outside Manhattan).
          
      - name: yellow_tripdata
        description: Yellow taxi trips, primarily from Manhattan
```

### Models

In `schema.yml`, you switch from `sources:` to `models:`. Same idea — give each model a name and a description, then drill down into columns.

```yaml
version: 2

models:
  - name: dim_zones
    description: >
      Zone lookup table containing LocationID, borough, zone name and service zone.
      One row per taxi zone in NYC.
    columns:
      - name: locationid
        description: Primary key for taxi zones
        tests:
          - unique
          - not_null
      
      - name: borough
        description: NYC borough name (Manhattan, Queens, Brooklyn, Bronx, Staten Island, EWR)
      
      - name: zone
        description: Taxi zone name/neighborhood
      
      - name: service_zone
        description: Service zone type (Yellow, Green, or Airports)
```

### Columns
Under each model, you can list every column with:
- **name** — must match the actual column name
- **description** — what it means
- **data_type** — what type it should be (informational, not enforced)
- **tests** — we'll cover these in the next video, but the slot is here
- **meta** — custom key-value tags (more on this below)

### Macros and seeds
You can document these too, using the same YAML pattern. Same `version: 2` header, just different top-level keys.

---

## Multi-line descriptions

If you need more than one line for a description, use the YAML **pipe operator** (`|`) or **greater-than operator** (`>`). Everything indented under it becomes part of the description. The `>` folds newlines into spaces, while `|` preserves them.

```yaml
version: 2

models:
  - name: fct_trips
    description: |
      Fact table containing all taxi trips from both yellow and green taxis.
      
      This is the core analytical table for trip-level analysis.
      Each row represents a single trip with:
      - Trip identifiers and service type
      - Pickup and dropoff locations and timestamps
      - Trip details (distance, passenger count, etc.)
      - Payment information and amounts
      
      Data is filtered for 2019-2020 only and excludes records
      with unknown pickup or dropoff locations.
```

---

## Meta tags — custom metadata

The `meta` field lets you attach arbitrary key-value pairs to any column or model. There's no predefined set — you and your team decide what matters. Common examples:

- **PII** — flag columns that contain personally identifiable information
- **owner** — who's responsible for this data asset, who to contact if something breaks
- **importance** — mark which columns or models are critical vs. informational

These don't affect how dbt runs anything. They're purely for governance, discoverability, and helping your team navigate the project.

---

## Generating and viewing the docs

Two commands, run them in order:

### `dbt docs generate`
- Compiles everything — your YAML descriptions, your model code, and metadata from the warehouse (like actual column types and table sizes) — into a JSON file
- In **dbt Cloud**, this happens automatically. There's even a checkbox for it
- In **dbt Core**, you have to run it yourself

### `dbt docs serve`
- Takes the generated JSON and spins up a local website (defaults to `localhost:8080`)
- Only needed if you're on **dbt Core** — dbt Cloud hosts the docs for you
- If you want other people to see it, you'll need to host it somewhere (S3, Netlify, etc.)

### What the docs site shows you
- **Model code** — both the Jinja version you wrote and the compiled SQL that actually hits the database
- **Column info** — types, descriptions, anything you added
- **Lineage graph** — a visual DAG showing sources in green, all the way through to your final mart models. You can see exactly what depends on what, and whether a change might break something downstream
- **Project structure** — toggle between a folder view and a database view

It's more of a **technical documentation** tool than a pretty data catalog. It's not going to replace something like Looker or Confluent's data catalog for non-technical stakeholders. But for the people building the models, it's genuinely useful — you can see at a glance what data assets exist, how they connect, and how they work.

================================================
FILE: 04-analytics-engineering/class_notes/4_5_2_dbt_tests.md
================================================
# DE Zoomcamp 4.5.2 — dbt Tests

> 📄 Video: [dbt Tests](https://www.youtube.com/watch?v=bvZ-rJm7uMU)  
> 📄 Official docs: [Data tests](https://docs.getdbt.com/docs/build/data-tests) | [Unit tests](https://docs.getdbt.com/docs/build/unit-tests) | [Model contracts](https://docs.getdbt.com/docs/mesh/govern/model-contracts)

Wrong KPIs in dashboards, bad numbers in reports — there are really only two causes: the underlying data wasn't what you expected, or you messed up the SQL. As an analytics engineer, if you can't tell which one it is, both are technically your fault. Tests are how you stay on top of this proactively. dbt ships with a pretty large suite of testing options, and this video walks through all of them.

---

## 1. Singular tests

The simplest kind of test. You write a plain SQL query, stick it in the `tests/` directory, and that's it — it's now a test.

The logic is straightforward: **if the query returns any rows, the test fails.** You're writing a query that selects for the "bad" cases. Zero rows back means everything checks out.

```sql
-- tests/assert_positive_fare_amount.sql
-- Fare amounts should always be positive

select
    tripid,
    fare_amount
from {{ ref('fct_trips') }}
where fare_amount <= 0
```

These are great for one-off business rules that are very specific to your organization — the kind of thing no generic test is going to cover out of the box.

> 📄 [Singular data tests — docs](https://docs.getdbt.com/docs/build/data-tests#singular-data-tests)

---

## 2. Source freshness tests

These live in your source YAML, not in a separate file. You add a `freshness` block to a source and tell dbt which column indicates when data was last loaded. Then you run `dbt source freshness` and dbt checks whether that timestamp is recent enough.

You can set both `warn_after` and `error_after` thresholds — one to flag it, one to actually fail.

```yaml
version: 2

sources:
  - name: staging
    database: production
    schema: trips_data_all
    tables:
      - name: green_tripdata
        loaded_at_field: lpep_pickup_datetime
        freshness:
          warn_after: {count: 6, period: hour}
          error_after: {count: 12, period: hour}
      
      - name: yellow_tripdata
        loaded_at_field: tpep_pickup_datetime
        freshness:
          warn_after: {count: 6, period: hour}
          error_after: {count: 12, period: hour}
```

Not something you see everywhere, but for pipelines where stale data would cause real problems it's a lifesaver.

> 📄 [Source freshness — docs](https://docs.getdbt.com/reference/resource-properties/freshness)

---

## 3. Generic tests

This is the big one — the most common type of test you'll see in dbt projects. Generic tests are defined in your YAML right alongside your column descriptions. They're parameterized and reusable, so you write the logic once and apply it across as many columns and models as you need.

### The four built-in generic tests

dbt ships with exactly four:

- **unique** — no duplicate values in this column
- **not_null** — no nulls allowed
- **accepted_values** — column values must be within a defined list
- **relationships** — every value in this column must exist in another model (referential integrity)

```yaml
version: 2

models:
  - name: stg_green_tripdata
    description: Staged green taxi data
    columns:
      - name: tripid
        description: Primary key for trips
        tests:
          - unique
          - not_null
      
      - name: vendorid
        tests:
          - not_null
      
      - name: payment_type
        description: Payment method code
        tests:
          - accepted_values:
              values: [1, 2, 3, 4, 5, 6]
      
      - name: pickup_locationid
        description: Taxi zone where trip started
        tests:
          - relationships:
              to: ref('taxi_zone_lookup')
              field: locationid
```

> 📄 [Generic data tests — docs](https://docs.getdbt.com/docs/build/data-tests#generic-data-tests)

### Writing your own custom generic tests

Four tests won't cover everything. You can write your own — they're SQL files that live in `tests/generic/`. The syntax uses Jinja test blocks, and dbt will pick them up and make them available just like the built-ins.

```sql
-- tests/generic/test_positive_values.sql
{% test positive_values(model, column_name) %}

select *
from {{ model }}
where {{ column_name }} < 0

{% endtest %}
```

**Usage in schema.yml:**
```yaml
models:
  - name: fct_trips
    columns:
      - name: fare_amount
        tests:
          - positive_values
      
      - name: trip_distance
        tests:
          - positive_values
```

And here's the thing — you probably don't need to write as many custom tests as you'd expect. The dbt community has already built a ton of them in open-source packages (dbt-utils, dbt-expectations, etc.). Worth checking those before rolling your own.

> 📄 [Writing custom generic tests — docs](https://docs.getdbt.com/best-practices/writing-custom-generic-tests)

---

## 4. Unit tests

Available from dbt v1.8 onwards (released in mid-2024). Unit tests let you test your SQL logic in isolation, without hitting the warehouse with real data.

The idea: you define a small set of mock input rows and the expected output rows. dbt runs your model's SQL against those mocks and checks whether the output matches what you said it should be. This is especially handy for complex logic — rolling windows, regex, edge cases — because you can test for scenarios that haven't even shown up in your real data yet.

```yaml
version: 2

unit_tests:
  - name: test_payment_type_mapping
    description: Test that payment type codes map to correct descriptions
    model: stg_green_tripdata
    given:
      - input: source('staging', 'green_tripdata')
        rows:
          - {tripid: '1', payment_type: 1}
          - {tripid: '2', payment_type: 2}
          - {tripid: '3', payment_type: 5}
    expect:
      rows:
        - {tripid: '1', payment_type_description: 'Credit card'}
        - {tripid: '2', payment_type_description: 'Cash'}
        - {tripid: '3', payment_type_description: 'Unknown'}
```

Unit tests are defined in YAML in your `models/` directory, and currently only support SQL models. Since the inputs are static, there's no reason to run them in production — use them in development and CI.

As of early 2026, unit tests have been available for about 18 months and are seeing increasing adoption, especially for teams with complex transformation logic or strict data quality requirements. They're particularly useful in CI/CD pipelines where you want to catch logic errors before they hit production data.

> 📄 [Unit tests — docs](https://docs.getdbt.com/docs/build/unit-tests)

---

## 5. Model contracts

The last type covered in this video, and a bit different from the others. Model contracts aren't about catching bad data after the fact — they're about **preventing your model from building at all** if it doesn't match a defined shape.

You define the expected columns, data types, and optionally constraints in your YAML. Then you flip on `contract: enforced: true` in the model's config. From that point on, if your model's output doesn't match — wrong column name, wrong type, missing column — dbt will error out before anything gets materialized.

```yaml
version: 2

models:
  - name: fct_trips
    config:
      contract:
        enforced: true
    columns:
      - name: tripid
        data_type: string
        constraints:
          - type: not_null
          - type: unique
      
      - name: pickup_datetime
        data_type: timestamp
        constraints:
          - type: not_null
      
      - name: service_type
        data_type: string
      
      - name: total_amount
        data_type: numeric
```

The idea behind this comes from the concept of **data contracts** — you sit down with your stakeholder, agree on what the output dataset should look like (column names, types, freshness expectations), and the contract enforces that agreement automatically. If someone changes the model in a way that breaks it, they'll know immediately.

> 📄 [Model contracts — docs](https://docs.getdbt.com/docs/mesh/govern/model-contracts)

================================================
FILE: 04-analytics-engineering/class_notes/4_5_3_dbt_packages.md
================================================
# DE Zoomcamp 4.5.3 — dbt Packages

> 📄 Video: [dbt Packages](https://www.youtube.com/watch?v=KfhUA9Kfp8Y)  
> 📄 Official docs: [Packages](https://docs.getdbt.com/docs/build/packages)  
> 📄 Package Hub: [hub.getdbt.com](https://hub.getdbt.com)

One of the things that makes dbt's community so strong is packages. A dbt package is basically a self-contained dbt project — it has its own macros, tests, models, sources — but instead of using it yourself, you distribute it so other people can drop it into their own projects. Think Python libraries, but for dbt. This video covers the most useful packages out there and how to actually install and use them.

---

## Packages worth knowing about

### dbt-utils

The big one. Maintained by dbt Labs, so it's well-kept and safe to use. It bundles a ton of common SQL utilities as macros — things like generating surrogate keys, deduplicating, pivoting, safe division, extracting URL parameters. Stuff most of us have written ourselves at some point.

The real kicker is **cross-database compatibility**. dbt-utils macros compile down to the correct SQL dialect depending on your warehouse. So the same macro works on BigQuery, DuckDB, Snowflake, etc. — no need to maintain separate versions of your code.

### dbt-codegen

A massive time-saver for the YAML grind. Codegen does two things:

- **YAML from SQL** — point it at a model or source and it auto-generates the `schema.yml` with all the columns listed out. No more manually typing hundreds of column names.
- **SQL from YAML** — the reverse. Give it a YAML spec and it generates a staging model SQL file following dbt conventions (single CTE for renaming, proper file naming, etc.).

### dbt-project-evaluator

Scores your dbt project against best practices. Good for teams that want a quick sanity check on whether they're following conventions.

### dbt-audit-helper

Handy when you're refactoring. It compares an old model against a new one and validates that they produce the same results — same columns, same row counts, same values. Takes the anxiety out of rewriting existing SQL.

### dbt-expectations

This is the one that makes custom tests almost unnecessary. It's a massive library of pre-built generic tests covering almost every assertion you can think of — row counts, value ranges, consistent casing, regex matching, approximate equality, and way more. In practice, if you need to test something, there's a very good chance dbt-expectations already has it.

> 📄 [dbt-expectations on the Package Hub](https://hub.getdbt.com/calogica/dbt_expectations/latest/)

### Warehouse-specific packages

The hub has plenty of packages tailored to specific platforms — Snowflake, BigQuery, etc. These typically come with models or macros for monitoring spend, evaluating best practices, applying constraints, or working with platform-specific features like semantic views.

---

## A note on trust

Packages on the dbt Hub have gone through a vetting process by dbt Labs — they're generally safe to use. Packages you find floating around on GitHub that aren't on the Hub? Take a closer look at what they actually do before dropping them into your project.

---

## How to install a package — the demo

The video walks through installing dbt-utils and using it to generate surrogate keys. Here's the workflow:

### 1. Create packages.yml

At the root of your dbt project (same level as `dbt_project.yml`), create a file called `packages.yml`. Declare the package and pin the version.

```yaml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
```

### 2. Run `dbt deps`

This downloads and installs the package. After it runs, two things appear:

- A `package-lock.yml` file — contains a hash of exactly what was installed. Commit this to version control so everyone on your team gets the same versions.
- A `dbt_packages/` directory — this is where the installed package code lives. It's git-ignored by default (you don't want to commit other people's source code into your repo), but you can browse it if you're curious how the macros work.

### 3. Use it

Once installed, the package's macros are immediately available. You call them with the standard Jinja syntax, prefixing with the package name.

**Before (manual surrogate key):**
```sql
select
    -- Manual concatenation approach
    concat(
        cast(vendorid as string), '-',
        cast(lpep_pickup_datetime as string)
    ) as tripid,
    vendorid,
    pickup_datetime
from {{ source('staging', 'green_tripdata') }}
```

**After (using dbt_utils.generate_surrogate_key):**
```sql
select
    -- Clean, cross-database macro
    {{ dbt_utils.generate_surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid,
    vendorid,
    pickup_datetime
from {{ source('staging', 'green_tripdata') }}
```

That's it. The macro handles the rest — compiles to the right SQL for whatever warehouse you're targeting (MD5 hash for BigQuery, hash function for Snowflake, etc.).

> 📄 [dbt deps command — docs](https://docs.getdbt.com/reference/commands/deps)

================================================
FILE: 04-analytics-engineering/class_notes/4_6_1_dbt_commands.md
================================================
# DE Zoomcamp 4.6.1 — dbt Commands

> 📄 Video: [dbt Commands](https://www.youtube.com/watch?v=t4OeWHW3SsA)  
> 📄 Official docs: [dbt command reference](https://docs.getdbt.com/reference/dbt-commands)  
> 📄 Selection syntax: [Node selection syntax](https://docs.getdbt.com/reference/node-selection/syntax)

We've been using dbt commands throughout the series without really stopping to talk about all of them. This video is the full tour — every command you'll actually use, plus the flags that make them powerful. Good one to bookmark.

---

## The setup commands — run these once (or when needed)

### dbt init

Creates your dbt project from scratch. Generates the full directory structure — `models/`, `seeds/`, `snapshots/`, `tests/`, `analysis/`, all of it. You only ever run this once, at the very start.

### dbt debug

Checks that your `profiles.yml` is valid and that dbt can actually connect to your warehouse. Run this whenever you're setting up a new environment or something feels off with your connection.

### dbt deps

Installs packages from your `packages.yml`. We covered this in 4.5.3 — just know it lives here in the command lineup too.

### dbt clean

Deletes the directories listed under `clean-targets` in your `dbt_project.yml`. By default that's `target/` and `dbt_packages/`. Useful for a fresh start, but remember you'll need to run `dbt deps` again after cleaning if you deleted `dbt_packages/`. You can add other directories to `clean-targets` if you want.

> 📄 [dbt clean — docs](https://docs.getdbt.com/reference/commands/clean)

---

## The feature-specific commands

These are tied to specific dbt features rather than being general-purpose.

### dbt seed

Loads all the CSVs in your `seeds/` directory into the warehouse. Quick and simple — great for reference data or small lookup tables.

### dbt snapshot

Runs any snapshots you've defined in your project. Snapshots are dbt's way of tracking how source data changes over time (think SCD Type 2). Not something you use every day, but it's there when you need it.

### dbt source freshness

Checks whether your source data is stale. If you've defined `freshness` blocks in your source YAML (we covered this in 4.5.2), this is the command that actually runs the check.

### dbt docs generate / dbt docs serve

`dbt docs generate` compiles your YAML documentation, model code, and warehouse metadata into a `catalog.json` artifact in `target/`. `dbt docs serve` spins up a local website (localhost:8080) so you can browse it. On dbt Cloud, `docs serve` isn't needed — it's handled automatically. For dbt Core users, finding a scalable way to host that docs site is something you'll need to sort out yourself.

> 📄 [dbt docs commands — docs](https://docs.getdbt.com/reference/commands/cmd-docs)

---

## The big four — these are your daily drivers

### dbt compile

Looks like it's doing nothing, but it's actually super useful. Takes all your models — with their Jinja, `ref()`, `source()` calls and everything — and outputs the fully resolved SQL into `target/compiled/`. No data moves, nothing hits the warehouse. It's just pure SQL sitting there for you to inspect.

Why bother? Two reasons. First, it's the fastest way to catch Jinja errors — way quicker than waiting for a full `dbt run`. Second, it's completely free — no compute, no warehouse cost. Good habit to run after making changes.

> 📄 [dbt compile — docs](https://docs.getdbt.com/reference/commands/compile)

### dbt run

Materializes every model in your project. Views become views, tables become tables, incremental models get incremental logic applied — whatever you configured. Models run in dependency order, so dbt figures out the sequence for you.

This is your go-to during active development when you just want to see your models built.

> 📄 [dbt run — docs](https://docs.getdbt.com/reference/commands/run)

### dbt test

Runs all the tests in your project — generic tests, singular tests, unit tests, all of it. Reports pass/fail at the end. Nothing gets built here, it just validates what's already in the warehouse.

> 📄 [dbt test — docs](https://docs.getdbt.com/reference/commands/test)

### dbt build ⭐

The most important command. It's a smart combination of `dbt run` + `dbt test` + `dbt seed` + `dbt snapshot`, all in one. But it's not just running them sequentially — it's DAG-aware. It knows the right order, and if something fails along the way, it skips everything downstream of that failure rather than wasting compute on models that are going to break anyway.

This is what you want for CI, production runs, or any time you need confidence that your whole project is solid.

> 📄 [dbt build — docs](https://docs.getdbt.com/reference/commands/build)

### dbt retry

If a `dbt build` or `dbt run` fails partway through, don't just re-run the whole thing from scratch. `dbt retry` re-executes from the point of failure by reading the `run_results.json` file from the previous run. It automatically identifies which nodes failed and re-runs those nodes plus everything downstream of them.

How it works:
- dbt looks at `target/run_results.json` from the last command
- It identifies failed nodes and skipped nodes (anything downstream of a failure)
- It re-runs only those nodes, reusing the same selection criteria from the original command
- If the previous command completed successfully, `dbt retry` finishes as a no-op

Saves a lot of time on big projects, especially when a single model fails deep in the DAG.

---

## Flags — the important ones

### --help / -h

Works on any command. `dbt --help` gives you the full list, `dbt run --help` gives you flags specific to `run`. Standard stuff, but worth knowing it's there.

### --version / -V

Tells you which version of dbt you have installed. Also lets you know if there's an update available.

### --full-refresh / -f

Used with `dbt run` or `dbt build`. When you have an incremental model, it normally just appends new rows. `--full-refresh` drops the whole thing and rebuilds from scratch. Handy when historical data has changed, you've got duplicates, or you just want to make sure everything is clean. Most teams do this on a regular schedule — maybe once a month — just to keep things tidy.

```bash
dbt run --full-refresh
```

### --fail-fast

Runs a stricter version of dbt. Normally warnings don't stop execution — with `--fail-fast` they do. Good for CI or any time you want to be sure nothing slips through. Better to fail loud than to be permissive and find surprises later.

### --target / -t

Controls which profile target dbt runs against. By default everything runs on `dev`. But you can override it:

```bash
dbt run --target prod
```

Works with `dbt run`, `dbt build`, `dbt test`, `dbt snapshot` — basically any command that touches the warehouse. Best practice: developers work in `dev`, production runs use `--target prod`.

### --select / -s

This is the big one. Lets you run only specific parts of your project instead of everything. There are a few ways to use it:

**By model name** — just give it the model name (no `.sql` needed):

```bash
dbt run --select stg_green_tripdata
```

**By directory path** — everything in a folder:

```bash
dbt run --select models/staging
```

**By tag:**

```bash
dbt run --select tag:nightly
```

**With graph operators (the + sign)** — this is where it gets really useful. The `+` lets you pull in upstream or downstream dependencies:

```bash
# Run stg_green_tripdata and all upstream dependencies
dbt run --select +stg_green_tripdata

# Run fct_trips and all downstream dependencies
dbt run --select fct_trips+

# Run dim_zones plus everything upstream AND downstream
dbt run --select +dim_zones+
```

- `+my_model` — builds `my_model` and everything upstream of it (all its ancestors)
- `my_model+` — builds `my_model` and everything downstream of it (all its descendants)
- `+my_model+` — both directions. Everything upstream, the model itself, and everything downstream

> 📄 [Graph operators — docs](https://docs.getdbt.com/reference/node-selection/graph-operators)

**With state selectors** — instead of guessing what changed, let dbt figure it out:

```bash
dbt build --select state:modified+ --state ./prod-artifacts
```

- `state:new` — only files you just created
- `state:modified` — anything that's changed since the last run
- Add `+` after to include downstream dependencies of modified models

How state comparison works:
- You need artifacts from a **previous run** stored somewhere persistent (not the same `target/` directory you're currently writing to)
- On **dbt Cloud**, this is handled automatically — production artifacts are stored and accessible for comparison
- On **dbt Core**, you need to manually store artifacts (especially `manifest.json`) somewhere — a cloud bucket, a separate directory, version control, etc.
- Point `--state` to where those previous artifacts live
- dbt compares your current code against those artifacts to determine what's new or modified

The key is that you're comparing against a *different environment's artifacts* (usually production) or a *previous point in time* — not against the directory you're currently building into. This lets you run only what's changed since your last production deployment, which is incredibly useful for CI/CD workflows.

Storing those JSON artifacts persistently is also just good practice in general — you can use them to analyze how your project evolves over time.

> 📄 [Node selection syntax — docs](https://docs.getdbt.com/reference/node-selection/syntax)

================================================
FILE: 04-analytics-engineering/refreshers/SQL.md
================================================
# SQL Refresher

### Table of contents


- [Window Functions](#window-funtions)
    - [Row Number](#row-number)
    - [Rank and Dense Rank](#rank-and-dense-rank)    
    - [Lag and Lead](#lag-and-lead)   
    - [Percentile Cont](#percentile-cont)         
- [Common Table Expression](#common-table-expression)
- [dbt models and CTEs](#dbt-models-and-ctes)


## Window Functions    

A window function performs a calculation across a set of table rows that are related to the current row within a specific "window" or subset of data. This is comparable to the type of calculation that can be done with an aggregate function  (such as SUM(), AVG(), COUNT(), etc.).

But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.


**Syntax:**

```sql
FUNCTION() OVER (PARTITION BY column_name ORDER BY column_name)
```

A window function always has two components. This second part here defines your window:

```sql
OVER (PARTITION BY column_name ORDER BY column_name)
```

Your window here is how you want to be viewing your data when you're applying your function

- PARTITION BY: divides the result set into groups (optional).

- ORDER BY: defines the order of processing rows within the partition.


**Common Window Functions:**

Ranking Functions:

- ROW_NUMBER(): Assigns a unique row number within a partition.
- RANK(): Similar to ROW_NUMBER(), but assigns the same rank to duplicate values, skipping numbers.
- DENSE_RANK(): Like RANK(), but without gaps in numbering.

Aggregate Functions as Window Functions:

- SUM() OVER(): Computes a running total.
- AVG() OVER(): Computes a moving average.

Lag and Lead Functions:

- LAG(): Retrieves the value from a previous row.
- LEAD(): Retrieves the value from the next row.

### Row Number

ROW_NUMBER() does just what it sounds like—displays the number of a given row. It starts at 1 and numbers the rows according to the ORDER BY part of the window statement. Using the PARTITION BY clause will allow you to begin counting 1 again in each partition.

**Syntax:**

```sql
ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name)
```

**Common Uses:**

- Removing Duplicates: You can use ROW_NUMBER() to identify duplicate rows and keep only one by filtering out rows with a row number greater than 1.

- Ranking Data: Used when ranking rows based on specific criteria but requiring unique row numbers.

- Selecting the Latest Record: Helps in selecting the most recent entry per category when combined with PARTITION BY.

**Example 1:**

```sql

SELECT 
  total_amount,
  ROW_NUMBER() OVER (ORDER BY total_amount DESC) AS ranking

FROM `greentaxi_trips` 
LIMIT 10;

```

The query returns the top 10 highest total_amount values from the table, along with a row number indicating their ranking.


| total_amount | ranking |
|--------|--------|
| 4012.3 | 1      |
| 2878.3 | 2      |
| 2438.8 | 3      |
| 2156.3 | 4      |
| 2109.8 | 5      |
| 2017.3 | 6      |
| 1971.05| 7      |
| 1958.8 | 8      |
| 1762.8 | 9      |
| 1600.8 | 10     |

The column generated with ROW_NUMBER() is temporary and does not modify the original table. It is just a calculation applied to the data in the query result.

**Example 2:**

Let's modify the previous query to add a partition by pick up location ID

```sql

SELECT 

  total_amount,
  PULocationID,
  ROW_NUMBER() OVER (PARTITION BY PULocationID ORDER BY total_amount DESC) AS ranking

FROM `greentaxi_trips` 
LIMIT 10;

```

This SQL query  assigns a ranking to each row based on total_amount in descending order within each 
PULocationID group:

| total_amount | PULocationID | ranking |
|-----------|-----------|-----------|
| 8.51      | 224       | 432       |
| 8.3       | 224       | 433       |
| 8.3       | 224       | 434       |
| 7.3       | 224       | 435       |
| 3.3       | 224       | 436       |
| 86.42     | 234       | 1         |
| 73.5      | 234       | 2         |
| 62.7      | 234       | 3         |
| 61.94     | 234       | 4         |
| 61.94     | 234       | 5         |

Using the PARTITION BY clause will allow you to begin counting 1 again in each partition.

### Rank and Dense Rank

ROW_NUMBER(), RANK(), and DENSE_RANK() are window functions used to assign a ranking to rows based on a specified order. However, they behave differently when there are duplicate values in the ranking column.

RANK() assigns a ranking, but skips numbers if there are ties. DENSE_RANK() its similar to RANK(), but does not skip numbers when there are ties.

For example:

| Score | ROW_NUMBER() | RANK() | DENSE_RANK() |
|-------|--------------|--------|--------------|
| 95    | 1            | 1      | 1            |
| 90    | 2            | 2      | 2            |
| 90    | 3            | 2      | 2            |
| 85    | 4            | 4      | 3            |


### Lag and Lead

It can often be useful to compare rows to preceding or following rows. You can use LAG or LEAD to create columns that pull values from other rows without the need for a self-join. All you need to do is enter which column to pull from and how many rows away you'd like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows


**Syntax:**

```sql

LAG(expression) OVER (PARTITION BY partition_expression ORDER BY order_expression)
```

- expression: The column whose value you want to retrieve from the previous row
- offset (optional): The number of rows back from the current row to look. The default is 1, meaning it looks at the immediate previous row.
- PARTITION BY (optional): Divides the result set into partitions to apply the function to each partition separately.
- ORDER BY: Specifies the order in which the rows are processed.

**Example:**

```sql

SELECT 

lpep_pickup_datetime,
total_amount,
LAG(total_amount) OVER (ORDER BY lpep_pickup_datetime) as prev_total_amount,
LEAD(total_amount) OVER (ORDER BY lpep_pickup_datetime) as next_total_amount

FROM `greentaxi_trips` 
ORDER BY lpep_pickup_datetime

```

The query retrieves the lpep_pickup_datetime, total_amount, the previous trip's total_amount, and the next trip's total_amount.

| lpep_pickup_datetime      | total_amount | prev_total_amount | next_total_amount |
|---------------------------|--------------|-------------------|-------------------|
| 2008-12-31 23:33:38 UTC   | 7.3          | 6.3               | 5.3               |
| 2008-12-31 23:42:31 UTC   | 5.3          | 7.3               | 14.55             |
| 2008-12-31 23:47:51 UTC   | 14.55        | 5.3               | 19.55             |
| 2008-12-31 23:57:46 UTC   | 19.55        | 14.55             | 9.8               |
| 2009-01-01 00:00:00 UTC   | 9.8          | 19.55             | 81.3              |


### Percentile Cont

Computes the specified percentile value for the value_expression, with linear interpolation.

**Syntax:**

```sql

PERCENTILE_CONT(value_expression, percentile ) OVER (PARTITION BY partition_expression)
```

**Example:**

Let's calculate the 90th percentile of total_amount for each unique pickup location (PULocationID)

```sql

SELECT 
  PULocationID,
  total_amount,
  PERCENTILE_CONT(total_amount, 0.9 ) OVER (PARTITION BY PULocationID) AS p90

FROM `greentaxi_trips` 

```

- PERCENTILE_CONT(total_amount, 0.9): calculates the 90th percentile (p90) of total_amount
- PARTITION BY PULocationID: This groups the calculations by PULocationID, so the 90th percentile is computed separately for each location.


Query results looks like this:

| PULocationID | total_amount  | p90  |
|------|-------|-------|
| 224  | 17.3    | 51.9  |
| 224  | 20.67    | 51.9  |
| 224  | 21    | 51.9  |
| 224  | 26.06 | 51.9  |
| 224  | 27.13 | 51.9  |
| 224  | 40.14 | 51.9  |
| 224  | 55.46 | 51.9  |
| 224  | 25.74 | 51.9  |
| 224  | 27.02 | 51.9  |
| 224  | 37    | 51.9  |


The P90 value is essentially the amount below which 90% of the values fall. In this table, the P90 
is constant at 51.9, which means that for location "224", 90% of the total amounts are below 51.9.


## Common Table Expression

A CTE, short for Common Table Expression, is like a query within a query. With the WITH statement, you can create temporary tables to store results, making complex queries more readable and maintainable. These temporary tables exist only for the duration of the main query.

CTEs and subqueries are both powerful tools and can be used to achieve similar goals, but they have different use cases and advantages. Differences are CTE is reusable during the entire session and more readable

By declaring CTEs at the beginning of the query, you enhance code readability, enabling a clearer grasp of your analysis logic. 

**Syntax:**

```sql

WITH cte_name AS (
    SELECT column1, column2
    FROM some_table
    WHERE condition
)
SELECT * FROM cte_name;
```

**Example: Let's find the trip with the second largest total_amount**

```sql

WITH cte AS(

  SELECT
  lpep_pickup_datetime,
  total_amount,
  RANK() OVER (ORDER BY total_amount DESC) AS rank

  FROM `greentaxi_trips` 

)


SELECT * FROM cte WHERE rank = 2;

```

The query starts with a Common Table Expression (CTE) named cte. We use the RANK() window function to 
assign a ranking (rank) to each row based on total_amount in descending order (from highest to lowest).

Now, we use the CTE in the main query: ```SELECT * FROM cte WHERE rank = 2;```

Result of the query:


| lpep_pickup_datetime      | total_amount | rank | 
|---------------------------|--------------|-------------------|
| 2019-10-10 15:22:49 UTC  | 2878.3        | 2             | 

## dbt models and CTEs

CTEs and window functions will be used a lot in module 4 on dbt. Let's see an example of application in dbt models

**Example:**

Suppose we start from the FHV dataset and we want to create a dbt model that enriches the data by calculating the trip duration and the 90th percentile.

```sql

WITH trip_duration_calculated AS (

    SELECT
        *,
        timestamp_diff(dropOff_datetime, pickup_datetime, second) as trip_duration

    FROM `fhv_trips`
)

SELECT 

    PUlocationID,
    trip_duration,
    PERCENTILE_CONT(trip_duration, 0.90) OVER (PARTITION BY PUlocationID) AS trip_duration_p90


FROM trip_duration_calculated


```

**Step 1: Understanding the CTE**

The WITH clause creates a CTE named trip_duration_calculated. This CTE acts as a temporary table that 
contains all columns from the fhv_trips table. Additionally, it calculates the trip duration for each ride

**Step 2: Main Query using the CTE and Window Function**

This query computes the 90th percentile of trip duration for each PUlocationID using a window function:

The PARTITION BY PUlocationID clause ensures that the percentile calculation is performed separately 
for each unique PUlocationID.

The percentile 90 means that 90% of the trips have a duration equal to or below this value

**Query result looks like this:**

| PUlocationID | trip_duration | trip_duration_p90 |
|-------------|---------------|--------------------|
| 190         | 451           | 2170.0            |
| 190         | 1373          | 2170.0            |
| 190         | 817           | 2170.0            |
| 190         | 589           | 2170.0            |
| 190         | 1648          | 2170.0            |
| 32          | 546           | 1988.0            |
| 32          | 151           | 1988.0            |
| 32          | 1752          | 1988.0            |
| 32          | 2426          | 1988.0            |
| 32          | 888           | 1988.0            |


- For PUlocationID = 190, 90% of trips have a duration ≤ 2170.0   seconds.
- For PUlocationID = 32, 90% of trips have a duration ≤ 1988.0  seconds.


================================================
FILE: 04-analytics-engineering/setup/cloud_setup.md
================================================
# Cloud Setup Guide

This guide walks you through setting up dbt to work with the BigQuery data warehouse you created in Module 3.

<div align="center">

[![dbt](https://img.shields.io/badge/dbt-FF694B?style=for-the-badge&logo=dbt&logoColor=white)](https://www.getdbt.com/)
[![BigQuery](https://img.shields.io/badge/BigQuery-4285F4?style=for-the-badge&logo=google-cloud&logoColor=white)](https://cloud.google.com/bigquery)

</div>

> [!NOTE]
> This guide assumes you've completed [Module 3: Data Warehouse](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/03-data-warehouse) where you:
> - Created a GCP project and enabled the BigQuery API
> - Created a service account with BigQuery permissions
> - Learned how to load data into BigQuery (in the `nytaxi` dataset)
>
> Module 4 uses **different data** than Module 3 (green and yellow taxi data for 2019-2020 instead of yellow-only 2024). You'll load the new data in [Step 1](#load-the-taxi-data) below.

## Step 1: Verify Your BigQuery Setup

Before setting up dbt Cloud, confirm you have the required data and credentials from Module 3.

### Check Your Service Account

You should already have a service account JSON key file from Module 3. Make sure it has these permissions:

- **BigQuery Data Editor**
- **BigQuery Job User**
- **BigQuery User**

If you need to create a new service account or download a new key, follow the instructions below.

### How to Download Service Account JSON Key

If you don't have the JSON key file or need to download a new one:

1. Go to [Google Cloud Console](https://console.cloud.google.com/)

2. Navigate to **IAM & Admin** > **Service Accounts**
   - Or use the search bar and type "Service Accounts"

3. Find your service account in the list
   - It should look like: `service-account-name@project-id.iam.gserviceaccount.com`
   - If you don't have a service account yet, click **+ CREATE SERVICE ACCOUNT** and:
     - Enter a name (e.g., `dbt-bigquery-service-account`)
     - Click **CREATE AND CONTINUE**
     - Add these roles:
       - **BigQuery Admin** (or at minimum: BigQuery Data Editor, BigQuery Job User, BigQuery User)
     - Click **CONTINUE** > **DONE**

4. Click on your service account name to open its details

5. Go to the **KEYS** tab

6. Click **ADD KEY** > **Create new key**

7. Select **JSON** as the key type

8. Click **CREATE**

9. The JSON key file will automatically download to your computer
   - Save it in a secure location
   - **Never commit this file to Git or share it publicly** - it contains credentials to access your GCP resources

The downloaded JSON file will look something like this:

```json
{
  "type": "service_account",
  "project_id": "your-project-id",
  "private_key_id": "...",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "service-account-name@project-id.iam.gserviceaccount.com",
  ...
}
```

You'll use this JSON file in Step 4 to connect dbt Cloud to BigQuery.

### Load the Taxi Data

This module uses **yellow and green taxi data for 2019-2020**, which is different from the data you loaded in Module 3. Using the same approach you learned in Module 3, load the following data into your BigQuery `nytaxi` dataset:

- **Yellow taxi trip records** for all months of 2019 and 2020
- **Green taxi trip records** for all months of 2019 and 2020

> [!IMPORTANT]
> Download the data from the [DataTalksClub NYC TLC Data repository](https://github.com/DataTalksClub/nyc-tlc-data/releases), **not** from the official NYC TLC website. The official site has been retroactively updated over the years, so its data differs from what the homework answers are based on.

After loading, verify your data:

1. Go to [BigQuery Console](https://console.cloud.google.com/bigquery)
2. In the Explorer panel on the left, expand your project
3. You should see the `nytaxi` dataset
4. Expand the `nytaxi` dataset - you should see tables:
   - `green_tripdata`
   - `yellow_tripdata`

### Note Your Dataset Location

When you created your BigQuery datasets in Module 3, you chose a location (e.g., `US`, `EU`, `us-central1`). You'll need to use the same location when configuring dbt.

**To check your dataset location:**
1. In BigQuery Console, click on the `nytaxi` dataset
2. Look for **Data location** in the dataset details

## Step 2: Sign Up for dbt Platform

dbt Platform is dbt's cloud-based development environment with a web IDE, scheduler, and collaboration features. dbt offers a **free Developer plan**. This should be more than enough to learn dbt and follow the course.

## Step 3: Create a New dbt Project

Now you'll create a fresh dbt project from scratch in dbt Cloud.

1. Navigate to **Account settings** (gear icon in the top-right corner) and click **+ New Project**

2. Enter a project name:
   - Project name: `taxi_rides_ny`

3. Click **Continue**

## Step 4: Configure BigQuery Connection

After clicking **Continue** in the previous step, dbt Cloud will prompt you to configure your data warehouse connection.

> [!TIP]
> If you're not automatically taken to the connection setup, you can also configure it from **Account settings** > **Projects** > **taxi_rides_ny** > **Connection**.

### Upload Service Account JSON

1. For the connection type, select **BigQuery**

2. Click **Upload a Service Account JSON file**

3. Select the service account JSON key file from Module 3

4. dbt will automatically extract:
   - Your GCP project ID
   - Authentication credentials

### Configure Connection Settings

1. **Dataset**: Enter `dbt_prod`
   - This is the base schema name where dbt will create datasets
   - dbt will organize your models into schemas like:
     - `dbt_prod_staging` - for staging models
     - `dbt_prod_intermediate` - for intermediate models
     - `dbt_prod_marts` - for final analytics tables

2. **Location**: Select the same location as your `nytaxi` dataset from Module 3
   - Example: `US`, `EU`, or `us-central1`
   - **This must match your nytaxi dataset location**
   - You can find this under **Optional Settings** or **Advanced Settings** depending on your UI version

3. **Timeout**: `300` seconds

4. **Maximum Bytes Billed**: (optional)
   - Leave blank for unlimited, OR
   - Set a limit like `1000000000` (1 GB) to prevent runaway queries

### Test the Connection

1. Click **Test Connection**

2. You should see a success message: "Connection test succeeded"

3. Click **Continue**

## Step 5: Set Up Your Repository

dbt Cloud needs a Git repository to store your project code. You have two options:

- Let dbt Manage the Repository (Recommended for Beginners)
- Connect Your Own GitHub Repository (Recommended for Production)

It doesn't matter which one you prefer for this course.

## Step 6: Verify Your Development Environment

### What Are Environments in dbt?

In dbt, **environments** define different contexts where your data transformations run:

- **Development Environment**: Your personal workspace for building and testing models
  - Uses your personal credentials
  - Creates temporary schemas with your name (e.g., `dbt_<your_name>`)
  - Changes only affect your work, not production
  - Used when working in the dbt Cloud IDE

- **Deployment Environment**: The production workspace where final models run on schedule
  - Uses service account credentials
  - Creates production schemas (e.g., `dbt_prod_staging`, `dbt_prod_marts`)
  - Used by scheduled jobs that keep your data warehouse updated

Think of it like having a draft folder (development) and a published folder (deployment) for your analytics code.

### Check Your Development Environment

dbt Cloud **automatically creates a development environment** when you set up a project. You don't need to create one manually.

To verify it was created:

1. Navigate to **Deploy** > **Environments** in the top navigation bar
2. You should see a **Development** environment already listed

### Customize Your Development Credentials (Optional)

If you need to change how dbt connects to BigQuery during development, or adjust your development schema:

1. Click your profile icon (bottom-left corner) > **Your Profile** > **Credentials**
2. Select the credential linked to your project
3. From here you can update:
   - **Development Schema**: Where your personal development models will be created
     - dbt automatically suggests: `dbt_<your_name>` (e.g., `dbt_john_smith`)
     - This schema is separate from production (`dbt_prod`)
   - **Target Name**: Leave as `dev` (default)

## Step 7: Start Developing

Once your project, connection, and repository are configured, you're ready to start building dbt models.

1. Click **Start developing in the Studio IDE**
   - If you don't see this option, navigate to **Develop** in the top navigation bar

2. dbt Cloud will initialize your workspace (this may take a minute)

3. Once the IDE loads, you'll have a fresh project ready for development!

## Additional Resources

* [BigQuery Documentation](https://cloud.google.com/bigquery/docs)
* [dbt Documentation](https://docs.getdbt.com/docs/cloud/about-cloud/dbt-cloud-features)
* [BigQuery Best Practices](https://cloud.google.com/bigquery/docs/best-practices)
* [NYC Taxi Data Dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)


================================================
FILE: 04-analytics-engineering/setup/duckdb_troubleshooting.md
================================================
# Troubleshooting DuckDB Out of Memory Errors

If you're getting `Out of Memory` errors while running dbt build commands, don't panic. This is a common issue, especially on machines with limited RAM. This guide explains why it happens and what you can do about it.

## Why does this happen?

DuckDB is an **in-process database**, which means it runs inside your computer's memory (RAM) rather than on a remote server. The NYC taxi dataset we use in this project contains **tens of millions of rows** across 24 months of yellow and green taxi data. When dbt builds models, DuckDB needs to load, transform, and write this data (all using your local RAM).

Some operations are more memory-intensive than others:

| Operation | Why it's expensive | Where it happens |
|---|---|---|
| `QUALIFY` with window functions | Requires sorting and partitioning the entire dataset in memory | `int_trips.sql` (deduplication) |
| `UNION ALL` on large tables | Combines two large datasets into one | `int_trips_unioned.sql` |
| Surrogate key generation (`generate_surrogate_key`) | Computes hashes across the full dataset | `int_trips.sql` |
| `JOIN` on large fact tables | Expands memory footprint when enriching trips with zones | `fct_trips.sql` |

## Check your available RAM

Before troubleshooting, know what you're working with. You can generally find this in your settings menu.

As a rule of thumb:

- **4 GB RAM**: You will very likely hit OOM. Consider using GitHub Codespaces or the Cloud Setup instead.
- **8 GB RAM**: You might hit OOM on some models. Adjust memory settings or use GitHub Codespaces.
- **16+ GB RAM**: You should be fine with default settings.

## Option A: Use GitHub Codespaces or Cloud Setup

If your local machine doesn't have enough RAM, the easiest solution is to avoid running DuckDB locally altogether.

### GitHub Codespaces

Run the project in a **GitHub Codespace**. The free tier includes machines with **4 cores / 8 GB RAM**, and **8 cores / 16 GB RAM** is available within the free monthly quota for personal accounts. A 16 GB machine can comfortably run this entire project without any of the workarounds below.

To get started:

1. Go to the [course repository on GitHub](https://github.com/DataTalksClub/data-engineering-zoomcamp).
2. Click **Code** > **Codespaces** > **Create codespace on main**.
3. Select the **8-core** machine type for the best experience.

Codespaces come with Python, pip, and git pre-installed, so setup is minimal.

### Cloud Setup (BigQuery)

Alternatively, use the **Cloud Setup (BigQuery)** path. BigQuery runs on Google's servers, so your local RAM doesn't matter. See the [Cloud Setup Guide](cloud_setup.md).

## Option B: Make it work on your local machine

If you prefer to run the project locally, follow the steps below to reduce memory usage.

### Step 1: Adjust DuckDB memory settings in `profiles.yml`

Your `~/.dbt/profiles.yml` controls how much memory DuckDB can use. Here's what you can tune:

- **`memory_limit`**: By default, DuckDB will try to use up to 80% of your system's RAM. That sounds reasonable, but your operating system, browser, IDE, and other apps also need memory. If DuckDB claims too much, the OS may kill the process — that's your OOM error. Setting an explicit limit (roughly **50% of your total RAM**) leaves enough room for everything else. So if you have 8 GB, try `'4GB'`.
- **`threads`**: This controls how many **dbt models** are built in parallel. Lowering `threads` to `1` means fewer concurrent models, which reduces overall memory pressure.
- **`preserve_insertion_order: false`**: Tells DuckDB it doesn't need to maintain row order, which saves memory.

### Step 2: Use `dbt retry` after a failure

If your `dbt build` fails partway through, you **don't need to rebuild everything from scratch**. Use:

```bash
dbt retry
```

This command picks up where the last run left off, only running the models that failed or were skipped. This is very useful when an OOM error kills a single model — fix the issue, then retry without re-running the models that already succeeded.

### Step 3: Build models selectively with `--select`

Instead of building the entire project at once, build one model at a time to reduce peak memory usage:

```bash
dbt build --select stg_yellow_tripdata --target prod
dbt build --select stg_green_tripdata --target prod
dbt build --select int_trips_unioned --target prod
dbt build --select int_trips --target prod
dbt build --select fct_trips --target prod
```

This way, DuckDB only needs to handle one model at a time.

### Step 4: Leverage incremental models

The `fct_trips` model in this project is already configured as **incremental**. This means that after the first full build, subsequent runs only process **new records** instead of reprocessing the entire dataset.

If your first full build fails due to OOM but some models succeeded, use `dbt retry` (Step 2). Once `fct_trips` is built for the first time, future runs will be much lighter on memory.

## DuckDB performance best practices

These tips come from [DuckDB's official performance guide](https://duckdb.org/docs/guides/performance/environment.html):

1. **Close other applications**: Browsers, IDEs, and other apps compete for RAM. Close what you don't need before running `dbt build`.
2. **Use an SSD**: DuckDB spills to disk when it runs out of memory. An SSD makes this spill-to-disk process much faster than an HDD.
3. **Avoid running inside Docker** (if possible): Docker containers have memory limits that may be lower than your system's total RAM. If you must use Docker, increase the container's memory limit.

## Still stuck?

If you've tried everything above and still can't build the project, ask for help in the [course Slack channel](https://datatalks-club.slack.com/). Include your RAM, OS, and the exact error message.


================================================
FILE: 04-analytics-engineering/setup/local_setup.md
================================================
# Local Setup Guide

This guide walks you through setting up a local analytics engineering environment using DuckDB and dbt.

<div align="center">

[![dbt Core](https://img.shields.io/badge/dbt-FF694B?style=for-the-badge&logo=dbt&logoColor=white)](https://www.getdbt.com/)
[![DuckDB](https://img.shields.io/badge/DuckDB-FFF000?style=for-the-badge&logo=duckdb&logoColor=black)](https://duckdb.org/)

</div>

>[!NOTE]
>*This guide will explain how to do the setup manually. If you want an additional challenge, try to run this setup using Docker Compose or a Python virtual environment.*

**Important**: All dbt commands must be run from inside the `taxi_rides_ny/` directory. The setup steps below will guide you through:

1. Installing the necessary tools
2. Configuring your connection to DuckDB
3. Loading the NYC taxi data
4. Verifying everything works

## Step 1: Install DuckDB

DuckDB is a fast, in-process SQL database that works great for local analytics workloads. To install DuckDB, follow the instruction on the [official site](https://duckdb.org/docs/installation) for your specific operating system.

> [!TIP]
> *You can install DuckDB in two ways. You can install the CLI or install the client API for your favorite programming language (in the case of Python, you can use `pip install duckdb`). I personally prefer installing the CLI, but either way is fine.*

## Step 2: Install dbt

```bash
pip install dbt-duckdb
```

This installs:

* `dbt-core`: The core dbt framework
* `dbt-duckdb`: The DuckDB adapter for dbt

## Step 3: Configure dbt Profile

Since this repository already contains a dbt project (`taxi_rides_ny/`), you don't need to run `dbt init`. Instead, you need to configure your dbt profile to connect to DuckDB.

### Create or Update `~/.dbt/profiles.yml`

The dbt profile tells dbt how to connect to your database. Create or update the file `~/.dbt/profiles.yml` with the following content:

```yaml
taxi_rides_ny:
  target: dev
  outputs:
    # DuckDB Development profile
    dev:
      type: duckdb
      path: taxi_rides_ny.duckdb
      schema: dev
      threads: 1
      extensions:
        - parquet
      settings:
        memory_limit: '2GB'
        preserve_insertion_order: false

    # DuckDB Production profile
    prod:
      type: duckdb
      path: taxi_rides_ny.duckdb
      schema: prod
      threads: 1
      extensions:
        - parquet
      settings:
        memory_limit: '2GB'
        preserve_insertion_order: false

# Troubleshooting:
# - If you have less than 4GB RAM, try setting memory_limit to '1GB'
# - If you have 16GB+ RAM, you can increase to '4GB' for faster builds
# - Expected build time: 5-10 minutes on most systems
```

## Step 4: Download and Ingest Data

Now that your dbt profile is configured, let's load the taxi data into DuckDB. Navigate to the dbt project directory and run the ingestion script

```python
import duckdb
import requests
from pathlib import Path

BASE_URL = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download"

def download_and_convert_files(taxi_type):
    data_dir = Path("data") / taxi_type
    data_dir.mkdir(exist_ok=True, parents=True)

    for year in [2019, 2020]:
        for month in range(1, 13):
            parquet_filename = f"{taxi_type}_tripdata_{year}-{month:02d}.parquet"
            parquet_filepath = data_dir / parquet_filename

            if parquet_filepath.exists():
                print(f"Skipping {parquet_filename} (already exists)")
                continue

            # Download CSV.gz file
            csv_gz_filename = f"{taxi_type}_tripdata_{year}-{month:02d}.csv.gz"
            csv_gz_filepath = data_dir / csv_gz_filename

            response = requests.get(f"{BASE_URL}/{taxi_type}/{csv_gz_filename}", stream=True)
            response.raise_for_status()

            with open(csv_gz_filepath, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)

            print(f"Converting {csv_gz_filename} to Parquet...")
            con = duckdb.connect()
            con.execute(f"""
                COPY (SELECT * FROM read_csv_auto('{csv_gz_filepath}'))
                TO '{parquet_filepath}' (FORMAT PARQUET)
            """)
            con.close()

            # Remove the CSV.gz file to save space
            csv_gz_filepath.unlink()
            print(f"Completed {parquet_filename}")

def update_gitignore():
    gitignore_path = Path(".gitignore")

    # Read existing content or start with empty string
    content = gitignore_path.read_text() if gitignore_path.exists() else ""

    # Add data/ if not already present
    if 'data/' not in content:
        with open(gitignore_path, 'a') as f:
            f.write('\n# Data directory\ndata/\n' if content else '# Data directory\ndata/\n')

if __name__ == "__main__":
    # Update .gitignore to exclude data directory
    update_gitignore()

    for taxi_type in ["yellow", "green"]:
        download_and_convert_files(taxi_type)

    con = duckdb.connect("taxi_rides_ny.duckdb")
    con.execute("CREATE SCHEMA IF NOT EXISTS prod")

    for taxi_type in ["yellow", "green"]:
        con.execute(f"""
            CREATE OR REPLACE TABLE prod.{taxi_type}_tripdata AS
            SELECT * FROM read_parquet('data/{taxi_type}/*.parquet', union_by_name=true)
        """)

    con.close()
```

This script downloads yellow and green taxi data from 2019-2020, creates the `prod` schema, and loads the raw data into DuckDB. The download may take several minutes depending on your internet connection.

## Step 5: Test the dbt Connection

Verify dbt can connect to your DuckDB database:

```bash
dbt debug
```

## Step 6: Install dbt Power User Extension (VS Code Users)

If you're using Visual Studio Code, install the **dbt Power User** extension to enhance your dbt development experience.

### What is dbt Power User?

dbt Power User is a VS Code extension that provides:

* SQL syntax highlighting and formatting for dbt models
* Inline column-level lineage visualization
* Auto-completion for dbt models, sources, and macros
* Interactive documentation preview
* Model compilation and execution directly from the editor

### Why Not Use the Official dbt Extension?

dbt Labs released an official VS Code extension called [dbt Extension](https://marketplace.visualstudio.com/items?itemName=dbtLabsInc.dbt) powered by the new dbt Fusion engine. However, this extension **requires dbt Fusion** and does not support dbt Core.

Since we're using **dbt Core** with DuckDB for local development, we need the community-maintained **dbt Power User by AltimateAI** extension instead. This extension:

* Works seamlessly with dbt Core (not just dbt Cloud)
* Supports all dbt adapters, including DuckDB
* Is actively maintained and open source
* Provides a rich feature set for local development

### Installation

1. Open VS Code
2. Go to Extensions (Ctrl+Shift+X / Cmd+Shift+X)
3. Search for "dbt Power User"
4. Install **dbt Power User by AltimateAI** (not the dbt Labs version)

Alternatively, install it from the [VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user).

> [!NOTE]
> At this point, your local dbt environment is fully configured and ready to use. The next steps (running models, tests, and building documentation) will be covered in the tutorial videos.

## Additional Resources

* [DuckDB Documentation](https://duckdb.org/docs/)
* [dbt Documentation](https://docs.getdbt.com/)
* [dbt-duckdb Adapter](https://github.com/duckdb/dbt-duckdb)
* [NYC Taxi Data Dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/.gitignore
================================================
# you shouldn't commit these into source control
# these are the default directory names, adjust/add to fit your needs
target/
dbt_packages/
logs/
profiles.yml
.user.yml

# Data files for DuckDB
data/green_tripdata/
data/yellow_tripdata/
data/
*.duckdb
*.duckdb.wal
.duckdb_temp/

# Parquet data files
*.parquet

# Python artifacts
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# Virtual environments
venv/
env/
ENV/
env.bak/
venv.bak/
.venv/

# PyCharm
.idea/

# VS Code
.vscode/

# Jupyter Notebook
.ipynb_checkpoints
*.ipynb

# pyenv
.python-version

# pytest
.pytest_cache/
.coverage
htmlcov/

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# GCP credentials and service account keys
*-key.json
*-keys.json
*key*.json
*credential*.json
*service-account*.json
*serviceaccount*.json
service-account.json
serviceaccount.json
gcp-*.json
google-*.json

# Environment variables
.env
.env.local
.env.*.local
*.env
dbt_internal_packages/


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/dbt_project.yml
================================================
name: 'taxi_rides_ny'
version: '1.0.0'

# Require a specific dbt version for reproducibility
require-dbt-version: [">=1.7.0", "<3.0.0"]

# This setting configures which "profile" dbt uses for this project.
profile: 'taxi_rides_ny'

# These configurations specify where dbt should look for different types of files.
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets:
  - "target"
  - "dbt_packages"

# Project-level variables
vars:
  # Date range for dev environment sampling
  dev_start_date: '2019-01-01'
  dev_end_date: '2019-02-01'

# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
models:
  taxi_rides_ny:
    staging:
      +materialized: view
    intermediate:
      +materialized: table
    marts:
      +materialized: table
flags:
  require_generic_test_arguments_property: true

================================================
FILE: 04-analytics-engineering/taxi_rides_ny/macros/get_trip_duration_minutes.sql
================================================
{#
    Calculate trip duration in minutes from pickup and dropoff timestamps.

    Uses dbts built-in cross-database datediff macro.
    This works seamlessly across DuckDB, BigQuery, Snowflake, Redshift, PostgreSQL, etc.

    Returns: Trip duration as a numeric value in minutes
#}

{% macro get_trip_duration_minutes(pickup_datetime, dropoff_datetime) %}
    {{ dbt.datediff(pickup_datetime, dropoff_datetime, 'minute') }}
{% endmacro %}


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/macros/get_vendor_data.sql
================================================
{#
    Macro to generate vendor_name column using Jinja dictionary.

    This approach works seamlessly across BigQuery, DuckDB, Snowflake, etc.
    by generating a CASE statement at compile time.

    Usage: {{ get_vendor_data('vendor_id') }}
    Returns: SQL CASE expression that maps vendor_id to vendor_name
#}

{% macro get_vendor_data(vendor_id_column) %}

{% set vendors = {
    1: 'Creative Mobile Technologies',
    2: 'VeriFone Inc.',
    4: 'Unknown/Other'
} %}

case {{ vendor_id_column }}
    {% for vendor_id, vendor_name in vendors.items() %}
    when {{ vendor_id }} then '{{ vendor_name }}'
    {% endfor %}
end

{% endmacro %}


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/macros/macros_properties.yml
================================================
macros:
  - name: get_trip_duration_minutes
    description: >
      Calculates trip duration in minutes from pickup and dropoff timestamps.
      This macro is cross-database compatible, supporting both DuckDB and BigQuery.
      Returns a numeric value representing the duration in minutes.
    arguments:
      - name: pickup_datetime
        type: timestamp
        description: The pickup timestamp
      - name: dropoff_datetime
        type: timestamp
        description: The dropoff timestamp

  - name: get_vendor_data
    description: >
      Generates a CASE statement that maps vendor_id to vendor_name.
      This macro is cross-database compatible and generates SQL at compile time using a Jinja dictionary.
      Supports vendor IDs: 1 (Creative Mobile Technologies), 2 (VeriFone Inc.), 4 (Unknown/Other).
    arguments:
      - name: vendor_id_column
        type: integer
        description: The column name containing the vendor ID


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/macros/safe_cast.sql
================================================
{% macro safe_cast(column, data_type) %}
    {% if target.type == 'bigquery' %}
        safe_cast({{ column }} as {{ data_type }})
    {% else %}
        cast({{ column }} as {{ data_type }})
    {% endif %}
{% endmacro %}


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips.sql
================================================
-- Enrich and deduplicate trip data
-- Demonstrates enrichment and surrogate key generation
-- Note: Data quality analysis available in analyses/trips_data_quality.sql

with unioned as (
    select * from {{ ref('int_trips_unioned') }}
),

payment_types as (
    select * from {{ ref('payment_type_lookup') }}
),

cleaned_and_enriched as (
    select
        -- Generate unique trip identifier (surrogate key pattern)
        {{ dbt_utils.generate_surrogate_key(['u.vendor_id', 'u.pickup_datetime', 'u.pickup_location_id', 'u.service_type']) }} as trip_id,

        -- Identifiers
        u.vendor_id,
        u.service_type,
        u.rate_code_id,

        -- Location IDs
        u.pickup_location_id,
        u.dropoff_location_id,

        -- Timestamps
        u.pickup_datetime,
        u.dropoff_datetime,

        -- Trip details
        u.store_and_fwd_flag,
        u.passenger_count,
        u.trip_distance,
        u.trip_type,

        -- Payment breakdown
        u.fare_amount,
        u.extra,
        u.mta_tax,
        u.tip_amount,
        u.tolls_amount,
        u.ehail_fee,
        u.improvement_surcharge,
        u.total_amount,

        -- Enrich with payment type description
        coalesce(u.payment_type, 0) as payment_type,
        coalesce(pt.description, 'Unknown') as payment_type_description

    from unioned u
    left join payment_types pt
        on coalesce(u.payment_type, 0) = pt.payment_type
)

select * from cleaned_and_enriched

-- Deduplicate: if multiple trips match (same vendor, second, location, service), keep first
qualify row_number() over(
    partition by vendor_id, pickup_datetime, pickup_location_id, service_type
    order by dropoff_datetime
) = 1


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips_unioned.sql
================================================
-- Union green and yellow taxi data into a single dataset
-- Demonstrates how to combine data from multiple sources with slightly different schemas

with green_trips as (
    select
        vendor_id,
        rate_code_id,
        pickup_location_id,
        dropoff_location_id,
        pickup_datetime,
        dropoff_datetime,
        store_and_fwd_flag,
        passenger_count,
        trip_distance,
        trip_type,
        fare_amount,
        extra,
        mta_tax,
        tip_amount,
        tolls_amount,
        ehail_fee,
        improvement_surcharge,
        total_amount,
        payment_type,
        'Green' as service_type
    from {{ ref('stg_green_tripdata') }}
),

yellow_trips as (
    select
        vendor_id,
        rate_code_id,
        pickup_location_id,
        dropoff_location_id,
        pickup_datetime,
        dropoff_datetime,
        store_and_fwd_flag,
        passenger_count,
        trip_distance,
        cast(1 as integer) as trip_type,  -- Yellow taxis only do street-hail (code 1)
        fare_amount,
        extra,
        mta_tax,
        tip_amount,
        tolls_amount,
        cast(0 as numeric) as ehail_fee,  -- Yellow taxis don't have ehail_fee
        improvement_surcharge,
        total_amount,
        payment_type,
        'Yellow' as service_type
    from {{ ref('stg_yellow_tripdata') }}
)

select * from green_trips
union all
select * from yellow_trips


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/intermediate/schema.yml
================================================
models:
  - name: int_trips_unioned
    description: Union of green and yellow taxi trip data with normalized schema
    columns:
      - name: vendor_id
        description: Taxi technology provider ID
      - name: rate_code_id
        description: Rate code at end of trip (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)
      - name: pickup_location_id
        description: TLC Taxi Zone where trip started
      - name: dropoff_location_id
        description: TLC Taxi Zone where trip ended
      - name: pickup_datetime
        description: Timestamp when meter was engaged
      - name: dropoff_datetime
        description: Timestamp when meter was disengaged
      - name: store_and_fwd_flag
        description: Trip record stored in vehicle memory (Y/N)
      - name: passenger_count
        description: Number of passengers in the vehicle
      - name: trip_distance
        description: Trip distance in miles
      - name: trip_type
        description: Trip type (1=Street-hail, 2=Dispatch)
      - name: fare_amount
        description: Time and distance fare
      - name: extra
        description: Miscellaneous extras and surcharges
      - name: mta_tax
        description: MTA tax
      - name: tip_amount
        description: Tip amount (credit card only)
      - name: tolls_amount
        description: Total tolls paid
      - name: ehail_fee
        description: E-hail service fee
      - name: improvement_surcharge
        description: Improvement surcharge
      - name: total_amount
        description: Total amount charged to passenger
      - name: payment_type
        description: Payment method code
      - name: service_type
        description: Type of taxi service (Green or Yellow)

  - name: int_trips
    description: Cleaned, enriched, and deduplicated trip data ready for marts
    columns:
      - name: trip_id
        description: Unique trip identifier (surrogate key)
        data_tests:
          - unique
          - not_null
      - name: vendor_id
        description: Taxi technology provider ID
        data_tests:
          - not_null
      - name: service_type
        description: Type of taxi service (Green or Yellow)
        data_tests:
          - not_null
          - accepted_values:
              arguments:
                values: ['Green', 'Yellow']
      - name: rate_code_id
        description: Rate code at end of trip
      - name: pickup_location_id
        description: TLC Taxi Zone where trip started
      - name: dropoff_location_id
        description: TLC Taxi Zone where trip ended
      - name: pickup_datetime
        description: Timestamp when meter was engaged
        data_tests:
          - not_null
      - name: dropoff_datetime
        description: Timestamp when meter was disengaged
      - name: store_and_fwd_flag
        description: Trip record stored in vehicle memory (Y/N)
      - name: passenger_count
        description: Number of passengers in the vehicle
      - name: trip_distance
        description: Trip distance in miles
      - name: trip_type
        description: Trip type (1=Street-hail, 2=Dispatch)
      - name: fare_amount
        description: Time and distance fare
      - name: extra
        description: Miscellaneous extras and surcharges
      - name: mta_tax
        description: MTA tax
      - name: tip_amount
        description: Tip amount (credit card only)
      - name: tolls_amount
        description: Total tolls paid
      - name: ehail_fee
        description: E-hail service fee
      - name: improvement_surcharge
        description: Improvement surcharge
      - name: total_amount
        description: Total amount charged to passenger
        data_tests:
          - not_null
      - name: payment_type
        description: Payment method code
      - name: payment_type_description
        description: Human-readable payment method description


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/dim_vendors.sql
================================================
-- Dimension table for taxi technology vendors
-- Small static dimension defining vendor codes and their company names

with trips as (
    select * from {{ ref('fct_trips') }}
),

vendors as (
    select distinct
        vendor_id,
        {{ get_vendor_data('vendor_id') }} as vendor_name
    from trips
)

select * from vendors


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/dim_zones.sql
================================================
-- Dimension table for NYC taxi zones
-- This is a simple pass-through from the seed file, but having it as a model
-- allows for future enhancements (e.g., adding calculated fields, filtering)

select
    locationid as location_id,
    borough,
    zone,
    service_zone
from {{ ref('taxi_zone_lookup') }}

================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/fct_trips.sql
================================================
{{
  config(
    materialized='incremental',
    unique_key='trip_id',
    incremental_strategy='merge',
    on_schema_change='append_new_columns'  )
}}

-- Fact table containing all taxi trips enriched with zone information
-- This is a classic star schema design: fact table (trips) joined to dimension table (zones)
-- Materialized incrementally to handle large datasets efficiently

select
    -- Trip identifiers
    trips.trip_id,
    trips.vendor_id,
    trips.service_type,
    trips.rate_code_id,

    -- Location details (enriched with human-readable zone names from dimension)
    trips.pickup_location_id,
    pz.borough as pickup_borough,
    pz.zone as pickup_zone,
    trips.dropoff_location_id,
    dz.borough as dropoff_borough,
    dz.zone as dropoff_zone,

    -- Trip timing
    trips.pickup_datetime,
    trips.dropoff_datetime,
    trips.store_and_fwd_flag,

    -- Trip metrics
    trips.passenger_count,
    trips.trip_distance,
    trips.trip_type,
    {{ get_trip_duration_minutes('trips.pickup_datetime', 'trips.dropoff_datetime') }} as trip_duration_minutes,

    -- Payment breakdown
    trips.fare_amount,
    trips.extra,
    trips.mta_tax,
    trips.tip_amount,
    trips.tolls_amount,
    trips.ehail_fee,
    trips.improvement_surcharge,
    trips.total_amount,
    trips.payment_type,
    trips.payment_type_description

from {{ ref('int_trips') }} as trips
-- LEFT JOIN preserves all trips even if zone information is missing or unknown
left join {{ ref('dim_zones') }} as pz
    on trips.pickup_location_id = pz.location_id
left join {{ ref('dim_zones') }} as dz
    on trips.dropoff_location_id = dz.location_id

{% if is_incremental() %}
  -- Only process new trips based on pickup datetime
  where trips.pickup_datetime > (select max(pickup_datetime) from {{ this }})
{% endif %}


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/reporting/fct_monthly_zone_revenue.sql
================================================
-- Data mart for monthly revenue analysis by pickup zone and service type
-- This aggregation is optimized for business reporting and dashboards
-- Enables analysis of revenue trends across different zones and taxi types

select
    -- Grouping dimensions
    coalesce(pickup_zone, 'Unknown Zone') as pickup_zone,
    {% if target.type == 'bigquery' %}cast(date_trunc(pickup_datetime, month) as date)
    {% elif target.type == 'duckdb' %}date_trunc('month', pickup_datetime)
    {% endif %} as revenue_month,
    service_type,

    -- Revenue breakdown (summed by zone, month, and service type)
    sum(fare_amount) as revenue_monthly_fare,
    sum(extra) as revenue_monthly_extra,
    sum(mta_tax) as revenue_monthly_mta_tax,
    sum(tip_amount) as revenue_monthly_tip_amount,
    sum(tolls_amount) as revenue_monthly_tolls_amount,
    sum(ehail_fee) as revenue_monthly_ehail_fee,
    sum(improvement_surcharge) as revenue_monthly_improvement_surcharge,
    sum(total_amount) as revenue_monthly_total_amount,

    -- Additional metrics for operational analysis
    count(trip_id) as total_monthly_trips,
    avg(passenger_count) as avg_monthly_passenger_count,
    avg(trip_distance) as avg_monthly_trip_distance

from {{ ref('fct_trips') }}
group by pickup_zone, revenue_month, service_type

================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/reporting/schema.yml
================================================
models:
  - name: fct_monthly_zone_revenue
    description: Monthly revenue aggregation by pickup zone and service type for business reporting
    data_tests:
      - dbt_utils.unique_combination_of_columns:
          arguments:
            combination_of_columns:
              - pickup_zone
              - revenue_month
              - service_type
    columns:
      - name: pickup_zone
        description: Pickup zone where revenue was generated
        data_tests:
          - not_null
      - name: revenue_month
        description: Month for revenue aggregation
        data_tests:
          - not_null
      - name: service_type
        description: Service type (Green or Yellow)
        data_tests:
          - not_null
          - accepted_values:
              arguments:
                values: ['Green', 'Yellow']
      - name: revenue_monthly_total_amount
        description: Monthly sum of total fares
        data_tests:
          - not_null
      - name: total_monthly_trips
        description: Count of trips in the month
        data_tests:
          - not_null


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/schema.yml
================================================
models:
  - name: dim_zones
    description: Taxi zone dimension table with location details
    columns:
      - name: location_id
        description: Unique identifier for each taxi zone
        data_tests:
          - unique
          - not_null
      - name: borough
        description: NYC borough name
      - name: zone
        description: Specific zone name within the borough
      - name: service_zone
        description: Service zone classification

  - name: dim_vendors
    description: Taxi technology vendor dimension table
    columns:
      - name: vendor_id
        description: Unique vendor identifier
        data_tests:
          - unique
          - not_null
      - name: vendor_name
        description: Company name of the vendor

  - name: fct_trips
    description: Fact table with all taxi trips including trip and payment details
    config:
      contract:
        enforced: true
    columns:
      - name: trip_id
        description: Unique trip identifier
        data_type: string
        data_tests:
          - unique
          - not_null
      - name: vendor_id
        description: Taxi technology provider
        data_type: integer
        data_tests:
          - not_null
      - name: service_type
        description: Type of taxi service (Green or Yellow)
        data_type: string
        data_tests:
          - accepted_values:
              arguments:
                values: ['Green', 'Yellow']
          - not_null
      - name: rate_code_id
        description: Final rate code
        data_type: integer
      - name: pickup_location_id
        description: TLC Taxi Zone where trip started
        data_type: integer
        data_tests:
          - relationships:
              arguments:
                to: ref('dim_zones')
                field: location_id
      - name: pickup_borough
        description: NYC borough where trip started
        data_type: string
      - name: pickup_zone
        description: Specific zone where trip started
        data_type: string
      - name: dropoff_location_id
        description: TLC Taxi Zone where trip ended
        data_type: integer
        data_tests:
          - relationships:
              arguments:
                to: ref('dim_zones')
                field: location_id
      - name: dropoff_borough
        description: NYC borough where trip ended
        data_type: string
      - name: dropoff_zone
        description: Specific zone where trip ended
        data_type: string
      - name: pickup_datetime
        description: Timestamp when meter was engaged
        data_type: timestamp
        data_tests:
          - not_null
      - name: dropoff_datetime
        description: Timestamp when meter was disengaged
        data_type: timestamp
      - name: store_and_fwd_flag
        description: Trip record stored in vehicle memory (Y/N)
        data_type: string
      - name: passenger_count
        description: Number of passengers
        data_type: integer
      - name: trip_distance
        description: Trip distance in miles
        data_type: numeric
      - name: trip_type
        description: Trip type (1=Street-hail, 2=Dispatch)
        data_type: integer
      - name: trip_duration_minutes
        description: Trip duration in minutes (calculated using cross-database macro)
        data_type: bigint
      - name: fare_amount
        description: Time and distance fare
        data_type: numeric
      - name: extra
        description: Miscellaneous extras and surcharges
        data_type: numeric
      - name: mta_tax
        description: MTA tax
        data_type: numeric
      - name: tip_amount
        description: Tip amount (credit card only)
        data_type: numeric
      - name: tolls_amount
        description: Total tolls paid
        data_type: numeric
      - name: ehail_fee
        description: E-hail service fee
        data_type: numeric
      - name: improvement_surcharge
        description: Improvement surcharge
        data_type: numeric
      - name: total_amount
        description: Total amount charged
        data_type: numeric
        data_tests:
          - not_null
      - name: payment_type
        description: Payment method code
        data_type: integer
      - name: payment_type_description
        description: Human-readable payment method description
        data_type: string

================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/staging/schema.yml
================================================
models:
  - name: stg_green_tripdata
    description: >
      Staging model for green taxi trip data. This model standardizes column names
      and data types from the raw green_tripdata source, providing a clean foundation
      for downstream transformations.
    columns:
      - name: vendor_id
        description: Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)
        data_tests:
          - not_null
      - name: rate_code_id
        description: Rate code at end of trip (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)
      - name: pickup_location_id
        description: TLC Taxi Zone where the meter was engaged
      - name: dropoff_location_id
        description: TLC Taxi Zone where the meter was disengaged
      - name: pickup_datetime
        description: Date and time when the meter was engaged
        data_tests:
          - not_null
      - name: dropoff_datetime
        description: Date and time when the meter was disengaged
      - name: store_and_fwd_flag
        description: Flag indicating if trip record was held in vehicle memory (Y/N)
      - name: passenger_count
        description: Number of passengers in the vehicle (driver-entered value)
      - name: trip_distance
        description: Trip distance in miles reported by the taximeter
      - name: trip_type
        description: Code for trip type (1=Street-hail, 2=Dispatch)
      - name: fare_amount
        description: Time and distance fare calculated by the meter
      - name: extra
        description: Miscellaneous extras and surcharges (rush hour, overnight)
      - name: mta_tax
        description: $0.50 MTA tax automatically triggered based on meter rate
      - name: tip_amount
        description: Tip amount (credit card tips only, cash tips not included)
      - name: tolls_amount
        description: Total amount of all tolls paid during the trip
      - name: ehail_fee
        description: E-hail service fee
      - name: improvement_surcharge
        description: Improvement surcharge assessed on hailed trips
      - name: total_amount
        description: Total amount charged to passengers (does not include cash tips)
      - name: payment_type
        description: Payment method code (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)

  - name: stg_yellow_tripdata
    description: >
      Staging model for yellow taxi trip data. This model standardizes column names
      and data types from the raw yellow_tripdata source, providing a clean foundation
      for downstream transformations.
    columns:
      - name: vendor_id
        description: Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)
        data_tests:
          - not_null
      - name: rate_code_id
        description: Rate code at end of trip (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)
      - name: pickup_location_id
        description: TLC Taxi Zone where the meter was engaged
      - name: dropoff_location_id
        description: TLC Taxi Zone where the meter was disengaged
      - name: pickup_datetime
        description: Date and time when the meter was engaged
        data_tests:
          - not_null
      - name: dropoff_datetime
        description: Date and time when the meter was disengaged
      - name: store_and_fwd_flag
        description: Flag indicating if trip record was held in vehicle memory (Y/N)
      - name: passenger_count
        description: Number of passengers in the vehicle (driver-entered value)
      - name: trip_distance
        description: Trip distance in miles reported by the taximeter
      - name: fare_amount
        description: Time and distance fare calculated by the meter
      - name: extra
        description: Miscellaneous extras and surcharges (rush hour, overnight)
      - name: mta_tax
        description: $0.50 MTA tax automatically triggered based on meter rate
      - name: tip_amount
        description: Tip amount (credit card tips only, cash tips not included)
      - name: tolls_amount
        description: Total amount of all tolls paid during the trip
      - name: improvement_surcharge
        description: Improvement surcharge assessed on hailed trips
      - name: total_amount
        description: Total amount charged to passengers (does not include cash tips)
      - name: payment_type
        description: Payment method code (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/staging/sources.yml
================================================
sources:
  - name: raw
    description: Raw taxi trip data from NYC TLC
    database: |
      {%- if target.type == 'bigquery' -%}
        {{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }}
      {%- else -%}
        taxi_rides_ny
      {%- endif -%}
    schema: |
      {%- if target.type == 'bigquery' -%}
        nytaxi
      {%- else -%}
        prod
      {%- endif -%}
    tables:
      - name: green_tripdata
        description: Raw green taxi trip records
        columns:
          - name: vendorid
            description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging"
          - name: lpep_pickup_datetime
            description: Date and time when the meter was engaged
          - name: lpep_dropoff_datetime
            description: Date and time when the meter was disengaged
          - name: passenger_count
            description: Number of passengers in the vehicle
          - name: trip_distance
            description: Trip distance in miles
          - name: pulocationid
            description: TLC Taxi Zone where the meter was engaged
          - name: dolocationid
            description: TLC Taxi Zone where the meter was disengaged
          - name: ratecodeid
            description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)
          - name: store_and_fwd_flag
            description: Trip record held in vehicle memory (Y/N)
          - name: payment_type
            description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)
          - name: fare_amount
            description: Time and distance fare
          - name: extra
            description: Miscellaneous extras and surcharges
          - name: mta_tax
            description: MTA tax
          - name: tip_amount
            description: Tip amount (credit card only)
          - name: tolls_amount
            description: Total tolls paid
          - name: improvement_surcharge
            description: Improvement surcharge
          - name: total_amount
            description: Total amount charged
          - name: trip_type
            description: Trip type (1=Street-hail, 2=Dispatch)
          - name: ehail_fee
            description: E-hail fee

        config:
          loaded_at_field: lpep_pickup_datetime
      - name: yellow_tripdata
        description: Raw yellow taxi trip records
        columns:
          - name: vendorid
            description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging"
          - name: tpep_pickup_datetime
            description: Date and time when the meter was engaged
          - name: tpep_dropoff_datetime
            description: Date and time when the meter was disengaged
          - name: passenger_count
            description: Number of passengers in the vehicle
          - name: trip_distance
            description: Trip distance in miles
          - name: pulocationid
            description: TLC Taxi Zone where the meter was engaged
          - name: dolocationid
            description: TLC Taxi Zone where the meter was disengaged
          - name: ratecodeid
            description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)
          - name: store_and_fwd_flag
            description: Trip record held in vehicle memory (Y/N)
          - name: payment_type
            description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)
          - name: fare_amount
            description: Time and distance fare
          - name: extra
            description: Miscellaneous extras and surcharges
          - name: mta_tax
            description: MTA tax
          - name: tip_amount
            description: Tip amount (credit card only)
          - name: tolls_amount
            description: Total tolls paid
          - name: improvement_surcharge
            description: Improvement surcharge
          - name: total_amount
            description: Total amount charged
        config:
          loaded_at_field: tpep_pickup_datetime
    config:
      freshness:
        warn_after: {count: 24, period: hour}
        error_after: {count: 48, period: hour}


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/staging/stg_green_tripdata.sql
================================================
with source as (
    select * from {{ source('raw', 'green_tripdata') }}
),

renamed as (
    select
        -- identifiers
        cast(vendorid as integer) as vendor_id,
        {{ safe_cast('ratecodeid', 'integer') }} as rate_code_id,
        cast(pulocationid as integer) as pickup_location_id,
        cast(dolocationid as integer) as dropoff_location_id,

        -- timestamps
        cast(lpep_pickup_datetime as timestamp) as pickup_datetime,  -- lpep = Licensed Passenger Enhancement Program (green taxis)
        cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,

        -- trip info
        cast(store_and_fwd_flag as string) as store_and_fwd_flag,
        cast(passenger_count as integer) as passenger_count,
        cast(trip_distance as numeric) as trip_distance,
        {{ safe_cast('trip_type', 'integer') }} as trip_type,

        -- payment info
        cast(fare_amount as numeric) as fare_amount,
        cast(extra as numeric) as extra,
        cast(mta_tax as numeric) as mta_tax,
        cast(tip_amount as numeric) as tip_amount,
        cast(tolls_amount as numeric) as tolls_amount,
        cast(ehail_fee as numeric) as ehail_fee,
        cast(improvement_surcharge as numeric) as improvement_surcharge,
        cast(total_amount as numeric) as total_amount,
        {{ safe_cast('payment_type', 'integer') }} as payment_type
    from source
    -- Filter out records with null vendor_id (data quality requirement)
    where vendorid is not null
)

select * from renamed

-- Sample records for dev environment using deterministic date filter
{% if target.name == 'dev' %}
where pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01'
{% endif %}


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/models/staging/stg_yellow_tripdata.sql
================================================
with source as (
    select * from {{ source('raw', 'yellow_tripdata') }}
),

renamed as (
    select
        -- identifiers (standardized naming for consistency across yellow/green)
        cast(vendorid as integer) as vendor_id,
        cast(ratecodeid as integer) as rate_code_id,
        cast(pulocationid as integer) as pickup_location_id,
        cast(dolocationid as integer) as dropoff_location_id,

        -- timestamps (standardized naming)
        cast(tpep_pickup_datetime as timestamp) as pickup_datetime,  -- tpep = Taxicab Passenger Enhancement Program (yellow taxis)
        cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime,

        -- trip info
        cast(store_and_fwd_flag as string) as store_and_fwd_flag,
        cast(passenger_count as integer) as passenger_count,
        cast(trip_distance as numeric) as trip_distance,

        -- payment info
        cast(fare_amount as numeric) as fare_amount,
        cast(extra as numeric) as extra,
        cast(mta_tax as numeric) as mta_tax,
        cast(tip_amount as numeric) as tip_amount,
        cast(tolls_amount as numeric) as tolls_amount,
        cast(improvement_surcharge as numeric) as improvement_surcharge,
        cast(total_amount as numeric) as total_amount,
        cast(payment_type as integer) as payment_type

    from source
    -- Filter out records with null vendor_id (data quality requirement)
    where vendorid is not null
)

select * from renamed

-- Sample records for dev environment using deterministic date filter
{% if target.name == 'dev' %}
where pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01'
{% endif %}


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/package-lock.yml
================================================
packages:
  - name: dbt_utils
    package: dbt-labs/dbt_utils
    version: 1.3.3
  - name: codegen
    package: dbt-labs/codegen
    version: 0.14.0
sha1_hash: 01f31e0d658d76121f50e62b998342ebf138df11


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/packages.yml
================================================
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.3.0", "<2.0.0"]
  - package: dbt-labs/codegen
    version: [">=0.14.0", "<1.0.0"]

================================================
FILE: 04-analytics-engineering/taxi_rides_ny/seeds/seeds_properties.yml
================================================
seeds:
  - name: taxi_zone_lookup
    description: >
      Taxi Zones roughly based on NYC Department of City Planning's Neighborhood
      Tabulation Areas (NTAs) and are meant to approximate neighborhoods, so you can see which
      neighborhood a passenger was picked up in, and which neighborhood they were dropped off in.
      Includes associated service_zone (EWR, Boro Zone, Yellow Zone)

  - name: payment_type_lookup
    description: >
      Payment type reference data mapping payment type codes to their descriptions.
      Used as a dimension table for payment method analysis.
    columns:
      - name: payment_type
        description: Numeric code for payment type
        data_tests:
          - unique
          - not_null
      - name: description
        description: Human-readable description of payment method

================================================
FILE: 04-analytics-engineering/taxi_rides_ny/snapshots/.gitkeep
================================================


================================================
FILE: 04-analytics-engineering/taxi_rides_ny/tests/.gitkeep
================================================


================================================
FILE: 05-data-platforms/README.md
================================================
# Module 5: Data Platforms

## Overview

In this module, you'll learn about data platforms - tools that help you manage the entire data lifecycle from ingestion to analytics.

We'll use [Bruin](https://getbruin.com/) as an example of a data platform. Bruin puts multiple tools under one platform:

- Data ingestion (extract from sources to your warehouse)
- Data transformation (cleaning, modeling, aggregating)
- Data orchestration (scheduling and dependency management)
- Data quality (built-in checks and validation)
- Metadata management (lineage, documentation)

## Tutorial

Follow the complete hands-on tutorial at:

[Bruin Data Engineering Zoomcamp Template](https://github.com/bruin-data/bruin/tree/main/templates/zoomcamp)

The template is a TODO-based learning exercise — run `bruin init zoomcamp my-taxi-pipeline` and fill in the configuration and code guided by inline comments. The [notes](notes/) contain completed reference implementations.

## Videos

### :movie_camera: 5.1 - Introduction to Bruin

[![](https://markdown-videos-api.jorgenkh.no/youtube/f6vg7lGqZx0)](https://youtu.be/f6vg7lGqZx0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=1)

Introduction to the Bruin data platform: what it is, what a modern data stack looks like (ETL/ELT, orchestration, data quality), and how Bruin brings all of these together into a single project.

- [Notes](notes/01-introduction.md)


### :movie_camera: 5.2 - Getting Started with Bruin

[![](https://markdown-videos-api.jorgenkh.no/youtube/JJwHKSidX_c)](https://youtu.be/JJwHKSidX_c&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=2)

Install Bruin, set up the VS Code/Cursor extension and Bruin MCP, and create a first project using `bruin init`. Walk through environments, connections (DuckDB, Chess.com), pipeline YAML configuration, and running Python, YAML ingestor, and SQL assets.

- [Notes](notes/02-getting-started.md)


### :movie_camera: 5.3 - Building an End-to-End Pipeline with NYC Taxi Data

[![](https://markdown-videos-api.jorgenkh.no/youtube/q0k_iz9kWsI)](https://youtu.be/q0k_iz9kWsI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=3)

Build a full pipeline with a three-layered architecture (ingestion, staging, reports) using NYC taxi data and DuckDB.

- [Notes](notes/03-nyc-taxi-pipeline.md)


### :movie_camera: 5.4 - Using Bruin MCP with AI Agents

[![](https://markdown-videos-api.jorgenkh.no/youtube/224xH7h8OaQ)](https://youtu.be/224xH7h8OaQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=4)

Install the Bruin MCP in Cursor/VS Code and use an AI agent to build the entire NYC taxi pipeline end to end. Query data conversationally, ask questions about pipeline logic, and troubleshoot issues — all through natural language.

- [Notes](notes/04-bruin-mcp.md)


### :movie_camera: 5.5 - Deploying to Bruin Cloud

[![](https://markdown-videos-api.jorgenkh.no/youtube/uBqjLEwF8rc)](https://youtu.be/uBqjLEwF8rc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5)

Register for Bruin Cloud, connect your GitHub repository, set up data warehouse connections, deploy and monitor your pipelines with a fully managed infrastructure.

- [Notes](notes/05-bruin-cloud.md)


## Bruin Core Concepts

Short videos covering the fundamental concepts of Bruin: projects, pipelines, assets, variables, and commands.

### :movie_camera: Projects

[![](https://markdown-videos-api.jorgenkh.no/youtube/YWDjnSxbBtY)](https://www.youtube.com/watch?v=YWDjnSxbBtY)

The root directory where you create your Bruin data pipeline. Learn about project initialization, the `.bruin.yml` configuration file, environments, and connections.

- [Notes](notes/06-core-01-projects.md)


### :movie_camera: Pipelines

[![](https://markdown-videos-api.jorgenkh.no/youtube/uzp_DiR4Sok)](https://www.youtube.com/watch?v=uzp_DiR4Sok)

A grouping mechanism for organizing assets based on their execution schedule. Each pipeline has a single schedule and its own configuration file.

- [Notes](notes/06-core-02-pipelines.md)


### :movie_camera: Assets

[![](https://markdown-videos-api.jorgenkh.no/youtube/ZElY5SoqrwI)](https://www.youtube.com/watch?v=ZElY5SoqrwI)

Single files that perform specific tasks, creating or updating tables/views in your database. Covers SQL, Python, and YAML asset types with examples.

- [Notes](notes/06-core-03-assets.md)


### :movie_camera: Variables

[![](https://markdown-videos-api.jorgenkh.no/youtube/XCx0nDmhhxA)](https://www.youtube.com/watch?v=XCx0nDmhhxA)

Dynamic values initialized at each pipeline run. Learn about built-in variables (start_date, end_date) and custom variables for parameterizing your pipelines.

- [Notes](notes/06-core-04-variables.md)


### :movie_camera: Commands

[![](https://markdown-videos-api.jorgenkh.no/youtube/3nykPEs_V7E)](https://www.youtube.com/watch?v=3nykPEs_V7E)

CLI commands for interacting with your Bruin project: `bruin run`, `bruin validate`, `bruin lineage`, and more with practical examples.

- [Notes](notes/06-core-05-commands.md)


## Resources

- [Bruin Documentation](https://getbruin.com/docs)
- [Bruin GitHub Repository](https://github.com/bruin-data/bruin)
- [Bruin MCP (AI Integration)](https://getbruin.com/docs/bruin/getting-started/bruin-mcp)
- [Bruin Cloud](https://getbruin.com/) — managed deployment and monitoring

# Homework

* [2026 Homework](../cohorts/2026/05-data-platforms/homework.md)

# Community notes

<details>
<summary>Did you take notes? You can share them here</summary>

* Add your notes here (above this line)

</details>


================================================
FILE: 05-data-platforms/notes/01-introduction.md
================================================
# 5.1 - Introduction to Bruin

## What is Bruin?

Bruin is an end-to-end data platform that combines ingestion, transformations, orchestration, data quality checks, metadata, and lineage into a single tool.

Instead of using five or six different tools configured separately, Bruin lets you have your code logic, configurations, dependencies, and quality checks all in the same place.

## The modern data stack

A typical data stack involves several components:

- Extract/ingest data from third-party sources or databases into a data warehouse or data lake
- Run transformations: clean data, create reports, push results to a warehouse, lake, or third-party application
- Orchestrate: tell different scripts and services when to run, how to run, and how to communicate with each other
- Data quality and governance: ensure accuracy, completeness, and consistency of data before delivering it to consumers

Bruin brings all of these together so you don't need to be a DevOps person, data infrastructure engineer, and data architect just to build a pipeline.

## Learning goals for the tutorial series

- Bruin project structure
- What is a pipeline and what are assets
- How to configure pipelines
- Materialization strategies supported by Bruin
- Lineage and how to build dependencies between assets
- Metadata created automatically and manually
- Parameterizing pipelines with custom variables


================================================
FILE: 05-data-platforms/notes/02-getting-started.md
================================================
# 5.2 - Getting Started with Bruin

## Installation

Install Bruin CLI:

```bash
curl -LsSf https://getbruin.com/install/cli | sh
bruin version
```

Install the Bruin extension for VS Code or Cursor. This adds a Bruin render panel that lets you run assets and pipelines directly from the IDE.

## Bruin MCP

Bruin provides an MCP (Model Context Protocol) server that you can add to your IDE (Cursor, VS Code) to use AI agents for creating pipelines. Add the Bruin MCP under your IDE settings > Tools and MCP.

### Bruin MCP Integration for VS Code

 Create a new file `mcp.json` in your Repository Root:
In the root directory of your project (the same level as your `.git` folder or `package.json`), create a new file named `mcp.json`.

Add the Configuration:
Open the `mcp.json` file and paste the following JSON configuration into it:

```json
{
  "servers": {
    "bruin": {
      "type": "stdio",
      "command": "bruin",
      "args": [
        "mcp"
      ]
    }
  },
  "inputs": []
}
```

This configuration instructs VS Code to launch the `bruin mcp` command, establishing a standard input/output connection with the Bruin MCP server.

## Initializing a project

```bash
bruin init default my-first-pipeline
cd my-first-pipeline
```

This creates a project from a template, initializes git, adds a `.gitignore`, and creates the `bruin.yaml` file.

Bruin requires the project to be git-initialized. The `bruin init` command handles this automatically.

## Project structure

```text
my-first-pipeline/
├── .bruin.yml              # Environment and connection configuration
├── pipeline.yml            # Pipeline name, schedule, default connections
└── assets/
    ├── players.asset.yml   # Ingestr asset (data ingestion)
    ├── player_stats.sql    # SQL asset with quality checks
    └── my_python_asset.py  # Python asset
```

### .bruin.yml

- Stays local only (auto-added to `.gitignore`)
- Never push this to your repo — it contains database connections and secrets
- Defines environments (default, production, staging, etc.)
- Under each environment, define connections (e.g. DuckDB, Chess.com, custom secrets)

```yaml
default_environment: default

environments:
  default:
    connections:
      duckdb:
        - name: duckdb-default
          path: duckdb.db
```

### pipeline.yml

Configures the pipeline: name, schedule, default connection, start date.

```yaml
name: my-pipeline
schedule: daily
start_date: "2022-01-01"
default_connections:
  duckdb: duckdb-default
```

## Asset types

### Python asset

Simplest form: a Python script with a name that prints or processes data. Run from the Bruin panel in your IDE.

### YAML ingestor asset

Uses Bruin's built-in ingestor. Define source connection, destination, and table. Supports many built-in sources and destinations: Redshift, MySQL, Postgres, Motherduck, BigQuery, etc. Automatically creates the destination database/table if it doesn't exist.

### SQL asset

Runs SQL queries against your database. Define dependencies to other assets — when a dependency finishes, this asset runs automatically.

## Intervals and incremental ingestion

- Set `start_date` and `end_date` parameters to ingest data for a specific time range
- Bruin provides these as variables you can inject into your code
- Built-in ingestion assets automatically use the start/end dates

## Dependencies and lineage

- Define dependencies between assets so they run in the correct order
- When the first asset completes, it automatically triggers the next dependent asset
- Bruin builds a lineage graph from these dependencies

## Key CLI commands

| Command | Purpose |
|---------|---------|
| `bruin validate <path>` | Check syntax and dependencies without running |
| `bruin run <path>` | Execute pipeline or individual asset |
| `bruin run --downstream` | Run asset and all downstream dependencies |
| `bruin run --full-refresh` | Truncate and rebuild tables from scratch |
| `bruin lineage <path>` | View asset dependencies |
| `bruin query --connection <conn> --query "..."` | Execute ad-hoc SQL queries |


================================================
FILE: 05-data-platforms/notes/03-nyc-taxi-pipeline.md
================================================
# 5.3 - Building an End-to-End Pipeline with NYC Taxi Data

## Architecture

Three-layered pipeline using DuckDB as a locally hosted database:

1. Ingestion layer: extract data and store in raw format
2. Staging layer: pre-process, clean, transform, join with lookup tables
3. Reports layer: aggregate data and run calculations

All assets have dependencies that create the data lineage Bruin uses for orchestration.

## Project setup

Initialize from the zoomcamp template:

```bash
bruin init zoomcamp my-taxi-pipeline
cd my-taxi-pipeline
```

Project structure:

```text
zoomcamp/
├── .bruin.yml
├── README.md
└── pipeline/
    ├── pipeline.yml
    └── assets/
        ├── ingestion/
        │   ├── trips.py
        │   ├── requirements.txt
        │   ├── payment_lookup.asset.yml
        │   └── payment_lookup.csv
        ├── staging/
        │   └── trips.sql
        └── reports/
            └── trips_report.sql
```

### .bruin.yml

```yaml
default_environment: default

environments:
  default:
    connections:
      duckdb:
        - name: duckdb-default
          path: duckdb.db
```

### pipeline.yml

```yaml
name: nyc_taxi
schedule: daily
start_date: "2022-01-01"
default_connections:
  duckdb: duckdb-default
variables:
  taxi_types:
    type: array
    items:
      type: string
    default: ["yellow"]
```

- `start_date`: when running a full refresh, process data starting from this date
- Custom variables: `taxi_types` lets you control which taxi types to ingest (yellow, green, or both)
- Variables can be overridden at runtime with `--var`

## Ingestion layer

### Python asset: trips.py

The Python asset connects to the NYC taxi API and extracts data.

```python
"""@bruin
name: ingestion.trips
type: python
image: python:3.11

materialization:
  type: table
  strategy: append

columns:
  - name: pickup_datetime
    type: timestamp
    description: "When the meter was engaged"
  - name: dropoff_datetime
    type: timestamp
    description: "When the meter was disengaged"
@bruin"""

import os
import json
import pandas as pd

def materialize():
    start_date = os.environ["BRUIN_START_DATE"]
    end_date = os.environ["BRUIN_END_DATE"]
    taxi_types = json.loads(os.environ["BRUIN_VARS"]).get("taxi_types", ["yellow"])

    # Generate list of months between start and end dates
    # Fetch parquet files from:
    # https://d37ci6vzurychx.cloudfront.net/trip-data/{taxi_type}_tripdata_{year}-{month}.parquet

    return final_dataframe
```

- `materialize()` returns a DataFrame; Bruin handles inserting it into the destination
- `append` strategy: each run inserts data without touching existing rows
- Uses `BRUIN_START_DATE` / `BRUIN_END_DATE` environment variables for the time window
- Uses `BRUIN_VARS` to read the `taxi_types` pipeline variable

### Seed file: payment_lookup.asset.yml

Seed files ingest data from local CSV files into the database.

```yaml
name: ingestion.payment_lookup
type: duckdb.seed
parameters:
  path: payment_lookup.csv
columns:
  - name: payment_type_id
    type: integer
    description: "Numeric code for payment type"
    primary_key: true
    checks:
      - name: not_null
      - name: unique
  - name: payment_type_name
    type: string
    description: "Human-readable payment type"
    checks:
      - name: not_null
```

payment_lookup.csv:

```csv
payment_type_id,payment_type_name
0,flex_fare
1,credit_card
2,cash
3,no_charge
4,dispute
5,unknown
6,voided_trip
```

Quality checks (`not_null`, `unique`) run automatically after the asset finishes.

### requirements.txt

```
pandas
requests
pyarrow
python-dateutil
```

Bruin handles the environment and installs dependencies locally within the pipeline.

## Staging layer

### SQL asset: staging/trips.sql

```sql
/* @bruin
name: staging.trips
type: duckdb.sql

depends:
  - ingestion.trips
  - ingestion.payment_lookup

materialization:
  type: table
  strategy: time_interval
  incremental_key: pickup_datetime
  time_granularity: timestamp

columns:
  - name: pickup_datetime
    type: timestamp
    primary_key: true
    checks:
      - name: not_null

custom_checks:
  - name: row_count_greater_than_zero
    query: |
      SELECT CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END
      FROM staging.trips
    value: 1
@bruin */

SELECT
    t.pickup_datetime,
    t.dropoff_datetime,
    t.pickup_location_id,
    t.dropoff_location_id,
    t.fare_amount,
    t.taxi_type,
    p.payment_type_name
FROM ingestion.trips t
LEFT JOIN ingestion.payment_lookup p
    ON t.payment_type = p.payment_type_id
WHERE t.pickup_datetime >= '{{ start_datetime }}'
  AND t.pickup_datetime < '{{ end_datetime }}'
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY t.pickup_datetime, t.dropoff_datetime,
                 t.pickup_location_id, t.dropoff_location_id, t.fare_amount
    ORDER BY t.pickup_datetime
) = 1
```

- `time_interval` strategy: deletes rows in the time window, then inserts the query result
- The `WHERE` clause must filter to the same time window to avoid duplicates
- `QUALIFY ROW_NUMBER()` deduplicates using a composite key
- Dependencies on both `ingestion.trips` and `ingestion.payment_lookup` ensure this runs after ingestion

## Reports layer

### SQL asset: reports/trips_report.sql

```sql
/* @bruin
name: reports.trips_report
type: duckdb.sql

depends:
  - staging.trips

materialization:
  type: table
  strategy: time_interval
  incremental_key: trip_date
  time_granularity: date

columns:
  - name: trip_date
    type: date
    primary_key: true
  - name: taxi_type
    type: string
    primary_key: true
  - name: payment_type
    type: string
    primary_key: true
  - name: trip_count
    type: bigint
    checks:
      - name: non_negative
@bruin */

SELECT
    CAST(pickup_datetime AS DATE) AS trip_date,
    taxi_type,
    payment_type_name AS payment_type,
    COUNT(*) AS trip_count,
    SUM(fare_amount) AS total_fare,
    AVG(fare_amount) AS avg_fare
FROM staging.trips
WHERE pickup_datetime >= '{{ start_datetime }}'
  AND pickup_datetime < '{{ end_datetime }}'
GROUP BY 1, 2, 3
```

## Running the full pipeline

```bash
# Validate structure and definitions
bruin validate ./pipeline/pipeline.yml

# Run with a small date range for testing
bruin run ./pipeline/pipeline.yml --start-date 2022-01-01 --end-date 2022-02-01

# Full refresh
bruin run ./pipeline/pipeline.yml --full-refresh

# Query results
bruin query --connection duckdb-default --query "SELECT COUNT(*) FROM ingestion.trips"
```

Open the pipeline YAML file in the Bruin panel and view the lineage tab to see all assets and their dependencies. Execution order:

1. Ingestion assets run first (trips + lookup, in parallel)
2. Staging asset runs after both ingestion assets complete
3. Report asset runs after staging completes

## Materialization strategies summary

| Strategy | Behavior |
|----------|----------|
| `table` | Drop and recreate the table each time |
| `append` | Insert new data without touching existing rows |
| `merge` | Upsert based on key columns |
| `time_interval` | Delete rows in date range, then re-insert |
| `delete+insert` | Delete matching rows, then insert |
| `create+replace` | Create or replace the table |


================================================
FILE: 05-data-platforms/notes/04-bruin-mcp.md
================================================
# 5.4 - Using Bruin MCP with AI Agents

## What is Bruin MCP?

MCP stands for **Model Context Protocol**. Bruin MCP is a way for AI agents (in Cursor, VS Code, Claude, etc.) to communicate with Bruin — querying documentation, running commands on your behalf, going through your code, troubleshooting, and analyzing data.

With the Bruin MCP and an AI agent, you can:

- Write pipeline code and asset configurations
- Write documentation and metadata
- Troubleshoot errors and debug issues
- Run queries and analyze data using natural language
- Ask questions about your pipeline logic and structure

## Installing Bruin MCP

Make sure you have [Bruin CLI installed](https://getbruin.com/docs/bruin/getting-started/introduction/installation) first.

### Cursor

Go to **Settings → Tools & MCP → New MCP Server** and add:

```json
{
  "mcpServers": {
    "bruin": {
      "command": "bruin",
      "args": ["mcp"]
    }
  }
}
```

If it shows a failure/error, close and reopen your IDE — you should see "Bruin enabled".

### VS Code (Copilot)

Create `.vscode/mcp.json` in your project folder:

```json
{
  "servers": {
    "bruin": {
      "command": "bruin",
      "args": ["mcp"]
    }
  }
}
```

### Claude Code

```bash
claude mcp add bruin -- bruin mcp
```

See the full [Bruin MCP documentation](https://getbruin.com/docs/bruin/getting-started/bruin-mcp) for other agents and troubleshooting.

## Building a pipeline with MCP

### Using the template prompt

The zoomcamp template includes an example prompt in its README that you can give to the AI agent to create the entire pipeline end-to-end:

```bash
bruin init zoomcamp my-taxi-pipeline
```

Open the generated `README.md` — it contains a prompt you can paste into the agent to scaffold the entire pipeline automatically.

### What the agent does

When given the pipeline prompt, the agent will:

1. Create all pipeline assets (ingestion, staging, reports)
2. Configure materialization strategies and dependencies
3. Set up quality checks and column metadata
4. Validate the pipeline with `bruin validate`
5. Run the pipeline with a test date range
6. Run custom checks to validate query logic
7. Execute verification queries using `bruin query`

### Working incrementally

In practice, you may prefer working asset by asset rather than generating everything at once. This lets you be involved in every design choice:

- Create and test the ingestion asset first
- Then build the staging layer
- Then add the reports layer
- Review and adjust quality checks at each step

## Querying data with the agent

Once your pipeline has run, you can use the agent conversationally to query your data:

**Example queries:**
- "Query the staging table and tell me how many days of data we have"
- "Which day had the highest number of trips and total fare?"
- "In which asset are we aggregating data?"

The agent understands the context of your pipeline — it knows the table structures, can write SQL queries, and can explain the logic behind each asset. This is useful for:

- Ad hoc analysis without writing SQL manually
- Understanding unfamiliar pipeline logic
- Data validation and troubleshooting
- Onboarding new team members to an existing pipeline


================================================
FILE: 05-data-platforms/notes/05-bruin-cloud.md
================================================
# 5.5 - Deploying to Bruin Cloud

## What is Bruin Cloud?

Bruin Cloud is a fully managed infrastructure for your data pipelines. It is powered by the same open-source CLI tool you use locally for development. Everything lives in the same place:

- Ingestions and transformations
- Quality checks and monitoring
- Lineage and metadata
- Data governance
- AI-powered features (automatic metadata generation, conversational data analysis)

## Registration

1. Go to [Bruin Cloud](https://getbruin.com/) and sign up
2. Fill out your name, email, and set a password
3. Verify your email by clicking the link in the verification email
4. Choose to join an existing team or create a new organization
5. Give your organization a name

## Connecting your GitHub repository

You have two options:

1. **Direct GitHub connection** (recommended) — connect your GitHub account directly and select your repo from a dropdown
2. **Personal Access Token** — provide a GitHub personal access token and your repo link manually

## Setting up connections

After connecting your repo, set up your data warehouse connections. These are the same connections you configure locally in `.bruin.yml`, but stored securely in the cloud.

1. Go to the connections page
2. Select your connection type (MotherDuck, BigQuery, Redshift, etc.)
3. Give it the same connection name you use locally
4. Provide the required credentials (e.g., service token, database name)
5. The connection will be validated and tested automatically

Read the Bruin documentation for details on how secrets are stored securely.

## Deploying pipelines

1. Navigate to the **Pipelines** page to see the list of pipelines from your repository
2. Bruin will validate every asset and ensure lineage and connections work (this takes a moment)
3. Once ready, **enable** the pipeline

When you enable a pipeline with a schedule, Bruin automatically creates a run for the last interval. For example, a monthly pipeline will immediately process the previous month's data.

## Monitoring

After a pipeline runs:

- Check the status of each asset (success/failure)
- Review quality check results
- View lineage across all assets
- Use AI-powered features to analyze data or ask questions about your pipelines

## Getting help

- Join the [Bruin Slack community](https://getbruin.com/) for questions and feature requests
- Submit issues on [GitHub](https://github.com/bruin-data/bruin)


================================================
FILE: 05-data-platforms/notes/06-core-01-projects.md
================================================
# 5.6 - Core Concepts: Projects

🎥 [Bruin Core Concepts | Projects](https://www.youtube.com/watch?v=YWDjnSxbBtY) (3:03)

## What is a Project?

A **Project** is the root directory where you create your entire Bruin data pipeline. It serves as the foundation for organizing all your data assets, configurations, and connections.

## Project Initialization

The project must be initialized with `bruin init` so the CLI tool can understand the directory structure and navigate files correctly.

```bash
bruin init zoomcamp my-pipeline
cd my-pipeline
```

## The `.bruin.yml` File

Located at the root of your project, this file defines environments, connections, and secrets.

**Important:** This file is always added to `.gitignore` to protect secrets. It stays local only and should never be pushed to your repo.

### Environments

Define different environments for various stages:

```yaml
default_environment: default

environments:
  default:
    connections:
      duckdb:
        - name: duckdb-default
          path: duckdb.db
      motherduck:
        - name: motherduck
          token: <your-token>

  production:
    connections:
      bigquery:
        - name: bq-prod
          project: my-project
          dataset: production
```

**Benefits:**
- Run pipelines locally or on servers without exposing production credentials
- Different teams can have different connection access
- Default to `dev` environment to prevent accidental production runs

### Connection Types

Built-in connections include:
- DuckDB, MotherDuck
- PostgreSQL, MySQL
- BigQuery, Redshift, Snowflake
- Custom connections (for API keys, secrets, etc.)

### Default Environment

Set which environment is used by default:

```yaml
default_environment: dev
```

This ensures pipelines run on development unless explicitly told to use production.

## Quick Reference

```bash
# Initialize a new project
bruin init zoomcamp my-pipeline

# Navigate to your project
cd my-pipeline

# Check project is valid
bruin validate .
```

## Further Reading

- [Bruin Documentation - Projects](https://getbruin.com/docs/bruin/core-concepts/project.html)
- [Bruin GitHub - Templates](https://github.com/bruin-data/bruin/tree/main/templates)


================================================
FILE: 05-data-platforms/notes/06-core-02-pipelines.md
================================================
# 5.6 - Core Concepts: Pipelines

🎥 [Bruin Core Concepts | Pipelines](https://www.youtube.com/watch?v=uzp_DiR4Sok) (3:13)

## What is a Pipeline?

A **Pipeline** is a grouping mechanism for organizing assets based on their execution schedule and configuration requirements. Within a project, you can have multiple pipelines.

## Key Characteristics

### Single Schedule

Each pipeline has **one schedule** - this is the primary reason to group assets together:
- Assets with the same schedule belong in the same pipeline
- Common schedules: `hourly`, `daily`, `monthly`, or cron expressions

### Pipeline Structure

Each pipeline has its own folder containing a `pipeline.yml` file:

```text
project/
├── .bruin.yml
├── pipelines/
│   ├── nyc-taxi/
│   │   ├── pipeline.yml
│   │   └── assets/
│   └── another-pipeline/
│       ├── pipeline.yml
│       └── assets/
```

## The `pipeline.yml` File

```yaml
name: nyc_taxi
schedule: monthly
start_date: "2019-01-01"
default_connections:
  duckdb: duckdb-default
```

### Configuration Options

| Setting | Description |
|---------|-------------|
| `name` | Pipeline identifier |
| `schedule` | When to run (cron, daily, monthly, etc.) |
| `start_date` | When the pipeline starts being active |
| `default_connections` | Which connections to use |
| `variables` | Custom variables for the pipeline |

### Connection Scoping

Even though connections are defined at the project level (`.bruin.yml`), each pipeline specifies **which connections it uses**.

**Why this matters:**
- In large organizations, different teams may need different credentials
- Prevents unnecessary exposure of secrets
- Only initializes connections needed for the specific pipeline run
- Security isolation between departments

## Quick Reference

```bash
# Validate a pipeline
bruin validate ./pipelines/nyc-taxi/pipeline.yml

# View pipeline lineage
bruin lineage ./pipelines/nyc-taxi/pipeline.yml

# Run the entire pipeline
bruin run ./pipelines/nyc-taxi/pipeline.yml
```

## Further Reading

- [Bruin Documentation - Pipelines](https://getbruin.com/docs/bruin/pipelines/definition.html)
- [Pipeline Configuration Reference](https://getbruin.com/docs/bruin/pipelines/definition.html)


================================================
FILE: 05-data-platforms/notes/06-core-03-assets.md
================================================
# 5.6 - Core Concepts: Assets

🎥 [Bruin Core Concepts | Assets](https://www.youtube.com/watch?v=ZElY5SoqrwI) (6:11)

## What is an Asset?

An **Asset** is a single file that performs a specific task, almost always related to creating or updating a table or view in the destination database.

Each asset file contains two parts:

1. **Definition** (Configuration) - Metadata, name, type, connection
2. **Content** (Code) - The actual SQL, Python, or R code to execute

## Asset Types

| Type | Description | Use Case |
|------|-------------|----------|
| **Python** | Python scripts | Ingestion, data processing, ML models |
| **SQL** | SQL queries | Transformations, aggregations |
| **YAML/Seed** | File-based tables | Reference data, static lookups |
| **R** | R scripts | Statistical analysis, R-specific workflows |

## Asset Naming

The asset name can be:
1. **Explicitly defined** in the decorator
2. **Inferred from file path** (default behavior)

**Convention:** Group assets by schema/dataset:
- `assets/raw/trips_raw.py` → Creates table `raw.trips_raw`
- `assets/staging/trips_summary.sql` → Creates table `staging.trips_summary`

## SQL Asset Example

```sql
@bruin.asset(
    name="staging.trips_summary",
    type="sql",
    connection="duckdb-default",
    materialization="table"
)

SELECT
    pickup_date,
    COUNT(*) as trip_count,
    SUM(fare_amount) as total_fare
FROM raw.trips_raw
WHERE pickup_date >= '{{ start_date }}'
  AND pickup_date < '{{ end_date }}'
GROUP BY pickup_date
```

### Materialization Strategies

| Strategy | Behavior |
|----------|----------|
| `table` | Recreates the table on each run |
| `view` | Creates a view (no data stored) |
| `insert` | Appends new data to existing table |
| `incremental` | Smart merge based on key columns |

## Python Asset Example (Ingestion)

```python
@bruin.asset(
    name="raw.trips_raw",
    type="python",
    connection="duckdb-default"
)
def ingest_trips():
    import requests
    import pandas as pd

    # Connect to API, fetch data
    response = requests.get("https://api.example.com/trips")
    data = response.json()

    # Return pandas DataFrame
    # Bruin handles materialization to database
    return pd.DataFrame(data)
```

## YAML/Seed Asset Example

```yaml
@bruin.asset(
    name="lookup.taxi_types",
    type="seed",
    connection="duckdb-default"
)

path: reference_data/taxi_types.csv
```

Simply loads a local CSV file and creates a table in the destination database.

## Lineage & Dependencies

Assets automatically define dependencies based on what they read:

- If Asset B reads from Asset A's table, **B depends on A**
- Visualized in VS Code extension
- Used for execution ordering during runs

```sql
-- This asset depends on raw.trips_raw
@bruin.asset(name="staging.trips_summary", type="sql")
SELECT * FROM raw.trips_raw  -- Creates dependency
```

## Quick Reference

```bash
# Run a specific asset
bruin run ./pipeline.yml --asset raw.trips_raw

# Run asset with all downstream dependencies
bruin run ./pipeline.yml --asset raw.trips_raw --downstream

# Run asset with all upstream dependencies
bruin run ./pipeline.yml --asset staging.trips_summary --upstream

# View lineage for an asset
bruin lineage ./pipeline.yml --asset raw.trips_raw
```

## Further Reading

- [Bruin Documentation - Assets](https://getbruin.com/docs/bruin/assets/definition-schema.html)
- [Materialization Strategies](https://getbruin.com/docs/bruin/assets/materialization.html)


================================================
FILE: 05-data-platforms/notes/06-core-04-variables.md
================================================
# 5.6 - Core Concepts: Variables

🎥 [Bruin Core Concepts | Variables](https://www.youtube.com/watch?v=XCx0nDmhhxA) (6:03)

## What are Variables?

**Variables** are dynamically initialized each time a pipeline run is created. They allow you to parameterize your pipelines and pass dynamic values at runtime.

## Variable Types

### 1. Built-in Variables

Always provided by Bruin automatically:

| Variable | Description |
|----------|-------------|
| `start_date` | Beginning of the scheduled interval |
| `end_date` | End of the scheduled interval |

These dates are determined by the pipeline's schedule:

| Schedule | Start Date | End Date |
|----------|------------|----------|
| **Monthly** | First day of month | Last day of month |
| **Daily** | Start of day | End of day |
| **Hourly** | Start of hour | End of hour |

#### SQL Assets - Jinja Format

In SQL, variables are injected using Jinja templating:

```sql
@bruin.asset(name="staging.monthly_trips", type="sql")
SELECT *
FROM raw.trips
WHERE pickup_date >= '{{ start_date }}'
  AND pickup_date < '{{ end_date }}'
```

Use the **Bruin Render panel** in VS Code to see the compiled query with actual values.

#### Python Assets - Environment Variables

In Python, variables are accessed via environment variables:

```python
import os
from datetime import datetime

@bruin.asset(name="raw.monthly_data", type="python")
def ingest_monthly_data():
    start_date = os.environ['BRUIN_VAR_START_DATE']
    end_date = os.environ['BRUIN_VAR_END_DATE']

    # Parse and use dates to fetch data for specific period
    start = datetime.fromisoformat(start_date)
    end = datetime.fromisoformat(end_date)

    # Loop through months in range
    # ...
```

### 2. Custom Variables

User-defined variables set at the pipeline level.

#### Definition in `pipeline.yml`

```yaml
variables:
  - name: taxi_types
    type: array
    default:
      - "yellow"
```

#### Override at Runtime

Change default values when creating a run:

```bash
bruin run ./pipeline.yml --var taxi_types=["green","fhv"]
```

#### Accessing Custom Variables in Python

```python
import os
import json

@bruin.asset(name="example.asset", type="python")
def example_asset():
    # Custom variables are prefixed with BRUIN_VAR_
    taxi_types_json = os.environ['BRUIN_VAR_TAXI_TYPES']
    taxi_types = json.loads(taxi_types_json)

    # Use the variable in your code
    for taxi_type in taxi_types:
        # Process each taxi type
        pass
```

## VS Code Extension Panel

From the Bruin panel in VS Code/Cursor:

1. **Variable Override** - Set custom variable values before running
2. **Bruin Render** - See how Jinja templates are compiled with actual values
3. **Run Configuration** - Set dates, environment, and variables

## Practical Use Cases

| Use Case | Description |
|----------|-------------|
| **Date-based partitioning** | Extract data for specific time periods |
| **Multi-tenant processing** | Run same pipeline for different customers |
| **Parameterized transformations** | Change logic based on variables |
| **A/B testing** | Test different configurations without code changes |

## Quick Reference

```bash
# Run with custom dates
bruin run ./pipeline.yml --start-date 2020-01-01 --end-date 2020-01-31

# Run with variable override (array)
bruin run ./pipeline.yml --var taxi_types=["green","fhv"]

# Run with variable override (string)
bruin run ./pipeline.yml --var customer_id=12345

# Run with full refresh (affects materialization)
bruin run ./pipeline.yml --full-refresh

# Set end date as exclusive
bruin run ./pipeline.yml --exclusive-end-date
```

## Further Reading

- [Bruin Documentation - Variables](https://getbruin.com/docs/bruin/core-concepts/variables.html)
- [Pipeline Runtime Options](https://getbruin.com/docs/bruin/commands/run.html)


================================================
FILE: 05-data-platforms/notes/06-core-05-commands.md
================================================
# 5.6 - Core Concepts: Commands

🎥 [Bruin Core Concepts | Commands](https://www.youtube.com/watch?v=3nykPEs_V7E) (6:46)

## Bruin CLI Commands

Commands are how you interact with your Bruin project - running pipelines, validating configurations, querying data, and more.

## `bruin run` - Execute a Pipeline

Creates a **single execution instance** (a "run") of your pipeline.

### Basic Usage

```bash
bruin run ./pipelines/nyc-taxi/pipeline.yml
```

### Run Scope Options

| Option | Description |
|--------|-------------|
| Entire pipeline | Runs all assets in dependency order |
| Single asset | `--asset staging.trips_summary` |
| With upstream | `--asset X --upstream` - Runs X plus all dependencies |
| With downstream | `--asset X --downstream` - Runs X plus all dependents |

### Common Run Flags

| Flag | Description |
|------|-------------|
| `--start-date DATE` | Set execution start date |
| `--end-date DATE` | Set execution end date |
| `--full-refresh` | Drop and recreate tables (overrides incremental) |
| `--exclusive-end-date` | End date is exclusive (default: inclusive) |
| `--environment ENV` | Use specific environment (dev/prod) |
| `--var KEY=VALUE` | Override custom variables |

### Example Run Commands

```bash
# Simple run
bruin run ./pipelines/nyc-taxi/pipeline.yml

# With date range
bruin run ./pipelines/nyc-taxi/pipeline.yml \
  --start-date 2020-01-01 \
  --end-date 2020-01-31

# Full refresh with variables
bruin run ./pipelines/nyc-taxi/pipeline.yml \
  --full-refresh \
  --var taxi_types=["yellow","green"] \
  --environment default
```

## `bruin validate` - Validate Pipeline

Checks for configuration issues before running:

```bash
bruin validate ./pipelines/nyc-taxi/pipeline.yml
```

**Validates:**
- No circular dependencies in lineage
- Asset definitions are correct
- Connections exist and are properly configured
- No broken references

**Always validate before running!**

## `bruin lineage` - View Dependency Graph

Visualize how assets are connected:

```bash
bruin lineage ./pipelines/nyc-taxi/pipeline.yml
```

Shows upstream and downstream relationships between assets.

## `bruin query` - Query Data

Run ad-hoc queries against your connections:

```bash
bruin query --connection duckdb-default \
  --query "SELECT * FROM ingestion.trips LIMIT 10"
```

## What is a "Run"?

A **run** is a single instance of pipeline execution:
- Has unique start/end times
- May run all assets or a subset
- Has its own variable values
- Creates execution logs and results

## Putting It All Together

The complete Bruin workflow:

```
1. Project (root, initialized)
   └── .bruin.yml (environments, connections)

2. Pipeline (scheduled grouping)
   └── pipeline.yml (schedule, default connection, variables)

3. Assets (the actual work)
   ├── Python (ingestion, processing)
   ├── SQL (transformations)
   └── YAML/Seed (static data)

4. Commands (make it happen)
   ├── bruin run (execute)
   ├── bruin validate (check)
   └── bruin query (inspect)
```

## Quick Reference

```bash
# Initialize new project
bruin init zoomcamp my-pipeline

# Validate before running
bruin validate ./pipeline/pipeline.yml

# Run entire pipeline
bruin run ./pipeline/pipeline.yml

# Run with date range
bruin run ./pipeline/pipeline.yml \
  --start-date 2020-01-01 \
  --end-date 2020-01-31

# Run single asset with downstream
bruin run ./pipeline/pipeline.yml \
  --asset raw.trips \
  --downstream

# View lineage
bruin lineage ./pipeline/pipeline.yml

# Query a table
bruin query --connection duckdb-default \
  --query "SELECT COUNT(*) FROM staging.trips"
```

## Further Reading

- [Bruin Documentation - CLI Reference](https://getbruin.com/docs/bruin/commands/overview.html)
- [Bruin GitHub Repository](https://github.com/bruin-data/bruin)


================================================
FILE: 06-batch/.gitignore
================================================


================================================
FILE: 06-batch/README.md
================================================
# Module 6: Batch Processing

## 6.1 Introduction

* :movie_camera: 6.1.1 Introduction to Batch Processing

[![](https://markdown-videos-api.jorgenkh.no/youtube/dcHe5Fl3MF8)](https://youtu.be/dcHe5Fl3MF8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=51)

* :movie_camera: 6.1.2 Introduction to Spark

[![](https://markdown-videos-api.jorgenkh.no/youtube/FhaqbEOuQ8U)](https://youtu.be/FhaqbEOuQ8U&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=52)


## 6.2 Installation

Follow [these instructions](setup/) to install Spark:

* [Windows](setup/windows.md)
* [Linux](setup/linux.md)
* [MacOS](setup/macos.md)

:movie_camera: 6.2.1 (Optional) Installing Spark (Linux)

[![](https://markdown-videos-api.jorgenkh.no/youtube/hqUbB9c8sKg)](https://youtu.be/hqUbB9c8sKg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=53)

Alternatively, if the setups above don't work, you can run Spark in Google Colab.
> [!NOTE]  
> It's advisable to invest some time in setting things up locally rather than immediately jumping into this solution

* [Google Colab Instructions](https://medium.com/gitconnected/launch-spark-on-google-colab-and-connect-to-sparkui-342cad19b304)
* [Google Colab Starter Notebook](https://github.com/aaalexlit/medium_articles/blob/main/Spark_in_Colab.ipynb)


## 6.3 Spark SQL and DataFrames

* :movie_camera: 6.3.1 First Look at Spark/PySpark

[![](https://markdown-videos-api.jorgenkh.no/youtube/r_Sf6fCB40c)](https://youtu.be/r_Sf6fCB40c&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=54)

* :movie_camera: 6.3.2 Spark Dataframes

[![](https://markdown-videos-api.jorgenkh.no/youtube/ti3aC1m3rE8)](https://youtu.be/ti3aC1m3rE8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=55)

* :movie_camera: 6.3.3 (Optional) Preparing Yellow and Green Taxi Data

[![](https://markdown-videos-api.jorgenkh.no/youtube/CI3P4tAtru4)](https://youtu.be/CI3P4tAtru4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=56)

Script to prepare the Dataset [download_data.sh](code/download_data.sh)

> [!NOTE]  
> The other way to infer the schema (apart from pandas) for the csv files, is to set the `inferSchema` option to `true` while reading the files in Spark.

* :movie_camera: 6.3.4 SQL with Spark

[![](https://markdown-videos-api.jorgenkh.no/youtube/uAlp2VuZZPY)](https://youtu.be/uAlp2VuZZPY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=57)


## 6.4 Spark Internals

* :movie_camera: 6.4.1 Anatomy of a Spark Cluster

[![](https://markdown-videos-api.jorgenkh.no/youtube/68CipcZt7ZA)](https://youtu.be/68CipcZt7ZA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=58)

* :movie_camera: 6.4.2 GroupBy in Spark

[![](https://markdown-videos-api.jorgenkh.no/youtube/9qrDsY_2COo)](https://youtu.be/9qrDsY_2COo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=59)

* :movie_camera: 6.4.3 Joins in Spark

[![](https://markdown-videos-api.jorgenkh.no/youtube/lu7TrqAWuH4)](https://youtu.be/lu7TrqAWuH4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=60)

## 6.5 (Optional) Resilient Distributed Datasets

* :movie_camera: 6.5.1 Operations on Spark RDDs

[![](https://markdown-videos-api.jorgenkh.no/youtube/Bdu-xIrF3OM)](https://youtu.be/Bdu-xIrF3OM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=61)

* :movie_camera: 6.5.2 Spark RDD mapPartition

[![](https://markdown-videos-api.jorgenkh.no/youtube/k3uB2K99roI)](https://youtu.be/k3uB2K99roI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=62)


## 6.6 Running Spark in the Cloud

* :movie_camera: 6.6.1 Connecting to Google Cloud Storage

[![](https://markdown-videos-api.jorgenkh.no/youtube/Yyz293hBVcQ)](https://youtu.be/Yyz293hBVcQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=63)

* :movie_camera: 6.6.2 Creating a Local Spark Cluster

[![](https://markdown-videos-api.jorgenkh.no/youtube/HXBwSlXo5IA)](https://youtu.be/HXBwSlXo5IA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=64)

* :movie_camera: 6.6.3 Setting up a Dataproc Cluster

[![](https://markdown-videos-api.jorgenkh.no/youtube/osAiAYahvh8)](https://youtu.be/osAiAYahvh8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=65)

* :movie_camera: 6.6.4 Connecting Spark to Big Query

[![](https://markdown-videos-api.jorgenkh.no/youtube/HIm2BOj8C0Q)](https://youtu.be/HIm2BOj8C0Q&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=66)


# Homework

* [2026 Homework](../cohorts/2026/06-batch/homework.md)


# Community notes

<details>
<summary>Did you take notes? You can share them here</summary>

* [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md)
* [Sandy's DE Learning Blog](https://learningdataengineering540969211.wordpress.com/2022/02/24/week-5-de-zoomcamp-5-2-1-installing-spark-on-linux/)
* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week5.md)
* [Alternative : Using docker-compose to launch spark by rafik](https://gist.github.com/rafik-rahoui/f98df941c4ccced9c46e9ccbdef63a03) 
* [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-5-batch-spark)
* [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week5)
* [Notes by Oscar Garcia](https://github.com/ozkary/Data-Engineering-Bootcamp/tree/main/Step5-Batch-Processing)
* [Notes by HongWei](https://github.com/hwchua0209/data-engineering-zoomcamp-submission/blob/main/05-batch-processing/README.md)
* [2024 videos transcript](https://drive.google.com/drive/folders/1XMmP4H5AMm1qCfMFxc_hqaPGw31KIVcb?usp=drive_link) by Maria Fisher 
* [2025 Notes by Manuel Guerra](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/5_Batch-Processing-Spark/README.md)
* [2025 Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/05_batch_processing/00_notes.md)
* [2025 Notes on Installing Spark on MacOS (with Anaconda + brew) by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/05_batch_processing/01_env_setup.md)
* [2025 Notes by Daniel Lachner](https://github.com/mossdet/dlp_data_eng/blob/main/Notes/05_01_Batch_Processing_Spark_GCP.pdf)
* [2026 Notes by Ajay Katte](https://github.com/mushroomsandchai/dtdez/tree/main/06_batch_processing/notes)
* Add your notes here (above this line)

</details>


================================================
FILE: 06-batch/code/03_test.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "72505747",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyspark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "bd55afbe",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/__init__.py'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pyspark.__file__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "29f1cf4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import SparkSession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "cf6d80ad",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: An illegal reflective access operation has occurred\n",
      "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
      "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
      "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
      "WARNING: All illegal access operations will be denied in a future release\n",
      "22/02/15 22:22:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "spark = SparkSession.builder \\\n",
    "    .master(\"local[*]\") \\\n",
    "    .appName('test') \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f604529",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2022-02-15 22:23:22--  https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv\n",
      "Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.196.8\n",
      "Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.196.8|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 12322 (12K) [application/octet-stream]\n",
      "Saving to: ‘taxi+_zone_lookup.csv’\n",
      "\n",
      "taxi+_zone_lookup.c 100%[===================>]  12.03K  --.-KB/s    in 0s      \n",
      "\n",
      "2022-02-15 22:23:23 (114 MB/s) - ‘taxi+_zone_lookup.csv’ saved [12322/12322]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12342345",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"LocationID\",\"Borough\",\"Zone\",\"service_zone\"\r\n",
      "\r\n",
      "1,\"EWR\",\"Newark Airport\",\"EWR\"\r\n",
      "\r\n",
      "2,\"Queens\",\"Jamaica Bay\",\"Boro Zone\"\r\n",
      "\r\n",
      "3,\"Bronx\",\"Allerton/Pelham Gardens\",\"Boro Zone\"\r\n",
      "\r\n",
      "4,\"Manhattan\",\"Alphabet City\",\"Yellow Zone\"\r\n",
      "\r\n",
      "5,\"Staten Island\",\"Arden Heights\",\"Boro Zone\"\r\n",
      "\r\n",
      "6,\"Staten Island\",\"Arrochar/Fort Wadsworth\",\"Boro Zone\"\r\n",
      "\r\n",
      "7,\"Queens\",\"Astoria\",\"Boro Zone\"\r\n",
      "\r\n",
      "8,\"Queens\",\"Astoria Park\",\"Boro Zone\"\r\n",
      "\r\n",
      "9,\"Queens\",\"Auburndale\",\"Boro Zone\"\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!head taxi_zone_lookup.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "809464d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = spark.read \\\n",
    "    .option(\"header\", \"true\") \\\n",
    "    .csv('taxi_zone_lookup.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "e36dd996",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----------+-------------+--------------------+------------+\n",
      "|LocationID|      Borough|                Zone|service_zone|\n",
      "+----------+-------------+--------------------+------------+\n",
      "|         1|          EWR|      Newark Airport|         EWR|\n",
      "|         2|       Queens|         Jamaica Bay|   Boro Zone|\n",
      "|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|\n",
      "|         4|    Manhattan|       Alphabet City| Yellow Zone|\n",
      "|         5|Staten Island|       Arden Heights|   Boro Zone|\n",
      "|         6|Staten Island|Arrochar/Fort Wad...|   Boro Zone|\n",
      "|         7|       Queens|             Astoria|   Boro Zone|\n",
      "|         8|       Queens|        Astoria Park|   Boro Zone|\n",
      "|         9|       Queens|          Auburndale|   Boro Zone|\n",
      "|        10|       Queens|        Baisley Park|   Boro Zone|\n",
      "|        11|     Brooklyn|          Bath Beach|   Boro Zone|\n",
      "|        12|    Manhattan|        Battery Park| Yellow Zone|\n",
      "|        13|    Manhattan|   Battery Park City| Yellow Zone|\n",
      "|        14|     Brooklyn|           Bay Ridge|   Boro Zone|\n",
      "|        15|       Queens|Bay Terrace/Fort ...|   Boro Zone|\n",
      "|        16|       Queens|             Bayside|   Boro Zone|\n",
      "|        17|     Brooklyn|             Bedford|   Boro Zone|\n",
      "|        18|        Bronx|        Bedford Park|   Boro Zone|\n",
      "|        19|       Queens|           Bellerose|   Boro Zone|\n",
      "|        20|        Bronx|             Belmont|   Boro Zone|\n",
      "+----------+-------------+--------------------+------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "cb547351",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r\n",
      "[Stage 4:>                                                          (0 + 1) / 1]\r\n",
      "\r\n",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df.write.parquet('zones')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "02fe2bdb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 28K\r\n",
      "-rw-rw-r-- 1 alexey alexey 6.8K Feb 15 22:25 Untitled.ipynb\r\n",
      "-rw-rw-r-- 1 alexey alexey  13K Aug 17  2016 taxi+_zone_lookup.csv\r\n",
      "drwxr-xr-x 2 alexey alexey 4.0K Feb 15 22:25 zones\r\n"
     ]
    }
   ],
   "source": [
    "!ls -lh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "659f0812",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 06-batch/code/04_pyspark.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "07de9dc3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyspark\n",
    "from pyspark.sql import SparkSession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ca5bbb06",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: An illegal reflective access operation has occurred\n",
      "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
      "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
      "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
      "WARNING: All illegal access operations will be denied in a future release\n",
      "22/02/16 21:11:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "spark = SparkSession.builder \\\n",
    "    .master(\"local[*]\") \\\n",
    "    .appName('test') \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "cf8de204",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2022-02-16 21:13:50--  https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-01.csv\n",
      "Resolving nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)... 52.217.84.132\n",
      "Connecting to nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)|52.217.84.132|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 752335705 (717M) [text/csv]\n",
      "Saving to: ‘fhvhv_tripdata_2021-01.csv’\n",
      "\n",
      "fhvhv_tripdata_2021 100%[===================>] 717.48M  35.6MB/s    in 21s     \n",
      "\n",
      "2022-02-16 21:14:11 (34.4 MB/s) - ‘fhvhv_tripdata_2021-01.csv’ saved [752335705/752335705]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "201a5957",
   "metadata": {},
   "outputs": [],
   "source": [
    "!gzip -dc fhvhv_tripdata_2021-01.csv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2a52087c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "11908469 fhvhv_tripdata_2021-01.csv\r\n"
     ]
    }
   ],
   "source": [
    "!wc -l fhvhv_tripdata_2021-01.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "931021a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = spark.read \\\n",
    "    .option(\"header\", \"true\") \\\n",
    "    .csv('fhvhv_tripdata_2021-01.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d44b7839",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "StructType(List(StructField(hvfhs_license_num,StringType,true),StructField(dispatching_base_num,StringType,true),StructField(pickup_datetime,StringType,true),StructField(dropoff_datetime,StringType,true),StructField(PULocationID,StringType,true),StructField(DOLocationID,StringType,true),StructField(SR_Flag,StringType,true)))"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "4249e790",
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -n 1001 fhvhv_tripdata_2021-01.csv > head.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "6894312c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "f3ca771b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_pandas = pd.read_csv('head.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "f1066b4f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "hvfhs_license_num        object\n",
       "dispatching_base_num     object\n",
       "pickup_datetime          object\n",
       "dropoff_datetime         object\n",
       "PULocationID              int64\n",
       "DOLocationID              int64\n",
       "SR_Flag                 float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_pandas.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "f8413c9d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "StructType(List(StructField(hvfhs_license_num,StringType,true),StructField(dispatching_base_num,StringType,true),StructField(pickup_datetime,StringType,true),StructField(dropoff_datetime,StringType,true),StructField(PULocationID,LongType,true),StructField(DOLocationID,LongType,true),StructField(SR_Flag,DoubleType,true)))"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.createDataFrame(df_pandas).schema"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80f252c1",
   "metadata": {},
   "source": [
    "Integer - 4 bytes\n",
    "Long - 8 bytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "16937bfd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "fc61a99a",
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = types.StructType([\n",
    "    types.StructField('hvfhs_license_num', types.StringType(), True),\n",
    "    types.StructField('dispatching_base_num', types.StringType(), True),\n",
    "    types.StructField('pickup_datetime', types.TimestampType(), True),\n",
    "    types.StructField('dropoff_datetime', types.TimestampType(), True),\n",
    "    types.StructField('PULocationID', types.IntegerType(), True),\n",
    "    types.StructField('DOLocationID', types.IntegerType(), True),\n",
    "    types.StructField('SR_Flag', types.StringType(), True)\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "f94052ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = spark.read \\\n",
    "    .option(\"header\", \"true\") \\\n",
    "    .schema(schema) \\\n",
    "    .csv('fhvhv_tripdata_2021-01.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "c270d9d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.repartition(24)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7796c2b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.write.parquet('fhvhv/2021/01/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "c3cab876",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = spark.read.parquet('fhvhv/2021/01/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "203b5627",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- hvfhs_license_num: string (nullable = true)\n",
      " |-- dispatching_base_num: string (nullable = true)\n",
      " |-- pickup_datetime: timestamp (nullable = true)\n",
      " |-- dropoff_datetime: timestamp (nullable = true)\n",
      " |-- PULocationID: integer (nullable = true)\n",
      " |-- DOLocationID: integer (nullable = true)\n",
      " |-- SR_Flag: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64172a47",
   "metadata": {},
   "source": [
    "SELECT * FROM df WHERE hvfhs_license_num =  HV0003"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "d24840a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import functions as F"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "3ab1ca44",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+\n",
      "|hvfhs_license_num|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|\n",
      "+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+\n",
      "|           HV0005|              B02510|2021-01-07 06:43:22|2021-01-07 06:55:06|         142|         230|   null|\n",
      "|           HV0005|              B02510|2021-01-01 16:01:26|2021-01-01 16:20:20|         133|          91|   null|\n",
      "|           HV0003|              B02764|2021-01-01 00:23:13|2021-01-01 00:30:35|         147|         159|   null|\n",
      "|           HV0003|              B02869|2021-01-06 11:43:12|2021-01-06 11:55:07|          79|         164|   null|\n",
      "|           HV0003|              B02884|2021-01-04 15:35:32|2021-01-04 15:52:02|         174|          18|   null|\n",
      "|           HV0003|              B02875|2021-01-04 13:42:15|2021-01-04 14:04:57|         201|         180|   null|\n",
      "|           HV0005|              B02510|2021-01-04 18:57:31|2021-01-04 19:09:55|         230|         142|   null|\n",
      "|           HV0003|              B02872|2021-01-03 18:42:03|2021-01-03 19:12:22|         132|          72|   null|\n",
      "|           HV0004|              B02800|2021-01-01 05:31:50|2021-01-01 05:40:03|         188|          61|   null|\n",
      "|           HV0005|              B02510|2021-01-04 20:21:47|2021-01-04 20:26:03|          97|         189|   null|\n",
      "|           HV0003|              B02764|2021-01-01 01:51:18|2021-01-01 02:05:32|         174|         235|   null|\n",
      "|           HV0003|              B02871|2021-01-05 10:20:54|2021-01-05 10:32:44|          35|          76|   null|\n",
      "|           HV0005|              B02510|2021-01-06 02:32:09|2021-01-06 02:43:35|          35|          39|   null|\n",
      "|           HV0003|              B02882|2021-01-04 12:34:52|2021-01-04 12:38:59|         231|          13|   null|\n",
      "|           HV0003|              B02617|2021-01-02 20:12:56|2021-01-02 20:41:18|          87|         127|   null|\n",
      "|           HV0005|              B02510|2021-01-02 16:55:48|2021-01-02 17:20:40|          17|          89|   null|\n",
      "|           HV0003|              B02869|2021-01-02 15:14:38|2021-01-02 15:23:27|          11|          14|   null|\n",
      "|           HV0005|              B02510|2021-01-01 05:54:50|2021-01-01 06:03:46|          21|          26|   null|\n",
      "|           HV0003|              B02869|2021-01-04 12:40:42|2021-01-04 12:48:34|          83|         260|   null|\n",
      "|           HV0005|              B02510|2021-01-01 14:58:57|2021-01-01 15:09:53|         189|          52|   null|\n",
      "+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "6d98c2ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "def crazy_stuff(base_num):\n",
    "    num = int(base_num[1:])\n",
    "    if num % 7 == 0:\n",
    "        return f's/{num:03x}'\n",
    "    elif num % 3 == 0:\n",
    "        return f'a/{num:03x}'\n",
    "    else:\n",
    "        return f'e/{num:03x}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "f3175419",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'s/b44'"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crazy_stuff('B02884')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "9bb5d503",
   "metadata": {},
   "outputs": [],
   "source": [
    "crazy_stuff_udf = F.udf(crazy_stuff, returnType=types.StringType())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "b38f0465",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+-----------+------------+------------+------------+\n",
      "|base_id|pickup_date|dropoff_date|PULocationID|DOLocationID|\n",
      "+-------+-----------+------------+------------+------------+\n",
      "|  e/9ce| 2021-01-07|  2021-01-07|         142|         230|\n",
      "|  e/9ce| 2021-01-01|  2021-01-01|         133|          91|\n",
      "|  e/acc| 2021-01-01|  2021-01-01|         147|         159|\n",
      "|  e/b35| 2021-01-06|  2021-01-06|          79|         164|\n",
      "|  s/b44| 2021-01-04|  2021-01-04|         174|          18|\n",
      "|  e/b3b| 2021-01-04|  2021-01-04|         201|         180|\n",
      "|  e/9ce| 2021-01-04|  2021-01-04|         230|         142|\n",
      "|  e/b38| 2021-01-03|  2021-01-03|         132|          72|\n",
      "|  s/af0| 2021-01-01|  2021-01-01|         188|          61|\n",
      "|  e/9ce| 2021-01-04|  2021-01-04|          97|         189|\n",
      "|  e/acc| 2021-01-01|  2021-01-01|         174|         235|\n",
      "|  a/b37| 2021-01-05|  2021-01-05|          35|          76|\n",
      "|  e/9ce| 2021-01-06|  2021-01-06|          35|          39|\n",
      "|  e/b42| 2021-01-04|  2021-01-04|         231|          13|\n",
      "|  e/a39| 2021-01-02|  2021-01-02|          87|         127|\n",
      "|  e/9ce| 2021-01-02|  2021-01-02|          17|          89|\n",
      "|  e/b35| 2021-01-02|  2021-01-02|          11|          14|\n",
      "|  e/9ce| 2021-01-01|  2021-01-01|          21|          26|\n",
      "|  e/b35| 2021-01-04|  2021-01-04|          83|         260|\n",
      "|  e/9ce| 2021-01-01|  2021-01-01|         189|          52|\n",
      "+-------+-----------+------------+------------+------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df \\\n",
    "    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \\\n",
    "    .withColumn('dropoff_date', F.to_date(df.dropoff_datetime)) \\\n",
    "    .withColumn('base_id', crazy_stuff_udf(df.dispatching_base_num)) \\\n",
    "    .select('base_id', 'pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID') \\\n",
    "    .show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "00921644",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Row(pickup_datetime=datetime.datetime(2021, 1, 1, 0, 23, 13), dropoff_datetime=datetime.datetime(2021, 1, 1, 0, 30, 35), PULocationID=147, DOLocationID=159),\n",
       " Row(pickup_datetime=datetime.datetime(2021, 1, 6, 11, 43, 12), dropoff_datetime=datetime.datetime(2021, 1, 6, 11, 55, 7), PULocationID=79, DOLocationID=164),\n",
       " Row(pickup_datetime=datetime.datetime(2021, 1, 4, 15, 35, 32), dropoff_datetime=datetime.datetime(2021, 1, 4, 15, 52, 2), PULocationID=174, DOLocationID=18),\n",
       " Row(pickup_datetime=datetime.datetime(2021, 1, 4, 13, 42, 15), dropoff_datetime=datetime.datetime(2021, 1, 4, 14, 4, 57), PULocationID=201, DOLocationID=180),\n",
       " Row(pickup_datetime=datetime.datetime(2021, 1, 3, 18, 42, 3), dropoff_datetime=datetime.datetime(2021, 1, 3, 19, 12, 22), PULocationID=132, DOLocationID=72)]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID') \\\n",
    "  .filter(df.hvfhs_license_num == 'HV0003')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "0866f9c0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hvfhs_license_num,dispatching_base_num,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,SR_Flag\r\n",
      "\r\n",
      "HV0003,B02682,2021-01-01 00:33:44,2021-01-01 00:49:07,230,166,\r\n",
      "\r\n",
      "HV0003,B02682,2021-01-01 00:55:19,2021-01-01 01:18:21,152,167,\r\n",
      "\r\n",
      "HV0003,B02764,2021-01-01 00:23:56,2021-01-01 00:38:05,233,142,\r\n",
      "\r\n",
      "HV0003,B02764,2021-01-01 00:42:51,2021-01-01 00:45:50,142,143,\r\n",
      "\r\n",
      "HV0003,B02764,2021-01-01 00:48:14,2021-01-01 01:08:42,143,78,\r\n",
      "\r\n",
      "HV0005,B02510,2021-01-01 00:06:59,2021-01-01 00:43:01,88,42,\r\n",
      "\r\n",
      "HV0005,B02510,2021-01-01 00:50:00,2021-01-01 01:04:57,42,151,\r\n",
      "\r\n",
      "HV0003,B02764,2021-01-01 00:14:30,2021-01-01 00:50:27,71,226,\r\n",
      "\r\n",
      "HV0003,B02875,2021-01-01 00:22:54,2021-01-01 00:30:20,112,255,\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!head -n 10 head.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa1b0e18",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 06-batch/code/05_taxi_schema.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8c1d0c08",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyspark\n",
    "from pyspark.sql import SparkSession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "96a248f5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: An illegal reflective access operation has occurred\n",
      "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
      "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
      "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
      "WARNING: All illegal access operations will be denied in a future release\n",
      "22/02/17 21:59:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "spark = SparkSession.builder \\\n",
    "    .master(\"local[*]\") \\\n",
    "    .appName('test') \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c53274b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "5d8434e1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "a84c6c6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "green_schema = types.StructType([\n",
    "    types.StructField(\"VendorID\", types.IntegerType(), True),\n",
    "    types.StructField(\"lpep_pickup_datetime\", types.TimestampType(), True),\n",
    "    types.StructField(\"lpep_dropoff_datetime\", types.TimestampType(), True),\n",
    "    types.StructField(\"store_and_fwd_flag\", types.StringType(), True),\n",
    "    types.StructField(\"RatecodeID\", types.IntegerType(), True),\n",
    "    types.StructField(\"PULocationID\", types.IntegerType(), True),\n",
    "    types.StructField(\"DOLocationID\", types.IntegerType(), True),\n",
    "    types.StructField(\"passenger_count\", types.IntegerType(), True),\n",
    "    types.StructField(\"trip_distance\", types.DoubleType(), True),\n",
    "    types.StructField(\"fare_amount\", types.DoubleType(), True),\n",
    "    types.StructField(\"extra\", types.DoubleType(), True),\n",
    "    types.StructField(\"mta_tax\", types.DoubleType(), True),\n",
    "    types.StructField(\"tip_amount\", types.DoubleType(), True),\n",
    "    types.StructField(\"tolls_amount\", types.DoubleType(), True),\n",
    "    types.StructField(\"ehail_fee\", types.DoubleType(), True),\n",
    "    types.StructField(\"improvement_surcharge\", types.DoubleType(), True),\n",
    "    types.StructField(\"total_amount\", types.DoubleType(), True),\n",
    "    types.StructField(\"payment_type\", types.IntegerType(), True),\n",
    "    types.StructField(\"trip_type\", types.IntegerType(), True),\n",
    "    types.StructField(\"congestion_surcharge\", types.DoubleType(), True)\n",
    "])\n",
    "\n",
    "yellow_schema = types.StructType([\n",
    "    types.StructField(\"VendorID\", types.IntegerType(), True),\n",
    "    types.StructField(\"tpep_pickup_datetime\", types.TimestampType(), True),\n",
    "    types.StructField(\"tpep_dropoff_datetime\", types.TimestampType(), True),\n",
    "    types.StructField(\"passenger_count\", types.IntegerType(), True),\n",
    "    types.StructField(\"trip_distance\", types.DoubleType(), True),\n",
    "    types.StructField(\"RatecodeID\", types.IntegerType(), True),\n",
    "    types.StructField(\"store_and_fwd_flag\", types.StringType(), True),\n",
    "    types.StructField(\"PULocationID\", types.IntegerType(), True),\n",
    "    types.StructField(\"DOLocationID\", types.IntegerType(), True),\n",
    "    types.StructField(\"payment_type\", types.IntegerType(), True),\n",
    "    types.StructField(\"fare_amount\", types.DoubleType(), True),\n",
    "    types.StructField(\"extra\", types.DoubleType(), True),\n",
    "    types.StructField(\"mta_tax\", types.DoubleType(), True),\n",
    "    types.StructField(\"tip_amount\", types.DoubleType(), True),\n",
    "    types.StructField(\"tolls_amount\", types.DoubleType(), True),\n",
    "    types.StructField(\"improvement_surcharge\", types.DoubleType(), True),\n",
    "    types.StructField(\"total_amount\", types.DoubleType(), True),\n",
    "    types.StructField(\"congestion_surcharge\", types.DoubleType(), True)\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "3f7e0cb9",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/3\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/5\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/6\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/7\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/8\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/9\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/10\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/11\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/12\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "year = 2020\n",
    "\n",
    "for month in range(1, 13):\n",
    "    print(f'processing data for {year}/{month}')\n",
    "\n",
    "    input_path = f'data/raw/green/{year}/{month:02d}/'\n",
    "    output_path = f'data/pq/green/{year}/{month:02d}/'\n",
    "\n",
    "    df_green = spark.read \\\n",
    "        .option(\"header\", \"true\") \\\n",
    "        .schema(green_schema) \\\n",
    "        .csv(input_path)\n",
    "\n",
    "    df_green \\\n",
    "        .repartition(4) \\\n",
    "        .write.parquet(output_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "96ac2ad7",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/3\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/5\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/6\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/7\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 15:>                                                         (0 + 1) / 1]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/8\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "                                                                                \r"
     ]
    },
    {
     "ename": "AnalysisException",
     "evalue": "Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/green/2021/08;",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAnalysisException\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m/tmp/ipykernel_129101/906373977.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      7\u001b[0m     \u001b[0moutput_path\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34mf'data/pq/green/{year}/{month:02d}/'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m     \u001b[0mdf_green\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mspark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     10\u001b[0m         \u001b[0;34m.\u001b[0m\u001b[0moption\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"header\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"true\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m         \u001b[0;34m.\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgreen_schema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/readwriter.py\u001b[0m in \u001b[0;36mcsv\u001b[0;34m(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup)\u001b[0m\n\u001b[1;32m    536\u001b[0m             \u001b[0mpath\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    537\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 538\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_df\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jreader\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcsv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_spark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jvm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPythonUtils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtoSeq\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    539\u001b[0m         \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mRDD\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    540\u001b[0m             \u001b[0;32mdef\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miterator\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m   1302\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1303\u001b[0m         \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1304\u001b[0;31m         return_value = get_return_value(\n\u001b[0m\u001b[1;32m   1305\u001b[0m             answer, self.gateway_client, self.target_id, self.name)\n\u001b[1;32m   1306\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m    132\u001b[0m                 \u001b[0;31m# Hide where the exception came from that shows a non-Pythonic\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    133\u001b[0m                 \u001b[0;31m# JVM exception message.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 134\u001b[0;31m                 \u001b[0mraise_from\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconverted\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    135\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    136\u001b[0m                 \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mraise_from\u001b[0;34m(e)\u001b[0m\n",
      "\u001b[0;31mAnalysisException\u001b[0m: Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/green/2021/08;"
     ]
    }
   ],
   "source": [
    "year = 2021 \n",
    "\n",
    "for month in range(1, 13):\n",
    "    print(f'processing data for {year}/{month}')\n",
    "\n",
    "    input_path = f'data/raw/green/{year}/{month:02d}/'\n",
    "    output_path = f'data/pq/green/{year}/{month:02d}/'\n",
    "\n",
    "    df_green = spark.read \\\n",
    "        .option(\"header\", \"true\") \\\n",
    "        .schema(green_schema) \\\n",
    "        .csv(input_path)\n",
    "\n",
    "    df_green \\\n",
    "        .repartition(4) \\\n",
    "        .write.parquet(output_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "463c7dc8",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "6ff4265d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "6e982d29",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "19326bc9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/3\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/5\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/6\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/7\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/8\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/9\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/10\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/11\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2020/12\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "year = 2020\n",
    "\n",
    "for month in range(1, 13):\n",
    "    print(f'processing data for {year}/{month}')\n",
    "\n",
    "    input_path = f'data/raw/yellow/{year}/{month:02d}/'\n",
    "    output_path = f'data/pq/yellow/{year}/{month:02d}/'\n",
    "\n",
    "    df_yellow = spark.read \\\n",
    "        .option(\"header\", \"true\") \\\n",
    "        .schema(yellow_schema) \\\n",
    "        .csv(input_path)\n",
    "\n",
    "    df_yellow \\\n",
    "        .repartition(4) \\\n",
    "        .write.parquet(output_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "aeca811a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/3\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/5\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/6\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/7\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Stage 78:===========================================>              (3 + 1) / 4]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing data for 2021/8\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "                                                                                \r"
     ]
    },
    {
     "ename": "AnalysisException",
     "evalue": "Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/yellow/2021/08;",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAnalysisException\u001b[0m                         Traceback (most recent call last)",
      "\u001b[0;32m/tmp/ipykernel_129101/2088663510.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      7\u001b[0m     \u001b[0moutput_path\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34mf'data/pq/yellow/{year}/{month:02d}/'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m     \u001b[0mdf_yellow\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mspark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     10\u001b[0m         \u001b[0;34m.\u001b[0m\u001b[0moption\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"header\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"true\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m         \u001b[0;34m.\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0myellow_schema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/readwriter.py\u001b[0m in \u001b[0;36mcsv\u001b[0;34m(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup)\u001b[0m\n\u001b[1;32m    536\u001b[0m             \u001b[0mpath\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    537\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 538\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_df\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jreader\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcsv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_spark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jvm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPythonUtils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtoSeq\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    539\u001b[0m         \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mRDD\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    540\u001b[0m             \u001b[0;32mdef\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miterator\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m   1302\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1303\u001b[0m         \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1304\u001b[0;31m         return_value = get_return_value(\n\u001b[0m\u001b[1;32m   1305\u001b[0m             answer, self.gateway_client, self.target_id, self.name)\n\u001b[1;32m   1306\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m    132\u001b[0m                 \u001b[0;31m# Hide where the exception came from that shows a non-Pythonic\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    133\u001b[0m                 \u001b[0;31m# JVM exception message.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 134\u001b[0;31m                 \u001b[0mraise_from\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconverted\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    135\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    136\u001b[0m                 \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mraise_from\u001b[0;34m(e)\u001b[0m\n",
      "\u001b[0;31mAnalysisException\u001b[0m: Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/yellow/2021/08;"
     ]
    }
   ],
   "source": [
    "year = 2021\n",
    "\n",
    "for month in range(1, 13):\n",
    "    print(f'processing data for {year}/{month}')\n",
    "\n",
    "    input_path = f'data/raw/yellow/{year}/{month:02d}/'\n",
    "    output_path = f'data/pq/yellow/{year}/{month:02d}/'\n",
    "\n",
    "    df_yellow = spark.read \\\n",
    "        .option(\"header\", \"true\") \\\n",
    "        .schema(yellow_schema) \\\n",
    "        .csv(input_path)\n",
    "\n",
    "    df_yellow \\\n",
    "        .repartition(4) \\\n",
    "        .write.parquet(output_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7eb0da9",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 06-batch/code/06_spark_sql.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3307b886",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: An illegal reflective access operation has occurred\n",
      "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
      "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
      "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
      "WARNING: All illegal access operations will be denied in a future release\n",
      "22/02/17 22:43:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "import pyspark\n",
    "from pyspark.sql import SparkSession\n",
    "\n",
    "spark = SparkSession.builder \\\n",
    "    .master(\"local[*]\") \\\n",
    "    .appName('test') \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1ee1eb1d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_green = spark.read.parquet('data/pq/green/*/*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ca5ee99",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "649bb4da",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_green = df_green \\\n",
    "    .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \\\n",
    "    .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "90cd6845",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_yellow = spark.read.parquet('data/pq/yellow/*/*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "88822efd",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_yellow = df_yellow \\\n",
    "    .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \\\n",
    "    .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "610167a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "common_colums = []\n",
    "\n",
    "yellow_columns = set(df_yellow.columns)\n",
    "\n",
    "for col in df_green.columns:\n",
    "    if col in yellow_columns:\n",
    "        common_colums.append(col)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "839d773f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import functions as F"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "2498810a",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_green_sel = df_green \\\n",
    "    .select(common_colums) \\\n",
    "    .withColumn('service_type', F.lit('green'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "19032efc",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_yellow_sel = df_yellow \\\n",
    "    .select(common_colums) \\\n",
    "    .withColumn('service_type', F.lit('yellow'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "f5b0f3d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_trips_data = df_green_sel.unionAll(df_yellow_sel)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "1bed8b33",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+--------+\n",
      "|service_type|   count|\n",
      "+------------+--------+\n",
      "|       green| 2304517|\n",
      "|      yellow|39649199|\n",
      "+------------+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_trips_data.groupBy('service_type').count().show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "28cc8fa3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['VendorID',\n",
       " 'pickup_datetime',\n",
       " 'dropoff_datetime',\n",
       " 'store_and_fwd_flag',\n",
       " 'RatecodeID',\n",
       " 'PULocationID',\n",
       " 'DOLocationID',\n",
       " 'passenger_count',\n",
       " 'trip_distance',\n",
       " 'fare_amount',\n",
       " 'extra',\n",
       " 'mta_tax',\n",
       " 'tip_amount',\n",
       " 'tolls_amount',\n",
       " 'improvement_surcharge',\n",
       " 'total_amount',\n",
       " 'payment_type',\n",
       " 'congestion_surcharge',\n",
       " 'service_type']"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_trips_data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "36e90cbc",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_trips_data.registerTempTable('trips_data')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "d0e01bf1",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------+--------+\n",
      "|service_type|count(1)|\n",
      "+------------+--------+\n",
      "|       green| 2304517|\n",
      "|      yellow|39649199|\n",
      "+------------+--------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"\"\"\n",
    "SELECT\n",
    "    service_type,\n",
    "    count(1)\n",
    "FROM\n",
    "    trips_data\n",
    "GROUP BY \n",
    "    service_type\n",
    "\"\"\").show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2ee7038",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_result = spark.sql(\"\"\"\n",
    "SELECT \n",
    "    -- Revenue grouping \n",
    "    PULocationID AS revenue_zone,\n",
    "    date_trunc('month', pickup_datetime) AS revenue_month, \n",
    "    service_type, \n",
    "\n",
    "    -- Revenue calculation \n",
    "    SUM(fare_amount) AS revenue_monthly_fare,\n",
    "    SUM(extra) AS revenue_monthly_extra,\n",
    "    SUM(mta_tax) AS revenue_monthly_mta_tax,\n",
    "    SUM(tip_amount) AS revenue_monthly_tip_amount,\n",
    "    SUM(tolls_amount) AS revenue_monthly_tolls_amount,\n",
    "    SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge,\n",
    "    SUM(total_amount) AS revenue_monthly_total_amount,\n",
    "    SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge,\n",
    "\n",
    "    -- Additional calculations\n",
    "    AVG(passenger_count) AS avg_monthly_passenger_count,\n",
    "    AVG(trip_distance) AS avg_monthly_trip_distance\n",
    "FROM\n",
    "    trips_data\n",
    "GROUP BY\n",
    "    1, 2, 3\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "f67eeb92",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_result.coalesce(1).write.parquet('data/report/revenue/', mode='overwrite')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f56a885d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 06-batch/code/06_spark_sql.py
================================================
#!/usr/bin/env python
# coding: utf-8

import argparse

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F


parser = argparse.ArgumentParser()

parser.add_argument('--input_green', required=True)
parser.add_argument('--input_yellow', required=True)
parser.add_argument('--output', required=True)

args = parser.parse_args()

input_green = args.input_green
input_yellow = args.input_yellow
output = args.output


spark = SparkSession.builder \
    .appName('test') \
    .getOrCreate()

df_green = spark.read.parquet(input_green)

df_green = df_green \
    .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \
    .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')

df_yellow = spark.read.parquet(input_yellow)


df_yellow = df_yellow \
    .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \
    .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')


common_colums = [
    'VendorID',
    'pickup_datetime',
    'dropoff_datetime',
    'store_and_fwd_flag',
    'RatecodeID',
    'PULocationID',
    'DOLocationID',
    'passenger_count',
    'trip_distance',
    'fare_amount',
    'extra',
    'mta_tax',
    'tip_amount',
    'tolls_amount',
    'improvement_surcharge',
    'total_amount',
    'payment_type',
    'congestion_surcharge'
]


df_green_sel = df_green \
    .select(common_colums) \
    .withColumn('service_type', F.lit('green'))

df_yellow_sel = df_yellow \
    .select(common_colums) \
    .withColumn('service_type', F.lit('yellow'))


df_trips_data = df_green_sel.unionAll(df_yellow_sel)

df_trips_data.registerTempTable('trips_data')


df_result = spark.sql("""
SELECT 
    -- Reveneue grouping 
    PULocationID AS revenue_zone,
    date_trunc('month', pickup_datetime) AS revenue_month, 
    service_type, 

    -- Revenue calculation 
    SUM(fare_amount) AS revenue_monthly_fare,
    SUM(extra) AS revenue_monthly_extra,
    SUM(mta_tax) AS revenue_monthly_mta_tax,
    SUM(tip_amount) AS revenue_monthly_tip_amount,
    SUM(tolls_amount) AS revenue_monthly_tolls_amount,
    SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge,
    SUM(total_amount) AS revenue_monthly_total_amount,
    SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge,

    -- Additional calculations
    AVG(passenger_count) AS avg_montly_passenger_count,
    AVG(trip_distance) AS avg_montly_trip_distance
FROM
    trips_data
GROUP BY
    1, 2, 3
""")


df_result.coalesce(1) \
    .write.parquet(output, mode='overwrite')


================================================
FILE: 06-batch/code/06_spark_sql_big_query.py
================================================
#!/usr/bin/env python
# coding: utf-8

import argparse

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F


parser = argparse.ArgumentParser()

parser.add_argument('--input_green', required=True)
parser.add_argument('--input_yellow', required=True)
parser.add_argument('--output', required=True)

args = parser.parse_args()

input_green = args.input_green
input_yellow = args.input_yellow
output = args.output


spark = SparkSession.builder \
    .appName('test') \
    .getOrCreate()

spark.conf.set('temporaryGcsBucket', 'dataproc-temp-europe-west6-828225226997-fckhkym8')

df_green = spark.read.parquet(input_green)

df_green = df_green \
    .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \
    .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')

df_yellow = spark.read.parquet(input_yellow)


df_yellow = df_yellow \
    .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \
    .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')


common_columns = [
    'VendorID',
    'pickup_datetime',
    'dropoff_datetime',
    'store_and_fwd_flag',
    'RatecodeID',
    'PULocationID',
    'DOLocationID',
    'passenger_count',
    'trip_distance',
    'fare_amount',
    'extra',
    'mta_tax',
    'tip_amount',
    'tolls_amount',
    'improvement_surcharge',
    'total_amount',
    'payment_type',
    'congestion_surcharge'
]


df_green_sel = df_green \
    .select(common_columns) \
    .withColumn('service_type', F.lit('green'))

df_yellow_sel = df_yellow \
    .select(common_columns) \
    .withColumn('service_type', F.lit('yellow'))


df_trips_data = df_green_sel.unionAll(df_yellow_sel)

df_trips_data.registerTempTable('trips_data')


df_result = spark.sql("""
SELECT 
    -- Revenue grouping 
    PULocationID AS revenue_zone,
    date_trunc('month', pickup_datetime) AS revenue_month, 
    service_type, 

    -- Revenue calculation 
    SUM(fare_amount) AS revenue_monthly_fare,
    SUM(extra) AS revenue_monthly_extra,
    SUM(mta_tax) AS revenue_monthly_mta_tax,
    SUM(tip_amount) AS revenue_monthly_tip_amount,
    SUM(tolls_amount) AS revenue_monthly_tolls_amount,
    SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge,
    SUM(total_amount) AS revenue_monthly_total_amount,
    SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge,

    -- Additional calculations
    AVG(passenger_count) AS avg_monthly_passenger_count,
    AVG(trip_distance) AS avg_monthly_trip_distance
FROM
    trips_data
GROUP BY
    1, 2, 3
""")


df_result.write.format('bigquery') \
    .option('table', output) \
    .save()
    

================================================
FILE: 06-batch/code/07_groupby_join.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4341e0e6",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: An illegal reflective access operation has occurred\n",
      "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
      "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
      "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
      "WARNING: All illegal access operations will be denied in a future release\n",
      "22/02/18 21:41:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "import pyspark\n",
    "from pyspark.sql import SparkSession\n",
    "\n",
    "spark = SparkSession.builder \\\n",
    "    .master(\"local[*]\") \\\n",
    "    .appName('test') \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cd304aec",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_green = spark.read.parquet('data/pq/green/*/*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "243991f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_green.registerTempTable('green')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e43764a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_green_revenue = spark.sql(\"\"\"\n",
    "SELECT \n",
    "    date_trunc('hour', lpep_pickup_datetime) AS hour, \n",
    "    PULocationID AS zone,\n",
    "\n",
    "    SUM(total_amount) AS amount,\n",
    "    COUNT(1) AS number_records\n",
    "FROM\n",
    "    green\n",
    "WHERE\n",
    "    lpep_pickup_datetime >= '2020-01-01 00:00:00'\n",
    "GROUP BY\n",
    "    1, 2\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "3e00310e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_green_revenue \\\n",
    "    .repartition(20) \\\n",
    "    .write.parquet('data/report/revenue/green', mode='overwrite')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "07ebb68c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_yellow = spark.read.parquet('data/pq/yellow/*/*')\n",
    "df_yellow.registerTempTable('yellow')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "9d5be29d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_yellow_revenue = spark.sql(\"\"\"\n",
    "SELECT \n",
    "    date_trunc('hour', tpep_pickup_datetime) AS hour, \n",
    "    PULocationID AS zone,\n",
    "\n",
    "    SUM(total_amount) AS amount,\n",
    "    COUNT(1) AS number_records\n",
    "FROM\n",
    "    yellow\n",
    "WHERE\n",
    "    tpep_pickup_datetime >= '2020-01-01 00:00:00'\n",
    "GROUP BY\n",
    "    1, 2\n",
    "\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "8bd9264e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_yellow_revenue \\\n",
    "    .repartition(20) \\\n",
    "    .write.parquet('data/report/revenue/yellow', mode='overwrite')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "fd5d74d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_green_revenue = spark.read.parquet('data/report/revenue/green')\n",
    "df_yellow_revenue = spark.read.parquet('data/report/revenue/yellow')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "35015ee6",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_green_revenue_tmp = df_green_revenue \\\n",
    "    .withColumnRenamed('amount', 'green_amount') \\\n",
    "    .withColumnRenamed('number_records', 'green_number_records')\n",
    "\n",
    "df_yellow_revenue_tmp = df_yellow_revenue \\\n",
    "    .withColumnRenamed('amount', 'yellow_amount') \\\n",
    "    .withColumnRenamed('number_records', 'yellow_number_records')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "ec9f34ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_join = df_green_revenue_tmp.join(df_yellow_revenue_tmp, on=['hour', 'zone'], how='outer')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "10238be7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_join.write.parquet('data/report/revenue/total', mode='overwrite')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "c3af7169",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_join = spark.read.parquet('data/report/revenue/total')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "bc2a6680",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DataFrame[hour: timestamp, zone: int, green_amount: double, green_number_records: bigint, yellow_amount: double, yellow_number_records: bigint]"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "abb46398",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_zones = spark.read.parquet('zones/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "b3cf98a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_result = df_join.join(df_zones, df_join.zone == df_zones.LocationID)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "5e0614ba",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_result.drop('LocationID', 'zone').write.parquet('tmp/revenue-zones')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f5ca913",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 06-batch/code/08_rdds.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d66f42fd",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: An illegal reflective access operation has occurred\n",
      "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
      "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
      "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
      "WARNING: All illegal access operations will be denied in a future release\n",
      "22/02/21 22:25:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "import pyspark\n",
    "from pyspark.sql import SparkSession\n",
    "\n",
    "spark = SparkSession.builder \\\n",
    "    .master(\"local[*]\") \\\n",
    "    .appName('test') \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "646fc343",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 0:>                                                          (0 + 1) / 1]\r",
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_green = spark.read.parquet('data/pq/green/*/*')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "196cccd5",
   "metadata": {},
   "source": [
    "```\n",
    "SELECT \n",
    "    date_trunc('hour', lpep_pickup_datetime) AS hour, \n",
    "    PULocationID AS zone,\n",
    "\n",
    "    SUM(total_amount) AS amount,\n",
    "    COUNT(1) AS number_records\n",
    "FROM\n",
    "    green\n",
    "WHERE\n",
    "    lpep_pickup_datetime >= '2020-01-01 00:00:00'\n",
    "GROUP BY\n",
    "    1, 2\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "74fe52cb",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "rdd = df_green \\\n",
    "    .select('lpep_pickup_datetime', 'PULocationID', 'total_amount') \\\n",
    "    .rdd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "1a0bf382",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "fa2b00f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "start = datetime(year=2020, month=1, day=1)\n",
    "\n",
    "def filter_outliers(row):\n",
    "    return row.lpep_pickup_datetime >= start"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "69dd326d",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows = rdd.take(10)\n",
    "row = rows[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "cd4b7006",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Row(lpep_pickup_datetime=datetime.datetime(2020, 1, 16, 19, 49, 27), PULocationID=260, total_amount=14.3)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "d99eb089",
   "metadata": {},
   "outputs": [],
   "source": [
    "def prepare_for_grouping(row): \n",
    "    hour = row.lpep_pickup_datetime.replace(minute=0, second=0, microsecond=0)\n",
    "    zone = row.PULocationID\n",
    "    key = (hour, zone)\n",
    "    \n",
    "    amount = row.total_amount\n",
    "    count = 1\n",
    "    value = (amount, count)\n",
    "\n",
    "    return (key, value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "cb328a44",
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_revenue(left_value, right_value):\n",
    "    left_amount, left_count = left_value\n",
    "    right_amount, right_count = right_value\n",
    "    \n",
    "    output_amount = left_amount + right_amount\n",
    "    output_count = left_count + right_count\n",
    "    \n",
    "    return (output_amount, output_count)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "2ea260f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import namedtuple"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "7dae6064",
   "metadata": {},
   "outputs": [],
   "source": [
    "RevenueRow = namedtuple('RevenueRow', ['hour', 'zone', 'revenue', 'count'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "e0a98ee4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def unwrap(row):\n",
    "    return RevenueRow(\n",
    "        hour=row[0][0], \n",
    "        zone=row[0][1],\n",
    "        revenue=row[1][0],\n",
    "        count=row[1][1]\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "a09200b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "5c14d15e",
   "metadata": {},
   "outputs": [],
   "source": [
    "result_schema = types.StructType([\n",
    "    types.StructField('hour', types.TimestampType(), True),\n",
    "    types.StructField('zone', types.IntegerType(), True),\n",
    "    types.StructField('revenue', types.DoubleType(), True),\n",
    "    types.StructField('count', types.IntegerType(), True)\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "56ea72ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_result = rdd \\\n",
    "    .filter(filter_outliers) \\\n",
    "    .map(prepare_for_grouping) \\\n",
    "    .reduceByKey(calculate_revenue) \\\n",
    "    .map(unwrap) \\\n",
    "    .toDF(result_schema) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "4675bd3f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_result.write.parquet('tmp/green-revenue')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "255b5503",
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = ['VendorID', 'lpep_pickup_datetime', 'PULocationID', 'DOLocationID', 'trip_distance']\n",
    "\n",
    "duration_rdd = df_green \\\n",
    "    .select(columns) \\\n",
    "    .rdd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "645c3190",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "921e4ef9",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows = duration_rdd.take(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "f50db3eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(rows, columns=columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "5b8ecc53",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['VendorID',\n",
       " 'lpep_pickup_datetime',\n",
       " 'PULocationID',\n",
       " 'DOLocationID',\n",
       " 'trip_distance']"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "6766c0f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "#model = ...\n",
    "\n",
    "def model_predict(df):\n",
    "#     y_pred = model.predict(df)\n",
    "    y_pred = df.trip_distance * 5\n",
    "    return y_pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "7437b848",
   "metadata": {},
   "outputs": [],
   "source": [
    "def apply_model_in_batch(rows):\n",
    "    df = pd.DataFrame(rows, columns=columns)\n",
    "    predictions = model_predict(df)\n",
    "    df['predicted_duration'] = predictions\n",
    "\n",
    "    for row in df.itertuples():\n",
    "        yield row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "id": "580b5845",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_predicts = duration_rdd \\\n",
    "    .mapPartitions(apply_model_in_batch)\\\n",
    "    .toDF() \\\n",
    "    .drop('Index')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "6055d543",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 48:>                                                         (0 + 1) / 1]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------+\n",
      "|predicted_duration|\n",
      "+------------------+\n",
      "|             12.95|\n",
      "|             31.25|\n",
      "|              14.0|\n",
      "|             12.75|\n",
      "|               0.1|\n",
      "|             11.05|\n",
      "|11.299999999999999|\n",
      "|54.349999999999994|\n",
      "|             15.25|\n",
      "|             91.75|\n",
      "|             12.25|\n",
      "|               3.1|\n",
      "|               7.5|\n",
      "|11.899999999999999|\n",
      "| 78.89999999999999|\n",
      "|              4.45|\n",
      "|              23.2|\n",
      "|              4.85|\n",
      "|              6.65|\n",
      "|              15.1|\n",
      "+------------------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_predicts.select('predicted_duration').show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e91d243",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 06-batch/code/09_spark_gcs.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3307b886",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyspark\n",
    "from pyspark.sql import SparkSession\n",
    "from pyspark.conf import SparkConf\n",
    "from pyspark.context import SparkContext"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9f0ddbff",
   "metadata": {},
   "outputs": [],
   "source": [
    "credentials_location = '/home/alexey/.google/credentials/google_credentials.json'\n",
    "\n",
    "conf = SparkConf() \\\n",
    "    .setMaster('local[*]') \\\n",
    "    .setAppName('test') \\\n",
    "    .set(\"spark.jars\", \"./lib/gcs-connector-hadoop3-2.2.5.jar\") \\\n",
    "    .set(\"spark.hadoop.google.cloud.auth.service.account.enable\", \"true\") \\\n",
    "    .set(\"spark.hadoop.google.cloud.auth.service.account.json.keyfile\", credentials_location)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b83404e8",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: An illegal reflective access operation has occurred\n",
      "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
      "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
      "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
      "WARNING: All illegal access operations will be denied in a future release\n",
      "22/03/30 12:25:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "sc = SparkContext(conf=conf)\n",
    "\n",
    "hadoop_conf = sc._jsc.hadoopConfiguration()\n",
    "\n",
    "hadoop_conf.set(\"fs.AbstractFileSystem.gs.impl\",  \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS\")\n",
    "hadoop_conf.set(\"fs.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem\")\n",
    "hadoop_conf.set(\"fs.gs.auth.service.account.json.keyfile\", credentials_location)\n",
    "hadoop_conf.set(\"fs.gs.auth.service.account.enable\", \"true\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c4713e2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "spark = SparkSession.builder \\\n",
    "    .config(conf=sc.getConf()) \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "1ee1eb1d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df_green = spark.read.parquet('gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/*/*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "104b40ab",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "data": {
      "text/plain": [
       "2304517"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_green.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f56a885d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 06-batch/code/cloud.md
================================================
## Running Spark in the Cloud

### Connecting to Google Cloud Storage 

Uploading data to GCS:

```bash
gsutil -m cp -r pq/ gs://dtc_data_lake_de-zoomcamp-nytaxi/pq
```

Download the jar for connecting to GCS to any location (e.g. the `lib` folder):

**Note**: For other versions of GCS connector for Hadoop see [Cloud Storage connector ](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#connector-setup-on-non-dataproc-clusters).

```bash
gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar ./lib/
```

See the notebook with configuration in [09_spark_gcs.ipynb](09_spark_gcs.ipynb)

(Thanks Alvin Do for the instructions!)


### Local Cluster and Spark-Submit

Creating a stand-alone cluster ([docs](https://spark.apache.org/docs/latest/spark-standalone.html)):

```bash
./sbin/start-master.sh
```

Creating a worker:

```bash
URL="spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077"
./sbin/start-slave.sh ${URL}

# for newer versions of spark use that:
#./sbin/start-worker.sh ${URL}
```

Turn the notebook into a script:

```bash
jupyter nbconvert --to=script 06_spark_sql.ipynb
```

Edit the script and then run it:

```bash 
python 06_spark_sql.py \
    --input_green=data/pq/green/2020/*/ \
    --input_yellow=data/pq/yellow/2020/*/ \
    --output=data/report-2020
```

Use `spark-submit` for running the script on the cluster

```bash
URL="spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077"

spark-submit \
    --master="${URL}" \
    06_spark_sql.py \
        --input_green=data/pq/green/2021/*/ \
        --input_yellow=data/pq/yellow/2021/*/ \
        --output=data/report-2021
```

### Data Proc

Upload the script to GCS:

```bash
gsutil -m cp -r 06_spark_sql.py gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py
```

Params for the job:

* `--input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2021/*/`
* `--input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2021/*/`
* `--output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2021`


Using Google Cloud SDK for submitting to dataproc
([link](https://cloud.google.com/dataproc/docs/guides/submit-job#dataproc-submit-job-gcloud))

```bash
gcloud dataproc jobs submit pyspark \
    --cluster=de-zoomcamp-cluster \
    --region=europe-west6 \
    gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py \
    -- \
        --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \
        --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \
        --output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2020
```

### Big Query

Upload the script to GCS:

```bash
gsutil -m cp -r 06_spark_sql_big_query.py gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql_big_query.py
```

Write results to big query ([docs](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#pyspark)):

```bash
gcloud dataproc jobs submit pyspark \
    --cluster=de-zoomcamp-cluster \
    --region=europe-west6 \
    --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
    gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql_big_query.py \
    -- \
        --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \
        --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \
        --output=trips_data_all.reports-2020
```

There can be issue with latest Spark version and the Big query connector. Download links to the jar file for respective Spark versions can be found at:
[Spark and Big query connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector)

**Note**: Dataproc on GCE 2.1+ images pre-install Spark BigQquery connector: [DataProc Release 2.2](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2). Therefore, no need to include the jar file in the job submission.

================================================
FILE: 06-batch/code/download_data.sh
================================================

set -e

TAXI_TYPE=$1 # "yellow"
YEAR=$2 # 2020

URL_PREFIX="https://github.com/DataTalksClub/nyc-tlc-data/releases/download"

for MONTH in {1..12}; do
  FMONTH=`printf "%02d" ${MONTH}`

  URL="${URL_PREFIX}/${TAXI_TYPE}/${TAXI_TYPE}_tripdata_${YEAR}-${FMONTH}.csv.gz"

  LOCAL_PREFIX="data/raw/${TAXI_TYPE}/${YEAR}/${FMONTH}"
  LOCAL_FILE="${TAXI_TYPE}_tripdata_${YEAR}_${FMONTH}.csv.gz"
  LOCAL_PATH="${LOCAL_PREFIX}/${LOCAL_FILE}"

  echo "downloading ${URL} to ${LOCAL_PATH}"
  mkdir -p ${LOCAL_PREFIX}
  wget ${URL} -O ${LOCAL_PATH}

done


================================================
FILE: 06-batch/code/homework.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "00bc6543",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyspark\n",
    "from pyspark.sql import SparkSession\n",
    "from pyspark.sql import types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cd4a0f3d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: An illegal reflective access operation has occurred\n",
      "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n",
      "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n",
      "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n",
      "WARNING: All illegal access operations will be denied in a future release\n",
      "22/03/07 21:55:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "spark = SparkSession.builder \\\n",
    "    .master(\"local[*]\") \\\n",
    "    .appName('test') \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "eb3e4c36",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'3.0.3'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5236cebd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-rw-r-- 1 alexey alexey 700M Oct 29 18:53 fhvhv_tripdata_2021-02.csv\r\n"
     ]
    }
   ],
   "source": [
    "!ls -lh fhvhv_tripdata_2021-02.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0a3399a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = types.StructType([\n",
    "    types.StructField('hvfhs_license_num', types.StringType(), True),\n",
    "    types.StructField('dispatching_base_num', types.StringType(), True),\n",
    "    types.StructField('pickup_datetime', types.TimestampType(), True),\n",
    "    types.StructField('dropoff_datetime', types.TimestampType(), True),\n",
    "    types.StructField('PULocationID', types.IntegerType(), True),\n",
    "    types.StructField('DOLocationID', types.IntegerType(), True),\n",
    "    types.StructField('SR_Flag', types.StringType(), True)\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "68bc8b72",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = spark.read \\\n",
    "    .option(\"header\", \"true\") \\\n",
    "    .schema(schema) \\\n",
    "    .csv('fhvhv_tripdata_2021-02.csv')\n",
    "\n",
    "df = df.repartition(24)\n",
    "\n",
    "df.write.parquet('data/pq/fhvhv/2021/02/', compression=)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "58989b55",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 0:>                                                          (0 + 1) / 1]\r",
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df = spark.read.parquet('data/pq/fhvhv/2021/02/')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48b01d2f",
   "metadata": {},
   "source": [
    "**Q3**: How many taxi trips were there on February 15?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "f7489aea",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.sql import functions as F"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "6c2500fd",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "data": {
      "text/plain": [
       "367170"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df \\\n",
    "    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \\\n",
    "    .filter(\"pickup_date = '2021-02-15'\") \\\n",
    "    .count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "dd7ae60d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.registerTempTable('fhvhv_2021_02')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "6d47c147",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 20:>                                                         (0 + 4) / 4]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------+\n",
      "|count(1)|\n",
      "+--------+\n",
      "|  367170|\n",
      "+--------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 20:==============>                                           (1 + 3) / 4]\r",
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "spark.sql(\"\"\"\n",
    "SELECT\n",
    "    COUNT(1)\n",
    "FROM \n",
    "    fhvhv_2021_02\n",
    "WHERE\n",
    "    to_date(pickup_datetime) = '2021-02-15';\n",
    "\"\"\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae3f533b",
   "metadata": {},
   "source": [
    "**Q4**: Longest trip for each day"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "7befe422",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['hvfhs_license_num',\n",
       " 'dispatching_base_num',\n",
       " 'pickup_datetime',\n",
       " 'dropoff_datetime',\n",
       " 'PULocationID',\n",
       " 'DOLocationID',\n",
       " 'SR_Flag']"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "279d9161",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Stage 37:==============>                                           (1 + 3) / 4]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------+-------------+\n",
      "|pickup_date|max(duration)|\n",
      "+-----------+-------------+\n",
      "| 2021-02-11|        75540|\n",
      "| 2021-02-17|        57221|\n",
      "| 2021-02-20|        44039|\n",
      "| 2021-02-03|        40653|\n",
      "| 2021-02-19|        37577|\n",
      "+-----------+-------------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 38:==================================================>   (187 + 4) / 200]\r",
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df \\\n",
    "    .withColumn('duration', df.dropoff_datetime.cast('long') - df.pickup_datetime.cast('long')) \\\n",
    "    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \\\n",
    "    .groupBy('pickup_date') \\\n",
    "        .max('duration') \\\n",
    "    .orderBy('max(duration)', ascending=False) \\\n",
    "    .limit(5) \\\n",
    "    .show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "74cf0e8b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 43:>                                                         (0 + 4) / 4]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------+-----------------+\n",
      "|pickup_date|         duration|\n",
      "+-----------+-----------------+\n",
      "| 2021-02-11|           1259.0|\n",
      "| 2021-02-17|953.6833333333333|\n",
      "| 2021-02-20|733.9833333333333|\n",
      "| 2021-02-03|           677.55|\n",
      "| 2021-02-19|626.2833333333333|\n",
      "| 2021-02-25|            583.5|\n",
      "| 2021-02-18|576.8666666666667|\n",
      "| 2021-02-10|569.4833333333333|\n",
      "| 2021-02-21|           537.05|\n",
      "| 2021-02-09|534.7833333333333|\n",
      "+-----------+-----------------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 44:================================================>     (180 + 4) / 200]\r",
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "spark.sql(\"\"\"\n",
    "SELECT\n",
    "    to_date(pickup_datetime) AS pickup_date,\n",
    "    MAX((CAST(dropoff_datetime AS LONG) - CAST(pickup_datetime AS LONG)) / 60) AS duration\n",
    "FROM \n",
    "    fhvhv_2021_02\n",
    "GROUP BY\n",
    "    1\n",
    "ORDER BY\n",
    "    2 DESC\n",
    "LIMIT 10;\n",
    "\"\"\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d915096b",
   "metadata": {},
   "source": [
    "**Q5**: Most frequent `dispatching_base_num`\n",
    "\n",
    "How many stages this spark job has?\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "25816aa2",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 73:>                                                         (0 + 4) / 4]\r",
      "\r",
      "[Stage 73:==============>                                           (1 + 3) / 4]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+--------+\n",
      "|dispatching_base_num|count(1)|\n",
      "+--------------------+--------+\n",
      "|              B02510| 3233664|\n",
      "|              B02764|  965568|\n",
      "|              B02872|  882689|\n",
      "|              B02875|  685390|\n",
      "|              B02765|  559768|\n",
      "+--------------------+--------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 74:===================================================>  (189 + 5) / 200]\r",
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "spark.sql(\"\"\"\n",
    "SELECT\n",
    "    dispatching_base_num,\n",
    "    COUNT(1)\n",
    "FROM \n",
    "    fhvhv_2021_02\n",
    "GROUP BY\n",
    "    1\n",
    "ORDER BY\n",
    "    2 DESC\n",
    "LIMIT 5;\n",
    "\"\"\").show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "a78f9fe3",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 86:>                                                         (0 + 4) / 4]\r",
      "\r",
      "[Stage 86:=============================>                            (2 + 2) / 4]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+-------+\n",
      "|dispatching_base_num|  count|\n",
      "+--------------------+-------+\n",
      "|              B02510|3233664|\n",
      "|              B02764| 965568|\n",
      "|              B02872| 882689|\n",
      "|              B02875| 685390|\n",
      "|              B02765| 559768|\n",
      "+--------------------+-------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "[Stage 87:===========================================>          (161 + 5) / 200]\r",
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "df \\\n",
    "    .groupBy('dispatching_base_num') \\\n",
    "        .count() \\\n",
    "    .orderBy('count', ascending=False) \\\n",
    "    .limit(5) \\\n",
    "    .show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d10173a",
   "metadata": {},
   "source": [
    "**Q6**: Most common locations pair"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "74b7f664",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_zones = spark.read.parquet('zones')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "81642d3b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['LocationID', 'Borough', 'Zone', 'service_zone']"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_zones.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "4f460dda",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['hvfhs_license_num',\n",
       " 'dispatching_base_num',\n",
       " 'pickup_datetime',\n",
       " 'dropoff_datetime',\n",
       " 'PULocationID',\n",
       " 'DOLocationID',\n",
       " 'SR_Flag']"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "ad8f0101",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_zones.registerTempTable('zones')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "6f738414",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Stage 103:==============================================>      (176 + 4) / 200]\r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+--------+\n",
      "|          pu_do_pair|count(1)|\n",
      "+--------------------+--------+\n",
      "|East New York / E...|   45041|\n",
      "|Borough Park / Bo...|   37329|\n",
      "| Canarsie / Canarsie|   28026|\n",
      "|Crown Heights Nor...|   25976|\n",
      "|Bay Ridge / Bay R...|   17934|\n",
      "+--------------------+--------+\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "spark.sql(\"\"\"\n",
    "SELECT\n",
    "    CONCAT(pul.Zone, ' / ', dol.Zone) AS pu_do_pair,\n",
    "    COUNT(1)\n",
    "FROM \n",
    "    fhvhv_2021_02 fhv LEFT JOIN zones pul ON fhv.PULocationID = pul.LocationID\n",
    "                      LEFT JOIN zones dol ON fhv.DOLocationID = dol.LocationID\n",
    "GROUP BY \n",
    "    1\n",
    "ORDER BY\n",
    "    2 DESC\n",
    "LIMIT 5;\n",
    "\"\"\").show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4b754d1",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 06-batch/setup/config/core-site.xml
================================================
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
  </property>
  <property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
  </property>
  <property>
    <name>fs.gs.auth.service.account.json.keyfile</name>
    <value>/home/alexey/.google/credentials/google_credentials.json</value>
  </property>
  <property>
    <name>fs.gs.auth.service.account.enable</name>
    <value>true</value>
  </property>
</configuration>

================================================
FILE: 06-batch/setup/config/spark-defaults.conf
================================================
spark-master    yarn
spark.hadoop.google.cloud.auth.service.account.enable        true
spark.hadoop.google.cloud.auth.service.account.json.keyfile  /home/alexey


================================================
FILE: 06-batch/setup/config/spark.dockerfile
================================================
FROM library/openjdk:11

================================================
FILE: 06-batch/setup/hadoop-yarn.md
================================================
## Spark on YARN 

For the Spark and Docker module, we need YARN, which
comes together with Hadoop. So we need to install Hadoop

In this document, we'll assume you use Linux. For Windows, use WSL. It should work (supposedly) on MacOS as well. 

We'll need to run it in a pseudo-distributed mode.


### Configuring ssh

You need to run be able to `ssh` to your localhost without having to type any password. In other words, you execute 

```bash
ssh localhost
```

And you get ssh access. 

If you don't have it, add your `id_rsa.pub` key to the list of keys authorized to access your computer:

```bash
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
```

(This assumes you already have `id_rsa.pub` in `~/.ssh`)

On WSL, you may need to start the ssh service:

```bash
sudo service ssh start
```

### Download Hadoop binaries

We use Spark that expects Hadoop 3.2 version. So we'll install it.

Go to the [Hadoop's website](https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz) to get the closest mirror. And then download it:

```bash
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
```

Unpack it and go to this directory

```bash
tar xzfv hadoop-3.2.3.tar.gz
cd hadoop-3.2.3/
```


### YARN on a Single Node

Set `JAVA_HOME` in `etc/hadoop/hadoop-env.sh`:

```bash
echo "export JAVA_HOME=${JAVA_HOME}" >> etc/hadoop/hadoop-env.sh
```

Start YARN

```bash
./sbin/start-yarn.sh
```

YARN should work on port 8088: http://localhost:8088/


### Running Spark on YARN

For submitting spark jobs, we'll need to use `master="yarn"`.

Spark needs to know where to look for YARN config files, so we need to set it:


```bash
export HADOOP_HOME="${HOME}/spark/hadoop-3.2.3"
export YARN_CONF_DIR="${HADOOP_HOME}/etc/hadoop"
```

Then run Jupyter or use spark-submit.


### Connecting Spark and YARN to GCS

Download the GCS connector:

```bash
gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar .
```

Config changes:

* Change `${SPARK_HOME}/conf/spark-defaults.conf` (see [here]())
* Change `${YARN_CONF_DIR}/core-site.xml` (see [here](config/core-site.xml))

Template for hadoop properties:

```xml
  <property>
    <name></name>
    <value></value>
  </property>
```

### Spark and YARN with Docker

Copy the config from [here](https://hadoop.apache.org/docs/r3.2.3/hadoop-yarn/hadoop-yarn-site/DockerContainers.html)

Running spark-submit:

```bash
MOUNTS="$HADOOP_HOME:$HADOOP_HOME:ro,/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro"
IMAGE_ID="pyspark-docker:test"

spark-submit \
    --master yarn \
    --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
    --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE_ID} \
    --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
    --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE_ID} \
    06_spark_sql.py \
        --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2021/*/ \
        --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2021/*/ \
        --output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2021
```


### Sources

* https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-common/SingleCluster.html
* https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration


================================================
FILE: 06-batch/setup/linux.md
================================================

## Linux

Here we'll show you how to install Spark 4.x for Linux.
We tested it on Ubuntu 24.04 (also WSL), but it should work
for other Linux distros as well


### Installing Java

Spark 4.x requires Java 17 or 21. The simplest way is to install it via your package manager:

```bash
sudo apt update
sudo apt install default-jdk
```

Check that it works:

```bash
java --version
```

Output (example):

```
openjdk 21.0.10 2026-01-20
OpenJDK Runtime Environment (build 21.0.10+7-Ubuntu-124.04)
OpenJDK 64-Bit Server VM (build 21.0.10+7-Ubuntu-124.04, mixed mode, sharing)
```

Set `JAVA_HOME` (add to your `.bashrc` or `.zshrc`):

```bash
export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
export PATH="${JAVA_HOME}/bin:${PATH}"
```


### PySpark

We recommend using [uv](https://docs.astral.sh/uv/) for managing Python packages:

```bash
uv init
uv add pyspark
```

Then run your scripts with `uv run`:

```bash
uv run python your_script.py
```

Alternatively, you can use pip:

```bash
pip install pyspark
```

Both approaches install PySpark along with a bundled Spark distribution - no separate Spark download needed.


### Testing it

Create a test script `test_spark.py`:

```python
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

print(f"Spark version: {spark.version}")

df = spark.range(10)
df.show()

spark.stop()
```

Run it:

```bash
uv run python test_spark.py
```


================================================
FILE: 06-batch/setup/macos.md
================================================

## MacOS

Here we'll show you how to install Spark 4.x for macOS.
We tested it on macOS 15 (Sequoia), but it should work
for other versions as well.


### Installing Java

Spark 4.x requires Java 17. Ensure [Homebrew](https://brew.sh/) is installed, then install OpenJDK 17:

```bash
brew install openjdk@17
```

Add the following environment variables to your `.zshrc` (or `.bash_profile`):

```bash
export JAVA_HOME=$(brew --prefix openjdk@17)
export PATH="$JAVA_HOME/bin:$PATH"
```

Check that Java works correctly:

```bash
java --version
```

Output (example):

```
openjdk 17.0.14 2026-01-21
OpenJDK Runtime Environment Homebrew (build 17.0.14+0)
OpenJDK 64-Bit Server VM Homebrew (build 17.0.14+0, mixed mode, sharing)
```


### PySpark

We recommend using [uv](https://docs.astral.sh/uv/) for managing Python packages:

```bash
uv init
uv add pyspark
```

Then run your scripts with `uv run`:

```bash
uv run python your_script.py
```

Alternatively, you can use pip:

```bash
pip install pyspark
```

Both approaches install PySpark along with a bundled Spark distribution — no separate Spark download needed.

> If you previously installed Spark 3.x and have `SPARK_HOME` set in your `.zshrc` or `.bash_profile` (e.g. pointing to a local Spark directory), remove that line. PySpark 4.x bundles its own Spark, so `SPARK_HOME` is no longer needed. If the old `SPARK_HOME` is still set, PySpark 4.x will load the old JARs and fail.


### Testing it

Create a test script `test_spark.py`:

```python
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

print(f"Spark version: {spark.version}")

df = spark.range(10)
df.show()

spark.stop()
```

Run it:

```bash
uv run python test_spark.py
```

You may see a warning like `WARNING: Using incubator modules: jdk.incubator.vector` — you can safely ignore it.


================================================
FILE: 06-batch/setup/windows.md
================================================
## Windows

Here we'll show you how to install Spark 4.x for Windows.
We tested it on Windows 10 and 11, but it should work
for other versions as well.

In this tutorial, we'll use [MINGW](https://www.mingw-w64.org/)/[Git Bash](https://gitforwindows.org/) for the command line.

If you use WSL, follow the instructions from [linux.md](linux.md).


### Installing Java

Spark 4.x requires Java 17. Download and unpack the Adoptium JDK 17:

```bash
wget https://github.com/adoptium/temurin17-binaries/releases/download/jdk-17.0.18%2B8/OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip
unzip OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip -d /c/tools/
```

The full path to JDK will be `/c/tools/jdk-17.0.18+8`.

Now let's configure it and add it to `PATH` (add to your `.bashrc`):

```bash
export JAVA_HOME="/c/tools/jdk-17.0.18+8"
export PATH="${JAVA_HOME}/bin:${PATH}"
```

Check that Java works correctly:

```bash
java --version
```

Output:

```
openjdk 17.0.18 2026-01-20 LTS
OpenJDK Runtime Environment Temurin-17.0.18+8 (build 17.0.18+8-LTS)
OpenJDK 64-Bit Server VM Temurin-17.0.18+8 (build 17.0.18+8-LTS, mixed mode, sharing)
```


### PySpark

We recommend using [uv](https://docs.astral.sh/uv/) for managing Python packages:

```bash
uv init
uv add pyspark
```

Then run your scripts with `uv run`:

```bash
uv run python your_script.py
```

Alternatively, you can use pip:

```bash
pip install pyspark
```

Both approaches install PySpark along with a bundled Spark distribution — no separate Spark or Hadoop download needed.

> If you previously installed Spark 3.x and have `SPARK_HOME` set in your `.bashrc` (e.g. pointing to `C:/tools/spark-3.3.2-bin-hadoop3`), remove that line. PySpark 4.x bundles its own Spark, so `SPARK_HOME` is no longer needed. If the old `SPARK_HOME` is still set, PySpark 4.x will load the old JARs and fail.


### Testing it

Create a test script `test_spark.py`:

```python
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

print(f"Spark version: {spark.version}")

df = spark.range(10)
df.show()

spark.stop()
```

Run it:

```bash
uv run python test_spark.py
```

At this point you may get a message from Windows Firewall — allow it.

You may see a warning like `WARNING: Using incubator modules: jdk.incubator.vector` — you can safely ignore it.


================================================
FILE: 07-streaming/.gitignore
================================================
week6_venv

================================================
FILE: 07-streaming/README.md
================================================
# Module 7: Stream Processing

Video: https://www.youtube.com/live/YDUgFeHQzJU

- [PyFlink workshop](workshop/) - build a real-time streaming pipeline step by step (Redpanda, Python, Flink, PostgreSQL)
- [Homework](../cohorts/2026/07-streaming/homework.md)
- [Kafka theory](theory/) - video lectures on Kafka concepts with Java code examples (optional)
- [Extras](extras/) - supplementary Python and PyFlink examples from previous years (optional)


## Community notes

<details>
<summary>Did you take notes? You can share them here</summary>

* [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/6_streaming.md )
* [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-6-stream-processing/)
* [Notes by Oscar Garcia](https://github.com/ozkary/Data-Engineering-Bootcamp/tree/main/Step6-Streaming)
* [Notes by Shayan Shafiee Moghadam](https://github.com/shayansm2/eng-notebook/blob/main/kafka/readme.md)
* Add your notes here (above this line)
</details>


================================================
FILE: 07-streaming/extras/README.md
================================================
# Supplementary streaming examples

Additional stream processing examples from previous course years. These are
not part of the main workshop but may be useful as reference material.


## python/

Python Kafka examples by Irem Erturk, using various libraries.

- [json_example/](python/json_example) - producer and consumer using
  `kafka-python` with JSON serialization
- [avro_example/](python/avro_example) - producer and consumer using
  `confluent-kafka` with Avro serialization and Schema Registry
- [redpanda_example/](python/redpanda_example) - same as the JSON example
  but running against Redpanda instead of Kafka, with a local
  docker-compose setup
- [streams-example/faust/](python/streams-example/faust) - stream processing
  with [Faust](https://faust-streaming.github.io/faust/), a Python library
  for Kafka Streams. Includes windowing, branching, and counting examples.
- [streams-example/pyspark/](python/streams-example/pyspark) - Spark
  Structured Streaming consuming from Kafka, with a Jupyter notebook
- [streams-example/redpanda/](python/streams-example/redpanda) - same as
  the PySpark example but using Redpanda as the broker
- [docker/](python/docker) - Docker Compose files for running Kafka and
  Spark clusters locally
- [resources/](python/resources) - sample data (rides.csv) and Avro schemas


## pyflink/

PyFlink workshop by Irem Erturk. Uses Apache Flink 1.x with a
Makefile-based workflow, PostgreSQL sink, and Docker Compose setup. The
[2025 stream with Zach Wilson](https://www.youtube.com/watch?v=P2loELMUUeI)
was rewritten into the current [2026 workshop](../workshop/) by Alexey,
using Flink 2.2, uv, and a step-by-step README.


## ksqldb/

[commands.md](ksqldb/commands.md) - example ksqlDB queries for creating
streams, filtering, grouping, and windowed aggregations over Kafka topics.
Companion to the [ksqlDB and Connect video](../theory/#kafka-streams) in
the theory section.


================================================
FILE: 07-streaming/extras/ksqldb/commands.md
================================================
## KSQL DB Examples
### Create streams
```sql
CREATE STREAM ride_streams (
    VendorId varchar, 
    trip_distance double,
    payment_type varchar
)  WITH (KAFKA_TOPIC='rides',
        VALUE_FORMAT='JSON');
```

### Query stream
```sql
select * from RIDE_STREAMS 
EMIT CHANGES;
```

### Query stream count
```sql
SELECT VENDORID, count(*) FROM RIDE_STREAMS 
GROUP BY VENDORID
EMIT CHANGES;
```

### Query stream with filters
```sql
SELECT payment_type, count(*) FROM RIDE_STREAMS 
WHERE payment_type IN ('1', '2')
GROUP BY payment_type
EMIT CHANGES;
```

### Query stream with window functions
```sql
CREATE TABLE payment_type_sessions AS
  SELECT payment_type,
         count(*)
  FROM  RIDE_STREAMS 
  WINDOW SESSION (60 SECONDS)
  GROUP BY payment_type
  EMIT CHANGES;
```

## KSQL documentation for details
[KSQL DB Documentation](https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/quick-reference/)

[KSQL DB Java client](https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-clients/java-client/)

================================================
FILE: 07-streaming/extras/pyflink/.gitignore
================================================
data/
postgres-data
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

dump.sql

# Personal workspace files
.idea/*
.vscode/*

================================================
FILE: 07-streaming/extras/pyflink/Dockerfile.flink
================================================
FROM --platform=linux/amd64 flink:1.16.0-scala_2.12-java8

# install python3: it has updated Python to 3.9 in Debian 11 and so install Python 3.7 from source
# it currently only supports Python 3.6, 3.7 and 3.8 in PyFlink officially.

# ref: https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker

RUN apt-get update -y && \
    apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev liblzma-dev && \
    wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz && \
    tar -xvf Python-3.7.9.tgz && \
    cd Python-3.7.9 && \
    ./configure --without-tests --enable-shared && \
    make -j6 && \
    make install && \
    ldconfig /usr/local/lib && \
    cd .. && rm -f Python-3.7.9.tgz && rm -rf Python-3.7.9 && \
    ln -s /usr/local/bin/python3 /usr/local/bin/python && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# install PyFlink
COPY requirements.txt .
RUN python -m pip install --upgrade pip; \
    pip3 install --upgrade google-api-python-client; \
    pip3 install -r requirements.txt  --no-cache-dir;

# Download connector libraries
RUN wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/1.16.0/flink-json-1.16.0.jar; \
    wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/1.16.0/flink-sql-connector-kafka-1.16.0.jar; \
    wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc/1.16.0/flink-connector-jdbc-1.16.0.jar; \
    wget -P /opt/flink/lib/ https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.24/postgresql-42.2.24.jar;

RUN echo "taskmanager.memory.jvm-metaspace.size: 512m" >> /opt/flink/conf/flink-conf.yaml;

WORKDIR /opt/flink


================================================
FILE: 07-streaming/extras/pyflink/LICENSE
================================================
MIT License

Copyright (c) 2025 Sreela Das, Julie Scherer, Zach Wilson

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: 07-streaming/extras/pyflink/Makefile
================================================
PLATFORM ?= linux/amd64

# COLORS
GREEN  := $(shell tput -Txterm setaf 2)
YELLOW := $(shell tput -Txterm setaf 3)
WHITE  := $(shell tput -Txterm setaf 7)
RESET  := $(shell tput -Txterm sgr0)


TARGET_MAX_CHAR_NUM=20

## Show help with `make help`
help:
	@echo ''
	@echo 'Usage:'
	@echo '  ${YELLOW}make${RESET} ${GREEN}<target>${RESET}'
	@echo ''
	@echo 'Targets:'
	@awk '/^[a-zA-Z\-\_0-9]+:/ { \
		helpMessage = match(lastLine, /^## (.*)/); \
		if (helpMessage) { \
			helpCommand = substr($$1, 0, index($$1, ":")-1); \
			helpMessage = substr(lastLine, RSTART + 3, RLENGTH); \
			printf "  ${YELLOW}%-$(TARGET_MAX_CHAR_NUM)s${RESET} ${GREEN}%s${RESET}\n", helpCommand, helpMessage; \
		} \
	} \
	{ lastLine = $$0 }' $(MAKEFILE_LIST)

.PHONY: build
## Builds the Flink base image with pyFlink and connectors installed
build:
	docker build .

.PHONY: up
## Builds the base Docker image and starts Flink cluster
up:
	docker compose up --build --remove-orphans  -d

.PHONY: down
## Shuts down the Flink cluster
down:
	docker compose down --remove-orphans

.PHONY: job
## Submit the Flink job
job:
	docker compose exec jobmanager ./bin/flink run -py /opt/src/job/start_job.py --pyFiles /opt/src -d

aggregation_job:
	docker compose exec jobmanager ./bin/flink run -py /opt/src/job/aggregation_job.py --pyFiles /opt/src -d

.PHONY: stop
## Stops all services in Docker compose
stop:
	docker compose stop

.PHONY: start
## Starts all services in Docker compose
start:
	docker compose start


================================================
FILE: 07-streaming/extras/pyflink/README.md
================================================
# Apache Flink Training
Apache Flink Streaming Pipelines

## :pushpin: Getting started 

### :whale: Installations

To run this repo, the following components will need to be installed:

1. [Docker](https://docs.docker.com/get-docker/) (required)
2. [Docker compose](https://docs.docker.com/compose/install/#installation-scenarios) (required)
3. Make (recommended) -- see below
    - On most Linux distributions and macOS, `make` is typically pre-installed by default. To check if `make` is installed on your system, you can run the `make --version` command in your terminal or command prompt. If it's installed, it will display the version information. 
    - Otherwise, you can try following the instructions below, or you can just copy+paste the commands from the `Makefile` into your terminal or command prompt and run manually.

        ```bash
        # On Ubuntu or Debian:
        sudo apt-get update
        sudo apt-get install build-essential

        # On CentOS or Fedora:
        sudo dnf install make

        # On macOS:
        xcode-select --install

        # On windows:
        choco install make # uses Chocolatey, https://chocolatey.org/install
        ```

### :computer: Local setup

Make sure you're in the `pyflick` folder:

```bash
cd 07-streaming/pyflink
```

## :boom: Running the pipeline

1. Build the Docker image and deploy the services in the `docker-compose.yml` file, including the PostgreSQL database and Flink cluster. This will (should) also create the sink table, `processed_events`, where Flink will write the Kafka messages to.

    ```bash
    make up

    #// if you dont have make, you can run:
    # docker compose up --build --remove-orphans  -d
    ```

    **:star: Wait until the Flink UI is running at [http://localhost:8081/](http://localhost:8081/) before proceeding to the next step.** _Note the first time you build the Docker image it can take anywhere from 5 to 30 minutes. Future builds should only take a few second, assuming you haven't deleted the image since._

    :information_source: After the image is built, Docker will automatically start up the job manager and task manager services. This will take a minute or so. Check the container logs in Docker desktop and when you see the line below, you know you're good to move onto the next step.

    ```
    taskmanager Successful registration at resource manager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_* under registration id <id_number>
    ```

2. Now that the Flink cluster is up and running, it's time to finally run the PyFlink job! :smile:

    ```bash
    make job

    #// if you dont have make, you can run:
    # docker-compose exec jobmanager ./bin/flink run -py /opt/job/start_job.py -d
    ```

    After about a minute, you should see a prompt that the job's been submitted (e.g., `Job has been submitted with JobID <job_id_number>`). Now go back to the [Flink UI](http://localhost:8081/#/job/running) to see the job running! :tada:


3. When you're done, you can stop and/or clean up the Docker resources by running the commands below.

    ```bash
    make stop # to stop running services in docker compose
    make down # to stop and remove docker compose services
    make clean # to remove the docker container and dangling images
    ```

    :grey_exclamation: Note the `/var/lib/postgresql/data` directory inside the PostgreSQL container is mounted to the `./postgres-data` directory on your local machine. This means the data will persist across container restarts or removals, so even if you stop/remove the container, you won't lose any data written within the container.

------

:information_source: To see all the make commands that're available and what they do, run:

```bash
make help
```

As of the time of writing this, the available commands are:

```bash

Usage:
  make <target>

Targets:
  help                 Show help with `make help`
  db-init              Builds and runs the PostgreSQL database service
  build                Builds the Flink base image with pyFlink and connectors installed
  up                   Builds the base Docker image and starts Flink cluster
  down                 Shuts down the Flink cluster
  job                  Submit the Flink job
  stop                 Stops all services in Docker compose
  start                Starts all services in Docker compose
  clean                Stops and removes the Docker container as well as images with tag `<none>`
  psql                 Runs psql to query containerized postgreSQL database in CLI
  postgres-die-mac     Removes mounted postgres data dir on local machine (mac users) and in Docker
  postgres-die-pc      Removes mounted postgres data dir on local machine (PC users) and in Docker
```


================================================
FILE: 07-streaming/extras/pyflink/docker-compose.yml
================================================
version: "3.9"
services:
  redpanda-1:
    image: redpandadata/redpanda:v24.2.18
    container_name: redpanda-1
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '1'
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092
      - --pandaproxy-addr
      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082
      - --advertise-pandaproxy-addr
      - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082
      - --rpc-addr
      - 0.0.0.0:33145
      - --advertise-rpc-addr
      - redpanda-1:33145
    ports:
      # - 8081:8081
      - 8082:8082
      - 9092:9092
      - 28082:28082
      - 29092:29092

  jobmanager:
    build:
      context: .
      dockerfile: ./Dockerfile.flink
    image: pyflink:1.16.0
    container_name: "flink-jobmanager"
    pull_policy: never
    platform: "linux/amd64"
    hostname: "jobmanager"
    expose:
      - "6123"
    ports:
      - "8081:8081"
    volumes:
      - ./:/opt/flink/usrlib
      - ./keys/:/var/private/ssl/
      - ./src/:/opt/src
    command: jobmanager 
    extra_hosts:
      - "host.docker.internal:127.0.0.1" #// Linux
      - "host.docker.internal:host-gateway" #// Access services on the host machine from within the Docker container
    environment:
      - POSTGRES_URL=${POSTGRES_URL:-jdbc:postgresql://host.docker.internal:5432/postgres}
      - POSTGRES_USER=${POSTGRES_USER:-postgres}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-postgres}
      - POSTGRES_DB=${POSTGRES_DB:-postgres}
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager        
  
  # Flink task manager
  taskmanager:
    image: pyflink:1.16.0
    container_name: "flink-taskmanager"
    pull_policy: never
    platform: "linux/amd64"
    expose:
      - "6121"
      - "6122"
    volumes:
      - ./:/opt/flink/usrlib
      - ./src/:/opt/src
    depends_on:
      - jobmanager
    command: taskmanager --taskmanager.registration.timeout 5 min
    extra_hosts:
      - "host.docker.internal:127.0.0.1" #// Linux
      - "host.docker.internal:host-gateway" #// Access services on the host machine from within the Docker container
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 15
        parallelism.default: 3
  postgres:
    image: postgres:14
    restart: on-failure
    container_name: "postgres"
    environment:
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
    ports:
      - "5432:5432"
    extra_hosts:
     - "host.docker.internal:127.0.0.1" #// Linux
     - "host.docker.internal:host-gateway" #// Access services on the host machine from within the Docker container


================================================
FILE: 07-streaming/extras/pyflink/homework.md
================================================
# Homework

For this homework we will be using the Taxi data:
- Green 2019-10 data from [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz)


## Start Red Panda, Flink Job Manager, Flink Task Manager, and Postgres 

There's a `docker-compose.yml` file in the homework folder (taken from [here](https://github.com/redpanda-data-blog/2023-python-gsg/blob/main/docker-compose.yml))

Copy this file to your homework directory and run

```bash
docker-compose up
```

(Add `-d` if you want to run in detached mode)

Visit `localhost:8081` to see the Flink Job Manager

Connect to Postgres with [DBeaver](https://dbeaver.io/).

The connection credentials are:
- Username `postgres`
- Password `postgres`
- Database `postgres`
- Host `localhost`
- Port `5432`


In DBeaver, run this query to create the Postgres landing zone for the first events:
```sql 
CREATE TABLE processed_events (
    test_data INTEGER,
    event_timestamp TIMESTAMP
)
```


## Question 1. Connecting to the Kafka server

We need to make sure we can connect to the server, so
later we can send some data to its topics

First, let's install the kafka connector (up to you if you
want to have a separate virtual environment for that)

```bash
pip install kafka-python
```

You can start a jupyter notebook in your solution folder or
create a script

Let's try to connect to our server:

```python
import json
import time 

from kafka import KafkaProducer

def json_serializer(data):
    return json.dumps(data).encode('utf-8')

server = 'localhost:9092'

producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=json_serializer
)

producer.bootstrap_connected()
```

## Question 3: Sending the Trip Data

* Read the green csv.gz file
* We will only need these columns:
  * `'lpep_pickup_datetime',`
  * `'lpep_dropoff_datetime',`
  * `'PULocationID',`
  * `'DOLocationID',`
  * `'passenger_count',`
  * `'trip_distance',`
  * `'tip_amount'`

* Create a topic `green-trips` and send the data there with `load_taxi_data.py`
* How much time in seconds did it take? (You can round it to a whole number)
* Make sure you don't include sleeps in your code

## Question 4: Build a Sessionization Window

* Copy `aggregation_job.py` and rename it to `session_job.py`
* Have it read from `green-trips` fixing the schema
* Use a [session window](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/) with a gap of 5 minutes
* Use `lpep_dropoff_datetime` time as your watermark with a 5 second tolerance
* Which pickup and drop off locations have the longest unbroken streak of taxi trips?


================================================
FILE: 07-streaming/extras/pyflink/requirements.txt
================================================
apache-flink==1.16.0
psycopg2-binary==2.9.1
requests
kafka-python

================================================
FILE: 07-streaming/extras/pyflink/src/job/aggregation_job.py
================================================
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, DataTypes, TableEnvironment, StreamTableEnvironment
from pyflink.common.watermark_strategy import WatermarkStrategy
from pyflink.common.time import Duration

def create_events_aggregated_sink(t_env):
    table_name = 'processed_events_aggregated'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            event_hour TIMESTAMP(3),
            test_data INT,
            num_hits BIGINT,
            PRIMARY KEY (event_hour, test_data) NOT ENFORCED
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name

def create_events_source_kafka(t_env):
    table_name = "events"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            test_data INTEGER,
            event_timestamp BIGINT,
            event_watermark AS TO_TIMESTAMP_LTZ(event_timestamp, 3),
            WATERMARK for event_watermark as event_watermark - INTERVAL '1' SECOND
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda-1:29092',
            'topic' = 'test-topic',
            'scan.startup.mode' = 'earliest-offset',
            'properties.auto.offset.reset' = 'earliest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name


def log_aggregation():
    # Set up the execution environment
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)
    env.set_parallelism(3)

    # Set up the table environment
    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)

    watermark_strategy = (
        WatermarkStrategy
        .for_bounded_out_of_orderness(Duration.of_seconds(5))
        .with_timestamp_assigner(
            # This lambda is your timestamp assigner:
            #   event -> The data record
            #   timestamp -> The previously assigned (or default) timestamp
            lambda event, timestamp: event[2]  # We treat the second tuple element as the event-time (ms).
        )
    )
    try:
        # Create Kafka table
        source_table = create_events_source_kafka(t_env)
        aggregated_table = create_events_aggregated_sink(t_env)

        t_env.execute_sql(f"""
        INSERT INTO {aggregated_table}
        SELECT
            window_start as event_hour,
            test_data,
            COUNT(*) AS num_hits
        FROM TABLE(
            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_watermark), INTERVAL '1' MINUTE)
        )
        GROUP BY window_start, test_data;
        
        """).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_aggregation()


================================================
FILE: 07-streaming/extras/pyflink/src/job/start_job.py
================================================
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, DataTypes, TableEnvironment, StreamTableEnvironment


def create_processed_events_sink_postgres(t_env):
    table_name = 'processed_events'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            test_data INTEGER,
            event_timestamp TIMESTAMP
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name


def create_events_source_kafka(t_env):
    table_name = "events"
    pattern = "yyyy-MM-dd HH:mm:ss.SSS"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            test_data INTEGER,
            event_timestamp BIGINT,
            event_watermark AS TO_TIMESTAMP_LTZ(event_timestamp, 3),
            WATERMARK for event_watermark as event_watermark - INTERVAL '5' SECOND
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda-1:29092',
            'topic' = 'test-topic',
            'scan.startup.mode' = 'latest-offset',
            'properties.auto.offset.reset' = 'latest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name

def log_processing():
    # Set up the execution environment
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)
    # env.set_parallelism(1)

    # Set up the table environment
    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)
    try:
        # Create Kafka table
        source_table = create_events_source_kafka(t_env)
        postgres_sink = create_processed_events_sink_postgres(t_env)
        # write records to postgres too!
        t_env.execute_sql(
            f"""
                    INSERT INTO {postgres_sink}
                    SELECT
                        test_data,
                        TO_TIMESTAMP_LTZ(event_timestamp, 3) as event_timestamp
                    FROM {source_table}
                    """
        ).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_processing()


================================================
FILE: 07-streaming/extras/pyflink/src/job/taxi_job.py
================================================
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, DataTypes, TableEnvironment, StreamTableEnvironment


def create_taxi_events_sink_postgres(t_env):
    table_name = 'taxi_events'
    sink_ddl = f"""
        CREATE OR REPLACE TABLE {table_name} (
            VendorID INTEGER,
            lpep_pickup_datetime VARCHAR,
            lpep_dropoff_datetime VARCHAR,
            store_and_fwd_flag VARCHAR,
            RatecodeID INTEGER ,
            PULocationID INTEGER,
            DOLocationID INTEGER,
            passenger_count INTEGER,
            trip_distance DOUBLE,
            fare_amount DOUBLE,
            extra DOUBLE,
            mta_tax DOUBLE,
            tip_amount DOUBLE,
            tolls_amount DOUBLE,
            ehail_fee DOUBLE,
            improvement_surcharge DOUBLE,
            total_amount DOUBLE,
            payment_type INTEGER,
            trip_type INTEGER,
            congestion_surcharge DOUBLE
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name


def create_events_source_kafka(t_env):
    table_name = "taxi_events"
    pattern = "yyyy-MM-dd HH:mm:ss"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            VendorID INTEGER,
            lpep_pickup_datetime VARCHAR,
            lpep_dropoff_datetime VARCHAR,
            store_and_fwd_flag VARCHAR,
            RatecodeID INTEGER ,
            PULocationID INTEGER,
            DOLocationID INTEGER,
            passenger_count INTEGER,
            trip_distance DOUBLE,
            fare_amount DOUBLE,
            extra DOUBLE,
            mta_tax DOUBLE,
            tip_amount DOUBLE,
            tolls_amount DOUBLE,
            ehail_fee DOUBLE,
            improvement_surcharge DOUBLE,
            total_amount DOUBLE,
            payment_type INTEGER,
            trip_type INTEGER,
            congestion_surcharge DOUBLE,
            pickup_timestamp AS TO_TIMESTAMP(lpep_pickup_datetime, '{pattern}'),
            WATERMARK FOR pickup_timestamp AS pickup_timestamp - INTERVAL '15' SECOND
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda-1:29092',
            'topic' = 'green-data',
            'scan.startup.mode' = 'earliest-offset',
            'properties.auto.offset.reset' = 'earliest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name

def log_processing():
    # Set up the execution environment
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)
    # env.set_parallelism(1)

    # Set up the table environment
    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)
    try:
        # Create Kafka table
        source_table = create_events_source_kafka(t_env)
        postgres_sink = create_taxi_events_sink_postgres(t_env)
        # write records to postgres too!
        t_env.execute_sql(
            f"""
                    INSERT INTO {postgres_sink}
                    SELECT
                        *
                    FROM {source_table}
                    """
        ).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_processing()


================================================
FILE: 07-streaming/extras/pyflink/src/producers/load_taxi_data.py
================================================
import csv
import json
from kafka import KafkaProducer

def main():
    # Create a Kafka producer
    producer = KafkaProducer(
        bootstrap_servers='localhost:9092',
        value_serializer=lambda v: json.dumps(v).encode('utf-8')
    )

    csv_file = 'data/green_tripdata_2019-10.csv'  # change to your CSV file path if needed

    with open(csv_file, 'r', newline='', encoding='utf-8') as file:
        reader = csv.DictReader(file)

        for row in reader:
            # Each row will be a dictionary keyed by the CSV headers
            # Send data to Kafka topic "green-data"
            producer.send('green-data', value=row)

    # Make sure any remaining messages are delivered
    producer.flush()
    producer.close()


if __name__ == "__main__":
    main()

================================================
FILE: 07-streaming/extras/pyflink/src/producers/producer.py
================================================
import json
import time
from kafka import KafkaProducer

def json_serializer(data):
    return json.dumps(data).encode('utf-8')

server = 'localhost:9092'

producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=json_serializer
)
t0 = time.time()

topic_name = 'test-topic'

for i in range(10, 1000):
    message = {'test_data': i, 'event_timestamp': time.time() * 1000}
    producer.send(topic_name, value=message)
    print(f"Sent: {message}")
    time.sleep(0.05)

producer.flush()

t1 = time.time()
print(f'took {(t1 - t0):.2f} seconds')

================================================
FILE: 07-streaming/extras/python/README.md
================================================
### Stream-Processing with Python

In this document, you will be finding information about stream processing 
using different Python libraries (`kafka-python`,`confluent-kafka`,`pyspark`, `faust`).

This Python module can be separated in following modules.

####  1. Docker
Docker module includes, Dockerfiles and docker-compose definitions 
to run Kafka and Spark in a docker container. Setting up required services is
the prerequsite step for running following modules.

#### 2. Kafka Producer - Consumer Examples
- [Json Producer-Consumer Example](json_example) using `kafka-python` library
- [Avro Producer-Consumer Example](avro_example) using `confluent-kafka` library

Both of these examples require, up-and running Kafka services, therefore please ensure
following steps under [docker-README](docker/README.md)

To run the producer-consumer examples in the respective example folder, run following commands
```bash
# Start producer script
python3 producer.py
# Start consumer script
python3 consumer.py
```


================================================
FILE: 07-streaming/extras/python/avro_example/consumer.py
================================================
import os
from typing import Dict, List

from confluent_kafka import Consumer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroDeserializer
from confluent_kafka.serialization import SerializationContext, MessageField

from ride_record_key import dict_to_ride_record_key
from ride_record import dict_to_ride_record
from settings import BOOTSTRAP_SERVERS, SCHEMA_REGISTRY_URL, \
    RIDE_KEY_SCHEMA_PATH, RIDE_VALUE_SCHEMA_PATH, KAFKA_TOPIC


class RideAvroConsumer:
    def __init__(self, props: Dict):

        # Schema Registry and Serializer-Deserializer Configurations
        key_schema_str = self.load_schema(props['schema.key'])
        value_schema_str = self.load_schema(props['schema.value'])
        schema_registry_props = {'url': props['schema_registry.url']}
        schema_registry_client = SchemaRegistryClient(schema_registry_props)
        self.avro_key_deserializer = AvroDeserializer(schema_registry_client=schema_registry_client,
                                                      schema_str=key_schema_str,
                                                      from_dict=dict_to_ride_record_key)
        self.avro_value_deserializer = AvroDeserializer(schema_registry_client=schema_registry_client,
                                                        schema_str=value_schema_str,
                                                        from_dict=dict_to_ride_record)

        consumer_props = {'bootstrap.servers': props['bootstrap.servers'],
                          'group.id': 'datatalkclubs.taxirides.avro.consumer.2',
                          'auto.offset.reset': "earliest"}
        self.consumer = Consumer(consumer_props)

    @staticmethod
    def load_schema(schema_path: str):
        path = os.path.realpath(os.path.dirname(__file__))
        with open(f"{path}/{schema_path}") as f:
            schema_str = f.read()
        return schema_str

    def consume_from_kafka(self, topics: List[str]):
        self.consumer.subscribe(topics=topics)
        while True:
            try:
                # SIGINT can't be handled when polling, limit timeout to 1 second.
                msg = self.consumer.poll(1.0)
                if msg is None:
                    continue
                key = self.avro_key_deserializer(msg.key(), SerializationContext(msg.topic(), MessageField.KEY))
                record = self.avro_value_deserializer(msg.value(),
                                                      SerializationContext(msg.topic(), MessageField.VALUE))
                if record is not None:
                    print("{}, {}".format(key, record))
            except KeyboardInterrupt:
                break

        self.consumer.close()


if __name__ == "__main__":
    config = {
        'bootstrap.servers': BOOTSTRAP_SERVERS,
        'schema_registry.url': SCHEMA_REGISTRY_URL,
        'schema.key': RIDE_KEY_SCHEMA_PATH,
        'schema.value': RIDE_VALUE_SCHEMA_PATH,
    }
    avro_consumer = RideAvroConsumer(props=config)
    avro_consumer.consume_from_kafka(topics=[KAFKA_TOPIC])


================================================
FILE: 07-streaming/extras/python/avro_example/producer.py
================================================
import os
import csv
from time import sleep
from typing import Dict

from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer
from confluent_kafka.serialization import SerializationContext, MessageField

from ride_record_key import RideRecordKey, ride_record_key_to_dict
from ride_record import RideRecord, ride_record_to_dict
from settings import RIDE_KEY_SCHEMA_PATH, RIDE_VALUE_SCHEMA_PATH, \
    SCHEMA_REGISTRY_URL, BOOTSTRAP_SERVERS, INPUT_DATA_PATH, KAFKA_TOPIC


def delivery_report(err, msg):
    if err is not None:
        print("Delivery failed for record {}: {}".format(msg.key(), err))
        return
    print('Record {} successfully produced to {} [{}] at offset {}'.format(
        msg.key(), msg.topic(), msg.partition(), msg.offset()))


class RideAvroProducer:
    def __init__(self, props: Dict):
        # Schema Registry and Serializer-Deserializer Configurations
        key_schema_str = self.load_schema(props['schema.key'])
        value_schema_str = self.load_schema(props['schema.value'])
        schema_registry_props = {'url': props['schema_registry.url']}
        schema_registry_client = SchemaRegistryClient(schema_registry_props)
        self.key_serializer = AvroSerializer(schema_registry_client, key_schema_str, ride_record_key_to_dict)
        self.value_serializer = AvroSerializer(schema_registry_client, value_schema_str, ride_record_to_dict)

        # Producer Configuration
        producer_props = {'bootstrap.servers': props['bootstrap.servers']}
        self.producer = Producer(producer_props)

    @staticmethod
    def load_schema(schema_path: str):
        path = os.path.realpath(os.path.dirname(__file__))
        with open(f"{path}/{schema_path}") as f:
            schema_str = f.read()
        return schema_str

    @staticmethod
    def delivery_report(err, msg):
        if err is not None:
            print("Delivery failed for record {}: {}".format(msg.key(), err))
            return
        print('Record {} successfully produced to {} [{}] at offset {}'.format(
            msg.key(), msg.topic(), msg.partition(), msg.offset()))

    @staticmethod
    def read_records(resource_path: str):
        ride_records, ride_keys = [], []
        with open(resource_path, 'r') as f:
            reader = csv.reader(f)
            header = next(reader)  # skip the header
            for row in reader:
                ride_records.append(RideRecord(arr=[row[0], row[3], row[4], row[9], row[16]]))
                ride_keys.append(RideRecordKey(vendor_id=int(row[0])))
        return zip(ride_keys, ride_records)

    def publish(self, topic: str, records: [RideRecordKey, RideRecord]):
        for key_value in records:
            key, value = key_value
            try:
                self.producer.produce(topic=topic,
                                      key=self.key_serializer(key, SerializationContext(topic=topic,
                                                                                        field=MessageField.KEY)),
                                      value=self.value_serializer(value, SerializationContext(topic=topic,
                                                                                              field=MessageField.VALUE)),
                                      on_delivery=delivery_report)
            except KeyboardInterrupt:
                break
            except Exception as e:
                print(f"Exception while producing record - {value}: {e}")

        self.producer.flush()
        sleep(1)


if __name__ == "__main__":
    config = {
        'bootstrap.servers': BOOTSTRAP_SERVERS,
        'schema_registry.url': SCHEMA_REGISTRY_URL,
        'schema.key': RIDE_KEY_SCHEMA_PATH,
        'schema.value': RIDE_VALUE_SCHEMA_PATH
    }
    producer = RideAvroProducer(props=config)
    ride_records = producer.read_records(resource_path=INPUT_DATA_PATH)
    producer.publish(topic=KAFKA_TOPIC, records=ride_records)


================================================
FILE: 07-streaming/extras/python/avro_example/ride_record.py
================================================
from typing import List, Dict


class RideRecord:

    def __init__(self, arr: List[str]):
        self.vendor_id = int(arr[0])
        self.passenger_count = int(arr[1])
        self.trip_distance = float(arr[2])
        self.payment_type = int(arr[3])
        self.total_amount = float(arr[4])

    @classmethod
    def from_dict(cls, d: Dict):
        return cls(arr=[
            d['vendor_id'],
            d['passenger_count'],
            d['trip_distance'],
            d['payment_type'],
            d['total_amount']
        ]
        )

    def __repr__(self):
        return f'{self.__class__.__name__}: {self.__dict__}'


def dict_to_ride_record(obj, ctx):
    if obj is None:
        return None

    return RideRecord.from_dict(obj)


def ride_record_to_dict(ride_record: RideRecord, ctx):
    return ride_record.__dict__


================================================
FILE: 07-streaming/extras/python/avro_example/ride_record_key.py
================================================
from typing import Dict


class RideRecordKey:
    def __init__(self, vendor_id):
        self.vendor_id = vendor_id

    @classmethod
    def from_dict(cls, d: Dict):
        return cls(vendor_id=d['vendor_id'])

    def __repr__(self):
        return f'{self.__class__.__name__}: {self.__dict__}'


def dict_to_ride_record_key(obj, ctx):
    if obj is None:
        return None

    return RideRecordKey.from_dict(obj)


def ride_record_key_to_dict(ride_record_key: RideRecordKey, ctx):
    return ride_record_key.__dict__


================================================
FILE: 07-streaming/extras/python/avro_example/settings.py
================================================
INPUT_DATA_PATH = '../resources/rides.csv'

RIDE_KEY_SCHEMA_PATH = '../resources/schemas/taxi_ride_key.avsc'
RIDE_VALUE_SCHEMA_PATH = '../resources/schemas/taxi_ride_value.avsc'

SCHEMA_REGISTRY_URL = 'http://localhost:8081'
BOOTSTRAP_SERVERS = 'localhost:9092'
KAFKA_TOPIC = 'rides_avro'


================================================
FILE: 07-streaming/extras/python/docker/README.md
================================================

# Running Spark and Kafka Clusters on Docker

### 1. Build Required Images for running Spark

The details of how to spark-images are build in different layers can be created can be read through 
the blog post written by André Perez on [Medium blog -Towards Data Science](https://towardsdatascience.com/apache-spark-cluster-on-docker-ft-a-juyterlab-interface-418383c95445)

```bash
# Build Spark Images
./build.sh 
```

### 2. Create Docker Network & Volume

```bash
# Create Network
docker network  create kafka-spark-network

# Create Volume
docker volume create --name=hadoop-distributed-file-system
```

### 3. Run Services on Docker
```bash
# Start Docker-Compose (within for kafka and spark folders)
docker compose up -d
```
In depth explanation of [Kafka Listeners](https://www.confluent.io/blog/kafka-listeners-explained/)

Explanation of [Kafka Listeners](https://www.confluent.io/blog/kafka-listeners-explained/)

### 4. Stop Services on Docker
```bash
# Stop Docker-Compose (within for kafka and spark folders)
docker compose down
```

### 5. Helpful Comands
```bash
# Delete all Containers
docker rm -f $(docker ps -a -q)

# Delete all volumes
docker volume rm $(docker volume ls -q)
```


================================================
FILE: 07-streaming/extras/python/docker/docker-compose.yml
================================================
version: "3.6"
volumes:
  shared-workspace:
    name: "hadoop-distributed-file-system"
    driver: local
services:
  jupyterlab:
    image: jupyterlab
    container_name: jupyterlab
    ports:
      - 8888:8888
    volumes:
      - shared-workspace:/opt/workspace
  spark-master:
    image: spark-master
    container_name: spark-master
    environment:
      SPARK_LOCAL_IP: 'spark-master'
    ports:
      - 8080:8080
      - 7077:7077
    volumes:
      - shared-workspace:/opt/workspace
  spark-worker-1:
    image: spark-worker
    container_name: spark-worker-1
    environment:
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=4g
    ports:
      - 8083:8081
    volumes:
      - shared-workspace:/opt/workspace
    depends_on:
      - spark-master
  spark-worker-2:
    image: spark-worker
    container_name: spark-worker-2
    environment:
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=4g
    ports:
      - 8082:8081
    volumes:
      - shared-workspace:/opt/workspace
    depends_on:
      - spark-master

  broker:
    image: confluentinc/cp-kafka:7.2.0
    hostname: broker
    container_name: broker
    depends_on:
      - zookeeper
    ports:
      - '9092:9092'
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
      # KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      # KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:9092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_BOB:PLAINTEXT,LISTENER_FRED:PLAINTEXT
      KAFKA_LISTENERS: LISTENER_BOB://broker:29092,LISTENER_FRED://broker:9092
      KAFKA_ADVERTISED_LISTENERS: LISTENER_BOB://broker:29092,LISTENER_FRED://localhost:9092
      KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_BOB
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
  schema-registry:
    image: confluentinc/cp-schema-registry:7.2.0
    hostname: schema-registry
    container_name: schema-registry
    depends_on:
      - zookeeper
      - broker
    ports:
      - "8081:8081"
    environment:
      # SCHEMA_REGISTRY_HOST_NAME: schema-registry # used for intercommunication
      # SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: "zookeeper:2181" #(depreciated)
      SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: "broker:29092"
      SCHEMA_REGISTRY_HOST_NAME: "localhost"
      SCHEMA_REGISTRY_LISTENERS: "http://0.0.0.0:8081" #(default: http://0.0.0.0:8081)
  zookeeper:
    image: confluentinc/cp-zookeeper:7.2.0
    hostname: zookeeper
    container_name: zookeeper
    ports:
      - '2181:2181'
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
  control-center:
    image: confluentinc/cp-enterprise-control-center:7.2.0
    hostname: control-center
    container_name: control-center
    depends_on:
      - zookeeper
      - broker
      - schema-registry
    ports:
      - "9021:9021"
    environment:
      CONTROL_CENTER_BOOTSTRAP_SERVERS: 'broker:29092'
      CONTROL_CENTER_ZOOKEEPER_CONNECT: 'zookeeper:2181'
      CONTROL_CENTER_SCHEMA_REGISTRY_URL: "http://localhost:8081"
      CONTROL_CENTER_REPLICATION_FACTOR: 1
      CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1
      CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1
      CONFLUENT_METRICS_TOPIC_REPLICATION: 1
      PORT: 9021


================================================
FILE: 07-streaming/extras/python/docker/kafka/docker-compose.yml
================================================
version: '3.6'
networks:
  default:
    name: kafka-spark-network
    external: true
services:
  broker:
    image: confluentinc/cp-kafka:7.2.0
    hostname: broker
    container_name: broker
    depends_on:
      - zookeeper
    ports:
      - '9092:9092'
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://broker:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
  schema-registry:
    image: confluentinc/cp-schema-registry:7.2.0
    hostname: schema-registry
    container_name: schema-registry
    depends_on:
      - zookeeper
      - broker
    ports:
      - "8081:8081"
    environment:
      # SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: "zookeeper:2181" #(depreciated)
      SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: "broker:29092"
      SCHEMA_REGISTRY_HOST_NAME: "localhost"
      SCHEMA_REGISTRY_LISTENERS: "http://0.0.0.0:8081" #(default: http://0.0.0.0:8081)
  zookeeper:
    image: confluentinc/cp-zookeeper:7.2.0
    hostname: zookeeper
    container_name: zookeeper
    ports:
      - '2181:2181'
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
  control-center:
    image: confluentinc/cp-enterprise-control-center:7.2.0
    hostname: control-center
    container_name: control-center
    depends_on:
      - zookeeper
      - broker
      - schema-registry
    ports:
      - "9021:9021"
    environment:
      CONTROL_CENTER_BOOTSTRAP_SERVERS: 'broker:29092'
      CONTROL_CENTER_ZOOKEEPER_CONNECT: 'zookeeper:2181'
      CONTROL_CENTER_SCHEMA_REGISTRY_URL: "http://localhost:8081"
      CONTROL_CENTER_REPLICATION_FACTOR: 1
      CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1
      CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1
      CONFLUENT_METRICS_TOPIC_REPLICATION: 1
      PORT: 9021

  kafka-rest:
    image: confluentinc/cp-kafka-rest:7.2.0
    hostname: kafka-rest
    ports:
      - "8082:8082"
    depends_on:
      - schema-registry
      - broker
    environment:
      KAFKA_REST_BOOTSTRAP_SERVERS: 'broker:29092'
      KAFKA_REST_ZOOKEEPER_CONNECT: 'zookeeper:2181'
      KAFKA_REST_SCHEMA_REGISTRY_URL: 'http://localhost:8081'
      KAFKA_REST_HOST_NAME: localhost
      KAFKA_REST_LISTENERS: 'http://0.0.0.0:8082'

================================================
FILE: 07-streaming/extras/python/docker/spark/build.sh
================================================
# -- Software Stack Version

SPARK_VERSION="3.3.1"
HADOOP_VERSION="3"
JUPYTERLAB_VERSION="3.6.1"

# -- Building the Images

docker build \
  -f cluster-base.Dockerfile \
  -t cluster-base .

docker build \
  --build-arg spark_version="${SPARK_VERSION}" \
  --build-arg hadoop_version="${HADOOP_VERSION}" \
  -f spark-base.Dockerfile \
  -t spark-base .

docker build \
  -f spark-master.Dockerfile \
  -t spark-master .

docker build \
  -f spark-worker.Dockerfile \
  -t spark-worker .

docker build \
  --build-arg spark_version="${SPARK_VERSION}" \
  --build-arg jupyterlab_version="${JUPYTERLAB_VERSION}" \
  -f jupyterlab.Dockerfile \
  -t jupyterlab .


================================================
FILE: 07-streaming/extras/python/docker/spark/cluster-base.Dockerfile
================================================
# Reference from offical Apache Spark repository Dockerfile for Kubernetes
# https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile
ARG java_image_tag=17-jre
FROM eclipse-temurin:${java_image_tag}

# -- Layer: OS + Python

ARG shared_workspace=/opt/workspace

RUN mkdir -p ${shared_workspace} && \
    apt-get update -y && \
    apt-get install -y python3 && \
    ln -s /usr/bin/python3 /usr/bin/python && \
    rm -rf /var/lib/apt/lists/*

ENV SHARED_WORKSPACE=${shared_workspace}

# -- Runtime

VOLUME ${shared_workspace}
CMD ["bash"]

================================================
FILE: 07-streaming/extras/python/docker/spark/docker-compose.yml
================================================
version: "3.6"
volumes:
  shared-workspace:
    name: "hadoop-distributed-file-system"
    driver: local
networks:
  default:
    name: kafka-spark-network
    external: true

services:
  jupyterlab:
    image: jupyterlab
    container_name: jupyterlab
    ports:
      - 8888:8888
    volumes:
      - shared-workspace:/opt/workspace
  spark-master:
    image: spark-master
    container_name: spark-master
    environment:
      SPARK_LOCAL_IP: 'spark-master'
    ports:
      - 8080:8080
      - 7077:7077
    volumes:
      - shared-workspace:/opt/workspace
  spark-worker-1:
    image: spark-worker
    container_name: spark-worker-1
    environment:
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=4g
    ports:
      - 8083:8081
    volumes:
      - shared-workspace:/opt/workspace
    depends_on:
      - spark-master
  spark-worker-2:
    image: spark-worker
    container_name: spark-worker-2
    environment:
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=4g
    ports:
      - 8084:8081
    volumes:
      - shared-workspace:/opt/workspace
    depends_on:
      - spark-master


================================================
FILE: 07-streaming/extras/python/docker/spark/jupyterlab.Dockerfile
================================================
FROM cluster-base

# -- Layer: JupyterLab

ARG spark_version=3.3.1
ARG jupyterlab_version=3.6.1

RUN apt-get update -y && \
    apt-get install -y python3-pip && \
    pip3 install wget pyspark==${spark_version} jupyterlab==${jupyterlab_version}

# -- Runtime

EXPOSE 8888
WORKDIR ${SHARED_WORKSPACE}
CMD jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token=


================================================
FILE: 07-streaming/extras/python/docker/spark/spark-base.Dockerfile
================================================
FROM cluster-base

# -- Layer: Apache Spark

ARG spark_version=3.3.1
ARG hadoop_version=3

RUN apt-get update -y && \
    apt-get install -y curl && \
    curl https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz -o spark.tgz && \
    tar -xf spark.tgz && \
    mv spark-${spark_version}-bin-hadoop${hadoop_version} /usr/bin/ && \
    mkdir /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/logs && \
    rm spark.tgz

ENV SPARK_HOME /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}
ENV SPARK_MASTER_HOST spark-master
ENV SPARK_MASTER_PORT 7077
ENV PYSPARK_PYTHON python3

# -- Runtime

WORKDIR ${SPARK_HOME}

================================================
FILE: 07-streaming/extras/python/docker/spark/spark-master.Dockerfile
================================================
FROM spark-base

# -- Runtime

ARG spark_master_web_ui=8080

EXPOSE ${spark_master_web_ui} ${SPARK_MASTER_PORT}
CMD bin/spark-class org.apache.spark.deploy.master.Master >> logs/spark-master.out

================================================
FILE: 07-streaming/extras/python/docker/spark/spark-worker.Dockerfile
================================================
FROM spark-base

# -- Runtime

ARG spark_worker_web_ui=8081

EXPOSE ${spark_worker_web_ui}
CMD bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} >> logs/spark-worker.out


================================================
FILE: 07-streaming/extras/python/json_example/consumer.py
================================================
from typing import Dict, List
from json import loads
from kafka import KafkaConsumer

from ride import Ride
from settings import BOOTSTRAP_SERVERS, KAFKA_TOPIC


class JsonConsumer:
    def __init__(self, props: Dict):
        self.consumer = KafkaConsumer(**props)

    def consume_from_kafka(self, topics: List[str]):
        self.consumer.subscribe(topics)
        print('Consuming from Kafka started')
        print('Available topics to consume: ', self.consumer.subscription())
        while True:
            try:
                # SIGINT can't be handled when polling, limit timeout to 1 second.
                message = self.consumer.poll(1.0)
                if message is None or message == {}:
                    continue
                for message_key, message_value in message.items():
                    for msg_val in message_value:
                        print(msg_val.key, msg_val.value)
            except KeyboardInterrupt:
                break

        self.consumer.close()


if __name__ == '__main__':
    config = {
        'bootstrap_servers': BOOTSTRAP_SERVERS,
        'auto_offset_reset': 'earliest',
        'enable_auto_commit': True,
        'key_deserializer': lambda key: int(key.decode('utf-8')),
        'value_deserializer': lambda x: loads(x.decode('utf-8'), object_hook=lambda d: Ride.from_dict(d)),
        'group_id': 'consumer.group.id.json-example.1',
    }

    json_consumer = JsonConsumer(props=config)
    json_consumer.consume_from_kafka(topics=[KAFKA_TOPIC])


================================================
FILE: 07-streaming/extras/python/json_example/producer.py
================================================
import csv
import json
from typing import List, Dict
from kafka import KafkaProducer
from kafka.errors import KafkaTimeoutError

from ride import Ride
from settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, KAFKA_TOPIC


class JsonProducer(KafkaProducer):
    def __init__(self, props: Dict):
        self.producer = KafkaProducer(**props)

    @staticmethod
    def read_records(resource_path: str):
        records = []
        with open(resource_path, 'r') as f:
            reader = csv.reader(f)
            header = next(reader)  # skip the header row
            for row in reader:
                records.append(Ride(arr=row))
        return records

    def publish_rides(self, topic: str, messages: List[Ride]):
        for ride in messages:
            try:
                record = self.producer.send(topic=topic, key=ride.pu_location_id, value=ride)
                print('Record {} successfully produced at offset {}'.format(ride.pu_location_id, record.get().offset))
            except KafkaTimeoutError as e:
                print(e.__str__())


if __name__ == '__main__':
    # Config Should match with the KafkaProducer expectation
    config = {
        'bootstrap_servers': BOOTSTRAP_SERVERS,
        'key_serializer': lambda key: str(key).encode(),
        'value_serializer': lambda x: json.dumps(x.__dict__, default=str).encode('utf-8')
    }
    producer = JsonProducer(props=config)
    rides = producer.read_records(resource_path=INPUT_DATA_PATH)
    producer.publish_rides(topic=KAFKA_TOPIC, messages=rides)


================================================
FILE: 07-streaming/extras/python/json_example/ride.py
================================================
from typing import List, Dict
from decimal import Decimal
from datetime import datetime


class Ride:
    def __init__(self, arr: List[str]):
        self.vendor_id = arr[0]
        self.tpep_pickup_datetime = datetime.strptime(arr[1], "%Y-%m-%d %H:%M:%S"),
        self.tpep_dropoff_datetime = datetime.strptime(arr[2], "%Y-%m-%d %H:%M:%S"),
        self.passenger_count = int(arr[3])
        self.trip_distance = Decimal(arr[4])
        self.rate_code_id = int(arr[5])
        self.store_and_fwd_flag = arr[6]
        self.pu_location_id = int(arr[7])
        self.do_location_id = int(arr[8])
        self.payment_type = arr[9]
        self.fare_amount = Decimal(arr[10])
        self.extra = Decimal(arr[11])
        self.mta_tax = Decimal(arr[12])
        self.tip_amount = Decimal(arr[13])
        self.tolls_amount = Decimal(arr[14])
        self.improvement_surcharge = Decimal(arr[15])
        self.total_amount = Decimal(arr[16])
        self.congestion_surcharge = Decimal(arr[17])

    @classmethod
    def from_dict(cls, d: Dict):
        return cls(arr=[
            d['vendor_id'],
            d['tpep_pickup_datetime'][0],
            d['tpep_dropoff_datetime'][0],
            d['passenger_count'],
            d['trip_distance'],
            d['rate_code_id'],
            d['store_and_fwd_flag'],
            d['pu_location_id'],
            d['do_location_id'],
            d['payment_type'],
            d['fare_amount'],
            d['extra'],
            d['mta_tax'],
            d['tip_amount'],
            d['tolls_amount'],
            d['improvement_surcharge'],
            d['total_amount'],
            d['congestion_surcharge'],
        ]
        )

    def __repr__(self):
        return f'{self.__class__.__name__}: {self.__dict__}'


================================================
FILE: 07-streaming/extras/python/json_example/settings.py
================================================
INPUT_DATA_PATH = '../resources/rides.csv'

BOOTSTRAP_SERVERS = ['localhost:9092']
KAFKA_TOPIC = 'rides_json'


================================================
FILE: 07-streaming/extras/python/redpanda_example/README.md
================================================
# Basic PubSub example with Redpanda

The aim of this module is to have a good grasp on the foundation of these Kafka/Redpanda concepts, to be able to submit a capstone project using streaming:
- clusters
- brokers
- topics
- producers
- consumers and consumer groups
- data serialization and deserialization
- replication and retention
- offsets
- consumer-groups
- 

## 1. Pre-requisites

If you have been following the [module-07](./../../../07-streaming/README.md) videos, you might already have installed the `kafka-python` library, so you can move on to [Docker](#2-docker) section.

If you have not, this is the only package you need to install in your virtual environment for this Redpanda lesson. 

1. activate your environment
2. `pip install kafka-python`

## 2. Docker

Start a Redpanda cluster. Redpanda is a single binary image, so it is very easy to start learning kafka concepts with Redpanda.

```bash
cd 07-streaming/python/redpanda_example/
docker-compose up -d
```

## 3. Set RPK alias

Redpanda has a console command `rpk` which means `Redpanda keeper`, the CLI tool that ships with Redpanda and is already available in the Docker image. 

Set the following `rpk` alias so we can use it from our terminal, without having to open a Docker interactive terminal. We can use this `rpk` alias directly in our terminal. 

```bash
alias rpk="docker exec -ti redpanda-1 rpk"
rpk version
```

At this time, the verion is shown as `v23.2.26 (rev 328d83a06e)`. The important version munber is the major one `v23` following the versioning semantics `major.minor[.build[.revision]]`, to ensure that you get the same results as whatever is shared in this document.

> [!TIP]
> If you're reading this after Mar, 2024 and want to update the Docker file to use the latest Redpanda images, just visit [Docker hub](https://hub.docker.com/r/vectorized/redpanda/tags), and paste the new version number.


## 4. Kafka Producer - Consumer Examples

To run the producer-consumer examples, open 2 shell terminals in 2 side-by-side tabs and run following commands. Be sure to activate your virtual environment in each terminal.

```bash
# Start consumer script, in 1st terminal tab
python -m consumer.py
# Start producer script, in 2nd terminal tab
python -m producer.py
```

Run the `python -m producer.py` command again (and again) to observe that the `consumer` worker tab would automatically consume messages in real-time when new `events` occur

## 5. Redpanda UI

You can also see the clusters, topics, etc from the Redpanda Console UI via your browser at [http://localhost:8080](http://localhost:8080)


## 6. rpk commands glossary

Visit [get-started-rpk blog post](https://redpanda.com/blog/get-started-rpk-manage-streaming-data-clusters) for more.

```bash
# set alias for rpk
alias rpk="docker exec -ti redpanda-1 rpk"

# get info on cluster
rpk cluster info

# create topic_name with m partitions and n replication factor
rpk topic create [topic_name] --partitions m --replicas n

# get list of available topics, without extra details and with details
rpk topic list
rpk topic list --detailed

# inspect topic config
rpk topic describe [topic_name]

# consume [topic_name]
rpk topic consume [topic_name]

# list the consumer groups in a Redpanda cluster
rpk group list

# get additional information about a consumer group, from above listed result
rpk group describe my-group
```

## 7. Additional Resources

Redpanda Univerity (needs a Redpanda account and it is free to enrol and do the course(s))
- [RP101: Getting Started with Redpanda](https://university.redpanda.com/courses/hands-on-redpanda-getting-started)
- [RP102: Stream Processing with Redpanda](https://university.redpanda.com/courses/take/hands-on-redpanda-stream-processing/lessons/37830192-intro)
- [SF101: Streaming Fundamentals](https://university.redpanda.com/courses/streaming-fundamentals)
- [SF102: Kafka building blocks](https://university.redpanda.com/courses/kafka-building-blocks)

If you feel that you already have a good foundational basis on Streaming and Kafka, feel free to skip these supplementary courses.


================================================
FILE: 07-streaming/extras/python/redpanda_example/consumer.py
================================================
import os
from typing import Dict, List
from json import loads
from kafka import KafkaConsumer

from ride import Ride
from settings import BOOTSTRAP_SERVERS, KAFKA_TOPIC


class JsonConsumer:
    def __init__(self, props: Dict):
        self.consumer = KafkaConsumer(**props)

    def consume_from_kafka(self, topics: List[str]):
        self.consumer.subscribe(topics)
        print('Consuming from Kafka started')
        print('Available topics to consume: ', self.consumer.subscription())
        while True:
            try:
                # SIGINT can't be handled when polling, limit timeout to 1 second.
                message = self.consumer.poll(1.0)
                if message is None or message == {}:
                    continue
                for message_key, message_value in message.items():
                    for msg_val in message_value:
                        print(msg_val.key, msg_val.value)
            except KeyboardInterrupt:
                break

        self.consumer.close()


if __name__ == '__main__':
    config = {
        'bootstrap_servers': BOOTSTRAP_SERVERS,
        'auto_offset_reset': 'earliest',
        'enable_auto_commit': True,
        'key_deserializer': lambda key: int(key.decode('utf-8')),
        'value_deserializer': lambda x: loads(x.decode('utf-8'), object_hook=lambda d: Ride.from_dict(d)),
        'group_id': 'consumer.group.id.json-example.1',
    }

    json_consumer = JsonConsumer(props=config)
    json_consumer.consume_from_kafka(topics=[KAFKA_TOPIC])


# There's no schema in JSON format, so if the schema changes and one column is removed or new one added or the data types is changed, the Ride class would still work and produce-consume messages would still run without a hitch.
# But the issue is in the downstream Analytics as the dataset would no longer have that column and the dashboards would thus fail. Therefore, the trust in our data and processes would erodes.

================================================
FILE: 07-streaming/extras/python/redpanda_example/docker-compose.yaml
================================================
version: '3.7'
services:
  # Redpanda cluster
  redpanda-1:
    image: docker.redpanda.com/redpandadata/redpanda:v23.2.26
    container_name: redpanda-1
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '1'
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092
      - --pandaproxy-addr
      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082
      - --advertise-pandaproxy-addr
      - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082
      - --rpc-addr
      - 0.0.0.0:33145
      - --advertise-rpc-addr
      - redpanda-1:33145
    ports:
      # - 8081:8081
      - 8082:8082
      - 9092:9092
      - 9644:9644
      - 28082:28082
      - 29092:29092

  # Want a two node Redpanda cluster? Uncomment this block :)
  # redpanda-2:
  #   image: docker.redpanda.com/redpandadata/redpanda:v23.1.1
  #   container_name: redpanda-2
  #   command:
  #     - redpanda
  #     - start
  #     - --smp
  #     - '1'
  #     - --reserve-memory
  #     - 0M
  #     - --overprovisioned
  #     - --node-id
  #     - '2'
  #     - --seeds
  #     - redpanda-1:33145
  #     - --kafka-addr
  #     - PLAINTEXT://0.0.0.0:29093,OUTSIDE://0.0.0.0:9093
  #     - --advertise-kafka-addr
  #     - PLAINTEXT://redpanda-2:29093,OUTSIDE://localhost:9093
  #     - --pandaproxy-addr
  #     - PLAINTEXT://0.0.0.0:28083,OUTSIDE://0.0.0.0:8083
  #     - --advertise-pandaproxy-addr
  #     - PLAINTEXT://redpanda-2:28083,OUTSIDE://localhost:8083
  #     - --rpc-addr
  #     - 0.0.0.0:33146
  #     - --advertise-rpc-addr
  #     - redpanda-2:33146
  #   ports:
  #     - 8083:8083
  #     - 9093:9093

  redpanda-console:
    image: docker.redpanda.com/redpandadata/console:v2.2.2
    container_name: redpanda-console
    entrypoint: /bin/sh
    command: -c "echo \"$$CONSOLE_CONFIG_FILE\" > /tmp/config.yml; /app/console"
    environment:
      CONFIG_FILEPATH: /tmp/config.yml
      CONSOLE_CONFIG_FILE: |
        kafka:
          brokers: ["redpanda-1:29092"]
          schemaRegistry:
            enabled: false
        redpanda:
          adminApi:
            enabled: true
            urls: ["http://redpanda-1:9644"]
        connect:
          enabled: false
    ports:
      - 8080:8080
    depends_on:
      - redpanda-1


================================================
FILE: 07-streaming/extras/python/redpanda_example/producer.py
================================================
import csv
import json
from typing import List, Dict
from kafka import KafkaProducer
from kafka.errors import KafkaTimeoutError

from ride import Ride
from settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, KAFKA_TOPIC


class JsonProducer(KafkaProducer):
    def __init__(self, props: Dict):
        self.producer = KafkaProducer(**props)

    @staticmethod
    def read_records(resource_path: str):
        records = []
        with open(resource_path, 'r') as f:
            reader = csv.reader(f)
            header = next(reader)  # skip the header row
            for row in reader:
                records.append(Ride(arr=row))
        return records

    def publish_rides(self, topic: str, messages: List[Ride]):
        for ride in messages:
            try:
                record = self.producer.send(topic=topic, key=ride.pu_location_id, value=ride)
                print('Record {} successfully produced at offset {}'.format(ride.pu_location_id, record.get().offset))
            except KafkaTimeoutError as e:
                print(e.__str__())


if __name__ == '__main__':
    # Config Should match with the KafkaProducer expectation
    # kafka expects binary format for the key-value pair
    config = {
        'bootstrap_servers': BOOTSTRAP_SERVERS,
        'key_serializer': lambda key: str(key).encode(),
        'value_serializer': lambda x: json.dumps(x.__dict__, default=str).encode('utf-8')
    }
    producer = JsonProducer(props=config)
    rides = producer.read_records(resource_path=INPUT_DATA_PATH)
    producer.publish_rides(topic=KAFKA_TOPIC, messages=rides)


================================================
FILE: 07-streaming/extras/python/redpanda_example/ride.py
================================================
from typing import List, Dict
from decimal import Decimal
from datetime import datetime


class Ride:
    def __init__(self, arr: List[str]):
        self.vendor_id = arr[0]
        self.tpep_pickup_datetime = datetime.strptime(arr[1], "%Y-%m-%d %H:%M:%S"),
        self.tpep_dropoff_datetime = datetime.strptime(arr[2], "%Y-%m-%d %H:%M:%S"),
        self.passenger_count = int(arr[3])
        self.trip_distance = Decimal(arr[4])
        self.rate_code_id = int(arr[5])
        self.store_and_fwd_flag = arr[6]
        self.pu_location_id = int(arr[7])
        self.do_location_id = int(arr[8])
        self.payment_type = arr[9]
        self.fare_amount = Decimal(arr[10])
        self.extra = Decimal(arr[11])
        self.mta_tax = Decimal(arr[12])
        self.tip_amount = Decimal(arr[13])
        self.tolls_amount = Decimal(arr[14])
        self.improvement_surcharge = Decimal(arr[15])
        self.total_amount = Decimal(arr[16])
        self.congestion_surcharge = Decimal(arr[17])

    @classmethod
    def from_dict(cls, d: Dict):
        return cls(arr=[
            d['vendor_id'],
            d['tpep_pickup_datetime'][0],
            d['tpep_dropoff_datetime'][0],
            d['passenger_count'],
            d['trip_distance'],
            d['rate_code_id'],
            d['store_and_fwd_flag'],
            d['pu_location_id'],
            d['do_location_id'],
            d['payment_type'],
            d['fare_amount'],
            d['extra'],
            d['mta_tax'],
            d['tip_amount'],
            d['tolls_amount'],
            d['improvement_surcharge'],
            d['total_amount'],
            d['congestion_surcharge'],
        ]
        )

    def __repr__(self):
        return f'{self.__class__.__name__}: {self.__dict__}'


================================================
FILE: 07-streaming/extras/python/redpanda_example/settings.py
================================================
INPUT_DATA_PATH = '../resources/rides.csv'

BOOTSTRAP_SERVERS = ['localhost:9092']
KAFKA_TOPIC = 'rides_json'


================================================
FILE: 07-streaming/extras/python/requirements.txt
================================================
kafka-python==1.4.6
confluent_kafka
requests
avro
faust
fastavro


================================================
FILE: 07-streaming/extras/python/resources/schemas/taxi_ride_key.avsc
================================================
{
  "namespace": "com.datatalksclub.taxi",
  "type": "record",
  "name": "RideRecordKey",
  "fields": [
    {
      "name": "vendor_id",
      "type": "int"
    }
  ]
}

================================================
FILE: 07-streaming/extras/python/resources/schemas/taxi_ride_value.avsc
================================================
{
  "namespace": "com.datatalksclub.taxi",
  "type": "record",
  "name": "RideRecord",
  "fields": [
    {
      "name": "vendor_id",
      "type": "int"
    },
    {
      "name": "passenger_count",
      "type": "int"
    },
    {
      "name": "trip_distance",
      "type": "float"
    },
    {
      "name": "payment_type",
      "type": "int"
    },
    {
      "name": "total_amount",
      "type": "float"
    }
  ]
}

================================================
FILE: 07-streaming/extras/python/streams-example/faust/branch_price.py
================================================
import faust
from taxi_rides import TaxiRide
from faust import current_event

app = faust.App('datatalksclub.stream.v3', broker='kafka://localhost:9092', consumer_auto_offset_reset="earliest")
topic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide)

high_amount_rides = app.topic('datatalks.yellow_taxi_rides.high_amount')
low_amount_rides = app.topic('datatalks.yellow_taxi_rides.low_amount')


@app.agent(topic)
async def process(stream):
    async for event in stream:
        if event.total_amount >= 40.0:
            await current_event().forward(high_amount_rides)
        else:
            await current_event().forward(low_amount_rides)

if __name__ == '__main__':
    app.main()


================================================
FILE: 07-streaming/extras/python/streams-example/faust/producer_taxi_json.py
================================================
import csv
from json import dumps
from kafka import KafkaProducer
from time import sleep


producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                         key_serializer=lambda x: dumps(x).encode('utf-8'),
                         value_serializer=lambda x: dumps(x).encode('utf-8'))

file = open('../../resources/rides.csv')

csvreader = csv.reader(file)
header = next(csvreader)
for row in csvreader:
    key = {"vendorId": int(row[0])}
    value = {"vendorId": int(row[0]), "passenger_count": int(row[3]), "trip_distance": float(row[4]), "payment_type": int(row[9]), "total_amount": float(row[16])}
    producer.send('datatalkclub.yellow_taxi_ride.json', value=value, key=key)
    print("producing")
    sleep(1)

================================================
FILE: 07-streaming/extras/python/streams-example/faust/stream.py
================================================
import faust
from taxi_rides import TaxiRide


app = faust.App('datatalksclub.stream.v2', broker='kafka://localhost:9092')
topic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide)


@app.agent(topic)
async def start_reading(records):
    async for record in records:
        print(record)


if __name__ == '__main__':
    app.main()


================================================
FILE: 07-streaming/extras/python/streams-example/faust/stream_count_vendor_trips.py
================================================
import faust
from taxi_rides import TaxiRide


app = faust.App('datatalksclub.stream.v2', broker='kafka://localhost:9092')
topic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide)

vendor_rides = app.Table('vendor_rides', default=int)


@app.agent(topic)
async def process(stream):
    async for event in stream.group_by(TaxiRide.vendorId):
        vendor_rides[event.vendorId] += 1

if __name__ == '__main__':
    app.main()


================================================
FILE: 07-streaming/extras/python/streams-example/faust/taxi_rides.py
================================================
import faust


class TaxiRide(faust.Record, validation=True):
    vendorId: str
    passenger_count: int
    trip_distance: float
    payment_type: int
    total_amount: float


================================================
FILE: 07-streaming/extras/python/streams-example/faust/windowing.py
================================================
from datetime import timedelta
import faust
from taxi_rides import TaxiRide


app = faust.App('datatalksclub.stream.v2', broker='kafka://localhost:9092')
topic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide)

vendor_rides = app.Table('vendor_rides_windowed', default=int).tumbling(
    timedelta(minutes=1),
    expires=timedelta(hours=1),
)


@app.agent(topic)
async def process(stream):
    async for event in stream.group_by(TaxiRide.vendorId):
        vendor_rides[event.vendorId] += 1


if __name__ == '__main__':
    app.main()


================================================
FILE: 07-streaming/extras/python/streams-example/pyspark/README.md
================================================

# Running PySpark Streaming 

#### Prerequisite

Ensure your Kafka and Spark services up and running by following the [docker setup readme](./../../docker/README.md). 
It is important to create network and volume as described in the document. Therefore please ensure, your volume and network are created correctly

```bash
docker volume ls # should list hadoop-distributed-file-system
docker network ls # should list kafka-spark-network 
```


### Running Producer and Consumer
```bash
# Run producer
python3 producer.py

# Run consumer with default settings
python3 consumer.py
# Run consumer for specific topic
python3 consumer.py --topic <topic-name>
```

### Running Streaming Script

spark-submit script ensures installation of necessary jars before running the streaming.py

```bash
./spark-submit.sh streaming.py 
```

### Additional Resources
- [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streaming-programming-guide)
- [Structured Streaming + Kafka Integration](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#structured-streaming-kafka-integration-guide-kafka-broker-versio)


================================================
FILE: 07-streaming/extras/python/streams-example/pyspark/consumer.py
================================================
import argparse
from typing import Dict, List
from kafka import KafkaConsumer

from settings import BOOTSTRAP_SERVERS, CONSUME_TOPIC_RIDES_CSV


class RideCSVConsumer:
    def __init__(self, props: Dict):
        self.consumer = KafkaConsumer(**props)

    def consume_from_kafka(self, topics: List[str]):
        self.consumer.subscribe(topics=topics)
        print('Consuming from Kafka started')
        print('Available topics to consume: ', self.consumer.subscription())
        while True:
            try:
                # SIGINT can't be handled when polling, limit timeout to 1 second.
                msg = self.consumer.poll(1.0)
                if msg is None or msg == {}:
                    continue
                for msg_key, msg_values in msg.items():
                    for msg_val in msg_values:
                        print(f'Key:{msg_val.key}-type({type(msg_val.key)}), '
                              f'Value:{msg_val.value}-type({type(msg_val.value)})')
            except KeyboardInterrupt:
                break

        self.consumer.close()


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Kafka Consumer')
    parser.add_argument('--topic', type=str, default=CONSUME_TOPIC_RIDES_CSV)
    args = parser.parse_args()

    topic = args.topic
    config = {
        'bootstrap_servers': [BOOTSTRAP_SERVERS],
        'auto_offset_reset': 'earliest',
        'enable_auto_commit': True,
        'key_deserializer': lambda key: int(key.decode('utf-8')),
        'value_deserializer': lambda value: value.decode('utf-8'),
        'group_id': 'consumer.group.id.csv-example.1',
    }
    csv_consumer = RideCSVConsumer(props=config)
    csv_consumer.consume_from_kafka(topics=[topic])


================================================
FILE: 07-streaming/extras/python/streams-example/pyspark/producer.py
================================================
import csv
from time import sleep
from typing import Dict
from kafka import KafkaProducer

from settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, PRODUCE_TOPIC_RIDES_CSV


def delivery_report(err, msg):
    if err is not None:
        print("Delivery failed for record {}: {}".format(msg.key(), err))
        return
    print('Record {} successfully produced to {} [{}] at offset {}'.format(
        msg.key(), msg.topic(), msg.partition(), msg.offset()))


class RideCSVProducer:
    def __init__(self, props: Dict):
        self.producer = KafkaProducer(**props)
        # self.producer = Producer(producer_props)

    @staticmethod
    def read_records(resource_path: str):
        records, ride_keys = [], []
        i = 0
        with open(resource_path, 'r') as f:
            reader = csv.reader(f)
            header = next(reader)  # skip the header
            for row in reader:
                # vendor_id, passenger_count, trip_distance, payment_type, total_amount
                records.append(f'{row[0]}, {row[1]}, {row[2]}, {row[3]}, {row[4]}, {row[9]}, {row[16]}')
                ride_keys.append(str(row[0]))
                i += 1
                if i == 5:
                    break
        return zip(ride_keys, records)

    def publish(self, topic: str, records: [str, str]):
        for key_value in records:
            key, value = key_value
            try:
                self.producer.send(topic=topic, key=key, value=value)
                print(f"Producing record for <key: {key}, value:{value}>")
            except KeyboardInterrupt:
                break
            except Exception as e:
                print(f"Exception while producing record - {value}: {e}")

        self.producer.flush()
        sleep(1)


if __name__ == "__main__":
    config = {
        'bootstrap_servers': [BOOTSTRAP_SERVERS],
        'key_serializer': lambda x: x.encode('utf-8'),
        'value_serializer': lambda x: x.encode('utf-8')
    }
    producer = RideCSVProducer(props=config)
    ride_records = producer.read_records(resource_path=INPUT_DATA_PATH)
    print(ride_records)
    producer.publish(topic=PRODUCE_TOPIC_RIDES_CSV, records=ride_records)


================================================
FILE: 07-streaming/extras/python/streams-example/pyspark/settings.py
================================================
import pyspark.sql.types as T

INPUT_DATA_PATH = '../../resources/rides.csv'
BOOTSTRAP_SERVERS = 'localhost:9092'

TOPIC_WINDOWED_VENDOR_ID_COUNT = 'vendor_counts_windowed'

PRODUCE_TOPIC_RIDES_CSV = CONSUME_TOPIC_RIDES_CSV = 'rides_csv'

RIDE_SCHEMA = T.StructType(
    [T.StructField("vendor_id", T.IntegerType()),
     T.StructField('tpep_pickup_datetime', T.TimestampType()),
     T.StructField('tpep_dropoff_datetime', T.TimestampType()),
     T.StructField("passenger_count", T.IntegerType()),
     T.StructField("trip_distance", T.FloatType()),
     T.StructField("payment_type", T.IntegerType()),
     T.StructField("total_amount", T.FloatType()),
     ])


================================================
FILE: 07-streaming/extras/python/streams-example/pyspark/spark-submit.sh
================================================
# Submit Python code to SparkMaster

if [ $# -lt 1 ]
then
	echo "Usage: $0 <pyspark-job.py> [ executor-memory ]"
	echo "(specify memory in string format such as \"512M\" or \"2G\")"
	exit 1
fi
PYTHON_JOB=$1

if [ -z $2 ]
then
	EXEC_MEM="1G"
else
	EXEC_MEM=$2
fi
spark-submit --master spark://localhost:7077 --num-executors 2 \
	           --executor-memory $EXEC_MEM --executor-cores 1 \
             --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.3.1 \
             $PYTHON_JOB

================================================
FILE: 07-streaming/extras/python/streams-example/pyspark/streaming-notebook.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c4419168-c0e6-4a65-b56e-8454c42060ac",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "### 0. Spark Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32bd7cdd-8504-4a54-a461-244bf7878d2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 pyspark-shell'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3aab2a7e-a685-4925-9c9a-b5adf201af77",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ":: loading settings :: url = jar:file:/usr/local/lib/python3.10/dist-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Ivy Default Cache set to: /root/.ivy2/cache\n",
      "The jars for the packages stored in: /root/.ivy2/jars\n",
      "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n",
      "org.apache.spark#spark-avro_2.12 added as a dependency\n",
      ":: resolving dependencies :: org.apache.spark#spark-submit-parent-5a3a4db6-be91-4d32-9884-8b0f38241b3f;1.0\n",
      "\tconfs: [default]\n",
      "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 in central\n",
      "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 in central\n",
      "\tfound org.apache.kafka#kafka-clients;2.8.1 in central\n",
      "\tfound org.lz4#lz4-java;1.8.0 in central\n",
      "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n",
      "\tfound org.slf4j#slf4j-api;1.7.32 in central\n",
      "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.2 in central\n",
      "\tfound org.spark-project.spark#unused;1.0.0 in central\n",
      "\tfound org.apache.hadoop#hadoop-client-api;3.3.2 in central\n",
      "\tfound commons-logging#commons-logging;1.1.3 in central\n",
      "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n",
      "\tfound org.apache.commons#commons-pool2;2.11.1 in central\n",
      "\tfound org.apache.spark#spark-avro_2.12;3.3.1 in central\n",
      "\tfound org.tukaani#xz;1.8 in central\n",
      ":: resolution report :: resolve 544ms :: artifacts dl 11ms\n",
      "\t:: modules in use:\n",
      "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n",
      "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n",
      "\torg.apache.commons#commons-pool2;2.11.1 from central in [default]\n",
      "\torg.apache.hadoop#hadoop-client-api;3.3.2 from central in [default]\n",
      "\torg.apache.hadoop#hadoop-client-runtime;3.3.2 from central in [default]\n",
      "\torg.apache.kafka#kafka-clients;2.8.1 from central in [default]\n",
      "\torg.apache.spark#spark-avro_2.12;3.3.1 from central in [default]\n",
      "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 from central in [default]\n",
      "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 from central in [default]\n",
      "\torg.lz4#lz4-java;1.8.0 from central in [default]\n",
      "\torg.slf4j#slf4j-api;1.7.32 from central in [default]\n",
      "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n",
      "\torg.tukaani#xz;1.8 from central in [default]\n",
      "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n",
      "\t---------------------------------------------------------------------\n",
      "\t|                  |            modules            ||   artifacts   |\n",
      "\t|       conf       | number| search|dwnlded|evicted|| number|dwnlded|\n",
      "\t---------------------------------------------------------------------\n",
      "\t|      default     |   14  |   0   |   0   |   0   ||   14  |   0   |\n",
      "\t---------------------------------------------------------------------\n",
      ":: retrieving :: org.apache.spark#spark-submit-parent-5a3a4db6-be91-4d32-9884-8b0f38241b3f\n",
      "\tconfs: [default]\n",
      "\t0 artifacts copied, 14 already retrieved (0kB/8ms)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "23/02/21 21:20:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "import pyspark.sql.types as T\n",
    "import pyspark.sql.functions as F\n",
    "\n",
    "spark = SparkSession \\\n",
    "    .builder \\\n",
    "    .appName(\"Spark-Notebook\") \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f4b62fa-b3ce-4a1b-a1f4-2ed332a0d55a",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "### 1. Reading from Kafka Stream\n",
    "\n",
    "through `readStream`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f491fa45-4471-4bc5-92f7-48081f687140",
   "metadata": {},
   "source": [
    "#### 1.1 Raw Kafka Stream"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "82c25cb2-2599-4f9b-8849-967fbb604a44",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# default for startingOffsets is \"latest\"\n",
    "df_kafka_raw = spark \\\n",
    "    .readStream \\\n",
    "    .format(\"kafka\") \\\n",
    "    .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n",
    "    .option(\"subscribe\", \"rides_csv\") \\\n",
    "    .option(\"startingOffsets\", \"earliest\") \\\n",
    "    .option(\"checkpointLocation\", \"checkpoint\") \\\n",
    "    .load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d9149ccd-69b2-4f5b-afc0-43567673c634",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- key: binary (nullable = true)\n",
      " |-- value: binary (nullable = true)\n",
      " |-- topic: string (nullable = true)\n",
      " |-- partition: integer (nullable = true)\n",
      " |-- offset: long (nullable = true)\n",
      " |-- timestamp: timestamp (nullable = true)\n",
      " |-- timestampType: integer (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_kafka_raw.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62e5e753-89c7-460f-a8be-16868ce5c680",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "#### 1.2 Encoded Kafka Stream"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0b745eed-7d74-421e-8e4b-c8343fda4de3",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_kafka_encoded = df_kafka_raw.selectExpr(\"CAST(key AS STRING)\",\"CAST(value AS STRING)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6839addc-c7c0-4117-8c9c-d2cd59cbf136",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- key: string (nullable = true)\n",
      " |-- value: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_kafka_encoded.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6749c4de-6f80-4b91-b2b8-b2968c761d75",
   "metadata": {},
   "source": [
    "#### 1.3 Structure Streaming DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ca20ae37-49f0-421f-9859-73fac8d4ca45",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def parse_ride_from_kafka_message(df_raw, schema):\n",
    "    \"\"\" take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema \"\"\"\n",
    "    assert df_raw.isStreaming is True, \"DataFrame doesn't receive streaming data\"\n",
    "\n",
    "    df = df_raw.selectExpr(\"CAST(key AS STRING)\", \"CAST(value AS STRING)\")\n",
    "\n",
    "    # split attributes to nested array in one Column\n",
    "    col = F.split(df['value'], ', ')\n",
    "\n",
    "    # expand col to multiple top-level columns\n",
    "    for idx, field in enumerate(schema):\n",
    "        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))\n",
    "    return df.select([field.name for field in schema])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e1737bd0-146f-4ee2-a70f-a4657af5bbc6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ride_schema = T.StructType(\n",
    "    [T.StructField(\"vendor_id\", T.IntegerType()),\n",
    "     T.StructField('tpep_pickup_datetime', T.TimestampType()),\n",
    "     T.StructField('tpep_dropoff_datetime', T.TimestampType()),\n",
    "     T.StructField(\"passenger_count\", T.IntegerType()),\n",
    "     T.StructField(\"trip_distance\", T.FloatType()),\n",
    "     T.StructField(\"payment_type\", T.IntegerType()),\n",
    "     T.StructField(\"total_amount\", T.FloatType()),\n",
    "     ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "ae2ce896-f54b-4166-b01f-b5532ab292fe",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_rides = parse_ride_from_kafka_message(df_raw=df_kafka_raw, schema=ride_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "cd848228-97c5-4325-8457-97f35e533cd8",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- vendor_id: integer (nullable = true)\n",
      " |-- tpep_pickup_datetime: timestamp (nullable = true)\n",
      " |-- tpep_dropoff_datetime: timestamp (nullable = true)\n",
      " |-- passenger_count: integer (nullable = true)\n",
      " |-- trip_distance: float (nullable = true)\n",
      " |-- payment_type: integer (nullable = true)\n",
      " |-- total_amount: float (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_rides.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60277fdc-2797-4b23-9ecf-956b76db5778",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 2 Sink Operation & Streaming Query\n",
    "\n",
    "through `writeStream`\n",
    "\n",
    "---\n",
    "**Output Sinks**\n",
    "- File Sink: stores the output to the directory\n",
    "- Kafka Sink: stores the output to one or more topics in Kafka\n",
    "- Foreach Sink:\n",
    "- (for debugging) Console Sink, Memory Sink\n",
    "\n",
    "Further details can be found in [Output Sinks](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks)\n",
    "\n",
    "---\n",
    "There are three types of **Output Modes**:\n",
    "- Complete: The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.\n",
    "- Append (default): Only new rows are added to the Result Table\n",
    "- Update: Only updated rows are outputted\n",
    "\n",
    "[Output Modes](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes) differs based on the set of transformations applied to the streaming data. \n",
    "\n",
    "--- \n",
    "**Triggers**\n",
    "\n",
    "The [trigger settings](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) of a streaming query define the timing of streaming data processing. Spark streaming support micro-batch streamings schema and you can select following options based on requirements.\n",
    "\n",
    "- default-micro-batch-mode\n",
    "- fixed-interval-micro-batch-mode\n",
    "- one-time-micro-batch-mode\n",
    "- available-now-micro-batch-mode\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02ca9b08-aa61-46cd-b946-4457ce2cdf5d",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Console and Memory Sink"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "74c72469-4c37-417c-a866-a1c1ef75ae8b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):\n",
    "    write_query = df.writeStream \\\n",
    "        .outputMode(output_mode) \\\n",
    "        .trigger(processingTime=processing_time) \\\n",
    "        .format(\"console\") \\\n",
    "        .option(\"truncate\", False) \\\n",
    "        .start()\n",
    "    return write_query # pyspark.sql.streaming.StreamingQuery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "d866c7ba-f8e9-475d-830a-50ffb2c5472b",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "23/02/21 21:46:12 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-289a958e-f6b6-4b38-a87b-50002d82ec8b. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n",
      "23/02/21 21:46:12 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n",
      "23/02/21 21:46:12 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0-3, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "23/02/21 21:46:12 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0-3, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n",
      "23/02/21 21:46:13 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-4, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "23/02/21 21:46:13 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-4, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n",
      "-------------------------------------------\n",
      "Batch: 0\n",
      "-------------------------------------------\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\n",
      "|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\n",
      "|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\n",
      "|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\n",
      "|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\n",
      "|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\n",
      "|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\n",
      "|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\n",
      "|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\n",
      "|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "\n",
      "23/02/21 22:11:05 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-5, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "23/02/21 22:11:05 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-5, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------------------------------\n",
      "Batch: 1\n",
      "-------------------------------------------\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\n",
      "|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\n",
      "|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\n",
      "|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\n",
      "|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "write_query = sink_console(df_rides, output_mode='append')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "a9bfa73f-a8cc-4988-a8cf-bf31ee6c449c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def sink_memory(df, query_name, query_template):\n",
    "    write_query = df \\\n",
    "        .writeStream \\\n",
    "        .queryName(query_name) \\\n",
    "        .format('memory') \\\n",
    "        .start()\n",
    "    query_str = query_template.format(table_name=query_name)\n",
    "    query_results = spark.sql(query_str)\n",
    "    return write_query, query_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b31d0b76-e917-44e7-a14d-f9ce6901c23a",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "23/02/21 21:31:47 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-b3e2c096-aa06-4083-9cdf-d6f3cf04fc06. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n",
      "23/02/21 21:31:47 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n",
      "23/02/21 21:31:48 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0-1, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "23/02/21 21:31:48 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0-1, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n",
      "23/02/21 21:31:49 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor-2, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "23/02/21 21:31:49 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor-2, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "query_name = 'vendor_id_counts'\n",
    "query_template = 'select count(distinct(vendor_id)) from {table_name}'\n",
    "write_query, df_vendor_id_counts = sink_memory(df=df_rides, query_name=query_name, query_template=query_template)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "4ba56111-83bf-4028-ac65-565e0190f310",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pyspark.sql.streaming.StreamingQuery'>\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'message': 'Waiting for data to arrive',\n",
       " 'isDataAvailable': False,\n",
       " 'isTriggerActive': True}"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(type(write_query)) # pyspark.sql.streaming.StreamingQuery\n",
    "write_query.status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "7cc37bda-9cfa-402b-9d42-a6ba5271476b",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------------+\n",
      "|count(DISTINCT vendor_id)|\n",
      "+-------------------------+\n",
      "|                        2|\n",
      "+-------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_vendor_id_counts.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "88862ca9-4d89-487e-987f-08a2b9e83efe",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "write_query.stop()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "443d4041-06db-4a4a-89c1-348848cc7ca8",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Kafka Sink\n",
    "\n",
    "To write stream results to `kafka-topic`, the stream dataframe has at least a column with name `value`.\n",
    "\n",
    "Therefore before starting `writeStream` in kafka format, dataframe needs to be updated accordingly.\n",
    "\n",
    "More information regarding kafka sink expected data structure [here](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "8b08a013-d039-41cf-94fd-a1a57571d25f",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def prepare_dataframe_to_kafka_sink(df, value_columns, key_column=None):\n",
    "    columns = df.columns\n",
    "    df = df.withColumn(\"value\", F.concat_ws(', ',*value_columns))    \n",
    "    if key_column:\n",
    "        df = df.withColumnRenamed(key_column,\"key\")\n",
    "        df = df.withColumn(\"key\",df.key.cast('string'))\n",
    "    return df.select(['key', 'value'])\n",
    "    \n",
    "def sink_kafka(df, topic, output_mode='append'):\n",
    "    write_query = df.writeStream \\\n",
    "        .format(\"kafka\") \\\n",
    "        .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n",
    "        .outputMode(output_mode) \\\n",
    "        .option(\"topic\", topic) \\\n",
    "        .option(\"checkpointLocation\", \"checkpoint\") \\\n",
    "        .start()\n",
    "    return write_query"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4cb2140-9f2e-4914-b74c-be4c18cdbe8a",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 07-streaming/extras/python/streams-example/pyspark/streaming.py
================================================
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

from settings import RIDE_SCHEMA, CONSUME_TOPIC_RIDES_CSV, TOPIC_WINDOWED_VENDOR_ID_COUNT


def read_from_kafka(consume_topic: str):
    # Spark Streaming DataFrame, connect to Kafka topic served at host in bootrap.servers option
    df_stream = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "localhost:9092,broker:29092") \
        .option("subscribe", consume_topic) \
        .option("startingOffsets", "earliest") \
        .option("checkpointLocation", "checkpoint") \
        .load()
    return df_stream


def parse_ride_from_kafka_message(df, schema):
    """ take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema """
    assert df.isStreaming is True, "DataFrame doesn't receive streaming data"

    df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

    # split attributes to nested array in one Column
    col = F.split(df['value'], ', ')

    # expand col to multiple top-level columns
    for idx, field in enumerate(schema):
        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))
    return df.select([field.name for field in schema])


def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):
    write_query = df.writeStream \
        .outputMode(output_mode) \
        .trigger(processingTime=processing_time) \
        .format("console") \
        .option("truncate", False) \
        .start()
    return write_query  # pyspark.sql.streaming.StreamingQuery


def sink_memory(df, query_name, query_template):
    query_df = df \
        .writeStream \
        .queryName(query_name) \
        .format("memory") \
        .start()
    query_str = query_template.format(table_name=query_name)
    query_results = spark.sql(query_str)
    return query_results, query_df


def sink_kafka(df, topic):
    write_query = df.writeStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "localhost:9092,broker:29092") \
        .outputMode('complete') \
        .option("topic", topic) \
        .option("checkpointLocation", "checkpoint") \
        .start()
    return write_query


def prepare_df_to_kafka_sink(df, value_columns, key_column=None):
    columns = df.columns

    df = df.withColumn("value", F.concat_ws(', ', *value_columns))
    if key_column:
        df = df.withColumnRenamed(key_column, "key")
        df = df.withColumn("key", df.key.cast('string'))
    return df.select(['key', 'value'])


def op_groupby(df, column_names):
    df_aggregation = df.groupBy(column_names).count()
    return df_aggregation


def op_windowed_groupby(df, window_duration, slide_duration):
    df_windowed_aggregation = df.groupBy(
        F.window(timeColumn=df.tpep_pickup_datetime, windowDuration=window_duration, slideDuration=slide_duration),
        df.vendor_id
    ).count()
    return df_windowed_aggregation


if __name__ == "__main__":
    spark = SparkSession.builder.appName('streaming-examples').getOrCreate()
    spark.sparkContext.setLogLevel('WARN')

    # read_streaming data
    df_consume_stream = read_from_kafka(consume_topic=CONSUME_TOPIC_RIDES_CSV)
    print(df_consume_stream.printSchema())

    # parse streaming data
    df_rides = parse_ride_from_kafka_message(df_consume_stream, RIDE_SCHEMA)
    print(df_rides.printSchema())

    sink_console(df_rides, output_mode='append')

    df_trip_count_by_vendor_id = op_groupby(df_rides, ['vendor_id'])
    df_trip_count_by_pickup_date_vendor_id = op_windowed_groupby(df_rides, window_duration="10 minutes",
                                                                 slide_duration='5 minutes')

    # write the output out to the console for debugging / testing
    sink_console(df_trip_count_by_vendor_id)
    # write the output to the kafka topic
    df_trip_count_messages = prepare_df_to_kafka_sink(df=df_trip_count_by_pickup_date_vendor_id,
                                                      value_columns=['count'], key_column='vendor_id')
    kafka_sink_query = sink_kafka(df=df_trip_count_messages, topic=TOPIC_WINDOWED_VENDOR_ID_COUNT)

    spark.streams.awaitAnyTermination()


================================================
FILE: 07-streaming/extras/python/streams-example/redpanda/README.md
================================================

# Running PySpark Streaming with Redpanda

### 1. Prerequisite

It is important to create network and volume as described in the document. Therefore please ensure, your volume and network are created correctly.

```bash
docker volume ls # should list hadoop-distributed-file-system
docker network ls # should list kafka-spark-network 
```

### 2. Create Docker Network & Volume

If you have not followed any other examples, and above `ls` steps shows no output, create them now.

```bash
# Create Network
docker network create kafka-spark-network

# Create Volume
docker volume create --name=hadoop-distributed-file-system
```

### Running Producer and Consumer
```bash
# Run producer
python producer.py

# Run consumer with default settings
python consumer.py
# Run consumer for specific topic
python consumer.py --topic <topic-name>
```

### Running Streaming Script

spark-submit script ensures installation of necessary jars before running the streaming.py

```bash
./spark-submit.sh streaming.py 
```

### Additional Resources
- [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streaming-programming-guide)
- [Structured Streaming + Kafka Integration](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#structured-streaming-kafka-integration-guide-kafka-broker-versio)


================================================
FILE: 07-streaming/extras/python/streams-example/redpanda/consumer.py
================================================
import argparse
from typing import Dict, List
from kafka import KafkaConsumer

from settings import BOOTSTRAP_SERVERS, CONSUME_TOPIC_RIDES_CSV


class RideCSVConsumer:
    def __init__(self, props: Dict):
        self.consumer = KafkaConsumer(**props)

    def consume_from_kafka(self, topics: List[str]):
        self.consumer.subscribe(topics=topics)
        print('Consuming from Kafka started')
        print('Available topics to consume: ', self.consumer.subscription())
        while True:
            try:
                # SIGINT can't be handled when polling, limit timeout to 1 second.
                msg = self.consumer.poll(1.0)
                if msg is None or msg == {}:
                    continue
                for msg_key, msg_values in msg.items():
                    for msg_val in msg_values:
                        print(f'Key:{msg_val.key}-type({type(msg_val.key)}), '
                              f'Value:{msg_val.value}-type({type(msg_val.value)})')
            except KeyboardInterrupt:
                break

        self.consumer.close()


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Kafka Consumer')
    parser.add_argument('--topic', type=str, default=CONSUME_TOPIC_RIDES_CSV)
    args = parser.parse_args()

    topic = args.topic
    config = {
        'bootstrap_servers': [BOOTSTRAP_SERVERS],
        'auto_offset_reset': 'earliest',
        'enable_auto_commit': True,
        'key_deserializer': lambda key: int(key.decode('utf-8')),
        'value_deserializer': lambda value: value.decode('utf-8'),
        'group_id': 'consumer.group.id.csv-example.1',
    }
    csv_consumer = RideCSVConsumer(props=config)
    csv_consumer.consume_from_kafka(topics=[topic])


================================================
FILE: 07-streaming/extras/python/streams-example/redpanda/docker-compose.yaml
================================================
version: '3.7'
volumes:
  shared-workspace:
    name: "hadoop-distributed-file-system"
    driver: local
networks:
  default:
    name: kafka-spark-network
    external: true
services:
  # Redpanda cluster
  redpanda-1:
    image: docker.redpanda.com/redpandadata/redpanda:v23.2.26
    container_name: redpanda-1
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '1'
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092
      - --pandaproxy-addr
      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082
      - --advertise-pandaproxy-addr
      - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082
      - --rpc-addr
      - 0.0.0.0:33145
      - --advertise-rpc-addr
      - redpanda-1:33145
    ports:
      # - 8081:8081
      - 8082:8082
      - 9092:9092
      - 9644:9644
      - 28082:28082
      - 29092:29092
    volumes:
      - shared-workspace:/opt/workspace

  # Want a two node Redpanda cluster? Uncomment this block :)
  redpanda-2:
    image: docker.redpanda.com/redpandadata/redpanda:v23.1.1
    container_name: redpanda-2
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '2'
      - --seeds
      - redpanda-1:33145
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29093,OUTSIDE://0.0.0.0:9093
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda-2:29093,OUTSIDE://localhost:9093
      - --pandaproxy-addr
      - PLAINTEXT://0.0.0.0:28083,OUTSIDE://0.0.0.0:8083
      - --advertise-pandaproxy-addr
      - PLAINTEXT://redpanda-2:28083,OUTSIDE://localhost:8083
      - --rpc-addr
      - 0.0.0.0:33146
      - --advertise-rpc-addr
      - redpanda-2:33146
    ports:
      - 8083:8083
      - 9093:9093
    volumes:
      - shared-workspace:/opt/workspace

  redpanda-console:
    image: docker.redpanda.com/redpandadata/console:v2.2.2
    container_name: redpanda-console
    entrypoint: /bin/sh
    command: -c "echo \"$$CONSOLE_CONFIG_FILE\" > /tmp/config.yml; /app/console"
    environment:
      CONFIG_FILEPATH: /tmp/config.yml
      CONSOLE_CONFIG_FILE: |
        kafka:
          brokers: ["redpanda-1:29092"]
          schemaRegistry:
            enabled: false
        redpanda:
          adminApi:
            enabled: true
            urls: ["http://redpanda-1:9644"]
        connect:
          enabled: false
    ports:
      - 8080:8080
    depends_on:
      - redpanda-1
    volumes:
      - shared-workspace:/opt/workspace


================================================
FILE: 07-streaming/extras/python/streams-example/redpanda/producer.py
================================================
import csv
from time import sleep
from typing import Dict
from kafka import KafkaProducer

from settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, PRODUCE_TOPIC_RIDES_CSV


def delivery_report(err, msg):
    if err is not None:
        print("Delivery failed for record {}: {}".format(msg.key(), err))
        return
    print('Record {} successfully produced to {} [{}] at offset {}'.format(
        msg.key(), msg.topic(), msg.partition(), msg.offset()))


class RideCSVProducer:
    def __init__(self, props: Dict):
        self.producer = KafkaProducer(**props)
        # self.producer = Producer(producer_props)

    @staticmethod
    def read_records(resource_path: str):
        records, ride_keys = [], []
        i = 0
        with open(resource_path, 'r') as f:
            reader = csv.reader(f)
            header = next(reader)  # skip the header
            for row in reader:
                # vendor_id, passenger_count, trip_distance, payment_type, total_amount
                records.append(f'{row[0]}, {row[1]}, {row[2]}, {row[3]}, {row[4]}, {row[9]}, {row[16]}')
                ride_keys.append(str(row[0]))
                i += 1
                if i == 5:
                    break
        return zip(ride_keys, records)

    def publish(self, topic: str, records: [str, str]):
        for key_value in records:
            key, value = key_value
            try:
                self.producer.send(topic=topic, key=key, value=value)
                print(f"Producing record for <key: {key}, value:{value}>")
            except KeyboardInterrupt:
                break
            except Exception as e:
                print(f"Exception while producing record - {value}: {e}")

        self.producer.flush()
        sleep(1)


if __name__ == "__main__":
    config = {
        'bootstrap_servers': [BOOTSTRAP_SERVERS],
        'key_serializer': lambda x: x.encode('utf-8'),
        'value_serializer': lambda x: x.encode('utf-8')
    }
    producer = RideCSVProducer(props=config)
    ride_records = producer.read_records(resource_path=INPUT_DATA_PATH)
    print(ride_records)
    producer.publish(topic=PRODUCE_TOPIC_RIDES_CSV, records=ride_records)


================================================
FILE: 07-streaming/extras/python/streams-example/redpanda/settings.py
================================================
import pyspark.sql.types as T

INPUT_DATA_PATH = '../../resources/rides.csv'
BOOTSTRAP_SERVERS = 'localhost:9092'

TOPIC_WINDOWED_VENDOR_ID_COUNT = 'vendor_counts_windowed'

PRODUCE_TOPIC_RIDES_CSV = CONSUME_TOPIC_RIDES_CSV = 'rides_csv'

RIDE_SCHEMA = T.StructType(
    [T.StructField("vendor_id", T.IntegerType()),
     T.StructField('tpep_pickup_datetime', T.TimestampType()),
     T.StructField('tpep_dropoff_datetime', T.TimestampType()),
     T.StructField("passenger_count", T.IntegerType()),
     T.StructField("trip_distance", T.FloatType()),
     T.StructField("payment_type", T.IntegerType()),
     T.StructField("total_amount", T.FloatType()),
     ])


================================================
FILE: 07-streaming/extras/python/streams-example/redpanda/spark-submit.sh
================================================
# Submit Python code to SparkMaster

if [ $# -lt 1 ]
then
	echo "Usage: $0 <pyspark-job.py> [ executor-memory ]"
	echo "(specify memory in string format such as \"512M\" or \"2G\")"
	exit 1
fi
PYTHON_JOB=$1

if [ -z $2 ]
then
	EXEC_MEM="1G"
else
	EXEC_MEM=$2
fi
spark-submit --master spark://localhost:7077 --num-executors 2 \
	           --executor-memory $EXEC_MEM --executor-cores 1 \
             --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.spark:spark-avro_2.12:3.5.1,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.5.1 \
             $PYTHON_JOB


================================================
FILE: 07-streaming/extras/python/streams-example/redpanda/streaming-notebook.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c4419168-c0e6-4a65-b56e-8454c42060ac",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 0. Spark Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "32bd7cdd-8504-4a54-a461-244bf7878d2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 pyspark-shell'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3aab2a7e-a685-4925-9c9a-b5adf201af77",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "your 131072x1 screen size is bogus. expect trouble\n",
      "24/03/11 00:28:48 WARN Utils: Your hostname, Cinders resolves to a loopback address: 127.0.1.1; using 172.17.156.62 instead (on interface eth0)\n",
      "24/03/11 00:28:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ":: loading settings :: url = jar:file:/home/ellabelle/spark/spark-3.5.1-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Ivy Default Cache set to: /home/ellabelle/.ivy2/cache\n",
      "The jars for the packages stored in: /home/ellabelle/.ivy2/jars\n",
      "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n",
      "org.apache.spark#spark-avro_2.12 added as a dependency\n",
      ":: resolving dependencies :: org.apache.spark#spark-submit-parent-0c8615d6-fa19-46ec-942b-46e9fe0012aa;1.0\n",
      "\tconfs: [default]\n",
      "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 in central\n",
      "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 in central\n",
      "\tfound org.apache.kafka#kafka-clients;2.8.1 in central\n",
      "\tfound org.lz4#lz4-java;1.8.0 in central\n",
      "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n",
      "\tfound org.slf4j#slf4j-api;1.7.32 in central\n",
      "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.2 in central\n",
      "\tfound org.spark-project.spark#unused;1.0.0 in central\n",
      "\tfound org.apache.hadoop#hadoop-client-api;3.3.2 in central\n",
      "\tfound commons-logging#commons-logging;1.1.3 in central\n",
      "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n",
      "\tfound org.apache.commons#commons-pool2;2.11.1 in central\n",
      "\tfound org.apache.spark#spark-avro_2.12;3.3.1 in central\n",
      "\tfound org.tukaani#xz;1.8 in central\n",
      ":: resolution report :: resolve 328ms :: artifacts dl 13ms\n",
      "\t:: modules in use:\n",
      "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n",
      "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n",
      "\torg.apache.commons#commons-pool2;2.11.1 from central in [default]\n",
      "\torg.apache.hadoop#hadoop-client-api;3.3.2 from central in [default]\n",
      "\torg.apache.hadoop#hadoop-client-runtime;3.3.2 from central in [default]\n",
      "\torg.apache.kafka#kafka-clients;2.8.1 from central in [default]\n",
      "\torg.apache.spark#spark-avro_2.12;3.3.1 from central in [default]\n",
      "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 from central in [default]\n",
      "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 from central in [default]\n",
      "\torg.lz4#lz4-java;1.8.0 from central in [default]\n",
      "\torg.slf4j#slf4j-api;1.7.32 from central in [default]\n",
      "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n",
      "\torg.tukaani#xz;1.8 from central in [default]\n",
      "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n",
      "\t---------------------------------------------------------------------\n",
      "\t|                  |            modules            ||   artifacts   |\n",
      "\t|       conf       | number| search|dwnlded|evicted|| number|dwnlded|\n",
      "\t---------------------------------------------------------------------\n",
      "\t|      default     |   14  |   0   |   0   |   0   ||   14  |   0   |\n",
      "\t---------------------------------------------------------------------\n",
      ":: retrieving :: org.apache.spark#spark-submit-parent-0c8615d6-fa19-46ec-942b-46e9fe0012aa\n",
      "\tconfs: [default]\n",
      "\t0 artifacts copied, 14 already retrieved (0kB/8ms)\n",
      "24/03/11 00:28:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
      "24/03/11 00:28:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql import SparkSession\n",
    "import pyspark.sql.types as T\n",
    "import pyspark.sql.functions as F\n",
    "\n",
    "spark = SparkSession \\\n",
    "    .builder \\\n",
    "    .appName(\"Spark-Notebook\") \\\n",
    "    .getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f4b62fa-b3ce-4a1b-a1f4-2ed332a0d55a",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 1. Reading from Kafka Stream\n",
    "\n",
    "through `readStream`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f491fa45-4471-4bc5-92f7-48081f687140",
   "metadata": {},
   "source": [
    "#### 1.1 Raw Kafka Stream"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "82c25cb2-2599-4f9b-8849-967fbb604a44",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# default for startingOffsets is \"latest\"\n",
    "df_kafka_raw = spark \\\n",
    "    .readStream \\\n",
    "    .format(\"kafka\") \\\n",
    "    .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n",
    "    .option(\"subscribe\", \"rides_csv\") \\\n",
    "    .option(\"startingOffsets\", \"earliest\") \\\n",
    "    .option(\"checkpointLocation\", \"checkpoint\") \\\n",
    "    .load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d9149ccd-69b2-4f5b-afc0-43567673c634",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- key: binary (nullable = true)\n",
      " |-- value: binary (nullable = true)\n",
      " |-- topic: string (nullable = true)\n",
      " |-- partition: integer (nullable = true)\n",
      " |-- offset: long (nullable = true)\n",
      " |-- timestamp: timestamp (nullable = true)\n",
      " |-- timestampType: integer (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_kafka_raw.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62e5e753-89c7-460f-a8be-16868ce5c680",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### 1.2 Encoded Kafka Stream"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0b745eed-7d74-421e-8e4b-c8343fda4de3",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_kafka_encoded = df_kafka_raw.selectExpr(\"CAST(key AS STRING)\",\"CAST(value AS STRING)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6839addc-c7c0-4117-8c9c-d2cd59cbf136",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- key: string (nullable = true)\n",
      " |-- value: string (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_kafka_encoded.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6749c4de-6f80-4b91-b2b8-b2968c761d75",
   "metadata": {},
   "source": [
    "#### 1.3 Structure Streaming DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ca20ae37-49f0-421f-9859-73fac8d4ca45",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def parse_ride_from_kafka_message(df_raw, schema):\n",
    "    \"\"\" take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema \"\"\"\n",
    "    assert df_raw.isStreaming is True, \"DataFrame doesn't receive streaming data\"\n",
    "\n",
    "    df = df_raw.selectExpr(\"CAST(key AS STRING)\", \"CAST(value AS STRING)\")\n",
    "\n",
    "    # split attributes to nested array in one Column\n",
    "    col = F.split(df['value'], ', ')\n",
    "\n",
    "    # expand col to multiple top-level columns\n",
    "    for idx, field in enumerate(schema):\n",
    "        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))\n",
    "    return df.select([field.name for field in schema])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e1737bd0-146f-4ee2-a70f-a4657af5bbc6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ride_schema = T.StructType(\n",
    "    [T.StructField(\"vendor_id\", T.IntegerType()),\n",
    "     T.StructField('tpep_pickup_datetime', T.TimestampType()),\n",
    "     T.StructField('tpep_dropoff_datetime', T.TimestampType()),\n",
    "     T.StructField(\"passenger_count\", T.IntegerType()),\n",
    "     T.StructField(\"trip_distance\", T.FloatType()),\n",
    "     T.StructField(\"payment_type\", T.IntegerType()),\n",
    "     T.StructField(\"total_amount\", T.FloatType()),\n",
    "     ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "ae2ce896-f54b-4166-b01f-b5532ab292fe",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_rides = parse_ride_from_kafka_message(df_raw=df_kafka_raw, schema=ride_schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "cd848228-97c5-4325-8457-97f35e533cd8",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- vendor_id: integer (nullable = true)\n",
      " |-- tpep_pickup_datetime: timestamp (nullable = true)\n",
      " |-- tpep_dropoff_datetime: timestamp (nullable = true)\n",
      " |-- passenger_count: integer (nullable = true)\n",
      " |-- trip_distance: float (nullable = true)\n",
      " |-- payment_type: integer (nullable = true)\n",
      " |-- total_amount: float (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_rides.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1cdb53e-f477-4137-8412-6915d7772125",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_rides.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60277fdc-2797-4b23-9ecf-956b76db5778",
   "metadata": {
    "tags": []
   },
   "source": [
    "### 2 Sink Operation & Streaming Query\n",
    "\n",
    "through `writeStream`\n",
    "\n",
    "---\n",
    "**Output Sinks**\n",
    "- File Sink: stores the output to the directory\n",
    "- Kafka Sink: stores the output to one or more topics in Kafka\n",
    "- Foreach Sink:\n",
    "- (for debugging) Console Sink, Memory Sink\n",
    "\n",
    "Further details can be found in [Output Sinks](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks)\n",
    "\n",
    "---\n",
    "There are three types of **Output Modes**:\n",
    "- Complete: The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.\n",
    "- Append (default): Only new rows are added to the Result Table\n",
    "- Update: Only updated rows are outputted\n",
    "\n",
    "[Output Modes](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes) differs based on the set of transformations applied to the streaming data. \n",
    "\n",
    "--- \n",
    "**Triggers**\n",
    "\n",
    "The [trigger settings](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) of a streaming query define the timing of streaming data processing. Spark streaming support micro-batch streamings schema and you can select following options based on requirements.\n",
    "\n",
    "- default-micro-batch-mode\n",
    "- fixed-interval-micro-batch-mode\n",
    "- one-time-micro-batch-mode\n",
    "- available-now-micro-batch-mode\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02ca9b08-aa61-46cd-b946-4457ce2cdf5d",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Console and Memory Sink"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "74c72469-4c37-417c-a866-a1c1ef75ae8b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):\n",
    "    write_query = df.writeStream \\\n",
    "        .outputMode(output_mode) \\\n",
    "        .trigger(processingTime=processing_time) \\\n",
    "        .format(\"console\") \\\n",
    "        .option(\"truncate\", False) \\\n",
    "        .start()\n",
    "    return write_query # pyspark.sql.streaming.StreamingQuery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "d866c7ba-f8e9-475d-830a-50ffb2c5472b",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "24/03/11 00:30:31 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-2b8e8845-1369-4653-8c23-c45a98e194a9. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n",
      "24/03/11 00:30:31 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "24/03/11 00:30:32 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n",
      "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\n",
      "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\n",
      "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\n",
      "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\n",
      "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\n",
      "24/03/11 00:30:33 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n",
      "                                                                                \r"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------------------------------\n",
      "Batch: 0\n",
      "-------------------------------------------\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\n",
      "|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\n",
      "|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\n",
      "|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\n",
      "|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "write_query = sink_console(df_rides, output_mode='append')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "a9bfa73f-a8cc-4988-a8cf-bf31ee6c449c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def sink_memory(df, query_name, query_template):\n",
    "    write_query = df \\\n",
    "        .writeStream \\\n",
    "        .queryName(query_name) \\\n",
    "        .format('memory') \\\n",
    "        .start()\n",
    "    query_str = query_template.format(table_name=query_name)\n",
    "    query_results = spark.sql(query_str)\n",
    "    return write_query, query_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b31d0b76-e917-44e7-a14d-f9ce6901c23a",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "24/03/11 00:30:42 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-c7621425-b7fb-47fe-8b42-791c9c5d3186. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n",
      "24/03/11 00:30:42 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n",
      "24/03/11 00:30:43 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n",
      "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\n",
      "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\n",
      "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\n",
      "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\n",
      "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "24/03/11 00:30:43 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n",
      "                                                                                \r"
     ]
    }
   ],
   "source": [
    "query_name = 'vendor_id_counts'\n",
    "query_template = 'select count(distinct(vendor_id)) from {table_name}'\n",
    "write_query, df_vendor_id_counts = sink_memory(df=df_rides, query_name=query_name, query_template=query_template)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "4ba56111-83bf-4028-ac65-565e0190f310",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pyspark.sql.streaming.query.StreamingQuery'>\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'message': 'Waiting for data to arrive',\n",
       " 'isDataAvailable': False,\n",
       " 'isTriggerActive': False}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------------------------------\n",
      "Batch: 1\n",
      "-------------------------------------------\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\n",
      "|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\n",
      "|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\n",
      "|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\n",
      "|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "\n",
      "-------------------------------------------\n",
      "Batch: 2\n",
      "-------------------------------------------\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\n",
      "|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\n",
      "|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\n",
      "|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\n",
      "|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\n",
      "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(type(write_query)) # pyspark.sql.streaming.StreamingQuery\n",
    "write_query.status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "7cc37bda-9cfa-402b-9d42-a6ba5271476b",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------------+\n",
      "|count(DISTINCT vendor_id)|\n",
      "+-------------------------+\n",
      "|                        2|\n",
      "+-------------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_vendor_id_counts.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "88862ca9-4d89-487e-987f-08a2b9e83efe",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "write_query.stop()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "443d4041-06db-4a4a-89c1-348848cc7ca8",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Kafka Sink\n",
    "\n",
    "To write stream results to `kafka-topic`, the stream dataframe has at least a column with name `value`.\n",
    "\n",
    "Therefore before starting `writeStream` in kafka format, dataframe needs to be updated accordingly.\n",
    "\n",
    "More information regarding kafka sink expected data structure [here](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "8b08a013-d039-41cf-94fd-a1a57571d25f",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:36 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:37 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:39 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:40 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:41 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:42 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:43 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:44 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:45 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:46 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:47 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:48 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:49 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:50 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:51 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:52 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:53 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:54 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:55 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:56 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:57 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:34:58 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:00 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:01 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:02 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:03 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:04 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:05 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:06 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:07 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:08 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:09 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:10 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:11 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:12 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:13 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:14 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:16 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:17 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:17 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:19 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:20 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:21 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:22 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:23 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:24 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:25 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:26 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:27 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:28 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:29 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:30 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:31 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:32 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:33 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:34 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:35 WARN KafkaOffsetReaderAdmin: Error in attempt 1 getting Kafka offsets: \n",
      "java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088535000, tries=1, nextAllowedTryMs=1710088535101) timed out at 1710088535001 after 1 attempt(s)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\n",
      "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\n",
      "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\n",
      "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\n",
      "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\n",
      "\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\n",
      "\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n",
      "\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n",
      "\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n",
      "\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n",
      "\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n",
      "\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n",
      "\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\n",
      "\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\n",
      "\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\n",
      "\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\n",
      "Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088535000, tries=1, nextAllowedTryMs=1710088535101) timed out at 1710088535001 after 1 attempt(s)\n",
      "Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\n",
      "24/03/11 00:35:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:36 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n",
      "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\n",
      "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\n",
      "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\n",
      "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\n",
      "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\n",
      "24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:37 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:37 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:39 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:40 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:41 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:42 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:43 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:44 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:45 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:46 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:47 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:48 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:49 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:50 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:51 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:52 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:53 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:55 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:55 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:57 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:58 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:35:59 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:00 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:01 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:02 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:03 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:04 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:06 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:07 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:08 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:09 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:10 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:11 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:12 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:13 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:14 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:15 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:16 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:17 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:18 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:19 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:20 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:22 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:23 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:24 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:25 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:26 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:27 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:28 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:29 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:30 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:31 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:32 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:33 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:35 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:35 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:36 WARN KafkaOffsetReaderAdmin: Error in attempt 2 getting Kafka offsets: \n",
      "java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088596058, tries=1, nextAllowedTryMs=1710088596159) timed out at 1710088596059 after 1 attempt(s)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\n",
      "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\n",
      "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\n",
      "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\n",
      "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\n",
      "\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\n",
      "\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n",
      "\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n",
      "\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n",
      "\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n",
      "\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n",
      "\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n",
      "\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\n",
      "\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\n",
      "\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\n",
      "\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\n",
      "Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088596058, tries=1, nextAllowedTryMs=1710088596159) timed out at 1710088596059 after 1 attempt(s)\n",
      "Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\n",
      "24/03/11 00:36:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:37 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n",
      "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\n",
      "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\n",
      "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\n",
      "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\n",
      "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\n",
      "24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:38 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:38 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:40 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:41 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:42 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:43 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:44 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:45 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:46 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:47 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:47 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:48 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:49 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:50 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:52 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:52 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:54 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:55 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:56 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:57 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:58 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:36:59 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:00 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:01 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:02 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:03 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:05 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:05 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:06 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:08 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:09 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:10 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:11 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:12 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:13 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:14 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:15 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:16 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:17 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:18 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:19 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:20 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:21 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:22 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:23 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:24 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:25 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:26 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:27 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:28 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:29 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:31 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:32 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:33 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:34 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:35 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:36 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:37 WARN KafkaOffsetReaderAdmin: Error in attempt 3 getting Kafka offsets: \n",
      "java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\n",
      "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\n",
      "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\n",
      "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\n",
      "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\n",
      "\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\n",
      "\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n",
      "\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n",
      "\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n",
      "\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n",
      "\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n",
      "\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n",
      "\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\n",
      "\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\n",
      "\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\n",
      "\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\n",
      "Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\n",
      "Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\n",
      "24/03/11 00:37:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n",
      "24/03/11 00:37:38 ERROR MicroBatchExecution: Query [id = 4dfba771-eff7-49e7-a3ff-f1aa03a6e840, runId = 0f86ad02-1d50-487a-97c7-72790d8857d8] terminated with error\n",
      "java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\n",
      "\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\n",
      "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\n",
      "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\n",
      "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\n",
      "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\n",
      "\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\n",
      "\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\n",
      "\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n",
      "\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n",
      "\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n",
      "\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n",
      "\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n",
      "\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n",
      "\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\n",
      "\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\n",
      "\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\n",
      "\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\n",
      "\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\n",
      "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\n",
      "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n",
      "\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\n",
      "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\n",
      "Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\n",
      "Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\n"
     ]
    }
   ],
   "source": [
    "def prepare_dataframe_to_kafka_sink(df, value_columns, key_column=None):\n",
    "    columns = df.columns\n",
    "    df = df.withColumn(\"value\", F.concat_ws(', ',*value_columns))    \n",
    "    if key_column:\n",
    "        df = df.withColumnRenamed(key_column,\"key\")\n",
    "        df = df.withColumn(\"key\",df.key.cast('string'))\n",
    "    return df.select(['key', 'value'])\n",
    "    \n",
    "def sink_kafka(df, topic, output_mode='append'):\n",
    "    write_query = df.writeStream \\\n",
    "        .format(\"kafka\") \\\n",
    "        .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n",
    "        .outputMode(output_mode) \\\n",
    "        .option(\"topic\", topic) \\\n",
    "        .option(\"checkpointLocation\", \"checkpoint\") \\\n",
    "        .start()\n",
    "    return write_query"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4cb2140-9f2e-4914-b74c-be4c18cdbe8a",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63abe115-879c-4863-97d3-b22cda7f7469",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 07-streaming/extras/python/streams-example/redpanda/streaming.py
================================================
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

from settings import RIDE_SCHEMA, CONSUME_TOPIC_RIDES_CSV, TOPIC_WINDOWED_VENDOR_ID_COUNT


def read_from_kafka(consume_topic: str):
    # Spark Streaming DataFrame, connect to Kafka topic served at host in bootrap.servers option
    df_stream = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "localhost:9092,broker:29092") \
        .option("subscribe", consume_topic) \
        .option("startingOffsets", "earliest") \
        .option("checkpointLocation", "checkpoint") \
        .load()
    return df_stream


def parse_ride_from_kafka_message(df, schema):
    """ take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema """
    assert df.isStreaming is True, "DataFrame doesn't receive streaming data"

    df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

    # split attributes to nested array in one Column
    col = F.split(df['value'], ', ')

    # expand col to multiple top-level columns
    for idx, field in enumerate(schema):
        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))
    return df.select([field.name for field in schema])


def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):
    write_query = df.writeStream \
        .outputMode(output_mode) \
        .trigger(processingTime=processing_time) \
        .format("console") \
        .option("truncate", False) \
        .start()
    return write_query  # pyspark.sql.streaming.StreamingQuery


def sink_memory(df, query_name, query_template):
    query_df = df \
        .writeStream \
        .queryName(query_name) \
        .format("memory") \
        .start()
    query_str = query_template.format(table_name=query_name)
    query_results = spark.sql(query_str)
    return query_results, query_df


def sink_kafka(df, topic):
    write_query = df.writeStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "localhost:9092,broker:29092") \
        .outputMode('complete') \
        .option("topic", topic) \
        .option("checkpointLocation", "checkpoint") \
        .start()
    return write_query


def prepare_df_to_kafka_sink(df, value_columns, key_column=None):
    columns = df.columns

    df = df.withColumn("value", F.concat_ws(', ', *value_columns))
    if key_column:
        df = df.withColumnRenamed(key_column, "key")
        df = df.withColumn("key", df.key.cast('string'))
    return df.select(['key', 'value'])


def op_groupby(df, column_names):
    df_aggregation = df.groupBy(column_names).count()
    return df_aggregation


def op_windowed_groupby(df, window_duration, slide_duration):
    df_windowed_aggregation = df.groupBy(
        F.window(timeColumn=df.tpep_pickup_datetime, windowDuration=window_duration, slideDuration=slide_duration),
        df.vendor_id
    ).count()
    return df_windowed_aggregation


if __name__ == "__main__":
    spark = SparkSession.builder.appName('streaming-examples').getOrCreate()
    spark.sparkContext.setLogLevel('WARN')

    # read_streaming data
    df_consume_stream = read_from_kafka(consume_topic=CONSUME_TOPIC_RIDES_CSV)
    print(df_consume_stream.printSchema())

    # parse streaming data
    df_rides = parse_ride_from_kafka_message(
        df_consume_stream, 
        RIDE_SCHEMA
    )
    print(df_rides.printSchema())

    sink_console(df_rides, output_mode='append')

    df_trip_count_by_vendor_id = op_groupby(df_rides, ['vendor_id'])
    df_trip_count_by_pickup_date_vendor_id = op_windowed_groupby(
        df_rides, 
        window_duration="10 minutes", 
        slide_duration='5 minutes'
    )

    # write the output out to the console for debugging / testing
    sink_console(df_trip_count_by_vendor_id)
    # write the output to the kafka topic
    df_trip_count_messages = prepare_df_to_kafka_sink(
        df=df_trip_count_by_pickup_date_vendor_id, 
        value_columns=['count'], 
        key_column='vendor_id'
    )
    kafka_sink_query = sink_kafka(
        df=df_trip_count_messages, 
        topic=TOPIC_WINDOWED_VENDOR_ID_COUNT
    )

    spark.streams.awaitAnyTermination()


================================================
FILE: 07-streaming/theory/README.md
================================================
# Kafka theory (optional)

Video lectures covering Kafka concepts, with code examples in Java.

Code: [java/kafka_examples](java/kafka_examples)


## Stream processing

- [7.0.1 Introduction](https://youtu.be/hfvju3iOIP0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=67)
- [7.0.2 What is stream processing](https://youtu.be/WxTxKGcfA-k&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=68)
- [7.3 What is Kafka?](https://youtu.be/zPLZUDPi4AY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=69)
- [7.4 Confluent Cloud](https://youtu.be/ZnEZFEYKppw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=70)
- [7.5 Kafka producer consumer](https://youtu.be/aegTuyxX7Yg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=71)
- [7.6 Kafka configuration](https://youtu.be/SXQtWyRpMKs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=72)

Links:

- [Slides](https://docs.google.com/presentation/d/1bCtdCba8v1HxJ_uMm9pwjRUC-NAMeB-6nOG2ng3KujA/edit?usp=sharing)
- [Kafka Configuration Reference](https://docs.confluent.io/platform/current/installation/configuration/)
- [Confluent Cloud trial](https://www.confluent.io/confluent-cloud/tryfree/)


## Kafka Streams

- [7.7 Kafka stream basics](https://youtu.be/dUyA_63eRb0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=73)
- [7.8 Kafka stream join](https://youtu.be/NcpKlujh34Y&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=74)
- [7.9 Kafka stream testing](https://youtu.be/TNx5rmLY8Pk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=75)
- [7.10 Kafka stream windowing](https://youtu.be/r1OuLdwxbRc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=76)
- [7.11 Kafka ksqlDB and Connect](https://youtu.be/DziQ4a4tn9Y&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=77)
- [7.12 Kafka Schema registry](https://youtu.be/tBY_hBuyzwI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=78)

Links:

- [Slides](https://docs.google.com/presentation/d/1fVi9sFa7fL2ZW3ynS5MAZm0bRSZ4jO10fymPmrfTUjE/edit?usp=sharing)
- [Streams Concepts](https://docs.confluent.io/platform/current/streams/concepts.html)


================================================
FILE: 07-streaming/theory/java/kafka_examples/.gitignore
================================================
.gradle
bin
!src/main/resources/rides.csv

build/classes
build/generated
build/libs
build/reports
build/resources
build/test-results
build/tmp


================================================
FILE: 07-streaming/theory/java/kafka_examples/build/generated-main-avro-java/schemaregistry/RideRecord.java
================================================
/**
 * Autogenerated by Avro
 *
 * DO NOT EDIT DIRECTLY
 */
package schemaregistry;

import org.apache.avro.generic.GenericArray;
import org.apache.avro.specific.SpecificData;
import org.apache.avro.util.Utf8;
import org.apache.avro.message.BinaryMessageEncoder;
import org.apache.avro.message.BinaryMessageDecoder;
import org.apache.avro.message.SchemaStore;

@org.apache.avro.specific.AvroGenerated
public class RideRecord extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
  private static final long serialVersionUID = 6805437803204402942L;


  public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"RideRecord\",\"namespace\":\"schemaregistry\",\"fields\":[{\"name\":\"vendor_id\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"passenger_count\",\"type\":\"int\"},{\"name\":\"trip_distance\",\"type\":\"double\"}]}");
  public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }

  private static final SpecificData MODEL$ = new SpecificData();

  private static final BinaryMessageEncoder<RideRecord> ENCODER =
      new BinaryMessageEncoder<>(MODEL$, SCHEMA$);

  private static final BinaryMessageDecoder<RideRecord> DECODER =
      new BinaryMessageDecoder<>(MODEL$, SCHEMA$);

  /**
   * Return the BinaryMessageEncoder instance used by this class.
   * @return the message encoder used by this class
   */
  public static BinaryMessageEncoder<RideRecord> getEncoder() {
    return ENCODER;
  }

  /**
   * Return the BinaryMessageDecoder instance used by this class.
   * @return the message decoder used by this class
   */
  public static BinaryMessageDecoder<RideRecord> getDecoder() {
    return DECODER;
  }

  /**
   * Create a new BinaryMessageDecoder instance for this class that uses the specified {@link SchemaStore}.
   * @param resolver a {@link SchemaStore} used to find schemas by fingerprint
   * @return a BinaryMessageDecoder instance for this class backed by the given SchemaStore
   */
  public static BinaryMessageDecoder<RideRecord> createDecoder(SchemaStore resolver) {
    return new BinaryMessageDecoder<>(MODEL$, SCHEMA$, resolver);
  }

  /**
   * Serializes this RideRecord to a ByteBuffer.
   * @return a buffer holding the serialized data for this instance
   * @throws java.io.IOException if this instance could not be serialized
   */
  public java.nio.ByteBuffer toByteBuffer() throws java.io.IOException {
    return ENCODER.encode(this);
  }

  /**
   * Deserializes a RideRecord from a ByteBuffer.
   * @param b a byte buffer holding serialized data for an instance of this class
   * @return a RideRecord instance decoded from the given buffer
   * @throws java.io.IOException if the given bytes could not be deserialized into an instance of this class
   */
  public static RideRecord fromByteBuffer(
      java.nio.ByteBuffer b) throws java.io.IOException {
    return DECODER.decode(b);
  }

  private java.lang.String vendor_id;
  private int passenger_count;
  private double trip_distance;

  /**
   * Default constructor.  Note that this does not initialize fields
   * to their default values from the schema.  If that is desired then
   * one should use <code>newBuilder()</code>.
   */
  public RideRecord() {}

  /**
   * All-args constructor.
   * @param vendor_id The new value for vendor_id
   * @param passenger_count The new value for passenger_count
   * @param trip_distance The new value for trip_distance
   */
  public RideRecord(java.lang.String vendor_id, java.lang.Integer passenger_count, java.lang.Double trip_distance) {
    this.vendor_id = vendor_id;
    this.passenger_count = passenger_count;
    this.trip_distance = trip_distance;
  }

  @Override
  public org.apache.avro.specific.SpecificData getSpecificData() { return MODEL$; }

  @Override
  public org.apache.avro.Schema getSchema() { return SCHEMA$; }

  // Used by DatumWriter.  Applications should not call.
  @Override
  public java.lang.Object get(int field$) {
    switch (field$) {
    case 0: return vendor_id;
    case 1: return passenger_count;
    case 2: return trip_distance;
    default: throw new IndexOutOfBoundsException("Invalid index: " + field$);
    }
  }

  // Used by DatumReader.  Applications should not call.
  @Override
  @SuppressWarnings(value="unchecked")
  public void put(int field$, java.lang.Object value$) {
    switch (field$) {
    case 0: vendor_id = value$ != null ? value$.toString() : null; break;
    case 1: passenger_count = (java.lang.Integer)value$; break;
    case 2: trip_distance = (java.lang.Double)value$; break;
    default: throw new IndexOutOfBoundsException("Invalid index: " + field$);
    }
  }

  /**
   * Gets the value of the 'vendor_id' field.
   * @return The value of the 'vendor_id' field.
   */
  public java.lang.String getVendorId() {
    return vendor_id;
  }


  /**
   * Sets the value of the 'vendor_id' field.
   * @param value the value to set.
   */
  public void setVendorId(java.lang.String value) {
    this.vendor_id = value;
  }

  /**
   * Gets the value of the 'passenger_count' field.
   * @return The value of the 'passenger_count' field.
   */
  public int getPassengerCount() {
    return passenger_count;
  }


  /**
   * Sets the value of the 'passenger_count' field.
   * @param value the value to set.
   */
  public void setPassengerCount(int value) {
    this.passenger_count = value;
  }

  /**
   * Gets the value of the 'trip_distance' field.
   * @return The value of the 'trip_distance' field.
   */
  public double getTripDistance() {
    return trip_distance;
  }


  /**
   * Sets the value of the 'trip_distance' field.
   * @param value the value to set.
   */
  public void setTripDistance(double value) {
    this.trip_distance = value;
  }

  /**
   * Creates a new RideRecord RecordBuilder.
   * @return A new RideRecord RecordBuilder
   */
  public static schemaregistry.RideRecord.Builder newBuilder() {
    return new schemaregistry.RideRecord.Builder();
  }

  /**
   * Creates a new RideRecord RecordBuilder by copying an existing Builder.
   * @param other The existing builder to copy.
   * @return A new RideRecord RecordBuilder
   */
  public static schemaregistry.RideRecord.Builder newBuilder(schemaregistry.RideRecord.Builder other) {
    if (other == null) {
      return new schemaregistry.RideRecord.Builder();
    } else {
      return new schemaregistry.RideRecord.Builder(other);
    }
  }

  /**
   * Creates a new RideRecord RecordBuilder by copying an existing RideRecord instance.
   * @param other The existing instance to copy.
   * @return A new RideRecord RecordBuilder
   */
  public static schemaregistry.RideRecord.Builder newBuilder(schemaregistry.RideRecord other) {
    if (other == null) {
      return new schemaregistry.RideRecord.Builder();
    } else {
      return new schemaregistry.RideRecord.Builder(other);
    }
  }

  /**
   * RecordBuilder for RideRecord instances.
   */
  @org.apache.avro.specific.AvroGenerated
  public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase<RideRecord>
    implements org.apache.avro.data.RecordBuilder<RideRecord> {

    private java.lang.String vendor_id;
    private int passenger_count;
    private double trip_distance;

    /** Creates a new Builder */
    private Builder() {
      super(SCHEMA$, MODEL$);
    }

    /**
     * Creates a Builder by copying an existing Builder.
     * @param other The existing Builder to copy.
     */
    private Builder(schemaregistry.RideRecord.Builder other) {
      super(other);
      if (isValidValue(fields()[0], other.vendor_id)) {
        this.vendor_id = data().deepCopy(fields()[0].schema(), other.vendor_id);
        fieldSetFlags()[0] = other.fieldSetFlags()[0];
      }
      if (isValidValue(fields()[1], other.passenger_count)) {
        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);
        fieldSetFlags()[1] = other.fieldSetFlags()[1];
      }
      if (isValidValue(fields()[2], other.trip_distance)) {
        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);
        fieldSetFlags()[2] = other.fieldSetFlags()[2];
      }
    }

    /**
     * Creates a Builder by copying an existing RideRecord instance
     * @param other The existing instance to copy.
     */
    private Builder(schemaregistry.RideRecord other) {
      super(SCHEMA$, MODEL$);
      if (isValidValue(fields()[0], other.vendor_id)) {
        this.vendor_id = data().deepCopy(fields()[0].schema(), other.vendor_id);
        fieldSetFlags()[0] = true;
      }
      if (isValidValue(fields()[1], other.passenger_count)) {
        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);
        fieldSetFlags()[1] = true;
      }
      if (isValidValue(fields()[2], other.trip_distance)) {
        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);
        fieldSetFlags()[2] = true;
      }
    }

    /**
      * Gets the value of the 'vendor_id' field.
      * @return The value.
      */
    public java.lang.String getVendorId() {
      return vendor_id;
    }


    /**
      * Sets the value of the 'vendor_id' field.
      * @param value The value of 'vendor_id'.
      * @return This builder.
      */
    public schemaregistry.RideRecord.Builder setVendorId(java.lang.String value) {
      validate(fields()[0], value);
      this.vendor_id = value;
      fieldSetFlags()[0] = true;
      return this;
    }

    /**
      * Checks whether the 'vendor_id' field has been set.
      * @return True if the 'vendor_id' field has been set, false otherwise.
      */
    public boolean hasVendorId() {
      return fieldSetFlags()[0];
    }


    /**
      * Clears the value of the 'vendor_id' field.
      * @return This builder.
      */
    public schemaregistry.RideRecord.Builder clearVendorId() {
      vendor_id = null;
      fieldSetFlags()[0] = false;
      return this;
    }

    /**
      * Gets the value of the 'passenger_count' field.
      * @return The value.
      */
    public int getPassengerCount() {
      return passenger_count;
    }


    /**
      * Sets the value of the 'passenger_count' field.
      * @param value The value of 'passenger_count'.
      * @return This builder.
      */
    public schemaregistry.RideRecord.Builder setPassengerCount(int value) {
      validate(fields()[1], value);
      this.passenger_count = value;
      fieldSetFlags()[1] = true;
      return this;
    }

    /**
      * Checks whether the 'passenger_count' field has been set.
      * @return True if the 'passenger_count' field has been set, false otherwise.
      */
    public boolean hasPassengerCount() {
      return fieldSetFlags()[1];
    }


    /**
      * Clears the value of the 'passenger_count' field.
      * @return This builder.
      */
    public schemaregistry.RideRecord.Builder clearPassengerCount() {
      fieldSetFlags()[1] = false;
      return this;
    }

    /**
      * Gets the value of the 'trip_distance' field.
      * @return The value.
      */
    public double getTripDistance() {
      return trip_distance;
    }


    /**
      * Sets the value of the 'trip_distance' field.
      * @param value The value of 'trip_distance'.
      * @return This builder.
      */
    public schemaregistry.RideRecord.Builder setTripDistance(double value) {
      validate(fields()[2], value);
      this.trip_distance = value;
      fieldSetFlags()[2] = true;
      return this;
    }

    /**
      * Checks whether the 'trip_distance' field has been set.
      * @return True if the 'trip_distance' field has been set, false otherwise.
      */
    public boolean hasTripDistance() {
      return fieldSetFlags()[2];
    }


    /**
      * Clears the value of the 'trip_distance' field.
      * @return This builder.
      */
    public schemaregistry.RideRecord.Builder clearTripDistance() {
      fieldSetFlags()[2] = false;
      return this;
    }

    @Override
    @SuppressWarnings("unchecked")
    public RideRecord build() {
      try {
        RideRecord record = new RideRecord();
        record.vendor_id = fieldSetFlags()[0] ? this.vendor_id : (java.lang.String) defaultValue(fields()[0]);
        record.passenger_count = fieldSetFlags()[1] ? this.passenger_count : (java.lang.Integer) defaultValue(fields()[1]);
        record.trip_distance = fieldSetFlags()[2] ? this.trip_distance : (java.lang.Double) defaultValue(fields()[2]);
        return record;
      } catch (org.apache.avro.AvroMissingFieldException e) {
        throw e;
      } catch (java.lang.Exception e) {
        throw new org.apache.avro.AvroRuntimeException(e);
      }
    }
  }

  @SuppressWarnings("unchecked")
  private static final org.apache.avro.io.DatumWriter<RideRecord>
    WRITER$ = (org.apache.avro.io.DatumWriter<RideRecord>)MODEL$.createDatumWriter(SCHEMA$);

  @Override public void writeExternal(java.io.ObjectOutput out)
    throws java.io.IOException {
    WRITER$.write(this, SpecificData.getEncoder(out));
  }

  @SuppressWarnings("unchecked")
  private static final org.apache.avro.io.DatumReader<RideRecord>
    READER$ = (org.apache.avro.io.DatumReader<RideRecord>)MODEL$.createDatumReader(SCHEMA$);

  @Override public void readExternal(java.io.ObjectInput in)
    throws java.io.IOException {
    READER$.read(this, SpecificData.getDecoder(in));
  }

  @Override protected boolean hasCustomCoders() { return true; }

  @Override public void customEncode(org.apache.avro.io.Encoder out)
    throws java.io.IOException
  {
    out.writeString(this.vendor_id);

    out.writeInt(this.passenger_count);

    out.writeDouble(this.trip_distance);

  }

  @Override public void customDecode(org.apache.avro.io.ResolvingDecoder in)
    throws java.io.IOException
  {
    org.apache.avro.Schema.Field[] fieldOrder = in.readFieldOrderIfDiff();
    if (fieldOrder == null) {
      this.vendor_id = in.readString();

      this.passenger_count = in.readInt();

      this.trip_distance = in.readDouble();

    } else {
      for (int i = 0; i < 3; i++) {
        switch (fieldOrder[i].pos()) {
        case 0:
          this.vendor_id = in.readString();
          break;

        case 1:
          this.passenger_count = in.readInt();
          break;

        case 2:
          this.trip_distance = in.readDouble();
          break;

        default:
          throw new java.io.IOException("Corrupt ResolvingDecoder.");
        }
      }
    }
  }
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/build/generated-main-avro-java/schemaregistry/RideRecordCompatible.java
================================================
/**
 * Autogenerated by Avro
 *
 * DO NOT EDIT DIRECTLY
 */
package schemaregistry;

import org.apache.avro.generic.GenericArray;
import org.apache.avro.specific.SpecificData;
import org.apache.avro.util.Utf8;
import org.apache.avro.message.BinaryMessageEncoder;
import org.apache.avro.message.BinaryMessageDecoder;
import org.apache.avro.message.SchemaStore;

@org.apache.avro.specific.AvroGenerated
public class RideRecordCompatible extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
  private static final long serialVersionUID = 7163300507090021229L;


  public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"RideRecordCompatible\",\"namespace\":\"schemaregistry\",\"fields\":[{\"name\":\"vendorId\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"passenger_count\",\"type\":\"int\"},{\"name\":\"trip_distance\",\"type\":\"double\"},{\"name\":\"pu_location_id\",\"type\":[\"null\",\"long\"],\"default\":null}]}");
  public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }

  private static final SpecificData MODEL$ = new SpecificData();

  private static final BinaryMessageEncoder<RideRecordCompatible> ENCODER =
      new BinaryMessageEncoder<>(MODEL$, SCHEMA$);

  private static final BinaryMessageDecoder<RideRecordCompatible> DECODER =
      new BinaryMessageDecoder<>(MODEL$, SCHEMA$);

  /**
   * Return the BinaryMessageEncoder instance used by this class.
   * @return the message encoder used by this class
   */
  public static BinaryMessageEncoder<RideRecordCompatible> getEncoder() {
    return ENCODER;
  }

  /**
   * Return the BinaryMessageDecoder instance used by this class.
   * @return the message decoder used by this class
   */
  public static BinaryMessageDecoder<RideRecordCompatible> getDecoder() {
    return DECODER;
  }

  /**
   * Create a new BinaryMessageDecoder instance for this class that uses the specified {@link SchemaStore}.
   * @param resolver a {@link SchemaStore} used to find schemas by fingerprint
   * @return a BinaryMessageDecoder instance for this class backed by the given SchemaStore
   */
  public static BinaryMessageDecoder<RideRecordCompatible> createDecoder(SchemaStore resolver) {
    return new BinaryMessageDecoder<>(MODEL$, SCHEMA$, resolver);
  }

  /**
   * Serializes this RideRecordCompatible to a ByteBuffer.
   * @return a buffer holding the serialized data for this instance
   * @throws java.io.IOException if this instance could not be serialized
   */
  public java.nio.ByteBuffer toByteBuffer() throws java.io.IOException {
    return ENCODER.encode(this);
  }

  /**
   * Deserializes a RideRecordCompatible from a ByteBuffer.
   * @param b a byte buffer holding serialized data for an instance of this class
   * @return a RideRecordCompatible instance decoded from the given buffer
   * @throws java.io.IOException if the given bytes could not be deserialized into an instance of this class
   */
  public static RideRecordCompatible fromByteBuffer(
      java.nio.ByteBuffer b) throws java.io.IOException {
    return DECODER.decode(b);
  }

  private java.lang.String vendorId;
  private int passenger_count;
  private double trip_distance;
  private java.lang.Long pu_location_id;

  /**
   * Default constructor.  Note that this does not initialize fields
   * to their default values from the schema.  If that is desired then
   * one should use <code>newBuilder()</code>.
   */
  public RideRecordCompatible() {}

  /**
   * All-args constructor.
   * @param vendorId The new value for vendorId
   * @param passenger_count The new value for passenger_count
   * @param trip_distance The new value for trip_distance
   * @param pu_location_id The new value for pu_location_id
   */
  public RideRecordCompatible(java.lang.String vendorId, java.lang.Integer passenger_count, java.lang.Double trip_distance, java.lang.Long pu_location_id) {
    this.vendorId = vendorId;
    this.passenger_count = passenger_count;
    this.trip_distance = trip_distance;
    this.pu_location_id = pu_location_id;
  }

  @Override
  public org.apache.avro.specific.SpecificData getSpecificData() { return MODEL$; }

  @Override
  public org.apache.avro.Schema getSchema() { return SCHEMA$; }

  // Used by DatumWriter.  Applications should not call.
  @Override
  public java.lang.Object get(int field$) {
    switch (field$) {
    case 0: return vendorId;
    case 1: return passenger_count;
    case 2: return trip_distance;
    case 3: return pu_location_id;
    default: throw new IndexOutOfBoundsException("Invalid index: " + field$);
    }
  }

  // Used by DatumReader.  Applications should not call.
  @Override
  @SuppressWarnings(value="unchecked")
  public void put(int field$, java.lang.Object value$) {
    switch (field$) {
    case 0: vendorId = value$ != null ? value$.toString() : null; break;
    case 1: passenger_count = (java.lang.Integer)value$; break;
    case 2: trip_distance = (java.lang.Double)value$; break;
    case 3: pu_location_id = (java.lang.Long)value$; break;
    default: throw new IndexOutOfBoundsException("Invalid index: " + field$);
    }
  }

  /**
   * Gets the value of the 'vendorId' field.
   * @return The value of the 'vendorId' field.
   */
  public java.lang.String getVendorId() {
    return vendorId;
  }


  /**
   * Sets the value of the 'vendorId' field.
   * @param value the value to set.
   */
  public void setVendorId(java.lang.String value) {
    this.vendorId = value;
  }

  /**
   * Gets the value of the 'passenger_count' field.
   * @return The value of the 'passenger_count' field.
   */
  public int getPassengerCount() {
    return passenger_count;
  }


  /**
   * Sets the value of the 'passenger_count' field.
   * @param value the value to set.
   */
  public void setPassengerCount(int value) {
    this.passenger_count = value;
  }

  /**
   * Gets the value of the 'trip_distance' field.
   * @return The value of the 'trip_distance' field.
   */
  public double getTripDistance() {
    return trip_distance;
  }


  /**
   * Sets the value of the 'trip_distance' field.
   * @param value the value to set.
   */
  public void setTripDistance(double value) {
    this.trip_distance = value;
  }

  /**
   * Gets the value of the 'pu_location_id' field.
   * @return The value of the 'pu_location_id' field.
   */
  public java.lang.Long getPuLocationId() {
    return pu_location_id;
  }


  /**
   * Sets the value of the 'pu_location_id' field.
   * @param value the value to set.
   */
  public void setPuLocationId(java.lang.Long value) {
    this.pu_location_id = value;
  }

  /**
   * Creates a new RideRecordCompatible RecordBuilder.
   * @return A new RideRecordCompatible RecordBuilder
   */
  public static schemaregistry.RideRecordCompatible.Builder newBuilder() {
    return new schemaregistry.RideRecordCompatible.Builder();
  }

  /**
   * Creates a new RideRecordCompatible RecordBuilder by copying an existing Builder.
   * @param other The existing builder to copy.
   * @return A new RideRecordCompatible RecordBuilder
   */
  public static schemaregistry.RideRecordCompatible.Builder newBuilder(schemaregistry.RideRecordCompatible.Builder other) {
    if (other == null) {
      return new schemaregistry.RideRecordCompatible.Builder();
    } else {
      return new schemaregistry.RideRecordCompatible.Builder(other);
    }
  }

  /**
   * Creates a new RideRecordCompatible RecordBuilder by copying an existing RideRecordCompatible instance.
   * @param other The existing instance to copy.
   * @return A new RideRecordCompatible RecordBuilder
   */
  public static schemaregistry.RideRecordCompatible.Builder newBuilder(schemaregistry.RideRecordCompatible other) {
    if (other == null) {
      return new schemaregistry.RideRecordCompatible.Builder();
    } else {
      return new schemaregistry.RideRecordCompatible.Builder(other);
    }
  }

  /**
   * RecordBuilder for RideRecordCompatible instances.
   */
  @org.apache.avro.specific.AvroGenerated
  public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase<RideRecordCompatible>
    implements org.apache.avro.data.RecordBuilder<RideRecordCompatible> {

    private java.lang.String vendorId;
    private int passenger_count;
    private double trip_distance;
    private java.lang.Long pu_location_id;

    /** Creates a new Builder */
    private Builder() {
      super(SCHEMA$, MODEL$);
    }

    /**
     * Creates a Builder by copying an existing Builder.
     * @param other The existing Builder to copy.
     */
    private Builder(schemaregistry.RideRecordCompatible.Builder other) {
      super(other);
      if (isValidValue(fields()[0], other.vendorId)) {
        this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId);
        fieldSetFlags()[0] = other.fieldSetFlags()[0];
      }
      if (isValidValue(fields()[1], other.passenger_count)) {
        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);
        fieldSetFlags()[1] = other.fieldSetFlags()[1];
      }
      if (isValidValue(fields()[2], other.trip_distance)) {
        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);
        fieldSetFlags()[2] = other.fieldSetFlags()[2];
      }
      if (isValidValue(fields()[3], other.pu_location_id)) {
        this.pu_location_id = data().deepCopy(fields()[3].schema(), other.pu_location_id);
        fieldSetFlags()[3] = other.fieldSetFlags()[3];
      }
    }

    /**
     * Creates a Builder by copying an existing RideRecordCompatible instance
     * @param other The existing instance to copy.
     */
    private Builder(schemaregistry.RideRecordCompatible other) {
      super(SCHEMA$, MODEL$);
      if (isValidValue(fields()[0], other.vendorId)) {
        this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId);
        fieldSetFlags()[0] = true;
      }
      if (isValidValue(fields()[1], other.passenger_count)) {
        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);
        fieldSetFlags()[1] = true;
      }
      if (isValidValue(fields()[2], other.trip_distance)) {
        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);
        fieldSetFlags()[2] = true;
      }
      if (isValidValue(fields()[3], other.pu_location_id)) {
        this.pu_location_id = data().deepCopy(fields()[3].schema(), other.pu_location_id);
        fieldSetFlags()[3] = true;
      }
    }

    /**
      * Gets the value of the 'vendorId' field.
      * @return The value.
      */
    public java.lang.String getVendorId() {
      return vendorId;
    }


    /**
      * Sets the value of the 'vendorId' field.
      * @param value The value of 'vendorId'.
      * @return This builder.
      */
    public schemaregistry.RideRecordCompatible.Builder setVendorId(java.lang.String value) {
      validate(fields()[0], value);
      this.vendorId = value;
      fieldSetFlags()[0] = true;
      return this;
    }

    /**
      * Checks whether the 'vendorId' field has been set.
      * @return True if the 'vendorId' field has been set, false otherwise.
      */
    public boolean hasVendorId() {
      return fieldSetFlags()[0];
    }


    /**
      * Clears the value of the 'vendorId' field.
      * @return This builder.
      */
    public schemaregistry.RideRecordCompatible.Builder clearVendorId() {
      vendorId = null;
      fieldSetFlags()[0] = false;
      return this;
    }

    /**
      * Gets the value of the 'passenger_count' field.
      * @return The value.
      */
    public int getPassengerCount() {
      return passenger_count;
    }


    /**
      * Sets the value of the 'passenger_count' field.
      * @param value The value of 'passenger_count'.
      * @return This builder.
      */
    public schemaregistry.RideRecordCompatible.Builder setPassengerCount(int value) {
      validate(fields()[1], value);
      this.passenger_count = value;
      fieldSetFlags()[1] = true;
      return this;
    }

    /**
      * Checks whether the 'passenger_count' field has been set.
      * @return True if the 'passenger_count' field has been set, false otherwise.
      */
    public boolean hasPassengerCount() {
      return fieldSetFlags()[1];
    }


    /**
      * Clears the value of the 'passenger_count' field.
      * @return This builder.
      */
    public schemaregistry.RideRecordCompatible.Builder clearPassengerCount() {
      fieldSetFlags()[1] = false;
      return this;
    }

    /**
      * Gets the value of the 'trip_distance' field.
      * @return The value.
      */
    public double getTripDistance() {
      return trip_distance;
    }


    /**
      * Sets the value of the 'trip_distance' field.
      * @param value The value of 'trip_distance'.
      * @return This builder.
      */
    public schemaregistry.RideRecordCompatible.Builder setTripDistance(double value) {
      validate(fields()[2], value);
      this.trip_distance = value;
      fieldSetFlags()[2] = true;
      return this;
    }

    /**
      * Checks whether the 'trip_distance' field has been set.
      * @return True if the 'trip_distance' field has been set, false otherwise.
      */
    public boolean hasTripDistance() {
      return fieldSetFlags()[2];
    }


    /**
      * Clears the value of the 'trip_distance' field.
      * @return This builder.
      */
    public schemaregistry.RideRecordCompatible.Builder clearTripDistance() {
      fieldSetFlags()[2] = false;
      return this;
    }

    /**
      * Gets the value of the 'pu_location_id' field.
      * @return The value.
      */
    public java.lang.Long getPuLocationId() {
      return pu_location_id;
    }


    /**
      * Sets the value of the 'pu_location_id' field.
      * @param value The value of 'pu_location_id'.
      * @return This builder.
      */
    public schemaregistry.RideRecordCompatible.Builder setPuLocationId(java.lang.Long value) {
      validate(fields()[3], value);
      this.pu_location_id = value;
      fieldSetFlags()[3] = true;
      return this;
    }

    /**
      * Checks whether the 'pu_location_id' field has been set.
      * @return True if the 'pu_location_id' field has been set, false otherwise.
      */
    public boolean hasPuLocationId() {
      return fieldSetFlags()[3];
    }


    /**
      * Clears the value of the 'pu_location_id' field.
      * @return This builder.
      */
    public schemaregistry.RideRecordCompatible.Builder clearPuLocationId() {
      pu_location_id = null;
      fieldSetFlags()[3] = false;
      return this;
    }

    @Override
    @SuppressWarnings("unchecked")
    public RideRecordCompatible build() {
      try {
        RideRecordCompatible record = new RideRecordCompatible();
        record.vendorId = fieldSetFlags()[0] ? this.vendorId : (java.lang.String) defaultValue(fields()[0]);
        record.passenger_count = fieldSetFlags()[1] ? this.passenger_count : (java.lang.Integer) defaultValue(fields()[1]);
        record.trip_distance = fieldSetFlags()[2] ? this.trip_distance : (java.lang.Double) defaultValue(fields()[2]);
        record.pu_location_id = fieldSetFlags()[3] ? this.pu_location_id : (java.lang.Long) defaultValue(fields()[3]);
        return record;
      } catch (org.apache.avro.AvroMissingFieldException e) {
        throw e;
      } catch (java.lang.Exception e) {
        throw new org.apache.avro.AvroRuntimeException(e);
      }
    }
  }

  @SuppressWarnings("unchecked")
  private static final org.apache.avro.io.DatumWriter<RideRecordCompatible>
    WRITER$ = (org.apache.avro.io.DatumWriter<RideRecordCompatible>)MODEL$.createDatumWriter(SCHEMA$);

  @Override public void writeExternal(java.io.ObjectOutput out)
    throws java.io.IOException {
    WRITER$.write(this, SpecificData.getEncoder(out));
  }

  @SuppressWarnings("unchecked")
  private static final org.apache.avro.io.DatumReader<RideRecordCompatible>
    READER$ = (org.apache.avro.io.DatumReader<RideRecordCompatible>)MODEL$.createDatumReader(SCHEMA$);

  @Override public void readExternal(java.io.ObjectInput in)
    throws java.io.IOException {
    READER$.read(this, SpecificData.getDecoder(in));
  }

  @Override protected boolean hasCustomCoders() { return true; }

  @Override public void customEncode(org.apache.avro.io.Encoder out)
    throws java.io.IOException
  {
    out.writeString(this.vendorId);

    out.writeInt(this.passenger_count);

    out.writeDouble(this.trip_distance);

    if (this.pu_location_id == null) {
      out.writeIndex(0);
      out.writeNull();
    } else {
      out.writeIndex(1);
      out.writeLong(this.pu_location_id);
    }

  }

  @Override public void customDecode(org.apache.avro.io.ResolvingDecoder in)
    throws java.io.IOException
  {
    org.apache.avro.Schema.Field[] fieldOrder = in.readFieldOrderIfDiff();
    if (fieldOrder == null) {
      this.vendorId = in.readString();

      this.passenger_count = in.readInt();

      this.trip_distance = in.readDouble();

      if (in.readIndex() != 1) {
        in.readNull();
        this.pu_location_id = null;
      } else {
        this.pu_location_id = in.readLong();
      }

    } else {
      for (int i = 0; i < 4; i++) {
        switch (fieldOrder[i].pos()) {
        case 0:
          this.vendorId = in.readString();
          break;

        case 1:
          this.passenger_count = in.readInt();
          break;

        case 2:
          this.trip_distance = in.readDouble();
          break;

        case 3:
          if (in.readIndex() != 1) {
            in.readNull();
            this.pu_location_id = null;
          } else {
            this.pu_location_id = in.readLong();
          }
          break;

        default:
          throw new java.io.IOException("Corrupt ResolvingDecoder.");
        }
      }
    }
  }
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/build/generated-main-avro-java/schemaregistry/RideRecordNoneCompatible.java
================================================
/**
 * Autogenerated by Avro
 *
 * DO NOT EDIT DIRECTLY
 */
package schemaregistry;

import org.apache.avro.generic.GenericArray;
import org.apache.avro.specific.SpecificData;
import org.apache.avro.util.Utf8;
import org.apache.avro.message.BinaryMessageEncoder;
import org.apache.avro.message.BinaryMessageDecoder;
import org.apache.avro.message.SchemaStore;

@org.apache.avro.specific.AvroGenerated
public class RideRecordNoneCompatible extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
  private static final long serialVersionUID = -4618980179396772493L;


  public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"RideRecordNoneCompatible\",\"namespace\":\"schemaregistry\",\"fields\":[{\"name\":\"vendorId\",\"type\":\"int\"},{\"name\":\"passenger_count\",\"type\":\"int\"},{\"name\":\"trip_distance\",\"type\":\"double\"}]}");
  public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }

  private static final SpecificData MODEL$ = new SpecificData();

  private static final BinaryMessageEncoder<RideRecordNoneCompatible> ENCODER =
      new BinaryMessageEncoder<>(MODEL$, SCHEMA$);

  private static final BinaryMessageDecoder<RideRecordNoneCompatible> DECODER =
      new BinaryMessageDecoder<>(MODEL$, SCHEMA$);

  /**
   * Return the BinaryMessageEncoder instance used by this class.
   * @return the message encoder used by this class
   */
  public static BinaryMessageEncoder<RideRecordNoneCompatible> getEncoder() {
    return ENCODER;
  }

  /**
   * Return the BinaryMessageDecoder instance used by this class.
   * @return the message decoder used by this class
   */
  public static BinaryMessageDecoder<RideRecordNoneCompatible> getDecoder() {
    return DECODER;
  }

  /**
   * Create a new BinaryMessageDecoder instance for this class that uses the specified {@link SchemaStore}.
   * @param resolver a {@link SchemaStore} used to find schemas by fingerprint
   * @return a BinaryMessageDecoder instance for this class backed by the given SchemaStore
   */
  public static BinaryMessageDecoder<RideRecordNoneCompatible> createDecoder(SchemaStore resolver) {
    return new BinaryMessageDecoder<>(MODEL$, SCHEMA$, resolver);
  }

  /**
   * Serializes this RideRecordNoneCompatible to a ByteBuffer.
   * @return a buffer holding the serialized data for this instance
   * @throws java.io.IOException if this instance could not be serialized
   */
  public java.nio.ByteBuffer toByteBuffer() throws java.io.IOException {
    return ENCODER.encode(this);
  }

  /**
   * Deserializes a RideRecordNoneCompatible from a ByteBuffer.
   * @param b a byte buffer holding serialized data for an instance of this class
   * @return a RideRecordNoneCompatible instance decoded from the given buffer
   * @throws java.io.IOException if the given bytes could not be deserialized into an instance of this class
   */
  public static RideRecordNoneCompatible fromByteBuffer(
      java.nio.ByteBuffer b) throws java.io.IOException {
    return DECODER.decode(b);
  }

  private int vendorId;
  private int passenger_count;
  private double trip_distance;

  /**
   * Default constructor.  Note that this does not initialize fields
   * to their default values from the schema.  If that is desired then
   * one should use <code>newBuilder()</code>.
   */
  public RideRecordNoneCompatible() {}

  /**
   * All-args constructor.
   * @param vendorId The new value for vendorId
   * @param passenger_count The new value for passenger_count
   * @param trip_distance The new value for trip_distance
   */
  public RideRecordNoneCompatible(java.lang.Integer vendorId, java.lang.Integer passenger_count, java.lang.Double trip_distance) {
    this.vendorId = vendorId;
    this.passenger_count = passenger_count;
    this.trip_distance = trip_distance;
  }

  @Override
  public org.apache.avro.specific.SpecificData getSpecificData() { return MODEL$; }

  @Override
  public org.apache.avro.Schema getSchema() { return SCHEMA$; }

  // Used by DatumWriter.  Applications should not call.
  @Override
  public java.lang.Object get(int field$) {
    switch (field$) {
    case 0: return vendorId;
    case 1: return passenger_count;
    case 2: return trip_distance;
    default: throw new IndexOutOfBoundsException("Invalid index: " + field$);
    }
  }

  // Used by DatumReader.  Applications should not call.
  @Override
  @SuppressWarnings(value="unchecked")
  public void put(int field$, java.lang.Object value$) {
    switch (field$) {
    case 0: vendorId = (java.lang.Integer)value$; break;
    case 1: passenger_count = (java.lang.Integer)value$; break;
    case 2: trip_distance = (java.lang.Double)value$; break;
    default: throw new IndexOutOfBoundsException("Invalid index: " + field$);
    }
  }

  /**
   * Gets the value of the 'vendorId' field.
   * @return The value of the 'vendorId' field.
   */
  public int getVendorId() {
    return vendorId;
  }


  /**
   * Sets the value of the 'vendorId' field.
   * @param value the value to set.
   */
  public void setVendorId(int value) {
    this.vendorId = value;
  }

  /**
   * Gets the value of the 'passenger_count' field.
   * @return The value of the 'passenger_count' field.
   */
  public int getPassengerCount() {
    return passenger_count;
  }


  /**
   * Sets the value of the 'passenger_count' field.
   * @param value the value to set.
   */
  public void setPassengerCount(int value) {
    this.passenger_count = value;
  }

  /**
   * Gets the value of the 'trip_distance' field.
   * @return The value of the 'trip_distance' field.
   */
  public double getTripDistance() {
    return trip_distance;
  }


  /**
   * Sets the value of the 'trip_distance' field.
   * @param value the value to set.
   */
  public void setTripDistance(double value) {
    this.trip_distance = value;
  }

  /**
   * Creates a new RideRecordNoneCompatible RecordBuilder.
   * @return A new RideRecordNoneCompatible RecordBuilder
   */
  public static schemaregistry.RideRecordNoneCompatible.Builder newBuilder() {
    return new schemaregistry.RideRecordNoneCompatible.Builder();
  }

  /**
   * Creates a new RideRecordNoneCompatible RecordBuilder by copying an existing Builder.
   * @param other The existing builder to copy.
   * @return A new RideRecordNoneCompatible RecordBuilder
   */
  public static schemaregistry.RideRecordNoneCompatible.Builder newBuilder(schemaregistry.RideRecordNoneCompatible.Builder other) {
    if (other == null) {
      return new schemaregistry.RideRecordNoneCompatible.Builder();
    } else {
      return new schemaregistry.RideRecordNoneCompatible.Builder(other);
    }
  }

  /**
   * Creates a new RideRecordNoneCompatible RecordBuilder by copying an existing RideRecordNoneCompatible instance.
   * @param other The existing instance to copy.
   * @return A new RideRecordNoneCompatible RecordBuilder
   */
  public static schemaregistry.RideRecordNoneCompatible.Builder newBuilder(schemaregistry.RideRecordNoneCompatible other) {
    if (other == null) {
      return new schemaregistry.RideRecordNoneCompatible.Builder();
    } else {
      return new schemaregistry.RideRecordNoneCompatible.Builder(other);
    }
  }

  /**
   * RecordBuilder for RideRecordNoneCompatible instances.
   */
  @org.apache.avro.specific.AvroGenerated
  public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase<RideRecordNoneCompatible>
    implements org.apache.avro.data.RecordBuilder<RideRecordNoneCompatible> {

    private int vendorId;
    private int passenger_count;
    private double trip_distance;

    /** Creates a new Builder */
    private Builder() {
      super(SCHEMA$, MODEL$);
    }

    /**
     * Creates a Builder by copying an existing Builder.
     * @param other The existing Builder to copy.
     */
    private Builder(schemaregistry.RideRecordNoneCompatible.Builder other) {
      super(other);
      if (isValidValue(fields()[0], other.vendorId)) {
        this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId);
        fieldSetFlags()[0] = other.fieldSetFlags()[0];
      }
      if (isValidValue(fields()[1], other.passenger_count)) {
        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);
        fieldSetFlags()[1] = other.fieldSetFlags()[1];
      }
      if (isValidValue(fields()[2], other.trip_distance)) {
        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);
        fieldSetFlags()[2] = other.fieldSetFlags()[2];
      }
    }

    /**
     * Creates a Builder by copying an existing RideRecordNoneCompatible instance
     * @param other The existing instance to copy.
     */
    private Builder(schemaregistry.RideRecordNoneCompatible other) {
      super(SCHEMA$, MODEL$);
      if (isValidValue(fields()[0], other.vendorId)) {
        this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId);
        fieldSetFlags()[0] = true;
      }
      if (isValidValue(fields()[1], other.passenger_count)) {
        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);
        fieldSetFlags()[1] = true;
      }
      if (isValidValue(fields()[2], other.trip_distance)) {
        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);
        fieldSetFlags()[2] = true;
      }
    }

    /**
      * Gets the value of the 'vendorId' field.
      * @return The value.
      */
    public int getVendorId() {
      return vendorId;
    }


    /**
      * Sets the value of the 'vendorId' field.
      * @param value The value of 'vendorId'.
      * @return This builder.
      */
    public schemaregistry.RideRecordNoneCompatible.Builder setVendorId(int value) {
      validate(fields()[0], value);
      this.vendorId = value;
      fieldSetFlags()[0] = true;
      return this;
    }

    /**
      * Checks whether the 'vendorId' field has been set.
      * @return True if the 'vendorId' field has been set, false otherwise.
      */
    public boolean hasVendorId() {
      return fieldSetFlags()[0];
    }


    /**
      * Clears the value of the 'vendorId' field.
      * @return This builder.
      */
    public schemaregistry.RideRecordNoneCompatible.Builder clearVendorId() {
      fieldSetFlags()[0] = false;
      return this;
    }

    /**
      * Gets the value of the 'passenger_count' field.
      * @return The value.
      */
    public int getPassengerCount() {
      return passenger_count;
    }


    /**
      * Sets the value of the 'passenger_count' field.
      * @param value The value of 'passenger_count'.
      * @return This builder.
      */
    public schemaregistry.RideRecordNoneCompatible.Builder setPassengerCount(int value) {
      validate(fields()[1], value);
      this.passenger_count = value;
      fieldSetFlags()[1] = true;
      return this;
    }

    /**
      * Checks whether the 'passenger_count' field has been set.
      * @return True if the 'passenger_count' field has been set, false otherwise.
      */
    public boolean hasPassengerCount() {
      return fieldSetFlags()[1];
    }


    /**
      * Clears the value of the 'passenger_count' field.
      * @return This builder.
      */
    public schemaregistry.RideRecordNoneCompatible.Builder clearPassengerCount() {
      fieldSetFlags()[1] = false;
      return this;
    }

    /**
      * Gets the value of the 'trip_distance' field.
      * @return The value.
      */
    public double getTripDistance() {
      return trip_distance;
    }


    /**
      * Sets the value of the 'trip_distance' field.
      * @param value The value of 'trip_distance'.
      * @return This builder.
      */
    public schemaregistry.RideRecordNoneCompatible.Builder setTripDistance(double value) {
      validate(fields()[2], value);
      this.trip_distance = value;
      fieldSetFlags()[2] = true;
      return this;
    }

    /**
      * Checks whether the 'trip_distance' field has been set.
      * @return True if the 'trip_distance' field has been set, false otherwise.
      */
    public boolean hasTripDistance() {
      return fieldSetFlags()[2];
    }


    /**
      * Clears the value of the 'trip_distance' field.
      * @return This builder.
      */
    public schemaregistry.RideRecordNoneCompatible.Builder clearTripDistance() {
      fieldSetFlags()[2] = false;
      return this;
    }

    @Override
    @SuppressWarnings("unchecked")
    public RideRecordNoneCompatible build() {
      try {
        RideRecordNoneCompatible record = new RideRecordNoneCompatible();
        record.vendorId = fieldSetFlags()[0] ? this.vendorId : (java.lang.Integer) defaultValue(fields()[0]);
        record.passenger_count = fieldSetFlags()[1] ? this.passenger_count : (java.lang.Integer) defaultValue(fields()[1]);
        record.trip_distance = fieldSetFlags()[2] ? this.trip_distance : (java.lang.Double) defaultValue(fields()[2]);
        return record;
      } catch (org.apache.avro.AvroMissingFieldException e) {
        throw e;
      } catch (java.lang.Exception e) {
        throw new org.apache.avro.AvroRuntimeException(e);
      }
    }
  }

  @SuppressWarnings("unchecked")
  private static final org.apache.avro.io.DatumWriter<RideRecordNoneCompatible>
    WRITER$ = (org.apache.avro.io.DatumWriter<RideRecordNoneCompatible>)MODEL$.createDatumWriter(SCHEMA$);

  @Override public void writeExternal(java.io.ObjectOutput out)
    throws java.io.IOException {
    WRITER$.write(this, SpecificData.getEncoder(out));
  }

  @SuppressWarnings("unchecked")
  private static final org.apache.avro.io.DatumReader<RideRecordNoneCompatible>
    READER$ = (org.apache.avro.io.DatumReader<RideRecordNoneCompatible>)MODEL$.createDatumReader(SCHEMA$);

  @Override public void readExternal(java.io.ObjectInput in)
    throws java.io.IOException {
    READER$.read(this, SpecificData.getDecoder(in));
  }

  @Override protected boolean hasCustomCoders() { return true; }

  @Override public void customEncode(org.apache.avro.io.Encoder out)
    throws java.io.IOException
  {
    out.writeInt(this.vendorId);

    out.writeInt(this.passenger_count);

    out.writeDouble(this.trip_distance);

  }

  @Override public void customDecode(org.apache.avro.io.ResolvingDecoder in)
    throws java.io.IOException
  {
    org.apache.avro.Schema.Field[] fieldOrder = in.readFieldOrderIfDiff();
    if (fieldOrder == null) {
      this.vendorId = in.readInt();

      this.passenger_count = in.readInt();

      this.trip_distance = in.readDouble();

    } else {
      for (int i = 0; i < 3; i++) {
        switch (fieldOrder[i].pos()) {
        case 0:
          this.vendorId = in.readInt();
          break;

        case 1:
          this.passenger_count = in.readInt();
          break;

        case 2:
          this.trip_distance = in.readDouble();
          break;

        default:
          throw new java.io.IOException("Corrupt ResolvingDecoder.");
        }
      }
    }
  }
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/build.gradle
================================================
plugins {
    id 'java'
    id "com.github.davidmc24.gradle.plugin.avro" version "1.5.0"
}


group 'org.example'
version '1.0-SNAPSHOT'

repositories {
    mavenCentral()
    maven {
        url "https://packages.confluent.io/maven"
    }
}

dependencies {
    implementation 'org.apache.kafka:kafka-clients:3.3.1'
    implementation 'com.opencsv:opencsv:5.7.1'
    implementation 'io.confluent:kafka-json-serializer:7.3.1'
    implementation 'org.apache.kafka:kafka-streams:3.3.1'
    implementation 'io.confluent:kafka-avro-serializer:7.3.1'
    implementation 'io.confluent:kafka-schema-registry-client:7.3.1'
    implementation 'io.confluent:kafka-streams-avro-serde:7.3.1'
    implementation "org.apache.avro:avro:1.11.0"
    testImplementation 'org.junit.jupiter:junit-jupiter-api:5.8.1'
    testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.8.1'
    testImplementation 'org.apache.kafka:kafka-streams-test-utils:3.3.1'
}

sourceSets.main.java.srcDirs = ['build/generated-main-avro-java','src/main/java']

test {
    useJUnitPlatform()
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/gradle/wrapper/gradle-wrapper.properties
================================================
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-7.5.1-bin.zip
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists


================================================
FILE: 07-streaming/theory/java/kafka_examples/gradlew
================================================
#!/bin/sh

#
# Copyright © 2015-2021 the original authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

##############################################################################
#
#   Gradle start up script for POSIX generated by Gradle.
#
#   Important for running:
#
#   (1) You need a POSIX-compliant shell to run this script. If your /bin/sh is
#       noncompliant, but you have some other compliant shell such as ksh or
#       bash, then to run this script, type that shell name before the whole
#       command line, like:
#
#           ksh Gradle
#
#       Busybox and similar reduced shells will NOT work, because this script
#       requires all of these POSIX shell features:
#         * functions;
#         * expansions «$var», «${var}», «${var:-default}», «${var+SET}»,
#           «${var#prefix}», «${var%suffix}», and «$( cmd )»;
#         * compound commands having a testable exit status, especially «case»;
#         * various built-in commands including «command», «set», and «ulimit».
#
#   Important for patching:
#
#   (2) This script targets any POSIX shell, so it avoids extensions provided
#       by Bash, Ksh, etc; in particular arrays are avoided.
#
#       The "traditional" practice of packing multiple parameters into a
#       space-separated string is a well documented source of bugs and security
#       problems, so this is (mostly) avoided, by progressively accumulating
#       options in "$@", and eventually passing that to Java.
#
#       Where the inherited environment variables (DEFAULT_JVM_OPTS, JAVA_OPTS,
#       and GRADLE_OPTS) rely on word-splitting, this is performed explicitly;
#       see the in-line comments for details.
#
#       There are tweaks for specific operating systems such as AIX, CygWin,
#       Darwin, MinGW, and NonStop.
#
#   (3) This script is generated from the Groovy template
#       https://github.com/gradle/gradle/blob/master/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
#       within the Gradle project.
#
#       You can find Gradle at https://github.com/gradle/gradle/.
#
##############################################################################

# Attempt to set APP_HOME

# Resolve links: $0 may be a link
app_path=$0

# Need this for daisy-chained symlinks.
while
    APP_HOME=${app_path%"${app_path##*/}"}  # leaves a trailing /; empty if no leading path
    [ -h "$app_path" ]
do
    ls=$( ls -ld "$app_path" )
    link=${ls#*' -> '}
    case $link in             #(
      /*)   app_path=$link ;; #(
      *)    app_path=$APP_HOME$link ;;
    esac
done

APP_HOME=$( cd "${APP_HOME:-./}" && pwd -P ) || exit

APP_NAME="Gradle"
APP_BASE_NAME=${0##*/}

# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'

# Use the maximum available, or set MAX_FD != -1 to use that value.
MAX_FD=maximum

warn () {
    echo "$*"
} >&2

die () {
    echo
    echo "$*"
    echo
    exit 1
} >&2

# OS specific support (must be 'true' or 'false').
cygwin=false
msys=false
darwin=false
nonstop=false
case "$( uname )" in                #(
  CYGWIN* )         cygwin=true  ;; #(
  Darwin* )         darwin=true  ;; #(
  MSYS* | MINGW* )  msys=true    ;; #(
  NONSTOP* )        nonstop=true ;;
esac

CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar


# Determine the Java command to use to start the JVM.
if [ -n "$JAVA_HOME" ] ; then
    if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
        # IBM's JDK on AIX uses strange locations for the executables
        JAVACMD=$JAVA_HOME/jre/sh/java
    else
        JAVACMD=$JAVA_HOME/bin/java
    fi
    if [ ! -x "$JAVACMD" ] ; then
        die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME

Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
    fi
else
    JAVACMD=java
    which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.

Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi

# Increase the maximum file descriptors if we can.
if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then
    case $MAX_FD in #(
      max*)
        MAX_FD=$( ulimit -H -n ) ||
            warn "Could not query maximum file descriptor limit"
    esac
    case $MAX_FD in  #(
      '' | soft) :;; #(
      *)
        ulimit -n "$MAX_FD" ||
            warn "Could not set maximum file descriptor limit to $MAX_FD"
    esac
fi

# Collect all arguments for the java command, stacking in reverse order:
#   * args from the command line
#   * the main class name
#   * -classpath
#   * -D...appname settings
#   * --module-path (only if needed)
#   * DEFAULT_JVM_OPTS, JAVA_OPTS, and GRADLE_OPTS environment variables.

# For Cygwin or MSYS, switch paths to Windows format before running java
if "$cygwin" || "$msys" ; then
    APP_HOME=$( cygpath --path --mixed "$APP_HOME" )
    CLASSPATH=$( cygpath --path --mixed "$CLASSPATH" )

    JAVACMD=$( cygpath --unix "$JAVACMD" )

    # Now convert the arguments - kludge to limit ourselves to /bin/sh
    for arg do
        if
            case $arg in                                #(
              -*)   false ;;                            # don't mess with options #(
              /?*)  t=${arg#/} t=/${t%%/*}              # looks like a POSIX filepath
                    [ -e "$t" ] ;;                      #(
              *)    false ;;
            esac
        then
            arg=$( cygpath --path --ignore --mixed "$arg" )
        fi
        # Roll the args list around exactly as many times as the number of
        # args, so each arg winds up back in the position where it started, but
        # possibly modified.
        #
        # NB: a `for` loop captures its iteration list before it begins, so
        # changing the positional parameters here affects neither the number of
        # iterations, nor the values presented in `arg`.
        shift                   # remove old arg
        set -- "$@" "$arg"      # push replacement arg
    done
fi

# Collect all arguments for the java command;
#   * $DEFAULT_JVM_OPTS, $JAVA_OPTS, and $GRADLE_OPTS can contain fragments of
#     shell script including quotes and variable substitutions, so put them in
#     double quotes to make sure that they get re-expanded; and
#   * put everything else in single quotes, so that it's not re-expanded.

set -- \
        "-Dorg.gradle.appname=$APP_BASE_NAME" \
        -classpath "$CLASSPATH" \
        org.gradle.wrapper.GradleWrapperMain \
        "$@"

# Stop when "xargs" is not available.
if ! command -v xargs >/dev/null 2>&1
then
    die "xargs is not available"
fi

# Use "xargs" to parse quoted args.
#
# With -n1 it outputs one arg per line, with the quotes and backslashes removed.
#
# In Bash we could simply go:
#
#   readarray ARGS < <( xargs -n1 <<<"$var" ) &&
#   set -- "${ARGS[@]}" "$@"
#
# but POSIX shell has neither arrays nor command substitution, so instead we
# post-process each arg (as a line of input to sed) to backslash-escape any
# character that might be a shell metacharacter, then use eval to reverse
# that process (while maintaining the separation between arguments), and wrap
# the whole thing up as a single "set" statement.
#
# This will of course break if any of these variables contains a newline or
# an unmatched quote.
#

eval "set -- $(
        printf '%s\n' "$DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS" |
        xargs -n1 |
        sed ' s~[^-[:alnum:]+,./:=@_]~\\&~g; ' |
        tr '\n' ' '
    )" '"$@"'

exec "$JAVACMD" "$@"


================================================
FILE: 07-streaming/theory/java/kafka_examples/gradlew.bat
================================================
@rem
@rem Copyright 2015 the original author or authors.
@rem
@rem Licensed under the Apache License, Version 2.0 (the "License");
@rem you may not use this file except in compliance with the License.
@rem You may obtain a copy of the License at
@rem
@rem      https://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing, software
@rem distributed under the License is distributed on an "AS IS" BASIS,
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@rem See the License for the specific language governing permissions and
@rem limitations under the License.
@rem

@if "%DEBUG%"=="" @echo off
@rem ##########################################################################
@rem
@rem  Gradle startup script for Windows
@rem
@rem ##########################################################################

@rem Set local scope for the variables with windows NT shell
if "%OS%"=="Windows_NT" setlocal

set DIRNAME=%~dp0
if "%DIRNAME%"=="" set DIRNAME=.
set APP_BASE_NAME=%~n0
set APP_HOME=%DIRNAME%

@rem Resolve any "." and ".." in APP_HOME to make it shorter.
for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi

@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
set DEFAULT_JVM_OPTS="-Xmx64m" "-Xms64m"

@rem Find java.exe
if defined JAVA_HOME goto findJavaFromJavaHome

set JAVA_EXE=java.exe
%JAVA_EXE% -version >NUL 2>&1
if %ERRORLEVEL% equ 0 goto execute

echo.
echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
echo.
echo Please set the JAVA_HOME variable in your environment to match the
echo location of your Java installation.

goto fail

:findJavaFromJavaHome
set JAVA_HOME=%JAVA_HOME:"=%
set JAVA_EXE=%JAVA_HOME%/bin/java.exe

if exist "%JAVA_EXE%" goto execute

echo.
echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME%
echo.
echo Please set the JAVA_HOME variable in your environment to match the
echo location of your Java installation.

goto fail

:execute
@rem Setup the command line

set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar


@rem Execute Gradle
"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %*

:end
@rem End local scope for the variables with windows NT shell
if %ERRORLEVEL% equ 0 goto mainEnd

:fail
rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
rem the _cmd.exe /c_ return code!
set EXIT_CODE=%ERRORLEVEL%
if %EXIT_CODE% equ 0 set EXIT_CODE=1
if not ""=="%GRADLE_EXIT_CONSOLE%" exit %EXIT_CODE%
exit /b %EXIT_CODE%

:mainEnd
if "%OS%"=="Windows_NT" endlocal

:omega


================================================
FILE: 07-streaming/theory/java/kafka_examples/settings.gradle
================================================
pluginManagement {
    repositories {
        gradlePluginPortal()
        mavenCentral()
    }
}
rootProject.name = 'kafka_examples'

================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/avro/rides.avsc
================================================
{
       "type": "record",
       "name":"RideRecord",
       "namespace": "schemaregistry",
       "fields":[
         {"name":"vendor_id","type":"string"},
         {"name":"passenger_count","type":"int"},
         {"name":"trip_distance","type":"double"}
       ]
}

================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/avro/rides_compatible.avsc
================================================
{
   "type": "record",
       "name":"RideRecordCompatible",
       "namespace": "schemaregistry",
       "fields":[
         {"name":"vendorId","type":"string"},
         {"name":"passenger_count","type":"int"},
         {"name":"trip_distance","type":"double"},
         {"name":"pu_location_id", "type": [ "null", "long" ], "default": null}
       ]
}

================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/avro/rides_non_compatible.avsc
================================================
{
   "type": "record",
       "name":"RideRecordNoneCompatible",
       "namespace": "schemaregistry",
       "fields":[
         {"name":"vendorId","type":"int"},
         {"name":"passenger_count","type":"int"},
         {"name":"trip_distance","type":"double"}
       ]
}

================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/AvroProducer.java
================================================
package org.example;

import com.opencsv.CSVReader;
import com.opencsv.exceptions.CsvException;
import io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig;
import io.confluent.kafka.serializers.KafkaAvroSerializer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.streams.StreamsConfig;
import schemaregistry.RideRecord;

import java.io.FileReader;
import java.io.IOException;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.ExecutionException;
import java.util.stream.Collectors;

public class AvroProducer {

    private Properties props = new Properties();

    public AvroProducer() {
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092");
        props.put("security.protocol", "SASL_SSL");
        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';");
        props.put("sasl.mechanism", "PLAIN");
        props.put("client.dns.lookup", "use_all_dns_ips");
        props.put("session.timeout.ms", "45000");
        props.put(ProducerConfig.ACKS_CONFIG, "all");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class.getName());

        props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "https://psrc-kk5gg.europe-west3.gcp.confluent.cloud");
        props.put("basic.auth.credentials.source", "USER_INFO");
        props.put("basic.auth.user.info", Secrets.SCHEMA_REGISTRY_KEY+":"+Secrets.SCHEMA_REGISTRY_SECRET);
    }

    public List<RideRecord> getRides() throws IOException, CsvException {
        var ridesStream = this.getClass().getResource("/rides.csv");
        var reader = new CSVReader(new FileReader(ridesStream.getFile()));
        reader.skip(1);

        return reader.readAll().stream().map(row ->
            RideRecord.newBuilder()
                    .setVendorId(row[0])
                    .setTripDistance(Double.parseDouble(row[4]))
                    .setPassengerCount(Integer.parseInt(row[3]))
                    .build()
                ).collect(Collectors.toList());
    }

    public void publishRides(List<RideRecord> rides) throws ExecutionException, InterruptedException {
        KafkaProducer<String, RideRecord> kafkaProducer = new KafkaProducer<>(props);
        for (RideRecord ride : rides) {
            var record = kafkaProducer.send(new ProducerRecord<>("rides_avro", String.valueOf(ride.getVendorId()), ride), (metadata, exception) -> {
                if (exception != null) {
                    System.out.println(exception.getMessage());
                }
            });
            System.out.println(record.get().offset());
            Thread.sleep(500);
        }
    }

    public static void main(String[] args) throws IOException, CsvException, ExecutionException, InterruptedException {
        var producer = new AvroProducer();
        var rideRecords = producer.getRides();
        producer.publishRides(rideRecords);
    }
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonConsumer.java
================================================
package org.example;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.example.data.Ride;

import java.time.Duration;
import java.time.temporal.ChronoUnit;
import java.time.temporal.TemporalUnit;
import java.util.List;
import java.util.Properties;
import io.confluent.kafka.serializers.KafkaJsonDeserializerConfig;
public class JsonConsumer {

    private Properties props = new Properties();
    private KafkaConsumer<String, Ride> consumer;
    public JsonConsumer() {
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092");
        props.put("security.protocol", "SASL_SSL");
        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';");
        props.put("sasl.mechanism", "PLAIN");
        props.put("client.dns.lookup", "use_all_dns_ips");
        props.put("session.timeout.ms", "45000");
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "io.confluent.kafka.serializers.KafkaJsonDeserializer");
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "kafka_tutorial_example.jsonconsumer.v2");
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
        props.put(KafkaJsonDeserializerConfig.JSON_VALUE_TYPE, Ride.class);
        consumer = new KafkaConsumer<String, Ride>(props);
        consumer.subscribe(List.of("rides"));

    }

    public void consumeFromKafka() {
        System.out.println("Consuming form kafka started");
        var results = consumer.poll(Duration.of(1, ChronoUnit.SECONDS));
        var i = 0;
        do {

            for(ConsumerRecord<String, Ride> result: results) {
                System.out.println(result.value().DOLocationID);
            }
            results =  consumer.poll(Duration.of(1, ChronoUnit.SECONDS));
            System.out.println("RESULTS:::" + results.count());
            i++;
        }
        while(!results.isEmpty() || i < 10);
    }

    public static void main(String[] args) {
        JsonConsumer jsonConsumer = new JsonConsumer();
        jsonConsumer.consumeFromKafka();
    }
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonKStream.java
================================================
package org.example;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.Produced;
import org.example.customserdes.CustomSerdes;
import org.example.data.Ride;

import java.util.Properties;

public class JsonKStream {
    private Properties props = new Properties();

    public JsonKStream() {
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092");
        props.put("security.protocol", "SASL_SSL");
        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';");
        props.put("sasl.mechanism", "PLAIN");
        props.put("client.dns.lookup", "use_all_dns_ips");
        props.put("session.timeout.ms", "45000");
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "kafka_tutorial.kstream.count.plocation.v1");
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
        props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);

    }

    public Topology createTopology() {
        StreamsBuilder streamsBuilder = new StreamsBuilder();
        var ridesStream = streamsBuilder.stream("rides", Consumed.with(Serdes.String(), CustomSerdes.getSerde(Ride.class)));
        var puLocationCount = ridesStream.groupByKey().count().toStream();
        puLocationCount.to("rides-pulocation-count", Produced.with(Serdes.String(), Serdes.Long()));
        return streamsBuilder.build();
    }

    public void countPLocation() throws InterruptedException {
        var topology = createTopology();
        var kStreams = new KafkaStreams(topology, props);
        kStreams.start();
        while (kStreams.state() != KafkaStreams.State.RUNNING) {
            System.out.println(kStreams.state());
            Thread.sleep(1000);
        }
        System.out.println(kStreams.state());
        Runtime.getRuntime().addShutdownHook(new Thread(kStreams::close));
    }

    public static void main(String[] args) throws InterruptedException {
        var object = new JsonKStream();
        object.countPLocation();
    }
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonKStreamJoins.java
================================================
package org.example;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.errors.StreamsUncaughtExceptionHandler;
import org.apache.kafka.streams.kstream.*;
import org.example.customserdes.CustomSerdes;
import org.example.data.PickupLocation;
import org.example.data.Ride;
import org.example.data.VendorInfo;

import java.time.Duration;
import java.util.Optional;
import java.util.Properties;
public class JsonKStreamJoins {
    private Properties props = new Properties();

    public JsonKStreamJoins() {
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092");
        props.put("security.protocol", "SASL_SSL");
        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';");
        props.put("sasl.mechanism", "PLAIN");
        props.put("client.dns.lookup", "use_all_dns_ips");
        props.put("session.timeout.ms", "45000");
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "kafka_tutorial.kstream.joined.rides.pickuplocation.v1");
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
        props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
    }

    public Topology createTopology() {
        StreamsBuilder streamsBuilder = new StreamsBuilder();
        KStream<String, Ride> rides = streamsBuilder.stream(Topics.INPUT_RIDE_TOPIC, Consumed.with(Serdes.String(), CustomSerdes.getSerde(Ride.class)));
        KStream<String, PickupLocation> pickupLocations = streamsBuilder.stream(Topics.INPUT_RIDE_LOCATION_TOPIC, Consumed.with(Serdes.String(), CustomSerdes.getSerde(PickupLocation.class)));

        var pickupLocationsKeyedOnPUId = pickupLocations.selectKey((key, value) -> String.valueOf(value.PULocationID));

        var joined = rides.join(pickupLocationsKeyedOnPUId, (ValueJoiner<Ride, PickupLocation, Optional<VendorInfo>>) (ride, pickupLocation) -> {
                    var period = Duration.between(ride.tpep_dropoff_datetime, pickupLocation.tpep_pickup_datetime);
                    if (period.abs().toMinutes() > 10) return Optional.empty();
                    else return Optional.of(new VendorInfo(ride.VendorID, pickupLocation.PULocationID, pickupLocation.tpep_pickup_datetime, ride.tpep_dropoff_datetime));
                }, JoinWindows.ofTimeDifferenceAndGrace(Duration.ofMinutes(20), Duration.ofMinutes(5)),
                StreamJoined.with(Serdes.String(), CustomSerdes.getSerde(Ride.class), CustomSerdes.getSerde(PickupLocation.class)));

        joined.filter(((key, value) -> value.isPresent())).mapValues(Optional::get)
                .to(Topics.OUTPUT_TOPIC, Produced.with(Serdes.String(), CustomSerdes.getSerde(VendorInfo.class)));

        return streamsBuilder.build();
    }

    public void joinRidesPickupLocation() throws InterruptedException {
        var topology = createTopology();
        var kStreams = new KafkaStreams(topology, props);

        kStreams.setUncaughtExceptionHandler(exception -> {
            System.out.println(exception.getMessage());
            return StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.SHUTDOWN_APPLICATION;
        });
        kStreams.start();
        while (kStreams.state() != KafkaStreams.State.RUNNING) {
            System.out.println(kStreams.state());
            Thread.sleep(1000);
        }
        System.out.println(kStreams.state());
        Runtime.getRuntime().addShutdownHook(new Thread(kStreams::close));

    }

    public static void main(String[] args) throws InterruptedException {
        var object = new JsonKStreamJoins();
        object.joinRidesPickupLocation();
    }
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonKStreamWindow.java
================================================
package org.example;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.Produced;
import org.apache.kafka.streams.kstream.TimeWindows;
import org.apache.kafka.streams.kstream.WindowedSerdes;
import org.example.customserdes.CustomSerdes;
import org.example.data.Ride;

import java.time.Duration;
import java.time.temporal.ChronoUnit;
import java.util.Properties;

public class JsonKStreamWindow {
    private Properties props = new Properties();

    public JsonKStreamWindow() {
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092");
        props.put("security.protocol", "SASL_SSL");
        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';");
        props.put("sasl.mechanism", "PLAIN");
        props.put("client.dns.lookup", "use_all_dns_ips");
        props.put("session.timeout.ms", "45000");
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "kafka_tutorial.kstream.count.plocation.v1");
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
        props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);

    }

    public Topology createTopology() {
        StreamsBuilder streamsBuilder = new StreamsBuilder();
        var ridesStream = streamsBuilder.stream("rides", Consumed.with(Serdes.String(), CustomSerdes.getSerde(Ride.class)));
        var puLocationCount = ridesStream.groupByKey()
                .windowedBy(TimeWindows.ofSizeAndGrace(Duration.ofSeconds(10), Duration.ofSeconds(5)))
                .count().toStream();
        var windowSerde = WindowedSerdes.timeWindowedSerdeFrom(String.class, 10*1000);

        puLocationCount.to("rides-pulocation-window-count", Produced.with(windowSerde, Serdes.Long()));
        return streamsBuilder.build();
    }

    public void countPLocationWindowed() {
        var topology = createTopology();
        var kStreams = new KafkaStreams(topology, props);
        kStreams.start();

        Runtime.getRuntime().addShutdownHook(new Thread(kStreams::close));
    }

    public static void main(String[] args) {
        var object = new JsonKStreamWindow();
        object.countPLocationWindowed();
    }
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonProducer.java
================================================
package org.example;

import com.opencsv.CSVReader;
import com.opencsv.exceptions.CsvException;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.streams.StreamsConfig;
import org.example.data.Ride;

import java.io.FileReader;
import java.io.IOException;
import java.time.LocalDateTime;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.ExecutionException;
import java.util.stream.Collectors;

public class JsonProducer {
    private Properties props = new Properties();
    public JsonProducer() {
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092");
        props.put("security.protocol", "SASL_SSL");
        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';");
        props.put("sasl.mechanism", "PLAIN");
        props.put("client.dns.lookup", "use_all_dns_ips");
        props.put("session.timeout.ms", "45000");
        props.put(ProducerConfig.ACKS_CONFIG, "all");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "io.confluent.kafka.serializers.KafkaJsonSerializer");
    }

    public List<Ride> getRides() throws IOException, CsvException {
        var ridesStream = this.getClass().getResource("/rides.csv");
        var reader = new CSVReader(new FileReader(ridesStream.getFile()));
        reader.skip(1);
        return reader.readAll().stream().map(arr -> new Ride(arr))
                .collect(Collectors.toList());

    }

    public void publishRides(List<Ride> rides) throws ExecutionException, InterruptedException {
        KafkaProducer<String, Ride> kafkaProducer = new KafkaProducer<String, Ride>(props);
        for(Ride ride: rides) {
            ride.tpep_pickup_datetime = LocalDateTime.now().minusMinutes(20);
            ride.tpep_dropoff_datetime = LocalDateTime.now();
            var record = kafkaProducer.send(new ProducerRecord<>("rides", String.valueOf(ride.DOLocationID), ride), (metadata, exception) -> {
                if(exception != null) {
                    System.out.println(exception.getMessage());
                }
            });
            System.out.println(record.get().offset());
            System.out.println(ride.DOLocationID);
            Thread.sleep(500);
        }
    }

    public static void main(String[] args) throws IOException, CsvException, ExecutionException, InterruptedException {
        var producer = new JsonProducer();
        var rides = producer.getRides();
        producer.publishRides(rides);
    }
}

================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonProducerPickupLocation.java
================================================
package org.example;

import com.opencsv.exceptions.CsvException;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.example.data.PickupLocation;

import java.io.IOException;
import java.time.LocalDateTime;
import java.util.Properties;
import java.util.concurrent.ExecutionException;

public class JsonProducerPickupLocation {
    private Properties props = new Properties();

    public JsonProducerPickupLocation() {
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092");
        props.put("security.protocol", "SASL_SSL");
        props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';");
        props.put("sasl.mechanism", "PLAIN");
        props.put("client.dns.lookup", "use_all_dns_ips");
        props.put("session.timeout.ms", "45000");
        props.put(ProducerConfig.ACKS_CONFIG, "all");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "io.confluent.kafka.serializers.KafkaJsonSerializer");
    }

    public void publish(PickupLocation pickupLocation) throws ExecutionException, InterruptedException {
        KafkaProducer<String, PickupLocation> kafkaProducer = new KafkaProducer<String, PickupLocation>(props);
        var record = kafkaProducer.send(new ProducerRecord<>("rides_location", String.valueOf(pickupLocation.PULocationID), pickupLocation), (metadata, exception) -> {
            if (exception != null) {
                System.out.println(exception.getMessage());
            }
        });
        System.out.println(record.get().offset());
    }


    public static void main(String[] args) throws IOException, CsvException, ExecutionException, InterruptedException {
        var producer = new JsonProducerPickupLocation();
        producer.publish(new PickupLocation(186, LocalDateTime.now()));
    }
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/Secrets.java
================================================
package org.example;

public class Secrets {
    public static final String KAFKA_CLUSTER_KEY = "REPLACE_WITH_YOUR_KAFKA_CLUSTER_KEY";
    public static final String KAFKA_CLUSTER_SECRET = "REPLACE_WITH_YOUR_KAFKA_CLUSTER_SECRET";

    public static final String SCHEMA_REGISTRY_KEY = "REPLACE_WITH_SCHEMA_REGISTRY_KEY";
    public static final String SCHEMA_REGISTRY_SECRET = "REPLACE_WITH_SCHEMA_REGISTRY_SECRET";

}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/Topics.java
================================================
package org.example;

public class Topics {
    public static final String INPUT_RIDE_TOPIC = "rides";
    public static final String INPUT_RIDE_LOCATION_TOPIC = "rides_location";
    public static final String OUTPUT_TOPIC = "vendor_info";
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/customserdes/CustomSerdes.java
================================================
package org.example.customserdes;

import io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig;
import io.confluent.kafka.serializers.KafkaJsonDeserializer;
import io.confluent.kafka.serializers.KafkaJsonSerializer;
import io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde;
import org.apache.avro.specific.SpecificRecordBase;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.serialization.Serializer;
import org.example.data.PickupLocation;
import org.example.data.Ride;
import org.example.data.VendorInfo;

import java.util.HashMap;
import java.util.Map;

public class CustomSerdes {

    public static <T> Serde<T> getSerde(Class<T> classOf) {
        Map<String, Object> serdeProps = new HashMap<>();
        serdeProps.put("json.value.type", classOf);
        final Serializer<T> mySerializer = new KafkaJsonSerializer<>();
        mySerializer.configure(serdeProps, false);

        final Deserializer<T> myDeserializer = new KafkaJsonDeserializer<>();
        myDeserializer.configure(serdeProps, false);
        return Serdes.serdeFrom(mySerializer, myDeserializer);
    }

    public static <T extends SpecificRecordBase> SpecificAvroSerde getAvroSerde(boolean isKey, String schemaRegistryUrl) {
        var serde = new SpecificAvroSerde<T>();

        Map<String, Object> serdeProps = new HashMap<>();
        serdeProps.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
        serde.configure(serdeProps, isKey);
        return serde;
    }


}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/data/PickupLocation.java
================================================
package org.example.data;

import java.time.LocalDateTime;

public class PickupLocation {
    public PickupLocation(long PULocationID, LocalDateTime tpep_pickup_datetime) {
        this.PULocationID = PULocationID;
        this.tpep_pickup_datetime = tpep_pickup_datetime;
    }

    public PickupLocation() {
    }

    public long PULocationID;
    public LocalDateTime tpep_pickup_datetime;
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/data/Ride.java
================================================
package org.example.data;

import java.nio.DoubleBuffer;
import java.time.LocalDate;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

public class Ride {
    public Ride(String[] arr) {
        VendorID = arr[0];
        tpep_pickup_datetime = LocalDateTime.parse(arr[1], DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
        tpep_dropoff_datetime = LocalDateTime.parse(arr[2], DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
        passenger_count = Integer.parseInt(arr[3]);
        trip_distance = Double.parseDouble(arr[4]);
        RatecodeID = Long.parseLong(arr[5]);
        store_and_fwd_flag = arr[6];
        PULocationID = Long.parseLong(arr[7]);
        DOLocationID = Long.parseLong(arr[8]);
        payment_type = arr[9];
        fare_amount = Double.parseDouble(arr[10]);
        extra = Double.parseDouble(arr[11]);
        mta_tax = Double.parseDouble(arr[12]);
        tip_amount = Double.parseDouble(arr[13]);
        tolls_amount = Double.parseDouble(arr[14]);
        improvement_surcharge = Double.parseDouble(arr[15]);
        total_amount = Double.parseDouble(arr[16]);
        congestion_surcharge = Double.parseDouble(arr[17]);
    }
    public Ride(){}
    public String VendorID;
    public LocalDateTime tpep_pickup_datetime;
    public LocalDateTime tpep_dropoff_datetime;
    public int passenger_count;
    public double trip_distance;
    public long RatecodeID;
    public String store_and_fwd_flag;
    public long PULocationID;
    public long DOLocationID;
    public String payment_type;
    public double fare_amount;
    public double extra;
    public double mta_tax;
    public double tip_amount;
    public double tolls_amount;
    public double improvement_surcharge;
    public double total_amount;
    public double congestion_surcharge;

}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/data/VendorInfo.java
================================================
package org.example.data;

import java.time.LocalDateTime;

public class VendorInfo {

    public VendorInfo(String vendorID, long PULocationID, LocalDateTime pickupTime, LocalDateTime lastDropoffTime) {
        VendorID = vendorID;
        this.PULocationID = PULocationID;
        this.pickupTime = pickupTime;
        this.lastDropoffTime = lastDropoffTime;
    }

    public VendorInfo() {
    }

    public String VendorID;
    public long PULocationID;
    public LocalDateTime pickupTime;
    public LocalDateTime lastDropoffTime;
}


================================================
FILE: 07-streaming/theory/java/kafka_examples/src/test/java/org/example/JsonKStreamJoinsTest.java
================================================
package org.example;

import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.internals.Topic;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.*;
import org.example.customserdes.CustomSerdes;
import org.example.data.PickupLocation;
import org.example.data.Ride;
import org.example.data.VendorInfo;
import org.example.helper.DataGeneratorHelper;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;

import javax.xml.crypto.Data;
import java.util.Properties;

import static org.junit.jupiter.api.Assertions.*;

class JsonKStreamJoinsTest {
    private Properties props = new Properties();
    private static TopologyTestDriver testDriver;
    private TestInputTopic<String, Ride> ridesTopic;
    private TestInputTopic<String, PickupLocation> pickLocationTopic;
    private TestOutputTopic<String, VendorInfo> outputTopic;

    private Topology topology = new JsonKStreamJoins().createTopology();
    @BeforeEach
    public void setup() {
        props = new Properties();
        props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "testing_count_application");
        props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "dummy:1234");
        if (testDriver != null) {
            testDriver.close();
        }
        testDriver = new TopologyTestDriver(topology, props);
        ridesTopic = testDriver.createInputTopic(Topics.INPUT_RIDE_TOPIC, Serdes.String().serializer(), CustomSerdes.getSerde(Ride.class).serializer());
        pickLocationTopic = testDriver.createInputTopic(Topics.INPUT_RIDE_LOCATION_TOPIC, Serdes.String().serializer(), CustomSerdes.getSerde(PickupLocation.class).serializer());
        outputTopic = testDriver.createOutputTopic(Topics.OUTPUT_TOPIC, Serdes.String().deserializer(), CustomSerdes.getSerde(VendorInfo.class).deserializer());
    }

    @Test
    public void testIfJoinWorksOnSameDropOffPickupLocationId() {
        Ride ride = DataGeneratorHelper.generateRide();
        PickupLocation pickupLocation = DataGeneratorHelper.generatePickUpLocation(ride.DOLocationID);
        ridesTopic.pipeInput(String.valueOf(ride.DOLocationID), ride);
        pickLocationTopic.pipeInput(String.valueOf(pickupLocation.PULocationID), pickupLocation);

        assertEquals(outputTopic.getQueueSize(), 1);
        var expected = new VendorInfo(ride.VendorID, pickupLocation.PULocationID, pickupLocation.tpep_pickup_datetime, ride.tpep_dropoff_datetime);
        var result = outputTopic.readKeyValue();
        assertEquals(result.key, String.valueOf(ride.DOLocationID));
        assertEquals(result.value.VendorID, expected.VendorID);
        assertEquals(result.value.pickupTime, expected.pickupTime);
    }


    @AfterAll
    public static void shutdown() {
        testDriver.close();
    }
}

================================================
FILE: 07-streaming/theory/java/kafka_examples/src/test/java/org/example/JsonKStreamTest.java
================================================
package org.example;

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.*;
import org.example.customserdes.CustomSerdes;
import org.example.data.Ride;
import org.example.helper.DataGeneratorHelper;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
import java.util.Properties;

class JsonKStreamTest {
    private Properties props;
    private static TopologyTestDriver testDriver;
    private TestInputTopic<String, Ride> inputTopic;
    private TestOutputTopic<String, Long> outputTopic;
    private Topology topology = new JsonKStream().createTopology();

    @BeforeEach
    public void setup() {
        props = new Properties();
        props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "testing_count_application");
        props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "dummy:1234");
        if (testDriver != null) {
            testDriver.close();
        }
        testDriver = new TopologyTestDriver(topology, props);
        inputTopic = testDriver.createInputTopic("rides", Serdes.String().serializer(), CustomSerdes.getSerde(Ride.class).serializer());
        outputTopic = testDriver.createOutputTopic("rides-pulocation-count", Serdes.String().deserializer(), Serdes.Long().deserializer());
    }

    @Test
    public void testIfOneMessageIsPassedToInputTopicWeGetCountOfOne() {
        Ride ride = DataGeneratorHelper.generateRide();
        inputTopic.pipeInput(String.valueOf(ride.DOLocationID), ride);

        assertEquals(outputTopic.readKeyValue(), KeyValue.pair(String.valueOf(ride.DOLocationID), 1L));
        assertTrue(outputTopic.isEmpty());
    }

    @Test
    public void testIfTwoMessageArePassedWithDifferentKey() {
        Ride ride1 = DataGeneratorHelper.generateRide();
        ride1.DOLocationID = 100L;
        inputTopic.pipeInput(String.valueOf(ride1.DOLocationID), ride1);

        Ride ride2 = DataGeneratorHelper.generateRide();
        ride2.DOLocationID = 200L;
        inputTopic.pipeInput(String.valueOf(ride2.DOLocationID), ride2);

        assertEquals(outputTopic.readKeyValue(), KeyValue.pair(String.valueOf(ride1.DOLocationID), 1L));
        assertEquals(outputTopic.readKeyValue(), KeyValue.pair(String.valueOf(ride2.DOLocationID), 1L));
        assertTrue(outputTopic.isEmpty());
    }

    @Test
    public void testIfTwoMessageArePassedWithSameKey() {
        Ride ride1 = DataGeneratorHelper.generateRide();
        ride1.DOLocationID = 100L;
        inputTopic.pipeInput(String.valueOf(ride1.DOLocationID), ride1);

        Ride ride2 = DataGeneratorHelper.generateRide();
        ride2.DOLocationID = 100L;
        inputTopic.pipeInput(String.valueOf(ride2.DOLocationID), ride2);

        assertEquals(outputTopic.readKeyValue(), KeyValue.pair("100", 1L));
        assertEquals(outputTopic.readKeyValue(), KeyValue.pair("100", 2L));
        assertTrue(outputTopic.isEmpty());
    }


    @AfterAll
    public static void tearDown() {
        testDriver.close();
    }


}

================================================
FILE: 07-streaming/theory/java/kafka_examples/src/test/java/org/example/helper/DataGeneratorHelper.java
================================================
package org.example.helper;

import org.example.data.PickupLocation;
import org.example.data.Ride;
import org.example.data.VendorInfo;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.List;

public class DataGeneratorHelper {
    public static Ride generateRide() {
        var arrivalTime = LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
        var departureTime = LocalDateTime.now().minusMinutes(30).format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
        return new Ride(new String[]{"1", departureTime, arrivalTime,"1","1.50","1","N","238","75","2","8","0.5","0.5","0","0","0.3","9.3","0"});
    }

    public static PickupLocation generatePickUpLocation(long pickupLocationId) {
        return new PickupLocation(pickupLocationId, LocalDateTime.now());
    }
}


================================================
FILE: 07-streaming/workshop/.python-version
================================================
3.13


================================================
FILE: 07-streaming/workshop/Dockerfile.flink
================================================
FROM flink:2.2.0-scala_2.12-java17

COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

# ref: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker

WORKDIR /opt/pyflink
COPY pyproject.flink.toml pyproject.toml
RUN uv python install 3.12 && uv sync
ENV PATH="/opt/pyflink/.venv/bin:$PATH"

# Download connector libraries

WORKDIR /opt/flink/lib
RUN wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/2.2.0/flink-json-2.2.0.jar; \
    wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/4.0.1-2.0/flink-sql-connector-kafka-4.0.1-2.0.jar; \
    wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-core/4.0.0-2.0/flink-connector-jdbc-core-4.0.0-2.0.jar; \
    wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-postgres/4.0.0-2.0/flink-connector-jdbc-postgres-4.0.0-2.0.jar; \
    wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.10/postgresql-42.7.10.jar

COPY flink-config.yaml /opt/flink/conf/config.yaml

WORKDIR /opt/flink


================================================
FILE: 07-streaming/workshop/Dockerfile_ARM64.flink
================================================
FROM flink:2.2.0-scala_2.12-java17

COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

USER root

# Install a full JDK (not just a runtime) plus native build tools for pemja
RUN apt-get update && apt-get install -y --no-install-recommends \
    openjdk-17-jdk-headless \
    build-essential \
    python3-dev \
    wget \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Point JAVA_HOME at the full JDK and make /opt/java/openjdk match what pemja expects
RUN JDK_DIR="$(dirname "$(dirname "$(readlink -f "$(command -v javac)")")")" \
    && rm -rf /opt/java/openjdk \
    && ln -s "${JDK_DIR}" /opt/java/openjdk \
    && test -d /opt/java/openjdk/include

ENV JAVA_HOME=/opt/java/openjdk
ENV PATH="${JAVA_HOME}/bin:${PATH}"

# ref: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker

WORKDIR /opt/pyflink
COPY pyproject.flink.toml pyproject.toml
RUN uv python install 3.12 && uv sync
ENV PATH="/opt/pyflink/.venv/bin:$PATH"

# Download connector libraries
# flink-json-2.2.0.jar is already bundled in the base image -- do NOT re-download it.
WORKDIR /opt/flink/lib
RUN wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/4.0.1-2.0/flink-sql-connector-kafka-4.0.1-2.0.jar \
    && wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-core/4.0.0-2.0/flink-connector-jdbc-core-4.0.0-2.0.jar \
    && wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-postgres/4.0.0-2.0/flink-connector-jdbc-postgres-4.0.0-2.0.jar \
    && wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.10/postgresql-42.7.10.jar

COPY flink-config.yaml /opt/flink/conf/config.yaml

WORKDIR /opt/flink

================================================
FILE: 07-streaming/workshop/Makefile
================================================
.PHONY: build up down job aggregation_job stop start

build:
	docker compose build

up:
	docker compose up --build --remove-orphans -d

down:
	docker compose down --remove-orphans

job:
	docker compose exec jobmanager ./bin/flink run -py /opt/src/job/start_job.py --pyFiles /opt/src -d

aggregation_job:
	docker compose exec jobmanager ./bin/flink run -py /opt/src/job/aggregation_job.py --pyFiles /opt/src -d

stop:
	docker compose stop

start:
	docker compose start


================================================
FILE: 07-streaming/workshop/README.md
================================================
# PyFlink: Stream Processing Workshop

Video: https://www.youtube.com/watch?v=YDUgFeHQzJU

This workshop is based on the
[2025 stream with Zach Wilson](https://www.youtube.com/watch?v=P2loELMUUeI).

In this workshop, we build a real-time streaming pipeline step by step.
We start with the basics - a message broker, a producer, and a consumer -
then add a database and finally a stream processing framework.

We'll use NYC yellow taxi trip data as our data source.

What we'll build by the end:

```
Producer (Python) -> Kafka (Redpanda) -> Flink -> PostgreSQL
```

Prerequisites:

- Docker and Docker Compose
- [uv](https://docs.astral.sh/uv/)
- A SQL client - [pgcli](https://www.pgcli.com/) (`uvx pgcli`), DBeaver, pgAdmin, or DataGrip

Code:

- [Reference code](./) in this directory (`07-streaming/workshop/`)
- [Code created during the workshop](live/) by Alexey

The README walks through building everything from scratch - you can follow
along step by step or study the existing files and run the commands.


## Redpanda - a Kafka-compatible broker

Before we can produce or consume messages, we need a message broker -
a service that receives messages from producers, stores them, and delivers
them to consumers.

We use [Redpanda](https://redpanda.com/), a drop-in replacement for
Apache Kafka. Redpanda implements the same protocol, so any Kafka client
library works with it unchanged. The `kafka-python` library we'll use
doesn't know or care that Redpanda is running instead of Kafka.

Why Redpanda instead of Kafka?

- No JVM - Kafka runs on Java and needs significant memory for the JVM.
  Redpanda is written in C++ and starts in seconds with far less overhead.
- No ZooKeeper - Kafka traditionally required a separate ZooKeeper cluster
  for coordination (metadata, leader election). Redpanda handles this
  internally using the Raft consensus protocol - one less service to run.
- Single binary - just one container, nothing else to configure.

For this workshop, every time we say "Kafka" we mean the Kafka protocol
and concepts. Redpanda is the actual broker running underneath.

Create `docker-compose.yml` with the Redpanda service:

```yaml
services:
  redpanda:
    image: redpandadata/redpanda:v25.3.9
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '1'
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
      - --pandaproxy-addr
      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082
      - --advertise-pandaproxy-addr
      - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082
      - --rpc-addr
      - 0.0.0.0:33145
      - --advertise-rpc-addr
      - redpanda:33145
    ports:
      - 8082:8082
      - 9092:9092
      - 28082:28082
      - 29092:29092
```

The command has many parameters. Let's go through them.

Resource parameters:

| Parameter | What it does |
|---|---|
| `--smp 1` | Use 1 CPU core. Redpanda is built on [Seastar](http://seastar.io/), a framework that pins threads to cores for high performance. For development, 1 core is enough. |
| `--reserve-memory 0M` | Don't reserve extra memory for Redpanda's internal cache. In production, Redpanda reserves memory for its own page cache; we skip this in development. |
| `--overprovisioned` | Don't pin threads to specific CPU cores. On a shared development machine, this avoids contention with other processes. |
| `--node-id 1` | Unique identifier for this broker in the cluster. With a single broker it doesn't matter, but the parameter is required. |

Networking parameters:

Redpanda exposes two separate listeners for the Kafka protocol - one for
connections from inside Docker (other containers) and one for connections
from outside Docker (your laptop):

| Parameter | Internal (Docker) | External (your laptop) |
|---|---|---|
| `--kafka-addr` | `PLAINTEXT://0.0.0.0:29092` | `OUTSIDE://0.0.0.0:9092` |
| `--advertise-kafka-addr` | `PLAINTEXT://redpanda:29092` | `OUTSIDE://localhost:9092` |

Why two addresses? Kafka clients use a two-step connection process:

1. The client connects to a bootstrap server and asks for cluster metadata
2. The broker responds with advertised addresses - where the client should
   connect for actual data transfer

Inside Docker, containers find each other by service name, so the internal
advertised address is `redpanda:29092`. From your laptop, you connect via
the published port at `localhost:9092`. If we used only one address, either
Docker containers or your laptop wouldn't be able to connect.

The `--pandaproxy-addr` / `--advertise-pandaproxy-addr` follow the same
pattern for Redpanda's HTTP REST API (not used in this workshop).
The `--rpc-addr` / `--advertise-rpc-addr` are for internal cluster
communication between Redpanda nodes (not relevant with a single node).

Published ports:

| Port | What it's for |
|---|---|
| `9092` | Kafka protocol (external) - your Python producer/consumer connects here |
| `29092` | Kafka protocol (internal) - Flink containers will connect here later |
| `8082` / `28082` | HTTP Proxy - REST API access (not used in this workshop) |

Start Redpanda:

```bash
docker compose up redpanda -d
```

Verify it's running:

```bash
docker compose ps
```

```
NAME                IMAGE                           SERVICE    STATUS
workshop-redpanda   redpandadata/redpanda:v25.3.9   redpanda   Up
```


## Produce messages to Kafka

Initialize a Python project and add the dependencies we need:

```bash
uv init -p 3.12
uv add kafka-python pandas pyarrow
```

> If you cloned the repository, `pyproject.toml` already exists.
> Run `uv sync` instead.

We'll send NYC yellow taxi trip data to Kafka. You can run the code below
either as a Python script or in a Jupyter notebook (`uv add jupyter`,
then `uv run jupyter lab`).

First, download the data. We read a parquet file of yellow taxi trips and
take the first 1000 rows:

```python
import pandas as pd

url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet"
columns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime']
df = pd.read_parquet(url, columns=columns).head(1000)
df.head()
```

We only read 5 columns to keep things focused. The full dataset has many
more (fare breakdown, rate codes, payment type, etc.).

Define a dataclass for our message. This gives us a clear schema for each
taxi trip:

```python
from dataclasses import dataclass

@dataclass
class Ride:
    PULocationID: int
    DOLocationID: int
    trip_distance: float
    total_amount: float
    tpep_pickup_datetime: int  # epoch milliseconds
```

Write a function to convert a DataFrame row into a `Ride`. We convert the
pandas Timestamp to epoch milliseconds - that's the format Flink expects
later:

```python
def ride_from_row(row):
    return Ride(
        PULocationID=int(row['PULocationID']),
        DOLocationID=int(row['DOLocationID']),
        trip_distance=float(row['trip_distance']),
        total_amount=float(row['total_amount']),
        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),
    )
```

Test it:

```python
ride = ride_from_row(df.iloc[0])
ride
# Ride(PULocationID=186, DOLocationID=79, trip_distance=1.72,
#      total_amount=17.31, tpep_pickup_datetime=1730429702000)
```

Next, connect to Kafka. The `bootstrap_servers` is where the broker accepts
connections - `localhost:9092` because we're running this from our laptop
(outside Docker). In production with multiple brokers, you'd list several
for redundancy - if one is down, the client connects through another.

Kafka works with raw bytes, so we need a serializer that converts Python
dicts to JSON:

```python
import json
from kafka import KafkaProducer

def json_serializer(data):
    return json.dumps(data).encode('utf-8')

server = 'localhost:9092'

producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=json_serializer
)
```

Let's send a single ride to try it out. `dataclasses.asdict(ride)` converts
the dataclass to a plain dict, which the serializer turns into JSON bytes.
The broker auto-creates the `rides` topic on first use:

```python
import dataclasses

topic_name = 'rides'

producer.send(topic_name, value=dataclasses.asdict(ride))
producer.flush()
```

This works, but calling `dataclasses.asdict()` every time is tedious. We
can make a serializer that handles dataclasses directly:

```python
def ride_serializer(ride):
    ride_dict = dataclasses.asdict(ride)
    json_str = json.dumps(ride_dict)
    return json_str.encode('utf-8')
```

Now recreate the producer with the new serializer - we can pass `Ride`
objects directly without converting them to dicts first:

```python
producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=ride_serializer
)
```

Send one ride to verify:

```python
producer.send(topic_name, value=ride)
producer.flush()
```

That sent one record. Now let's send all 1000 rides in a loop:

```python
import time

t0 = time.time()

for _, row in df.iterrows():
    ride = ride_from_row(row)
    producer.send(topic_name, value=ride)
    print(f"Sent: {ride}")
    time.sleep(0.01)

producer.flush()

t1 = time.time()
print(f'took {(t1 - t0):.2f} seconds')
```

If you're building from scratch (not using the cloned repo files), create
the source directory structure and save the shared data model. The
producer and consumer scripts both import from this file:

```bash
mkdir -p src/producers src/consumers src/job
```

Create `src/models.py`:

```python
import json
from dataclasses import dataclass


@dataclass
class Ride:
    PULocationID: int
    DOLocationID: int
    trip_distance: float
    total_amount: float
    tpep_pickup_datetime: int  # epoch milliseconds


def ride_from_row(row):
    return Ride(
        PULocationID=int(row['PULocationID']),
        DOLocationID=int(row['DOLocationID']),
        trip_distance=float(row['trip_distance']),
        total_amount=float(row['total_amount']),
        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),
    )


def ride_deserializer(data):
    json_str = data.decode('utf-8')
    ride_dict = json.loads(json_str)
    return Ride(**ride_dict)
```

`ride_deserializer` is introduced in the next step - we include it here so
the file is complete.

> The complete script is in `src/producers/producer.py`.

Run it:

```bash
uv run python src/producers/producer.py
```

You'll see 1000 taxi trips sent over ~10 seconds:

```
Sent: Ride(PULocationID=..., DOLocationID=..., trip_distance=..., total_amount=..., tpep_pickup_datetime=...)
...
took 10.23 seconds
```


## Consume messages with Python

Now let's read back the messages. The consumer receives raw bytes from
Kafka. Instead of deserializing to a dict and then constructing a `Ride`
manually, let's write a function that does both in one step:

```python
import json

def ride_deserializer(data):
    json_str = data.decode('utf-8')
    ride_dict = json.loads(json_str)
    return Ride(**ride_dict)
```

Test it with a sample JSON binary string (this is what Kafka delivers):

```python
test_bytes = json.dumps({
    'PULocationID': 186,
    'DOLocationID': 79,
    'trip_distance': 1.72,
    'total_amount': 17.31,
    'tpep_pickup_datetime': 1730429702000
}).encode('utf-8')

ride_deserializer(test_bytes)
# Ride(PULocationID=186, DOLocationID=79, trip_distance=1.72,
#      total_amount=17.31, tpep_pickup_datetime=1730429702000)
```

Now we can pass `ride_deserializer` directly as the `value_deserializer` -
Kafka calls it on every message, so `message.value` is already a `Ride`.

Connect to Kafka as a consumer. `auto_offset_reset='earliest'` means we
start reading from the beginning of the topic (without this, new consumers
default to `latest` and only see new messages). `group_id` identifies this
consumer group - Kafka tracks how far each group has read, so restarting
with the same group ID continues where it left off:

```python
from kafka import KafkaConsumer

server = 'localhost:9092'
topic_name = 'rides'

consumer = KafkaConsumer(
    topic_name,
    bootstrap_servers=[server],
    auto_offset_reset='earliest',
    group_id='rides-console',
    value_deserializer=ride_deserializer
)
```

Read messages and print them. Since `value_deserializer` returns a `Ride`,
`message.value` is already a `Ride` object - no extra conversion needed:

```python
from datetime import datetime

print(f"Listening to {topic_name}...")

count = 0
for message in consumer:
    ride = message.value
    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)
    print(f"Received: PU={ride.PULocationID}, DO={ride.DOLocationID}, "
          f"distance={ride.trip_distance}, amount=${ride.total_amount:.2f}, "
          f"pickup={pickup_dt}")
    count += 1
    if count >= 10:
        print(f"\n... received {count} messages so far (stopping after 10 for demo)")
        break

consumer.close()
```

> The complete script is in `src/consumers/consumer.py`.

Run it:

```bash
uv run python src/consumers/consumer.py
```

```
Listening to rides...
Received: PU=..., DO=..., distance=..., amount=$..., pickup=2025-...
...
... received 10 messages so far (stopping after 10 for demo)
```


## Save events to PostgreSQL

Printing to the screen is fine for debugging, but let's save events to a
database. Add the PostgreSQL service to `docker-compose.yml`:

```yaml
  postgres:
    image: postgres:18
    restart: on-failure
    environment:
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
    ports:
      - "5432:5432"
```

Start it:

```bash
docker compose up postgres -d
```

Connect to PostgreSQL. With `pgcli`:

```bash
uvx pgcli -h localhost -p 5432 -U postgres -d postgres
# password: postgres
```

Or via Docker:

```bash
docker compose exec postgres psql -U postgres -d postgres
```

Create a table for our events:

```sql
CREATE TABLE processed_events (
    PULocationID INTEGER,
    DOLocationID INTEGER,
    trip_distance DOUBLE PRECISION,
    total_amount DOUBLE PRECISION,
    pickup_datetime TIMESTAMP
);
```

Install the PostgreSQL client library:

```bash
uv add psycopg2-binary
```

Create `src/consumers/consumer_postgres.py`.

Set up the Kafka consumer. We reuse the same `ride_deserializer` from the
previous step. The `group_id` is different - each consumer group tracks its
offsets independently, so the console consumer and the PostgreSQL consumer
each read all messages:

```python
from kafka import KafkaConsumer

server = 'localhost:9092'
topic_name = 'rides'

consumer = KafkaConsumer(
    topic_name,
    bootstrap_servers=[server],
    auto_offset_reset='earliest',
    group_id='rides-to-postgres',
    value_deserializer=ride_deserializer
)
```

Connect to PostgreSQL:

```python
import psycopg2

conn = psycopg2.connect(
    host='localhost',
    port=5432,
    database='postgres',
    user='postgres',
    password='postgres'
)
conn.autocommit = True
cur = conn.cursor()
```

`autocommit = True` means each INSERT is committed immediately - no need
to call `conn.commit()` after every row.

Read messages and insert into PostgreSQL:

```python
from datetime import datetime

print(f"Listening to {topic_name} and writing to PostgreSQL...")

count = 0
for message in consumer:
    ride = message.value
    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)
    cur.execute(
        """INSERT INTO processed_events
           (PULocationID, DOLocationID, trip_distance, total_amount, pickup_datetime)
           VALUES (%s, %s, %s, %s, %s)""",
        (ride.PULocationID, ride.DOLocationID,
         ride.trip_distance, ride.total_amount, pickup_dt)
    )
    count += 1
    if count % 100 == 0:
        print(f"Inserted {count} rows...")

consumer.close()
cur.close()
conn.close()
```

Run it (press Ctrl+C after it processes the data):

```bash
uv run python src/consumers/consumer_postgres.py
```

Check PostgreSQL:

```sql
SELECT count(*) FROM processed_events;
```

```
 count
-------
  1000
```

This works, but think about what's missing:

- What if we want to aggregate by time window? We'd need to implement windowing
  logic ourselves.
- What if the consumer crashes? We'd need to track offsets ourselves to avoid
  reprocessing or missing data.
- What about parallelism? We'd need to manage multiple consumer instances and
  partition assignment.
- What about writing to different sinks? We'd need to write connector code for
  each destination.

This is where Flink comes in. Clear the table before moving on:

```sql
TRUNCATE processed_events;
```


## Why Flink?

Flink is a stream processing framework that handles all the hard parts:

- Windowing - built-in tumbling, sliding, and session windows
- Checkpointing - automatic state recovery after failures (no manual offset tracking)
- Parallelism - distribute processing across multiple workers
- Connectors - built-in JDBC, Kafka, filesystem sinks (no psycopg2 code)
- SQL interface - express stream processing with SQL queries

Flink can also connect to sources beyond Kafka - REST APIs, websockets,
filesystems, and more. But Kafka is the most common source in stream processing.

The trade-off is infrastructure complexity - we need the JobManager and
TaskManager containers. A streaming job is more like owning a server than
running a batch pipeline - it runs 24/7 and needs monitoring. But for anything
beyond simple consume-and-write, Flink pays for itself.


## The Flink image and services

Flink doesn't come with Python support out of the box. We need a custom
Docker image with Python, PyFlink, and connector JARs.

Download the Flink build files:

```bash
PREFIX="https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/main/07-streaming/workshop"

wget ${PREFIX}/Dockerfile.flink
wget ${PREFIX}/pyproject.flink.toml
wget ${PREFIX}/flink-config.yaml
```

> If you cloned the repository, these files are already in the
> `07-streaming/workshop/` directory.

You can look at
[`Dockerfile.flink`](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/07-streaming/workshop/Dockerfile.flink)
to see what it does:

- Starts from the official Flink image (`flink:2.2.0-scala_2.12-java17`)
- Installs Python 3.12 and PyFlink via uv
- Downloads connector JARs (Kafka, JDBC, PostgreSQL driver)
- Applies a custom Flink config to increase JVM metaspace for PyFlink

Now add the Flink services to `docker-compose.yml`. A Flink cluster has
two types of processes - let's add them one at a time.

The JobManager is the coordinator. It accepts jobs, manages checkpoints,
and assigns work to task managers. You interact with it through the web UI
(port `8081`) and submit jobs via its RPC port (`6123`):

```yaml
  jobmanager:
    build:
      context: .
      dockerfile: ./Dockerfile.flink
    image: pyflink-workshop
    pull_policy: never
    expose:
      - "6123"
    ports:
      - "8081:8081"
    volumes:
      - ./:/opt/flink/usrlib
      - ./src/:/opt/src
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        jobmanager.memory.process.size: 1600m
```

- `build` + `image: pyflink-workshop` - builds our custom Docker image and
  tags it as `pyflink-workshop`. The taskmanager will reuse this same image
  without rebuilding.
- `pull_policy: never` - don't try to pull `pyflink-workshop` from Docker Hub
  (it doesn't exist there - we built it locally).
- `volumes` - mount the source code into the container so we can submit jobs
  without rebuilding the image.
- `FLINK_PROPERTIES` - Flink configuration passed as an environment variable.
  `jobmanager.rpc.address: jobmanager` tells Flink where the coordinator
  lives (`jobmanager` is the Docker service name).

The TaskManager is the worker. It executes the actual data processing:

```yaml
  taskmanager:
    image: pyflink-workshop
    pull_policy: never
    expose:
      - "6121"
      - "6122"
    volumes:
      - ./:/opt/flink/usrlib
      - ./src/:/opt/src
    depends_on:
      - jobmanager
    command: taskmanager --taskmanager.registration.timeout 5 min
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.memory.process.size: 1728m
        taskmanager.numberOfTaskSlots: 15
        parallelism.default: 3
```

- `image: pyflink-workshop` - reuses the image built by the jobmanager
  service, no `build` needed.
- `depends_on: jobmanager` - start after the jobmanager.
- `--taskmanager.registration.timeout 5 min` - give the task manager
  5 minutes to find the job manager on startup (useful when services start
  in parallel).
- `taskmanager.numberOfTaskSlots: 15` - this task manager has 15 slots.
- `parallelism.default: 3` - by default, each pipeline stage runs 3 copies
  processing data in parallel.

A task slot is a unit of resources (memory, CPU) that can run one parallel
instance of a pipeline stage. Think of slots like lanes on a highway - more
lanes means more data can flow through at once. If you submit a job with
parallelism 3, that job uses 3 slots. With 15 slots available, you can run
5 such jobs simultaneously on this single task manager. In production, you'd
have multiple task managers across different machines, each contributing
slots to the cluster. The job manager decides which slots run which parts
of which jobs.

Make sure `src/` exists before starting Docker - the volume mount
`./src/:/opt/src` will create it as root if it doesn't exist, causing
permission issues later when you try to create files inside it:

```bash
mkdir -p src/job
```

Build the Flink image and start all services:

```bash
docker compose up --build -d
```

The first build takes a few minutes - it installs Python, PyFlink, and downloads
the connector JARs.

Verify all four services are running:

```bash
docker compose ps
```

```
NAME                  IMAGE                           SERVICE        STATUS
workshop-jobmanager   pyflink-workshop                jobmanager     Up
workshop-taskmanager  pyflink-workshop                taskmanager    Up
workshop-postgres     postgres:18                     postgres       Up
workshop-redpanda     redpandadata/redpanda:v25.3.9   redpanda       Up
```

Check the Flink dashboard at [http://localhost:8081](http://localhost:8081) -
you should see 1 task manager with 15 available task slots.


## The pass-through Flink job

Now let's do the same thing our Python consumer did, but with Flink.

Unlike the producer and consumer scripts, Flink jobs can't run from a
Jupyter notebook. They are submitted to the Flink cluster as .py files
using `docker compose exec`. We cover how job submission works in
production in the "Flink in production" section at the end.

Create `src/job/pass_through_job.py`.

The Kafka source table:

```python
def create_events_source_kafka(t_env):
    table_name = "events"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            tpep_pickup_datetime BIGINT
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda:29092',
            'topic' = 'rides',
            'scan.startup.mode' = 'latest-offset',
            'properties.auto.offset.reset' = 'latest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name
```

This is a Flink SQL DDL statement. Breaking it down:

- `PULocationID`, `DOLocationID`, `trip_distance`, `total_amount`,
  `tpep_pickup_datetime` - the JSON fields from our producer
- `'properties.bootstrap.servers' = 'redpanda:29092'` - the internal Docker
  network address (not `localhost` - Flink runs inside Docker)
- `'scan.startup.mode' = 'latest-offset'` - only read new messages arriving
  after the job starts
- `'format' = 'json'` - Flink deserializes JSON automatically

The PostgreSQL sink table:

```python
def create_processed_events_sink_postgres(t_env):
    table_name = 'processed_events'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            pickup_datetime TIMESTAMP
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name
```

No psycopg2, no INSERT statements - just declare the table and Flink handles
the rest.

The execution:

```python
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment

def log_processing():
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)  # checkpoint every 10 seconds

    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)

    source_table = create_events_source_kafka(t_env)
    postgres_sink = create_processed_events_sink_postgres(t_env)

    t_env.execute_sql(
        f"""
        INSERT INTO {postgres_sink}
        SELECT
            PULocationID,
            DOLocationID,
            trip_distance,
            total_amount,
            TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) as pickup_datetime
        FROM {source_table}
        """
    ).wait()

if __name__ == '__main__':
    log_processing()
```

- Streaming mode - the job runs continuously, waiting for new data
- The `INSERT INTO ... SELECT` is the pipeline - read from Kafka, convert the
  timestamp, write to PostgreSQL

`enable_checkpointing(10 * 1000)` tells Flink to take a snapshot of the
job's state every 10 seconds. A checkpoint captures the Kafka offsets (how
far Flink has read) and any in-flight data. If the job crashes, it resumes
from the last checkpoint instead of starting from the beginning.

Checkpointing gets especially important with windows. If you have a
5-minute window and the job fails 2 minutes in, Flink doesn't just track
the offset - it also serializes the open windows to disk. When it
restarts, it picks up right where it left off, with the partially-filled
window intact.

The trade-off is resilience versus efficiency. Checkpointing every 1 second
is expensive - Flink has to serialize and persist the entire state that
often. Checkpointing every 10 minutes means you could lose up to 10 minutes
of progress on failure. 10 seconds is a reasonable default for most jobs.

Submit the job:

```bash
docker compose exec jobmanager ./bin/flink run \
    -py /opt/src/job/pass_through_job.py \
    --pyFiles /opt/src -d
```

```
Job has been submitted with JobID 663cff6811b65e97fc1e068d641401f4
```

Check the Flink UI at [http://localhost:8081](http://localhost:8081) - you should
see a running job.

Since the job uses `latest-offset`, it's waiting for new messages. Send data:

```bash
uv run python src/producers/producer.py
```

Query PostgreSQL:

```sql
SELECT count(*) FROM processed_events;
```

Compare this to our Python consumer approach - same result, but Flink handles
checkpointing, offset management, and PostgreSQL writes automatically.


## Offsets - earliest vs latest

When Flink connects to Kafka, it needs to know where to start reading. This
is the `scan.startup.mode` setting:

| Mode | Behavior |
|---|---|
| `latest-offset` | Only read messages arriving after the job starts |
| `earliest-offset` | Read everything from the beginning of the topic |
| `timestamp` | Start from a specific point in time |

`earliest` is typically used for backfilling or restating data - you're
using Flink to process data that's been sitting in Kafka for a while, not
real-time data. `latest` is the more common production setting - the job
starts up and only processes new events as people click buttons on your
website or whatever event feed you're consuming.

Our pass-through job uses `latest-offset`. Let's see what happens with
`earliest-offset`:

1. Cancel the running job from the Flink UI (click on the job, then Cancel)
2. Clear the table:
   ```sql
   TRUNCATE processed_events;
   ```
3. Edit `src/job/pass_through_job.py` - change both offset settings:
   ```
   'scan.startup.mode' = 'earliest-offset',
   'properties.auto.offset.reset' = 'earliest',
   ```
4. Resubmit:
   ```bash
   docker compose exec jobmanager ./bin/flink run \
       -py /opt/src/job/pass_through_job.py \
       --pyFiles /opt/src -d
   ```
5. Wait 15 seconds, then check:
   ```sql
   SELECT count(*) FROM processed_events;
   ```

Flink reads all messages from the topic - including data from previous producer
runs. If you ran the producer twice before, you'll see ~2000 rows (duplicates
of everything already processed).

Why duplicates? Checkpoints are scoped to a specific job instance. When you
cancel and resubmit, it's a brand new job that knows nothing about previous
checkpoints. With `earliest-offset`, it starts from scratch. The offset
setting only matters at startup - once the job is running, checkpointing
takes over and tracks progress. But if you kill the job and create a new
one, those checkpoints are gone.

There is a third option - `timestamp` mode. If your job was running fine
until 2:00 PM and then crashed, you can restart it from exactly 2:00 PM.
This is useful for recovering from failures without reprocessing everything
from the beginning or missing the data that arrived while the job was down.

A common production pattern (Lambda architecture): run your streaming job with
`latest-offset` for real-time results, and if it goes down, use a separate
batch job to backfill the gap. This way the streaming job stays fast and you
don't lose data.

> Change the offset back to `latest-offset` when you're done experimenting.


## Aggregation with tumbling windows

Now let's do something our plain Python consumer can't easily do - windowed
aggregation. We'll count taxi trips and sum revenue by pickup location per hour.

First, cancel any running jobs. Then create the aggregation table in PostgreSQL:

```sql
CREATE TABLE processed_events_aggregated (
    window_start TIMESTAMP,
    PULocationID INTEGER,
    num_trips BIGINT,
    total_revenue DOUBLE PRECISION,
    PRIMARY KEY (window_start, PULocationID)
);
```

Two important design choices:

1. `PULocationID` is included - we group by both time window and pickup
   location, so both appear in the output.
2. `PRIMARY KEY` - enables upsert behavior. When Flink sends updated counts
   for the same window, PostgreSQL updates the existing row instead of creating
   a duplicate. This matters because late-arriving events can cause Flink to
   re-evaluate a window it already emitted results for. With upsert, the
   corrected count replaces the old one automatically.

Now create `src/job/aggregation_job.py`:

```python
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment


def create_events_source_kafka(t_env):
    table_name = "events"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            tpep_pickup_datetime BIGINT,
            event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3),
            WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda:29092',
            'topic' = 'rides',
            'scan.startup.mode' = 'earliest-offset',
            'properties.auto.offset.reset' = 'earliest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name


def create_events_aggregated_sink(t_env):
    table_name = 'processed_events_aggregated'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            window_start TIMESTAMP(3),
            PULocationID INT,
            num_trips BIGINT,
            total_revenue DOUBLE,
            PRIMARY KEY (window_start, PULocationID) NOT ENFORCED
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name


def log_aggregation():
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)
    env.set_parallelism(3)

    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)

    try:
        source_table = create_events_source_kafka(t_env)
        aggregated_table = create_events_aggregated_sink(t_env)

        t_env.execute_sql(f"""
        INSERT INTO {aggregated_table}
        SELECT
            window_start,
            PULocationID,
            COUNT(*) AS num_trips,
            SUM(total_amount) AS total_revenue
        FROM TABLE(
            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR)
        )
        GROUP BY window_start, PULocationID;

        """).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_aggregation()
```

The Kafka source table has two new lines compared to the pass-through job:

- `event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3)` - a computed
  column that converts epoch milliseconds to a timestamp. The `3` means
  milliseconds precision.
- `WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND` -
  tells Flink when to publish window results.

The window defines WHAT you're counting - a 1-hour bucket of taxi trips.
But in a stream, events keep arriving. How does Flink know when to stop
waiting and publish the count for the 2 PM - 3 PM hour? It can't just
look at the clock because some events arrive late. Without a trigger,
Flink would accumulate data forever and never write anything to PostgreSQL.

The watermark is that trigger. It tells Flink when to publish. In the SQL:

```
WATERMARK FOR event_timestamp AS event_timestamp - INTERVAL '5' SECOND
                                                   ^^^^^^^^^^^^^^^^^^^
                                                   patience = 5 seconds
```

The watermark is always 5 seconds behind the latest event timestamp Flink
has seen. When the watermark passes the end of a window, Flink publishes
that window's results. The 5 seconds is patience for stragglers - events
that happened before the window ended but arrived a few seconds late.

Three pieces working together:

- Window = what bucket to count into (1 hour)
- Watermark = when to publish the result (the trigger)
- Upsert (PRIMARY KEY) = safety net that corrects the result if something
  arrives after publishing

Here's a concrete example. Two taxi pickups in East Village (PU=79) with
a 10-second window and 5-second watermark. Event A is on time, Event B is
8 seconds late (the rider's phone lost signal in a tunnel).

Event B arrives late, but Flink hasn't published yet - both events counted:

```mermaid
sequenceDiagram
    participant P as Producer
    participant K as Kafka
    participant F as Flink
    participant PG as PostgreSQL

    P->>K: Event A (ts=14:00:07, on time)
    K->>F: Event A
    Note over F: watermark = 00:02<br/>window [00:00, 00:10) not published yet<br/>A added to window

    Note over P: 5 seconds pass, phone reconnects

    P->>K: Event B (ts=14:00:04, 8s late)
    K->>F: Event B
    Note over F: watermark = 00:07<br/>window [00:00, 00:10) still not published<br/>B added to window

    Note over F: more events arrive<br/>watermark reaches 00:10<br/>time to publish

    F->>PG: INSERT (window=00:00, PU=79, trips=2)
    Note over PG: both events counted
```

Event B arrived late, but within Flink's patience window. Flink hadn't
published the result yet, so B was included in the count.

Now what if Event B were 20 seconds late - arriving after Flink already
published?

```mermaid
sequenceDiagram
    participant P as Producer
    participant K as Kafka
    participant F as Flink
    participant PG as PostgreSQL

    P->>K: Event A (ts=14:00:07, on time)
    K->>F: Event A
    Note over F: A added to window [00:00, 00:10)

    Note over F: watermark reaches 00:10<br/>time to publish

    F->>PG: INSERT (window=00:00, PU=79, trips=1)
    Note over PG: published with trips=1

    Note over P: 20 seconds later, phone reconnects

    P->>K: Event B (ts=14:00:04, 20s late)
    K->>F: Event B
    Note over F: window [00:00, 00:10) already published<br/>but B still belongs to it

    F->>PG: UPDATE (window=00:00, PU=79, trips=2)
    Note over PG: upsert via PRIMARY KEY<br/>corrected from 1 to 2
```

Flink already published trips=1, but when Event B finally arrives, the
PRIMARY KEY lets Flink send a correction. PostgreSQL updates the row
from 1 to 2. Without the PRIMARY KEY (an append-only sink), Event B
would be lost - Flink can't re-open a published window in append mode.

The trade-off is latency vs completeness. A larger watermark means more
patience for late events, but you wait longer before seeing any results.
5 seconds is a reasonable default. In production, you'd tune this based
on how out-of-order your data actually is.

Other differences from the pass-through job:

- The sink has a `PRIMARY KEY` with `NOT ENFORCED` - this enables upsert
  behavior in the Flink JDBC connector.
- `earliest-offset` - reads all existing data from Kafka.
- `env.set_parallelism(3)` - runs 3 copies processing data in parallel.
- The `TUMBLE` function creates fixed-size, non-overlapping windows.
  `DESCRIPTOR(event_timestamp)` must reference the column with the `WATERMARK`
  defined on it, and `INTERVAL '1' HOUR` sets the window size.

Submit and test:

```bash
docker compose exec jobmanager ./bin/flink run \
    -py /opt/src/job/aggregation_job.py \
    --pyFiles /opt/src -d
```

Send data:

```bash
uv run python src/producers/producer.py
```

Wait ~15 seconds for the windows to close, then check:

```sql
SELECT window_start, count(*) as locations, sum(num_trips) as total_trips,
       round(sum(total_revenue)::numeric, 2) as revenue
FROM processed_events_aggregated
GROUP BY window_start
ORDER BY window_start;
```

```
     window_start     | locations | total_trips | revenue
----------------------+-----------+-------------+---------
 2025-11-01 00:00:00  |        ...
 2025-11-01 01:00:00  |        ...
 ...
```

The 1000 taxi trips were grouped into 1-hour tumbling windows by pickup
location. Each row shows how many locations had trips in that hour and the
total number of trips.

Try this with a plain Python consumer - you'd need to implement the windowing
logic, handle late events, manage state, and write the upsert SQL yourself.
With Flink, it's a SQL query.


## Late events and upserts

The CSV producer sends events in order, so the watermark never has to
handle late arrivals. Let's use a real-time producer that generates
synthetic events with occasional delays to see what happens.

Download and run the real-time producer:

```bash
PREFIX="https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/main/07-streaming/workshop"
wget ${PREFIX}/src/producers/producer_realtime.py -P src/producers/
```

```bash
uv run python src/producers/producer_realtime.py
```

It generates random taxi trips with current timestamps, but ~20% of events
are sent with a timestamp 3-10 seconds in the past (simulating network
delays). The output labels each event:

```
  on time   -> PU=79 ts=14:23:05
  on time   -> PU=107 ts=14:23:05
  LATE (8s) -> PU=234 ts=14:22:58
  on time   -> PU=48 ts=14:23:06
```

With our 5-second watermark and 1-hour windows, no events will be dropped -
even an event 10 seconds late lands well within the current hour window.
But the watermark + upsert behavior is still visible: Flink first emits
window results when the watermark passes the window end, then late events
update those results via the PRIMARY KEY.

To see this in action, open two terminals:

Terminal 1 - run the real-time producer:

```bash
uv run python src/producers/producer_realtime.py
```

Terminal 2 - watch aggregation counts change:

```bash
watch -n 1 'PGPASSWORD=postgres docker compose exec postgres psql -U postgres -d postgres -c "SELECT window_start, sum(num_trips) as trips, round(sum(total_revenue)::numeric, 2) as revenue FROM processed_events_aggregated GROUP BY window_start ORDER BY window_start;"'
```

You'll see the counts for older windows increase as late events arrive
and update the aggregation via upsert. This is why we set up the PRIMARY
KEY - without it, late events would either be dropped or create duplicates.


## Understanding window types

We used tumbling windows above. Flink supports three types:

### Tumbling windows

Fixed-size, non-overlapping. Every event belongs to exactly one window.
If you come from the batch world, tumbling windows are the most familiar -
they just cut up your data into fixed segments. It's essentially a way to
speed up batch processing.

```
|  Window 1  |  Window 2  |  Window 3  |
|  1 hour    |  1 hour    |  1 hour    |
```

Use case: Counting trips per hour, daily revenue summaries.

### Sliding windows

Fixed-size, overlapping. An event can belong to multiple windows. When you
think of a 1-hour window, most people think of 00:00-01:00. But there's
also 00:15-01:15, 00:30-01:30 - those are also 1-hour windows, just
starting at different points. Sliding windows capture all of them.

```
|--- Window 1 (1 hour) ---|
      |--- Window 2 (1 hour) ---|
            |--- Window 3 (1 hour) ---|
      <- 15 min slide ->
```

```sql
HOP(TABLE events, DESCRIPTOR(event_timestamp), INTERVAL '15' MINUTE, INTERVAL '1' HOUR)
```

Use case: finding peaks and valleys - "what was our peak traffic in any
1-hour window?" These overlapping windows let you find the moment in time
where you have the highest or lowest values. Good for min-maxing, moving
averages, and surge detection (e.g., ride-share surge pricing).

### Session windows

Dynamic windows based on inactivity gaps. Unlike tumbling and sliding
windows, the window size isn't fixed - the window doesn't close at a
specified time, it closes after a specified amount of inactivity.

```
|--events--| gap |--events------| gap |--events--|
| Session 1|     |  Session 2   |     | Session 3|
```

Use case: grouping user behavior together. Imagine a user logs into an app,
clicks a bunch of buttons, leaves for 2 minutes, then comes back - that's
still technically the same session. You set a session gap (say, 30 minutes
of inactivity) and Flink groups all the events within that session together.
Sessionization is very powerful for behavioral analytics.


## Cleanup

Stop and remove all containers:

```bash
docker compose down
```

To also remove the PostgreSQL data volume:

```bash
docker compose down -v
```


## Q&A

Questions and answers from the
[2025 stream with Zach Wilson](https://www.youtube.com/watch?v=P2loELMUUeI).

### What happens when a Flink job dies and restarts? Does it reprocess everything?

The `earliest` offset setting is only for the initial startup. If the job
restarts (not re-submitted as a new job), it uses checkpointing to resume
from the last snapshot. Without checkpointing, you either reprocess
everything (with `earliest`) or skip data (with `latest`).

The catch: checkpoints are scoped to a specific job instance. If you
completely kill a job and submit a new one, the new job has no knowledge of
the previous checkpoints. To preserve state across redeployments, restart
the existing job rather than creating a new one.

### Why can't we just use Kafka consumers? What does Flink actually add?

For simple pass-through (read a message, write it somewhere), a Kafka
consumer is fine. For anything involving time windows, watermarks,
checkpointing, or parallel processing, Flink saves you from building all
that yourself.

You can do windowing, watermarking, late data handling, and job recovery
with a plain consumer - go ahead and manage it yourself. But as Zach puts
it: "good luck." With a plain consumer, you'd also need to track
checkpoints yourself - save the latest processed timestamp to a file or
database and manage it on every restart. Flink keeps the state for you.

It's like asking "why use Spark when you can use Pandas?" You can, but
Pandas won't work at higher scale in a distributed way.

### What happens with events delayed beyond the watermark (the "tunnel" scenario)?

There are two types of lateness. The watermark handles acceptable lateness -
small delays where events arrive a few seconds late. For events arriving
much later (like after a 5-minute tunnel), Flink has an allowed lateness
parameter.

By default, allowed lateness is zero - events arriving after the watermark
closes a window are discarded. If you set allowed lateness to 10 minutes,
Flink will go back, find the old closed window, create a new aggregation
with the late event, and send it to the sink as a brand new record. This
means you need deduplication logic on the sink side (a primary key with
upsert behavior - exactly what we set up in the aggregation section).

The trade-off: allowed lateness requires Flink to hold all those windows
on disk for the duration of the tolerance.

### When do we actually need streaming? For many things micro-batch is enough.

The key question: is something going to happen in real time on the other
side? If there is an automated process that will change something based on
the data, streaming is a great choice. If a human is just looking at data,
real-time is unnecessary and micro-batch is easier to maintain.

In 10 years as a data engineer, Zach had literally two use cases that
genuinely needed streaming - Netflix fraud/security detection (5 minutes of
delay means 5 more minutes of a hacked account) and Airbnb surge pricing
(supply and demand changes rapidly). Everything else was daily batch, or
hourly/every-15-minute micro-batch for lower latency needs.

Before committing to streaming, consider the operational cost. A streaming
job runs 24/7 - if it breaks at 3 AM, someone needs to fix it. If you're
the only person on the team who understands Flink, you'll be on-call for
it forever. Talk to your manager before implementing streaming - you'll
need to teach your entire team before you can share the on-call burden.

### Spark Streaming vs Flink Streaming?

They are fundamentally different today but will likely converge. The key
difference: Spark Streaming is micro-batch - it pulses every 15-30 seconds,
pulling data in small batches (pull architecture). Flink is genuine
continuous processing - events flow through as they arrive (push
architecture). For most use cases the difference is negligible, but Flink
has lower latency for truly real-time needs.

For micro-batch intervals, Zach finds every-5-minutes too frequent with
Spark because startup alone takes about a minute, making the
overhead-to-work ratio poor. His sweet spots are hourly and every 15
minutes.

### How does job submission work in production?

In this workshop we mount local files into Docker and submit jobs with
`docker compose exec` - that's a development convenience. In production,
job submission looks different depending on the deployment:

- Managed services (AWS Kinesis Data Analytics, Google Cloud Dataflow,
  Confluent Cloud) - you upload a JAR or Python zip through a web console
  or CLI. The service handles the cluster.
- Self-hosted Flink on Kubernetes - you typically build a Docker image with
  your job code baked in, or use the Flink Kubernetes Operator which pulls
  job artifacts from S3/GCS at startup.
- Standalone Flink cluster - you use the `flink run` CLI pointing to a
  local file or an HTTP/S3 URL. CI/CD pipelines often upload the job
  artifact to S3 and then call `flink run` with that URL.

The common pattern: your code lives in git, CI builds an artifact (JAR,
Python zip, or Docker image), pushes it to a registry or object store, and
then triggers the Flink cluster to pick it up.


================================================
FILE: 07-streaming/workshop/docker-compose.yml
================================================
services:
  redpanda:
    image: redpandadata/redpanda:v25.3.9
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '1'
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
      - --pandaproxy-addr
      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082
      - --advertise-pandaproxy-addr
      - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082
      - --rpc-addr
      - 0.0.0.0:33145
      - --advertise-rpc-addr
      - redpanda:33145
    ports:
      - 8082:8082
      - 9092:9092
      - 28082:28082
      - 29092:29092

  jobmanager:
    build:
      context: .
      dockerfile: ./Dockerfile.flink
    image: pyflink-workshop
    pull_policy: never
    expose:
      - "6123"
    ports:
      - "8081:8081"
    volumes:
      - ./:/opt/flink/usrlib
      - ./src/:/opt/src
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        jobmanager.memory.process.size: 1600m

  taskmanager:
    image: pyflink-workshop
    pull_policy: never
    expose:
      - "6121"
      - "6122"
    volumes:
      - ./:/opt/flink/usrlib
      - ./src/:/opt/src
    depends_on:
      - jobmanager
    command: taskmanager --taskmanager.registration.timeout 5 min
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.memory.process.size: 1728m
        taskmanager.numberOfTaskSlots: 15
        parallelism.default: 3

  postgres:
    image: postgres:18
    restart: on-failure
    environment:
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
    ports:
      - "5432:5432"


================================================
FILE: 07-streaming/workshop/flink-config.yaml
================================================
# Custom Flink config for PyFlink workshop.
# Original: https://github.com/apache/flink/blob/release-2.2/flink-dist/src/main/resources/config.yaml
# Changes from default:
#   1. Added taskmanager.memory.jvm-metaspace.size: 512m (PyFlink needs more metaspace)
#   2. Removed --add-exports=jdk.compiler/... from env.java.opts.all
#      (jdk.compiler module is not present in the JRE, causing warnings on every command)

blob:
  server:
    port: '6124'
taskmanager:
  memory:
    process:
      size: 1728m
    jvm-metaspace:
      size: 512m  # added for PyFlink
  bind-host: 0.0.0.0
  numberOfTaskSlots: 15
jobmanager:
  execution:
    failover-strategy: region
  rpc:
    address: jobmanager
    port: 6123
  memory:
    process:
      size: 1600m
  bind-host: 0.0.0.0
query:
  server:
    port: '6125'
parallelism:
  default: 1
rest:
  address: 0.0.0.0
env:
  java:
    opts:
      all: >-
        --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED
        --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED
        --add-opens=java.base/java.lang=ALL-UNNAMED
        --add-opens=java.base/java.net=ALL-UNNAMED
        --add-opens=java.base/java.io=ALL-UNNAMED
        --add-opens=java.base/java.nio=ALL-UNNAMED
        --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
        --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
        --add-opens=java.base/java.text=ALL-UNNAMED
        --add-opens=java.base/java.time=ALL-UNNAMED
        --add-opens=java.base/java.util=ALL-UNNAMED
        --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
        --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
        --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED


================================================
FILE: 07-streaming/workshop/live/.gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[codz]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py.cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# UV
#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#uv.lock

# poetry
#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
#   This is especially recommended for binary packages to ensure reproducibility, and is more
#   commonly ignored for libraries.
#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
#poetry.toml

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
#   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
#pdm.lock
#pdm.toml
.pdm-python
.pdm-build/

# pixi
#   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
#pixi.lock
#   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
#   in the .venv directory. It is recommended not to include this directory in version control.
.pixi

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.envrc
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# Abstra
# Abstra is an AI-powered process automation framework.
# Ignore directories containing user credentials, local state, and settings.
# Learn more at https://abstra.io/docs
.abstra/

# Visual Studio Code
#  Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore 
#  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
#  and can be added to the global gitignore or merged into this file. However, if you prefer, 
#  you could uncomment the following to ignore the entire vscode folder
# .vscode/

# Ruff stuff:
.ruff_cache/

# PyPI configuration file
.pypirc

# Cursor
#  Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
#  exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
#  refer to https://docs.cursor.com/context/ignore-files
.cursorignore
.cursorindexingignore

# Marimo
marimo/_static/
marimo/_lsp/
__marimo__/


================================================
FILE: 07-streaming/workshop/live/.python-version
================================================
3.12


================================================
FILE: 07-streaming/workshop/live/Dockerfile.flink
================================================
FROM flink:2.2.0-scala_2.12-java17

COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/

# ref: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker

WORKDIR /opt/pyflink
COPY pyproject.flink.toml pyproject.toml
RUN uv python install 3.12 && uv sync
ENV PATH="/opt/pyflink/.venv/bin:$PATH"

# Download connector libraries

WORKDIR /opt/flink/lib
RUN wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/2.2.0/flink-json-2.2.0.jar; \
    wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/4.0.1-2.0/flink-sql-connector-kafka-4.0.1-2.0.jar; \
    wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-core/4.0.0-2.0/flink-connector-jdbc-core-4.0.0-2.0.jar; \
    wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-postgres/4.0.0-2.0/flink-connector-jdbc-postgres-4.0.0-2.0.jar; \
    wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.10/postgresql-42.7.10.jar

COPY flink-config.yaml /opt/flink/conf/config.yaml

WORKDIR /opt/flink


================================================
FILE: 07-streaming/workshop/live/README.md
================================================
# streaming-workshop

================================================
FILE: 07-streaming/workshop/live/docker-compose.yaml
================================================
services:
  redpanda:
    image: redpandadata/redpanda:v25.3.9
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '1'
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
      - --pandaproxy-addr
      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082
      - --advertise-pandaproxy-addr
      - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082
      - --rpc-addr
      - 0.0.0.0:33145
      - --advertise-rpc-addr
      - redpanda:33145
    ports:
      - 8082:8082
      - 9092:9092
      - 28082:28082
      - 29092:29092

  postgres:
    image: postgres:18
    restart: on-failure
    environment:
      - POSTGRES_DB=postgres
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
    ports:
      - "5432:5432"

  jobmanager:
    build:
      context: .
      dockerfile: ./Dockerfile.flink
    image: pyflink-workshop
    pull_policy: never
    expose:
      - "6123"
    ports:
      - "8081:8081"
    volumes:
      - ./:/opt/flink/usrlib
      - ./src/:/opt/src
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        jobmanager.memory.process.size: 1600m

  taskmanager:
    image: pyflink-workshop
    pull_policy: never
    expose:
      - "6121"
      - "6122"
    volumes:
      - ./:/opt/flink/usrlib
      - ./src/:/opt/src
    depends_on:
      - jobmanager
    command: taskmanager --taskmanager.registration.timeout 5 min
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.memory.process.size: 1728m
        taskmanager.numberOfTaskSlots: 15
        parallelism.default: 3

================================================
FILE: 07-streaming/workshop/live/flink-config.yaml
================================================
# Custom Flink config for PyFlink workshop.
# Original: https://github.com/apache/flink/blob/release-2.2/flink-dist/src/main/resources/config.yaml
# Changes from default:
#   1. Added taskmanager.memory.jvm-metaspace.size: 512m (PyFlink needs more metaspace)
#   2. Removed --add-exports=jdk.compiler/... from env.java.opts.all
#      (jdk.compiler module is not present in the JRE, causing warnings on every command)

blob:
  server:
    port: '6124'
taskmanager:
  memory:
    process:
      size: 1728m
    jvm-metaspace:
      size: 512m  # added for PyFlink
  bind-host: 0.0.0.0
  numberOfTaskSlots: 15
jobmanager:
  execution:
    failover-strategy: region
  rpc:
    address: jobmanager
    port: 6123
  memory:
    process:
      size: 1600m
  bind-host: 0.0.0.0
query:
  server:
    port: '6125'
parallelism:
  default: 1
rest:
  address: 0.0.0.0
env:
  java:
    opts:
      all: >-
        --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED
        --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED
        --add-opens=java.base/java.lang=ALL-UNNAMED
        --add-opens=java.base/java.net=ALL-UNNAMED
        --add-opens=java.base/java.io=ALL-UNNAMED
        --add-opens=java.base/java.nio=ALL-UNNAMED
        --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
        --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
        --add-opens=java.base/java.text=ALL-UNNAMED
        --add-opens=java.base/java.time=ALL-UNNAMED
        --add-opens=java.base/java.util=ALL-UNNAMED
        --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
        --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
        --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED


================================================
FILE: 07-streaming/workshop/live/main.py
================================================
def main():
    print("Hello from streaming-workshop!")


if __name__ == "__main__":
    main()


================================================
FILE: 07-streaming/workshop/live/notebooks/consumer_db.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c77749d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from kafka import KafkaConsumer\n",
    "\n",
    "server = 'localhost:9092'\n",
    "topic_name = 'rides'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "74dcdffe",
   "metadata": {},
   "outputs": [],
   "source": [
    "from models import Ride, ride_deserializer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00726e41",
   "metadata": {},
   "outputs": [],
   "source": [
    "consumer = KafkaConsumer(\n",
    "    topic_name,\n",
    "    bootstrap_servers=[server],\n",
    "    auto_offset_reset='earliest',\n",
    "    group_id='rides-database',\n",
    "    value_deserializer=ride_deserializer\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a2cf7106",
   "metadata": {},
   "outputs": [],
   "source": [
    "import psycopg2\n",
    "\n",
    "conn = psycopg2.connect(\n",
    "    host='localhost',\n",
    "    port=5432,\n",
    "    database='postgres',\n",
    "    user='postgres',\n",
    "    password='postgres'\n",
    ")\n",
    "\n",
    "conn.autocommit = True\n",
    "cur = conn.cursor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "f0902406",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Listening to rides and writing to PostgreSQL...\n",
      "Inserted 100 rows...\n",
      "Inserted 200 rows...\n",
      "Inserted 300 rows...\n",
      "Inserted 400 rows...\n",
      "Inserted 500 rows...\n",
      "Inserted 600 rows...\n",
      "Inserted 700 rows...\n",
      "Inserted 800 rows...\n",
      "Inserted 900 rows...\n",
      "Inserted 1000 rows...\n"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mKeyboardInterrupt\u001b[39m                         Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[5]\u001b[39m\u001b[32m, line 6\u001b[39m\n\u001b[32m      3\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mListening to \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mtopic_name\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m and writing to PostgreSQL...\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m      5\u001b[39m count = \u001b[32m0\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m \u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mmessage\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mconsumer\u001b[49m\u001b[43m:\u001b[49m\n\u001b[32m      7\u001b[39m \u001b[43m    \u001b[49m\u001b[43mride\u001b[49m\u001b[43m \u001b[49m\u001b[43m=\u001b[49m\u001b[43m \u001b[49m\u001b[43mmessage\u001b[49m\u001b[43m.\u001b[49m\u001b[43mvalue\u001b[49m\n\u001b[32m      8\u001b[39m \u001b[43m    \u001b[49m\u001b[43mpickup_dt\u001b[49m\u001b[43m \u001b[49m\u001b[43m=\u001b[49m\u001b[43m \u001b[49m\u001b[43mdatetime\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfromtimestamp\u001b[49m\u001b[43m(\u001b[49m\u001b[43mride\u001b[49m\u001b[43m.\u001b[49m\u001b[43mtpep_pickup_datetime\u001b[49m\u001b[43m \u001b[49m\u001b[43m/\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1000\u001b[39;49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:1213\u001b[39m, in \u001b[36mKafkaConsumer.__next__\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m   1211\u001b[39m     \u001b[38;5;28mself\u001b[39m._iterator = \u001b[38;5;28mself\u001b[39m._message_generator_v2()\n\u001b[32m   1212\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1213\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mnext\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_iterator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1214\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mStopIteration\u001b[39;00m:\n\u001b[32m   1215\u001b[39m     \u001b[38;5;28mself\u001b[39m._iterator = \u001b[38;5;28;01mNone\u001b[39;00m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:1185\u001b[39m, in \u001b[36mKafkaConsumer._message_generator_v2\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m   1183\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_message_generator_v2\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m   1184\u001b[39m     timeout_ms = \u001b[32m1000\u001b[39m * \u001b[38;5;28mmax\u001b[39m(\u001b[32m0\u001b[39m, \u001b[38;5;28mself\u001b[39m._consumer_timeout - time.time())\n\u001b[32m-> \u001b[39m\u001b[32m1185\u001b[39m     record_map = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mpoll\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtimeout_ms\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtimeout_ms\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mupdate_offsets\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[32m   1186\u001b[39m     \u001b[38;5;28;01mfor\u001b[39;00m tp, records \u001b[38;5;129;01min\u001b[39;00m six.iteritems(record_map):\n\u001b[32m   1187\u001b[39m         \u001b[38;5;66;03m# Generators are stateful, and it is possible that the tp / records\u001b[39;00m\n\u001b[32m   1188\u001b[39m         \u001b[38;5;66;03m# here may become stale during iteration -- i.e., we seek to a\u001b[39;00m\n\u001b[32m   1189\u001b[39m         \u001b[38;5;66;03m# different offset, pause consumption, or lose assignment.\u001b[39;00m\n\u001b[32m   1190\u001b[39m         \u001b[38;5;28;01mfor\u001b[39;00m record \u001b[38;5;129;01min\u001b[39;00m records:\n\u001b[32m   1191\u001b[39m             \u001b[38;5;66;03m# is_fetchable(tp) should handle assignment changes and offset\u001b[39;00m\n\u001b[32m   1192\u001b[39m             \u001b[38;5;66;03m# resets; for all other changes (e.g., seeks) we'll rely on the\u001b[39;00m\n\u001b[32m   1193\u001b[39m             \u001b[38;5;66;03m# outer function destroying the existing iterator/generator\u001b[39;00m\n\u001b[32m   1194\u001b[39m             \u001b[38;5;66;03m# via self._iterator = None\u001b[39;00m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:708\u001b[39m, in \u001b[36mKafkaConsumer.poll\u001b[39m\u001b[34m(self, timeout_ms, max_records, update_offsets)\u001b[39m\n\u001b[32m    706\u001b[39m timer = Timer(timeout_ms)\n\u001b[32m    707\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m._closed:\n\u001b[32m--> \u001b[39m\u001b[32m708\u001b[39m     records = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_poll_once\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtimer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmax_records\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mupdate_offsets\u001b[49m\u001b[43m=\u001b[49m\u001b[43mupdate_offsets\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    709\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m records:\n\u001b[32m    710\u001b[39m         \u001b[38;5;28;01mreturn\u001b[39;00m records\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:757\u001b[39m, in \u001b[36mKafkaConsumer._poll_once\u001b[39m\u001b[34m(self, timer, max_records, update_offsets)\u001b[39m\n\u001b[32m    754\u001b[39m     log.debug(\u001b[33m'\u001b[39m\u001b[33mpoll: do not have all fetch positions...\u001b[39m\u001b[33m'\u001b[39m)\n\u001b[32m    755\u001b[39m     poll_timeout_ms = \u001b[38;5;28mmin\u001b[39m(poll_timeout_ms, \u001b[38;5;28mself\u001b[39m.config[\u001b[33m'\u001b[39m\u001b[33mretry_backoff_ms\u001b[39m\u001b[33m'\u001b[39m])\n\u001b[32m--> \u001b[39m\u001b[32m757\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_client\u001b[49m\u001b[43m.\u001b[49m\u001b[43mpoll\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtimeout_ms\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpoll_timeout_ms\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    758\u001b[39m \u001b[38;5;66;03m# after the long poll, we should check whether the group needs to rebalance\u001b[39;00m\n\u001b[32m    759\u001b[39m \u001b[38;5;66;03m# prior to returning data so that the group can stabilize faster\u001b[39;00m\n\u001b[32m    760\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._coordinator.need_rejoin():\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/client_async.py:685\u001b[39m, in \u001b[36mKafkaClient.poll\u001b[39m\u001b[34m(self, timeout_ms, future)\u001b[39m\n\u001b[32m    678\u001b[39m         timeout = \u001b[38;5;28mmin\u001b[39m(\n\u001b[32m    679\u001b[39m             user_timeout_ms,\n\u001b[32m    680\u001b[39m             metadata_timeout_ms,\n\u001b[32m    681\u001b[39m             idle_connection_timeout_ms,\n\u001b[32m    682\u001b[39m             request_timeout_ms)\n\u001b[32m    683\u001b[39m         timeout = \u001b[38;5;28mmax\u001b[39m(\u001b[32m0\u001b[39m, timeout)  \u001b[38;5;66;03m# avoid negative timeouts\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m685\u001b[39m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_poll\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m \u001b[49m\u001b[43m/\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1000\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[32m    687\u001b[39m \u001b[38;5;66;03m# called without the lock to avoid deadlock potential\u001b[39;00m\n\u001b[32m    688\u001b[39m \u001b[38;5;66;03m# if handlers need to acquire locks\u001b[39;00m\n\u001b[32m    689\u001b[39m responses.extend(\u001b[38;5;28mself\u001b[39m._fire_pending_completed_requests())\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/client_async.py:781\u001b[39m, in \u001b[36mKafkaClient._poll\u001b[39m\u001b[34m(self, timeout)\u001b[39m\n\u001b[32m    778\u001b[39m         \u001b[38;5;28;01mcontinue\u001b[39;00m\n\u001b[32m    780\u001b[39m     \u001b[38;5;28mself\u001b[39m._idle_expiry_manager.update(conn.node_id)\n\u001b[32m--> \u001b[39m\u001b[32m781\u001b[39m     \u001b[38;5;28mself\u001b[39m._pending_completion.extend(\u001b[43mconn\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrecv\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m)\n\u001b[32m    783\u001b[39m \u001b[38;5;66;03m# Check for additional pending SSL bytes\u001b[39;00m\n\u001b[32m    784\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m.config[\u001b[33m'\u001b[39m\u001b[33msecurity_protocol\u001b[39m\u001b[33m'\u001b[39m] \u001b[38;5;129;01min\u001b[39;00m (\u001b[33m'\u001b[39m\u001b[33mSSL\u001b[39m\u001b[33m'\u001b[39m, \u001b[33m'\u001b[39m\u001b[33mSASL_SSL\u001b[39m\u001b[33m'\u001b[39m):\n\u001b[32m    785\u001b[39m     \u001b[38;5;66;03m# TODO: optimize\u001b[39;00m\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/conn.py:1131\u001b[39m, in \u001b[36mBrokerConnection.recv\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m   1126\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mrecv\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m   1127\u001b[39m \u001b[38;5;250m    \u001b[39m\u001b[33;03m\"\"\"Non-blocking network receive.\u001b[39;00m\n\u001b[32m   1128\u001b[39m \n\u001b[32m   1129\u001b[39m \u001b[33;03m    Return list of (response, future) tuples\u001b[39;00m\n\u001b[32m   1130\u001b[39m \u001b[33;03m    \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1131\u001b[39m     responses = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_recv\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1132\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m responses \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m.requests_timed_out():\n\u001b[32m   1133\u001b[39m         timed_out = \u001b[38;5;28mself\u001b[39m.timed_out_ifrs()\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/conn.py:1202\u001b[39m, in \u001b[36mBrokerConnection._recv\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m   1200\u001b[39m recvd_data = \u001b[33mb\u001b[39m\u001b[33m'\u001b[39m\u001b[33m'\u001b[39m.join(recvd)\n\u001b[32m   1201\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._sensors:\n\u001b[32m-> \u001b[39m\u001b[32m1202\u001b[39m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_sensors\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbytes_received\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrecord\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mlen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mrecvd_data\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1204\u001b[39m \u001b[38;5;66;03m# We need to keep the lock through protocol receipt\u001b[39;00m\n\u001b[32m   1205\u001b[39m \u001b[38;5;66;03m# so that we ensure that the processed byte order is the\u001b[39;00m\n\u001b[32m   1206\u001b[39m \u001b[38;5;66;03m# same as the received byte order\u001b[39;00m\n\u001b[32m   1207\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/metrics/stats/sensor.py:77\u001b[39m, in \u001b[36mSensor.record\u001b[39m\u001b[34m(self, value, time_ms)\u001b[39m\n\u001b[32m     74\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m._lock:  \u001b[38;5;66;03m# XXX high volume, might be performance issue\u001b[39;00m\n\u001b[32m     75\u001b[39m     \u001b[38;5;66;03m# increment all the stats\u001b[39;00m\n\u001b[32m     76\u001b[39m     \u001b[38;5;28;01mfor\u001b[39;00m stat \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._stats:\n\u001b[32m---> \u001b[39m\u001b[32m77\u001b[39m         \u001b[43mstat\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrecord\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_config\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mvalue\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtime_ms\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     78\u001b[39m     \u001b[38;5;28mself\u001b[39m._check_quotas(time_ms)\n\u001b[32m     79\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m parent \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._parents:\n",
      "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/metrics/stats/rate.py:49\u001b[39m, in \u001b[36mRate.record\u001b[39m\u001b[34m(self, config, value, time_ms)\u001b[39m\n\u001b[32m     46\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34munit_name\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m     47\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m TimeUnit.get_name(\u001b[38;5;28mself\u001b[39m._unit)\n\u001b[32m---> \u001b[39m\u001b[32m49\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mrecord\u001b[39m(\u001b[38;5;28mself\u001b[39m, config, value, time_ms):\n\u001b[32m     50\u001b[39m     \u001b[38;5;28mself\u001b[39m._stat.record(config, value, time_ms)\n\u001b[32m     52\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mmeasure\u001b[39m(\u001b[38;5;28mself\u001b[39m, config, now):\n",
      "\u001b[31mKeyboardInterrupt\u001b[39m: "
     ]
    }
   ],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "print(f\"Listening to {topic_name} and writing to PostgreSQL...\")\n",
    "\n",
    "count = 0\n",
    "for message in consumer:\n",
    "    ride = message.value\n",
    "    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)\n",
    "    cur.execute(\n",
    "        \"\"\"INSERT INTO processed_events\n",
    "           (PULocationID, DOLocationID, trip_distance, total_amount, pickup_datetime)\n",
    "           VALUES (%s, %s, %s, %s, %s)\"\"\",\n",
    "        (ride.PULocationID, ride.DOLocationID,\n",
    "         ride.trip_distance, ride.total_amount, pickup_dt)\n",
    "    )\n",
    "    count += 1\n",
    "    if count % 100 == 0:\n",
    "        print(f\"Inserted {count} rows...\")\n",
    "\n",
    "consumer.close()\n",
    "cur.close()\n",
    "conn.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66840c80",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2bec0472",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "streaming-workshop",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 07-streaming/workshop/live/notebooks/models.py
================================================
import json
import dataclasses

from dataclasses import dataclass


@dataclass
class Ride:
    PULocationID: int
    DOLocationID: int
    trip_distance: float
    total_amount: float
    tpep_pickup_datetime: int  # epoch milliseconds


def ride_from_row(row):
    return Ride(
        PULocationID=int(row['PULocationID']),
        DOLocationID=int(row['DOLocationID']),
        trip_distance=float(row['trip_distance']),
        total_amount=float(row['total_amount']),
        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),
    )


def ride_serializer(ride):
    ride_dict = dataclasses.asdict(ride)
    ride_json = json.dumps(ride_dict).encode('utf-8')
    return ride_json


def ride_deserializer(data):
    json_str = data.decode('utf-8')
    ride_dict = json.loads(json_str)
    return Ride(**ride_dict)


================================================
FILE: 07-streaming/workshop/live/notebooks/producer.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "eebfcff0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1e3c198b",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2113c0a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime']\n",
    "df = pd.read_parquet(url, columns=columns).head(1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "05ed66d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from models import Ride, ride_from_row, ride_serializer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "26950bac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ride(PULocationID=142, DOLocationID=237, trip_distance=2.28, total_amount=24.94, tpep_pickup_datetime=1761958147000)"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ride = ride_from_row(df.iloc[1])\n",
    "ride"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "05cfce95",
   "metadata": {},
   "outputs": [],
   "source": [
    "from kafka import KafkaProducer\n",
    "\n",
    "server = 'localhost:9092'\n",
    "\n",
    "producer = KafkaProducer(\n",
    "    bootstrap_servers=[server],\n",
    "    value_serializer=ride_serializer\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "21f5fff3",
   "metadata": {},
   "outputs": [],
   "source": [
    "topic_name = 'rides'\n",
    "\n",
    "producer.send(topic_name, value=ride)\n",
    "producer.flush()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "b17a175a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sent: Ride(PULocationID=43, DOLocationID=186, trip_distance=1.68, total_amount=22.15, tpep_pickup_datetime=1761956005000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=237, trip_distance=2.28, total_amount=24.94, tpep_pickup_datetime=1761958147000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=238, trip_distance=2.7, total_amount=25.62, tpep_pickup_datetime=1761955639000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=261, trip_distance=12.87, total_amount=86.14, tpep_pickup_datetime=1761955200000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=37, trip_distance=8.4, total_amount=48.65, tpep_pickup_datetime=1761956330000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=100, trip_distance=0.85, total_amount=16.45, tpep_pickup_datetime=1761956471000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=170, trip_distance=3.01, total_amount=25.85, tpep_pickup_datetime=1761955651000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=144, trip_distance=3.82, total_amount=57.54, tpep_pickup_datetime=1761958012000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=161, trip_distance=0.89, total_amount=12.95, tpep_pickup_datetime=1761958619000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=162, trip_distance=2.28, total_amount=38.68, tpep_pickup_datetime=1761955843000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=88, trip_distance=3.3, total_amount=44.0, tpep_pickup_datetime=1761955203000)\n",
      "Sent: Ride(PULocationID=88, DOLocationID=148, trip_distance=1.5, total_amount=19.55, tpep_pickup_datetime=1761957833000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=236, trip_distance=4.7, total_amount=47.65, tpep_pickup_datetime=1761958682000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=255, trip_distance=5.61, total_amount=38.85, tpep_pickup_datetime=1761958368000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=43, trip_distance=3.9, total_amount=46.55, tpep_pickup_datetime=1761955553000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=262, trip_distance=1.14, total_amount=14.9, tpep_pickup_datetime=1761956024000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=24, trip_distance=0.6, total_amount=9.12, tpep_pickup_datetime=1761955398000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=147, trip_distance=4.3, total_amount=29.2, tpep_pickup_datetime=1761956395000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=137, trip_distance=3.0, total_amount=32.75, tpep_pickup_datetime=1761957955000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.69, total_amount=11.5, tpep_pickup_datetime=1761955872000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=265, trip_distance=15.47, total_amount=106.63, tpep_pickup_datetime=1761955521000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=125, trip_distance=1.29, total_amount=22.26, tpep_pickup_datetime=1761955760000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=79, trip_distance=1.66, total_amount=32.34, tpep_pickup_datetime=1761957539000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.25, total_amount=22.25, tpep_pickup_datetime=1761958533000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=249, trip_distance=2.68, total_amount=48.68, tpep_pickup_datetime=1761956184000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=48, trip_distance=3.16, total_amount=33.15, tpep_pickup_datetime=1761958409000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=24, trip_distance=2.8, total_amount=24.55, tpep_pickup_datetime=1761956650000)\n",
      "Sent: Ride(PULocationID=143, DOLocationID=169, trip_distance=7.45, total_amount=44.04, tpep_pickup_datetime=1761956178000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=142, trip_distance=2.02, total_amount=17.8, tpep_pickup_datetime=1761957454000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=90, trip_distance=3.46, total_amount=35.7, tpep_pickup_datetime=1761957360000)\n",
      "Sent: Ride(PULocationID=50, DOLocationID=263, trip_distance=2.89, total_amount=26.46, tpep_pickup_datetime=1761956052000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=68, trip_distance=1.2, total_amount=27.3, tpep_pickup_datetime=1761956041000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.2, total_amount=13.02, tpep_pickup_datetime=1761957285000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=233, trip_distance=2.57, total_amount=26.15, tpep_pickup_datetime=1761957826000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=233, trip_distance=1.4, total_amount=24.75, tpep_pickup_datetime=1761956818000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=170, trip_distance=0.3, total_amount=13.0, tpep_pickup_datetime=1761957934000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=137, trip_distance=0.7, total_amount=17.7, tpep_pickup_datetime=1761958462000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=236, trip_distance=0.92, total_amount=13.8, tpep_pickup_datetime=1761956649000)\n",
      "Sent: Ride(PULocationID=43, DOLocationID=151, trip_distance=2.21, total_amount=17.4, tpep_pickup_datetime=1761957030000)\n",
      "Sent: Ride(PULocationID=151, DOLocationID=116, trip_distance=2.62, total_amount=21.75, tpep_pickup_datetime=1761957624000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=114, trip_distance=1.58, total_amount=35.25, tpep_pickup_datetime=1761957104000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=74, trip_distance=6.51, total_amount=52.08, tpep_pickup_datetime=1761958025000)\n",
      "Sent: Ride(PULocationID=166, DOLocationID=262, trip_distance=3.19, total_amount=27.24, tpep_pickup_datetime=1761956016000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=238, trip_distance=0.46, total_amount=12.12, tpep_pickup_datetime=1761956726000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=166, trip_distance=1.2, total_amount=17.16, tpep_pickup_datetime=1761957605000)\n",
      "Sent: Ride(PULocationID=66, DOLocationID=246, trip_distance=4.4, total_amount=30.35, tpep_pickup_datetime=1761955924000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=239, trip_distance=3.5, total_amount=29.85, tpep_pickup_datetime=1761958430000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=161, trip_distance=0.66, total_amount=14.25, tpep_pickup_datetime=1761956248000)\n",
      "Sent: Ride(PULocationID=43, DOLocationID=239, trip_distance=1.13, total_amount=16.38, tpep_pickup_datetime=1761956861000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=166, trip_distance=1.22, total_amount=13.32, tpep_pickup_datetime=1761957749000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=48, trip_distance=11.29, total_amount=81.18, tpep_pickup_datetime=1761958324000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=262, trip_distance=0.86, total_amount=14.2, tpep_pickup_datetime=1761955542000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=249, trip_distance=1.47, total_amount=24.78, tpep_pickup_datetime=1761958550000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=164, trip_distance=0.52, total_amount=20.47, tpep_pickup_datetime=1761955095000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=142, trip_distance=3.99, total_amount=38.22, tpep_pickup_datetime=1761955769000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=164, trip_distance=1.03, total_amount=16.35, tpep_pickup_datetime=1761955355000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=141, trip_distance=2.47, total_amount=27.75, tpep_pickup_datetime=1761956835000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=237, trip_distance=1.6, total_amount=18.45, tpep_pickup_datetime=1761958690000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=75, trip_distance=2.11, total_amount=20.52, tpep_pickup_datetime=1761955763000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=216, trip_distance=4.7, total_amount=24.3, tpep_pickup_datetime=1761957345000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=170, trip_distance=2.08, total_amount=31.5, tpep_pickup_datetime=1761958185000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=4, trip_distance=0.9, total_amount=19.72, tpep_pickup_datetime=1761957574000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=233, trip_distance=2.2, total_amount=22.05, tpep_pickup_datetime=1761958544000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=209, trip_distance=1.1, total_amount=19.7, tpep_pickup_datetime=1761958275000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=238, trip_distance=4.98, total_amount=45.75, tpep_pickup_datetime=1761955208000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=164, trip_distance=9.43, total_amount=59.54, tpep_pickup_datetime=1761955156000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=141, trip_distance=1.14, total_amount=17.22, tpep_pickup_datetime=1761957361000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=68, trip_distance=1.43, total_amount=23.21, tpep_pickup_datetime=1761955824000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=0.36, total_amount=12.15, tpep_pickup_datetime=1761956775000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=186, trip_distance=0.01, total_amount=-10.85, tpep_pickup_datetime=1761957695000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=186, trip_distance=0.01, total_amount=10.85, tpep_pickup_datetime=1761957695000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=265, trip_distance=4.31, total_amount=90.81, tpep_pickup_datetime=1761958059000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=141, trip_distance=3.92, total_amount=42.42, tpep_pickup_datetime=1761956086000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=173, trip_distance=7.98, total_amount=50.25, tpep_pickup_datetime=1761958460000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=48, trip_distance=3.03, total_amount=35.44, tpep_pickup_datetime=1761956112000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=140, trip_distance=0.54, total_amount=13.86, tpep_pickup_datetime=1761956344000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=1.46, total_amount=20.5, tpep_pickup_datetime=1761957032000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=48, trip_distance=1.8, total_amount=26.45, tpep_pickup_datetime=1761957521000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=233, trip_distance=0.91, total_amount=17.85, tpep_pickup_datetime=1761956210000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=113, trip_distance=1.61, total_amount=27.3, tpep_pickup_datetime=1761957229000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.33, total_amount=19.25, tpep_pickup_datetime=1761957297000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=265, trip_distance=45.7, total_amount=284.39, tpep_pickup_datetime=1761956656000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=231, trip_distance=1.75, total_amount=20.55, tpep_pickup_datetime=1761956254000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=229, trip_distance=4.32, total_amount=45.78, tpep_pickup_datetime=1761957567000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=224, trip_distance=1.99, total_amount=22.26, tpep_pickup_datetime=1761955509000)\n",
      "Sent: Ride(PULocationID=224, DOLocationID=141, trip_distance=2.85, total_amount=29.82, tpep_pickup_datetime=1761956184000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=239, trip_distance=1.32, total_amount=18.84, tpep_pickup_datetime=1761957687000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=1.0, total_amount=17.75, tpep_pickup_datetime=1761956921000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.29, total_amount=23.75, tpep_pickup_datetime=1761957629000)\n",
      "Sent: Ride(PULocationID=125, DOLocationID=186, trip_distance=1.81, total_amount=31.15, tpep_pickup_datetime=1761956520000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=249, trip_distance=1.14, total_amount=21.25, tpep_pickup_datetime=1761958570000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=237, trip_distance=10.18, total_amount=69.71, tpep_pickup_datetime=1761957894000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=43, trip_distance=2.16, total_amount=17.1, tpep_pickup_datetime=1761956402000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=1.7, total_amount=18.0, tpep_pickup_datetime=1761955253000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=7, trip_distance=4.5, total_amount=32.34, tpep_pickup_datetime=1761955970000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.4, total_amount=27.55, tpep_pickup_datetime=1761955608000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=244, trip_distance=10.5, total_amount=57.05, tpep_pickup_datetime=1761957376000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=255, trip_distance=7.32, total_amount=56.35, tpep_pickup_datetime=1761955670000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=261, trip_distance=6.49, total_amount=43.85, tpep_pickup_datetime=1761958702000)\n",
      "Sent: Ride(PULocationID=45, DOLocationID=97, trip_distance=3.2, total_amount=23.45, tpep_pickup_datetime=1761956745000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=186, trip_distance=2.9, total_amount=24.85, tpep_pickup_datetime=1761958647000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=0.97, total_amount=23.94, tpep_pickup_datetime=1761956073000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=237, trip_distance=1.73, total_amount=22.26, tpep_pickup_datetime=1761957100000)\n",
      "Sent: Ride(PULocationID=45, DOLocationID=79, trip_distance=1.32, total_amount=22.26, tpep_pickup_datetime=1761957425000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.64, total_amount=24.72, tpep_pickup_datetime=1761958384000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.6, total_amount=23.1, tpep_pickup_datetime=1761956460000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=41, trip_distance=7.6, total_amount=49.4, tpep_pickup_datetime=1761957518000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=233, trip_distance=1.1, total_amount=22.3, tpep_pickup_datetime=1761956203000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=75, trip_distance=2.7, total_amount=23.2, tpep_pickup_datetime=1761957674000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=125, trip_distance=3.43, total_amount=49.14, tpep_pickup_datetime=1761957040000)\n",
      "Sent: Ride(PULocationID=209, DOLocationID=90, trip_distance=3.28, total_amount=31.15, tpep_pickup_datetime=1761956343000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=125, trip_distance=2.08, total_amount=34.86, tpep_pickup_datetime=1761958287000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=113, trip_distance=2.01, total_amount=41.58, tpep_pickup_datetime=1761956311000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.15, total_amount=10.44, tpep_pickup_datetime=1761955368000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=48, trip_distance=1.4, total_amount=22.75, tpep_pickup_datetime=1761957364000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=68, trip_distance=2.4, total_amount=24.8, tpep_pickup_datetime=1761956279000)\n",
      "Sent: Ride(PULocationID=262, DOLocationID=90, trip_distance=4.3, total_amount=52.5, tpep_pickup_datetime=1761956236000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=246, trip_distance=0.72, total_amount=13.86, tpep_pickup_datetime=1761956524000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.38, total_amount=12.55, tpep_pickup_datetime=1761957465000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.95, total_amount=17.22, tpep_pickup_datetime=1761958160000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=236, trip_distance=4.25, total_amount=46.62, tpep_pickup_datetime=1761955309000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=75, trip_distance=0.83, total_amount=11.5, tpep_pickup_datetime=1761955738000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=137, trip_distance=2.0, total_amount=20.75, tpep_pickup_datetime=1761957902000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=140, trip_distance=0.49, total_amount=10.1, tpep_pickup_datetime=1761956499000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.3, total_amount=17.16, tpep_pickup_datetime=1761957396000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=249, trip_distance=3.49, total_amount=30.05, tpep_pickup_datetime=1761958009000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=229, trip_distance=6.31, total_amount=40.74, tpep_pickup_datetime=1761958328000)\n",
      "Sent: Ride(PULocationID=50, DOLocationID=68, trip_distance=1.9, total_amount=20.55, tpep_pickup_datetime=1761955870000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=49, trip_distance=3.71, total_amount=25.55, tpep_pickup_datetime=1761957107000)\n",
      "Sent: Ride(PULocationID=97, DOLocationID=256, trip_distance=3.39, total_amount=27.6, tpep_pickup_datetime=1761958421000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=236, trip_distance=5.79, total_amount=40.95, tpep_pickup_datetime=1761955769000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=143, trip_distance=3.23, total_amount=23.7, tpep_pickup_datetime=1761957945000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=232, trip_distance=1.6, total_amount=19.25, tpep_pickup_datetime=1761956045000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=229, trip_distance=0.7, total_amount=13.85, tpep_pickup_datetime=1761956383000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.3, total_amount=13.4, tpep_pickup_datetime=1761957107000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=263, trip_distance=0.7, total_amount=12.8, tpep_pickup_datetime=1761957502000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=229, trip_distance=1.7, total_amount=18.9, tpep_pickup_datetime=1761958191000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=148, trip_distance=1.04, total_amount=24.78, tpep_pickup_datetime=1761958597000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=265, trip_distance=0.0, total_amount=124.25, tpep_pickup_datetime=1761956790000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=229, trip_distance=2.57, total_amount=28.14, tpep_pickup_datetime=1761956518000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=48, trip_distance=1.06, total_amount=19.57, tpep_pickup_datetime=1761956704000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.32, total_amount=13.02, tpep_pickup_datetime=1761955678000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=162, trip_distance=1.73, total_amount=26.46, tpep_pickup_datetime=1761956006000)\n",
      "Sent: Ride(PULocationID=74, DOLocationID=236, trip_distance=1.55, total_amount=17.16, tpep_pickup_datetime=1761958603000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=230, trip_distance=2.16, total_amount=33.69, tpep_pickup_datetime=1761955305000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=231, trip_distance=1.99, total_amount=31.5, tpep_pickup_datetime=1761955569000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.04, total_amount=19.95, tpep_pickup_datetime=1761956993000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=170, trip_distance=1.53, total_amount=19.95, tpep_pickup_datetime=1761958077000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=234, trip_distance=2.63, total_amount=25.62, tpep_pickup_datetime=1761958784000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=141, trip_distance=4.28, total_amount=38.22, tpep_pickup_datetime=1761958147000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=230, trip_distance=3.19, total_amount=42.42, tpep_pickup_datetime=1761955883000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.38, total_amount=18.9, tpep_pickup_datetime=1761958302000)\n",
      "Sent: Ride(PULocationID=261, DOLocationID=186, trip_distance=3.2, total_amount=34.0, tpep_pickup_datetime=1761957372000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=68, trip_distance=0.85, total_amount=26.81, tpep_pickup_datetime=1761955469000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=237, trip_distance=3.15, total_amount=33.8, tpep_pickup_datetime=1761956993000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=114, trip_distance=1.4, total_amount=31.5, tpep_pickup_datetime=1761955287000)\n",
      "Sent: Ride(PULocationID=114, DOLocationID=230, trip_distance=2.8, total_amount=31.05, tpep_pickup_datetime=1761956826000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=43, trip_distance=0.55, total_amount=13.02, tpep_pickup_datetime=1761955330000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=143, trip_distance=0.7, total_amount=13.8, tpep_pickup_datetime=1761955615000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=141, trip_distance=6.07, total_amount=34.55, tpep_pickup_datetime=1761956252000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=186, trip_distance=2.8, total_amount=33.18, tpep_pickup_datetime=1761956258000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=48, trip_distance=1.54, total_amount=21.35, tpep_pickup_datetime=1761958653000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=48, trip_distance=1.3, total_amount=17.35, tpep_pickup_datetime=1761957861000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=141, trip_distance=1.6, total_amount=23.95, tpep_pickup_datetime=1761958511000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=4.26, total_amount=39.9, tpep_pickup_datetime=1761956321000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=164, trip_distance=2.85, total_amount=28.14, tpep_pickup_datetime=1761958363000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=140, trip_distance=3.86, total_amount=42.42, tpep_pickup_datetime=1761956960000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=68, trip_distance=4.44, total_amount=34.02, tpep_pickup_datetime=1761958780000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=234, trip_distance=2.0, total_amount=28.45, tpep_pickup_datetime=1761956085000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=48, trip_distance=2.1, total_amount=37.2, tpep_pickup_datetime=1761957500000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=159, trip_distance=4.96, total_amount=30.4, tpep_pickup_datetime=1761956905000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=237, trip_distance=1.48, total_amount=21.42, tpep_pickup_datetime=1761955928000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.0, total_amount=16.45, tpep_pickup_datetime=1761956435000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=0.6, total_amount=13.85, tpep_pickup_datetime=1761957093000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=229, trip_distance=1.5, total_amount=24.75, tpep_pickup_datetime=1761957404000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=224, trip_distance=4.26, total_amount=50.85, tpep_pickup_datetime=1761955719000)\n",
      "Sent: Ride(PULocationID=224, DOLocationID=233, trip_distance=1.07, total_amount=17.22, tpep_pickup_datetime=1761958748000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=162, trip_distance=2.38, total_amount=34.86, tpep_pickup_datetime=1761956435000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=24, trip_distance=3.81, total_amount=28.55, tpep_pickup_datetime=1761958415000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=246, trip_distance=0.77, total_amount=15.05, tpep_pickup_datetime=1761956032000)\n",
      "Sent: Ride(PULocationID=262, DOLocationID=262, trip_distance=0.0, total_amount=8.0, tpep_pickup_datetime=1761956477000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=0.8, total_amount=13.5, tpep_pickup_datetime=1761957782000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=238, trip_distance=2.0, total_amount=18.84, tpep_pickup_datetime=1761958251000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=141, trip_distance=1.78, total_amount=18.09, tpep_pickup_datetime=1761955958000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=74, trip_distance=2.49, total_amount=21.6, tpep_pickup_datetime=1761956602000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=87, trip_distance=2.61, total_amount=24.78, tpep_pickup_datetime=1761958149000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=237, trip_distance=1.84, total_amount=17.15, tpep_pickup_datetime=1761955513000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=262, trip_distance=1.05, total_amount=15.48, tpep_pickup_datetime=1761956227000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=163, trip_distance=0.45, total_amount=15.54, tpep_pickup_datetime=1761954619000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=68, trip_distance=1.8, total_amount=49.98, tpep_pickup_datetime=1761955361000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=114, trip_distance=1.88, total_amount=31.94, tpep_pickup_datetime=1761958526000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=87, trip_distance=4.4, total_amount=57.3, tpep_pickup_datetime=1761956034000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=237, trip_distance=0.89, total_amount=15.39, tpep_pickup_datetime=1761958187000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=229, trip_distance=4.88, total_amount=43.73, tpep_pickup_datetime=1761958624000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=151, trip_distance=1.0, total_amount=14.6, tpep_pickup_datetime=1761956087000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=261, trip_distance=5.0, total_amount=50.15, tpep_pickup_datetime=1761957798000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=148, trip_distance=0.39, total_amount=23.8, tpep_pickup_datetime=1761955384000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=223, trip_distance=8.64, total_amount=60.06, tpep_pickup_datetime=1761956689000)\n",
      "Sent: Ride(PULocationID=256, DOLocationID=107, trip_distance=3.25, total_amount=40.35, tpep_pickup_datetime=1761955754000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=68, trip_distance=1.47, total_amount=18.55, tpep_pickup_datetime=1761957878000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=7, trip_distance=4.12, total_amount=31.8, tpep_pickup_datetime=1761955283000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=113, trip_distance=9.36, total_amount=75.5, tpep_pickup_datetime=1761957063000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=158, trip_distance=0.96, total_amount=23.94, tpep_pickup_datetime=1761957835000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=230, trip_distance=1.28, total_amount=20.56, tpep_pickup_datetime=1761955072000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=107, trip_distance=1.7, total_amount=24.95, tpep_pickup_datetime=1761956114000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=137, trip_distance=0.64, total_amount=13.63, tpep_pickup_datetime=1761957625000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=107, trip_distance=1.17, total_amount=27.3, tpep_pickup_datetime=1761958042000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=68, trip_distance=3.48, total_amount=37.38, tpep_pickup_datetime=1761956556000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=75, trip_distance=3.25, total_amount=25.62, tpep_pickup_datetime=1761955328000)\n",
      "Sent: Ride(PULocationID=43, DOLocationID=237, trip_distance=1.06, total_amount=18.35, tpep_pickup_datetime=1761956831000)\n",
      "Sent: Ride(PULocationID=43, DOLocationID=236, trip_distance=2.03, total_amount=22.26, tpep_pickup_datetime=1761957676000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=246, trip_distance=0.54, total_amount=16.55, tpep_pickup_datetime=1761955938000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=236, trip_distance=3.49, total_amount=35.7, tpep_pickup_datetime=1761956552000)\n",
      "Sent: Ride(PULocationID=75, DOLocationID=238, trip_distance=1.23, total_amount=12.9, tpep_pickup_datetime=1761958460000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=229, trip_distance=10.1, total_amount=72.24, tpep_pickup_datetime=1761956382000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=230, trip_distance=1.0, total_amount=17.15, tpep_pickup_datetime=1761958646000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=177, trip_distance=9.77, total_amount=54.55, tpep_pickup_datetime=1761957231000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=262, trip_distance=1.9, total_amount=21.35, tpep_pickup_datetime=1761958531000)\n",
      "Sent: Ride(PULocationID=125, DOLocationID=239, trip_distance=4.72, total_amount=47.46, tpep_pickup_datetime=1761955991000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=262, trip_distance=2.02, total_amount=20.5, tpep_pickup_datetime=1761958263000)\n",
      "Sent: Ride(PULocationID=224, DOLocationID=231, trip_distance=3.16, total_amount=27.65, tpep_pickup_datetime=1761957209000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=90, trip_distance=3.67, total_amount=34.02, tpep_pickup_datetime=1761956212000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=90, trip_distance=0.18, total_amount=19.74, tpep_pickup_datetime=1761957726000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=79, trip_distance=0.98, total_amount=27.3, tpep_pickup_datetime=1761958631000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=90, trip_distance=4.7, total_amount=40.74, tpep_pickup_datetime=1761956974000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=68, trip_distance=0.85, total_amount=19.55, tpep_pickup_datetime=1761955241000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=161, trip_distance=1.37, total_amount=24.94, tpep_pickup_datetime=1761956360000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=232, trip_distance=2.09, total_amount=22.26, tpep_pickup_datetime=1761956603000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=78, trip_distance=8.9, total_amount=35.0, tpep_pickup_datetime=1761957013000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=236, trip_distance=7.09, total_amount=43.65, tpep_pickup_datetime=1761955507000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=2.0, total_amount=18.81, tpep_pickup_datetime=1761957857000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=50, trip_distance=1.81, total_amount=20.58, tpep_pickup_datetime=1761958528000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=113, trip_distance=1.25, total_amount=24.78, tpep_pickup_datetime=1761956381000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=107, trip_distance=1.44, total_amount=24.78, tpep_pickup_datetime=1761957581000)\n",
      "Sent: Ride(PULocationID=50, DOLocationID=90, trip_distance=2.4, total_amount=33.2, tpep_pickup_datetime=1761955786000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=0.6, total_amount=17.05, tpep_pickup_datetime=1761957365000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=68, trip_distance=0.8, total_amount=16.75, tpep_pickup_datetime=1761958508000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=246, trip_distance=2.2, total_amount=27.3, tpep_pickup_datetime=1761955399000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=90, trip_distance=1.5, total_amount=18.45, tpep_pickup_datetime=1761957057000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=107, trip_distance=1.1, total_amount=24.8, tpep_pickup_datetime=1761957738000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=234, trip_distance=2.5, total_amount=34.0, tpep_pickup_datetime=1761956224000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=186, trip_distance=1.2, total_amount=18.9, tpep_pickup_datetime=1761955938000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=166, trip_distance=6.1, total_amount=43.3, tpep_pickup_datetime=1761957437000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=239, trip_distance=0.7, total_amount=15.5, tpep_pickup_datetime=1761956589000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=113, trip_distance=0.37, total_amount=17.15, tpep_pickup_datetime=1761957323000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.2, total_amount=24.8, tpep_pickup_datetime=1761956590000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=107, trip_distance=1.0, total_amount=20.65, tpep_pickup_datetime=1761957748000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.34, total_amount=24.15, tpep_pickup_datetime=1761958718000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=137, trip_distance=2.4, total_amount=27.78, tpep_pickup_datetime=1761956073000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=232, trip_distance=3.1, total_amount=34.85, tpep_pickup_datetime=1761955678000)\n",
      "Sent: Ride(PULocationID=232, DOLocationID=263, trip_distance=5.4, total_amount=36.55, tpep_pickup_datetime=1761957916000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=43, trip_distance=0.51, total_amount=12.96, tpep_pickup_datetime=1761957600000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=1.02, total_amount=17.88, tpep_pickup_datetime=1761958142000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=75, trip_distance=3.62, total_amount=27.3, tpep_pickup_datetime=1761957421000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.54, total_amount=11.4, tpep_pickup_datetime=1761957775000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=140, trip_distance=0.89, total_amount=13.02, tpep_pickup_datetime=1761958544000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=17, trip_distance=7.47, total_amount=65.94, tpep_pickup_datetime=1761956096000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.8, total_amount=23.94, tpep_pickup_datetime=1761956686000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=229, trip_distance=3.43, total_amount=34.02, tpep_pickup_datetime=1761955666000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=262, trip_distance=1.62, total_amount=17.05, tpep_pickup_datetime=1761957259000)\n",
      "Sent: Ride(PULocationID=262, DOLocationID=237, trip_distance=1.32, total_amount=18.0, tpep_pickup_datetime=1761957709000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=0.9, total_amount=16.32, tpep_pickup_datetime=1761958396000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=107, trip_distance=1.3, total_amount=19.7, tpep_pickup_datetime=1761957455000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=170, trip_distance=0.93, total_amount=16.38, tpep_pickup_datetime=1761955645000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=164, trip_distance=1.12, total_amount=24.15, tpep_pickup_datetime=1761955455000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=236, trip_distance=0.39, total_amount=11.11, tpep_pickup_datetime=1761957759000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=231, trip_distance=1.82, total_amount=24.95, tpep_pickup_datetime=1761955330000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=162, trip_distance=6.21, total_amount=43.05, tpep_pickup_datetime=1761956607000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.27, total_amount=11.75, tpep_pickup_datetime=1761956212000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=43, trip_distance=1.46, total_amount=15.0, tpep_pickup_datetime=1761958020000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=141, trip_distance=0.5, total_amount=14.95, tpep_pickup_datetime=1761956740000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=87, trip_distance=4.1, total_amount=51.35, tpep_pickup_datetime=1761958460000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=116, trip_distance=18.5, total_amount=96.24, tpep_pickup_datetime=1761957126000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=17, trip_distance=9.96, total_amount=61.39, tpep_pickup_datetime=1761957438000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=170, trip_distance=4.21, total_amount=34.55, tpep_pickup_datetime=1761957583000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=263, trip_distance=3.7, total_amount=31.5, tpep_pickup_datetime=1761956232000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=263, trip_distance=0.0, total_amount=8.0, tpep_pickup_datetime=1761957635000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=229, trip_distance=1.9, total_amount=27.3, tpep_pickup_datetime=1761958640000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=164, trip_distance=1.05, total_amount=15.75, tpep_pickup_datetime=1761956099000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=170, trip_distance=1.13, total_amount=15.75, tpep_pickup_datetime=1761956710000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=256, trip_distance=5.72, total_amount=60.9, tpep_pickup_datetime=1761957394000)\n",
      "Sent: Ride(PULocationID=255, DOLocationID=49, trip_distance=4.19, total_amount=46.08, tpep_pickup_datetime=1761956379000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=148, trip_distance=2.1, total_amount=29.76, tpep_pickup_datetime=1761955716000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=249, trip_distance=1.3, total_amount=18.9, tpep_pickup_datetime=1761957598000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=87, trip_distance=2.2, total_amount=26.45, tpep_pickup_datetime=1761958322000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.35, total_amount=15.54, tpep_pickup_datetime=1761956616000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=246, trip_distance=1.21, total_amount=15.75, tpep_pickup_datetime=1761955343000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=246, trip_distance=0.83, total_amount=19.95, tpep_pickup_datetime=1761955909000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=263, trip_distance=4.38, total_amount=34.02, tpep_pickup_datetime=1761958565000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=107, trip_distance=16.64, total_amount=93.44, tpep_pickup_datetime=1761956975000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=233, trip_distance=1.44, total_amount=18.45, tpep_pickup_datetime=1761955379000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=246, trip_distance=2.65, total_amount=35.55, tpep_pickup_datetime=1761956312000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=229, trip_distance=0.71, total_amount=13.86, tpep_pickup_datetime=1761958144000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=79, trip_distance=3.44, total_amount=40.74, tpep_pickup_datetime=1761957545000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=90, trip_distance=0.6, total_amount=17.22, tpep_pickup_datetime=1761955431000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=1.1, total_amount=19.85, tpep_pickup_datetime=1761956413000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=62, trip_distance=8.16, total_amount=53.55, tpep_pickup_datetime=1761956385000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=79, trip_distance=1.11, total_amount=26.46, tpep_pickup_datetime=1761955381000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=4, trip_distance=0.77, total_amount=21.42, tpep_pickup_datetime=1761956586000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=107, trip_distance=1.11, total_amount=18.06, tpep_pickup_datetime=1761957433000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=68, trip_distance=2.16, total_amount=28.95, tpep_pickup_datetime=1761956224000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=90, trip_distance=0.8, total_amount=19.74, tpep_pickup_datetime=1761957722000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=246, trip_distance=1.77, total_amount=21.95, tpep_pickup_datetime=1761958441000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=79, trip_distance=2.87, total_amount=34.02, tpep_pickup_datetime=1761957364000)\n",
      "Sent: Ride(PULocationID=43, DOLocationID=140, trip_distance=2.05, total_amount=17.15, tpep_pickup_datetime=1761956821000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=263, trip_distance=0.6, total_amount=13.8, tpep_pickup_datetime=1761957600000)\n",
      "Sent: Ride(PULocationID=43, DOLocationID=249, trip_distance=3.17, total_amount=35.82, tpep_pickup_datetime=1761958522000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=170, trip_distance=0.7, total_amount=17.75, tpep_pickup_datetime=1761955603000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=8.9, total_amount=64.74, tpep_pickup_datetime=1761958130000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=142, trip_distance=1.9, total_amount=23.9, tpep_pickup_datetime=1761956855000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=41, trip_distance=2.4, total_amount=22.2, tpep_pickup_datetime=1761958017000)\n",
      "Sent: Ride(PULocationID=166, DOLocationID=151, trip_distance=0.71, total_amount=11.7, tpep_pickup_datetime=1761955955000)\n",
      "Sent: Ride(PULocationID=166, DOLocationID=243, trip_distance=5.58, total_amount=34.32, tpep_pickup_datetime=1761956353000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=236, trip_distance=0.64, total_amount=12.12, tpep_pickup_datetime=1761955402000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=170, trip_distance=3.58, total_amount=34.02, tpep_pickup_datetime=1761955588000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=236, trip_distance=1.3, total_amount=17.2, tpep_pickup_datetime=1761956232000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=74, trip_distance=7.55, total_amount=64.26, tpep_pickup_datetime=1761958722000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=90, trip_distance=1.0, total_amount=24.78, tpep_pickup_datetime=1761955226000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=137, trip_distance=1.51, total_amount=26.95, tpep_pickup_datetime=1761956284000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=263, trip_distance=2.0, total_amount=19.15, tpep_pickup_datetime=1761958009000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=1.0, total_amount=19.74, tpep_pickup_datetime=1761955806000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=233, trip_distance=0.15, total_amount=19.75, tpep_pickup_datetime=1761956644000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=256, trip_distance=4.57, total_amount=45.05, tpep_pickup_datetime=1761955504000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=161, trip_distance=0.67, total_amount=13.02, tpep_pickup_datetime=1761956393000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=158, trip_distance=0.07, total_amount=10.15, tpep_pickup_datetime=1761956927000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=107, trip_distance=0.99, total_amount=33.69, tpep_pickup_datetime=1761957291000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=100, trip_distance=1.46, total_amount=20.95, tpep_pickup_datetime=1761956040000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=233, trip_distance=1.13, total_amount=25.05, tpep_pickup_datetime=1761956974000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=68, trip_distance=2.89, total_amount=24.13, tpep_pickup_datetime=1761956628000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=233, trip_distance=1.9, total_amount=23.1, tpep_pickup_datetime=1761957676000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=68, trip_distance=0.82, total_amount=13.95, tpep_pickup_datetime=1761956420000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=142, trip_distance=3.3, total_amount=36.54, tpep_pickup_datetime=1761957077000)\n",
      "Sent: Ride(PULocationID=143, DOLocationID=79, trip_distance=7.5, total_amount=59.05, tpep_pickup_datetime=1761956774000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.08, total_amount=16.38, tpep_pickup_datetime=1761957693000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=263, trip_distance=4.19, total_amount=35.44, tpep_pickup_datetime=1761958288000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=229, trip_distance=0.52, total_amount=15.54, tpep_pickup_datetime=1761955463000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=238, trip_distance=2.32, total_amount=23.04, tpep_pickup_datetime=1761956228000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.48, total_amount=12.12, tpep_pickup_datetime=1761956375000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=140, trip_distance=1.2, total_amount=16.32, tpep_pickup_datetime=1761956646000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=142, trip_distance=4.38, total_amount=38.41, tpep_pickup_datetime=1761955989000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.42, total_amount=15.05, tpep_pickup_datetime=1761958537000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.3, total_amount=11.55, tpep_pickup_datetime=1761957747000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=249, trip_distance=0.98, total_amount=19.74, tpep_pickup_datetime=1761958266000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.86, total_amount=20.58, tpep_pickup_datetime=1761955096000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=229, trip_distance=2.06, total_amount=23.94, tpep_pickup_datetime=1761956339000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=141, trip_distance=0.89, total_amount=15.54, tpep_pickup_datetime=1761955468000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=262, trip_distance=0.97, total_amount=11.5, tpep_pickup_datetime=1761956055000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=158, trip_distance=5.59, total_amount=45.78, tpep_pickup_datetime=1761956318000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=68, trip_distance=1.17, total_amount=19.57, tpep_pickup_datetime=1761958429000)\n",
      "Sent: Ride(PULocationID=24, DOLocationID=75, trip_distance=1.38, total_amount=15.0, tpep_pickup_datetime=1761956097000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=224, trip_distance=0.83, total_amount=19.25, tpep_pickup_datetime=1761958771000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=48, trip_distance=1.34, total_amount=24.78, tpep_pickup_datetime=1761955846000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=41, trip_distance=3.75, total_amount=31.4, tpep_pickup_datetime=1761957073000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=263, trip_distance=2.33, total_amount=23.04, tpep_pickup_datetime=1761958753000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=61, trip_distance=7.04, total_amount=58.36, tpep_pickup_datetime=1761957360000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=163, trip_distance=1.1, total_amount=21.4, tpep_pickup_datetime=1761955669000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=261, trip_distance=5.9, total_amount=52.5, tpep_pickup_datetime=1761956719000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=233, trip_distance=1.16, total_amount=16.45, tpep_pickup_datetime=1761956976000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=107, trip_distance=0.98, total_amount=18.06, tpep_pickup_datetime=1761957777000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=24, trip_distance=3.75, total_amount=26.25, tpep_pickup_datetime=1761957045000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=170, trip_distance=2.2, total_amount=21.17, tpep_pickup_datetime=1761957744000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=48, trip_distance=1.2, total_amount=18.55, tpep_pickup_datetime=1761955558000)\n",
      "Sent: Ride(PULocationID=74, DOLocationID=42, trip_distance=2.21, total_amount=15.3, tpep_pickup_datetime=1761955487000)\n",
      "Sent: Ride(PULocationID=151, DOLocationID=238, trip_distance=0.71, total_amount=11.7, tpep_pickup_datetime=1761957229000)\n",
      "Sent: Ride(PULocationID=143, DOLocationID=151, trip_distance=2.15, total_amount=19.58, tpep_pickup_datetime=1761958183000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=230, trip_distance=1.9, total_amount=21.4, tpep_pickup_datetime=1761957928000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=230, trip_distance=0.3, total_amount=11.55, tpep_pickup_datetime=1761958796000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=233, trip_distance=2.3, total_amount=31.05, tpep_pickup_datetime=1761957266000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=262, trip_distance=3.0, total_amount=19.25, tpep_pickup_datetime=1761958773000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=50, trip_distance=1.26, total_amount=23.05, tpep_pickup_datetime=1761958750000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=186, trip_distance=0.59, total_amount=15.54, tpep_pickup_datetime=1761956784000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=48, trip_distance=1.39, total_amount=16.45, tpep_pickup_datetime=1761957244000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=238, trip_distance=2.04, total_amount=25.62, tpep_pickup_datetime=1761958076000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=141, trip_distance=3.93, total_amount=40.74, tpep_pickup_datetime=1761957260000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=4, trip_distance=1.98, total_amount=36.54, tpep_pickup_datetime=1761957796000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=112, trip_distance=3.96, total_amount=49.91, tpep_pickup_datetime=1761956053000)\n",
      "Sent: Ride(PULocationID=211, DOLocationID=137, trip_distance=2.41, total_amount=29.25, tpep_pickup_datetime=1761955544000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=141, trip_distance=1.48, total_amount=18.06, tpep_pickup_datetime=1761958664000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=138, trip_distance=11.3, total_amount=71.29, tpep_pickup_datetime=1761956418000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=11.04, total_amount=74.46, tpep_pickup_datetime=1761958340000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=79, trip_distance=1.82, total_amount=44.1, tpep_pickup_datetime=1761957003000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=237, trip_distance=0.64, total_amount=14.2, tpep_pickup_datetime=1761955771000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=162, trip_distance=1.19, total_amount=19.74, tpep_pickup_datetime=1761956127000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=80, trip_distance=5.8, total_amount=52.43, tpep_pickup_datetime=1761956989000)\n",
      "Sent: Ride(PULocationID=262, DOLocationID=87, trip_distance=6.6, total_amount=66.8, tpep_pickup_datetime=1761955277000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=141, trip_distance=14.9, total_amount=81.0, tpep_pickup_datetime=1761956893000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=230, trip_distance=2.16, total_amount=34.02, tpep_pickup_datetime=1761955901000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=236, trip_distance=2.67, total_amount=23.65, tpep_pickup_datetime=1761957546000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=179, trip_distance=3.9, total_amount=29.0, tpep_pickup_datetime=1761955359000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=113, trip_distance=1.0, total_amount=17.85, tpep_pickup_datetime=1761955818000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=48, trip_distance=3.0, total_amount=34.85, tpep_pickup_datetime=1761956723000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=113, trip_distance=0.87, total_amount=29.95, tpep_pickup_datetime=1761955542000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=236, trip_distance=3.63, total_amount=41.56, tpep_pickup_datetime=1761957744000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=170, trip_distance=1.92, total_amount=25.75, tpep_pickup_datetime=1761956455000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=48, trip_distance=1.12, total_amount=21.42, tpep_pickup_datetime=1761957962000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.2, total_amount=15.05, tpep_pickup_datetime=1761957985000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=263, trip_distance=2.0, total_amount=16.45, tpep_pickup_datetime=1761958639000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=1.4, total_amount=16.35, tpep_pickup_datetime=1761956025000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=141, trip_distance=2.8, total_amount=27.3, tpep_pickup_datetime=1761958394000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=186, trip_distance=1.2, total_amount=24.05, tpep_pickup_datetime=1761956352000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=142, trip_distance=2.1, total_amount=22.25, tpep_pickup_datetime=1761957923000)\n",
      "Sent: Ride(PULocationID=50, DOLocationID=246, trip_distance=1.25, total_amount=16.38, tpep_pickup_datetime=1761958378000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=75, trip_distance=5.25, total_amount=37.35, tpep_pickup_datetime=1761958709000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=170, trip_distance=0.59, total_amount=14.65, tpep_pickup_datetime=1761958613000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=161, trip_distance=1.56, total_amount=18.9, tpep_pickup_datetime=1761956052000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=7, trip_distance=3.47, total_amount=31.5, tpep_pickup_datetime=1761956801000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=107, trip_distance=0.54, total_amount=14.15, tpep_pickup_datetime=1761955290000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=229, trip_distance=1.92, total_amount=30.65, tpep_pickup_datetime=1761955702000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=232, trip_distance=5.7, total_amount=56.7, tpep_pickup_datetime=1761955310000)\n",
      "Sent: Ride(PULocationID=34, DOLocationID=263, trip_distance=8.81, total_amount=54.9, tpep_pickup_datetime=1761956216000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=263, trip_distance=1.8, total_amount=17.15, tpep_pickup_datetime=1761956756000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=233, trip_distance=2.3, total_amount=23.2, tpep_pickup_datetime=1761957345000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=163, trip_distance=1.2, total_amount=17.2, tpep_pickup_datetime=1761958050000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=42, trip_distance=3.5, total_amount=24.8, tpep_pickup_datetime=1761958674000)\n",
      "Sent: Ride(PULocationID=143, DOLocationID=81, trip_distance=13.98, total_amount=62.65, tpep_pickup_datetime=1761957611000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=263, trip_distance=7.04, total_amount=47.46, tpep_pickup_datetime=1761958101000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=230, trip_distance=0.28, total_amount=11.55, tpep_pickup_datetime=1761955440000)\n",
      "Sent: Ride(PULocationID=265, DOLocationID=265, trip_distance=0.0, total_amount=71.0, tpep_pickup_datetime=1761957594000)\n",
      "Sent: Ride(PULocationID=265, DOLocationID=265, trip_distance=0.0, total_amount=85.2, tpep_pickup_datetime=1761957770000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=87, trip_distance=4.0, total_amount=47.35, tpep_pickup_datetime=1761958153000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=231, trip_distance=0.3, total_amount=12.15, tpep_pickup_datetime=1761955983000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=1.9, total_amount=21.35, tpep_pickup_datetime=1761958793000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=238, trip_distance=1.6, total_amount=17.0, tpep_pickup_datetime=1761956346000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=0.8, total_amount=13.8, tpep_pickup_datetime=1761956873000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=263, trip_distance=2.3, total_amount=22.25, tpep_pickup_datetime=1761957322000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=66, trip_distance=4.8, total_amount=31.85, tpep_pickup_datetime=1761954981000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=239, trip_distance=4.01, total_amount=37.38, tpep_pickup_datetime=1761955907000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=249, trip_distance=4.0, total_amount=52.5, tpep_pickup_datetime=1761955461000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=236, trip_distance=5.02, total_amount=42.95, tpep_pickup_datetime=1761958553000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=262, trip_distance=5.8, total_amount=40.75, tpep_pickup_datetime=1761957082000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=68, trip_distance=3.74, total_amount=34.56, tpep_pickup_datetime=1761956465000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=75, trip_distance=1.98, total_amount=15.7, tpep_pickup_datetime=1761955693000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=158, trip_distance=2.0, total_amount=24.8, tpep_pickup_datetime=1761957491000)\n",
      "Sent: Ride(PULocationID=43, DOLocationID=170, trip_distance=1.8, total_amount=23.9, tpep_pickup_datetime=1761955415000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=113, trip_distance=1.0, total_amount=24.75, tpep_pickup_datetime=1761956410000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=1.4, total_amount=25.45, tpep_pickup_datetime=1761957751000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=229, trip_distance=1.5, total_amount=16.75, tpep_pickup_datetime=1761956645000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=263, trip_distance=1.9, total_amount=19.7, tpep_pickup_datetime=1761957347000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=79, trip_distance=3.7, total_amount=36.5, tpep_pickup_datetime=1761957967000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=4, trip_distance=2.35, total_amount=45.04, tpep_pickup_datetime=1761955695000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=230, trip_distance=3.06, total_amount=34.02, tpep_pickup_datetime=1761958084000)\n",
      "Sent: Ride(PULocationID=151, DOLocationID=164, trip_distance=3.53, total_amount=37.38, tpep_pickup_datetime=1761956326000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=79, trip_distance=1.62, total_amount=24.05, tpep_pickup_datetime=1761958296000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=141, trip_distance=1.78, total_amount=22.35, tpep_pickup_datetime=1761956815000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=161, trip_distance=3.12, total_amount=37.45, tpep_pickup_datetime=1761956190000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=137, trip_distance=0.71, total_amount=14.7, tpep_pickup_datetime=1761958680000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=239, trip_distance=4.03, total_amount=36.15, tpep_pickup_datetime=1761957513000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=239, trip_distance=1.43, total_amount=17.7, tpep_pickup_datetime=1761955297000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=263, trip_distance=4.9, total_amount=33.85, tpep_pickup_datetime=1761955712000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=249, trip_distance=4.3, total_amount=55.0, tpep_pickup_datetime=1761957577000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=234, trip_distance=0.47, total_amount=15.7, tpep_pickup_datetime=1761955763000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=2.07, total_amount=29.82, tpep_pickup_datetime=1761956324000)\n",
      "Sent: Ride(PULocationID=232, DOLocationID=232, trip_distance=0.07, total_amount=8.75, tpep_pickup_datetime=1761957625000)\n",
      "Sent: Ride(PULocationID=232, DOLocationID=224, trip_distance=1.45, total_amount=19.74, tpep_pickup_datetime=1761957883000)\n",
      "Sent: Ride(PULocationID=224, DOLocationID=229, trip_distance=1.92, total_amount=20.58, tpep_pickup_datetime=1761958559000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=255, trip_distance=4.34, total_amount=54.18, tpep_pickup_datetime=1761955952000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=249, trip_distance=1.0, total_amount=40.74, tpep_pickup_datetime=1761955307000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=148, trip_distance=1.88, total_amount=22.97, tpep_pickup_datetime=1761957572000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=25, trip_distance=2.62, total_amount=24.78, tpep_pickup_datetime=1761958769000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=263, trip_distance=1.95, total_amount=15.75, tpep_pickup_datetime=1761955485000)\n",
      "Sent: Ride(PULocationID=261, DOLocationID=13, trip_distance=0.83, total_amount=16.83, tpep_pickup_datetime=1761958153000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=261, trip_distance=1.9, total_amount=20.75, tpep_pickup_datetime=1761955780000)\n",
      "Sent: Ride(PULocationID=45, DOLocationID=170, trip_distance=3.1, total_amount=39.9, tpep_pickup_datetime=1761957020000)\n",
      "Sent: Ride(PULocationID=243, DOLocationID=116, trip_distance=2.99, total_amount=21.72, tpep_pickup_datetime=1761955822000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=137, trip_distance=1.41, total_amount=27.55, tpep_pickup_datetime=1761956665000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=41, trip_distance=3.29, total_amount=25.45, tpep_pickup_datetime=1761955794000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=170, trip_distance=1.17, total_amount=24.78, tpep_pickup_datetime=1761955535000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=107, trip_distance=0.91, total_amount=23.94, tpep_pickup_datetime=1761957348000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=100, trip_distance=0.93, total_amount=18.06, tpep_pickup_datetime=1761958539000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=107, trip_distance=1.65, total_amount=28.85, tpep_pickup_datetime=1761956687000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=113, trip_distance=1.3, total_amount=24.78, tpep_pickup_datetime=1761955497000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=100, trip_distance=2.19, total_amount=30.45, tpep_pickup_datetime=1761957097000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=24, trip_distance=5.13, total_amount=44.1, tpep_pickup_datetime=1761957239000)\n",
      "Sent: Ride(PULocationID=264, DOLocationID=90, trip_distance=1.4, total_amount=41.55, tpep_pickup_datetime=1761957062000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=10.22, total_amount=65.24, tpep_pickup_datetime=1761955208000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=234, trip_distance=1.9, total_amount=23.1, tpep_pickup_datetime=1761956752000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.0, total_amount=-8.75, tpep_pickup_datetime=1761957929000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.0, total_amount=8.75, tpep_pickup_datetime=1761957929000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=75, trip_distance=5.69, total_amount=35.95, tpep_pickup_datetime=1761958337000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=148, trip_distance=0.8, total_amount=22.25, tpep_pickup_datetime=1761955358000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=170, trip_distance=2.0, total_amount=25.6, tpep_pickup_datetime=1761956269000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=262, trip_distance=3.6, total_amount=30.65, tpep_pickup_datetime=1761957679000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=0.7, total_amount=11.1, tpep_pickup_datetime=1761958752000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=238, trip_distance=2.0, total_amount=22.31, tpep_pickup_datetime=1761957155000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=229, trip_distance=1.5, total_amount=19.85, tpep_pickup_datetime=1761958389000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=144, trip_distance=1.58, total_amount=30.66, tpep_pickup_datetime=1761955512000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=143, trip_distance=1.2, total_amount=18.06, tpep_pickup_datetime=1761956341000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=141, trip_distance=2.19, total_amount=23.94, tpep_pickup_datetime=1761957714000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=233, trip_distance=1.26, total_amount=25.25, tpep_pickup_datetime=1761956020000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=79, trip_distance=1.64, total_amount=27.65, tpep_pickup_datetime=1761956631000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.91, total_amount=14.45, tpep_pickup_datetime=1761958456000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=68, trip_distance=2.5, total_amount=29.82, tpep_pickup_datetime=1761955599000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=87, trip_distance=4.67, total_amount=50.82, tpep_pickup_datetime=1761957483000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=164, trip_distance=1.85, total_amount=23.94, tpep_pickup_datetime=1761958438000)\n",
      "Sent: Ride(PULocationID=143, DOLocationID=162, trip_distance=1.3, total_amount=13.65, tpep_pickup_datetime=1761956802000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=158, trip_distance=3.0, total_amount=41.65, tpep_pickup_datetime=1761957357000)\n",
      "Sent: Ride(PULocationID=37, DOLocationID=143, trip_distance=8.61, total_amount=65.88, tpep_pickup_datetime=1761956548000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=148, trip_distance=1.67, total_amount=28.14, tpep_pickup_datetime=1761956271000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.93, total_amount=18.9, tpep_pickup_datetime=1761957806000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=236, trip_distance=3.45, total_amount=26.45, tpep_pickup_datetime=1761958735000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=87, trip_distance=3.04, total_amount=26.05, tpep_pickup_datetime=1761958064000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=236, trip_distance=1.77, total_amount=20.5, tpep_pickup_datetime=1761955394000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=265, trip_distance=6.26, total_amount=92.1, tpep_pickup_datetime=1761956388000)\n",
      "Sent: Ride(PULocationID=152, DOLocationID=82, trip_distance=7.39, total_amount=50.94, tpep_pickup_datetime=1761957298000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.38, total_amount=19.25, tpep_pickup_datetime=1761958530000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=90, trip_distance=0.83, total_amount=18.9, tpep_pickup_datetime=1761956852000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=237, trip_distance=1.47, total_amount=14.3, tpep_pickup_datetime=1761955429000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=233, trip_distance=0.55, total_amount=19.74, tpep_pickup_datetime=1761956098000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=75, trip_distance=6.2, total_amount=53.35, tpep_pickup_datetime=1761956596000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=231, trip_distance=2.93, total_amount=35.55, tpep_pickup_datetime=1761958267000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=246, trip_distance=0.73, total_amount=14.25, tpep_pickup_datetime=1761957445000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=48, trip_distance=1.48, total_amount=23.94, tpep_pickup_datetime=1761955900000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.6, total_amount=20.41, tpep_pickup_datetime=1761955372000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=137, trip_distance=1.6, total_amount=25.9, tpep_pickup_datetime=1761956526000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=233, trip_distance=1.4, total_amount=22.05, tpep_pickup_datetime=1761958784000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=234, trip_distance=0.4, total_amount=14.7, tpep_pickup_datetime=1761955781000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=2.1, total_amount=44.9, tpep_pickup_datetime=1761956134000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=87, trip_distance=5.1, total_amount=53.34, tpep_pickup_datetime=1761956209000)\n",
      "Sent: Ride(PULocationID=262, DOLocationID=145, trip_distance=1.73, total_amount=16.45, tpep_pickup_datetime=1761955736000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=158, trip_distance=0.68, total_amount=21.95, tpep_pickup_datetime=1761955728000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=249, trip_distance=0.22, total_amount=15.54, tpep_pickup_datetime=1761956324000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=229, trip_distance=2.49, total_amount=24.95, tpep_pickup_datetime=1761955362000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=140, trip_distance=1.08, total_amount=15.54, tpep_pickup_datetime=1761956228000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=163, trip_distance=4.6, total_amount=47.85, tpep_pickup_datetime=1761955422000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=237, trip_distance=1.41, total_amount=19.74, tpep_pickup_datetime=1761958654000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=68, trip_distance=3.5, total_amount=26.25, tpep_pickup_datetime=1761958702000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=48, trip_distance=1.1, total_amount=24.8, tpep_pickup_datetime=1761955292000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=90, trip_distance=1.3, total_amount=21.45, tpep_pickup_datetime=1761956729000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=249, trip_distance=0.5, total_amount=20.55, tpep_pickup_datetime=1761957601000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=137, trip_distance=2.7, total_amount=36.3, tpep_pickup_datetime=1761958362000)\n",
      "Sent: Ride(PULocationID=232, DOLocationID=87, trip_distance=1.54, total_amount=22.29, tpep_pickup_datetime=1761957878000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=148, trip_distance=1.66, total_amount=29.82, tpep_pickup_datetime=1761958604000)\n",
      "Sent: Ride(PULocationID=65, DOLocationID=49, trip_distance=0.9, total_amount=11.4, tpep_pickup_datetime=1761956206000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=249, trip_distance=1.43, total_amount=21.95, tpep_pickup_datetime=1761958184000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=186, trip_distance=1.66, total_amount=26.15, tpep_pickup_datetime=1761955322000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=68, trip_distance=0.75, total_amount=16.05, tpep_pickup_datetime=1761956736000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=164, trip_distance=1.11, total_amount=21.42, tpep_pickup_datetime=1761957396000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=2.22, total_amount=49.85, tpep_pickup_datetime=1761956131000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=68, trip_distance=11.1, total_amount=79.9, tpep_pickup_datetime=1761955488000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=246, trip_distance=0.9, total_amount=33.45, tpep_pickup_datetime=1761958334000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=74, trip_distance=2.9, total_amount=23.6, tpep_pickup_datetime=1761955491000)\n",
      "Sent: Ride(PULocationID=74, DOLocationID=244, trip_distance=3.7, total_amount=22.2, tpep_pickup_datetime=1761956515000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=209, trip_distance=7.04, total_amount=51.66, tpep_pickup_datetime=1761957054000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=256, trip_distance=7.1, total_amount=61.95, tpep_pickup_datetime=1761955765000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=1.18, total_amount=17.22, tpep_pickup_datetime=1761955439000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=161, trip_distance=2.44, total_amount=24.78, tpep_pickup_datetime=1761956004000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=236, trip_distance=1.22, total_amount=17.22, tpep_pickup_datetime=1761957596000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=142, trip_distance=1.74, total_amount=18.84, tpep_pickup_datetime=1761958175000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=229, trip_distance=3.03, total_amount=37.38, tpep_pickup_datetime=1761955292000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=75, trip_distance=2.3, total_amount=20.58, tpep_pickup_datetime=1761957307000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.27, total_amount=20.15, tpep_pickup_datetime=1761956056000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=243, trip_distance=9.04, total_amount=55.02, tpep_pickup_datetime=1761956831000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=142, trip_distance=0.69, total_amount=15.86, tpep_pickup_datetime=1761955394000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=100, trip_distance=1.52, total_amount=21.42, tpep_pickup_datetime=1761957177000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=163, trip_distance=0.76, total_amount=14.44, tpep_pickup_datetime=1761957724000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.84, total_amount=13.8, tpep_pickup_datetime=1761958711000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=113, trip_distance=1.8, total_amount=34.45, tpep_pickup_datetime=1761957122000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=233, trip_distance=0.38, total_amount=10.85, tpep_pickup_datetime=1761957862000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=233, trip_distance=1.63, total_amount=30.66, tpep_pickup_datetime=1761955249000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=79, trip_distance=3.63, total_amount=35.25, tpep_pickup_datetime=1761957099000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=87, trip_distance=1.2, total_amount=19.75, tpep_pickup_datetime=1761956635000)\n",
      "Sent: Ride(PULocationID=209, DOLocationID=170, trip_distance=4.5, total_amount=39.05, tpep_pickup_datetime=1761957901000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=249, trip_distance=0.27, total_amount=15.05, tpep_pickup_datetime=1761955403000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=239, trip_distance=3.5, total_amount=48.3, tpep_pickup_datetime=1761956136000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=170, trip_distance=2.16, total_amount=24.78, tpep_pickup_datetime=1761957922000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=166, trip_distance=1.0, total_amount=13.0, tpep_pickup_datetime=1761958113000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.56, total_amount=12.96, tpep_pickup_datetime=1761956773000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=74, trip_distance=1.91, total_amount=17.25, tpep_pickup_datetime=1761957198000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=161, trip_distance=9.31, total_amount=70.26, tpep_pickup_datetime=1761958431000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=243, trip_distance=10.4, total_amount=54.25, tpep_pickup_datetime=1761957210000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=231, trip_distance=1.17, total_amount=21.42, tpep_pickup_datetime=1761956227000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.57, total_amount=19.55, tpep_pickup_datetime=1761958599000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=145, trip_distance=1.97, total_amount=22.26, tpep_pickup_datetime=1761956297000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=209, trip_distance=1.4, total_amount=18.9, tpep_pickup_datetime=1761958782000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=163, trip_distance=0.96, total_amount=13.95, tpep_pickup_datetime=1761958014000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=234, trip_distance=1.35, total_amount=19.55, tpep_pickup_datetime=1761958413000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=145, trip_distance=4.79, total_amount=42.26, tpep_pickup_datetime=1761956569000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=152, trip_distance=7.57, total_amount=57.79, tpep_pickup_datetime=1761957412000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=163, trip_distance=0.64, total_amount=12.25, tpep_pickup_datetime=1761957651000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=1.16, total_amount=15.75, tpep_pickup_datetime=1761958651000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=80, trip_distance=3.27, total_amount=24.85, tpep_pickup_datetime=1761956017000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=70, trip_distance=1.16, total_amount=24.09, tpep_pickup_datetime=1761958172000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=252, trip_distance=6.74, total_amount=49.63, tpep_pickup_datetime=1761958767000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=3.51, total_amount=37.38, tpep_pickup_datetime=1761955439000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=75, trip_distance=2.72, total_amount=18.55, tpep_pickup_datetime=1761958052000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=151, trip_distance=0.53, total_amount=12.12, tpep_pickup_datetime=1761957724000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=48, trip_distance=2.82, total_amount=38.22, tpep_pickup_datetime=1761955250000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=74, trip_distance=5.21, total_amount=37.38, tpep_pickup_datetime=1761957341000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=144, trip_distance=4.22, total_amount=42.73, tpep_pickup_datetime=1761956330000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=142, trip_distance=21.73, total_amount=87.69, tpep_pickup_datetime=1761956883000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=141, trip_distance=4.5, total_amount=46.6, tpep_pickup_datetime=1761955852000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=262, trip_distance=3.44, total_amount=33.18, tpep_pickup_datetime=1761956238000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=229, trip_distance=1.63, total_amount=19.74, tpep_pickup_datetime=1761958132000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=236, trip_distance=1.45, total_amount=16.38, tpep_pickup_datetime=1761958766000)\n",
      "Sent: Ride(PULocationID=114, DOLocationID=107, trip_distance=1.3, total_amount=21.35, tpep_pickup_datetime=1761955718000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=79, trip_distance=1.0, total_amount=20.65, tpep_pickup_datetime=1761956865000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=48, trip_distance=3.6, total_amount=29.05, tpep_pickup_datetime=1761957987000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=238, trip_distance=2.1, total_amount=21.4, tpep_pickup_datetime=1761957886000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=141, trip_distance=3.71, total_amount=32.34, tpep_pickup_datetime=1761958179000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=170, trip_distance=1.92, total_amount=28.14, tpep_pickup_datetime=1761958413000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=9.95, total_amount=76.61, tpep_pickup_datetime=1761956513000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=107, trip_distance=2.2, total_amount=24.15, tpep_pickup_datetime=1761957310000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=17, trip_distance=8.9, total_amount=67.49, tpep_pickup_datetime=1761956182000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.33, total_amount=17.16, tpep_pickup_datetime=1761956654000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=230, trip_distance=1.8, total_amount=22.25, tpep_pickup_datetime=1761957156000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=48, trip_distance=2.3, total_amount=28.35, tpep_pickup_datetime=1761958637000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=79, trip_distance=4.6, total_amount=53.34, tpep_pickup_datetime=1761957453000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=249, trip_distance=1.3, total_amount=30.45, tpep_pickup_datetime=1761955450000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=238, trip_distance=4.5, total_amount=50.0, tpep_pickup_datetime=1761957468000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=246, trip_distance=1.83, total_amount=24.75, tpep_pickup_datetime=1761957761000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=148, trip_distance=2.5, total_amount=32.85, tpep_pickup_datetime=1761955826000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=249, trip_distance=1.9, total_amount=24.15, tpep_pickup_datetime=1761958126000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=143, trip_distance=1.98, total_amount=25.36, tpep_pickup_datetime=1761955308000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=246, trip_distance=2.17, total_amount=23.45, tpep_pickup_datetime=1761955684000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.79, total_amount=28.98, tpep_pickup_datetime=1761955920000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=237, trip_distance=2.55, total_amount=23.35, tpep_pickup_datetime=1761957446000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=1.66, total_amount=17.2, tpep_pickup_datetime=1761958370000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=255, trip_distance=5.69, total_amount=52.15, tpep_pickup_datetime=1761956236000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=249, trip_distance=1.0, total_amount=15.05, tpep_pickup_datetime=1761958649000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=43, trip_distance=4.24, total_amount=42.45, tpep_pickup_datetime=1761957905000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=256, trip_distance=16.31, total_amount=87.85, tpep_pickup_datetime=1761957025000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=231, trip_distance=11.65, total_amount=66.33, tpep_pickup_datetime=1761958443000)\n",
      "Sent: Ride(PULocationID=88, DOLocationID=261, trip_distance=0.43, total_amount=18.06, tpep_pickup_datetime=1761956169000)\n",
      "Sent: Ride(PULocationID=261, DOLocationID=186, trip_distance=5.42, total_amount=51.66, tpep_pickup_datetime=1761957049000)\n",
      "Sent: Ride(PULocationID=50, DOLocationID=68, trip_distance=0.93, total_amount=15.54, tpep_pickup_datetime=1761957235000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=13, trip_distance=3.6, total_amount=-36.05, tpep_pickup_datetime=1761958091000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=13, trip_distance=3.6, total_amount=36.05, tpep_pickup_datetime=1761958091000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=37, trip_distance=12.66, total_amount=83.09, tpep_pickup_datetime=1761956632000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=0.69, total_amount=14.65, tpep_pickup_datetime=1761955727000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=148, trip_distance=1.23, total_amount=25.62, tpep_pickup_datetime=1761956213000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=1.04, total_amount=16.75, tpep_pickup_datetime=1761957350000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=231, trip_distance=1.63, total_amount=19.15, tpep_pickup_datetime=1761958074000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=255, trip_distance=3.38, total_amount=49.14, tpep_pickup_datetime=1761956923000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=113, trip_distance=0.28, total_amount=16.38, tpep_pickup_datetime=1761956238000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=161, trip_distance=2.63, total_amount=29.75, tpep_pickup_datetime=1761956822000)\n",
      "Sent: Ride(PULocationID=43, DOLocationID=158, trip_distance=5.39, total_amount=36.25, tpep_pickup_datetime=1761956931000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=186, trip_distance=1.33, total_amount=17.75, tpep_pickup_datetime=1761958537000)\n",
      "Sent: Ride(PULocationID=50, DOLocationID=263, trip_distance=4.11, total_amount=26.25, tpep_pickup_datetime=1761955276000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=144, trip_distance=1.07, total_amount=24.06, tpep_pickup_datetime=1761955616000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=158, trip_distance=2.15, total_amount=27.3, tpep_pickup_datetime=1761956698000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=249, trip_distance=0.58, total_amount=25.03, tpep_pickup_datetime=1761957879000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=246, trip_distance=1.7, total_amount=25.6, tpep_pickup_datetime=1761958272000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=75, trip_distance=4.3, total_amount=36.5, tpep_pickup_datetime=1761955917000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.73, total_amount=17.22, tpep_pickup_datetime=1761955730000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=243, trip_distance=8.62, total_amount=52.5, tpep_pickup_datetime=1761957246000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=163, trip_distance=1.9, total_amount=20.6, tpep_pickup_datetime=1761956718000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=234, trip_distance=2.3, total_amount=29.8, tpep_pickup_datetime=1761957364000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=79, trip_distance=3.15, total_amount=48.56, tpep_pickup_datetime=1761955429000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.09, total_amount=13.02, tpep_pickup_datetime=1761957978000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=2.07, total_amount=23.94, tpep_pickup_datetime=1761955360000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.43, total_amount=12.12, tpep_pickup_datetime=1761956193000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=1.27, total_amount=15.48, tpep_pickup_datetime=1761956860000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=229, trip_distance=3.11, total_amount=28.98, tpep_pickup_datetime=1761957265000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=79, trip_distance=2.55, total_amount=32.45, tpep_pickup_datetime=1761958387000)\n",
      "Sent: Ride(PULocationID=226, DOLocationID=7, trip_distance=1.1, total_amount=14.75, tpep_pickup_datetime=1761958172000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=137, trip_distance=0.75, total_amount=16.35, tpep_pickup_datetime=1761955477000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=236, trip_distance=2.06, total_amount=19.85, tpep_pickup_datetime=1761957775000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=145, trip_distance=16.5, total_amount=90.45, tpep_pickup_datetime=1761958648000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=107, trip_distance=1.0, total_amount=21.42, tpep_pickup_datetime=1761955658000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=170, trip_distance=0.71, total_amount=12.95, tpep_pickup_datetime=1761956423000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=170, trip_distance=0.0, total_amount=-19.25, tpep_pickup_datetime=1761956816000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=170, trip_distance=0.0, total_amount=19.25, tpep_pickup_datetime=1761956816000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=224, trip_distance=1.0, total_amount=23.95, tpep_pickup_datetime=1761956447000)\n",
      "Sent: Ride(PULocationID=224, DOLocationID=79, trip_distance=0.9, total_amount=17.2, tpep_pickup_datetime=1761957800000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.7, total_amount=18.9, tpep_pickup_datetime=1761958452000)\n",
      "Sent: Ride(PULocationID=114, DOLocationID=113, trip_distance=0.8, total_amount=19.15, tpep_pickup_datetime=1761955618000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=7, trip_distance=5.5, total_amount=56.7, tpep_pickup_datetime=1761957926000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=48, trip_distance=1.8, total_amount=19.15, tpep_pickup_datetime=1761956880000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=164, trip_distance=2.4, total_amount=29.05, tpep_pickup_datetime=1761957816000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=237, trip_distance=2.28, total_amount=24.15, tpep_pickup_datetime=1761957069000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=164, trip_distance=0.62, total_amount=16.38, tpep_pickup_datetime=1761957817000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=238, trip_distance=4.1, total_amount=36.15, tpep_pickup_datetime=1761958381000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=125, trip_distance=0.99, total_amount=23.75, tpep_pickup_datetime=1761955374000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=158, trip_distance=2.28, total_amount=43.26, tpep_pickup_datetime=1761956336000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=237, trip_distance=0.88, total_amount=12.25, tpep_pickup_datetime=1761956351000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=263, trip_distance=0.58, total_amount=12.12, tpep_pickup_datetime=1761956881000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.34, total_amount=10.81, tpep_pickup_datetime=1761957100000)\n",
      "Sent: Ride(PULocationID=262, DOLocationID=140, trip_distance=1.3, total_amount=18.0, tpep_pickup_datetime=1761957520000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=236, trip_distance=0.51, total_amount=10.4, tpep_pickup_datetime=1761958424000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=79, trip_distance=1.71, total_amount=21.35, tpep_pickup_datetime=1761955817000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.77, total_amount=19.25, tpep_pickup_datetime=1761957630000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=230, trip_distance=2.52, total_amount=44.94, tpep_pickup_datetime=1761955596000)\n",
      "Sent: Ride(PULocationID=116, DOLocationID=166, trip_distance=1.05, total_amount=13.52, tpep_pickup_datetime=1761956317000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=230, trip_distance=11.01, total_amount=82.02, tpep_pickup_datetime=1761956511000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=45, trip_distance=4.35, total_amount=-38.85, tpep_pickup_datetime=1761958593000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=45, trip_distance=4.35, total_amount=38.85, tpep_pickup_datetime=1761958593000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=263, trip_distance=0.82, total_amount=13.7, tpep_pickup_datetime=1761956267000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=74, trip_distance=0.84, total_amount=13.8, tpep_pickup_datetime=1761956776000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=186, trip_distance=1.4, total_amount=18.9, tpep_pickup_datetime=1761958274000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=28, trip_distance=6.35, total_amount=50.47, tpep_pickup_datetime=1761955330000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=262, trip_distance=8.44, total_amount=63.48, tpep_pickup_datetime=1761957553000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=48, trip_distance=1.11, total_amount=22.26, tpep_pickup_datetime=1761956847000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=87, trip_distance=6.66, total_amount=50.15, tpep_pickup_datetime=1761957913000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=151, trip_distance=1.41, total_amount=16.32, tpep_pickup_datetime=1761956114000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=0.76, total_amount=10.8, tpep_pickup_datetime=1761957059000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=238, trip_distance=1.01, total_amount=13.8, tpep_pickup_datetime=1761957693000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=263, trip_distance=1.4, total_amount=18.0, tpep_pickup_datetime=1761958383000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=1.11, total_amount=14.64, tpep_pickup_datetime=1761958732000)\n",
      "Sent: Ride(PULocationID=261, DOLocationID=229, trip_distance=5.7, total_amount=39.55, tpep_pickup_datetime=1761957098000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=148, trip_distance=1.38, total_amount=29.05, tpep_pickup_datetime=1761956006000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=141, trip_distance=4.33, total_amount=44.1, tpep_pickup_datetime=1761957811000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=238, trip_distance=19.53, total_amount=98.88, tpep_pickup_datetime=1761956173000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=43, trip_distance=2.5, total_amount=37.2, tpep_pickup_datetime=1761957579000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=68, trip_distance=10.49, total_amount=73.3, tpep_pickup_datetime=1761955850000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=48, trip_distance=1.54, total_amount=33.25, tpep_pickup_datetime=1761955446000)\n",
      "Sent: Ride(PULocationID=50, DOLocationID=97, trip_distance=5.7, total_amount=56.65, tpep_pickup_datetime=1761958001000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=148, trip_distance=1.43, total_amount=38.22, tpep_pickup_datetime=1761956136000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=36, trip_distance=3.85, total_amount=37.45, tpep_pickup_datetime=1761958185000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.09, total_amount=15.15, tpep_pickup_datetime=1761955232000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=249, trip_distance=1.32, total_amount=23.45, tpep_pickup_datetime=1761955586000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=13, trip_distance=3.02, total_amount=39.06, tpep_pickup_datetime=1761956901000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.7, total_amount=12.95, tpep_pickup_datetime=1761956438000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=244, trip_distance=5.9, total_amount=38.9, tpep_pickup_datetime=1761956922000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=68, trip_distance=1.07, total_amount=19.74, tpep_pickup_datetime=1761954560000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=263, trip_distance=5.34, total_amount=49.14, tpep_pickup_datetime=1761955328000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=237, trip_distance=1.92, total_amount=19.25, tpep_pickup_datetime=1761958528000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=10, trip_distance=2.32, total_amount=26.76, tpep_pickup_datetime=1761958227000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=162, trip_distance=2.4, total_amount=26.45, tpep_pickup_datetime=1761956453000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=48, trip_distance=1.3, total_amount=23.2, tpep_pickup_datetime=1761957702000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.1, total_amount=17.05, tpep_pickup_datetime=1761958724000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=90, trip_distance=1.8, total_amount=27.3, tpep_pickup_datetime=1761955601000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=88, trip_distance=3.1, total_amount=41.6, tpep_pickup_datetime=1761956876000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=48, trip_distance=1.5, total_amount=26.45, tpep_pickup_datetime=1761955465000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=164, trip_distance=1.3, total_amount=20.85, tpep_pickup_datetime=1761956661000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=50, trip_distance=0.5, total_amount=15.5, tpep_pickup_datetime=1761957974000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=1.64, total_amount=18.06, tpep_pickup_datetime=1761957850000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=4, trip_distance=2.84, total_amount=39.06, tpep_pickup_datetime=1761958097000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=164, trip_distance=1.09, total_amount=17.15, tpep_pickup_datetime=1761958351000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=88, trip_distance=3.3, total_amount=35.7, tpep_pickup_datetime=1761957349000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=229, trip_distance=1.0, total_amount=17.0, tpep_pickup_datetime=1761955657000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=223, trip_distance=5.7, total_amount=55.0, tpep_pickup_datetime=1761957596000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=4, trip_distance=0.54, total_amount=16.05, tpep_pickup_datetime=1761956220000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=246, trip_distance=2.39, total_amount=35.7, tpep_pickup_datetime=1761956807000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=140, trip_distance=4.82, total_amount=42.42, tpep_pickup_datetime=1761957016000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=231, trip_distance=1.29, total_amount=18.06, tpep_pickup_datetime=1761958584000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=238, trip_distance=2.3, total_amount=20.65, tpep_pickup_datetime=1761955264000)\n",
      "Sent: Ride(PULocationID=143, DOLocationID=263, trip_distance=2.0, total_amount=18.85, tpep_pickup_datetime=1761956722000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=107, trip_distance=4.0, total_amount=34.0, tpep_pickup_datetime=1761957356000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=249, trip_distance=1.3, total_amount=24.95, tpep_pickup_datetime=1761958780000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=229, trip_distance=1.0, total_amount=18.05, tpep_pickup_datetime=1761955689000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=107, trip_distance=1.7, total_amount=32.15, tpep_pickup_datetime=1761956227000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=79, trip_distance=0.9, total_amount=27.55, tpep_pickup_datetime=1761958453000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.35, total_amount=23.94, tpep_pickup_datetime=1761955756000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=48, trip_distance=0.7, total_amount=16.38, tpep_pickup_datetime=1761956588000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=100, trip_distance=1.02, total_amount=15.75, tpep_pickup_datetime=1761957508000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=140, trip_distance=2.04, total_amount=22.75, tpep_pickup_datetime=1761958204000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=166, trip_distance=2.9, total_amount=23.35, tpep_pickup_datetime=1761956399000)\n",
      "Sent: Ride(PULocationID=151, DOLocationID=116, trip_distance=2.6, total_amount=22.55, tpep_pickup_datetime=1761957485000)\n",
      "Sent: Ride(PULocationID=209, DOLocationID=158, trip_distance=2.7, total_amount=26.95, tpep_pickup_datetime=1761955527000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=249, trip_distance=1.3, total_amount=32.8, tpep_pickup_datetime=1761957154000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=48, trip_distance=0.3, total_amount=10.15, tpep_pickup_datetime=1761955372000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=112, trip_distance=5.3, total_amount=39.9, tpep_pickup_datetime=1761955882000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=262, trip_distance=1.43, total_amount=-15.0, tpep_pickup_datetime=1761957456000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=1.43, total_amount=15.0, tpep_pickup_datetime=1761957456000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=48, trip_distance=0.7, total_amount=22.95, tpep_pickup_datetime=1761956063000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=161, trip_distance=10.23, total_amount=79.5, tpep_pickup_datetime=1761955406000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=87, trip_distance=4.43, total_amount=50.05, tpep_pickup_datetime=1761956344000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=113, trip_distance=0.5, total_amount=13.85, tpep_pickup_datetime=1761956358000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=144, trip_distance=0.9, total_amount=15.55, tpep_pickup_datetime=1761956781000)\n",
      "Sent: Ride(PULocationID=261, DOLocationID=236, trip_distance=9.4, total_amount=59.35, tpep_pickup_datetime=1761958586000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=140, trip_distance=0.87, total_amount=13.7, tpep_pickup_datetime=1761955339000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=7, trip_distance=3.94, total_amount=27.65, tpep_pickup_datetime=1761956046000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=107, trip_distance=2.58, total_amount=34.02, tpep_pickup_datetime=1761958777000)\n",
      "Sent: Ride(PULocationID=24, DOLocationID=74, trip_distance=1.82, total_amount=17.52, tpep_pickup_datetime=1761957474000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.77, total_amount=23.1, tpep_pickup_datetime=1761955737000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=87, trip_distance=3.99, total_amount=40.74, tpep_pickup_datetime=1761957025000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=113, trip_distance=0.66, total_amount=17.22, tpep_pickup_datetime=1761956685000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=170, trip_distance=1.08, total_amount=21.42, tpep_pickup_datetime=1761957289000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=48, trip_distance=1.59, total_amount=25.79, tpep_pickup_datetime=1761958164000)\n",
      "Sent: Ride(PULocationID=226, DOLocationID=181, trip_distance=9.59, total_amount=47.3, tpep_pickup_datetime=1761955971000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=246, trip_distance=0.55, total_amount=12.85, tpep_pickup_datetime=1761956644000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=265, trip_distance=2.75, total_amount=102.97, tpep_pickup_datetime=1761958631000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=13, trip_distance=1.22, total_amount=17.22, tpep_pickup_datetime=1761955724000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=140, trip_distance=3.69, total_amount=26.25, tpep_pickup_datetime=1761958367000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=144, trip_distance=0.95, total_amount=19.25, tpep_pickup_datetime=1761955694000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=79, trip_distance=0.82, total_amount=19.74, tpep_pickup_datetime=1761956737000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=49, trip_distance=6.77, total_amount=54.69, tpep_pickup_datetime=1761957574000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=41, trip_distance=2.43, total_amount=22.2, tpep_pickup_datetime=1761957454000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=148, trip_distance=9.41, total_amount=64.7, tpep_pickup_datetime=1761958000000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=211, trip_distance=3.51, total_amount=40.74, tpep_pickup_datetime=1761955201000)\n",
      "Sent: Ride(PULocationID=211, DOLocationID=158, trip_distance=1.21, total_amount=22.29, tpep_pickup_datetime=1761957218000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=239, trip_distance=5.75, total_amount=40.43, tpep_pickup_datetime=1761955800000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=262, trip_distance=2.3, total_amount=23.2, tpep_pickup_datetime=1761955464000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=79, trip_distance=4.2, total_amount=53.35, tpep_pickup_datetime=1761956651000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=233, trip_distance=1.18, total_amount=18.06, tpep_pickup_datetime=1761956725000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=79, trip_distance=1.34, total_amount=32.34, tpep_pickup_datetime=1761957567000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=263, trip_distance=1.73, total_amount=17.4, tpep_pickup_datetime=1761957998000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=163, trip_distance=2.2, total_amount=23.86, tpep_pickup_datetime=1761955626000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=141, trip_distance=1.11, total_amount=18.9, tpep_pickup_datetime=1761956732000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=68, trip_distance=0.4, total_amount=13.02, tpep_pickup_datetime=1761955846000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=144, trip_distance=4.01, total_amount=48.15, tpep_pickup_datetime=1761956664000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=107, trip_distance=5.4, total_amount=50.45, tpep_pickup_datetime=1761956704000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=33, trip_distance=4.16, total_amount=44.1, tpep_pickup_datetime=1761957441000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=196, trip_distance=8.15, total_amount=49.99, tpep_pickup_datetime=1761958446000)\n",
      "Sent: Ride(PULocationID=13, DOLocationID=107, trip_distance=4.97, total_amount=37.38, tpep_pickup_datetime=1761955283000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=140, trip_distance=2.6, total_amount=22.75, tpep_pickup_datetime=1761956605000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=143, trip_distance=1.47, total_amount=18.84, tpep_pickup_datetime=1761957687000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=79, trip_distance=1.59, total_amount=24.75, tpep_pickup_datetime=1761955644000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=113, trip_distance=0.99, total_amount=19.74, tpep_pickup_datetime=1761956917000)\n",
      "Sent: Ride(PULocationID=114, DOLocationID=4, trip_distance=0.88, total_amount=17.45, tpep_pickup_datetime=1761957745000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=263, trip_distance=3.05, total_amount=34.86, tpep_pickup_datetime=1761956135000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=237, trip_distance=0.79, total_amount=14.2, tpep_pickup_datetime=1761957780000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=90, trip_distance=3.1, total_amount=28.98, tpep_pickup_datetime=1761958374000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=262, trip_distance=2.46, total_amount=24.78, tpep_pickup_datetime=1761957461000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=79, trip_distance=6.25, total_amount=77.44, tpep_pickup_datetime=1761956039000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=137, trip_distance=1.4, total_amount=27.3, tpep_pickup_datetime=1761956403000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=170, trip_distance=0.8, total_amount=19.7, tpep_pickup_datetime=1761957702000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=229, trip_distance=0.8, total_amount=11.55, tpep_pickup_datetime=1761958542000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=42, trip_distance=6.31, total_amount=50.82, tpep_pickup_datetime=1761955373000)\n",
      "Sent: Ride(PULocationID=41, DOLocationID=263, trip_distance=1.92, total_amount=19.68, tpep_pickup_datetime=1761958166000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=209, trip_distance=6.1, total_amount=39.35, tpep_pickup_datetime=1761957889000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=164, trip_distance=0.97, total_amount=16.45, tpep_pickup_datetime=1761955699000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=231, trip_distance=2.31, total_amount=37.19, tpep_pickup_datetime=1761956454000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=236, trip_distance=1.53, total_amount=18.84, tpep_pickup_datetime=1761955789000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=148, trip_distance=0.0, total_amount=8.75, tpep_pickup_datetime=1761957945000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=68, trip_distance=2.8, total_amount=28.15, tpep_pickup_datetime=1761957966000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=87, trip_distance=0.7, total_amount=12.55, tpep_pickup_datetime=1761958490000)\n",
      "Sent: Ride(PULocationID=66, DOLocationID=37, trip_distance=3.77, total_amount=31.0, tpep_pickup_datetime=1761958073000)\n",
      "Sent: Ride(PULocationID=24, DOLocationID=262, trip_distance=2.63, total_amount=24.72, tpep_pickup_datetime=1761958756000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=186, trip_distance=3.0, total_amount=42.45, tpep_pickup_datetime=1761956480000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=231, trip_distance=1.12, total_amount=14.89, tpep_pickup_datetime=1761955444000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.18, total_amount=29.75, tpep_pickup_datetime=1761955863000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=234, trip_distance=1.1, total_amount=20.25, tpep_pickup_datetime=1761957725000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=246, trip_distance=1.9, total_amount=26.7, tpep_pickup_datetime=1761956765000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=142, trip_distance=1.04, total_amount=17.31, tpep_pickup_datetime=1761957754000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.75, total_amount=14.25, tpep_pickup_datetime=1761955857000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=163, trip_distance=2.13, total_amount=19.95, tpep_pickup_datetime=1761956490000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=142, trip_distance=0.4, total_amount=13.02, tpep_pickup_datetime=1761957827000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.38, total_amount=12.95, tpep_pickup_datetime=1761957395000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=151, trip_distance=7.92, total_amount=69.3, tpep_pickup_datetime=1761957823000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=163, trip_distance=1.44, total_amount=19.25, tpep_pickup_datetime=1761956112000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=256, trip_distance=5.77, total_amount=51.45, tpep_pickup_datetime=1761955515000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=236, trip_distance=0.33, total_amount=12.12, tpep_pickup_datetime=1761957509000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=4, trip_distance=1.05, total_amount=-19.25, tpep_pickup_datetime=1761957971000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=4, trip_distance=1.05, total_amount=19.25, tpep_pickup_datetime=1761957971000)\n",
      "Sent: Ride(PULocationID=88, DOLocationID=33, trip_distance=3.68, total_amount=28.14, tpep_pickup_datetime=1761956756000)\n",
      "Sent: Ride(PULocationID=33, DOLocationID=25, trip_distance=0.98, total_amount=13.0, tpep_pickup_datetime=1761957514000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=211, trip_distance=0.85, total_amount=18.65, tpep_pickup_datetime=1761955144000)\n",
      "Sent: Ride(PULocationID=211, DOLocationID=97, trip_distance=3.25, total_amount=31.15, tpep_pickup_datetime=1761956474000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.26, total_amount=18.06, tpep_pickup_datetime=1761958254000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=1.43, total_amount=15.3, tpep_pickup_datetime=1761955827000)\n",
      "Sent: Ride(PULocationID=50, DOLocationID=90, trip_distance=2.31, total_amount=27.3, tpep_pickup_datetime=1761956610000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=3.37, total_amount=31.5, tpep_pickup_datetime=1761957803000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=113, trip_distance=1.03, total_amount=20.58, tpep_pickup_datetime=1761955629000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=170, trip_distance=0.11, total_amount=9.45, tpep_pickup_datetime=1761957550000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=68, trip_distance=1.7, total_amount=21.33, tpep_pickup_datetime=1761957416000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=231, trip_distance=2.27, total_amount=33.45, tpep_pickup_datetime=1761958648000)\n",
      "Sent: Ride(PULocationID=90, DOLocationID=233, trip_distance=1.76, total_amount=27.33, tpep_pickup_datetime=1761956324000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=186, trip_distance=2.52, total_amount=35.7, tpep_pickup_datetime=1761955460000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=236, trip_distance=4.01, total_amount=28.95, tpep_pickup_datetime=1761955447000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=127, trip_distance=15.53, total_amount=81.65, tpep_pickup_datetime=1761955352000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=246, trip_distance=2.33, total_amount=28.98, tpep_pickup_datetime=1761955667000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=100, trip_distance=0.23, total_amount=13.1, tpep_pickup_datetime=1761955396000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=151, trip_distance=2.59, total_amount=22.65, tpep_pickup_datetime=1761956438000)\n",
      "Sent: Ride(PULocationID=143, DOLocationID=262, trip_distance=2.21, total_amount=25.4, tpep_pickup_datetime=1761958058000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=234, trip_distance=3.02, total_amount=52.32, tpep_pickup_datetime=1761955791000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=4, trip_distance=1.64, total_amount=34.86, tpep_pickup_datetime=1761958476000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=163, trip_distance=2.58, total_amount=25.62, tpep_pickup_datetime=1761956710000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=0.79, total_amount=20.25, tpep_pickup_datetime=1761956022000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=114, trip_distance=0.61, total_amount=17.45, tpep_pickup_datetime=1761957040000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=249, trip_distance=2.04, total_amount=32.34, tpep_pickup_datetime=1761957906000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=260, trip_distance=6.9, total_amount=50.09, tpep_pickup_datetime=1761956327000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=48, trip_distance=0.94, total_amount=15.54, tpep_pickup_datetime=1761957874000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=143, trip_distance=1.88, total_amount=17.85, tpep_pickup_datetime=1761953770000)\n",
      "Sent: Ride(PULocationID=143, DOLocationID=236, trip_distance=2.75, total_amount=26.4, tpep_pickup_datetime=1761954614000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=48, trip_distance=2.65, total_amount=34.02, tpep_pickup_datetime=1761956414000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=229, trip_distance=1.5, total_amount=22.25, tpep_pickup_datetime=1761955922000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.2, total_amount=15.69, tpep_pickup_datetime=1761958070000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=152, trip_distance=8.0, total_amount=58.38, tpep_pickup_datetime=1761956935000)\n",
      "Sent: Ride(PULocationID=166, DOLocationID=116, trip_distance=1.21, total_amount=11.96, tpep_pickup_datetime=1761956038000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.75, total_amount=18.8, tpep_pickup_datetime=1761957587000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=90, trip_distance=1.78, total_amount=26.46, tpep_pickup_datetime=1761955155000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=107, trip_distance=2.28, total_amount=28.44, tpep_pickup_datetime=1761956921000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=162, trip_distance=1.15, total_amount=16.38, tpep_pickup_datetime=1761958446000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=75, trip_distance=1.63, total_amount=18.0, tpep_pickup_datetime=1761955216000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=186, trip_distance=1.14, total_amount=17.31, tpep_pickup_datetime=1761956032000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=90, trip_distance=0.76, total_amount=26.25, tpep_pickup_datetime=1761958542000)\n",
      "Sent: Ride(PULocationID=151, DOLocationID=116, trip_distance=2.17, total_amount=20.04, tpep_pickup_datetime=1761957776000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=79, trip_distance=4.19, total_amount=47.35, tpep_pickup_datetime=1761957071000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=164, trip_distance=11.3, total_amount=69.85, tpep_pickup_datetime=1761958772000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=158, trip_distance=1.9, total_amount=26.85, tpep_pickup_datetime=1761958643000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=263, trip_distance=1.1, total_amount=15.45, tpep_pickup_datetime=1761956554000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=234, trip_distance=3.1, total_amount=33.15, tpep_pickup_datetime=1761957289000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=134, trip_distance=6.05, total_amount=41.65, tpep_pickup_datetime=1761956033000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=239, trip_distance=11.5, total_amount=82.86, tpep_pickup_datetime=1761958551000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.72, total_amount=26.25, tpep_pickup_datetime=1761956681000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=61, trip_distance=6.25, total_amount=41.45, tpep_pickup_datetime=1761957608000)\n",
      "Sent: Ride(PULocationID=4, DOLocationID=80, trip_distance=3.17, total_amount=25.55, tpep_pickup_datetime=1761958324000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=137, trip_distance=3.33, total_amount=31.45, tpep_pickup_datetime=1761955496000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=13, trip_distance=6.92, total_amount=64.26, tpep_pickup_datetime=1761957268000)\n",
      "Sent: Ride(PULocationID=249, DOLocationID=170, trip_distance=2.41, total_amount=41.58, tpep_pickup_datetime=1761955218000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=113, trip_distance=0.62, total_amount=23.94, tpep_pickup_datetime=1761955586000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=263, trip_distance=4.62, total_amount=49.14, tpep_pickup_datetime=1761956998000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=246, trip_distance=2.12, total_amount=25.45, tpep_pickup_datetime=1761956111000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=50, trip_distance=0.39, total_amount=22.14, tpep_pickup_datetime=1761957432000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=142, trip_distance=0.9, total_amount=14.7, tpep_pickup_datetime=1761958636000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=0.65, total_amount=13.8, tpep_pickup_datetime=1761957376000)\n",
      "Sent: Ride(PULocationID=141, DOLocationID=263, trip_distance=0.67, total_amount=11.5, tpep_pickup_datetime=1761958327000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=68, trip_distance=0.4, total_amount=13.02, tpep_pickup_datetime=1761956732000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=41, trip_distance=4.87, total_amount=-36.05, tpep_pickup_datetime=1761957722000)\n",
      "Sent: Ride(PULocationID=186, DOLocationID=74, trip_distance=4.87, total_amount=36.05, tpep_pickup_datetime=1761957722000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=239, trip_distance=5.79, total_amount=53.34, tpep_pickup_datetime=1761957109000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=79, trip_distance=0.99, total_amount=19.95, tpep_pickup_datetime=1761956287000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.73, total_amount=23.35, tpep_pickup_datetime=1761957380000)\n",
      "Sent: Ride(PULocationID=152, DOLocationID=151, trip_distance=1.39, total_amount=14.16, tpep_pickup_datetime=1761955467000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=107, trip_distance=3.63, total_amount=32.34, tpep_pickup_datetime=1761956364000)\n",
      "Sent: Ride(PULocationID=231, DOLocationID=263, trip_distance=6.34, total_amount=52.5, tpep_pickup_datetime=1761955711000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=74, trip_distance=8.5, total_amount=54.15, tpep_pickup_datetime=1761955795000)\n",
      "Sent: Ride(PULocationID=74, DOLocationID=42, trip_distance=2.2, total_amount=17.5, tpep_pickup_datetime=1761957673000)\n",
      "Sent: Ride(PULocationID=80, DOLocationID=34, trip_distance=4.3, total_amount=26.75, tpep_pickup_datetime=1761957260000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=112, trip_distance=4.4, total_amount=44.95, tpep_pickup_datetime=1761956555000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=68, trip_distance=2.0, total_amount=47.45, tpep_pickup_datetime=1761955568000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=114, trip_distance=1.7, total_amount=27.25, tpep_pickup_datetime=1761958278000)\n",
      "Sent: Ride(PULocationID=162, DOLocationID=179, trip_distance=3.86, total_amount=33.69, tpep_pickup_datetime=1761956794000)\n",
      "Sent: Ride(PULocationID=264, DOLocationID=229, trip_distance=1.38, total_amount=18.9, tpep_pickup_datetime=1761958759000)\n",
      "Sent: Ride(PULocationID=79, DOLocationID=233, trip_distance=2.18, total_amount=31.94, tpep_pickup_datetime=1761956161000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=262, trip_distance=1.85, total_amount=19.74, tpep_pickup_datetime=1761957605000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=1.47, total_amount=21.25, tpep_pickup_datetime=1761956512000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=140, trip_distance=2.02, total_amount=20.58, tpep_pickup_datetime=1761957479000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=164, trip_distance=1.9, total_amount=24.63, tpep_pickup_datetime=1761958326000)\n",
      "Sent: Ride(PULocationID=50, DOLocationID=252, trip_distance=15.65, total_amount=106.85, tpep_pickup_datetime=1761956198000)\n",
      "Sent: Ride(PULocationID=170, DOLocationID=141, trip_distance=2.03, total_amount=18.9, tpep_pickup_datetime=1761958441000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=243, trip_distance=7.66, total_amount=65.94, tpep_pickup_datetime=1761956564000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=262, trip_distance=2.73, total_amount=19.9, tpep_pickup_datetime=1761955822000)\n",
      "Sent: Ride(PULocationID=262, DOLocationID=162, trip_distance=2.05, total_amount=-16.45, tpep_pickup_datetime=1761956792000)\n",
      "Sent: Ride(PULocationID=262, DOLocationID=162, trip_distance=2.05, total_amount=16.45, tpep_pickup_datetime=1761956792000)\n",
      "Sent: Ride(PULocationID=229, DOLocationID=229, trip_distance=0.41, total_amount=12.48, tpep_pickup_datetime=1761957355000)\n",
      "Sent: Ride(PULocationID=68, DOLocationID=233, trip_distance=2.05, total_amount=24.35, tpep_pickup_datetime=1761956966000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=232, trip_distance=2.58, total_amount=34.86, tpep_pickup_datetime=1761955867000)\n",
      "Sent: Ride(PULocationID=232, DOLocationID=238, trip_distance=7.27, total_amount=45.21, tpep_pickup_datetime=1761957725000)\n",
      "Sent: Ride(PULocationID=132, DOLocationID=10, trip_distance=3.38, total_amount=24.31, tpep_pickup_datetime=1761957541000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=234, trip_distance=0.5, total_amount=20.58, tpep_pickup_datetime=1761955758000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=4.53, total_amount=48.3, tpep_pickup_datetime=1761957112000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=142, trip_distance=1.09, total_amount=17.22, tpep_pickup_datetime=1761955648000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=100, trip_distance=2.5, total_amount=27.3, tpep_pickup_datetime=1761956205000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=68, trip_distance=1.06, total_amount=20.58, tpep_pickup_datetime=1761957629000)\n",
      "Sent: Ride(PULocationID=125, DOLocationID=151, trip_distance=6.01, total_amount=38.85, tpep_pickup_datetime=1761956924000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=107, trip_distance=0.77, total_amount=16.38, tpep_pickup_datetime=1761955684000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=164, trip_distance=1.0, total_amount=18.8, tpep_pickup_datetime=1761955373000)\n",
      "Sent: Ride(PULocationID=164, DOLocationID=137, trip_distance=1.0, total_amount=23.2, tpep_pickup_datetime=1761956086000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=141, trip_distance=1.2, total_amount=12.95, tpep_pickup_datetime=1761956985000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=141, trip_distance=1.8, total_amount=19.75, tpep_pickup_datetime=1761958331000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=230, trip_distance=0.7, total_amount=16.05, tpep_pickup_datetime=1761956008000)\n",
      "Sent: Ride(PULocationID=161, DOLocationID=263, trip_distance=1.8, total_amount=21.4, tpep_pickup_datetime=1761956913000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=142, trip_distance=2.2, total_amount=19.8, tpep_pickup_datetime=1761957773000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=263, trip_distance=2.68, total_amount=24.78, tpep_pickup_datetime=1761955217000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.56, total_amount=11.62, tpep_pickup_datetime=1761955998000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=83, trip_distance=5.16, total_amount=31.85, tpep_pickup_datetime=1761957529000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=246, trip_distance=2.6, total_amount=39.9, tpep_pickup_datetime=1761957214000)\n",
      "Sent: Ride(PULocationID=211, DOLocationID=114, trip_distance=0.8, total_amount=15.65, tpep_pickup_datetime=1761956007000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=158, trip_distance=1.9, total_amount=24.78, tpep_pickup_datetime=1761956443000)\n",
      "Sent: Ride(PULocationID=158, DOLocationID=87, trip_distance=3.82, total_amount=28.45, tpep_pickup_datetime=1761957437000)\n",
      "Sent: Ride(PULocationID=87, DOLocationID=107, trip_distance=4.34, total_amount=40.69, tpep_pickup_datetime=1761958760000)\n",
      "Sent: Ride(PULocationID=114, DOLocationID=137, trip_distance=2.0, total_amount=25.6, tpep_pickup_datetime=1761957565000)\n",
      "Sent: Ride(PULocationID=265, DOLocationID=265, trip_distance=0.0, total_amount=96.0, tpep_pickup_datetime=1761958241000)\n",
      "Sent: Ride(PULocationID=230, DOLocationID=144, trip_distance=3.5, total_amount=32.55, tpep_pickup_datetime=1761956658000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=160, trip_distance=7.45, total_amount=40.45, tpep_pickup_datetime=1761958726000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=256, trip_distance=6.03, total_amount=68.46, tpep_pickup_datetime=1761955214000)\n",
      "Sent: Ride(PULocationID=24, DOLocationID=152, trip_distance=1.3, total_amount=10.4, tpep_pickup_datetime=1761956874000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=236, trip_distance=0.43, total_amount=11.75, tpep_pickup_datetime=1761956620000)\n",
      "Sent: Ride(PULocationID=236, DOLocationID=263, trip_distance=1.15, total_amount=16.32, tpep_pickup_datetime=1761956897000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.99, total_amount=19.1, tpep_pickup_datetime=1761957478000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=24, trip_distance=2.3, total_amount=19.7, tpep_pickup_datetime=1761955265000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=239, trip_distance=0.8, total_amount=13.8, tpep_pickup_datetime=1761956041000)\n",
      "Sent: Ride(PULocationID=142, DOLocationID=262, trip_distance=2.1, total_amount=20.5, tpep_pickup_datetime=1761956804000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=226, trip_distance=4.0, total_amount=23.45, tpep_pickup_datetime=1761957750000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=1.53, total_amount=17.95, tpep_pickup_datetime=1761958179000)\n",
      "Sent: Ride(PULocationID=144, DOLocationID=231, trip_distance=1.2, total_amount=18.9, tpep_pickup_datetime=1761955678000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=229, trip_distance=2.66, total_amount=30.66, tpep_pickup_datetime=1761958375000)\n",
      "Sent: Ride(PULocationID=113, DOLocationID=107, trip_distance=0.75, total_amount=21.42, tpep_pickup_datetime=1761955353000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=88, trip_distance=4.28, total_amount=36.05, tpep_pickup_datetime=1761957278000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.67, total_amount=13.5, tpep_pickup_datetime=1761956709000)\n",
      "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=1.5, total_amount=23.1, tpep_pickup_datetime=1761955608000)\n",
      "Sent: Ride(PULocationID=41, DOLocationID=238, trip_distance=1.73, total_amount=19.68, tpep_pickup_datetime=1761955822000)\n",
      "Sent: Ride(PULocationID=163, DOLocationID=140, trip_distance=1.44, total_amount=18.11, tpep_pickup_datetime=1761958756000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=137, trip_distance=0.49, total_amount=12.55, tpep_pickup_datetime=1761957049000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=68, trip_distance=1.48, total_amount=23.05, tpep_pickup_datetime=1761957555000)\n",
      "Sent: Ride(PULocationID=166, DOLocationID=41, trip_distance=1.1, total_amount=11.1, tpep_pickup_datetime=1761955442000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=140, trip_distance=1.8, total_amount=18.85, tpep_pickup_datetime=1761956742000)\n",
      "Sent: Ride(PULocationID=140, DOLocationID=237, trip_distance=0.7, total_amount=14.31, tpep_pickup_datetime=1761957607000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=107, trip_distance=2.8, total_amount=30.65, tpep_pickup_datetime=1761958306000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=239, trip_distance=2.94, total_amount=38.22, tpep_pickup_datetime=1761958211000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=42, trip_distance=4.7, total_amount=34.85, tpep_pickup_datetime=1761955550000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=236, trip_distance=0.9, total_amount=13.8, tpep_pickup_datetime=1761957769000)\n",
      "Sent: Ride(PULocationID=238, DOLocationID=114, trip_distance=6.12, total_amount=56.35, tpep_pickup_datetime=1761955775000)\n",
      "Sent: Ride(PULocationID=138, DOLocationID=232, trip_distance=9.37, total_amount=55.4, tpep_pickup_datetime=1761955488000)\n",
      "Sent: Ride(PULocationID=232, DOLocationID=137, trip_distance=1.93, total_amount=22.05, tpep_pickup_datetime=1761957174000)\n",
      "Sent: Ride(PULocationID=137, DOLocationID=263, trip_distance=2.72, total_amount=20.65, tpep_pickup_datetime=1761958356000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=263, trip_distance=4.6, total_amount=39.9, tpep_pickup_datetime=1761956440000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.6, total_amount=11.92, tpep_pickup_datetime=1761958168000)\n",
      "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=1.04, total_amount=14.64, tpep_pickup_datetime=1761955790000)\n",
      "Sent: Ride(PULocationID=237, DOLocationID=75, trip_distance=1.93, total_amount=18.7, tpep_pickup_datetime=1761956503000)\n",
      "Sent: Ride(PULocationID=233, DOLocationID=107, trip_distance=0.76, total_amount=21.42, tpep_pickup_datetime=1761956475000)\n",
      "Sent: Ride(PULocationID=262, DOLocationID=48, trip_distance=3.12, total_amount=28.14, tpep_pickup_datetime=1761955204000)\n",
      "Sent: Ride(PULocationID=48, DOLocationID=68, trip_distance=0.45, total_amount=13.02, tpep_pickup_datetime=1761956902000)\n",
      "Sent: Ride(PULocationID=246, DOLocationID=186, trip_distance=0.7, total_amount=13.95, tpep_pickup_datetime=1761957726000)\n",
      "Sent: Ride(PULocationID=100, DOLocationID=164, trip_distance=0.15, total_amount=9.45, tpep_pickup_datetime=1761958138000)\n",
      "Sent: Ride(PULocationID=234, DOLocationID=113, trip_distance=0.43, total_amount=18.06, tpep_pickup_datetime=1761958439000)\n",
      "Sent: Ride(PULocationID=239, DOLocationID=116, trip_distance=3.3, total_amount=28.35, tpep_pickup_datetime=1761955931000)\n",
      "Sent: Ride(PULocationID=107, DOLocationID=79, trip_distance=0.9, total_amount=18.9, tpep_pickup_datetime=1761958296000)\n",
      "took 11.49 seconds\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "t0 = time.time()\n",
    "\n",
    "for _, row in df.iterrows():\n",
    "    ride = ride_from_row(row)\n",
    "    producer.send(topic_name, value=ride)\n",
    "    print(f\"Sent: {ride}\")\n",
    "    time.sleep(0.01)\n",
    "\n",
    "producer.flush()\n",
    "\n",
    "t1 = time.time()\n",
    "print(f'took {(t1 - t0):.2f} seconds')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1ca66fe",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "streaming-workshop",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: 07-streaming/workshop/live/pyproject.flink.toml
================================================
[project]
name = "pyflink-workshop"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "apache-flink==2.2.0",
]


================================================
FILE: 07-streaming/workshop/live/pyproject.toml
================================================
[project]
name = "streaming-workshop"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "kafka-python>=2.3.0",
    "pandas>=3.0.1",
    "psycopg2-binary>=2.9.11",
    "pyarrow>=23.0.1",
]

[dependency-groups]
dev = [
    "jupyter>=1.1.1",
]


================================================
FILE: 07-streaming/workshop/live/src/job/aggregation_job.py
================================================
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment


def create_events_source_kafka(t_env):
    table_name = "events"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            tpep_pickup_datetime BIGINT,
            event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3),
            WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda:29092',
            'topic' = 'rides',
            'scan.startup.mode' = 'earliest-offset',
            'properties.auto.offset.reset' = 'earliest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name


def create_events_aggregated_sink(t_env):
    table_name = 'processed_events_aggregated'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            window_start TIMESTAMP(3),
            PULocationID INT,
            num_trips BIGINT,
            total_revenue DOUBLE,
            PRIMARY KEY (window_start, PULocationID) NOT ENFORCED
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name


def log_aggregation():
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)
    env.set_parallelism(3)

    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)

    try:
        source_table = create_events_source_kafka(t_env)
        aggregated_table = create_events_aggregated_sink(t_env)

        t_env.execute_sql(f"""
        INSERT INTO {aggregated_table}
        SELECT
            window_start,
            PULocationID,
            COUNT(*) AS num_trips,
            SUM(total_amount) AS total_revenue
        FROM TABLE(
            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR)
        )
        GROUP BY window_start, PULocationID;

        """).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_aggregation()

================================================
FILE: 07-streaming/workshop/live/src/job/pass_through_job.py
================================================

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment


def create_events_source_kafka(t_env):
    table_name = "events"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            tpep_pickup_datetime BIGINT
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda:29092',
            'topic' = 'rides',
            'scan.startup.mode' = 'latest-offset',
            'properties.auto.offset.reset' = 'latest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name


def create_processed_events_sink_postgres(t_env):
    table_name = 'processed_events'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            pickup_datetime TIMESTAMP
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name


def log_processing():
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)  # checkpoint every 10 seconds

    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)

    source_table = create_events_source_kafka(t_env)
    postgres_sink = create_processed_events_sink_postgres(t_env)

    t_env.execute_sql(
        f"""
        INSERT INTO {postgres_sink}
        SELECT
            PULocationID,
            DOLocationID,
            trip_distance,
            total_amount,
            TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) as pickup_datetime
        FROM {source_table}
        """
    ).wait()

if __name__ == '__main__':
    log_processing()

================================================
FILE: 07-streaming/workshop/live/src/producers/models.py
================================================
import json
import dataclasses

from dataclasses import dataclass


@dataclass
class Ride:
    PULocationID: int
    DOLocationID: int
    trip_distance: float
    total_amount: float
    tpep_pickup_datetime: int  # epoch milliseconds


def ride_from_row(row):
    return Ride(
        PULocationID=int(row['PULocationID']),
        DOLocationID=int(row['DOLocationID']),
        trip_distance=float(row['trip_distance']),
        total_amount=float(row['total_amount']),
        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),
    )


def ride_serializer(ride):
    ride_dict = dataclasses.asdict(ride)
    ride_json = json.dumps(ride_dict).encode('utf-8')
    return ride_json


def ride_deserializer(data):
    json_str = data.decode('utf-8')
    ride_dict = json.loads(json_str)
    return Ride(**ride_dict)


================================================
FILE: 07-streaming/workshop/live/src/producers/producer_realtime.py
================================================
import dataclasses
import json
import random
import sys
import time
from datetime import datetime, timezone
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent.parent))

from kafka import KafkaProducer
from models import Ride

# Top pickup locations from the actual NYC yellow taxi data.
# PULocationID is a taxi zone ID (1-263) defined by the NYC TLC.
# See https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
PICKUP_LOCATIONS = [
    79,   # East Village, Manhattan
    107,  # Gramercy, Manhattan
    48,   # Clinton East (Hell's Kitchen), Manhattan
    132,  # JFK Airport
    234,  # Union Sq, Manhattan
    148,  # Lower East Side, Manhattan
    249,  # West Village, Manhattan
    68,   # East Chelsea, Manhattan
    90,   # Flatiron, Manhattan
    263,  # Yorkville West, Manhattan
    138,  # LaGuardia Airport
    230,  # Times Sq/Theatre District, Manhattan
    161,  # Midtown Center, Manhattan
    162,  # Midtown East, Manhattan
    170,  # Murray Hill, Manhattan
    237,  # Upper East Side South, Manhattan
    239,  # Upper West Side South, Manhattan
    186,  # Penn Station/Madison Sq West, Manhattan
    164,  # Midtown South, Manhattan
    236,  # Upper East Side North, Manhattan
]

DROPOFF_LOCATIONS = PICKUP_LOCATIONS  # same pool for simplicity


def make_ride(delay_seconds=0):
    now_ms = int(time.time() * 1000) - delay_seconds * 1000
    return Ride(
        PULocationID=random.choice(PICKUP_LOCATIONS),
        DOLocationID=random.choice(DROPOFF_LOCATIONS),
        trip_distance=round(random.uniform(0.5, 20.0), 2),
        total_amount=round(random.uniform(5.0, 100.0), 2),
        tpep_pickup_datetime=now_ms,
    )


def ride_serializer(ride):
    return json.dumps(dataclasses.asdict(ride)).encode('utf-8')


server = 'localhost:9092'
producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=ride_serializer,
)

topic_name = 'rides'
count = 0

print("Sending events (Ctrl+C to stop)...")
print()

try:
    while True:
        # ~20% chance of a late event (3-10 seconds old)
        if random.random() < 0.2:
            delay = random.randint(3, 10)
            ride = make_ride(delay_seconds=delay)
            ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc)
            print(f"  LATE ({delay}s) -> PU={ride.PULocationID} ts={ts:%H:%M:%S}")
        else:
            ride = make_ride()
            ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc)
            print(f"  on time   -> PU={ride.PULocationID} ts={ts:%H:%M:%S}")

        producer.send(topic_name, value=ride)
        count += 1
        time.sleep(0.5)

except KeyboardInterrupt:
    producer.flush()
    print(f"\nSent {count} events")


================================================
FILE: 07-streaming/workshop/pyproject.flink.toml
================================================
[project]
name = "pyflink-workshop"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
    "apache-flink==2.2.0",
]


================================================
FILE: 07-streaming/workshop/pyproject.toml
================================================
[project]
name = "workshop"
version = "0.1.0"
description = "PyFlink Stream Processing Workshop"
requires-python = ">=3.12"
dependencies = [
    "kafka-python>=2.3.0",
    "pandas>=2.2.0",
    "psycopg2-binary>=2.9.11",
    "pyarrow>=19.0.0",
]


================================================
FILE: 07-streaming/workshop/src/consumers/consumer.py
================================================
import sys
from datetime import datetime
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent.parent))

from kafka import KafkaConsumer
from models import ride_deserializer

server = 'localhost:9092'
topic_name = 'rides'

consumer = KafkaConsumer(
    topic_name,
    bootstrap_servers=[server],
    auto_offset_reset='earliest',
    group_id='rides-console',
    value_deserializer=ride_deserializer
)

print(f"Listening to {topic_name}...")

count = 0
for message in consumer:
    ride = message.value
    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)
    print(f"Received: PU={ride.PULocationID}, DO={ride.DOLocationID}, "
          f"distance={ride.trip_distance}, amount=${ride.total_amount:.2f}, "
          f"pickup={pickup_dt}")
    count += 1
    if count >= 10:
        print(f"\n... received {count} messages so far (stopping after 10 for demo)")
        break

consumer.close()


================================================
FILE: 07-streaming/workshop/src/consumers/consumer_postgres.py
================================================
import sys
from datetime import datetime
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent.parent))

import psycopg2
from kafka import KafkaConsumer
from models import ride_deserializer

server = 'localhost:9092'
topic_name = 'rides'

# Connect to PostgreSQL
conn = psycopg2.connect(
    host='localhost',
    port=5432,
    database='postgres',
    user='postgres',
    password='postgres'
)
conn.autocommit = True
cur = conn.cursor()

consumer = KafkaConsumer(
    topic_name,
    bootstrap_servers=[server],
    auto_offset_reset='earliest',
    group_id='rides-to-postgres',
    value_deserializer=ride_deserializer
)

print(f"Listening to {topic_name} and writing to PostgreSQL...")

count = 0
for message in consumer:
    ride = message.value
    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)
    cur.execute(
        """INSERT INTO processed_events
           (PULocationID, DOLocationID, trip_distance, total_amount, pickup_datetime)
           VALUES (%s, %s, %s, %s, %s)""",
        (ride.PULocationID, ride.DOLocationID,
         ride.trip_distance, ride.total_amount, pickup_dt)
    )
    count += 1
    if count % 100 == 0:
        print(f"Inserted {count} rows...")

consumer.close()
cur.close()
conn.close()


================================================
FILE: 07-streaming/workshop/src/job/aggregation_job.py
================================================
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment


def create_events_aggregated_sink(t_env):
    table_name = 'processed_events_aggregated'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            window_start TIMESTAMP(3),
            PULocationID INT,
            num_trips BIGINT,
            total_revenue DOUBLE,
            PRIMARY KEY (window_start, PULocationID) NOT ENFORCED
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name

def create_events_source_kafka(t_env):
    table_name = "events"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            tpep_pickup_datetime BIGINT,
            event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3),
            WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda:29092',
            'topic' = 'rides',
            'scan.startup.mode' = 'earliest-offset',
            'properties.auto.offset.reset' = 'earliest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name


def log_aggregation():
    # Set up the execution environment
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)
    env.set_parallelism(3)

    # Set up the table environment
    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)

    try:
        # Create Kafka table
        source_table = create_events_source_kafka(t_env)
        aggregated_table = create_events_aggregated_sink(t_env)

        t_env.execute_sql(f"""
        INSERT INTO {aggregated_table}
        SELECT
            window_start,
            PULocationID,
            COUNT(*) AS num_trips,
            SUM(total_amount) AS total_revenue
        FROM TABLE(
            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR)
        )
        GROUP BY window_start, PULocationID;

        """).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_aggregation()


================================================
FILE: 07-streaming/workshop/src/job/aggregation_job_demo.py
================================================
"""
Demo aggregation job with 10-second tumbling windows.

Use with producer_realtime.py to observe watermark behavior:
- Watermark = event_timestamp - 5 seconds
- Late events (<=5s) arrive before the watermark closes the window -> included
- Late events (>5s) may arrive after the watermark closes the window -> dropped
"""

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment


def create_events_source_kafka(t_env):
    table_name = "events"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            tpep_pickup_datetime BIGINT,
            event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3),
            WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda:29092',
            'topic' = 'rides',
            'scan.startup.mode' = 'latest-offset',
            'properties.auto.offset.reset' = 'latest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name


def create_events_aggregated_sink(t_env):
    table_name = 'processed_events_aggregated'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            window_start TIMESTAMP(3),
            PULocationID INT,
            num_trips BIGINT,
            total_revenue DOUBLE,
            PRIMARY KEY (window_start, PULocationID) NOT ENFORCED
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name


def log_aggregation():
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)
    env.set_parallelism(1)

    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)

    try:
        source_table = create_events_source_kafka(t_env)
        aggregated_table = create_events_aggregated_sink(t_env)

        # 10-second tumbling windows (instead of 1 hour) so we can
        # observe windows closing and late events being dropped
        t_env.execute_sql(f"""
        INSERT INTO {aggregated_table}
        SELECT
            window_start,
            PULocationID,
            COUNT(*) AS num_trips,
            SUM(total_amount) AS total_revenue
        FROM TABLE(
            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '10' SECOND)
        )
        GROUP BY window_start, PULocationID;

        """).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_aggregation()


================================================
FILE: 07-streaming/workshop/src/job/pass_through_job.py
================================================
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment


def create_processed_events_sink_postgres(t_env):
    table_name = 'processed_events'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            pickup_datetime TIMESTAMP
        ) WITH (
            'connector' = 'jdbc',
            'url' = 'jdbc:postgresql://postgres:5432/postgres',
            'table-name' = '{table_name}',
            'username' = 'postgres',
            'password' = 'postgres',
            'driver' = 'org.postgresql.Driver'
        );
        """
    t_env.execute_sql(sink_ddl)
    return table_name


def create_events_source_kafka(t_env):
    table_name = "events"
    source_ddl = f"""
        CREATE TABLE {table_name} (
            PULocationID INTEGER,
            DOLocationID INTEGER,
            trip_distance DOUBLE,
            total_amount DOUBLE,
            tpep_pickup_datetime BIGINT
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = 'redpanda:29092',
            'topic' = 'rides',
            'scan.startup.mode' = 'latest-offset',
            'properties.auto.offset.reset' = 'latest',
            'format' = 'json'
        );
        """
    t_env.execute_sql(source_ddl)
    return table_name

def log_processing():
    # Set up the execution environment
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10 * 1000)

    # Set up the table environment
    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)
    try:
        # Create Kafka table
        source_table = create_events_source_kafka(t_env)
        postgres_sink = create_processed_events_sink_postgres(t_env)
        # write records to postgres
        t_env.execute_sql(
            f"""
                    INSERT INTO {postgres_sink}
                    SELECT
                        PULocationID,
                        DOLocationID,
                        trip_distance,
                        total_amount,
                        TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) as pickup_datetime
                    FROM {source_table}
                    """
        ).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_processing()


================================================
FILE: 07-streaming/workshop/src/models.py
================================================
import json
from dataclasses import dataclass


@dataclass
class Ride:
    PULocationID: int
    DOLocationID: int
    trip_distance: float
    total_amount: float
    tpep_pickup_datetime: int  # epoch milliseconds


def ride_from_row(row):
    return Ride(
        PULocationID=int(row['PULocationID']),
        DOLocationID=int(row['DOLocationID']),
        trip_distance=float(row['trip_distance']),
        total_amount=float(row['total_amount']),
        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),
    )


def ride_deserializer(data):
    json_str = data.decode('utf-8')
    ride_dict = json.loads(json_str)
    return Ride(**ride_dict)


================================================
FILE: 07-streaming/workshop/src/producers/producer.py
================================================
import dataclasses
import json
import sys
import time
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent.parent))

import pandas as pd
from kafka import KafkaProducer
from models import Ride, ride_from_row

# Download NYC yellow taxi trip data (first 1000 rows)
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet"
columns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime']
df = pd.read_parquet(url, columns=columns).head(1000)

def ride_serializer(ride):
    ride_dict = dataclasses.asdict(ride)
    json_str = json.dumps(ride_dict)
    return json_str.encode('utf-8')

server = 'localhost:9092'

producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=ride_serializer
)
t0 = time.time()

topic_name = 'rides'

for _, row in df.iterrows():
    ride = ride_from_row(row)
    producer.send(topic_name, value=ride)
    print(f"Sent: {ride}")
    time.sleep(0.01)

producer.flush()

t1 = time.time()
print(f'took {(t1 - t0):.2f} seconds')


================================================
FILE: 07-streaming/workshop/src/producers/producer_realtime.py
================================================
import dataclasses
import json
import random
import sys
import time
from datetime import datetime, timezone
from pathlib import Path

sys.path.insert(0, str(Path(__file__).parent.parent))

from kafka import KafkaProducer
from models import Ride

# Top pickup locations from the actual NYC yellow taxi data.
# PULocationID is a taxi zone ID (1-263) defined by the NYC TLC.
# See https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
PICKUP_LOCATIONS = [
    79,   # East Village, Manhattan
    107,  # Gramercy, Manhattan
    48,   # Clinton East (Hell's Kitchen), Manhattan
    132,  # JFK Airport
    234,  # Union Sq, Manhattan
    148,  # Lower East Side, Manhattan
    249,  # West Village, Manhattan
    68,   # East Chelsea, Manhattan
    90,   # Flatiron, Manhattan
    263,  # Yorkville West, Manhattan
    138,  # LaGuardia Airport
    230,  # Times Sq/Theatre District, Manhattan
    161,  # Midtown Center, Manhattan
    162,  # Midtown East, Manhattan
    170,  # Murray Hill, Manhattan
    237,  # Upper East Side South, Manhattan
    239,  # Upper West Side South, Manhattan
    186,  # Penn Station/Madison Sq West, Manhattan
    164,  # Midtown South, Manhattan
    236,  # Upper East Side North, Manhattan
]

DROPOFF_LOCATIONS = PICKUP_LOCATIONS  # same pool for simplicity


def make_ride(delay_seconds=0):
    now_ms = int(time.time() * 1000) - delay_seconds * 1000
    return Ride(
        PULocationID=random.choice(PICKUP_LOCATIONS),
        DOLocationID=random.choice(DROPOFF_LOCATIONS),
        trip_distance=round(random.uniform(0.5, 20.0), 2),
        total_amount=round(random.uniform(5.0, 100.0), 2),
        tpep_pickup_datetime=now_ms,
    )


def ride_serializer(ride):
    return json.dumps(dataclasses.asdict(ride)).encode('utf-8')


server = 'localhost:9092'
producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=ride_serializer,
)

topic_name = 'rides'
count = 0

print("Sending events (Ctrl+C to stop)...")
print()

try:
    while True:
        # ~20% chance of a late event (3-10 seconds old)
        if random.random() < 0.2:
            delay = random.randint(3, 10)
            ride = make_ride(delay_seconds=delay)
            ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc)
            print(f"  LATE ({delay}s) -> PU={ride.PULocationID} ts={ts:%H:%M:%S}")
        else:
            ride = make_ride()
            ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc)
            print(f"  on time   -> PU={ride.PULocationID} ts={ts:%H:%M:%S}")

        producer.send(topic_name, value=ride)
        count += 1
        time.sleep(0.5)

except KeyboardInterrupt:
    producer.flush()
    print(f"\nSent {count} events")


================================================
FILE: README.md
================================================
<p align="center">
  <img width="100%" src="/images/architecture/arch_v5_workshops.png" alt="Data Engineering Zoomcamp Overview">
</p>

<h1 align="center">
    <strong>Data Engineering Zoomcamp: A Free 9-Week Course on Data Engineering Fundamentals</strong>
</h1>

<p align="center">
Master the fundamentals of data engineering by building an end-to-end data pipeline from scratch. Gain hands-on experience with industry-standard tools and best practices.
</p>

<p align="center">
<a href="https://airtable.com/shr6oVXeQvSI5HuWD"><img src="https://user-images.githubusercontent.com/875246/185755203-17945fd1-6b64-46f2-8377-1011dcb1a444.png" height="50" /></a>
</p>

<p align="center">
<a href="https://datatalks.club/slack.html">Join Slack</a> •
<a href="https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG">#course-data-engineering Channel</a> •
<a href="https://t.me/dezoomcamp">Telegram Announcements</a> •
<a href="https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb">Course Playlist</a> •
<a href="https://datatalks.club/faq/data-engineering-zoomcamp.html">FAQ</a>
</p>

## How to Enroll

### 2026 Cohort
- **Start Date**: 12 January 2026
- **Register Here**: [Sign up](https://airtable.com/shr6oVXeQvSI5HuWD)

### Self-Paced Learning
All course materials are freely available for independent study. Follow these steps:
1. Watch the course videos.
2. Join the [Slack community](https://datatalks.club/slack.html).
3. Refer to the [FAQ document](https://datatalks.club/faq/data-engineering-zoomcamp.html) for guidance.

## Syllabus Overview
The course consists of structured modules, hands-on workshops, and a final project to reinforce your learning.

### **Prerequisites**
To get the most out of this course, you should have:
- Basic coding experience
- Familiarity with SQL
- Experience with Python (helpful but not required)

No prior data engineering experience is necessary.

### **Modules**

#### [Module 1: Containerization and Infrastructure as Code](01-docker-terraform/)
- Introduction to GCP
- Docker and Docker Compose
- Running PostgreSQL with Docker
- Infrastructure setup with Terraform
- Homework

#### [Module 2: Workflow Orchestration](02-workflow-orchestration/)
- Data Lakes and Workflow Orchestration
- Workflow orchestration with Kestra
- Homework

#### [Workshop 1: Data Ingestion](cohorts/2026/workshops/dlt.md)
- API reading and pipeline scalability
- Data normalization and incremental loading
- Homework

#### [Module 3: Data Warehousing](03-data-warehouse/)
- Introduction to BigQuery
- Partitioning, clustering, and best practices
- Machine learning in BigQuery

#### [Module 4: Analytics Engineering](04-analytics-engineering/)
- Analytics Engineering and Data Modeling
- dbt (data build tool) with DuckDB & BigQuery
- Testing, documentation, and deployment

#### [Module 5: Data Platforms](05-data-platforms/)
- Building end-to-end data pipelines with Bruin
- Data ingestion, transformation, and quality
- Deployment to cloud (BigQuery)

#### [Module 6: Batch Processing](06-batch/)
- Introduction to Apache Spark
- DataFrames and SQL
- Internals of GroupBy and Joins

#### [Module 7: Streaming](07-streaming/)
- Introduction to Kafka
- Kafka Streams and KSQL
- Schema management with Avro

#### [Final Project](projects/)
- Apply all concepts learned in a real-world scenario
- Peer review and feedback process

## Testimonials
> Thank you for what you do! The Data Engineering Zoomcamp gave me skills that helped me land my first tech job.
> 
> — [Tim Claytor](https://www.linkedin.com/in/claytor/) ([Source](https://www.linkedin.com/feed/update/urn:li:activity:7396882073308938240?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7396882073308938240%2C7396889959711793152%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287396889959711793152%2Curn%3Ali%3Aactivity%3A7396882073308938240%29))

> Three months might seem like a long time, but the growth and learning during this period are truly remarkable. It was a great experience with a lot of learning, connecting with like-minded people from all around the world, and having fun. I must admit, this was really hard. But the feeling of accomplishment and learning made it all worthwhile. And I would do it again!
>
> — [Nevenka Lukic](https://www.linkedin.com/in/nevenka-lukic/) ([Source](https://www.linkedin.com/posts/nevenka-lukic_data-engineering-zoomcamp-final-project-activity-7181985646033461248-Lc1O?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4))

> One of the significant things I inferred from the Zoomcamp is to prioritize fundamentals and principles over ever-evolving tools and tech stacks. Hugely grateful to Alexey Grigorev for putting together this incredible course and offering it for free.
>
> — [Siddhartha Gogoi](https://www.linkedin.com/in/siddhartha-gogoi/) ([Source](https://www.linkedin.com/posts/activity-7325692407675604992-XSKI?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4))

> Such a fun deep dive into data engineering, cloud automation, and orchestration. I learned so much along the way. Big shoutout to Alexey Grigorev and the DataTalksClub team for the opportunity and guidance throughout the 3 months of the free course.
>
> — [Assitan NIARE](https://www.linkedin.com/in/assitan-niar%C3%A9-data/) ([Source](https://www.linkedin.com/posts/activity-7317441554023874561-E3wm?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4))

> If you’re serious about breaking into data engineering, start here. The repo’s structure, community, and hands-on focus make it unparalleled.
> 
> — [Wady Osama](https://www.linkedin.com/in/wadyosama/) ([Source](https://www.linkedin.com/posts/wadyosama_dataengineering-zoomcamp-dezoomcamp-activity-7292126824711520258-puJm?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4))

## Community & Support

### **Getting Help on Slack**
Join the [`#course-data-engineering`](https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG) channel on [DataTalks.Club Slack](https://datatalks.club/slack.html) for discussions, troubleshooting, and networking.

To keep discussions organized:
- Follow [our guidelines](asking-questions.md) when posting questions.
- Review the [community guidelines](https://datatalks.club/slack/guidelines.html).

## Meet the Instructors

- [Alexey Grigorev](https://linkedin.com/in/agrigorev)
- [Michael Shoemaker](https://www.linkedin.com/in/michaelshoemaker1/)
- [Will Russell](https://www.linkedin.com/in/wrussell1999/)
- [Anna Geller](https://www.linkedin.com/in/anna-geller-12a86811a/)
- [Juan Manuel Perafan](https://www.linkedin.com/in/jmperafan/)
- [Arsalan Noorafkan](https://www.linkedin.com/in/arsalan0/)

Past instructors:

- [Victoria Perez Mola](https://www.linkedin.com/in/victoriaperezmola/)
- [Ankush Khanna](https://linkedin.com/in/ankushkhanna2)
- [Sejal Vaidya](https://www.linkedin.com/in/vaidyasejal/)
- [Irem Erturk](https://www.linkedin.com/in/iremerturk/)
- [Luis Oliveira](https://www.linkedin.com/in/lgsoliveira/)
- [Zach Wilson](https://www.linkedin.com/in/eczachly)

## Sponsors & Supporters
A special thanks to our course sponsors for making this initiative possible!

<p align="center">
  <a href="https://kestra.io/">
    <img height="120" src="images/kestra.svg">
  </a>
</p>

<p align="center">
  <a href="https://getbruin.com/">
    <img height="110" src="images/bruin.svg">
  </a>
</p>


<p align="center">
  <a href="https://dlthub.com/">
    <img height="90" src="images/dlthub.png">
  </a>
</p>

Interested in supporting our community? Reach out to [alexey@datatalks.club](mailto:alexey@datatalks.club).

## About DataTalks.Club

<p align="center">
  <img width="40%" src="https://github.com/user-attachments/assets/1243a44a-84c8-458d-9439-aaf6f3a32d89" alt="DataTalks.Club">
</p>

<p align="center">
<a href="https://datatalks.club/">DataTalks.Club</a> is a global online community of data enthusiasts. It's a place to discuss data, learn, share knowledge, ask and answer questions, and support each other.
</p>

<p align="center">
<a href="https://datatalks.club/">Website</a> •
<a href="https://datatalks.club/slack.html">Join Slack Community</a> •
<a href="https://us19.campaign-archive.com/home/?u=0d7822ab98152f5afc118c176&id=97178021aa">Newsletter</a> •
<a href="http://lu.ma/dtc-events">Upcoming Events</a> •
<a href="https://www.youtube.com/@DataTalksClub/featured">YouTube</a> •
<a href="https://github.com/DataTalksClub">GitHub</a> •
<a href="https://www.linkedin.com/company/datatalks-club/">LinkedIn</a> •
<a href="https://twitter.com/DataTalksClub">Twitter</a>
</p>

All the activity at DataTalks.Club mainly happens on [Slack](https://datatalks.club/slack.html). We post updates there and discuss different aspects of data, career questions, and more.

At DataTalksClub, we organize online events, community activities, and free courses. You can learn more about what we do at [DataTalksClub Community Navigation](https://www.notion.so/DataTalksClub-Community-Navigation-bf070ad27ba44bf6bbc9222082f0e5a8?pvs=21).


================================================
FILE: after-sign-up.md
================================================
## Thank you!

Thanks for signing up for the course.

The process of adding you to the mailing list is not automated yet, 
but you will hear from us closer to the course start. 

To make sure you don't miss any announcements

- Register in [DataTalks.Club's Slack](https://datatalks.club/slack.html) and
  join the [`#course-data-engineering`](https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG) channel
- Join the [course Telegram channel with announcements](https://t.me/dezoomcamp)
- Subscribe to [DataTalks.Club's YouTube channel](https://www.youtube.com/c/DataTalksClub) and check 
  [the course playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)

See you in January!


================================================
FILE: asking-questions.md
================================================
## Asking questions

If you have any questions, ask them 
in the [`#course-data-engineering`](https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG) channel in [DataTalks.Club](https://datatalks.club) slack.

To keep our discussion in Slack more organized, we ask you to follow these suggestions:

* First, review How to troubleshoot issues listed below.
* Before asking a question, check the [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html).
* Before asking a question review the [Slack Guidelines](#Ask-in-Slack).
* If somebody helped you with your problem and it's not in [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html), please add it there.
  It'll help other students.
* Zed Shaw (of the Learn the Hard Way series) has [a great post on how to help others help you](https://learncodethehardway.com/blog/03-how-to-ask-for-help/)
* Check [Stackoverflow guide on asking](https://stackoverflow.com/help/how-to-ask)
  
### How to troubleshoot issues

The first step is to try to solve the issue on you own; get used to solving problems. This will be a real life skill you need when employed.

1. What does the error say? There will often be a description of the error or instructions on what is needed, I have even seen a link to the solution. Does it reference a specific line of your code?
2. Restart the application or server/pc. 
3. Google it. It is going to be rare that you are the first to have the problem, someone out there has posted the issue and likely the solution. Search using: **technology** **problem statement**. Example: `pgcli error column c.relhasoids does not exist`. 
    * There are often different solutions for the same problem due to variation in environments. 
4. Check the tech’s documentation. Use its search if available or use the browser's search function. 
5. Try uninstall (this may remove the bad actor) and reinstall of application or re-implementation of action. Don’t forget to restart the server/pc for reinstalls.
    * Sometimes reinstalling fails to resolve the issue but works if you uninstall first.
6. Ask in Slack
7. Take a break and come back to it later. You will be amazed at how often you figure out the solution after letting your brain rest. Get some fresh air, workout, play a video game, watch a tv show, whatever allows your brain to not think about it for a little while or even until the next day. 
8. Remember technology issues in real life sometimes take days or even weeks to resolve

### Asking in Slack

* Before asking a question, check the [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html).
* DO NOT use screenshots, especially don’t take pictures from a phone.
* DO NOT tag instructors, it may discourage others from helping you.
* Copy and paste errors; if it’s long, just post it in a reply to your thread. 
* Use ``` for formatting your code.
* Use the same thread for the conversation (that means replying to your own thread). 
* DO NOT create multiple posts to discuss the issue.
* You may create a new post if the issue reemerges down the road. Be sure to describe what has changed in the environment.
* Provide additional information in the same thread of the steps you have taken for resolution.
  

================================================
FILE: awesome-data-engineering.md
================================================
Have you found any cool resources about data engineering? Put them here

## Learning Data Engineering

### Courses

* [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp) by DataTalks.Club (free)
* [Big Data Platforms, Autumn 2022: Introduction to Big Data Processing Frameworks](https://big-data-platforms-22.mooc.fi/) by the University of Helsinki (free)   
* [Awesome Data Engineering Learning Path](https://awesomedataengineering.com/)


### Books

* [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321)
* [Big Data: Principles and Best Practices of Scalable Realtime Data Systems by Nathan Marz, James Warren](https://www.amazon.com/Big-Data-Principles-practices-scalable/dp/1617290343)
* [Practical DataOps: Delivering Agile Data Science at Scale by Harvinder Atwal](https://www.amazon.com/Practical-DataOps-Delivering-Agile-Science/dp/1484251032)
* [Data Pipelines Pocket Reference: Moving and Processing Data for Analytics by James Densmore](https://www.amazon.com/Data-Pipelines-Pocket-Reference-Processing/dp/1492087831)
* [Best books for data engineering](https://awesomedataengineering.com/data_engineering_best_books)
* [Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis, Matt Housley](https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302)


### Introduction to Data Engineering Terms

* [https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html](https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html) 


### Data engineering in practice

Conference talks from companies, blog posts, etc

* [Uber Data Archives](https://eng.uber.com/category/articles/uberdata/) (Uber engineering blog)
* [Data Engineering Weekly (DE-focused substack)](https://www.dataengineeringweekly.com/)
* [Seattle Data Guy (DE-focused substack)](https://seattledataguy.substack.com/) 


## Doing Data Engineering

### Coding & Python

* [CS50's Introduction to Computer Science | edX](https://www.edx.org/course/introduction-computer-science-harvardx-cs50x) (course)
* [Python for Everybody Specialization](https://www.coursera.org/specializations/python) (course)
* [Practical Python programming](https://github.com/dabeaz-course/practical-python/blob/master/Notes/Contents.md)


### SQL

* [Intro to SQL: Querying and managing data | Khan Academy](https://www.khanacademy.org/computing/computer-programming/sql) 
* [Mode SQL Tutorial](https://mode.com/sql-tutorial/)
* [Use The Index, Luke](https://use-the-index-luke.com/) (SQL Indexing a        nd Tuning e-Book)nfreffx 
* [SQL Performance Explained](https://sql-performance-explained.com/) (book)  e


### Workflow orchestration

* [What is DAG?](https://youtu.be/1Yh5S-S6wsI) (video) 
* [Airflow, Prefect, and Dagster: An Inside Look](https://towardsdatascience.com/airflow-prefect-and-dagster-an-inside-look-6074781c9b77) (blog post) 
* [Open-Source Spotlight - Prefect - Kevin Kho](https://www.youtube.com/watch?v=ISLV9JyqF1w) (video) 
* [Prefect as a Data Engineering Project Workflow Tool, with Mary Clair Thompson (Duke) - 11/6/2020](https://youtu.be/HuwA4wLQtCM) (video) 


### ETL and ELT

* [ETL vs. ELT: What’s the Difference?](https://rivery.io/blog/etl-vs-elt/) (blog post) (print version)

### Data lakes

* [An Introduction to Modern Data Lake Storage Layers (Hodi, Iceberg, Delta Lake)](https://dacort.dev/posts/modern-data-lake-storage-layers/) (blog post) 
* [Lake House Architecture @ Halodoc: Data Platform 2.0](https://blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/amp/) (blzog post) 


### Data warehousing


* [Guide to Data Warehousing. Short and comprehensive information… | by Tomas Peluritis](https://medium.com/towards-data-science/guide-to-data-warehousing-6fdcf30b6fbe) (blog post) 
* [Snowflake, Redshift, BigQuery, and Others: Cloud Data Warehouse Tools Compared](https://www.altexsoft.com/blog/snowflake-redshift-bigquery-data-warehouse-tools/) (blog post)


### Streaming


*   Building Streaming Analytics: The Journey and Learnings - Maxim Lukichev

### DataOps

* [DataOps 101 with Lars Albertsson – DataTalks.Club](https://datatalks.club/podcast/s02e11-dataops.html) (podcast)
*  


### Monitoring and observability 

* [Data Observability: The Next Frontier of Data Engineering with Barr Moses](https://datatalks.club/podcast/s03e03-data-observability.html) (podcast)


### Analytics engineering

* [Analytics Engineer: New Role in a Data Team with Victoria Perez Mola](https://datatalks.club/podcast/s03e11-analytics-engineer.html) (podcast)
* [Modern Data Stack for Analytics Engineering - Kyle Shannon](https://www.youtube.com/watch?v=UmIZIkeOfi0) (video) 
* [Analytics Engineering vs Data Engineering | RudderStack Blog](https://www.rudderstack.com/blog/analytics-engineering-vs-data-engineering) (blog post)
* [Learn the Fundamentals of Analytics Engineering with dbt](https://courses.getdbt.com/courses/fundamentals) (course)


### Data mesh

* [Data Mesh in Practice - Max Schultze](https://www.youtube.com/watch?v=ekEc8D_D3zY) (video)

### Cloud

* [https://acceldataio.medium.com/data-engineering-best-practices-how-netflix-keeps-its-data-infrastructure-cost-effective-dee310bcc910](https://acceldataio.medium.com/data-engineering-best-practices-how-netflix-keeps-its-data-infrastructure-cost-effective-dee310bcc910) 


### Reverse ETL

* TODO: What is reverse ETL?
* [https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html](https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html) 
* [Open-Source Spotlight - Grouparoo - Brian Leonard](https://www.youtube.com/watch?v=hswlcgQZYuw) (video) 
* [Open-Source Spotlight - Castled.io (Reverse ETL) - Arun Thulasidharan](https://www.youtube.com/watch?v=iW0XhltAUJ8) (video) 

## Career in Data Engineering

* [From Data Science to Data Engineering with Ellen König – DataTalks.Club](https://datatalks.club/podcast/s07e08-from-data-science-to-data-engineering.html) (podcast)
* [Big Data Engineer vs Data Scientist with Roksolana Diachuk – DataTalks.Club](https://datatalks.club/podcast/s04e03-big-data-engineer-vs-data-scientist.html) (podcast)
* [What Skills Do You Need to Become a Data Engineer](https://www.linkedin.com/pulse/what-skills-do-you-need-become-data-engineer-peng-wang/) (blog post) 
* [The future history of Data Engineering](https://groupby1.substack.com/p/data-engineering?s=r) (blog post) 
* [What Skills Do Data Engineers Need](https://www.theseattledataguy.com/what-skills-do-data-engineers-need/) (blog post)

### Data Engineering Management 

* [Becoming a Data Engineering Manager with Rahul Jain – DataTalks.Club](https://datatalks.club/podcast/s07e07-becoming-a-data-engineering-manager.html) (podcast)

## Data engineering projects

* [How To Start A Data Engineering Project - With Data Engineering Project Ideas](https://www.youtube.com/watch?v=WpN47Jddo7I) (video)
* [Data Engineering Project for Beginners - Batch edition](https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/) (blog post)
* [Building a Data Engineering Project in 20 Minutes](https://www.sspaeti.com/blog/data-engineering-project-in-twenty-minutes/) (blog post)
* [Automating Nike Run Club Data Analysis with Python, Airflow and Google Data Studio | by Rich Martin | Medium](https://medium.com/@rich_23525/automating-nike-run-club-data-analysis-with-python-airflow-and-google-data-studio-3c9556478926) (blog post)


## Data Engineering Resources 

### Blogs

* [Start Data Engineering](https://www.startdataengineering.com/)

### Podcasts

* [The Data Engineering Podcast](https://www.dataengineeringpodcast.com/)
* [DataTalks.Club Podcast](https://datatalks.club/podcast.html) (only some episodes are about data engineering) 
* 

### Communities

* [DataTalks.Club](https://datatalks.club/)
* [/r/dataengineering](https://www.reddit.com/r/dataengineering) 


### Meetups

* [Sydney Data Engineers](https://sydneydataengineers.github.io/) 

### People to follow on Twitter and LinkedIn

* TODO

### YouTube channels

* [Karolina Sowinska - YouTube](https://www.youtube.com/channel/UCAxnMry1lETl47xQWABvH7g) x`
* [Seattle Data Guy - YouTube](https://www.youtube.com/c/SeattleDataGuy) 
* [Andreas Kretz - YouTube](https://www.youtube.com/c/andreaskayy) 
* [DataTalksClub - YouTube](https://youtube.com/c/datatalksclub) (only some videos are about data engineering) 

### Resource aggregators

* [Reading List](https://www.scling.com/reading-list/) by Lars Albertsson
* [GitHub - igorbarinov/awesome-data-engineering](https://github.com/igorbarinov/awesome-data-engineering) (focus is more on tools)
* [GitHub - DataExpert-io/data-engineer-handbook](https://github.com/DataExpert-io/data-engineer-handbook) (contains tools,blogs and more)


## License

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0


================================================
FILE: certificates.md
================================================
## Getting your certificate

Congratulations on finishing the course!

You can find your certificate in your enrollment profile (you need to be logged in):

* For the 2025 edition, it's https://courses.datatalks.club/de-zoomcamp-2025/enrollment

If you can't find a certificate in your profile, it means you didn't pass the project.
If you believe it's a mistake, write in the course channel in Slack.


## Adding to LinkedIn

You can add your certificate to LinkedIn:

* Log in to your LinkedIn account, then go to your profile.
* On the right, in the "Add profile" section dropdown, choose "Background" and then select the drop-down triangle next to "Licenses & Certifications".
* In "Name", enter "Data Engineering Zoomcamp".
* In "Issuing Organization", enter "DataTalksClub".
* (Optional) In "Issue Date", enter the time when the certificate was created.
* (Optional) Select the checkbox This certification does not expire. 
* Put your certificate ID.
* In "Certification URL", enter the URL for your certificate.

[Adapted from here](https://support.edx.org/hc/en-us/articles/206501938-How-can-I-add-my-certificate-to-my-LinkedIn-profile-)


================================================
FILE: cohorts/2022/README.md
================================================

### 2022 Cohort

* **Start**: 17 January 2022
* **Registration link**: https://airtable.com/shr6oVXeQvSI5HuWD
* [Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vR9oQiYnAVvzL4dagnhvp0sngqagF0AceD0FGjhS-dnzMTBzNQIal3-hOgkTibVQvfuqbQ69b0fvRnf/pubhtml)


================================================
FILE: cohorts/2022/project.md
================================================
## Course Project

The goal of this project is to apply everything we learned
in this course and build an end-to-end data pipeline.

Remember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered compelete.


### Submitting 

#### Project Cohort #2

Project:

* Form: https://forms.gle/JECXB9jYQ1vBXbsw6
* Deadline: 2 May, 22:00 CET

Peer reviewing:

* Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vShnv8T4iY_5NA8h0nySIS8Wzr-DZGGigEikIW4ZMSi9HlvhaEB4RhwmepVIuIUGaQHS90r5iHR2YXV/pubhtml?gid=964123374&single=true)
* Form: https://forms.gle/Pb2fBwYLQ3GGFsaK6
* Deadline: 9 May, 22:00 CET


#### Project Cohort #1

Project:

* Form: https://forms.gle/6aeVcEVJipqR2BqC8
* Deadline: 4 April, 22:00 CET

Peer reviewing:

* Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vShnv8T4iY_5NA8h0nySIS8Wzr-DZGGigEikIW4ZMSi9HlvhaEB4RhwmepVIuIUGaQHS90r5iHR2YXV/pubhtml)
* Form: https://forms.gle/AZ62bXMp4SGcVUmK7
* Deadline: 11 April, 22:00 CET

Project feedback: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRcVCkO-jes5mbPAcikn9X_s2laJ1KhsO8aibHYQxxKqdCUYMVTEJLJQdM8C5aAUWKFl_0SJW4rme7H/pubhtml)


================================================
FILE: cohorts/2022/week_1_basics_n_setup/homework.md
================================================
## Week 1 Homework

In this homework we'll prepare the environment 
and practice with terraform and SQL


## Question 1. Google Cloud SDK

Install Google Cloud SDK. What's the version you have? 

To get the version, run `gcloud --version`

## Google Cloud account 

Create an account in Google Cloud and create a project.


## Question 2. Terraform 

Now install terraform and go to the terraform directory (`week_1_basics_n_setup/1_terraform_gcp/terraform`)

After that, run

* `terraform init`
* `terraform plan`
* `terraform apply` 

Apply the plan and copy the output (after running `apply`) to the form.

It should be the entire output - from the moment you typed `terraform init` to the very end.

## Prepare Postgres 

Run Postgres and load data as shown in the videos

We'll use the yellow taxi trips from January 2021:

```bash
wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv
```

You will also need the dataset with zones:

```bash 
wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
```

Download this data and put it to Postgres

## Question 3. Count records 

How many taxi trips were there on January 15?

Consider only trips that started on January 15.


## Question 4. Largest tip for each day

Find the largest tip for each day. 
On which day it was the largest tip in January?

Use the pick up time for your calculations.

(note: it's not a typo, it's "tip", not "trip")


## Question 5. Most popular destination

What was the most popular destination for passengers picked up 
in central park on January 14?

Use the pick up time for your calculations.

Enter the zone name (not id). If the zone name is unknown (missing), write "Unknown" 


## Question 6. Most expensive locations

What's the pickup-dropoff pair with the largest 
average price for a ride (calculated based on `total_amount`)?

Enter two zone names separated by a slash

For example:

"Jamaica Bay / Clinton East"

If any of the zone names are unknown (missing), write "Unknown". For example, "Unknown / Clinton East". 


## Submitting the solutions

* Form for submitting: https://forms.gle/yGQrkgRdVbiFs8Vd7
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 26 January (Wednesday), 22:00 CET


## Solution

Here is the solution to questions 3-6: [video](https://www.youtube.com/watch?v=HxHqH2ARfxM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)


================================================
FILE: cohorts/2022/week_2_data_ingestion/README.md
================================================
## Week 2: Data Ingestion

### Data Lake (GCS)

* What is a Data Lake
* ELT vs. ETL
* Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)

:movie_camera: [Video](https://www.youtube.com/watch?v=W3Zm6rjOq70&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)

[Slides](https://docs.google.com/presentation/d/1RkH-YhBz2apIjYZAxUz2Uks4Pt51-fVWVN9CcH9ckyY/edit?usp=sharing)


### Introduction to Workflow orchestration

* What is an Orchestration Pipeline?
* What is a DAG?
* [Video](https://www.youtube.com/watch?v=0yK7LXwYeD0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)


### Setting up Airflow locally

* Setting up Airflow with Docker-Compose
* [Video](https://www.youtube.com/watch?v=lqDMzReAtrw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* More information in the [airflow folder](airflow)

If you want to run a lighter version of Airflow with fewer services, check this [video](https://www.youtube.com/watch?v=A1p5LQ0zzaQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb). It's optional.


### Ingesting data to GCP with Airflow

* Extraction: Download and unpack the data
* Pre-processing: Convert this raw data to parquet
* Upload the parquet files to GCS
* Create an external table in BigQuery
* [Video](https://www.youtube.com/watch?v=9ksX9REfL8w&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=19)

### Ingesting data to Local Postgres with Airflow

* Converting the ingestion script for loading data to Postgres to Airflow DAG
* [Video](https://www.youtube.com/watch?v=s2U8MWJH5xA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)


### Transfer service (AWS -> GCP)

Moving files from AWS to GCP.

You will need an AWS account for this. This section is optional

* [Video 1](https://www.youtube.com/watch?v=rFOFTfD1uGk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* [Video 2](https://www.youtube.com/watch?v=VhmmbqpIzeI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)


### Homework 

In the homework, you'll create a few DAGs for processing the NY Taxi data for 2019-2021

More information [here](homework.md)


## Community notes

Did you take notes? You can share them here.

* [Notes from Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/2_data_ingestion.md)
* [Notes from Aaron Wright](https://github.com/ABZ-Aaron/DataEngineerZoomCamp/blob/master/week_2_data_ingestion/README.md)
* [Notes from Abd](https://itnadigital.notion.site/Week-2-Data-Ingestion-ec2d0d36c0664bc4b8be6a554b2765fd)
* [Blog post by Isaac Kargar](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/25/data-engineering-w2.html)
* [Blog, notes, walkthroughs by Sandy Behrens](https://learningdataengineering540969211.wordpress.com/2022/01/30/week-2-de-zoomcamp-2-3-2-ingesting-data-to-gcp-with-airflow/)
* [Notes from Apurva Hegde](https://github.com/apuhegde/Airflow-LocalExecutor-In-Docker#readme)
* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)
* Add your notes here (above this line)


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/.env_example
================================================
# Custom
COMPOSE_PROJECT_NAME=dtc-de
GOOGLE_APPLICATION_CREDENTIALS=/.google/credentials/google_credentials.json
AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json
# AIRFLOW_UID=
GCP_PROJECT_ID=
GCP_GCS_BUCKET=

# Postgres
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow

# Airflow
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC=10

AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}
AIRFLOW_CONN_METADATA_DB=postgres+psycopg2://airflow:airflow@postgres:5432/airflow
AIRFLOW_VAR__METADATA_DB_SCHEMA=airflow

_AIRFLOW_WWW_USER_CREATE=True
_AIRFLOW_WWW_USER_USERNAME=${_AIRFLOW_WWW_USER_USERNAME:airflow}
_AIRFLOW_WWW_USER_PASSWORD=${_AIRFLOW_WWW_USER_PASSWORD:airflow}

AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True
AIRFLOW__CORE__LOAD_EXAMPLES=False


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/1_setup_official.md
================================================
## Setup (Official)

### Pre-Reqs

1. For the sake of standardization across this workshop's config,
    rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory
    ``` bash
        cd ~ && mkdir -p ~/.google/credentials/
        mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json
    ```

2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 5GB
(ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting.

3. Python version: 3.7+


### Airflow Setup

1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in)

2. **Set the Airflow user**:

    On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. 
    Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. 
    You have to make sure to configure them for the docker-compose:

    ```bash
    mkdir -p ./dags ./logs ./plugins
    echo -e "AIRFLOW_UID=$(id -u)" > .env
    ```

    On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. 

    To get rid of the warning ("AIRFLOW_UID is not set"), you can create `.env` file with
    this content:

    ```
    AIRFLOW_UID=50000
    ```

   
3. **Import the official docker setup file** from the latest Airflow version:
   ```shell
   curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'
   ```
   
4. It could be overwhelming to see a lot of services in here. 
   But this is only a quick-start template, and as you proceed you'll figure out which unused services can be removed.
   Eg. [Here's](docker-compose-nofrills.yml) a no-frills version of that template.

5. **Docker Build**:

    When you want to run Airflow locally, you might want to use an extended image, 
    containing some additional dependencies - for example you might add new python packages, 
    or upgrade airflow providers to a later version.
    
    Create a `Dockerfile` pointing to Airflow version you've just downloaded, 
    such as `apache/airflow:2.2.3`, as the base image,
       
    And customize this `Dockerfile` by:
    * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket/Data Lake.
    * Also, integrating `requirements.txt` to install libraries via  `pip install`

6. **Docker Compose**:

    Back in your `docker-compose.yaml`:
   * In `x-airflow-common`: 
     * Remove the `image` tag, to replace it with your `build` from your Dockerfile, as shown
     * Mount your `google_credentials` in `volumes` section as read-only
     * Set environment variables: `GCP_PROJECT_ID`, `GCP_GCS_BUCKET`, `GOOGLE_APPLICATION_CREDENTIALS` & `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`, as per your config.
   * Change `AIRFLOW__CORE__LOAD_EXAMPLES` to `false` (optional)

7. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look.


## Problems

### `File /.google/credentials/google_credentials.json was not found`

First, make sure you have your credentials in your `$HOME/.google/credentials`.
Maybe you missed the step and didn't copy the your JSON with credentials there?
Also, make sure the file-name is `google_credentials.json`.

Second, check that docker-compose can correctly map this directory to airflow worker.

Execute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker.

Then execute `bash` on this container:

```bash
docker exec -it <container-ID> bash
```

Now check if the file with credentials is actually there:

```bash
ls -lh /.google/credentials/
```

If it's empty, docker-compose couldn't map the folder with credentials. 
In this case, try changing it to the absolute path to this folder:

```yaml
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    # here: ----------------------------
    - c:/Users/alexe/.google/credentials/:/.google/credentials:ro
    # -----------------------------------
```


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/2_setup_nofrills.md
================================================
## Setup (No-frills)

### Pre-Reqs

1. For the sake of standardization across this workshop's config,
    rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory
    ``` bash
        cd ~ && mkdir -p ~/.google/credentials/
        mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json
    ```

2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 4GB
(ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting.

3. Python version: 3.7+


### Airflow Setup

1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in)
   
2. **Set the Airflow user**:

    On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. 
    Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. 
    You have to make sure to configure them for the docker-compose:

    ```bash
    mkdir -p ./dags ./logs ./plugins
    echo -e "AIRFLOW_UID=$(id -u)" >> .env
    ```

    On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. 

    To get rid of the warning ("AIRFLOW_UID is not set"), you can create `.env` file with
    this content:

    ```
    AIRFLOW_UID=50000
    ```

3. **Docker Build**:

    When you want to run Airflow locally, you might want to use an extended image, 
    containing some additional dependencies - for example you might add new python packages, 
    or upgrade airflow providers to a later version.
    
    Create a `Dockerfile` pointing to the latest Airflow version such as `apache/airflow:2.2.3`, for the base image,
       
    And customize this `Dockerfile` by:
    * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket (Data Lake).
    * Also, integrating `requirements.txt` to install libraries via  `pip install`

4. Copy [docker-compose-nofrills.yml](docker-compose-nofrills.yml), [.env_example](.env_example) & [entrypoint.sh](scripts/entrypoint.sh) from this repo.
    The changes from the official setup are:
    * Removal of `redis` queue, `worker`, `triggerer`, `flower` & `airflow-init` services, 
    and changing from `CeleryExecutor` (multi-node) mode to `LocalExecutor` (single-node) mode 
    * Inclusion of `.env` for better parametrization & flexibility
    * Inclusion of simple `entrypoint.sh` to the `webserver` container, responsible to initialize the database and create login-user (admin).
    * Updated `Dockerfile` to grant permissions on executing `scripts/entrypoint.sh`
        
5. `.env`:
    * Rebuild your `.env` file by making a copy of `.env_example` (but make sure your `AIRFLOW_UID` remains):
        ```shell
        mv .env_example .env
        ```
    * Set environment variables `AIRFLOW_UID`, `GCP_PROJECT_ID` & `GCP_GCS_BUCKET`, as per your config.
    * Optionally, if your `google-credentials.json` is stored somewhere else, such as a path like `$HOME/.gc`, 
    modify the env-vars (`GOOGLE_APPLICATION_CREDENTIALS`, `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`) and `volumes` path in `docker-compose-nofrills.yml`

6. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose-nofrills](./docker-compose-nofrills.yml) should look.


## Problems

### `no-frills setup does not work for me - WSL/Windows user `

If you are running Docker in Windows/WSL/WSL2 and you have encountered some `ModuleNotFoundError` or low performance issues, take a look at this [Airflow & WSL2 gist](https://gist.github.com/nervuzz/d1afe81116cbfa3c834634ebce7f11c5) focused entirely on troubleshooting possible problems.

### `File /.google/credentials/google_credentials.json was not found`

First, make sure you have your credentials in your `$HOME/.google/credentials`.
Maybe you missed the step and didn't copy the your JSON with credentials there?
Also, make sure the file-name is `google_credentials.json`.

Second, check that docker-compose can correctly map this directory to airflow worker.

Execute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker.

Then execute `bash` on this container:

```bash
docker exec -it <container-ID> bash
```

Now check if the file with credentials is actually there:

```bash
ls -lh /.google/credentials/
```

If it's empty, docker-compose couldn't map the folder with credentials. 
In this case, try changing it to the absolute path to this folder:

```yaml
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    # here: ----------------------------
    - c:/Users/alexe/.google/credentials/:/.google/credentials:ro
    # -----------------------------------
```


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/Dockerfile
================================================
# First-time build can take upto 10 mins.

FROM apache/airflow:2.2.3

ENV AIRFLOW_HOME=/opt/airflow

USER root
RUN apt-get update -qq && apt-get install vim -qqq
# git gcc g++ -qqq

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Ref: https://airflow.apache.org/docs/docker-stack/recipes.html

SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]

ARG CLOUD_SDK_VERSION=322.0.0
ENV GCLOUD_HOME=/home/google-cloud-sdk

ENV PATH="${GCLOUD_HOME}/bin/:${PATH}"

RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" \
    && TMP_DIR="$(mktemp -d)" \
    && curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" \
    && mkdir -p "${GCLOUD_HOME}" \
    && tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 \
    && "${GCLOUD_HOME}/install.sh" \
       --bash-completion=false \
       --path-update=false \
       --usage-reporting=false \
       --quiet \
    && rm -rf "${TMP_DIR}" \
    && gcloud --version

WORKDIR $AIRFLOW_HOME

COPY scripts scripts
RUN chmod +x scripts

USER $AIRFLOW_UID


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/README.md
================================================
### Concepts

 [Airflow Concepts and Architecture](docs/1_concepts.md)

### Workflow

 ![](docs/gcs_ingestion_dag.png)
 
### Setup - Official Version
 (For the section on the Custom/Lightweight setup, scroll down)

 #### Setup
  [Airflow Setup with Docker, through official guidelines](1_setup_official.md)

 #### Execution
 
  1. Build the image (only first-time, or when there's any change in the `Dockerfile`, takes ~15 mins for the first-time):
     ```shell
     docker-compose build
     ```
   
     or (for legacy versions)
   
     ```shell
     docker build .
     ```

 2. Initialize the Airflow scheduler, DB, and other config
    ```shell
    docker-compose up airflow-init
    ```

 3. Kick up the all the services from the container:
    ```shell
    docker-compose up
    ```

 4. In another terminal, run `docker-compose ps` to see which containers are up & running (there should be 7, matching with the services in your docker-compose file).

 5. Login to Airflow web UI on `localhost:8080` with default creds: `airflow/airflow`

 6. Run your DAG on the Web Console.

 7. On finishing your run or to shut down the container/s:
    ```shell
    docker-compose down
    ```

    To stop and delete containers, delete volumes with database data, and download images, run:
    ```
    docker-compose down --volumes --rmi all
    ```

    or
    ```
    docker-compose down --volumes --remove-orphans
    ```
       
### Setup - Custom No-Frills Version (Lightweight)
This is a quick, simple & less memory-intensive setup of Airflow that works on a LocalExecutor.

  #### Setup
  [Airflow Setup with Docker, customized](2_setup_nofrills.md)

  #### Execution
  
  1. Stop and delete containers, delete volumes with database data, & downloaded images (from the previous setup):
    ```
    docker-compose down --volumes --rmi all
    ```

   or
    ```
    docker-compose down --volumes --remove-orphans
    ```
    
   Or, if you need to clear your system of any pre-cached Docker issues:
    ```
    docker system prune
    ```
    
   Also, empty the airflow `logs` directory.
    
  2. Build the image (only first-time, or when there's any change in the `Dockerfile`):
  Takes ~5-10 mins for the first-time
    ```shell
    docker-compose build
    ```
    or (for legacy versions)
    ```shell
    docker build .
    ```

  3. Kick up the all the services from the container (no need to specially initialize):
    ```shell
    docker-compose -f docker-compose-nofrills.yml up
    ```

  4. In another terminal, run `docker ps` to see which containers are up & running (there should be 3, matching with the services in your docker-compose file).

  5. Login to Airflow web UI on `localhost:8080` with creds: `admin/admin` (explicit creation of admin user was required)

  6. Run your DAG on the Web Console.

  7. On finishing your run or to shut down the container/s:
    ```shell
    docker-compose down
    ```
    
### Setup - Taken from DE Zoomcamp 2.3.4 - Optional: Lightweight Local Setup for Airflow

Use the docker-compose_2.3.4.yaml file (and rename it to docker-compose.yaml). Don't forget to replace the variables `GCP_PROJECT_ID` and `GCP_GCS_BUCKET`.

### Future Enhancements
* Deploy self-hosted Airflow setup on Kubernetes cluster, or use a Managed Airflow (Cloud Composer) service by GCP

### References
For more info, check out these official docs:
   * https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html
   * https://airflow.apache.org/docs/docker-stack/build.html
   * https://airflow.apache.org/docs/docker-stack/recipes.html


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/dags/data_ingestion_gcs_dag.py
================================================
import os
import logging

from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

from google.cloud import storage
from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator
import pyarrow.csv as pv
import pyarrow.parquet as pq

PROJECT_ID = os.environ.get("GCP_PROJECT_ID")
BUCKET = os.environ.get("GCP_GCS_BUCKET")

dataset_file = "yellow_tripdata_2021-01.csv"
dataset_url = f"https://s3.amazonaws.com/nyc-tlc/trip+data/{dataset_file}"
path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")
parquet_file = dataset_file.replace('.csv', '.parquet')
BIGQUERY_DATASET = os.environ.get("BIGQUERY_DATASET", 'trips_data_all')


def format_to_parquet(src_file):
    if not src_file.endswith('.csv'):
        logging.error("Can only accept source files in CSV format, for the moment")
        return
    table = pv.read_csv(src_file)
    pq.write_table(table, src_file.replace('.csv', '.parquet'))


# NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed
def upload_to_gcs(bucket, object_name, local_file):
    """
    Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
    :param bucket: GCS bucket name
    :param object_name: target path & file-name
    :param local_file: source path & file-name
    :return:
    """
    # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.
    # (Ref: https://github.com/googleapis/python-storage/issues/74)
    storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB
    storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB
    # End of Workaround

    client = storage.Client()
    bucket = client.bucket(bucket)

    blob = bucket.blob(object_name)
    blob.upload_from_filename(local_file)


default_args = {
    "owner": "airflow",
    "start_date": days_ago(1),
    "depends_on_past": False,
    "retries": 1,
}

# NOTE: DAG declaration - using a Context Manager (an implicit way)
with DAG(
    dag_id="data_ingestion_gcs_dag",
    schedule_interval="@daily",
    default_args=default_args,
    catchup=False,
    max_active_runs=1,
    tags=['dtc-de'],
) as dag:

    download_dataset_task = BashOperator(
        task_id="download_dataset_task",
        bash_command=f"curl -sSL {dataset_url} > {path_to_local_home}/{dataset_file}"
    )

    format_to_parquet_task = PythonOperator(
        task_id="format_to_parquet_task",
        python_callable=format_to_parquet,
        op_kwargs={
            "src_file": f"{path_to_local_home}/{dataset_file}",
        },
    )

    # TODO: Homework - research and try XCOM to communicate output values between 2 tasks/operators
    local_to_gcs_task = PythonOperator(
        task_id="local_to_gcs_task",
        python_callable=upload_to_gcs,
        op_kwargs={
            "bucket": BUCKET,
            "object_name": f"raw/{parquet_file}",
            "local_file": f"{path_to_local_home}/{parquet_file}",
        },
    )

    bigquery_external_table_task = BigQueryCreateExternalTableOperator(
        task_id="bigquery_external_table_task",
        table_resource={
            "tableReference": {
                "projectId": PROJECT_ID,
                "datasetId": BIGQUERY_DATASET,
                "tableId": "external_table",
            },
            "externalDataConfiguration": {
                "sourceFormat": "PARQUET",
                "sourceUris": [f"gs://{BUCKET}/raw/{parquet_file}"],
            },
        },
    )

    download_dataset_task >> format_to_parquet_task >> local_to_gcs_task >> bigquery_external_table_task


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/dags_local/data_ingestion_local.py
================================================
import os

from datetime import datetime

from airflow import DAG

from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

from ingest_script import ingest_callable


AIRFLOW_HOME = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")


PG_HOST = os.getenv('PG_HOST')
PG_USER = os.getenv('PG_USER')
PG_PASSWORD = os.getenv('PG_PASSWORD')
PG_PORT = os.getenv('PG_PORT')
PG_DATABASE = os.getenv('PG_DATABASE')


local_workflow = DAG(
    "LocalIngestionDag",
    schedule_interval="0 6 2 * *",
    start_date=datetime(2021, 1, 1)
)


URL_PREFIX = 'https://s3.amazonaws.com/nyc-tlc/trip+data' 
URL_TEMPLATE = URL_PREFIX + '/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv'
OUTPUT_FILE_TEMPLATE = AIRFLOW_HOME + '/output_{{ execution_date.strftime(\'%Y-%m\') }}.csv'
TABLE_NAME_TEMPLATE = 'yellow_taxi_{{ execution_date.strftime(\'%Y_%m\') }}'

with local_workflow:
    wget_task = BashOperator(
        task_id='wget',
        bash_command=f'curl -sSL {URL_TEMPLATE} > {OUTPUT_FILE_TEMPLATE}'
    )

    ingest_task = PythonOperator(
        task_id="ingest",
        python_callable=ingest_callable,
        op_kwargs=dict(
            user=PG_USER,
            password=PG_PASSWORD,
            host=PG_HOST,
            port=PG_PORT,
            db=PG_DATABASE,
            table_name=TABLE_NAME_TEMPLATE,
            csv_file=OUTPUT_FILE_TEMPLATE
        ),
    )

    wget_task >> ingest_task

================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/dags_local/ingest_script.py
================================================
import os

from time import time

import pandas as pd
from sqlalchemy import create_engine


def ingest_callable(user, password, host, port, db, table_name, csv_file, execution_date):
    print(table_name, csv_file, execution_date)

    engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')
    engine.connect()

    print('connection established successfully, inserting data...')

    t_start = time()
    df_iter = pd.read_csv(csv_file, iterator=True, chunksize=100000)

    df = next(df_iter)

    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

    df.head(n=0).to_sql(name=table_name, con=engine, if_exists='replace')

    df.to_sql(name=table_name, con=engine, if_exists='append')

    t_end = time()
    print('inserted the first chunk, took %.3f second' % (t_end - t_start))

    while True: 
        t_start = time()

        try:
            df = next(df_iter)
        except StopIteration:
            print("completed")
            break

        df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
        df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)

        df.to_sql(name=table_name, con=engine, if_exists='append')

        t_end = time()

        print('inserted another chunk, took %.3f second' % (t_end - t_start))


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/docker-compose-nofrills.yml
================================================
version: '3'
services:
    postgres:
        image: postgres:13
        env_file:
            - .env
        volumes:
            - postgres-db-volume:/var/lib/postgresql/data
        healthcheck:
            test: ["CMD", "pg_isready", "-U", "airflow"]
            interval: 5s
            retries: 5
        restart: always

    scheduler:
        build: .
        command: scheduler
        restart: on-failure
        depends_on:
            - postgres
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
            - ./plugins:/opt/airflow/plugins
            - ./scripts:/opt/airflow/scripts
            - ~/.google/credentials/:/.google/credentials


    webserver:
        build: .
        entrypoint: ./scripts/entrypoint.sh
        restart: on-failure
        depends_on:
            - postgres
            - scheduler
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
            - ./plugins:/opt/airflow/plugins
            - ~/.google/credentials/:/.google/credentials:ro
            - ./scripts:/opt/airflow/scripts

        user: "${AIRFLOW_UID:-50000}:0"
        ports:
            - "8080:8080"
        healthcheck:
            test: [ "CMD-SHELL", "[ -f /home/airflow/airflow-webserver.pid ]" ]
            interval: 30s
            timeout: 30s
            retries: 3

volumes:
  postgres-db-volume:

================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/docker-compose.yaml
================================================
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#

# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
#
# WARNING: This configuration is for local development. Do not use it in a production deployment.
#
# This configuration supports basic configuration using environment variables or an .env file
# The following variables are supported:
#
# AIRFLOW_IMAGE_NAME           - Docker image name used to run Airflow.
#                                Default: apache/airflow:2.2.3
# AIRFLOW_UID                  - User ID in Airflow containers
#                                Default: 50000
# Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode
#
# _AIRFLOW_WWW_USER_USERNAME   - Username for the administrator account (if requested).
#                                Default: airflow
# _AIRFLOW_WWW_USER_PASSWORD   - Password for the administrator account (if requested).
#                                Default: airflow
# _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.
#                                Default: ''
#
# Feel free to modify this file to suit your needs.
---
version: '3'
x-airflow-common:
  &airflow-common
  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  build:
    context: .
    dockerfile: ./Dockerfile
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
    GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json
    AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'

    # TODO: Please change GCP_PROJECT_ID & GCP_GCS_BUCKET, as per your config
    GCP_PROJECT_ID: 'pivotal-surfer-336713'
    GCP_GCS_BUCKET: 'dtc_data_lake_pivotal-surfer-336713'

  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    - ~/.google/credentials/:/.google/credentials:ro

  user: "${AIRFLOW_UID:-50000}:0"
  depends_on:
    &airflow-common-depends-on
    redis:
      condition: service_healthy
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      retries: 5
    restart: always

  redis:
    image: redis:latest
    expose:
      - 6379
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 30s
      retries: 50
    restart: always

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - 8080:8080
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"']
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-worker:
    <<: *airflow-common
    command: celery worker
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 10s
      timeout: 10s
      retries: 5
    environment:
      <<: *airflow-common-env
      # Required to handle warm shutdown of the celery workers properly
      # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
      DUMB_INIT_SETSID: "0"
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-triggerer:
    <<: *airflow-common
    command: triggerer
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"']
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    # yamllint disable rule:line-length
    command:
      - -c
      - |
        function ver() {
          printf "%04d%04d%04d%04d" $${1//./ }
        }
        airflow_version=$$(gosu airflow airflow version)
        airflow_version_comparable=$$(ver $${airflow_version})
        min_airflow_version=2.2.0
        min_airflow_version_comparable=$$(ver $${min_airflow_version})
        if (( airflow_version_comparable < min_airflow_version_comparable )); then
          echo
          echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m"
          echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!"
          echo
          exit 1
        fi
        if [[ -z "${AIRFLOW_UID}" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
          echo "If you are on Linux, you SHOULD follow the instructions below to set "
          echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
          echo "For other operating systems you can get rid of the warning with manually created .env file:"
          echo "    See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user"
          echo
        fi
        one_meg=1048576
        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
        disk_available=$$(df / | tail -1 | awk '{print $$4}')
        warning_resources="false"
        if (( mem_available < 4000 )) ; then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
          echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
          echo
          warning_resources="true"
        fi
        if (( cpus_available < 2 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
          echo "At least 2 CPUs recommended. You have $${cpus_available}"
          echo
          warning_resources="true"
        fi
        if (( disk_available < one_meg * 10 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
          echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
          echo
          warning_resources="true"
        fi
        if [[ $${warning_resources} == "true" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
          echo "Please follow the instructions to increase amount of resources available:"
          echo "   https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin"
          echo
        fi
        mkdir -p /sources/logs /sources/dags /sources/plugins
        chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
        exec /entrypoint airflow version
    # yamllint enable rule:line-length
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_UPGRADE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
    user: "0:0"
    volumes:
      - .:/sources

  airflow-cli:
    <<: *airflow-common
    profiles:
      - debug
    environment:
      <<: *airflow-common-env
      CONNECTION_CHECK_MAX_COUNT: "0"
    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
    command:
      - bash
      - -c
      - airflow

  flower:
    <<: *airflow-common
    command: celery flower
    ports:
      - 5555:5555
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

volumes:
  postgres-db-volume:


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/docker-compose_2.3.4.yaml
================================================
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#

# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
#
# WARNING: This configuration is for local development. Do not use it in a production deployment.
#
# This configuration supports basic configuration using environment variables or an .env file
# The following variables are supported:
#
# AIRFLOW_IMAGE_NAME           - Docker image name used to run Airflow.
#                                Default: apache/airflow:2.2.3
# AIRFLOW_UID                  - User ID in Airflow containers
#                                Default: 50000
# Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode
#
# _AIRFLOW_WWW_USER_USERNAME   - Username for the administrator account (if requested).
#                                Default: airflow
# _AIRFLOW_WWW_USER_PASSWORD   - Password for the administrator account (if requested).
#                                Default: airflow
# _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.
#                                Default: ''
#
# Feel free to modify this file to suit your needs.
---
version: '3'
x-airflow-common:
  &airflow-common
  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  build:
    context: .
    dockerfile: ./Dockerfile
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
    GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json
    AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'
    GCP_PROJECT_ID: 'abc'
    GCP_GCS_BUCKET: "abc"

  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    - ~/.google/credentials/:/.google/credentials:ro

  user: "${AIRFLOW_UID:-50000}:0"
  depends_on:
    &airflow-common-depends-on
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      retries: 5
    restart: always

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - 8080:8080
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"']
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    # yamllint disable rule:line-length
    command:
      - -c
      - |
        function ver() {
          printf "%04d%04d%04d%04d" $${1//./ }
        }
        airflow_version=$$(gosu airflow airflow version)
        airflow_version_comparable=$$(ver $${airflow_version})
        min_airflow_version=2.2.0
        min_airflow_version_comparable=$$(ver $${min_airflow_version})
        if (( airflow_version_comparable < min_airflow_version_comparable )); then
          echo
          echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m"
          echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!"
          echo
          exit 1
        fi
        if [[ -z "${AIRFLOW_UID}" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
          echo "If you are on Linux, you SHOULD follow the instructions below to set "
          echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
          echo "For other operating systems you can get rid of the warning with manually created .env file:"
          echo "    See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user"
          echo
        fi
        one_meg=1048576
        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
        disk_available=$$(df / | tail -1 | awk '{print $$4}')
        warning_resources="false"
        if (( mem_available < 4000 )) ; then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
          echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
          echo
          warning_resources="true"
        fi
        if (( cpus_available < 2 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
          echo "At least 2 CPUs recommended. You have $${cpus_available}"
          echo
          warning_resources="true"
        fi
        if (( disk_available < one_meg * 10 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
          echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
          echo
          warning_resources="true"
        fi
        if [[ $${warning_resources} == "true" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
          echo "Please follow the instructions to increase amount of resources available:"
          echo "   https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin"
          echo
        fi
        mkdir -p /sources/logs /sources/dags /sources/plugins
        chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
        exec /entrypoint airflow version
    # yamllint enable rule:line-length
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_UPGRADE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
    user: "0:0"
    volumes:
      - .:/sources

  airflow-cli:
    <<: *airflow-common
    profiles:
      - debug
    environment:
      <<: *airflow-common-env
      CONNECTION_CHECK_MAX_COUNT: "0"
    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
    command:
      - bash
      - -c
      - airflow

volumes:
  postgres-db-volume:


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/docs/1_concepts.md
================================================
## Airflow concepts


### Airflow architecture
![](arch-diag-airflow.png)

Ref: https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html

* **Web server**:
GUI to inspect, trigger and debug the behaviour of DAGs and tasks. 
Available at http://localhost:8080.

* **Scheduler**:
Responsible for scheduling jobs. Handles both triggering & scheduled workflows, submits Tasks to the executor to run, monitors all tasks and DAGs, and
then triggers the task instances once their dependencies are complete.

* **Worker**:
This component executes the tasks given by the scheduler.

* **Metadata database (postgres)**:
Backend to the Airflow environment. Used by the scheduler, executor and webserver to store state.

* **Other components** (seen in docker-compose services):
    * `redis`: Message broker that forwards messages from scheduler to worker.
    * `flower`: The flower app for monitoring the environment. It is available at http://localhost:5555.
    * `airflow-init`: initialization service (customized as per this design)

All these services allow you to run Airflow with CeleryExecutor. 
For more information, see [Architecture Overview](https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html).


### Project Structure:

* `./dags` - `DAG_FOLDER` for DAG files (use `./dags_local` for the local ingestion DAG)
* `./logs` - contains logs from task execution and scheduler.
* `./plugins` - for custom plugins


### Workflow components

* `DAG`: Directed acyclic graph, specifies the dependencies between a set of tasks with explicit execution order, and has a beginning as well as an end. (Hence, “acyclic”)
    * `DAG Structure`: DAG Definition, Tasks (eg. Operators), Task Dependencies (control flow: `>>` or `<<` )
    
* `Task`: a defined unit of work (aka, operators in Airflow). The Tasks themselves describe what to do, be it fetching data, running analysis, triggering other systems, or more.
    * Common Types: Operators (used in this workshop), Sensors, TaskFlow decorators
    * Sub-classes of Airflow's BaseOperator

* `DAG Run`: individual execution/run of a DAG
    * scheduled or triggered

* `Task Instance`: an individual run of a single task. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc.
    * Ideally, a task should flow from `none`, to `scheduled`, to `queued`, to `running`, and finally to `success`.


### References

https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html

https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/extras/data_ingestion_gcs_dag_ex2.py
================================================
import os
from datetime import datetime

from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from google.cloud import storage

PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "pivotal-surfer-336713")
BUCKET = os.environ.get("GCP_GCS_BUCKET", "dtc_data_lake_pivotal-surfer-336713")

dataset_file = "yellow_tripdata_2021-01.csv"
dataset_url = f"https://s3.amazonaws.com/nyc-tlc/trip+data/{dataset_file}"
path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")
path_to_creds = f"{path_to_local_home}/google_credentials.json"

default_args = {
    "owner": "airflow",
    "start_date": days_ago(1),
    "depends_on_past": False,
    "retries": 1,
}


# # Takes 15-20 mins to run. Good case for using Spark (distributed processing, in place of chunks)
# def upload_to_gcs(bucket, object_name, local_file):
#     """
#     Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
#     :param bucket: GCS bucket name
#     :param object_name: target path & file-name
#     :param local_file: source path & file-name
#     :return:
#     """
#     # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload link.
#     # (Ref: https://github.com/googleapis/python-storage/issues/74)
#     storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB
#     storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB
#
#     client = storage.Client()
#     bucket = client.bucket(bucket)
#
#     blob = bucket.blob(object_name)
#     # blob.chunk_size = 5 * 1024 * 1024
#     blob.upload_from_filename(local_file)


with DAG(
    dag_id="data_ingestion_gcs_dag",
    schedule_interval="@daily",
    default_args=default_args,
    catchup=True,
    max_active_runs=1,
) as dag:

    # Takes ~2 mins, depending upon your internet's download speed
    download_dataset_task = BashOperator(
        task_id="download_dataset_task",
        bash_command=f"curl -sS {dataset_url} > {path_to_local_home}/{dataset_file}"    # "&& unzip {zip_file} && rm {zip_file}"
    )

    # # APPROACH 1: (takes 20 mins, at an upload speed of 800Kbps. Faster if your internet has a better upload speed)
    # upload_to_gcs_task = PythonOperator(
    #     task_id="upload_to_gcs_task",
    #     python_callable=upload_to_gcs,
    #     op_kwargs={
    #         "bucket": BUCKET,
    #         "object_name": f"raw/{dataset_file}",
    #         "local_file": f"{path_to_local_home}/{dataset_file}",
    #
    #     },
    # )

    # OR APPROACH 2: (takes 20 mins, at an upload speed of 800Kbps. Faster if your internet has a better upload speed)
    # Ref: https://cloud.google.com/blog/products/gcp/optimizing-your-cloud-storage-performance-google-cloud-performance-atlas
    upload_to_gcs_task = BashOperator(
        task_id="upload_to_gcs_task",
        bash_command=f"gcloud auth activate-service-account --key-file={path_to_creds} && \
        gsutil -m cp {path_to_local_home}/{dataset_file} gs://{BUCKET}",

    )

    download_dataset_task >> upload_to_gcs_task

================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/extras/web_to_gcs.sh
================================================
dataset_url=${dataset_url}
dataset_file=${dataset_file}
path_to_local_file=${path_to_local_file}
path_to_creds=${path_to_creds}

curl -sS "$dataset_url" > $path_to_local_file/$dataset_file
gcloud auth activate-service-account --key-file=$path_to_creds
gsutil -m cp $path_to_local_file/$dataset_file gs://$BUCKET


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/requirements.txt
================================================
apache-airflow-providers-google
pyarrow


================================================
FILE: cohorts/2022/week_2_data_ingestion/airflow/scripts/entrypoint.sh
================================================
#!/usr/bin/env bash
export GOOGLE_APPLICATION_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS}
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=${AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT}

airflow db upgrade

airflow users create -r Admin -u admin -p admin -e admin@example.com -f admin -l airflow
# "$_AIRFLOW_WWW_USER_USERNAME" -p "$_AIRFLOW_WWW_USER_PASSWORD"

airflow webserver


================================================
FILE: cohorts/2022/week_2_data_ingestion/homework/homework.md
================================================
## Week 2 Homework

In this homework, we'll prepare data for the next week. We'll need
to put these datasets to our data lake:

* For the lessons, we'll need the Yellow taxi dataset (years 2019 and 2020)
* For the homework, we'll need FHV Data (for-hire vehicles, for 2019 only)

You can find all the URLs on [the dataset page](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)


In this homework, we will:

* Modify the DAG we created during the lessons for transferring the yellow taxi data
* Create a new dag for transferring the FHV data
* Create another dag for the Zones data


If you don't have access to GCP, you can do that locally and ingest data to Postgres 
instead. If you have access to GCP, you don't need to do it for local Postgres -
only if you want.

Also note that for this homework we don't need the last step - creating a table in GCP.
After putting all the files to the datalake, we'll create the tables in Week 3.


## Question 1: Start date for the Yellow taxi data (1 point)

You'll need to parametrize the DAG for processing the yellow taxi data that
we created in the videos. 

What should be the start date for this dag?

* 2019-01-01
* 2020-01-01
* 2021-01-01
* days_ago(1)


## Question 2: Frequency for the Yellow taxi data (1 point)

How often do we need to run this DAG?

* Daily
* Monthly
* Yearly
* Once


## Re-running the DAGs for past dates

To execute your DAG for past dates, try this:

* First, delete your DAG from the web interface (the bin icon)
* Set the `catchup` parameter to `True`
* Be careful with running a lot of jobs in parallel - your system may not like it. Don't set it higher than 3: `max_active_runs=3`
* Rename the DAG to something like `data_ingestion_gcs_dag_v02` 
* Execute it from the Airflow GUI (the play button)


Also, there's no data for the recent months, but `curl` will exit successfully.
To make it fail on 404, add the `-f` flag:

```bash
curl -sSLf { URL } > { LOCAL_PATH }
```

When you run this for all the data, the temporary files will be saved in Docker and will consume your 
disk space. If it causes problems for you, add another step in your DAG that cleans everything up.
It could be a bash operator that runs this command:

```bash
rm name-of-csv-file.csv name-of-parquet-file.parquet
```


## Question 3: DAG for FHV Data (2 points)

Now create another DAG - for uploading the FHV data. 

We will need three steps: 

* Download the data
* Parquetize it 
* Upload to GCS

If you don't have a GCP account, for local ingestion you'll need two steps:

* Download the data
* Ingest to Postgres

Use the same frequency and the start date as for the yellow taxi dataset

Question: how many DAG runs are green for data in 2019 after finishing everything? 

Note: when processing the data for 2020-01 you probably will get an error. It's up 
to you to decide what to do with it - for Week 3 homework we won't need 2020 data.


## Question 4: DAG for Zones (2 points)


Create the final DAG - for Zones:

* Download it
* Parquetize 
* Upload to GCS

(Or two steps for local ingestion: download -> ingest to postgres)

How often does it need to run?

* Daily
* Monthly
* Yearly
* Once


## Submitting the solutions

* Form for submitting: https://forms.gle/ViWS8pDf2tZD4zSu5
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: February 7, 17:00 CET 


================================================
FILE: cohorts/2022/week_2_data_ingestion/homework/solution.py
================================================
import os
import logging

from datetime import datetime

from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

from google.cloud import storage

import pyarrow.csv as pv
import pyarrow.parquet as pq


PROJECT_ID = os.environ.get("GCP_PROJECT_ID")
BUCKET = os.environ.get("GCP_GCS_BUCKET")
AIRFLOW_HOME = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")


def format_to_parquet(src_file, dest_file):
    if not src_file.endswith('.csv'):
        logging.error("Can only accept source files in CSV format, for the moment")
        return
    table = pv.read_csv(src_file)
    pq.write_table(table, dest_file)


def upload_to_gcs(bucket, object_name, local_file):
    client = storage.Client()
    bucket = client.bucket(bucket)
    blob = bucket.blob(object_name)
    blob.upload_from_filename(local_file)


default_args = {
    "owner": "airflow",
    #"start_date": days_ago(1),
    "depends_on_past": False,
    "retries": 1,
}


def donwload_parquetize_upload_dag(
    dag,
    url_template,
    local_csv_path_template,
    local_parquet_path_template,
    gcs_path_template
):
    with dag:
        download_dataset_task = BashOperator(
            task_id="download_dataset_task",
            bash_command=f"curl -sSLf {url_template} > {local_csv_path_template}"
        )

        format_to_parquet_task = PythonOperator(
            task_id="format_to_parquet_task",
            python_callable=format_to_parquet,
            op_kwargs={
                "src_file": local_csv_path_template,
                "dest_file": local_parquet_path_template
            },
        )

        local_to_gcs_task = PythonOperator(
            task_id="local_to_gcs_task",
            python_callable=upload_to_gcs,
            op_kwargs={
                "bucket": BUCKET,
                "object_name": gcs_path_template,
                "local_file": local_parquet_path_template,
            },
        )

        rm_task = BashOperator(
            task_id="rm_task",
            bash_command=f"rm {local_csv_path_template} {local_parquet_path_template}"
        )

        download_dataset_task >> format_to_parquet_task >> local_to_gcs_task >> rm_task


URL_PREFIX = 'https://s3.amazonaws.com/nyc-tlc/trip+data'

YELLOW_TAXI_URL_TEMPLATE = URL_PREFIX + '/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv'
YELLOW_TAXI_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv'
YELLOW_TAXI_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet'
YELLOW_TAXI_GCS_PATH_TEMPLATE = "raw/yellow_tripdata/{{ execution_date.strftime(\'%Y\') }}/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet"


yellow_taxi_data_dag = DAG(
    dag_id="yellow_taxi_data_v2",
    schedule_interval="0 6 2 * *",
    start_date=datetime(2019, 1, 1),
    default_args=default_args,
    catchup=True,
    max_active_runs=3,
    tags=['dtc-de'],
)

donwload_parquetize_upload_dag(
    dag=yellow_taxi_data_dag,
    url_template=YELLOW_TAXI_URL_TEMPLATE,
    local_csv_path_template=YELLOW_TAXI_CSV_FILE_TEMPLATE,
    local_parquet_path_template=YELLOW_TAXI_PARQUET_FILE_TEMPLATE,
    gcs_path_template=YELLOW_TAXI_GCS_PATH_TEMPLATE
)

# https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2021-01.csv

GREEN_TAXI_URL_TEMPLATE = URL_PREFIX + '/green_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv'
GREEN_TAXI_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/green_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv'
GREEN_TAXI_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/green_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet'
GREEN_TAXI_GCS_PATH_TEMPLATE = "raw/green_tripdata/{{ execution_date.strftime(\'%Y\') }}/green_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet"

green_taxi_data_dag = DAG(
    dag_id="green_taxi_data_v1",
    schedule_interval="0 7 2 * *",
    start_date=datetime(2019, 1, 1),
    default_args=default_args,
    catchup=True,
    max_active_runs=3,
    tags=['dtc-de'],
)

donwload_parquetize_upload_dag(
    dag=green_taxi_data_dag,
    url_template=GREEN_TAXI_URL_TEMPLATE,
    local_csv_path_template=GREEN_TAXI_CSV_FILE_TEMPLATE,
    local_parquet_path_template=GREEN_TAXI_PARQUET_FILE_TEMPLATE,
    gcs_path_template=GREEN_TAXI_GCS_PATH_TEMPLATE
)


# https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-01.csv

FHV_TAXI_URL_TEMPLATE = URL_PREFIX + '/fhv_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv'
FHV_TAXI_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/fhv_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv'
FHV_TAXI_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/fhv_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet'
FHV_TAXI_GCS_PATH_TEMPLATE = "raw/fhv_tripdata/{{ execution_date.strftime(\'%Y\') }}/fhv_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet"

fhv_taxi_data_dag = DAG(
    dag_id="hfv_taxi_data_v1",
    schedule_interval="0 8 2 * *",
    start_date=datetime(2019, 1, 1),
    end_date=datetime(2020, 1, 1),
    default_args=default_args,
    catchup=True,
    max_active_runs=3,
    tags=['dtc-de'],
)

donwload_parquetize_upload_dag(
    dag=fhv_taxi_data_dag,
    url_template=FHV_TAXI_URL_TEMPLATE,
    local_csv_path_template=FHV_TAXI_CSV_FILE_TEMPLATE,
    local_parquet_path_template=FHV_TAXI_PARQUET_FILE_TEMPLATE,
    gcs_path_template=FHV_TAXI_GCS_PATH_TEMPLATE
)


# https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

ZONES_URL_TEMPLATE = 'https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv'
ZONES_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/taxi_zone_lookup.csv'
ZONES_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/taxi_zone_lookup.parquet'
ZONES_GCS_PATH_TEMPLATE = "raw/taxi_zone/taxi_zone_lookup.parquet"

zones_data_dag = DAG(
    dag_id="zones_data_v1",
    schedule_interval="@once",
    start_date=days_ago(1),
    default_args=default_args,
    catchup=True,
    max_active_runs=3,
    tags=['dtc-de'],
)

donwload_parquetize_upload_dag(
    dag=zones_data_dag,
    url_template=ZONES_URL_TEMPLATE,
    local_csv_path_template=ZONES_CSV_FILE_TEMPLATE,
    local_parquet_path_template=ZONES_PARQUET_FILE_TEMPLATE,
    gcs_path_template=ZONES_GCS_PATH_TEMPLATE
)

================================================
FILE: cohorts/2022/week_2_data_ingestion/transfer_service/README.md
================================================
## Generate AWS Access key
- Login in to AWS account  
- Search for IAM
  ![aws iam](../../images/aws/iam.png)
- Click on `Manage access key`
- Click on `Create New Access Key`
- Download the csv, your access key and secret would be in that csv (Please note that once lost secret cannot be recovered)

## Transfer service
https://console.cloud.google.com/transfer/cloud/jobs


================================================
FILE: cohorts/2022/week_3_data_warehouse/airflow/.env_example
================================================
# Custom
COMPOSE_PROJECT_NAME=dtc-de
GOOGLE_APPLICATION_CREDENTIALS=/.google/credentials/google_credentials.json
AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json
# AIRFLOW_UID=
GCP_PROJECT_ID=
GCP_GCS_BUCKET=

# Postgres
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow

# Airflow
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC=10

AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}
AIRFLOW_CONN_METADATA_DB=postgres+psycopg2://airflow:airflow@postgres:5432/airflow
AIRFLOW_VAR__METADATA_DB_SCHEMA=airflow

_AIRFLOW_WWW_USER_CREATE=True
_AIRFLOW_WWW_USER_USERNAME=${_AIRFLOW_WWW_USER_USERNAME:airflow}
_AIRFLOW_WWW_USER_PASSWORD=${_AIRFLOW_WWW_USER_PASSWORD:airflow}

AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True
AIRFLOW__CORE__LOAD_EXAMPLES=False


================================================
FILE: cohorts/2022/week_3_data_warehouse/airflow/1_setup_official.md
================================================
## Setup (Official)

### Pre-Reqs

1. For the sake of standardization across this workshop's config,
    rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory
    ``` bash
        cd ~ && mkdir -p ~/.google/credentials/
        mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json
    ```

2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 5GB
(ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting.

3. Python version: 3.7+


### Airflow Setup

1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in)

2. **Set the Airflow user**:

    On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. 
    Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. 
    You have to make sure to configure them for the docker-compose:

    ```bash
    mkdir -p ./dags ./logs ./plugins
    echo -e "AIRFLOW_UID=$(id -u)" > .env
    ```

    On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. 

    To get rid of the warning ("AIRFLOW_UID is not set"), you can create `.env` file with
    this content:

    ```
    AIRFLOW_UID=50000
    ```

3. **Import the official docker setup file** from the latest Airflow version:
   ```shell
   curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'
   ```
   
4. It could be overwhelming to see a lot of services in here. 
   But this is only a quick-start template, and as you proceed you'll figure out which unused services can be removed.
   Eg. [Here's](docker-compose-nofrills.yml) a no-frills version of that template.


5. **Docker Build**:

    When you want to run Airflow locally, you might want to use an extended image, 
    containing some additional dependencies - for example you might add new python packages, 
    or upgrade airflow providers to a later version.
    
    Create a `Dockerfile` pointing to Airflow version you've just downloaded, 
    such as `apache/airflow:2.2.3`, as the base image,
       
    And customize this `Dockerfile` by:
    * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket/Data Lake.
    * Also, integrating `requirements.txt` to install libraries via  `pip install`


6. **Docker Compose**:

    Back in your `docker-compose.yaml`:
   * In `x-airflow-common`: 
     * Remove the `image` tag, to replace it with your `build` from your Dockerfile, as shown
     * Mount your `google_credentials` in `volumes` section as read-only
     * Set environment variables: `GCP_PROJECT_ID`, `GCP_GCS_BUCKET`, `GOOGLE_APPLICATION_CREDENTIALS` & `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`, as per your config.

   * Change `AIRFLOW__CORE__LOAD_EXAMPLES` to `false` (optional)

7. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look.


## Problems

### `File /.google/credentials/google_credentials.json was not found`

First, make sure you have your credentials in your `$HOME/.google/credentials`.
Maybe you missed the step and didn't copy the your JSON with credentials there?
Also, make sure the file-name is `google_credentials.json`.

Second, check that docker-compose can correctly map this directory to airflow worker.

Execute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker.

Then execute `bash` on this container:

```bash
docker exec -it <container-ID> bash
```

Now check if the file with credentials is actually there:

```bash
ls -lh /.google/credentials/
```

If it's empty, docker-compose couldn't map the folder with credentials. 
In this case, try changing it to the absolute path to this folder:

```yaml
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    # here: ----------------------------
    - c:/Users/alexe/.google/credentials/:/.google/credentials:ro
    # -----------------------------------
```


================================================
FILE: cohorts/2022/week_3_data_warehouse/airflow/2_setup_nofrills.md
================================================
## Setup (No-frills)

### Pre-Reqs

1. For the sake of standardization across this workshop's config,
    rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory
    ``` bash
        cd ~ && mkdir -p ~/.google/credentials/
        mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json
    ```

2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 4GB
(ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting.

3. Python version: 3.7+


### Airflow Setup

1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in)
   
2. **Set the Airflow user**:

    On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. 
    Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. 
    You have to make sure to configure them for the docker-compose:

    ```bash
    mkdir -p ./dags ./logs ./plugins
    echo -e "AIRFLOW_UID=$(id -u)" >> .env
    ```

    On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. 

    To get rid of the warning ("AIRFLOW_UID is not set"), you can create `.env` file with
    this content:

    ```
    AIRFLOW_UID=50000
    ```

3. **Docker Build**:

    When you want to run Airflow locally, you might want to use an extended image, 
    containing some additional dependencies - for example you might add new python packages, 
    or upgrade airflow providers to a later version.
    
    Create a `Dockerfile` pointing to the latest Airflow version such as `apache/airflow:2.2.3`, for the base image,
       
    And customize this `Dockerfile` by:
    * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket (Data Lake).
    * Also, integrating `requirements.txt` to install libraries via  `pip install`

4. Copy [docker-compose-nofrills.yml](docker-compose-nofrills.yml), [.env_example](.env_example) & [entrypoint.sh](scripts/entrypoint.sh) from this repo.
    The changes from the official setup are:
    * Removal of `redis` queue, `worker`, `triggerer`, `flower` & `airflow-init` services, 
    and changing from `CeleryExecutor` (multi-node) mode to `LocalExecutor` (single-node) mode 
    * Inclusion of `.env` for better parametrization & flexibility
    * Inclusion of simple `entrypoint.sh` to the `webserver` container, responsible to initialize the database and create login-user (admin).
    * Updated `Dockerfile` to grant permissions on executing `scripts/entrypoint.sh`
        
5. `.env`:
    * Rebuild your `.env` file by making a copy of `.env_example` (but make sure your `AIRFLOW_UID` remains):
        ```shell
        mv .env_example .env
        ```
    * Set environment variables `AIRFLOW_UID`, `GCP_PROJECT_ID` & `GCP_GCS_BUCKET`, as per your config.
    * Optionally, if your `google-credentials.json` is stored somewhere else, such as a path like `$HOME/.gc`, 
    modify the env-vars (`GOOGLE_APPLICATION_CREDENTIALS`, `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`) and `volumes` path in `docker-compose-nofrills.yml`

6. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose-nofrills](./docker-compose-nofrills.yml) should look.


## Problems

### `no-frills setup does not work for me - WSL/Windows user `

If you are running Docker in Windows/WSL/WSL2 and you have encountered some `ModuleNotFoundError` or low performance issues,
take a look at this [Airflow & WSL2 gist](https://gist.github.com/nervuzz/d1afe81116cbfa3c834634ebce7f11c5) focused entirely on troubleshooting possible problems.

### `File /.google/credentials/google_credentials.json was not found`

First, make sure you have your credentials in your `$HOME/.google/credentials`.
Maybe you missed the step and didn't copy the your JSON with credentials there?
Also, make sure the file-name is `google_credentials.json`.

Second, check that docker-compose can correctly map this directory to airflow worker.

Execute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker.

Then execute `bash` on this container:

```bash
docker exec -it <container-ID> bash
```

Now check if the file with credentials is actually there:

```bash
ls -lh /.google/credentials/
```

If it's empty, docker-compose couldn't map the folder with credentials. 
In this case, try changing it to the absolute path to this folder:

```yaml
  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    # here: ----------------------------
    - c:/Users/alexe/.google/credentials/:/.google/credentials:ro
    # -----------------------------------
```


================================================
FILE: cohorts/2022/week_3_data_warehouse/airflow/README.md
================================================
### Concepts

 [Airflow Concepts and Architecture](../week_2_data_ingestion/airflow/docs/1_concepts.md)

### Workflow

 ![](docs/gcs_2_bq_dag_graph_view.png)
 
 ![](docs/gcs_2_bq_dag_tree_view.png)
 
### Setup - Official Version
 (For the section on the Custom/Lightweight setup, scroll down)

 #### Setup
  [Airflow Setup with Docker, through official guidelines](1_setup_official.md)

 #### Execution
 
  1. Build the image (only first-time, or when there's any change in the `Dockerfile`, takes ~15 mins for the first-time):
     ```shell
     docker-compose build
     ```
   
     or (for legacy versions)
   
     ```shell
     docker build .
     ```

 2. Initialize the Airflow scheduler, DB, and other config
    ```shell
    docker-compose up airflow-init
    ```

 3. Kick up the all the services from the container:
    ```shell
    docker-compose up
    ```

 4. In another terminal, run `docker-compose ps` to see which containers are up & running (there should be 7, matching with the services in your docker-compose file).

 5. Login to Airflow web UI on `localhost:8080` with default creds: `airflow/airflow`

 6. Run your DAG on the Web Console.

 7. On finishing your run or to shut down the container/s:
    ```shell
    docker-compose down
    ```

    To stop and delete containers, delete volumes with database data, and download images, run:
    ```
    docker-compose down --volumes --rmi all
    ```

    or
    ```
    docker-compose down --volumes --remove-orphans
    ```
       
### Setup - Custom No-Frills Version (Lightweight)
This is a quick, simple & less memory-intensive setup of Airflow that works on a LocalExecutor.

  #### Setup
  [Airflow Setup with Docker, customized](2_setup_nofrills.md)

  #### Execution
  
  1. Stop and delete containers, delete volumes with database data, & downloaded images (from the previous setup):
    ```
    docker-compose down --volumes --rmi all
    ```

   or
    ```
    docker-compose down --volumes --remove-orphans
    ```
    
   Or, if you need to clear your system of any pre-cached Docker issues:
    ```
    docker system prune
    ```
    
   Also, empty the airflow `logs` directory.
    
  2. Build the image (only first-time, or when there's any change in the `Dockerfile`):
  Takes ~5-10 mins for the first-time
    ```shell
    docker-compose build
    ```
    or (for legacy versions)
    ```shell
    docker build .
    ```

  3. Kick up the all the services from the container (no need to specially initialize):
    ```shell
    docker-compose -f docker-compose-nofrills.yml up
    ```

  4. In another terminal, run `docker ps` to see which containers are up & running (there should be 3, matching with the services in your docker-compose file).

  5. Login to Airflow web UI on `localhost:8080` with creds: `admin/admin` (explicit creation of admin user was required)

  6. Run your DAG on the Web Console.

  7. On finishing your run or to shut down the container/s:
    ```shell
    docker-compose down
    ```
    
   
### Future Enhancements
* Deploy self-hosted Airflow setup on Kubernetes cluster, or use a Managed Airflow (Cloud Composer) service by GCP

### References
For more info, check out these official docs:
   * https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html
   * https://airflow.apache.org/docs/docker-stack/build.html
   * https://airflow.apache.org/docs/docker-stack/recipes.html


================================================
FILE: cohorts/2022/week_3_data_warehouse/airflow/dags/gcs_to_bq_dag.py
================================================
import os
import logging

from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator, BigQueryInsertJobOperator
from airflow.providers.google.cloud.transfers.gcs_to_gcs import GCSToGCSOperator

PROJECT_ID = os.environ.get("GCP_PROJECT_ID")
BUCKET = os.environ.get("GCP_GCS_BUCKET")

path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")
BIGQUERY_DATASET = os.environ.get("BIGQUERY_DATASET", 'trips_data_all')

DATASET = "tripdata"
COLOUR_RANGE = {'yellow': 'tpep_pickup_datetime', 'green': 'lpep_pickup_datetime'}
INPUT_PART = "raw"
INPUT_FILETYPE = "parquet"

default_args = {
    "owner": "airflow",
    "start_date": days_ago(1),
    "depends_on_past": False,
    "retries": 1,
}

# NOTE: DAG declaration - using a Context Manager (an implicit way)
with DAG(
    dag_id="gcs_2_bq_dag",
    schedule_interval="@daily",
    default_args=default_args,
    catchup=False,
    max_active_runs=1,
    tags=['dtc-de'],
) as dag:

    for colour, ds_col in COLOUR_RANGE.items():
        move_files_gcs_task = GCSToGCSOperator(
            task_id=f'move_{colour}_{DATASET}_files_task',
            source_bucket=BUCKET,
            source_object=f'{INPUT_PART}/{colour}_{DATASET}*.{INPUT_FILETYPE}',
            destination_bucket=BUCKET,
            destination_object=f'{colour}/{colour}_{DATASET}',
            move_object=True
        )

        bigquery_external_table_task = BigQueryCreateExternalTableOperator(
            task_id=f"bq_{colour}_{DATASET}_external_table_task",
            table_resource={
                "tableReference": {
                    "projectId": PROJECT_ID,
                    "datasetId": BIGQUERY_DATASET,
                    "tableId": f"{colour}_{DATASET}_external_table",
                },
                "externalDataConfiguration": {
                    "autodetect": "True",
                    "sourceFormat": f"{INPUT_FILETYPE.upper()}",
                    "sourceUris": [f"gs://{BUCKET}/{colour}/*"],
                },
            },
        )

        CREATE_BQ_TBL_QUERY = (
            f"CREATE OR REPLACE TABLE {BIGQUERY_DATASET}.{colour}_{DATASET} \
            PARTITION BY DATE({ds_col}) \
            AS \
            SELECT * FROM {BIGQUERY_DATASET}.{colour}_{DATASET}_external_table;"
        )

        # Create a partitioned table from external table
        bq_create_partitioned_table_job = BigQueryInsertJobOperator(
            task_id=f"bq_create_{colour}_{DATASET}_partitioned_table_task",
            configuration={
                "query": {
                    "query": CREATE_BQ_TBL_QUERY,
                    "useLegacySql": False,
                }
            }
        )

        move_files_gcs_task >> bigquery_external_table_task >> bq_create_partitioned_table_job


================================================
FILE: cohorts/2022/week_3_data_warehouse/airflow/docker-compose-nofrills.yml
================================================
version: '3'
services:
    postgres:
        image: postgres:13
        env_file:
            - .env
        volumes:
            - postgres-db-volume:/var/lib/postgresql/data
        healthcheck:
            test: ["CMD", "pg_isready", "-U", "airflow"]
            interval: 5s
            retries: 5
        restart: always

    scheduler:
        build: .
        command: scheduler
        restart: on-failure
        depends_on:
            - postgres
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
            - ./plugins:/opt/airflow/plugins
            - ./scripts:/opt/airflow/scripts
            - ~/.google/credentials/:/.google/credentials:ro


    webserver:
        build: .
        entrypoint: ./scripts/entrypoint.sh
        restart: on-failure
        depends_on:
            - postgres
            - scheduler
        env_file:
            - .env
        volumes:
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
            - ./plugins:/opt/airflow/plugins
            - ~/.google/credentials/:/.google/credentials:ro
            - ./scripts:/opt/airflow/scripts

        user: "${AIRFLOW_UID:-50000}:0"
        ports:
            - "8080:8080"
        healthcheck:
            test: [ "CMD-SHELL", "[ -f /home/airflow/airflow-webserver.pid ]" ]
            interval: 30s
            timeout: 30s
            retries: 3

volumes:
  postgres-db-volume:

================================================
FILE: cohorts/2022/week_3_data_warehouse/airflow/docker-compose.yaml
================================================
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#

# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
#
# WARNING: This configuration is for local development. Do not use it in a production deployment.
#
# This configuration supports basic configuration using environment variables or an .env file
# The following variables are supported:
#
# AIRFLOW_IMAGE_NAME           - Docker image name used to run Airflow.
#                                Default: apache/airflow:2.2.3
# AIRFLOW_UID                  - User ID in Airflow containers
#                                Default: 50000
# Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode
#
# _AIRFLOW_WWW_USER_USERNAME   - Username for the administrator account (if requested).
#                                Default: airflow
# _AIRFLOW_WWW_USER_PASSWORD   - Password for the administrator account (if requested).
#                                Default: airflow
# _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.
#                                Default: ''
#
# Feel free to modify this file to suit your needs.
---
version: '3'
x-airflow-common:
  &airflow-common
  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  build:
    context: .
    dockerfile: ./Dockerfile
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
#    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
#    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
    GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json
    AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'

    # TODO: Please change GCP_PROJECT_ID & GCP_GCS_BUCKET, as per your config
    GCP_PROJECT_ID: 'pivotal-surfer-336713'
    GCP_GCS_BUCKET: 'dtc_data_lake_pivotal-surfer-336713'

  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins
    - ~/.google/credentials/:/.google/credentials:ro

  user: "${AIRFLOW_UID:-50000}:0"
  depends_on:
    &airflow-common-depends-on
#    redis:
#      condition: service_healthy
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      retries: 5
    restart: always

#  redis:
#    image: redis:latest
#    expose:
#      - 6379
#    healthcheck:
#      test: ["CMD", "redis-cli", "ping"]
#      interval: 5s
#      timeout: 30s
#      retries: 50
#    restart: always

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - 8080:8080
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"']
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

#  airflow-worker:
#    <<: *airflow-common
#    command: celery worker
#    healthcheck:
#      test:
#        - "CMD-SHELL"
#        - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
#      interval: 10s
#      timeout: 10s
#      retries: 5
#    environment:
#      <<: *airflow-common-env
#      # Required to handle warm shutdown of the celery workers properly
#      # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
#      DUMB_INIT_SETSID: "0"
#    restart: always
#    depends_on:
#      <<: *airflow-common-depends-on
#      airflow-init:
#        condition: service_completed_successfully
#
#  airflow-triggerer:
#    <<: *airflow-common
#    command: triggerer
#    healthcheck:
#      test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"']
#      interval: 10s
#      timeout: 10s
#      retries: 5
#    restart: always
#    depends_on:
#      <<: *airflow-common-depends-on
#      airflow-init:
#        condition: service_completed_successfully

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    # yamllint disable rule:line-length
    command:
      - -c
      - |
        function ver() {
          printf "%04d%04d%04d%04d" $${1//./ }
        }
        airflow_version=$$(gosu airflow airflow version)
        airflow_version_comparable=$$(ver $${airflow_version})
        min_airflow_version=2.2.0
        min_airflow_version_comparable=$$(ver $${min_airflow_version})
        if (( airflow_version_comparable < min_airflow_version_comparable )); then
          echo
          echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m"
          echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!"
          echo
          exit 1
        fi
        if [[ -z "${AIRFLOW_UID}" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
          echo "If you are on Linux, you SHOULD follow the instructions below to set "
          echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
          echo "For other operating systems you can get rid of the warning with manually created .env file:"
          echo "    See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user"
          echo
        fi
        one_meg=1048576
        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
        disk_available=$$(df / | tail -1 | awk '{print $$4}')
        warning_resources="false"
        if (( mem_available < 4000 )) ; then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
          echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
          echo
          warning_resources="true"
        fi
        if (( cpus_available < 2 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
          echo "At least 2 CPUs recommended. You have $${cpus_available}"
          echo
          warning_resources="true"
        fi
        if (( disk_available < one_meg * 10 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
          echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
          echo
          warning_resources="true"
        fi
        if [[ $${warning_resources} == "true" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
          echo "Please follow the instructions to increase amount of resources available:"
          echo "   https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin"
          echo
        fi
        mkdir -p /sources/logs /sources/dags /sources/plugins
        chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
        exec /entrypoint airflow version
    # yamllint enable rule:line-length
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_UPGRADE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
    user: "0:0"
    volumes:
      - .:/sources

  airflow-cli:
    <<: *airflow-common
    profiles:
      - debug
    environment:
      <<: *airflow-common-env
      CONNECTION_CHECK_MAX_COUNT: "0"
    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
    command:
      - bash
      - -c
      - airflow

#  flower:
#    <<: *airflow-common
#    command: celery flower
#    ports:
#      - 5555:5555
#    healthcheck:
#      test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
#      interval: 10s
#      timeout: 10s
#      retries: 5
#    restart: always
#    depends_on:
#      <<: *airflow-common-depends-on
#      airflow-init:
#        condition: service_completed_successfully

volumes:
  postgres-db-volume:


================================================
FILE: cohorts/2022/week_3_data_warehouse/airflow/scripts/entrypoint.sh
================================================
#!/usr/bin/env bash
export GOOGLE_APPLICATION_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS}
export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=${AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT}

airflow db upgrade

airflow users create -r Admin -u admin -p admin -e admin@example.com -f admin -l airflow
# "$_AIRFLOW_WWW_USER_USERNAME" -p "$_AIRFLOW_WWW_USER_PASSWORD"

airflow webserver


================================================
FILE: cohorts/2022/week_5_batch_processing/homework.md
================================================
## Week 5 Homework

In this homework we'll put what we learned about Spark
in practice.

We'll use high volume for-hire vehicles (HVFHV) dataset for that.

## Question 1. Install Spark and PySpark

* Install Spark
* Run PySpark
* Create a local spark session 
* Execute `spark.version`

What's the output?


## Question 2. HVFHW February 2021

Download the HVFHV data for february 2021:

```bash
wget https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-02.csv
```

Read it with Spark using the same schema as we did 
in the lessons. We will use this dataset for all
the remaining questions.

Repartition it to 24 partitions and save it to
parquet.

What's the size of the folder with results (in MB)?


## Question 3. Count records 

How many taxi trips were there on February 15?

Consider only trips that started on February 15.


## Question 4. Longest trip for each day

Now calculate the duration for each trip.

Trip starting on which day was the longest? 


## Question 5. Most frequent `dispatching_base_num`

Now find the most frequently occurring `dispatching_base_num` 
in this dataset.

How many stages this spark job has?

> Note: the answer may depend on how you write the query,
> so there are multiple correct answers. 
> Select the one you have.


## Question 6. Most common locations pair

Find the most common pickup-dropoff pair. 

For example:

"Jamaica Bay / Clinton East"

Enter two zone names separated by a slash

If any of the zone names are unknown (missing), use "Unknown". For example, "Unknown / Clinton East". 


## Bonus question. Join type

(not graded) 

For finding the answer to Q6, you'll need to perform a join.

What type of join is it?

And how many stages your spark job has?


## Submitting the solutions

* Form for submitting: https://forms.gle/dBkVK9yT8cSMDwuw7
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 07 March (Monday), 22:00 CET


================================================
FILE: cohorts/2022/week_6_stream_processing/homework.md
================================================
## Week 6 Homework
[Form](https://forms.gle/mSzfpPCXskWCabeu5)

The homework is mostly theoretical. In the last question you have to provide working code link, please keep in mind that this
question is not scored.

Deadline: 14 March, 22:00 CET

================================================
FILE: cohorts/2023/README.md
================================================
## Data Engineering Zoomcamp 2023 Cohort

* [Launch stream with course overview](https://www.youtube.com/watch?v=-zpVha7bw5A)
* [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html)
* [Public Leaderboard](leaderboard.md) and [Private Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vTbL00GcdQp0bJt9wf1ROltMq7s3qyxl-NYF7Pvk79Jfxgwfn9dNWmPD_yJHTDq_Wzvps8EIr6cOKWm/pubhtml)
* [Course Playlist: Only 2023 Live videos & homeworks](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)

[**Week 1: Introduction & Prerequisites**](week_1_docker_sql/)

* [Homework SQL](week_1_docker_sql/homework.md) and [solution](https://www.youtube.com/watch?v=KIh_9tZiroA)
* [Homework Terraform](week_1_terraform/homework.md)
* [Office hours](https://www.youtube.com/watch?v=RVTryVvSyw4&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)

[**Week 2: Workflow Orchestration**](week_2_workflow_orchestration)

* [Homework](week_2_workflow_orchestration/homework.md)
* [Office hours part 1](https://www.youtube.com/watch?v=a_nmLHb8hzw&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW) and [part 2](https://www.youtube.com/watch?v=PK8yyMY54Vk&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW&index=7) 

[**Week 3: Data Warehouse**](week_3_data_warehouse)

* [Homework](week_3_data_warehouse/homework.md)
* [Office hours](https://www.youtube.com/watch?v=QXfmtJp3bXE&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)

[**Week 4: Analytics Engineering**](week_4_analytics_engineering/)

* [Homework](week_4_analytics_engineering/homework.md)
* [PipeRider + dbt Workshop](workshops/piperider.md)
* [Office hours](https://www.youtube.com/watch?v=ODYg_r72qaE&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)

[**Week 5: Batch processing**](week_5_batch_processing/)

* [Homework](week_5_batch_processing/homework.md)
* [Office hours](https://www.youtube.com/watch?v=5_69yL2PPYI&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)

[**Week 6: Stream Processing**](week_6_stream_processing)

* [Homework](week_6_stream_processing/homework.md)


[**Week 7, 8 & 9: Project**](project.md)

More information [here](project.md)


================================================
FILE: cohorts/2023/leaderboard.md
================================================
## Leaderboard 

This is the top [100 leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vTbL00GcdQp0bJt9wf1ROltMq7s3qyxl-NYF7Pvk79Jfxgwfn9dNWmPD_yJHTDq_Wzvps8EIr6cOKWm/pubhtml)
of participants of Data Engineering Zoomcamp 2023 edition!

<table>
<tr>
  <th>Name</th>
  <th>Project</th>
  <th>Social</th>
  <th>Links and comments</th>
</tr>
<tr>
<td>Katharina Eichinger</td>
<td><a href="https://github.com/PandaKata/dezoomcamp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/katharina-eichinger/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/PandaKata"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Alia Hamwi</td>
<td><a href="https://github.com/AliaHa3/data-engineering-zoomcamp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/alia-hamwi/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/AliaHa3"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Emmanuel Ikpesu</td>
<td><a href="https://github.com/uchiharon/DataTalksClub_de-zoomcamp_CapStone_Project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/emmanuel-ikpesu-393708132/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/uchiharon"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://medium.com/@emmanarutops2/automating-data-pipelines-using-prefect-block-98d9b16f16bc">Automating Data Pipelines Using Prefect Block</a></li>
</ul></details></td>
</tr>
<tr>
<td>Sanya Syed</td>
<td><a href="https://github.com/sanyassyed/sf_eviction">Project</a></td>
<td> <a href="http://linkedin.com/in/sanyasy"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/sanyassyed"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://resume.creddle.io/resume/1so01cu6gx7">My Resume</a></li>
</ul>

> I am excited about the prospect of securing a challenging role as a Data Engineer, where I can utilise my skills and expertise to contribute meaningfully to an organisation's data-driven initiatives. </details></td>
</tr>
<tr>
<td>Aminu Lawal</td>
<td><a href="https://github.com/zabull1/cycling_DE_project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/aminu-lawal-600920100/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/zabull1"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Lisa Reiber</td>
<td><a href="https://github.com/lisallreiber/biketheft_berlin">Project</a></td>
<td> <a href="https://www.linkedin.com/in/lisareiber/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/lisallreiber"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://lookerstudio.google.com/u/2/reporting/8a06d083-e46f-403a-bcb0-d3ff23434e24/page/p_nmv21l7w4c">Project Dashboard</a></li>
</ul>

> always happy to connect with other data enthusiasts over topics like low-budget data engineering solutions for non-profits or AI solutions for non-profits</details></td>
</tr>
<tr>
<td>Vincenzo Galante</td>
<td><a href="https://lookerstudio.google.com/u/0/reporting/ebdf68e1-27f7-435b-8add-a4018681f801/page/BkBJD">Project</a></td>
<td> <a href="https://www.linkedin.com/in/galantevincenzo/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/VincenzoGalante"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


> Thank you for having this course!</details></td>
</tr>
<tr>
<td>Grzegorz Gątkowski </td>
<td><a href="https://github.com/GrzegorzGatkowski/Air_Pollution_Pipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/grzegorz-g%C4%85tkowski-811727125/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/GrzegorzGatkowski"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Matt Young</td>
<td><a href="https://github.com/directdetour/BeerReviewsDataPipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/matt-young-11377720/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/directdetour"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://twitter.com/ymatty">Twitter</a></li>
</ul>

> Experienced Developer | Cloud & Data Enthusiast | Open to Cloud & Data Engineering Roles 🌩️
➜ C#, SQL, JavaScript, Python | BI, Data Analytics | AWS, Azure, GCP

Passionate about data pipelines, storage, and processing. Excited to implement advanced cloud solutions and enable data-driven insights. Seeking Data Engineering opportunities to leverage my extensive SQL/Data Analytics experience and to transition into the world of cloud-based data solutions. Let's connect and collaborate on innovative data projects! #DataEngineering #CloudTechnology</details></td>
</tr>
<tr>
<td>Sam Hatley</td>
<td><a href="https://github.com/sam-hatley/real-estate-data">Project</a></td>
<td> <a href="https://www.linkedin.com/in/samhatley/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/sam-hatley"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Evan Hofmeister</td>
<td><a href="https://github.com/EvanHofmeister/Housing-Wealth-Pipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/evanhofmeister/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/EvanHofmeister"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Barys Kazarkin</td>
<td><a href="https://github.com/KazarkinBarys/Data_Engineering_Zoomcamp_Project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/barys-kazarkin-b9904b203/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/KazarkinBarys"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Joshua Ati</td>
<td><a href="https://github.com/joshuaati/DE_airline_pipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/joshua-ati-460750110/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/joshuaati"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Oleg Agapov</td>
<td><a href="https://github.com/oleg-agapov/de-zoomcamp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/oagapov/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/oleg-agapov/"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://twitter.com/oleg_agapov_">Twitter</a></li>
<li><a href="https://olegagapov.com/">Website</a></li>
</ul></details></td>
</tr>
<tr>
<td>Mikhail Kuklin</td>
<td><a href="https://github.com/MikhailKuklin/data-pipeline-COVID19-monitoring">Project</a></td>
<td> <a href="https://www.linkedin.com/in/mikhail-kuklin-194a9544/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/MikhailKuklin"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://mikhailkuklin.wordpress.com">Personal webpage</a></li>
</ul></details></td>
</tr>
<tr>
<td>Emmanuel Letremble</td>
<td><a href="https://github.com/Valkea/DE_bootcamp_project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/letremble"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Valkea"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://valkea.github.io">Portfolio</a></li>
</ul>

> Thanks to the DataTalks.Club for completing my Full Stack & Machine Learning skill sets with some extra DE knowledge.</details></td>
</tr>
<tr>
<td>Victor Kuang</td>
<td><a href="https://github.com/vykuang/toronto-service-calls-2023">Project</a></td>
<td> <a href="https://www.linkedin.com/in/vykuang/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/vykuang"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Antonis Angelakis</td>
<td><a href="https://github.com/angeanto/dezoomcamp-project-youtube">Project</a></td>
<td> <a href="https://www.linkedin.com/in/antonios-angelakis-249899101"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/angeanto"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Christian Ruiz</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Alex Pilugin</td>
<td><a href="https://github.com/skipper-com/dtc_de_course_project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/alexander-pilugin/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/skipper-com?tab=repositories"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Ahmad Rizky</td>
<td><a href="https://linktr.ee/ahmdxrzky">Project</a></td>
<td> <a href="https://linkedin.com/in/ahmdxrzky"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/ahmdxrzky"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Juan Francisco Hernandez Hernandez </td>
<td><a href="https://github.com/JuanPacoHernandez/TelecommDescriptive-Analysis">Project</a></td>
<td> <a href="https://www.linkedin.com/in/juan-paco-hernandez/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/JuanPacoHernandez"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


> Thanks to Data Talks Club, it was amazing learning for me as a Career changer.</details></td>
</tr>
<tr>
<td>Iurii Chernigin</td>
<td><a href="https://github.com/iurii-chernigin/audio-streaming-data-platform">Project</a></td>
<td> <a href="https://www.linkedin.com/in/iurii-chernigin/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/iurii-chernigin"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Franklyne Kibet</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Federico Zambelli</td>
<td><a href="https://github.com/wtfzambo/subreddit-analytics">Project</a></td>
<td> <a href="https://www.linkedin.com/in/fzambo/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/wtfzambo"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Marilina Orihuela</td>
<td><a href="https://github.com/mary435/MLA_Dashboard">Project</a></td>
<td> <a href="https://www.linkedin.com/in/marilina-orihuela/?locale=en_US"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/mary435"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Alejandro R. Mármol Ruiz</td>
<td><a href="https://github.com/marmola90/dezoomcampam">Project</a></td>
<td> <a href="https://www.linkedin.com/in/alejandro-marmol-81a998167/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/marmola90"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Daniel Takeshi</td>
<td><a href="https://github.com/danietakeshi/de-zoomcamp-2023/tree/main/project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/daniel-takeshi"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/danietakeshi"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Xia He-Bleinagel</td>
<td><a href="https://github.com/Data-Think-2021/DE-Final-Project-CO2">Project</a></td>
<td> <a href="https://www.linkedin.com/in/xia-he-bleinagel-51773585/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Data-Think-2021"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://xiahe-bleinagel.com/">Personal website</a></li>
</ul></details></td>
</tr>
<tr>
<td>Thorsten Foltz</td>
<td></td>
<td> <a href="https://www.linkedin.com/in/thorsten-foltz-a91481127/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Danh Vo</td>
<td><a href="https://github.com/datavadoz/eu-airbnb">Project</a></td>
<td> <a href="https://www.linkedin.com/in/0798a811b"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/datavadoz"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Joseph Ologunja</td>
<td><a href="https://github.com/Joseun/data-engineering-zoomcamp/tree/main/cohorts/2023/week_7_project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/josephologunja/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Joseun"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Roman Zabolotin</td>
<td><a href="https://github.com/rzabolotin/de_zoomcamp_2023_project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/rzabolotin/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/rzabolotin"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Aditya Gupta </td>
<td><a href="https://github.com/itsadityagupta/yelposphere">Project</a></td>
<td> <a href="https://www.linkedin.com/in/itsadityagupta"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/itsadityagupta"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://peerlist.io/itsadityagupta">Portfolio</a></li>
</ul></details></td>
</tr>
<tr>
<td>Vladimir Bugaevskii</td>
<td><a href="https://github.com/vbugaevskii/de-zoomcamp-cycling-2023">Project</a></td>
<td> <a href="https://www.linkedin.com/in/vbugaevskii/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/vbugaevskii"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Fozan Talat</td>
<td><a href="https://github.com/Fozan-Talat/divvy-bikeshare-de-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/fozan-talat/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Fozan-Talat"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Alain Boisvert</td>
<td><a href="https://github.com/boisalai/twitter-dashboard">Project</a></td>
<td> <a href="https://www.linkedin.com/in/alain-boisvert-98b058156/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/boisalai"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>reneboy garcia</td>
<td><a href="https://github.com/reneboygarcia/capstone_project_mongodb.git">Project</a></td>
<td> <a href="http://www.linkedin.com/in/eboygarcia"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/reneboygarcia"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


> "Success is not always about the grand achievements; it's about the small victories that accumulate over time." - Unknown</details></td>
</tr>
<tr>
<td>Svetlana Kononova</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Dmitrii Nikolaev</td>
<td></td>
<td> <a href="https://www.linkedin.com/in/dnnikolaev/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/melvinru"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://t.me/melvinru">DN Telegram</a></li>
</ul></details></td>
</tr>
<tr>
<td>Francis Romio</td>
<td><a href="https://github.com/romiof/brazil-weather">Project</a></td>
<td> <a href="https://br.linkedin.com/in/francisromio"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/romiof"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Saul Acevedo</td>
<td><a href="https://github.com/seacevedo/Solana-Pipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/saul-acevedo-739b17122"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/seacevedo"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Alina Li</td>
<td><a href="https://github.com/alinali87/de-zoomcamp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/alinali87/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Alexander Eryuzhev</td>
<td><a href="https://github.com/aeryuzhev/de-zoomcamp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/alexander-eryuzhev/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Paul Nwosu</td>
<td><a href="https://github.com/paulonye/Cloudrunjobs">Project</a></td>
<td> <a href="https://www.linkedin.com/in/nwosu-paul-1b7b2218b/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/paulonye"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li>https://medium.com/@nwosupaul141/serverless-deployment-of-a-prefect-data-pipeline-on-google-cloud-run-8c48765f2480</li>
</ul></details></td>
</tr>
<tr>
<td>Param mirani </td>
<td><a href="https://github.com/Param-29/stock-data-pipeline">Project</a></td>
<td> <a href="https://in.linkedin.com/in/param-mirani"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Param-29"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Oscar Garcia - ozkary</td>
<td><a href="https://github.com/ozkary/data-engineering-mta-turnstile/">Project</a></td>
<td> <a href="https://github.com/ozkary"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://twitter.com/ozkary">Twitter</a>  * <a href="https://www.youtube.com/channel/UCpaqmBQr8YE6ikLXXyt8D7g">You Tube</a> * <a href="https://www.ozkary.com">blog</a></li>
</ul></details></td>
</tr>
<tr>
<td>Hector Torres</td>
<td><a href="https://github.com/hdt94/dtc-de-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/hdt94/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/hdt94/"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://twitter.com/hdt94">Twitter @hdt94</a></li>
</ul>

> Currently looking for a position as data engineer</details></td>
</tr>
<tr>
<td>Dewi Nurfitri Oktaviani</td>
<td><a href="https://github.com/oktavianidewi/github-data-pipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/dewi-nurfitri-oktaviani-6b450b22/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/oktavianidewi"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://medium.com/@oktavianidewi">medium</a></li>
</ul></details></td>
</tr>
<tr>
<td>Ryno Marx</td>
<td></td>
<td> <a href="https://www.linkedin.com/in/ryno-m-402a58120"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Hidir Cem Altun</td>
<td></td>
<td> <a href="https://www.linkedin.com/in/hidir-cem-altun-914aaa65/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/HCA97"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Francis Mark Cayco</td>
<td><a href="https://github.com/PeteCastle/League-of-Legends-Analytics">Project</a></td>
<td> <a href="https://www.linkedin.com/in/francis-mark-cayco-33511a190/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/PeteCastle"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Adrian Baumann</td>
<td><a href="https://github.com/adrian-baumann/dwd-temp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/adrianbaumann/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/adrian-baumann"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Vladislav Garist</td>
<td><a href="https://github.com/garistvlad/data-engineering-zoomcamp/tree/main/week-7">Project</a></td>
<td> <a href="https://www.linkedin.com/in/vgarist/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/garistvlad"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Gerald Ooi</td>
<td></td>
<td> <a href="https://www.linkedin.com/in/geraldooi/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Roman</td>
<td></td>
<td> <a href="https://www.linkedin.com/in/roman-yakovlev-86b2b4130"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/romanyakovlev"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Aleksandr Krasnov</td>
<td></td>
<td> <a href="https://www.linkedin.com/in/aleksandr-krasnov/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://www.linkedin.com/in/aleksandr-krasnov/">Open to work</a></li>
</ul></details></td>
</tr>
<tr>
<td>Jaesung Ryu</td>
<td><a href="https://github.com/Haebuk/GHArchive-Data-Pipeline-Project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/jaesungryu"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Haebuk"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>António Damião Rodrigues</td>
<td><a href="https://github.com/adamiaonr/de-zoomcamp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/adamiaonrod/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/adamiaonr"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Alicia Escontrela</td>
<td><a href="https://github.com/aliescont/dezoomcamp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/alicia-escontrela/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/aliescont"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Chalermdej Lematavekul</td>
<td><a href="https://github.com/Chalermdej-l/Final_Project_FredETE">Project</a></td>
<td> <a href="https://www.linkedin.com/in/chalermdej-l/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Chalermdej-l?tab=repositories"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


> Thank you so much for the course. Learn so many thing from here.</details></td>
</tr>
<tr>
<td>Muhammed Jimoh</td>
<td><a href="https://github.com/Manny-97/DE-ZOOMCAMP-PROJECT">Project</a></td>
<td> <a href="https://www.linkedin.com/in/%F0%9F%91%A8%F0%9F%8F%BE%E2%80%8D%F0%9F%92%BB-muhammed-jimoh-45120a14a/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Manny-97"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Bartosz Skłodowski</td>
<td><a href="https://github.com/bartoszsklodowski/de_zoomcamp_project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/bartosz-sk%C5%82odowski/?locale=en_US"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/bartoszsklodowski"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Daniel Rigney</td>
<td><a href="https://github.com/danielyrigney/USDA-Data-Pipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/daniel-rigney-data/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/danielyrigney"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Daniel Gheorghita</td>
<td><a href="https://github.com/daniel-gheorghita/dezoomcamp/tree/main/7_project_Belgium_housing_market">Project</a></td>
<td> <a href="https://www.linkedin.com/in/daniel-gheorghita-4a59903a/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/daniel-gheorghita"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Daniel Gheorghita</td>
<td><a href="https://github.com/daniel-gheorghita/belgian_housing_buy_vs_rent">Project</a></td>
<td> <a href="https://www.linkedin.com/in/daniel-gheorghita-4a59903a/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/daniel-gheorghita"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Niel Kemp</td>
<td></td>
<td> <a href="https://www.linkedin.com/in/nielkemp/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Shahmir</td>
<td><a href="https://github.com/Light2Dark/quality-of-life">Project</a></td>
<td> <a href="https://www.linkedin.com/in/shahmir-varqha"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Light2Dark"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


Links:

<ul>
<li><a href="https://smolwaffle.com">Portfolio</a></li>
</ul>

> I've added a bunch of new features since the reviews! Check it out</details></td>
</tr>
<tr>
<td>Matt Bertrand</td>
<td><a href="https://github.com/mbertrand/eo-climate-pipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/bertrandmatt/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/mbertrand"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Nikolay Galkov</td>
<td><a href="https://github.com/ngalkov/DEZoomcamp_project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/nikolay-galkov/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/ngalkov"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Hiroko Sakai</td>
<td><a href="https://github.com/hirobo/world-earthquake">Project</a></td>
<td> <a href="https://www.linkedin.com/in/hirokos/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/hirobo"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Rohit Joshi</td>
<td><a href="https://github.com/Rohitjoshi07/FHVDataAnalysis">Project</a></td>
<td> <a href="https://www.linkedin.com/in/rohit-joshi09"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/RohitJoshi07"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Valerii Bazyrov</td>
<td></td>
<td> <a href="https://www.linkedin.com/in/lantenak/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/lantenak"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Juan Pablo Ricapito</td>
<td><a href="https://github.com/EzicStar/BA-turnstiles-pipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/juan-pablo-ricapito-112332186/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/EzicStar"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Ashraf Omara</td>
<td><a href="https://github.com/AshrafOmara12/Ukraine-Conflict-Twitter-Data-Pipeline">Project</a></td>
<td> <a href="https://www.linkedin.com/in/ashraf-omara-48294a106/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/AshrafOmara12"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


> I need to thank all of the data club community for this amazing contribution. </details></td>
</tr>
<tr>
<td>Wasawat Boonyarittikit</td>
<td><a href="https://github.com/ChungWasawat/dtc_de_project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/wasawat-boonyarittikit-b1698b179/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/ChungWasawat"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td></td>
</tr>
<tr>
<td>Fedor Faizov</td>
<td><a href="https://github.com/Fedrpi/de-zoomcamp-bandcamp-project">Project</a></td>
<td> <a href="https://www.linkedin.com/in/fedor-faizov-a75b32245/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Fedrpi"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
<td><details>
<summary>More info</summary>


> Absolutly amazing course <3 </details></td>

</tr>
</table>


================================================
FILE: cohorts/2023/project.md
================================================
## Course Project

The goal of this project is to apply everything we learned
in this course and build an end-to-end data pipeline.

You will have two attempts to submit your project. If you don't have 
time to submit your project by the end of attempt #1 (you started the 
course late, you have vacation plans, life/work got in the way, etc.)
or you fail your first attempt, 
then you will have a second chance to submit your project as attempt
#2. 

There are only two attempts.

Remember that to pass the project, you must evaluate 3 peers. If you don't do that,
your project can't be considered complete.

To find the projects assigned to you, use the peer review assignments link 
and find your hash in the first column. You will see three rows: you need to evaluate 
each of these projects. For each project, you need to submit the form once,
so in total, you will make three submissions. 


### Submitting

#### Project Attempt #1

Project:

* Form: https://forms.gle/zTJiVYSmCgsENj6y8
* Deadline: 10 April, 22:00 CET

Peer reviewing:

* Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRYQ0A9C7AkRK-YPSFhqaRMmuPR97QPfl2PjI8n11l5jntc6YMHIJXVVS0GQNqAYIGwzyevyManDB08/pubhtml?gid=0&single=true) ("project-01" sheet)
* Form: https://forms.gle/1bxmgR8yPwV359zb7
* Deadline: 17 April, 22:00 CET

Project feedback: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vQuMt9m1XlPrCACqnsFTXTV_KGiSnsl9UjL7kdTMsLJ8DLu3jNJlPzoUKG6baxc8APeEQ8RaSP1U2VX/pubhtml?gid=27207346&single=true) ("project-01" sheet)

#### Project Attempt #2

Project:

* Form: https://forms.gle/gCXUSYBm1KgMKXVm8
* Deadline: 4 May, 22:00 CET

Peer reviewing:

* Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRYQ0A9C7AkRK-YPSFhqaRMmuPR97QPfl2PjI8n11l5jntc6YMHIJXVVS0GQNqAYIGwzyevyManDB08/pubhtml?gid=303437788&single=true) ("project-02" sheet)
* Form: https://forms.gle/2x5MT4xxczR8isy37
* Deadline: 11 May, 22:00 CET

Project feedback: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vQuMt9m1XlPrCACqnsFTXTV_KGiSnsl9UjL7kdTMsLJ8DLu3jNJlPzoUKG6baxc8APeEQ8RaSP1U2VX/pubhtml?gid=246029638&single=true)

### Evaluation criteria

See [here](../../week_7_project/README.md)


### Misc

To get the hash for your project, use this function to hash your email:

```python
from hashlib import sha1

def compute_hash(email):
    return sha1(email.lower().encode('utf-8')).hexdigest()
```

Or use [this website](http://www.sha1-online.com/). 


================================================
FILE: cohorts/2023/week_1_docker_sql/homework.md
================================================
## Week 1 Homework

In this homework we'll prepare the environment 
and practice with Docker and SQL


## Question 1. Knowing docker tags

Run the command to get information on Docker 

```docker --help```

Now run the command to get help on the "docker build" command

Which tag has the following text? - *Write the image ID to the file* 

- `--imageid string`
- `--iidfile string`
- `--idimage string`
- `--idfile string`


## Question 2. Understanding docker first run 

Run docker with the python:3.9 image in an interactive mode and the entrypoint of bash.
Now check the python modules that are installed ( use pip list). 
How many python packages/modules are installed?

- 1
- 6
- 3
- 7

# Prepare Postgres

Run Postgres and load data as shown in the videos
We'll use the green taxi trips from January 2019:

```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz```

You will also need the dataset with zones:

```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv```

Download this data and put it into Postgres (with jupyter notebooks or with a pipeline)


## Question 3. Count records 

How many taxi trips were totally made on January 15?

Tip: started and finished on 2019-01-15. 

Remember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date.

- 20689
- 20530
- 17630
- 21090

## Question 4. Largest trip for each day

Which was the day with the largest trip distance
Use the pick up time for your calculations.

- 2019-01-18
- 2019-01-28
- 2019-01-15
- 2019-01-10

## Question 5. The number of passengers

In 2019-01-01 how many trips had 2 and 3 passengers?
 
- 2: 1282 ; 3: 266
- 2: 1532 ; 3: 126
- 2: 1282 ; 3: 254
- 2: 1282 ; 3: 274


## Question 6. Largest tip

For the passengers picked up in the Astoria Zone which was the drop off zone that had the largest tip?
We want the name of the zone, not the id.

Note: it's not a typo, it's `tip` , not `trip`

- Central Park
- Jamaica
- South Ozone Park
- Long Island City/Queens Plaza


## Submitting the solutions

* Form for submitting: [form](https://forms.gle/EjphSkR1b3nsdojv7)
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 30 January (Monday), 22:00 CET


## Solution

See here: https://www.youtube.com/watch?v=KIh_9tZiroA


================================================
FILE: cohorts/2023/week_1_terraform/homework.md
================================================
## Week 1 Homework

In this homework we'll prepare the environment by creating resources in GCP with Terraform.

In your VM on GCP install Terraform. Copy the files from the course repo
[here](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup/1_terraform_gcp/terraform) to your VM.

Modify the files as necessary to create a GCP Bucket and Big Query Dataset.


## Question 1. Creating Resources

After updating the main.tf and variable.tf files run:

```
terraform apply
```

Paste the output of this command into the homework submission form.


## Submitting the solutions

* Form for submitting: [form](https://forms.gle/S57Xs3HL9nB3YTzj9)
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 30 January (Monday), 22:00 CET


================================================
FILE: cohorts/2023/week_2_workflow_orchestration/README.md
================================================
## Week 2: Workflow Orchestration

Python code from videos is linked [below](#code-repository).

Also, if you find the commands too small to view in Kalise's videos, here's the [transcript with code for the second Prefect video](https://github.com/discdiver/prefect-zoomcamp/tree/main/flows/01_start) and the [fifth Prefect video](https://github.com/discdiver/prefect-zoomcamp/tree/main/flows/03_deployments).

### Data Lake (GCS)

* What is a Data Lake
* ELT vs. ETL
* Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)
* [Video](https://www.youtube.com/watch?v=W3Zm6rjOq70&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* [Slides](https://docs.google.com/presentation/d/1RkH-YhBz2apIjYZAxUz2Uks4Pt51-fVWVN9CcH9ckyY/edit?usp=sharing)


### 1. Introduction to Workflow orchestration

* What is orchestration?
* Workflow orchestrators vs. other types of orchestrators
* Core features of a workflow orchestration tool
* Different types of workflow orchestration tools that currently exist 

:movie_camera: [Video](https://www.youtube.com/watch?v=8oLs6pzHp68&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)


### 2. Introduction to Prefect concepts

* What is Prefect?
* Installing Prefect
* Prefect flow
* Creating an ETL
* Prefect task
* Blocks and collections
* Orion UI

:movie_camera: [Video](https://www.youtube.com/watch?v=cdtN6dhp708&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)

### 3. ETL with GCP & Prefect

* Flow 1: Putting data to Google Cloud Storage 

:movie_camera: [Video](https://www.youtube.com/watch?v=W-rMz_2GwqQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)


### 4. From Google Cloud Storage to Big Query

* Flow 2: From GCS to BigQuery

:movie_camera: [Video](https://www.youtube.com/watch?v=Cx5jt-V5sgE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)

### 5. Parametrizing Flow & Deployments 

* Parametrizing the script from your flow
* Parameter validation with Pydantic
* Creating a deployment locally
* Setting up Prefect Agent
* Running the flow
* Notifications

:movie_camera: [Video](https://www.youtube.com/watch?v=QrDxPjX10iw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)

### 6. Schedules & Docker Storage with Infrastructure

* Scheduling a deployment
* Flow code storage
* Running tasks in Docker

:movie_camera: [Video](https://www.youtube.com/watch?v=psNSzqTsi-s&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)

### 7. Prefect Cloud and Additional Resources 


* Using Prefect Cloud instead of local Prefect
* Workspaces
* Running flows on GCP

:movie_camera: [Video](https://www.youtube.com/watch?v=gGC23ZK7lr8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)

* [Prefect docs](https://docs.prefect.io/)
* [Pefect Discourse](https://discourse.prefect.io/)
* [Prefect Cloud](https://app.prefect.cloud/)
* [Prefect Slack](https://prefect-community.slack.com)

### Code repository

[Code from videos](https://github.com/discdiver/prefect-zoomcamp) (with a few minor enhancements)

### Homework 
Homework can be found [here](./homework.md).

## Community notes

Did you take notes? You can share them here.

* [Blog by Marcos Torregrosa (Prefect)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-2/)
* [Notes from Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week2)
* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week2.md)
* [Notes by Candace Williams](https://github.com/teacherc/de_zoomcamp_candace2023/blob/main/week_2/week2_notes.md)
* [Notes from Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/02/week-2-data-engineering-zoomcamp-notes-prefect/)
* [Notes from froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_2_workflow_orchestration/notes/notes_week_02.md)
* [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%202/Detailed%20Week%202%20Notes.ipynb)
* More on [Pandas vs SQL, Prefect capabilities, and testing your data](https://medium.com/@verazabeida/zoomcamp-2023-week-3-7f27bb8c483f), by Vera
* Add your notes here (above this line)


================================================
FILE: cohorts/2023/week_2_workflow_orchestration/homework.md
================================================
## Week 2 Homework

The goal of this homework is to familiarise users with workflow orchestration and observation. 


## Question 1. Load January 2020 data

Using the `etl_web_to_gcs.py` flow that loads taxi data into GCS as a guide, create a flow that loads the green taxi CSV dataset for January 2020 into GCS and run it. Look at the logs to find out how many rows the dataset has.

How many rows does that dataset have?

* 447,770
* 766,792
* 299,234
* 822,132


## Question 2. Scheduling with Cron

Cron is a common scheduling specification for workflows. 

Using the flow in `etl_web_to_gcs.py`, create a deployment to run on the first of every month at 5am UTC. What’s the cron schedule for that?

- `0 5 1 * *`
- `0 0 5 1 *`
- `5 * 1 0 *`
- `* * 5 1 0`


## Question 3. Loading data to BigQuery 

Using `etl_gcs_to_bq.py` as a starting point, modify the script for extracting data from GCS and loading it into BigQuery. This new script should not fill or remove rows with missing values. (The script is really just doing the E and L parts of ETL).

The main flow should print the total number of rows processed by the script. Set the flow decorator to log the print statement.

Parametrize the entrypoint flow to accept a list of months, a year, and a taxi color. 

Make any other necessary changes to the code for it to function as required.

Create a deployment for this flow to run in a local subprocess with local flow code storage (the defaults).

Make sure you have the parquet data files for Yellow taxi data for Feb. 2019 and March 2019 loaded in GCS. Run your deployment to append this data to your BiqQuery table. How many rows did your flow code process?

- 14,851,920
- 12,282,990
- 27,235,753
- 11,338,483


## Question 4. Github Storage Block

Using the `web_to_gcs` script from the videos as a guide, you want to store your flow code in a GitHub repository for collaboration with your team. Prefect can look in the GitHub repo to find your flow code and read it. Create a GitHub storage block from the UI or in Python code and use that in your Deployment instead of storing your flow code locally or baking your flow code into a Docker image. 

Note that you will have to push your code to GitHub, Prefect will not push it for you.

Run your deployment in a local subprocess (the default if you don’t specify an infrastructure). Use the Green taxi data for the month of November 2020.

How many rows were processed by the script?

- 88,019
- 192,297
- 88,605
- 190,225


## Question 5. Email or Slack notifications

Q5. It’s often helpful to be notified when something with your dataflow doesn’t work as planned. Choose one of the options below for creating email or slack notifications.

The hosted Prefect Cloud lets you avoid running your own server and has Automations that allow you to get notifications when certain events occur or don’t occur. 

Create a free forever Prefect Cloud account at app.prefect.cloud and connect your workspace to it following the steps in the UI when you sign up. 

Set up an Automation that will send yourself an email when a flow run completes. Run the deployment used in Q4 for the Green taxi data for April 2019. Check your email to see the notification.

Alternatively, use a Prefect Cloud Automation or a self-hosted Orion server Notification to get notifications in a Slack workspace via an incoming webhook. 

Join my temporary Slack workspace with [this link](https://join.slack.com/t/temp-notify/shared_invite/zt-1odklt4wh-hH~b89HN8MjMrPGEaOlxIw). 400 people can use this link and it expires in 90 days. 

In the Prefect Cloud UI create an [Automation](https://docs.prefect.io/ui/automations) or in the Prefect Orion UI create a [Notification](https://docs.prefect.io/ui/notifications/) to send a Slack message when a flow run enters a Completed state. Here is the Webhook URL to use: https://hooks.slack.com/services/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

Test the functionality.

Alternatively, you can grab the webhook URL from your own Slack workspace and Slack App that you create. 


How many rows were processed by the script?

- `125,268`
- `377,922`
- `728,390`
- `514,392`


## Question 6. Secrets

Prefect Secret blocks provide secure, encrypted storage in the database and obfuscation in the UI. Create a secret block in the UI that stores a fake 10-digit password to connect to a third-party service. Once you’ve created your block in the UI, how many characters are shown as asterisks (*) on the next page of the UI?

- 5
- 6
- 8
- 10


## Submitting the solutions

* Form for submitting: https://forms.gle/PY8mBEGXJ1RvmTM97
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 8 February (Wednesday), 22:00 CET


## Solution

* Video: https://youtu.be/L04lvYqNlc0
* Code: https://github.com/discdiver/prefect-zoomcamp/tree/main/flows/04_homework


================================================
FILE: cohorts/2023/week_3_data_warehouse/homework.md
================================================
## Week 3 Homework
<b><u>Important Note:</b></u> <p>You can load the data however you would like, but keep the files in .GZ Format. 
If you are using orchestration such as Airflow or Prefect do not load the data into Big Query using the orchestrator.</br> 
Stop with loading the files into a bucket. </br></br>
<u>NOTE:</u> You can use the CSV option for the GZ files when creating an External Table</br>

<b>SETUP:</b></br>
Create an external table using the fhv 2019 data. </br>
Create a table in BQ using the fhv 2019 data (do not partition or cluster this table). </br>
Data can be found here: https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv </p>

## Question 1:
What is the count for fhv vehicle records for year 2019?
- 65,623,481
- 43,244,696
- 22,978,333
- 13,942,414

## Question 2:
Write a query to count the distinct number of affiliated_base_number for the entire dataset on both the tables.</br> 
What is the estimated amount of data that will be read when this query is executed on the External Table and the Table?

- 25.2 MB for the External Table and 100.87MB for the BQ Table
- 225.82 MB for the External Table and 47.60MB for the BQ Table
- 0 MB for the External Table and 0MB for the BQ Table
- 0 MB for the External Table and 317.94MB for the BQ Table 


## Question 3:
How many records have both a blank (null) PUlocationID and DOlocationID in the entire dataset?
- 717,748
- 1,215,687
- 5
- 20,332

## Question 4:
What is the best strategy to optimize the table if query always filter by pickup_datetime and order by affiliated_base_number?
- Cluster on pickup_datetime Cluster on affiliated_base_number
- Partition by pickup_datetime Cluster on affiliated_base_number
- Partition by pickup_datetime Partition by affiliated_base_number
- Partition by affiliated_base_number Cluster on pickup_datetime

## Question 5:
Implement the optimized solution you chose for question 4. Write a query to retrieve the distinct affiliated_base_number between pickup_datetime 2019/03/01 and 2019/03/31 (inclusive).</br> 
Use the BQ table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values? Choose the answer which most closely matches.
- 12.82 MB for non-partitioned table and 647.87 MB for the partitioned table
- 647.87 MB for non-partitioned table and 23.06 MB for the partitioned table
- 582.63 MB for non-partitioned table and 0 MB for the partitioned table
- 646.25 MB for non-partitioned table and 646.25 MB for the partitioned table


## Question 6: 
Where is the data stored in the External Table you created?

- Big Query
- GCP Bucket
- Container Registry
- Big Table


## Question 7:
It is best practice in Big Query to always cluster your data:
- True
- False


## (Not required) Question 8:
A better format to store these files may be parquet. Create a data pipeline to download the gzip files and convert them into parquet. Upload the files to your GCP Bucket and create an External and BQ Table. 


Note: Column types for all files used in an External Table must have the same datatype. While an External Table may be created and shown in the side panel in Big Query, this will need to be validated by running a count query on the External Table to check if any errors occur. 
 
## Submitting the solutions

* Form for submitting: https://forms.gle/rLdvQW2igsAT73HTA
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 13 February (Monday), 22:00 CET


## Solution

Solution: https://www.youtube.com/watch?v=j8r2OigKBWE


================================================
FILE: cohorts/2023/week_4_analytics_engineering/homework.md
================================================
## Week 4 Homework 

In this homework, we'll use the models developed during the week 4 videos and enhance the already presented dbt project using the already loaded Taxi data for fhv vehicles for year 2019 in our DWH.

This means that in this homework we use the following data [Datasets list](https://github.com/DataTalksClub/nyc-tlc-data/)
* Yellow taxi data - Years 2019 and 2020
* Green taxi data - Years 2019 and 2020 
* fhv data - Year 2019. 

We will use the data loaded for:

* Building a source table: `stg_fhv_tripdata`
* Building a fact table: `fact_fhv_trips`
* Create a dashboard 

If you don't have access to GCP, you can do this locally using the ingested data from your Postgres database
instead. If you have access to GCP, you don't need to do it for local Postgres -
only if you want to.

> **Note**: if your answer doesn't match exactly, select the closest option 

### Question 1: 

**What is the count of records in the model fact_trips after running all models with the test run variable disabled and filtering for 2019 and 2020 data only (pickup datetime)?** 

You'll need to have completed the ["Build the first dbt models"](https://www.youtube.com/watch?v=UVI30Vxzd6c) video and have been able to run the models via the CLI. 
You should find the views and models for querying in your DWH.

- 41648442
- 51648442
- 61648442
- 71648442


### Question 2: 

**What is the distribution between service type filtering by years 2019 and 2020 data as done in the videos?**

You will need to complete "Visualising the data" videos, either using [google data studio](https://www.youtube.com/watch?v=39nLTs74A3E) or [metabase](https://www.youtube.com/watch?v=BnLkrA7a6gM). 

- 89.9/10.1
- 94/6
- 76.3/23.7
- 99.1/0.9


### Question 3: 

**What is the count of records in the model stg_fhv_tripdata after running all models with the test run variable disabled (:false)?**  

Create a staging model for the fhv data for 2019 and do not add a deduplication step. Run it via the CLI without limits (is_test_run: false).
Filter records with pickup time in year 2019.

- 33244696
- 43244696
- 53244696
- 63244696


### Question 4: 

**What is the count of records in the model fact_fhv_trips after running all dependencies with the test run variable disabled (:false)?**  

Create a core model for the stg_fhv_tripdata joining with dim_zones.
Similar to what we've done in fact_trips, keep only records with known pickup and dropoff locations entries for pickup and dropoff locations. 
Run it via the CLI without limits (is_test_run: false) and filter records with pickup time in year 2019.

- 12998722
- 22998722
- 32998722
- 42998722

### Question 5: 

**What is the month with the biggest amount of rides after building a tile for the fact_fhv_trips table?**

Create a dashboard with some tiles that you find interesting to explore the data. One tile should show the amount of trips per month, as done in the videos for fact_trips, based on the fact_fhv_trips table.

- March
- April
- January
- December


## Submitting the solutions

* Form for submitting: https://forms.gle/6A94GPutZJTuT5Y16
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 25 February (Saturday), 22:00 CET


## Solution

* Video: https://www.youtube.com/watch?v=I_K0lNu9WQw&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW
* Answers:
  * Question 1: 61648442,
  * Question 2: 89.9/10.1
  * Question 3: 43244696
  * Question 4: 22998722
  * Question 5: January


================================================
FILE: cohorts/2023/week_5_batch_processing/homework.md
================================================
## Week 5 Homework 

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the FHVHV 2021-06 data found here. [FHVHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-06.csv.gz )


### Question 1: 

**Install Spark and PySpark** 

- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

What's the output?
- 3.3.2
- 2.1.4
- 1.2.3
- 5.4
</br></br>


### Question 2: 

**HVFHW June 2021**

Read it with Spark using the same schema as we did in the lessons.</br> 
We will use this dataset for all the remaining questions.</br>
Repartition it to 12 partitions and save it to parquet.</br>
What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.</br>


- 2MB
- 24MB
- 100MB
- 250MB
</br></br>


### Question 3: 

**Count records**  

How many taxi trips were there on June 15?</br></br>
Consider only trips that started on June 15.</br>

- 308,164
- 12,856
- 452,470
- 50,982
</br></br>


### Question 4: 

**Longest trip for each day**  

Now calculate the duration for each trip.</br>
How long was the longest trip in Hours?</br>

- 66.87 Hours
- 243.44 Hours
- 7.68 Hours
- 3.32 Hours
</br></br>

### Question 5: 

**User Interface**

 Spark’s User Interface which shows application's dashboard runs on which local port?</br>

- 80
- 443
- 4040
- 8080
</br></br>


### Question 6: 

**Most frequent pickup location zone**

Load the zone lookup data into a temp view in Spark</br>
[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv)</br>

Using the zone lookup data and the fhvhv June 2021 data, what is the name of the most frequent pickup location zone?</br>

- East Chelsea
- Astoria
- Union Sq
- Crown Heights North
</br></br>


## Submitting the solutions

* Form for submitting: https://forms.gle/EcSvDs6vp64gcGuD8
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 06 March (Monday), 22:00 CET


## Solution

* Video: https://www.youtube.com/watch?v=ldoDIT32pJs
* Answers:
  * Question 1: 3.3.2
  * Question 2: 24MB
  * Question 3: 452,470
  * Question 4: 66.87 Hours
  * Question 5: 4040
  * Question 6: Crown Heights North


================================================
FILE: cohorts/2023/week_6_stream_processing/client.properties
================================================
# Required connection configs for Kafka producer, consumer, and admin
bootstrap.servers=<CONFLUENT CLOUD KAFKA BROKER>:9092
security.protocol=SASL_SSL
sasl.mechanisms=PLAIN
sasl.username=<CONFLUENT CLOUD API USER NAME>
sasl.password=<CONFLUENT CLOUD API PASSWORD>

# Best practice for higher availability in librdkafka clients prior to 1.7
session.timeout.ms=45000

================================================
FILE: cohorts/2023/week_6_stream_processing/homework.md
================================================
## Week 6 Homework 

In this homework, there will be two sections, the first session focus on theoretical questions related to Kafka 
and streaming concepts and the second session asks to create a small streaming application using preferred 
programming language (Python or Java).

### Question 1: 

**Please select the statements that are correct**

- Kafka Node is responsible to store topics [x]
- Zookeeper is removed from Kafka cluster starting from version 4.0 [x]
- Retention configuration ensures the messages not get lost over specific period of time. [x]
- Group-Id ensures the messages are distributed to associated consumers [x]


### Question 2: 

**Please select the Kafka concepts that support reliability and availability**

- Topic Replication [x]
- Topic Partioning
- Consumer Group Id
- Ack All [x]


### Question 3: 

**Please select the Kafka concepts that support scaling**  

- Topic Replication
- Topic Paritioning [x]
- Consumer Group Id [x]
- Ack All


### Question 4: 

**Please select the attributes that are good candidates for partitioning key. 
Consider cardinality of the field you have selected and scaling aspects of your application**  

- payment_type [x]
- vendor_id [x]
- passenger_count
- total_amount
- tpep_pickup_datetime
- tpep_dropoff_datetime


### Question 5: 

**Which configurations below should be provided for Kafka Consumer but not needed for Kafka Producer**

- Deserializer Configuration [x]
- Topics Subscription [x]
- Bootstrap Server 
- Group-Id [x]
- Offset [x]
- Cluster Key and Cluster-Secret


### Question 6:

Please implement a streaming application, for finding out popularity of PUlocationID across green and fhv trip datasets.
Please use the datasets [fhv_tripdata_2019-01.csv.gz](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv) 
and [green_tripdata_2019-01.csv.gz](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green)

PS: If you encounter memory related issue, you can use the smaller portion of these two datasets as well, 
it is not necessary to find exact number in the  question.

Your code should include following
1. Producer that reads csv files and publish rides in corresponding kafka topics (such as rides_green, rides_fhv)
2. Pyspark-streaming-application that reads two kafka topics
   and writes both of them in topic rides_all and apply aggregations to find most popular pickup location.

   
## Submitting the solutions

* Form for submitting: https://forms.gle/rK7268U92mHJBpmW7
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 13 March (Monday), 22:00 CET


## Solution

We will publish the solution here after deadline#

For Question 6 ensure, 

1) Download fhv_tripdata_2019-01.csv and green_tripdata_2019-01.csv under resources/fhv_tripdata 
and resources/green_tripdata resprctively. ps: You need to unzip the compressed files

2) Update the client.properties settings using your Confluent Cloud api keys and cluster. 
3) And create the topics(all_rides, fhv_taxi_rides, green_taxi_rides) in Confluent Cloud UI

4) Run Producers for two datasets
```
python3 producer_confluent --type green
python3 producer_confluent --type fhv
```

5) Run pyspark streaming
```
./spark-submit.sh streaming_confluent.py
```


================================================
FILE: cohorts/2023/week_6_stream_processing/producer_confluent.py
================================================
from confluent_kafka import Producer

import argparse
import csv
from typing import Dict
from time import sleep

from settings import CONFLUENT_CLOUD_CONFIG, \
    GREEN_TAXI_TOPIC, FHV_TAXI_TOPIC, \
    GREEN_TRIP_DATA_PATH, FHV_TRIP_DATA_PATH


class RideCSVProducer:
    def __init__(self, probs: Dict, ride_type: str):

        self.producer = Producer(**probs)
        self.ride_type = ride_type

    def parse_row(self, row):
        if self.ride_type == 'green':
            record = f'{row[5]}, {row[6]}'  # PULocationID, DOLocationID
            key = str(row[0])  # vendor_id
        elif self.ride_type == 'fhv':
            record = f'{row[3]}, {row[4]}'  # PULocationID, DOLocationID,
            key = str(row[0])  # dispatching_base_num
        return key, record

    def read_records(self, resource_path: str):
        records, ride_keys = [], []
        with open(resource_path, 'r') as f:
            reader = csv.reader(f)
            header = next(reader)  # skip the header
            for row in reader:
                key, record = self.parse_row(row)
                ride_keys.append(key)
                records.append(record)
        return zip(ride_keys, records)

    def publish(self, records: [str, str], topic: str):
        for key_value in records:
            key, value = key_value
            try:
                self.producer.poll(0)
                self.producer.produce(topic=topic, key=key, value=value)
                print(f"Producing record for <key: {key}, value:{value}>")
            except KeyboardInterrupt:
                break
            except BufferError as bfer:
                self.producer.poll(0.1)
            except Exception as e:
                print(f"Exception while producing record - {value}: {e}")

        self.producer.flush()
        sleep(10)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Kafka Consumer')
    parser.add_argument('--type', type=str, default='green')
    args = parser.parse_args()

    if args.type == 'green':
        kafka_topic = GREEN_TAXI_TOPIC
        data_path = GREEN_TRIP_DATA_PATH
    elif args.type == 'fhv':
        kafka_topic = FHV_TAXI_TOPIC
        data_path = FHV_TRIP_DATA_PATH

    producer = RideCSVProducer(ride_type=args.type, probs=CONFLUENT_CLOUD_CONFIG)
    ride_records = producer.read_records(resource_path=data_path)
    producer.publish(records=ride_records, topic=kafka_topic)


================================================
FILE: cohorts/2023/week_6_stream_processing/settings.py
================================================
import pyspark.sql.types as T

GREEN_TRIP_DATA_PATH = './resources/green_tripdata/green_tripdata_2019-01.csv'
FHV_TRIP_DATA_PATH = './resources/fhv_tripdata/fhv_tripdata_2019-01.csv'
BOOTSTRAP_SERVERS = 'localhost:9092'

RIDES_TOPIC = 'all_rides'
FHV_TAXI_TOPIC = 'fhv_taxi_rides'
GREEN_TAXI_TOPIC = 'green_taxi_rides'

ALL_RIDE_SCHEMA = T.StructType(
    [T.StructField("PUlocationID", T.StringType()),
     T.StructField("DOlocationID", T.StringType()),
     ])


def read_ccloud_config(config_file):
    conf = {}
    with open(config_file) as fh:
        for line in fh:
            line = line.strip()
            if len(line) != 0 and line[0] != "#":
                parameter, value = line.strip().split('=', 1)
                conf[parameter] = value.strip()
    return conf


CONFLUENT_CLOUD_CONFIG = read_ccloud_config('client_original.properties')


================================================
FILE: cohorts/2023/week_6_stream_processing/spark-submit.sh
================================================
# Submit Python code to SparkMaster

if [ $# -lt 1 ]
then
	echo "Usage: $0 <pyspark-job.py> [ executor-memory ]"
	echo "(specify memory in string format such as \"512M\" or \"2G\")"
	exit 1
fi
PYTHON_JOB=$1

if [ -z $2 ]
then
	EXEC_MEM="1G"
else
	EXEC_MEM=$2
fi
spark-submit --master spark://localhost:7077 --num-executors 2 \
	           --executor-memory $EXEC_MEM --executor-cores 1 \
             --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.3.1 \
             $PYTHON_JOB

================================================
FILE: cohorts/2023/week_6_stream_processing/streaming_confluent.py
================================================
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

from settings import CONFLUENT_CLOUD_CONFIG, GREEN_TAXI_TOPIC, FHV_TAXI_TOPIC, RIDES_TOPIC, ALL_RIDE_SCHEMA


def read_from_kafka(consume_topic: str):
    # Spark Streaming DataFrame, connect to Kafka topic served at host in bootrap.servers option

    df_stream = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", CONFLUENT_CLOUD_CONFIG['bootstrap.servers']) \
        .option("subscribe", consume_topic) \
        .option("startingOffsets", "earliest") \
        .option("checkpointLocation", "checkpoint") \
        .option("kafka.security.protocol", "SASL_SSL") \
        .option("kafka.sasl.mechanism", "PLAIN") \
        .option("kafka.sasl.jaas.config",
                f"""org.apache.kafka.common.security.plain.PlainLoginModule required username="{CONFLUENT_CLOUD_CONFIG['sasl.username']}" password="{CONFLUENT_CLOUD_CONFIG['sasl.password']}";""") \
        .option("failOnDataLoss", False) \
        .load()

    return df_stream


def parse_rides(df, schema):
    """ take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema """
    assert df.isStreaming is True, "DataFrame doesn't receive streaming data"

    df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
    # split attributes to nested array in one Column
    col = F.split(df['value'], ', ')

    # expand col to multiple top-level columns
    for idx, field in enumerate(schema):
        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))

    df = df.na.drop()

    df.printSchema()

    return df.select([field.name for field in schema])


def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):
    query = df.writeStream \
        .outputMode(output_mode) \
        .trigger(processingTime=processing_time) \
        .format("console") \
        .option("truncate", False) \
        .start() \
        .awaitTermination()
    return query  # pyspark.sql.streaming.StreamingQuery


def sink_kafka(df, topic, output_mode: str = 'complete'):
    query = df.writeStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092") \
        .outputMode(output_mode) \
        .option("topic", topic) \
        .option("checkpointLocation", "checkpoint") \
        .option("kafka.security.protocol", "SASL_SSL") \
        .option("kafka.sasl.mechanism", "PLAIN") \
        .option("kafka.sasl.jaas.config",
                f"""org.apache.kafka.common.security.plain.PlainLoginModule required username="{CONFLUENT_CLOUD_CONFIG['sasl.username']}" password="{CONFLUENT_CLOUD_CONFIG['sasl.password']}";""") \
        .option("failOnDataLoss", False) \
        .start()
    return query


def op_groupby(df, column_names):
    df_aggregation = df.groupBy(column_names).count()
    return df_aggregation


if __name__ == "__main__":
    spark = SparkSession.builder.appName('streaming-homework').getOrCreate()
    spark.sparkContext.setLogLevel('WARN')

    # Step 1: Consume GREEN_TAXI_TOPIC and FHV_TAXI_TOPIC
    df_green_rides = read_from_kafka(consume_topic=GREEN_TAXI_TOPIC)
    df_fhv_rides = read_from_kafka(consume_topic=FHV_TAXI_TOPIC)

    # Step 2: Publish green and fhv rides to RIDES_TOPIC
    kafka_sink_green_query = sink_kafka(df=df_green_rides, topic=RIDES_TOPIC, output_mode='append')
    kafka_sink_fhv_query = sink_kafka(df=df_fhv_rides, topic=RIDES_TOPIC, output_mode='append')

    # Step 3: Read RIDES_TOPIC and parse it in ALL_RIDE_SCHEMA
    df_all_rides = read_from_kafka(consume_topic=RIDES_TOPIC)
    df_all_rides = parse_rides(df_all_rides, ALL_RIDE_SCHEMA)

    # Step 4: Apply Aggregation on the all_rides
    df_pu_location_count = op_groupby(df_all_rides, ['PULocationID'])
    df_pu_location_count = df_pu_location_count.sort(F.col('count').desc())

    # Step 5: Sink Aggregation Streams to Console
    console_sink_pu_location = sink_console(df_pu_location_count, output_mode='complete')


================================================
FILE: cohorts/2023/workshops/piperider.md
================================================

## Workshop: Maximizing Confidence in Your Data Model Changes with dbt and PipeRider

To learn how to use PipeRider together with dbt for detecting changes in model and data, sign up for a workshop

- Video: https://www.youtube.com/watch?v=O-tyUOQccSs
- Repository: https://github.com/InfuseAI/taxi_rides_ny_duckdb


## Homework

The following questions follow on from the original Week 4 homework, and so use the same data as required by those questions:

https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2023/week_4_analytics_engineering/homework.md

Yellow taxi data - Years 2019 and 2020
Green taxi data - Years 2019 and 2020
fhv data - Year 2019.

### Question 1:

What is the distribution between vendor id filtering by years 2019 and 2020 data?

You will need to run PipeRider and check the report

* 70.1/29.6/0.5
* 60.1/39.5/0.4
* 90.2/9.5/0.3
* 80.1/19.7/0.2

### Question 2:

What is the composition of total amount (positive/zero/negative) filtering by years 2019 and 2020 data?

You will need to run PipeRider and check the report


* 51.4M/15K/48.6K
* 21.4M/5K/248.6K
* 61.4M/25K/148.6K
* 81.4M/35K/14.6K

### Question 3:

What is the numeric statistics (average/standard deviation/min/max/sum) of trip distances filtering by years 2019 and 2020 data?

You will need to run PipeRider and check the report


* 1.95/35.43/0/16.3K/151.5M
* 3.95/25.43/23.88/267.3K/281.5M
* 5.95/75.43/-63.88/67.3K/81.5M
* 2.95/35.43/-23.88/167.3K/181.5M


## Submitting the solutions

* Form for submitting: https://forms.gle/WyLQHBu1DNwNTfqe8
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 20 March, 22:00 CET


## Solution

Video: https://www.youtube.com/watch?v=inNrUys7W8U&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW


================================================
FILE: cohorts/2024/01-docker-terraform/homework.md
================================================
## Module 1 Homework

ATTENTION: At the very end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository.

## Docker & SQL

In this homework we'll prepare the environment 
and practice with Docker and SQL


## Question 1. Knowing docker tags

Run the command to get information on Docker 

```docker --help```

Now run the command to get help on the "docker build" command:

```docker build --help```

Do the same for "docker run".

Which tag has the following text? - *Automatically remove the container when it exits* 

- `--delete`
- `--rc`
- `--rmc`
- `--rm`


## Question 2. Understanding docker first run 

Run docker with the python:3.9 image in an interactive mode and the entrypoint of bash.
Now check the python modules that are installed ( use ```pip list``` ). 

What is version of the package *wheel* ?

- 0.42.0
- 1.0.0
- 23.0.1
- 58.1.0


# Prepare Postgres

Run Postgres and load data as shown in the videos
We'll use the green taxi trips from September 2019:

```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz```

You will also need the dataset with zones:

```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv```

Download this data and put it into Postgres (with jupyter notebooks or with a pipeline)


## Question 3. Count records 

How many taxi trips were totally made on September 18th 2019?

Tip: started and finished on 2019-09-18. 

Remember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date.

- 15767
- 15612
- 15859
- 89009

## Question 4. Longest trip for each day

Which was the pick up day with the longest trip distance?
Use the pick up time for your calculations.

Tip: For every trip on a single day, we only care about the trip with the longest distance. 

- 2019-09-18
- 2019-09-16
- 2019-09-26
- 2019-09-21


## Question 5. Three biggest pick up Boroughs

Consider lpep_pickup_datetime in '2019-09-18' and ignoring Borough has Unknown

Which were the 3 pick up Boroughs that had the maximum total_amount?
 
- "Brooklyn" "Manhattan" "Queens"
- "Bronx" "Brooklyn" "Manhattan"
- "Bronx" "Manhattan" "Queens" 
- "Brooklyn" "Queens" "Staten Island"


## Question 6. Largest tip

For the passengers picked up in September 2019 in the zone name Astoria which was the drop off zone that had the largest tip?
We want the name of the zone, not the id.

Note: it's not a typo, it's `tip` , not `trip`

- Central Park
- Jamaica
- JFK Airport
- Long Island City/Queens Plaza


## Terraform

In this section homework we'll prepare the environment by creating resources in GCP with Terraform.

In your VM on GCP/Laptop/GitHub Codespace install Terraform. 
Copy the files from the course repo
[here](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/01-docker-terraform/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace.

Modify the files as necessary to create a GCP Bucket and Big Query Dataset.


## Question 7. Creating Resources

After updating the main.tf and variable.tf files run:

```
terraform apply
```

Paste the output of this command into the homework submission form.


## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw01
* You can submit your homework multiple times. In this case, only the last submission will be used. 

Deadline: 29 January, 23:00 CET


================================================
FILE: cohorts/2024/01-docker-terraform/solutions.md
================================================
## Question 1. Knowing docker tags
```
❯ docker run --help | grep "Automatically remove"
--rm                               Automatically remove
```

- `|` pipe operator redirects the previous command output as an input to the command after the operator
- `docker run --help` -----> outputs `|` ---------> inputs to `grep "Automatically remove"`
- `grep` allows you to search through text
  
Answer: `--rm`


## Question 2. Understanding docker first run

- Run python:3.9 image with `docker run -it python:3.9 bash`
- Since you opened with `it` tag, the container will be interactive`
- Since the docker command ends with `bash`, the entrypoint into the container will be `bash`

```shell
root@root: docker run -it python:3.9 bash
root@b67c6949422a:/# pip list
Package    Version
---------- -------
pip        23.0.1
setuptools 58.1.0
wheel      0.45.1
```

Since it's been a while since 2024 cohort, your wheel version might differ and may not be in the options provided.

Answer: For me it was `0.45.1`


## Question 3. Count records

- Trips that started and finished on 2019-09-18
- Format timestamp(date and hour+min+sec) to date.

```sql
SELECT COUNT(*) FROM "csv_green_tripdata_2019_09"
WHERE DATE("lpep_pickup_datetime") = '2019-09-18' AND
      DATE("lpep_dropoff_datetime") = '2019-09-18';
```
```
+-------+
| count |
|-------|
| 15612 |
+-------+
```

Answer: `15612`


## Question 4. Longest trip for each day
```sql
SELECT
    DATE("lpep_pickup_datetime") AS "pickup_date",
    MAX("trip_distance") AS "longest_trip"
FROM
    "csv_green_tripdata_2019_09"
GROUP BY
    DATE("lpep_pickup_datetime")
ORDER BY
    "longest_trip" DESC
LIMIT 1;
```
```
+-------------+--------------+
| pickup_date | longest_trip |
|-------------+--------------|
| 2019-09-26  | 341.64       |
+-------------+--------------+
```

Answer: `2019-09-26`


## Question 5. Three biggest pickup zones
```sql
SELECT
    "zone"."Zone",
    ROUND(SUM(("total_amount")::NUMERIC), 3) AS "total_amount"
FROM
    "csv_green_tripdata_2019_09"
INNER JOIN
    "zone" ON "csv_green_tripdata_2019_09"."PULocationID" = "zone"."LocationID"
WHERE
    DATE("lpep_pickup_datetime") = '2019-09-18'
GROUP BY
    "zone"."Zone"
ORDER BY
    "total_amount" DESC
LIMIT 3;
```
```
+---------------------+--------------+
| Zone                | total_amount |
|---------------------+--------------|
| East Harlem North   | 17893.060    |
| East Harlem South   | 17152.160    |
| Morningside Heights | 11259.680    |
+---------------------+--------------+
```

Answer: `East Harlem North, East Harlem South, Morningside Heights`


## Question 6. Largest tip
```sql
SELECT
    puz."Zone" AS pickup_zone,
    doz."Zone" AS dropoff_zone,
    g."tip_amount"
FROM
    "csv_green_tripdata_2019_09" g
INNER JOIN
    "zone" puz ON g."PULocationID" = puz."LocationID"
INNER JOIN
    "zone" doz ON g."DOLocationID" = doz."LocationID"
WHERE
    puz."Zone" = 'Astoria'
ORDER BY
    g."tip_amount" DESC
LIMIT 1;
```

```
+-------------+--------------+------------+
| pickup_zone | dropoff_zone | tip_amount |
|-------------+--------------+------------|
| Astoria     | JFK Airport  | 62.31      |
+-------------+--------------+------------+
```

Answer: `JFK Airport`


## Question 7. Terraform Workflow

> self-explanatory


================================================
FILE: cohorts/2024/02-workflow-orchestration/README.md
================================================
> [!NOTE]  
>If you're looking for Airflow videos from the 2022 edition, check the [2022 cohort folder](../cohorts/2022/week_2_data_ingestion/). 
>
>If you're looking for Prefect videos from the 2023 edition, check the [2023 cohort folder](../cohorts/2023/week_2_data_ingestion/).

# Week 2: Workflow Orchestration

Welcome to Week 2 of the Data Engineering Zoomcamp! 🚀😤 This week, we'll be covering workflow orchestration with Mage.

Mage is an open-source, hybrid framework for transforming and integrating data. ✨

This week, you'll learn how to use the Mage platform to author and share _magical_ data pipelines. This will all be covered in the course, but if you'd like to learn a bit more about Mage, check out our docs [here](https://docs.mage.ai/introduction/overview). 

* [2.2.1 - 📯 Intro to Orchestration](#221----intro-to-orchestration)
* [2.2.2 - 🧙‍♂️ Intro to Mage](#222---%EF%B8%8F-intro-to-mage)
* [2.2.3 - 🐘 ETL: API to Postgres](#223----etl-api-to-postgres)
* [2.2.4 - 🤓 ETL: API to GCS](#224----etl-api-to-gcs)
* [2.2.5 - 🔍 ETL: GCS to BigQuery](#225----etl-gcs-to-bigquery)
* [2.2.6 - 👨‍💻 Parameterized Execution](#226----parameterized-execution)
* [2.2.7 - 🤖 Deployment (Optional)](#227----deployment-optional)
* [2.2.8 - 🗒️ Homework](#228---️-homework)
* [2.2.9 - 👣 Next Steps](#229----next-steps)

## 📕 Course Resources

### 2.2.1 - 📯 Intro to Orchestration

In this section, we'll cover the basics of workflow orchestration. We'll discuss what it is, why it's important, and how it can be used to build data pipelines.

Videos
- 2.2.1a - What is Orchestration?

[![](https://markdown-videos-api.jorgenkh.no/youtube/Li8-MWHhTbo)](https://youtu.be/Li8-MWHhTbo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=17)

Resources
- [Slides](https://docs.google.com/presentation/d/17zSxG5Z-tidmgY-9l7Al1cPmz4Slh4VPK6o2sryFYvw/)

### 2.2.2 - 🧙‍♂️ Intro to Mage

In this section, we'll introduce the Mage platform. We'll cover what makes Mage different from other orchestrators, the fundamental concepts behind Mage, and how to get started. To cap it off, we'll spin Mage up via Docker 🐳 and run a simple pipeline.

Videos
- 2.2.2a - What is Mage?

[![](https://markdown-videos-api.jorgenkh.no/youtube/AicKRcK3pa4)](https://youtu.be/AicKRcK3pa4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=18)

- 2.2.2b - Configuring Mage

[![](https://markdown-videos-api.jorgenkh.no/youtube/tNiV7Wp08XE)](https://youtu.be/tNiV7Wp08XE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=19)

- 2.2.2c - A Simple Pipeline

[![](https://markdown-videos-api.jorgenkh.no/youtube/stI-gg4QBnI)](https://youtu.be/stI-gg4QBnI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=20)

Resources
- [Getting Started Repo](https://github.com/mage-ai/mage-zoomcamp)
- [Slides](https://docs.google.com/presentation/d/1y_5p3sxr6Xh1RqE6N8o2280gUzAdiic2hPhYUUD6l88/)

### 2.2.3 - 🐘 ETL: API to Postgres

Hooray! Mage is up and running. Now, let's build a _real_ pipeline. In this section, we'll build a simple ETL pipeline that loads data from an API into a Postgres database. Our database will be built using Docker— it will be running locally, but it's the same as if it were running in the cloud.

Videos
- 2.2.3a - Configuring Postgres

[![](https://markdown-videos-api.jorgenkh.no/youtube/pmhI-ezd3BE)](https://youtu.be/pmhI-ezd3BE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=21)

- 2.2.3b - Writing an ETL Pipeline : API to postgres

[![](https://markdown-videos-api.jorgenkh.no/youtube/Maidfe7oKLs)](https://youtu.be/Maidfe7oKLs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=22)


### 2.2.4 - 🤓 ETL: API to GCS

Ok, so we've written data _locally_ to a database, but what about the cloud? In this tutorial, we'll walk through the process of using Mage to extract, transform, and load data from an API to Google Cloud Storage (GCS). 

We'll cover both writing _partitioned_ and _unpartitioned_ data to GCS and discuss _why_ you might want to do one over the other. Many data teams start with extracting data from a source and writing it to a data lake _before_ loading it to a structured data source, like a database.

Videos
- 2.2.4a - Configuring GCP

[![](https://markdown-videos-api.jorgenkh.no/youtube/00LP360iYvE)](https://youtu.be/00LP360iYvE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=23)

- 2.2.4b - Writing an ETL Pipeline : API to GCS

[![](https://markdown-videos-api.jorgenkh.no/youtube/w0XmcASRUnc)](https://youtu.be/w0XmcASRUnc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=24)

Resources
- [DTC Zoomcamp GCP Setup](../01-docker-terraform/1_terraform_gcp/2_gcp_overview.md)

### 2.2.5 - 🔍 ETL: GCS to BigQuery

Now that we've written data to GCS, let's load it into BigQuery. In this section, we'll walk through the process of using Mage to load our data from GCS to BigQuery. This closely mirrors a very common data engineering workflow: loading data from a data lake into a data warehouse.

Videos
- 2.2.5a - Writing an ETL Pipeline : GCS to BigQuery

[![](https://markdown-videos-api.jorgenkh.no/youtube/JKp_uzM-XsM)](https://youtu.be/JKp_uzM-XsM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=25)

### 2.2.6 - 👨‍💻 Parameterized Execution

By now you're familiar with building pipelines, but what about adding parameters? In this video, we'll discuss some built-in runtime variables that exist in Mage and show you how to define your own! We'll also cover how to use these variables to parameterize your pipelines. Finally, we'll talk about what it means to *backfill* a pipeline and how to do it in Mage.

Videos
- 2.2.6a - Parameterized Execution

[![](https://markdown-videos-api.jorgenkh.no/youtube/H0hWjWxB-rg)](https://youtu.be/H0hWjWxB-rg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=26)


- 2.2.6b - Backfills

[![](https://markdown-videos-api.jorgenkh.no/youtube/ZoeC6Ag5gQc)](https://youtu.be/ZoeC6Ag5gQc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=27)

Resources
- [Mage Variables Overview](https://docs.mage.ai/development/variables/overview)
- [Mage Runtime Variables](https://docs.mage.ai/getting-started/runtime-variable)

### 2.2.7 - 🤖 Deployment (Optional)

In this section, we'll cover deploying Mage using Terraform and Google Cloud. This section is optional— it's not *necessary* to learn Mage, but it might be helpful if you're interested in creating a fully deployed project. If you're using Mage in your final project, you'll need to deploy it to the cloud.

Videos
- 2.2.7a - Deployment Prerequisites

[![](https://markdown-videos-api.jorgenkh.no/youtube/zAwAX5sxqsg)](https://youtu.be/zAwAX5sxqsg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=28)

- 2.2.7b - Google Cloud Permissions

[![](https://markdown-videos-api.jorgenkh.no/youtube/O_H7DCmq2rA)](https://youtu.be/O_H7DCmq2rA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=29)

- 2.2.7c - Deploying to Google Cloud - Part 1

[![](https://markdown-videos-api.jorgenkh.no/youtube/9A872B5hb_0)](https://youtu.be/9A872B5hb_0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=30)

- 2.2.7d - Deploying to Google Cloud - Part 2

[![](https://markdown-videos-api.jorgenkh.no/youtube/0YExsb2HgLI)](https://youtu.be/0YExsb2HgLI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=31)

Resources
- [Installing Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli)
- [Installing `gcloud` CLI](https://cloud.google.com/sdk/docs/install)
- [Mage Terraform Templates](https://github.com/mage-ai/mage-ai-terraform-templates)

Additional Mage Guides
- [Terraform](https://docs.mage.ai/production/deploying-to-cloud/using-terraform)
- [Deploying to GCP with Terraform](https://docs.mage.ai/production/deploying-to-cloud/gcp/setup)

### 2.2.8 - 🗒️ Homework 

We've prepared a short exercise to test you on what you've learned this week. You can find the homework [here](../cohorts/2024/02-workflow-orchestration/homework.md). This follows closely from the contents of the course and shouldn't take more than an hour or two to complete. 😄

### 2.2.9 - 👣 Next Steps

Congratulations! You've completed Week 2 of the Data Engineering Zoomcamp. We hope you've enjoyed learning about Mage and that you're excited to use it in your final project. If you have any questions, feel free to reach out to us on Slack. Be sure to check out our "Next Steps" video for some inspiration for the rest of your journey 😄.

Videos
- 2.2.9 - Next Steps

[![](https://markdown-videos-api.jorgenkh.no/youtube/uUtj7N0TleQ)](https://youtu.be/uUtj7N0TleQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=32)

Resources
- [Slides](https://docs.google.com/presentation/d/1yN-e22VNwezmPfKrZkgXQVrX5owDb285I2HxHWgmAEQ/edit#slide=id.g262fb0d2905_0_12)

### 📑 Additional Resources

- [Mage Docs](https://docs.mage.ai/)
- [Mage Guides](https://docs.mage.ai/guides)
- [Mage Slack](https://www.mage.ai/chat)


# Community notes

Did you take notes? You can share them here:

## 2024 notes

* [2024 Videos transcripts week 2](https://drive.google.com/drive/folders/1yxT0uMMYKa6YOxanh91wGqmQUMS7yYW7?usp=sharing) by Maria Fisher
* [Notes from Jonah Oliver](https://www.jonahboliver.com/blog/de-zc-w2)
* [Notes from Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/2-workflow-orchestration/readme.md)
* [Notes from Kirill](https://github.com/kirill505/data-engineering-zoomcamp/blob/main/02-workflow-orchestration/README.md)
* [Notes from Zharko](https://www.zharconsulting.com/contents/data/data-engineering-bootcamp-2024/week-2-ingesting-data-with-mage/)
* Add your notes above this line

## 2023 notes

See [here](../cohorts/2023/week_2_workflow_orchestration#community-notes)


## 2022 notes

See [here](../cohorts/2022/week_2_data_ingestion#community-notes)


================================================
FILE: cohorts/2024/02-workflow-orchestration/homework.md
================================================
## Module 2 Homework

ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository.

> In case you don't get one option exactly, select the closest one 

For the homework, we'll be working with the _green_ taxi dataset located here:

`https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download`

To get a `wget`-able link, use this prefix (note that the link itself gives 404):

`https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/`

### Assignment

The goal will be to construct an ETL pipeline that loads the data, performs some transformations, and writes the data to a database (and Google Cloud!).

- Create a new pipeline, call it `green_taxi_etl`
- Add a data loader block and use Pandas to read data for the final quarter of 2020 (months `10`, `11`, `12`).
  - You can use the same datatypes and date parsing methods shown in the course.
  - `BONUS`: load the final three months using a for loop and `pd.concat`
- Add a transformer block and perform the following:
  - Remove rows where the passenger count is equal to 0 _and_ the trip distance is equal to zero.
  - Create a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date.
  - Rename columns in Camel Case to Snake Case, e.g. `VendorID` to `vendor_id`.
  - Add three assertions:
    - `vendor_id` is one of the existing values in the column (currently)
    - `passenger_count` is greater than 0
    - `trip_distance` is greater than 0
- Using a Postgres data exporter (SQL or Python), write the dataset to a table called `green_taxi` in a schema `mage`. Replace the table if it already exists.
- Write your data as Parquet files to a bucket in GCP, partioned by `lpep_pickup_date`. Use the `pyarrow` library!
- Schedule your pipeline to run daily at 5AM UTC.

### Questions

## Question 1. Data Loading

Once the dataset is loaded, what's the shape of the data?

* 266,855 rows x 20 columns
* 544,898 rows x 18 columns
* 544,898 rows x 20 columns
* 133,744 rows x 20 columns

## Question 2. Data Transformation

Upon filtering the dataset where the passenger count is greater than 0 _and_ the trip distance is greater than zero, how many rows are left?

* 544,897 rows
* 266,855 rows
* 139,370 rows
* 266,856 rows

## Question 3. Data Transformation

Which of the following creates a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date?

* `data = data['lpep_pickup_datetime'].date`
* `data('lpep_pickup_date') = data['lpep_pickup_datetime'].date`
* `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date`
* `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt().date()`

## Question 4. Data Transformation

What are the existing values of `VendorID` in the dataset?

* 1, 2, or 3
* 1 or 2
* 1, 2, 3, 4
* 1

## Question 5. Data Transformation

How many columns need to be renamed to snake case?

* 3
* 6
* 2
* 4

## Question 6. Data Exporting

Once exported, how many partitions (folders) are present in Google Cloud?

* 96
* 56
* 67
* 108

## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw2
* Check the link above to see the due date
  
## Solution

Will be added after the due date


================================================
FILE: cohorts/2024/03-data-warehouse/homework.md
================================================
## Module 3 Homework

Solution: https://www.youtube.com/watch?v=8g_lRKaC9ro

ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository.

<b><u>Important Note:</b></u> <p> For this homework we will be using the 2022 Green Taxi Trip Record Parquet Files from the New York
City Taxi Data found here: </br> https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page </br>
If you are using orchestration such as Mage, Airflow or Prefect do not load the data into Big Query using the orchestrator.</br> 
Stop with loading the files into a bucket. </br></br>
<u>NOTE:</u> You will need to use the PARQUET option files when creating an External Table</br>

<b>SETUP:</b></br>
Create an external table using the Green Taxi Trip Records Data for 2022. </br>
Create a table in BQ using the Green Taxi Trip Records for 2022 (do not partition or cluster this table). </br>
</p>

## Question 1:
Question 1: What is count of records for the 2022 Green Taxi Data??
- 65,623,481
- 840,402
- 1,936,423
- 253,647

## Question 2:
Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.</br> 
What is the estimated amount of data that will be read when this query is executed on the External Table and the Table?

- 0 MB for the External Table and 6.41MB for the Materialized Table
- 18.82 MB for the External Table and 47.60 MB for the Materialized Table
- 0 MB for the External Table and 0MB for the Materialized Table
- 2.14 MB for the External Table and 0MB for the Materialized Table


## Question 3:
How many records have a fare_amount of 0?
- 12,488
- 128,219
- 112
- 1,622

## Question 4:
What is the best strategy to make an optimized table in Big Query if your query will always order the results by PUlocationID and filter based on lpep_pickup_datetime? (Create a new table with this strategy)
- Cluster on lpep_pickup_datetime Partition by PUlocationID
- Partition by lpep_pickup_datetime  Cluster on PUlocationID
- Partition by lpep_pickup_datetime and Partition by PUlocationID
- Cluster on by lpep_pickup_datetime and Cluster on PUlocationID

## Question 5:
Write a query to retrieve the distinct PULocationID between lpep_pickup_datetime
06/01/2022 and 06/30/2022 (inclusive)</br>

Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values? </br>

Choose the answer which most closely matches.</br> 

- 22.82 MB for non-partitioned table and 647.87 MB for the partitioned table
- 12.82 MB for non-partitioned table and 1.12 MB for the partitioned table
- 5.63 MB for non-partitioned table and 0 MB for the partitioned table
- 10.31 MB for non-partitioned table and 10.31 MB for the partitioned table


## Question 6: 
Where is the data stored in the External Table you created?

- Big Query
- GCP Bucket
- Big Table
- Container Registry


## Question 7:
It is best practice in Big Query to always cluster your data:
- True
- False


## (Bonus: Not worth points) Question 8:
No Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why?

 
## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw3


================================================
FILE: cohorts/2024/04-analytics-engineering/homework.md
================================================
## Module 4 Homework 

In this homework, we'll use the models developed during the week 4 videos and enhance the already presented dbt project using the already loaded Taxi data for fhv vehicles for year 2019 in our DWH.

This means that in this homework we use the following data [Datasets list](https://github.com/DataTalksClub/nyc-tlc-data/)
* Yellow taxi data - Years 2019 and 2020
* Green taxi data - Years 2019 and 2020 
* fhv data - Year 2019. 

We will use the data loaded for:

* Building a source table: `stg_fhv_tripdata`
* Building a fact table: `fact_fhv_trips`
* Create a dashboard 

If you don't have access to GCP, you can do this locally using the ingested data from your Postgres database
instead. If you have access to GCP, you don't need to do it for local Postgres - only if you want to.

> **Note**: if your answer doesn't match exactly, select the closest option 

### Question 1: 

**What happens when we execute dbt build --vars '{'is_test_run':'true'}'**
You'll need to have completed the ["Build the first dbt models"](https://www.youtube.com/watch?v=UVI30Vxzd6c) video. 
- It's the same as running *dbt build*
- It applies a _limit 100_ to all of our models
- It applies a _limit 100_ only to our staging models
- Nothing

### Question 2: 

**What is the code that our CI job will run? Where is this code coming from?**  

- The code that has been merged into the main branch
- The code that is behind the creation object on the dbt_cloud_pr_ schema
- The code from any development branch that has been opened based on main
- The code from the development branch we are requesting to merge to main


### Question 3 (2 points)

**What is the count of records in the model fact_fhv_trips after running all dependencies with the test run variable disabled (:false)?**  
Create a staging model for the fhv data, similar to the ones made for yellow and green data. Add an additional filter for keeping only records with pickup time in year 2019.
Do not add a deduplication step. Run this models without limits (is_test_run: false).

Create a core model similar to fact trips, but selecting from stg_fhv_tripdata and joining with dim_zones.
Similar to what we've done in fact_trips, keep only records with known pickup and dropoff locations entries for pickup and dropoff locations. 
Run the dbt model without limits (is_test_run: false).

- 12998722
- 22998722
- 32998722
- 42998722

### Question 4 (2 points)

**What is the service that had the most rides during the month of July 2019 month with the biggest amount of rides after building a tile for the fact_fhv_trips table and the fact_trips tile as seen in the videos?**

Create a dashboard with some tiles that you find interesting to explore the data. One tile should show the amount of trips per month, as done in the videos for fact_trips, including the fact_fhv_trips data.

- FHV
- Green
- Yellow
- FHV and Green


## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw4

Deadline: 22 February (Thursday), 22:00 CET


## Solution (To be published after deadline)

* Video: https://youtu.be/3OPggh5Rca8
* Answers:
  * Question 1: It applies a _limit 100_ only to our staging models
  * Question 2: The code from the development branch we are requesting to merge to main
  * Question 3: 22998722
  * Question 4: Yellow


================================================
FILE: cohorts/2024/05-batch/homework.md
================================================
## Module 5 Homework 

Solution: https://www.youtube.com/watch?v=YtddC7vJOgQ

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the FHV 2019-10 data found here. [FHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz)

### Question 1: 

**Install Spark and PySpark** 

- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

What's the output?

> [!NOTE]
> To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/setup/pyspark.md)

### Question 2: 

**FHV October 2019**

Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.

Repartition the Dataframe to 6 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

- 1MB
- 6MB
- 25MB
- 87MB


### Question 3: 

**Count records** 

How many taxi trips were there on the 15th of October?

Consider only trips that started on the 15th of October.

- 108,164
- 12,856
- 452,470
- 62,610

> [!IMPORTANT]
> Be aware of columns order when defining schema

### Question 4: 

**Longest trip for each day** 

What is the length of the longest trip in the dataset in hours?

- 631,152.50 Hours
- 243.44 Hours
- 7.68 Hours
- 3.32 Hours


### Question 5: 

**User Interface**

Spark’s User Interface which shows the application's dashboard runs on which local port?

- 80
- 443
- 4040
- 8080


### Question 6: 

**Least frequent pickup location zone**

Load the zone lookup data into a temp view in Spark</br>
[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv)

Using the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?</br>

- East Chelsea
- Jamaica Bay
- Union Sq
- Crown Heights North


## Submitting the solutions

- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw5
- Deadline: See the website


================================================
FILE: cohorts/2024/06-streaming/docker-compose.yml
================================================
version: '3.7'
services:
  # Redpanda cluster
  redpanda-1:
    image: docker.redpanda.com/vectorized/redpanda:v22.3.5
    container_name: redpanda-1
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '1'
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092
      - --pandaproxy-addr
      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082
      - --advertise-pandaproxy-addr
      - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082
      - --rpc-addr
      - 0.0.0.0:33145
      - --advertise-rpc-addr
      - redpanda-1:33145
    ports:
      # - 8081:8081
      - 8082:8082
      - 9092:9092
      - 28082:28082
      - 29092:29092

================================================
FILE: cohorts/2024/06-streaming/homework.md
================================================
## Module 6 Homework 

In this homework, we're going to extend Module 5 Homework and learn about streaming with PySpark.

Instead of Kafka, we will use Red Panda, which is a drop-in
replacement for Kafka. 

Ensure you have the following set up (if you had done the previous homework and the module):

- Docker (see [module 1](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/01-docker-terraform))
- PySpark (see [module 5](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/05-batch/setup))

For this homework we will be using the files from Module 5 homework:

- Green 2019-10 data from [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz)


## Start Red Panda

Let's start redpanda in a docker container. 

There's a `docker-compose.yml` file in the homework folder (taken from [here](https://github.com/redpanda-data-blog/2023-python-gsg/blob/main/docker-compose.yml))

Copy this file to your homework directory and run

```bash
docker-compose up
```

(Add `-d` if you want to run in detached mode)


## Question 1: Redpanda version

Now let's find out the version of redpandas. 

For that, check the output of the command `rpk help` _inside the container_. The name of the container is `redpanda-1`.

Find out what you need to execute based on the `help` output.

What's the version, based on the output of the command you executed? (copy the entire version)


## Question 2. Creating a topic

Before we can send data to the redpanda server, we
need to create a topic. We do it also with the `rpk`
command we used previously for figuring out the version of 
redpandas.

Read the output of `help` and based on it, create a topic with name `test-topic` 

What's the output of the command for creating a topic? Include the entire output in your answer.


## Question 3. Connecting to the Kafka server

We need to make sure we can connect to the server, so
later we can send some data to its topics

First, let's install the kafka connector (up to you if you
want to have a separate virtual environment for that)

```bash
pip install kafka-python
```

You can start a jupyter notebook in your solution folder or
create a script

Let's try to connect to our server:

```python
import json
import time 

from kafka import KafkaProducer

def json_serializer(data):
    return json.dumps(data).encode('utf-8')

server = 'localhost:9092'

producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=json_serializer
)

producer.bootstrap_connected()
```

Provided that you can connect to the server, what's the output
of the last command?


## Question 4. Sending data to the stream

Now we're ready to send some test data:

```python
t0 = time.time()

topic_name = 'test-topic'

for i in range(10):
    message = {'number': i}
    producer.send(topic_name, value=message)
    print(f"Sent: {message}")
    time.sleep(0.05)

producer.flush()

t1 = time.time()
print(f'took {(t1 - t0):.2f} seconds')
```

How much time did it take? Where did it spend most of the time?

* Sending the messages
* Flushing
* Both took approximately the same amount of time

(Don't remove `time.sleep` when answering this question)


## Reading data with `rpk`

You can see the messages that you send to the topic
with `rpk`:

```bash
rpk topic consume test-topic
```

Run the command above and send the messages one more time to 
see them


## Sending the taxi data

Now let's send our actual data:

* Read the green csv.gz file
* We will only need these columns:
  * `'lpep_pickup_datetime',`
  * `'lpep_dropoff_datetime',`
  * `'PULocationID',`
  * `'DOLocationID',`
  * `'passenger_count',`
  * `'trip_distance',`
  * `'tip_amount'`

Iterate over the records in the dataframe

```python
for row in df_green.itertuples(index=False):
    row_dict = {col: getattr(row, col) for col in row._fields}
    print(row_dict)
    break

    # TODO implement sending the data here
```

Note: this way of iterating over the records is more efficient compared
to `iterrows`


## Question 5: Sending the Trip Data

* Create a topic `green-trips` and send the data there
* How much time in seconds did it take? (You can round it to a whole number)
* Make sure you don't include sleeps in your code


## Creating the PySpark consumer

Now let's read the data with PySpark. 

Spark needs a library (jar) to be able to connect to Kafka, 
so we need to tell PySpark that it needs to use it:

```python
import pyspark
from pyspark.sql import SparkSession

pyspark_version = pyspark.__version__
kafka_jar_package = f"org.apache.spark:spark-sql-kafka-0-10_2.12:{pyspark_version}"

spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("GreenTripsConsumer") \
    .config("spark.jars.packages", kafka_jar_package) \
    .getOrCreate()
```

Now we can connect to the stream:

```python
green_stream = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "green-trips") \
    .option("startingOffsets", "earliest") \
    .load()
```

In order to test that we can consume from the stream, 
let's see what will be the first record there. 

In Spark streaming, the stream is represented as a sequence of 
small batches, each batch being a small RDD (or a small dataframe).

So we can execute a function over each mini-batch.
Let's run `take(1)` there to see what do we have in the stream:

```python
def peek(mini_batch, batch_id):
    first_row = mini_batch.take(1)

    if first_row:
        print(first_row[0])

query = green_stream.writeStream.foreachBatch(peek).start()
```

You should see a record like this:

```
Row(key=None, value=bytearray(b'{"lpep_pickup_datetime": "2019-10-01 00:26:02", "lpep_dropoff_datetime": "2019-10-01 00:39:58", "PULocationID": 112, "DOLocationID": 196, "passenger_count": 1.0, "trip_distance": 5.88, "tip_amount": 0.0}'), topic='green-trips', partition=0, offset=0, timestamp=datetime.datetime(2024, 3, 12, 22, 42, 9, 411000), timestampType=0)
```

Now let's stop the query, so it doesn't keep consuming messages
from the stream

```python
query.stop()
```

## Question 6. Parsing the data

The data is JSON, but currently it's in binary format. We need
to parse it and turn it into a streaming dataframe with proper
columns.

Similarly to PySpark, we define the schema

```python
from pyspark.sql import types

schema = types.StructType() \
    .add("lpep_pickup_datetime", types.StringType()) \
    .add("lpep_dropoff_datetime", types.StringType()) \
    .add("PULocationID", types.IntegerType()) \
    .add("DOLocationID", types.IntegerType()) \
    .add("passenger_count", types.DoubleType()) \
    .add("trip_distance", types.DoubleType()) \
    .add("tip_amount", types.DoubleType())
```

And apply this schema:

```python
from pyspark.sql import functions as F

green_stream = green_stream \
  .select(F.from_json(F.col("value").cast('STRING'), schema).alias("data")) \
  .select("data.*")
```

How does the record look after parsing? Copy the output. 


### Question 7: Most popular destination

Now let's finally do some streaming analytics. We will
see what's the most popular destination currently 
based on our stream of data (which ideally we should 
have sent with delays like we did in workshop 2)


This is how you can do it:

* Add a column "timestamp" using the `current_timestamp` function
* Group by:
  * 5 minutes window based on the timestamp column (`F.window(col("timestamp"), "5 minutes")`)
  * `"DOLocationID"`
* Order by count

You can print the output to the console using this 
code

```python
query = popular_destinations \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("truncate", "false") \
    .start()

query.awaitTermination()
```

Write the most popular destination, your answer should be *either* the zone ID or the zone name of this destination. (You will need to re-send the data for this to work)


## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw6


## Solution

We will publish the solution here after deadline.


================================================
FILE: cohorts/2024/README.md
================================================
## Data Engineering Zoomcamp 2024 Cohort

* [Pre-launch Q&A stream](https://www.youtube.com/watch?v=91b8u9GmqB4)
* [Launch stream with course overview](https://www.youtube.com/live/AtRhA-NfS24?si=5JzA_E8BmJjiLi8l)
* [Deadline calendar](https://docs.google.com/spreadsheets/d/e/2PACX-1vQACMLuutV5rvXg5qICuJGL-yZqIV0FBD84CxPdC5eZHf8TfzB-CJT_3Mo7U7oGVTXmSihPgQxuuoku/pubhtml)
* [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html)
* Course Playlist: Only 2024 Live videos & homeworks (TODO)
* [Public Leaderboard of Top-100 Participants](leaderboard.md)


[**Module 1: Introduction & Prerequisites**](01-docker-terraform/)

* [Homework](01-docker-terraform/homework.md)


[**Module 2: Workflow Orchestration**](02-workflow-orchestration)

* [Homework](02-workflow-orchestration/homework.md)
* Office hours

[**Workshop 1: Data Ingestion**](workshops/dlt.md)

* Workshop with dlt
* [Homework](workshops/dlt.md)


[**Module 3: Data Warehouse**](03-data-warehouse)

* [Homework](03-data-warehouse/homework.md)


[**Module 4: Analytics Engineering**](04-analytics-engineering/)

* [Homework](04-analytics-engineering/homework.md)


[**Module 5: Batch processing**](05-batch/)

* [Homework](05-batch/homework.md)


[**Module 6: Stream Processing**](06-streaming)

* [Homework](06-streaming/homework.md)


[**Project**](project.md)

More information [here](project.md)


================================================
FILE: cohorts/2024/leaderboard.md
================================================
## Leaderboard 

This is the top [100 leaderboard](https://courses.datatalks.club/de-zoomcamp-2024/leaderboard)
of participants of Data Engineering Zoomcamp 2024 edition!

<table>
<tr>
  <th>Name</th>
  <th>Projects</th>
  <th>Social</th>
  <th>Comments</th>
</tr>
<tr>
  <td>Ashraf Mohammad</td>
  <td><a href="https://github.com/Ashraf1395/customer_retention_analytics"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a><a href="https://github.com/Ashraf1395/supply_chain_finance.git"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="www.linkedin.com/in/ashraf1395"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="www.github.com/Ashraf1395"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Really Recommend this bootcamp , if you want to get hands on data engineering experience.     My two Capstone project: www.github.com/Ashraf1395/supply_chain_finance, www.github.com/Ashraf1395/customer_retention_analytics
</details></td>
</tr>
<tr>
  <td>Jorge Vladimir Abrego Arevalo</td>
  <td><a href="https://github.com/JorgeAbrego/weather_stream_project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a><a href="https://github.com/JorgeAbrego/capital_bikeshare_project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/jorge-abrego/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/JorgeAbrego"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Purnendu Shekhar Shukla</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Krishna Anand</td>
  <td><a href="https://github.com/anandaiml19/DE_Zoomcamp_Project2/tree/main"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a><a href="https://github.com/anandaiml19/Data-Engineering-Zoomcamp-Project1"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/krishna-anand-v-g-70bba623/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/anandaiml19"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Abhijit Chakraborty</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Hekmatullah Sajid</td>
  <td><a href="https://github.com/hekmatullah-sajid/EcoEnergy-Germany"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/hekmatullah-sajid/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/hekmatullah-sajid"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Lottie Jane Pollard</td>
  <td><a href="https://github.com/LottieJaneDev/usgs_earthquake_data_pipeline"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/lottiejanedev/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/LottieJaneDev/usgs_earthquake_data_pipeline"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>AviAnna</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Ketut Garjita</td>
  <td><a href="https://github.com/garjita63/dezoomcamp2024-project1"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/ketutgarjitadba/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/garjita63"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
I would like to express my thanks and appreciation to the Data Talks Club for organizing this excellent Data Engineering Zoomcamp training. This made me valuable experience in deepening new knowledge for me even though previously I had mostly worked as a Database Administrator for various platform databases. Thank you also to the community (datatalks-club.slack.com), especially slack course-data-engineering, as well as other slack communities such as mageai.slack.com.
</details></td>
</tr>
<tr>
  <td>Diogo Costa</td>
  <td><a href="https://github.com/techwithcosta/youtube-ai-analytics"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/costadms/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/techwithcosta"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Great course! Check out my YouTube channel: https://www.youtube.com/@TechWithCosta
</details></td>
</tr>
<tr>
  <td>Francisco Ortiz Tena</td>
  <td><a href="https://github.com/FranciscoOrtizTena/de_zoomcamp_project_01/"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/francisco-ortiz-tena/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/FranciscoOrtizTena"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
It is an awesome course!
</details></td>
</tr>
<tr>
  <td>Nevenka Lukic</td>
  <td><a href="https://github.com/nenalukic/air-quality-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/nevenka-lukic/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/nenalukic"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
This DE Zoomcamp was fantastic learning and networking experiences. Many thanks to organizers and big recommendations to anyone!
</details></td>
</tr>
<tr>
  <td>Mukhammad Sofyan Rizka Akbar</td>
  <td><a href="https://github.com/SofyanAkbar94/Project-DE-Zoomcamp-2024"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://id.linkedin.com/in/m-sofyan-r-a-aa00a4118"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/SofyanAkbar94/"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Thanks for providing this course, especially for Alexey and other Datatalk hosts and I hope I can join ML, ML Ops, and LLM Zoomcamp. See you soon :)
</details></td>
</tr>
<tr>
  <td>Mahmoud Mahdy Zaky</td>
  <td><a href="https://github.com/MahmoudMahdy448/Football-Data-Analytics/tree/main"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/mahmoud-mahdy-zaky"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/MahmoudMahdy448"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Brilliant Pancake</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Jobert M. Gutierrez</td>
  <td><a href="https://github.com/bizzaccelerator/Footballers-transfers-Insights.git"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="www.linkedin.com/in/jobertgutierrez"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/bizzaccelerator"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Olusegun Samson Ayeni</td>
  <td><a href="https://github.com/iamraphson/IMDB-pipeline-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a><a href="https://github.com/iamraphson/DE-2024-project-book-recommendation"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/iamraphson/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/iamraphson"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Lily Chau</td>
  <td><a href="https://github.com/lilychau1/uk-power-analytics"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a><a href="https://github.com/lilychau1/uk-power-analytics/tree/main"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="www.linkedin.com/in/lilychau1"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/lilychau1"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Big thank you to Alexey and all other speakers. This is one of the best online learning platforms I have ever come across.
</details></td>
</tr>
<tr>
  <td>Aleksandr Kolmakov</td>
  <td><a href="https://github.com/Feanaur/marine-species-analytics"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a><a href="https://github.com/Feanaur/marine-species-analytics"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/aleksandr-kolmakov/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/alex-kolmakov"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Kang Zhi Yong</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Eduardo Muñoz Sala</td>
  <td><a href="https://github.com/edumunozsala/GDELT-Events-Data-Eng-Project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/edumunozsala/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/edumunozsala"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Kirill Bazarov</td>
  <td><a href="https://github.com/kirill505/de-zoomcamp-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/kirill-bazarov-66ba3152"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/kirill505"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Shayan Shafiee Moghadam</td>
  <td><a href="https://github.com/shayansm2/DE-zoomcamp-playground/tree/de-zoomcamp-2nd-project/github-events-analyzer"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a><a href="https://github.com/shayansm2/tech-career-explorer/tree/de-zoomcamp-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/shayan-shafiee-moghadam/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/shayansm2"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Landry N.</td>
  <td><a href="https://github.com/drux31/capstone-dezoomcamp"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://github.com/drux31"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Thanks for the awsome course.
</details></td>
</tr>
<tr>
  <td>Condescending Austin</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Lee Durbin</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Loving Einstein</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Carlos Vecina Tebar</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Abiodun Oki</td>
  <td></a></td>
  <td> <a href="https://www.linkedin.com/in/okibaba/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Okibaba"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
thoroughly enjoyed the course, great work Alexey & course team!
</details></td>
</tr>
<tr>
  <td>Jimoh</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Sleepy Villani</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Ella Cinders</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Max Lutz</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Jessica De Silva</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Daniel Okello</td>
  <td></a></td>
  <td> <a href="https://www.linkedin.com/in/okellodaniel/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/okellodaniel"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Kirill Sitnikov</td>
  <td><a href="https://github.com/Siddha911/Citibike-data-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="Siddha911"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Thank you Alexey and all DTC team! I’m so glad that I knew about your courses and projects!
</details></td>
</tr>
<tr>
  <td>edumad</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Duy Quoc Vo</td>
  <td><a href="https://github.com/voduyquoc/air_pollution_tracking"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/voduyquoc/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/voduyquoc"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
NA
</details></td>
</tr>
<tr>
  <td>Xiang Li</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Sugeng Wahyudi</td>
  <td><a href="https://github.com/Gengsu07/DEGengsuProject"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/sugeng-wahyudi-8a3939132/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Gengsu07"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Thanks a lot, this was amazing. Can't miss another course and zoomcamp from datatalks.club
</details></td>
</tr>
<tr>
  <td>Anatolii Kryvko</td>
  <td><a href="https://github.com/Nogromi/ukraine-vaccinations/tree/master"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/anatolii-kryvko-69b538107/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Nogromi"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>David Vanegas</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Honey Badger</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Abdelrahman Kamal</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Jean Paul Rodriguez</td>
  <td><a href="https://github.com/jeanpaulrd1/de-zc-final-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/jean-paul-rodriguez"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/jeanpaulrd1/de-zc-final-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Eager Pasteur</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Damian Pszczoła</td>
  <td><a href="https://github.com/d4mp3/GLDAS-Data-Pipeline"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/damian-pszczo%C5%82a-7aba54241/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/d4mp3"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>ManPrat</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>forrest_parnassus</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Ramazan Abylkassov</td>
  <td><a href="https://github.com/ramazanabylkassov/aviation_stack_project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/ramazan-abylkassov-23965097/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/ramazanabylkassov"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Look mom, I am on leaderboard!
</details></td>
</tr>
<tr>
  <td>Digamber Deshmukh</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Andrew Lee</td>
  <td><a href="https://github.com/wndrlxx/ca-trademarks-data-pipeline"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Matt R</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Raul Antonio Catacora Grundy</td>
  <td><a href="https://github.com/Cerpint4xt/data-engineering-all-news-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/raul-catacora-grundy-208315236/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Cerpint4xt"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
I just want to thank everyone, all the instructors, collaborators for creating this amazing set of resources and such a solid community based on sharing and caring. Many many thanks and shout out to you guys
</details></td>
</tr>
<tr>
  <td>Ranga H.</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Salma Gouda</td>
  <td><a href="https://github.com/salmagouda/data-engineering-capstone/tree/main"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://linkedin.com/in/salmagouda"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/salmagouda"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Artsiom Turevich</td>
  <td><a href="https://github.com/aturevich/zoomcamp_de_project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/artsiom-turevich/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="a.turevich"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
A long time ago in a galaxy far, far away...
</details></td>
</tr>
<tr>
  <td>Abhirup Ghosh</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Sonny Pham</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Peter Tran</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Ritika Tilwalia</td>
  <td><a href="https://github.com/rtilwalia/Fashion-Campus-Orders"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/ritika-tilwalia/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/rtilwalia"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Eager Yalow</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Dave Samaniego</td>
  <td><a href="https://github.com/nishiikata/de-zoomcamp-2024-mage-capstone"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/dave-s-32545014a"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/nishiikata"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Thank you DataTalksClub for the course. It was challenging learning many new things, but I had fun along the way too!
</details></td>
</tr>
<tr>
  <td>Lucid Keldysh</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Isaac Ndirangu Muturi</td>
  <td><a href="https://github.com/Isaac-Ndirangu-Muturi-749/End_to_end_data_pipeline--Optimizing_Online_Retail_Analytics_with_Data_and_Analytics_Engineering"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/isaac-muturi-3b6b2b237"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Isaac-Ndirangu-Muturi-749"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Amazing learning experience
</details></td>
</tr>
<tr>
  <td>Agitated Wing</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Hanaa HAMMAD</td>
  <td></a></td>
  <td> <a href="https://www.linkedin.com/in/hanaahammad/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/hanaahammad"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Grateful to this great course
</details></td>
</tr>
<tr>
  <td>Jonah Oliver</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Paul Emilio Arizpe Colorado</td>
  <td><a href="https://github.com/kiramishima/crimes_in_mexico_city_analysis"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/parizpe/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/kiramishima"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
DataTalksClub brought me the opportunity to learn data engineering. Thanks for all :D
</details></td>
</tr>
<tr>
  <td>Asma-Chloë FARAH</td>
  <td><a href="https://github.com/AsmaChloe/traffic_counting_paris"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/asma-chloefarah/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/AsmaChloe"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Thank you for this amazing zoomcamp ! It was really fun !
</details></td>
</tr>
<tr>
  <td>Happy Feistel</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Luca Pugliese</td>
  <td><a href="https://github.com/lucapug/nyc-bike-analytics"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/lucapug/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/lucapug"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
it has been a crowdlearning experience! starting in thousands of us. 359 graduated in the end. Proud to have classified 59th. Thanks to all.
</details></td>
</tr>
<tr>
  <td>Jake Maund</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Aditya Phulallwar</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Dave Wilson</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Haitham Hussein Hamad</td>
  <td><a href="https://github.com/haithamhamad2/kaggle-survey"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/haitham-hamad-8926b415/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/haithamhamad2"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Alexandre Bergere aka Rocket</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>TOGBAN COKOUVI Joyce Elvis Mahoutondji</td>
  <td><a href="https://github.com/lvsuno/Github_data_analysis"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/elvistogban/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/lvsuno/Github_data_analysis"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Sad Robinson</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Tetiana Omelchenko</td>
  <td><a href="https://github.com/TOmelchenko/LifeExpectancyProject"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="www.linkedin.com/in/tetiana-omelchenko-35177379"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/TOmelchenko/LifeExpectancyProject"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Amanda Kershaw</td>
  <td><a href="https://github.com/ANKershaw/youtube_video_ranks"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/amandalnkershaw"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/ANKershaw/youtube_video_ranks"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
This course was incredibly rewarding and absolutely worth the effort.
</details></td>
</tr>
<tr>
  <td>Kristjan Sert</td>
  <td><a href="https://github.com/KrisSert/cadaster-ee"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/kristjan-sert-043396131/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/KrisSert"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Murad Arfanyan</td>
  <td><a href="https://github.com/murkenson/movies_tv_shows_data_pipeline"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/murad-arfanyan-846786176/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/murkenson"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Ecstatic Hofstadter</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Chung Huu Tin</td>
  <td><a href="https://github.com/TinChung41/US-Accidents-Analysis-zoomcamp-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="linkedin.com/in/huu-tin-chung"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/TinChung41"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Zen Mayer</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Zhastay Yeltay</td>
  <td><a href="https://github.com/yelzha/tengrinews-open-project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/yelzha/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/yelzha"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
;)
</details></td>
</tr>
<tr>
  <td>AV3NII</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Sebastian Alejandro Peralta Casafranca</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Relaxed Williams</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>George Mouratos</td>
  <td><a href="https://github.com/Gimour/Datatalks_final_project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/gmouratos/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/Gimour/DataTalks, https://github.com/Gimour/Datatalks_final_project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
-
</details></td>
</tr>
<tr>
  <td>mhmed ahmed rjb</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Frosty Jackson</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>WANJOHI</td>
  <td><a href="https://github.com/DE-ZoomCamp/Flood-Monitoring"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://github.com/DE-ZoomCamp/Flood-Monitoring"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Ighorr Holstrom</td>
  <td><a href="https://github.com/askeladden31/air_raids_data/"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/ighorr-holstrom/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/askeladden31"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Jesse Delzio</td>
  <td></a></td>
  <td> <a href="https://www.linkedin.com/in/delzioj"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/delzio"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
<tr>
  <td>Khalil El Daou</td>
  <td><a href="https://github.com/khalileldoau/global-news-engagement-on-social-media"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/khalil-el-daou-177a8b114?utm_source=share&utm_campaign=share_via&utm_content=profile&utm_medium=android_app"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/khalileldoau"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td><details>
<summary>comment</summary>
Already made a post about the zoomcamp
</details></td>
</tr>
<tr>
  <td>Juan Rojas</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Gonçalo</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Muhamad Farikhin</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Bold Lederberg</td>
  <td></a></td>
  <td></td>
  <td></td>
</tr>
<tr>
  <td>Taras Shalaiko</td>
  <td><a href="https://github.com/tarasenya/dezoomcamp_final_project"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></a></td>
  <td> <a href="https://www.linkedin.com/in/taras-shalaiko-30114a107/"><img src="https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png" height="16em" /></a> <a href="https://github.com/tarasenya"><img src="https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png" height="16em" /></a></td>
  <td></td>
</tr>
</table>


================================================
FILE: cohorts/2024/project.md
================================================
## Course Project

The goal of this project is to apply everything we learned
in this course and build an end-to-end data pipeline.

You will have two attempts to submit your project. If you don't have 
time to submit your project by the end of attempt #1 (you started the 
course late, you have vacation plans, life/work got in the way, etc.)
or you fail your first attempt, 
then you will have a second chance to submit your project as attempt
#2. 

There are only two attempts.

Remember that to pass the project, you must evaluate 3 peers. If you don't do that,
your project can't be considered complete.

To find the projects assigned to you, use the peer review assignments link 
and find your hash in the first column. You will see three rows: you need to evaluate 
each of these projects. For each project, you need to submit the form once,
so in total, you will make three submissions. 


### Submitting

#### Project Attempt #1

* Project: https://courses.datatalks.club/de-zoomcamp-2024/project/project1
* Review: https://courses.datatalks.club/de-zoomcamp-2024/project/project1/eval

#### Project Attempt #2

* Project: https://courses.datatalks.club/de-zoomcamp-2024/project/project2
* Review: https://courses.datatalks.club/de-zoomcamp-2024/project/project2/eval

> **Important**: update your "Certificate name" here: https://courses.datatalks.club/de-zoomcamp-2024/enrollment -
this is what we will use when generating certificates for you.

### Evaluation criteria

See [here](../../week_7_project/README.md)


================================================
FILE: cohorts/2024/workshops/dlt.md
================================================
# Data ingestion with dlt

​In this hands-on workshop, we’ll learn how to build data ingestion pipelines.

​We’ll cover the following steps:

* ​Extracting data from APIs, or files.
* ​Normalizing and loading data
* ​Incremental loading

​By the end of this workshop, you’ll be able to write data pipelines like a senior data engineer: Quickly, concisely, scalable, and self-maintaining.

Video: https://www.youtube.com/live/oLXhBM7nf2Q

--- 

# Navigation

* [Workshop content](dlt_resources/data_ingestion_workshop.md)
* [Workshop notebook](dlt_resources/workshop.ipynb)
* [Homework starter notebook](dlt_resources/homework_starter.ipynb)

# Resources

- Website and community: Visit our [docs](https://dlthub.com/docs/intro), discuss on our slack (Link at top of docs).
- Course colab: [Notebook](https://colab.research.google.com/drive/1kLyD3AL-tYf_HqCXYnA3ZLwHGpzbLmoj#scrollTo=5aPjk0O3S_Ag&forceEdit=true&sandboxMode=true).
- dlthub [community Slack](https://dlthub.com/community).

---

# Teacher

Welcome to the data talks club data engineering zoomcamp, the data ingestion workshop.

- My name is [Adrian](https://www.linkedin.com/in/data-team/), and I work in the data field since 2012
    - I built many data warehouses some lakes, and a few data teams
    - 10 years into my career I started working on dlt “data load tool”, which is an open source library to enable data engineers to build faster and better.
    - I started working on dlt because data engineering is one of the few areas of software engineering where we do not have developer tools to do our work.
    - Building better pipelines would require more code re-use - we cannot all just build perfect pipelines from scratch every time.
    - And so dlt was born, a library that automates the tedious part of data ingestion: Loading, schema management, data type detection, scalability, self healing, scalable extraction… you get the idea - essentially a data engineer’s “one stop shop” for best practice data pipelining.
    - Due to its **simplicity** of use, dlt enables **laymen** to
        - Build pipelines 5-10x faster than without it
        - Build self healing, self maintaining pipelines with all the best practices of data engineers. Automating schema changes removes the bulk of maintenance efforts.
        - Govern your pipelines with schema evolution alerts and data contracts.
        - and generally develop pipelines like a senior, commercial data engineer.

--- 

# Course
You can find the course file [here](./dlt_resources/data_ingestion_workshop.md)
The course has 3 parts
- [Extraction Section](./dlt_resources/data_ingestion_workshop.md#extracting-data): In this section we will learn about scalable extraction
- [Normalisation Section](./dlt_resources/data_ingestion_workshop.md#normalisation): In this section we will learn to prepare data for loading
- [Loading Section](./dlt_resources/data_ingestion_workshop.md#incremental-loading)): Here we will learn about incremental loading modes

---

# Homework

The [linked colab notebook](https://colab.research.google.com/drive/1Te-AT0lfh0GpChg1Rbd0ByEKOHYtWXfm#scrollTo=wLF4iXf-NR7t&forceEdit=true&sandboxMode=true) offers a few exercises to practice what you learned today.


#### Question 1: What is the sum of the outputs of the generator for limit = 5?
- **A**: 10.23433234744176
- **B**: 7.892332347441762
- **C**: 8.382332347441762
- **D**: 9.123332347441762

#### Question 2: What is the 13th number yielded by the generator?
- **A**: 4.236551275463989
- **B**: 3.605551275463989
- **C**: 2.345551275463989
- **D**: 5.678551275463989

#### Question 3: Append the 2 generators. After correctly appending the data, calculate the sum of all ages of people.
- **A**: 353
- **B**: 365
- **C**: 378
- **D**: 390

#### Question 4: Merge the 2 generators using the ID column. Calculate the sum of ages of all the people loaded as described above.
- **A**: 215
- **B**: 266
- **C**: 241
- **D**: 258

Submit the solution here: https://courses.datatalks.club/de-zoomcamp-2024/homework/workshop1

--- 
# Next steps

As you are learning the various concepts of data engineering, 
consider creating a portfolio project that will further your own knowledge.

By demonstrating the ability to deliver end to end, you will have an easier time finding your first role. 
This will help regardless of whether your hiring manager reviews your project, largely because you will have a better 
understanding and will be able to talk the talk.

Here are some example projects that others did with dlt:
- Serverless dlt-dbt on cloud functions: [Article](https://docs.getdbt.com/blog/serverless-dlt-dbt-stack)
- Bird finder: [Part 1](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-i), [Part 2](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-ii)
- Event ingestion on GCP: [Article and repo](https://dlthub.com/docs/blog/streaming-pubsub-json-gcp)
- Event ingestion on AWS: [Article and repo](https://dlthub.com/docs/blog/dlt-aws-taktile-blog)
- Or see one of the many demos created by our working students: [Hacker news](https://dlthub.com/docs/blog/hacker-news-gpt-4-dashboard-demo), 
[GA4 events](https://dlthub.com/docs/blog/ga4-internal-dashboard-demo), 
[an E-Commerce](https://dlthub.com/docs/blog/postgresql-bigquery-metabase-demo), 
[google sheets](https://dlthub.com/docs/blog/google-sheets-to-data-warehouse-pipeline), 
[Motherduck](https://dlthub.com/docs/blog/dlt-motherduck-demo), 
[MongoDB + Holistics](https://dlthub.com/docs/blog/MongoDB-dlt-Holistics), 
[Deepnote](https://dlthub.com/docs/blog/deepnote-women-wellness-violence-tends), 
[Prefect](https://dlthub.com/docs/blog/dlt-prefect),
[PowerBI vs GoodData vs Metabase](https://dlthub.com/docs/blog/semantic-modeling-tools-comparison),
[Dagster](https://dlthub.com/docs/blog/dlt-dagster),
[Ingesting events via gcp webhooks](https://dlthub.com/docs/blog/dlt-webhooks-on-cloud-functions-for-event-capture),
[SAP to snowflake replication](https://dlthub.com/docs/blog/sap-hana-to-snowflake-demo-blog),
[Read emails and send sumamry to slack with AI and Kestra](https://dlthub.com/docs/blog/dlt-kestra-demo-blog),
[Mode +dlt capabilities](https://dlthub.com/docs/blog/dlt-mode-blog),
[dbt on cloud functions](https://dlthub.com/docs/blog/dlt-dbt-runner-on-cloud-functions)
- If you want to use dlt in your project, [check this list of public APIs](https://dlthub.com/docs/blog/practice-api-sources)


If you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt slack.


**And don't forget, if you like dlt**
- **Give us a [GitHub Star!](https://github.com/dlt-hub/dlt)**
- **Join our [Slack community](https://dlthub.com/community)**


# Notes

* Add your notes here


================================================
FILE: cohorts/2024/workshops/dlt_resources/data_ingestion_workshop.md
================================================
# Intro

What is data loading, or data ingestion?

Data ingestion is the process of extracting data from a producer, transporting it to a convenient environment, and preparing it for usage by normalising it, sometimes cleaning, and adding metadata.

### “A wild dataset magically appears!”

In many data science teams, data magically appears - because the engineer loads it.

- Sometimes the format in which it appears is structured, and with explicit schema
    - In that case, they can go straight to using it; Examples: Parquet, Avro, or table in a db,
- Sometimes the format is weakly typed and without explicit schema, such as csv, json
    - in which case some extra normalisation or cleaning might be needed before usage

> 💡 **What is a schema?** The schema specifies the expected format and structure of data within a document or data store, defining the allowed keys, their data types, and any constraints or relationships.


### Be the magician! 😎

Since you are here to learn about data engineering, you will be the one making datasets magically appear. 

Here’s what you need to learn to build pipelines

- Extracting data
- Normalising, cleaning, adding metadata such as schema and types
- and Incremental loading, which is vital for fast, cost effective data refreshes.

### What else does a data engineer do? What are we not learning, and what are we learning?

- It might seem simplistic, but in fact a data engineer’s main goal is to ensure data flows from source systems to analytical destinations.
- So besides building pipelines, running pipelines and fixing pipelines, a data engineer may also focus on optimising data storage, ensuring data quality and integrity, implementing effective data governance practices, and continuously refining data architecture to meet the evolving needs of the organisation.
- Ultimately, a data engineer's role extends beyond the mechanical aspects of pipeline development, encompassing the strategic management and enhancement of the entire data lifecycle.
- This workshop focuses on building robust, scalable, self maintaining pipelines, with built in governance - in other words, best practices applied.

# Extracting data

### The considerations of extracting data

In this section we will learn about extracting data from source systems, and what to care about when doing so.

Most data is stored behind an API 

- Sometimes that’s a RESTful api for some business application, returning records of data.
- Sometimes the API returns a secure file path to something like a json or parquet file in a bucket that enables you to grab the data in bulk,
- Sometimes the API is something else (mongo, sql, other databases or applications) and will generally return records as JSON - the most common interchange format.

As an engineer, you will need to build pipelines that “just work”. 

So here’s what you need to consider on extraction, to prevent the pipelines from breaking, and to keep them running smoothly.

- Hardware limits: During this course we will cover how to navigate the challenges of managing memory.
- Network limits: Sometimes networks can fail. We can’t fix what could go wrong but we can retry network jobs until they succeed. For example, dlt library offers a requests “replacement” that has built in retries. [Docs](https://dlthub.com/docs/reference/performance#using-the-built-in-requests-client). We won’t focus on this during the course but you can read the docs on your own.
- Source api limits: Each source might have some limits such as how many requests you can do per second. We would call these “rate limits”. Read each source’s docs carefully to understand how to navigate these obstacles. You can find some examples of how to wait for rate limits in our verified sources repositories
    - examples: [Zendesk](https://developer.zendesk.com/api-reference/introduction/rate-limits/), [Shopify](https://shopify.dev/docs/api/usage/rate-limits)

### Extracting data without hitting hardware limits

What kind of limits could you hit on your machine? In the case of data extraction, the only limits are memory and storage. This refers to the RAM or virtual memory, and the disk, or physical storage.

### **Managing memory.**

- Many data pipelines run on serverless functions or on orchestrators that delegate the workloads to clusters of small workers.
- These systems have a small memory or share it between multiple workers - so filling the memory is BAAAD: It might lead to not only your pipeline crashing, but crashing the entire container or machine that might be shared with other worker processes, taking them down too.
- The same can be said about disk - in most cases your disk is sufficient, but in some cases it’s not. For those cases, mounting an external drive mapped to a storage bucket is the way to go. Airflow for example supports a “data” folder that is used just like a local folder but can be mapped to a bucket for unlimited capacity.

### So how do we avoid filling the memory?

- We often do not know the volume of data upfront
- And we cannot scale dynamically or infinitely on hardware during runtime
- So the answer is: Control the max memory you use

### Control the max memory used by streaming the data

Streaming here refers to processing the data event by event or chunk by chunk instead of doing bulk operations. 

Let’s look at some classic examples of streaming where data is transferred chunk by chunk or event by event

- Between an audio broadcaster and an in-browser audio player
- Between a server and a local video player
- Between a smart home device or IoT device and your phone
- between google maps and your navigation app
- Between instagram live and your followers

What do data engineers do? We usually stream the data between buffers, such as 

- from API to local file
- from webhooks to event queues
- from event queue (Kafka, SQS) to Bucket

### Streaming in python via generators

Let’s focus on how we build most data pipelines:

- To process data in a stream in python, we use generators, which are functions that can return multiple times - by allowing multiple returns, the data can be released as it’s produced, as stream, instead of returning it all at once as a batch.

Take the following theoretical example: 

- We search twitter for “cat pictures”. We do not know how many pictures will be returned - maybe 10, maybe 10.000.000. Will they fit in memory? Who knows.
- So to grab this data without running out of memory, we would use a python generator.
- What’s a generator? In simple words, it’s a function that can return multiple times. Here’s an example of a regular function, and how that function looks if written as a generator.

### Generator examples:

Let’s look at a regular returning function, and how we can re-write it as a generator.

**Regular function collects data in memory.** Here you can see how data is collected row by row in a list called `data`before it is returned. This will break if we have more data than memory.

```python
def search_twitter(query):
	data = []
	for row in paginated_get(query):
		data.append(row)
	return data

# Collect all the cat picture data
for row in search_twitter("cat pictures"):
  # Once collected, 
  # print row by row
	print(row)
```

When calling `for row in search_twitter("cat pictures"):` all the data must first be downloaded before the first record is returned

Let’s see how we could rewrite this as a generator.

**Generator for streaming the data.** The memory usage here is minimal.

As you can see, in the modified function, we yield each row as we get the data, without collecting it into memory. We can then run this generator and handle the data item by item.

```python
def search_twitter(query):
	for row in paginated_get(query):
		yield row

# Get one row at a time
for row in extract_data("cat pictures"):
	# print the row
	print(row)
  # do something with the row such as cleaning it and writing it to a buffer
	# continue requesting and printing data
```

When calling `for row in extract_data("cat pictures"):` the function only runs until the first data item is yielded, before printing - so we do not need to wait long for the first value. It will then continue until there is no more data to get.

If we wanted to get all the values at once from a generator instead of one by one, we would need to first “run” the generator and collect the data. For example, if we wanted to get all the data in memory we could do `data = list(extract_data("cat pictures"))` which would run the generator and collect all the data in a list before continuing.

## 3 Extraction examples:

### Example 1: Grabbing data from an api

> 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab or in your local setup.


For these purposes we created an api that can serve the data you are already familiar with, the NYC taxi dataset.

The api documentation is as follows:

- There are a limited nr of records behind the api
- The data can be requested page by page, each page containing 1000 records
- If we request a page with no data, we will get a successful response with no data
- so this means that when we get an empty page, we know there is no more data and we can stop requesting pages - this is a common way to paginate but not the only one - each api may be different.
- details:
    - method: get
    - url: `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api`
    - parameters: `page`  integer. Represents the page number you are requesting. Defaults to 1.
    

So how do we design our requester? 

- We need to request page by page until we get no more data. At this point, we do not know how much data is behind the api.
- It could be 1000 records or it could be 10GB of records. So let’s grab the data with a generator to avoid having to fit an undetermined amount of data into ram.

In this approach to grabbing data from apis, we have pros and cons:

- Pros: **Easy memory management** thanks to api returning events/pages
- Cons: **Low throughput**, due to the data transfer being constrained via an API.

```bash
import requests

BASE_API_URL = "https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api"

# I call this a paginated getter
# as it's a function that gets data
# and also paginates until there is no more data
# by yielding pages, we "microbatch", which speeds up downstream processing

def paginated_getter():
    page_number = 1

    while True:
        # Set the query parameters
        params = {'page': page_number}

        # Make the GET request to the API
        response = requests.get(BASE_API_URL, params=params)
        response.raise_for_status()  # Raise an HTTPError for bad responses
        page_json = response.json()
        print(f'got page number {page_number} with {len(page_json)} records')

        # if the page has no records, stop iterating
        if page_json:
            yield page_json
            page_number += 1
        else:
            # No more data, break the loop
            break

if __name__ == '__main__':
    # Use the generator to iterate over pages
    for page_data in paginated_getter():
        # Process each page as needed
        print(page_data)
```

### Example 2: Grabbing the same data from file - simple download


> 💡 This part is demonstrative, so you do not need to follow along; just pay attention.


- Why am I showing you this? so when you do this in the future, you will remember there is a best practice you can apply for scalability.

Some apis respond with files instead of pages of data. The reason for this is simple: Throughput and cost. A restful api that returns data has to read the data from storage and process and return it to you by some logic - If this data is large, this costs time, money and creates a bottleneck. 

A better way is to offer the data as files that someone can download from storage directly, without going through the restful api layer. This is common for apis that offer large volumes of data, such as ad impressions data.

In this example, we grab exactly the same data as we did in the API example above, but now we get it from the underlying file instead of going through the API.

- Pros: **High throughput**
- Cons: **Memory** is used to hold all the data

This is how the code could look. As you can see in this case our `data`and  `parsed_data` variables hold the entire file data in memory before returning it. Not great.

```python
import requests
import json

url = "https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl"

def download_and_read_jsonl(url):
    response = requests.get(url)
    response.raise_for_status()  # Raise an HTTPError for bad responses
    data = response.text.splitlines()
    parsed_data = [json.loads(line) for line in data]
    return parsed_data
   

downloaded_data = download_and_read_jsonl(url)

if downloaded_data:
    # Process or print the downloaded data as needed
    print(downloaded_data[:5])  # Print the first 5 entries as an example
```

### Example 3: Same file, streaming download


> 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab

Ok, downloading files is simple, but what if we want to do a stream download?

That’s possible too - in effect giving us the best of both worlds. In this case we prepared a jsonl file which is already split into lines making our code simple. But json (not jsonl) files could also be downloaded in this fashion, for example using the `ijson` library.

What are the pros and cons of this method of grabbing data?

Pros: **High throughput, easy memory management,** because we are downloading a file

Cons: **Difficult to do for columnar file formats**, as entire blocks need to be downloaded before they can be deserialised to rows. Sometimes, the code is complex too.

Here’s what the code looks like - in a jsonl file each line is a json document, or a “row” of data, so we yield them as they get downloaded. This allows us to download one row and process it before getting the next row.

```bash
import requests
import json

def download_and_yield_rows(url):
    response = requests.get(url, stream=True)
    response.raise_for_status()  # Raise an HTTPError for bad responses

    for line in response.iter_lines():
        if line:
            yield json.loads(line)

# Replace the URL with your actual URL
url = "https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl"

# Use the generator to iterate over rows with minimal memory usage
for row in download_and_yield_rows(url):
    # Process each row as needed
    print(row)
```

In the colab notebook you can also find a code snippet to load the data - but we will load some data later in the course and you can explore the colab on your own after the course. 

What is worth keeping in mind at this point is that our loader library that we will use later, `dlt`or data load tool, will respect the streaming concept of the generator and will process it in an efficient way keeping memory usage low and using parallelism where possible.

Let’s move over to the Colab notebook and run examples 2 and 3, compare them, and finally load examples 1 and 3 to DuckDB

# Normalising data

You often hear that data people spend most of their time “cleaning” data. What does this mean? 

Let’s look granularly into what people consider data cleaning. 

Usually we have 2 parts: 

- Normalising data without changing its meaning,
- and filtering data for a use case, which changes its meaning.

### Part of what we often call data cleaning is just metadata work:

- Add types (string to number, string to timestamp, etc)
- Rename columns: Ensure column names follow a supported standard downstream - such as no strange characters in the names.
- Flatten nested dictionaries: Bring nested dictionary values into the top dictionary row
- Unnest lists or arrays into child tables: Arrays or lists cannot be flattened into their parent record, so if we want flat data we need to break them out into separate tables.
- We will look at a practical example next, as these concepts can be difficult to visualise from text.

### **Why prepare data? why not use json as is?**

- We do not easily know what is inside a json document due to lack of schema
- Types are not enforced between rows of json - we could have one record where age is `25`and another where age is `twenty five` , and another where it’s `25.00`.  Or in some systems, you might have a dictionary for a single record, but a list of dicts for multiple records. This could easily lead to applications downstream breaking.
- We cannot just use json data easily, for example we would need to convert strings to time if we want to do a daily aggregation.
- Reading json loads more data into memory, as the whole document is scanned - while in parquet or databases we can scan a single column of a document. This causes costs and slowness.
- Json is not fast to aggregate - columnar formats are.
- Json is not fast to search.
- Basically json is designed as a "lowest common denominator format" for "interchange" / data transfer and is unsuitable for direct analytical usage.

### Practical example


> 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab notebook.

In the case of the NY taxi rides data, the dataset is quite clean - so let’s instead use a small example of more complex data. Let’s assume we know some information about passengers and stops.

For this example we modified the dataset as follows

- We added nested dictionaries
    
    ```json
    "coordinates": {
                "start": {
                    "lon": -73.787442,
                    "lat": 40.641525
                    },
    ```
    
- We added nested lists
    
    ```json
    "passengers": [
                {"name": "John", "rating": 4.9},
                {"name": "Jack", "rating": 3.9}
                  ],
    ```
    
- We added a record hash that gives us an unique id for the record, for easy identification
    
    ```json
    "record_hash": "b00361a396177a9cb410ff61f20015ad",
    ```
    

We want to load this data to a database. How do we want to clean the data?

- We want to flatten dictionaries into the base row
- We want to flatten lists into a separate table
- We want to convert time strings into time type

```python
data = [
    {
        "vendor_name": "VTS",
		"record_hash": "b00361a396177a9cb410ff61f20015ad",
        "time": {
            "pickup": "2009-06-14 23:23:00",
            "dropoff": "2009-06-14 23:48:00"
        },
        "Trip_Distance": 17.52,
        "coordinates": {
            "start": {
                "lon": -73.787442,
                "lat": 40.641525
            },
            "end": {
                "lon": -73.980072,
                "lat": 40.742963
            }
        },
        "Rate_Code": None,
        "store_and_forward": None,
        "Payment": {
            "type": "Credit",
            "amt": 20.5,
            "surcharge": 0,
            "mta_tax": None,
            "tip": 9,
            "tolls": 4.15,
			"status": "booked"
        },
        "Passenger_Count": 2,
        "passengers": [
            {"name": "John", "rating": 4.9},
            {"name": "Jack", "rating": 3.9}
        ],
        "Stops": [
            {"lon": -73.6, "lat": 40.6},
            {"lon": -73.5, "lat": 40.5}
        ]
    },
]
```

Now let’s normalise this data.

## Introducing dlt

dlt is a python library created for the purpose of assisting data engineers to build simpler, faster and more robust pipelines with minimal effort. 

You can think of dlt as a loading tool that implements the best practices of data pipelines enabling you to just “use” those best practices in your own pipelines, in a declarative way. 

This enables you to stop reinventing the flat tyre, and leverage dlt to build pipelines much faster than if you did everything from scratch.

dlt automates much of the tedious work a data engineer would do, and does it in a way that is robust. dlt can handle things like:

- Schema: Inferring and evolving schema, alerting changes, using schemas as data contracts.
- Typing data, flattening structures, renaming columns to fit database standards.  In our example we will pass the “data” you can see above and see it normalised.
- Processing a stream of events/rows without filling memory. This includes extraction from generators.
- Loading to a variety of dbs or file formats.

Let’s use it to load our nested json to duckdb:

Here’s how you would do that on your local machine. I will walk you through before showing you in colab as well.

First, install dlt

```bash
# Make sure you are using Python 3.8-3.11 and have pip installed
# spin up a venv
python -m venv ./env
source ./env/bin/activate
# pip install
pip install dlt[duckdb]
```

Next, grab your data from above and run this snippet

- here we define a pipeline, which is a connection to a destination
- and we run the pipeline, printing the outcome

```python
# define the connection to load to. 
# We now use duckdb, but you can switch to Bigquery later
pipeline = dlt.pipeline(pipeline_name="taxi_data",
						destination='duckdb', 
						dataset_name='taxi_rides')

# run the pipeline with default settings, and capture the outcome
info = pipeline.run(data, 
                    table_name="users", 
                    write_disposition="replace")

# show the outcome
print(info)
```

If you are running dlt locally you can use the built in streamlit app by running the cli command with the pipeline name we chose above.

```bash
dlt pipeline taxi_data show
```

Or explore the data in the linked colab notebook. I’ll switch to it now to show you the data.

# Incremental loading

Incremental loading means that as we update our datasets with the new data, we would only load the new data, as opposed to making a full copy of a source’s data all over again and replacing the old version.

By loading incrementally, our pipelines run faster and cheaper.

- Incremental loading goes hand in hand with incremental extraction and state, two concepts which we will not delve into during this workshop
    - `State` is information that keeps track of what was loaded, to know what else remains to be loaded. dlt stores the state at the destination in a separate table.
    - Incremental extraction refers to only requesting the increment of data that we need, and not more. This is tightly connected to the state to determine the exact chunk that needs to be extracted and loaded.
- You can learn more about incremental extraction and state by reading the dlt docs on how to do it.

### dlt currently supports 2 ways of loading incrementally:

1. Append: 
    - We can use this for immutable or stateless events (data that doesn’t change), such as taxi rides - For example,  every day there are new rides, and we could load the new ones only instead of the entire history.
    - We could also use this to load different versions of stateful data, for example for creating a “slowly changing dimension” table for auditing changes. For example, if we load a list of cars and their colors every day, and one day one car changes color, we need both sets of data to be able to discern that a change happened.
2. Merge: 
    - We can use this to update  data that changes.
    - For example, a taxi ride could have a payment status, which is originally “booked” but could later be changed into “paid”, “rejected” or “cancelled”

Here is how you can think about which method to use:

![Incremental Loading](./incremental_loading.png)

* If you want to keep track of when changes occur in stateful data (slowly changing dimension) then you will need to append the data

### Let’s do a merge example together:


> 💡 This is the bread and butter of data engineers pulling data, so follow along.


- In our previous example, the payment status changed from "booked" to “cancelled”. Perhaps Jack likes to fraud taxis and that explains his low rating. Besides the ride status change, he also got his rating lowered further.
- The merge operation replaces an old record with a new one based on a key. The key could consist of multiple fields or a single unique id. We will use record hash that we created for simplicity. If you do not have a unique key, you could create one deterministically out of several fields, such as by concatenating the data and hashing it.
- A merge operation replaces rows, it does not update them. If you want to update only parts of a row, you would have to load the new data by appending it and doing a custom transformation to combine the old and new data.

In this example, the score of the 2 drivers got lowered and we need to update the values. We do it by using merge write disposition, replacing the records identified by  `record hash` present in the new data.

```python
data = [
    {
        "vendor_name": "VTS",
		"record_hash": "b00361a396177a9cb410ff61f20015ad",
        "time": {
            "pickup": "2009-06-14 23:23:00",
            "dropoff": "2009-06-14 23:48:00"
        },
        "Trip_Distance": 17.52,
        "coordinates": {
            "start": {
                "lon": -73.787442,
                "lat": 40.641525
            },
            "end": {
                "lon": -73.980072,
                "lat": 40.742963
            }
        },
        "Rate_Code": None,
        "store_and_forward": None,
        "Payment": {
            "type": "Credit",
            "amt": 20.5,
            "surcharge": 0,
            "mta_tax": None,
            "tip": 9,
            "tolls": 4.15,
			"status": "cancelled"
        },
        "Passenger_Count": 2,
        "passengers": [
            {"name": "John", "rating": 4.4},
            {"name": "Jack", "rating": 3.6}
        ],
        "Stops": [
            {"lon": -73.6, "lat": 40.6},
            {"lon": -73.5, "lat": 40.5}
        ]
    },
]

# define the connection to load to. 
# We now use duckdb, but you can switch to Bigquery later
pipeline = dlt.pipeline(destination='duckdb', dataset_name='taxi_rides')

# run the pipeline with default settings, and capture the outcome
info = pipeline.run(data, 
					table_name="users", 
					write_disposition="merge", 
					merge_key="record_hash")

# show the outcome
print(info)
```

As you can see in your notebook, the payment status and Jack’s rating were updated after running the code.

### What’s next?

- You could change the destination to parquet + local file system or storage bucket. See the colab bonus section.
- You could change the destination to BigQuery. Destination & credential setup docs: https://dlthub.com/docs/dlt-ecosystem/destinations/, https://dlthub.com/docs/walkthroughs/add_credentials
or See the colab bonus section.
- You could use a decorator to convert the generator into a customised dlt resource: https://dlthub.com/docs/general-usage/resource
- You can deep dive into building more complex pipelines by following the guides:
    - https://dlthub.com/docs/walkthroughs
    - https://dlthub.com/docs/build-a-pipeline-tutorial
- You can join our [Slack community](https://dlthub.com/community) and engage with us there.

================================================
FILE: cohorts/2024/workshops/dlt_resources/homework_solution.ipynb
================================================
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# **Homework**: Data talks club data engineering zoomcamp Data loading workshop\n",
        "\n",
        "Hello folks, let's practice what we learned - Loading data with the best practices of data engineering.\n",
        "\n",
        "Here are the exercises we will do\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "mrTFv5nPClXh"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 1. Use a generator\n",
        "\n",
        "Remember the concept of generator? Let's practice using them to futher our understanding of how they work.\n",
        "\n",
        "Let's define a generator and then run it as practice.\n",
        "\n",
        "**Answer the following questions:**\n",
        "\n",
        "- **Question 1: What is the sum of the outputs of the generator for limit = 5?**\n",
        "- **Question 2: What is the 13th number yielded**\n",
        "\n",
        "I suggest practicing these questions without GPT as the purpose is to further your learning."
      ],
      "metadata": {
        "id": "wLF4iXf-NR7t"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def square_root_generator(limit):\n",
        "    n = 1\n",
        "    while n <= limit:\n",
        "        yield n ** 0.5\n",
        "        n += 1\n",
        "\n",
        "# Example usage:\n",
        "limit = 5\n",
        "generator = square_root_generator(limit)\n",
        "\n",
        "for sqrt_value in generator:\n",
        "    print(sqrt_value)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wLng-bDJN4jf",
        "outputId": "547683cb-5f56-4815-a903-d0d9578eb1f9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "1.0\n",
            "1.4142135623730951\n",
            "1.7320508075688772\n",
            "2.0\n",
            "2.23606797749979\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "xbe3q55zN43j"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 2. Append a generator to a table with existing data\n",
        "\n",
        "\n",
        "Below you have 2 generators. You will be tasked to load them to duckdb and answer some questions from the data\n",
        "\n",
        "1. Load the first generator and calculate the sum of ages of all people. Make sure to only load it once.\n",
        "2. Append the second generator to the same table as the first.\n",
        "3. **After correctly appending the data, calculate the sum of all ages of people.**\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "vjWhILzGJMpK"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2MoaQcdLBEk6",
        "outputId": "d2b93dc1-d83f-44ea-aeff-fdf51d75f7aa"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{'ID': 1, 'Name': 'Person_1', 'Age': 26, 'City': 'City_A'}\n",
            "{'ID': 2, 'Name': 'Person_2', 'Age': 27, 'City': 'City_A'}\n",
            "{'ID': 3, 'Name': 'Person_3', 'Age': 28, 'City': 'City_A'}\n",
            "{'ID': 4, 'Name': 'Person_4', 'Age': 29, 'City': 'City_A'}\n",
            "{'ID': 5, 'Name': 'Person_5', 'Age': 30, 'City': 'City_A'}\n",
            "{'ID': 3, 'Name': 'Person_3', 'Age': 33, 'City': 'City_B', 'Occupation': 'Job_3'}\n",
            "{'ID': 4, 'Name': 'Person_4', 'Age': 34, 'City': 'City_B', 'Occupation': 'Job_4'}\n",
            "{'ID': 5, 'Name': 'Person_5', 'Age': 35, 'City': 'City_B', 'Occupation': 'Job_5'}\n",
            "{'ID': 6, 'Name': 'Person_6', 'Age': 36, 'City': 'City_B', 'Occupation': 'Job_6'}\n",
            "{'ID': 7, 'Name': 'Person_7', 'Age': 37, 'City': 'City_B', 'Occupation': 'Job_7'}\n",
            "{'ID': 8, 'Name': 'Person_8', 'Age': 38, 'City': 'City_B', 'Occupation': 'Job_8'}\n"
          ]
        }
      ],
      "source": [
        "def people_1():\n",
        "    for i in range(1, 6):\n",
        "        yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 25 + i, \"City\": \"City_A\"}\n",
        "\n",
        "for person in people_1():\n",
        "    print(person)\n",
        "\n",
        "\n",
        "def people_2():\n",
        "    for i in range(3, 9):\n",
        "        yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 30 + i, \"City\": \"City_B\", \"Occupation\": f\"Job_{i}\"}\n",
        "\n",
        "\n",
        "for person in people_2():\n",
        "    print(person)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "vtdTIm4fvQCN"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 3. Merge a generator\n",
        "\n",
        "Re-use the generators from Exercise 2.\n",
        "\n",
        "A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\n",
        "\n",
        "Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\n",
        "\n",
        "After loading, you should have a total of 8 records, and ID 3 should have age 33.\n",
        "\n",
        "Question: **Calculate the sum of ages of all the people loaded as described above.**\n"
      ],
      "metadata": {
        "id": "pY4cFAWOSwN1"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Solution: First make sure that the following modules are installed:"
      ],
      "metadata": {
        "id": "kKB2GTB9oVjr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "#Install the dependencies\n",
        "%%capture\n",
        "!pip install dlt[duckdb]"
      ],
      "metadata": {
        "id": "xTVvtyqrfVNq"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Solutions\n",
        "\n",
        "You can use these solutions to self check your results, or to check how the answer can be obtained if you get stuck."
      ],
      "metadata": {
        "id": "kUG4DNYGb5dF"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "ks6Sh_jBJWdh"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Solution 1"
      ],
      "metadata": {
        "id": "U61tgQaYb8Yt"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def sum_of_generator_outputs(generator, limit):\n",
        "    return sum(next(generator) for _ in range(limit))\n",
        "\n",
        "# Example usage:\n",
        "limit_1 = 5\n",
        "generator_1 = square_root_generator(limit_1)\n",
        "result_1 = sum_of_generator_outputs(generator_1, limit_1)\n",
        "print(f\"The sum of the outputs for limit={limit_1} is: {result_1}\")\n",
        "\n",
        "\n",
        "def nth_yielded_number(generator, n):\n",
        "    for _ in range(n - 1):\n",
        "        next(generator)\n",
        "    return next(generator)\n",
        "\n",
        "# Example usage:\n",
        "n = 13\n",
        "generator_2 = square_root_generator(n)\n",
        "result_2 = nth_yielded_number(generator_2, n)\n",
        "print(f\"The {n}th number yielded is: {result_2}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Roc3y_lSTSfn",
        "outputId": "f03d348e-cdfa-44d0-e5f2-276db6af1cf5"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The sum of the outputs for limit=5 is: 8.382332347441762\n",
            "The 13th number yielded is: 3.605551275463989\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Solution 2: Append a generator\n",
        "\n",
        "Load your first generator first, and then load the second one using the \"append\" operation. Since they have overlapping IDs, some records will appear multiple times.\n",
        "\n",
        "After loading, you should have a total of 11 records.\n",
        "\n",
        "Question: Calculate the sum of ages of all the people loaded as described above"
      ],
      "metadata": {
        "id": "M3PJYca2TIw8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Importing the DLT library\n",
        "import dlt\n",
        "\n",
        "# Create a DLT pipeline for the first generator `people_1`\n",
        "# The pipeline is set to load data into a DuckDB database with the dataset named 'people'\n",
        "people_1_pipeline = dlt.pipeline(destination='duckdb', dataset_name='people')\n",
        "\n",
        "# Run the pipeline for the first generator, creating or replacing the table 'people'\n",
        "info = people_1_pipeline.run(people_1(),\n",
        "                             table_name=\"people\",\n",
        "                             write_disposition=\"replace\")\n",
        "\n",
        "print(f\"{info}\\n\\n\")\n",
        "\n",
        "\n",
        "# Create a second DLT pipeline for the generator `people_2`, targeting the same DuckDB database and dataset\n",
        "people_2_pipeline = dlt.pipeline(destination='duckdb', dataset_name='people')\n",
        "\n",
        "# Run the second pipeline, appending data from `people_2` to the existing 'people' table\n",
        "info = people_2_pipeline.run(people_2(),\n",
        "                             table_name=\"people\",\n",
        "                             write_disposition=\"append\")\n",
        "\n",
        "print(f\"{info}\\n\\n\")\n",
        "\n",
        "\n",
        "# Importing the DuckDB library\n",
        "import duckdb\n",
        "\n",
        "# Connect to the DuckDB database created by the first generator\n",
        "conn = duckdb.connect(f\"{people_1_pipeline.pipeline_name}.duckdb\")\n",
        "\n",
        "# Setting the search path to the dataset 'people' and displaying available tables\n",
        "conn.sql(f\"SET search_path = '{people_1_pipeline.dataset_name}'\")\n",
        "print('Loaded tables: ')\n",
        "display(conn.sql(\"show tables\"))\n",
        "\n",
        "\n",
        "# Fetching the appended data from the 'people' table and displaying it\n",
        "data = conn.sql(\"SELECT * FROM people\").df()\n",
        "display(data)\n",
        "\n",
        "# Calculate the sum of ages from the combined data of `people_1` and `people_2` in the 'people' table\n",
        "sum_of_ages_p1_p2 = conn.execute(\"SELECT SUM(age) FROM people\").fetchone()[0]\n",
        "print(f\"\\n\\nSum of ages from generators `people_1()` and `people_2()` combined: {sum_of_ages_p1_p2}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 841
        },
        "id": "0u2mtndkTLpk",
        "outputId": "d5d253de-4502-42bf-ac89-08e0a7065d85"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Pipeline dlt_colab_kernel_launcher load step completed in 0.59 seconds\n",
            "1 load package(s) were loaded to destination duckdb and into dataset people\n",
            "The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\n",
            "Load package 1706029306.7456656 is LOADED and contains no failed jobs\n",
            "\n",
            "\n",
            "Pipeline dlt_colab_kernel_launcher load step completed in 0.43 seconds\n",
            "1 load package(s) were loaded to destination duckdb and into dataset people\n",
            "The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\n",
            "Load package 1706029307.9851513 is LOADED and contains no failed jobs\n",
            "\n",
            "\n",
            "Loaded tables: \n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "┌─────────────────────┐\n",
              "│        name         │\n",
              "│       varchar       │\n",
              "├─────────────────────┤\n",
              "│ _dlt_loads          │\n",
              "│ _dlt_pipeline_state │\n",
              "│ _dlt_version        │\n",
              "│ people              │\n",
              "└─────────────────────┘"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "    id      name  age    city        _dlt_load_id         _dlt_id occupation\n",
              "0    1  Person_1   26  City_A  1706029306.7456656  An8WyXL43/J1GQ       None\n",
              "1    2  Person_2   27  City_A  1706029306.7456656  ZGI1S72CddPbJQ       None\n",
              "2    3  Person_3   28  City_A  1706029306.7456656  +z4Pm5oCykL2Vg       None\n",
              "3    4  Person_4   29  City_A  1706029306.7456656  0Vfr36JHZ34OJA       None\n",
              "4    5  Person_5   30  City_A  1706029306.7456656  aA+9WOclw3YWpg       None\n",
              "5    3  Person_3   33  City_B  1706029307.9851513  mEegoM7n4XujYw      Job_3\n",
              "6    4  Person_4   34  City_B  1706029307.9851513  FPrsrzXgz+E9Fw      Job_4\n",
              "7    5  Person_5   35  City_B  1706029307.9851513  ZaAOBa5EEqXU1Q      Job_5\n",
              "8    6  Person_6   36  City_B  1706029307.9851513  gmcktDnX6y4Fmg      Job_6\n",
              "9    7  Person_7   37  City_B  1706029307.9851513  960gdVKySsa4JA      Job_7\n",
              "10   8  Person_8   38  City_B  1706029307.9851513  +su5IfZQyFEsEw      Job_8"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-164dc4c0-056c-460d-b99f-0582206da3c6\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>id</th>\n",
              "      <th>name</th>\n",
              "      <th>age</th>\n",
              "      <th>city</th>\n",
              "      <th>_dlt_load_id</th>\n",
              "      <th>_dlt_id</th>\n",
              "      <th>occupation</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>1</td>\n",
              "      <td>Person_1</td>\n",
              "      <td>26</td>\n",
              "      <td>City_A</td>\n",
              "      <td>1706029306.7456656</td>\n",
              "      <td>An8WyXL43/J1GQ</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>2</td>\n",
              "      <td>Person_2</td>\n",
              "      <td>27</td>\n",
              "      <td>City_A</td>\n",
              "      <td>1706029306.7456656</td>\n",
              "      <td>ZGI1S72CddPbJQ</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>3</td>\n",
              "      <td>Person_3</td>\n",
              "      <td>28</td>\n",
              "      <td>City_A</td>\n",
              "      <td>1706029306.7456656</td>\n",
              "      <td>+z4Pm5oCykL2Vg</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>4</td>\n",
              "      <td>Person_4</td>\n",
              "      <td>29</td>\n",
              "      <td>City_A</td>\n",
              "      <td>1706029306.7456656</td>\n",
              "      <td>0Vfr36JHZ34OJA</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>5</td>\n",
              "      <td>Person_5</td>\n",
              "      <td>30</td>\n",
              "      <td>City_A</td>\n",
              "      <td>1706029306.7456656</td>\n",
              "      <td>aA+9WOclw3YWpg</td>\n",
              "      <td>None</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>3</td>\n",
              "      <td>Person_3</td>\n",
              "      <td>33</td>\n",
              "      <td>City_B</td>\n",
              "      <td>1706029307.9851513</td>\n",
              "      <td>mEegoM7n4XujYw</td>\n",
              "      <td>Job_3</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>4</td>\n",
              "      <td>Person_4</td>\n",
              "      <td>34</td>\n",
              "      <td>City_B</td>\n",
              "      <td>1706029307.9851513</td>\n",
              "      <td>FPrsrzXgz+E9Fw</td>\n",
              "      <td>Job_4</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>5</td>\n",
              "      <td>Person_5</td>\n",
              "      <td>35</td>\n",
              "      <td>City_B</td>\n",
              "      <td>1706029307.9851513</td>\n",
              "      <td>ZaAOBa5EEqXU1Q</td>\n",
              "      <td>Job_5</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>6</td>\n",
              "      <td>Person_6</td>\n",
              "      <td>36</td>\n",
              "      <td>City_B</td>\n",
              "      <td>1706029307.9851513</td>\n",
              "      <td>gmcktDnX6y4Fmg</td>\n",
              "      <td>Job_6</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>7</td>\n",
              "      <td>Person_7</td>\n",
              "      <td>37</td>\n",
              "      <td>City_B</td>\n",
              "      <td>1706029307.9851513</td>\n",
              "      <td>960gdVKySsa4JA</td>\n",
              "      <td>Job_7</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>10</th>\n",
              "      <td>8</td>\n",
              "      <td>Person_8</td>\n",
              "      <td>38</td>\n",
              "      <td>City_B</td>\n",
              "      <td>1706029307.9851513</td>\n",
              "      <td>+su5IfZQyFEsEw</td>\n",
              "      <td>Job_8</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-164dc4c0-056c-460d-b99f-0582206da3c6')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-164dc4c0-056c-460d-b99f-0582206da3c6 button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-164dc4c0-056c-460d-b99f-0582206da3c6');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-d353cda7-9937-430a-a4e2-605b8f9fa6ab\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-d353cda7-9937-430a-a4e2-605b8f9fa6ab')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-d353cda7-9937-430a-a4e2-605b8f9fa6ab button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\n",
            "\n",
            "Sum of ages from generators `people_1()` and `people_2()` combined: 353\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Solution 3: Merge a generator\n",
        "\n",
        "A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\n",
        "\n",
        "Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\n",
        "\n",
        "After loading, you should have a total of 8 records, and ID 3 should have age 33."
      ],
      "metadata": {
        "id": "G-T-jR9qlzdB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import dlt\n",
        "\n",
        "# Set up a DLT pipeline.\n",
        "# Currently using DuckDB for local testing, but it can be switched to BigQuery for production.\n",
        "generators_pipeline = dlt.pipeline(destination='duckdb', dataset_name='people_merge')\n",
        "\n",
        "# Load data from the first generator `people_1` into 'people_merge' table.\n",
        "# This operation will replace any existing data in the table.\n",
        "# A primary key 'ID' is specified for potential future merge operations.\n",
        "info = generators_pipeline.run(people_1(),\n",
        "                               table_name=\"people_v2\",\n",
        "                               write_disposition=\"replace\",\n",
        "                               primary_key=\"ID\")\n",
        "\n",
        "# Print metadata of the loading process for the first generator.\n",
        "print(f\"{info}\\n\\n\")\n",
        "\n",
        "# Load data from the second generator `people_2` into the same 'people_merge' table.\n",
        "# This operation will merge the new data with existing data based on the primary key 'ID'.\n",
        "info = generators_pipeline.run(people_2(),\n",
        "                               table_name=\"people_merged\",\n",
        "                               write_disposition=\"merge\",\n",
        "                               primary_key=\"ID\")\n",
        "\n",
        "# Print metadata of the loading process for the second generator.\n",
        "print(f\"{info}\\n\\n\")\n",
        "\n",
        "import duckdb\n",
        "\n",
        "# Establish a connection to the DuckDB database created by the pipeline.\n",
        "conn = duckdb.connect(f\"{generators_pipeline.pipeline_name}.duckdb\")\n",
        "\n",
        "# Set the search path to the dataset 'people_merge' and display the available tables.\n",
        "conn.sql(f\"SET search_path = '{generators_pipeline.dataset_name}'\")\n",
        "print('Loaded tables: ')\n",
        "display(conn.sql(\"show tables\"))\n",
        "\n",
        "# Display the merged data from the 'people_merged' table.\n",
        "print(\"\\n\\n\\nData from the 'people_merged' table:\")\n",
        "data = conn.sql(\"SELECT * FROM people_merged\").df()\n",
        "display(data)\n",
        "\n",
        "# Calculate and display the sum of ages from the merged data in 'people_merged' table.\n",
        "sum_of_ages_p1_p2 = conn.execute(\"SELECT SUM(age) FROM people_merged\").fetchone()[0]\n",
        "print(f\"\\n\\nSum of ages of people in generator `people_1()` merged with generator `people_2()` is: {sum_of_ages_p1_p2}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 773
        },
        "id": "rXR-IN85kBtq",
        "outputId": "c74a7ab7-aa77-4445-c2bc-e782054a7201"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Pipeline dlt_colab_kernel_launcher load step completed in 0.24 seconds\n",
            "1 load package(s) were loaded to destination duckdb and into dataset people_merge\n",
            "The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\n",
            "Load package 1706030294.0522 is LOADED and contains no failed jobs\n",
            "\n",
            "\n",
            "Pipeline dlt_colab_kernel_launcher load step completed in 0.42 seconds\n",
            "1 load package(s) were loaded to destination duckdb and into dataset people_merge\n",
            "The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\n",
            "Load package 1706030294.7037766 is LOADED and contains no failed jobs\n",
            "\n",
            "\n",
            "Loaded tables: \n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "┌─────────────────────┐\n",
              "│        name         │\n",
              "│       varchar       │\n",
              "├─────────────────────┤\n",
              "│ _dlt_loads          │\n",
              "│ _dlt_pipeline_state │\n",
              "│ _dlt_version        │\n",
              "│ people_merged       │\n",
              "│ people_v2           │\n",
              "└─────────────────────┘"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\n",
            "\n",
            "\n",
            "Data from the 'people_merged' table:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "   id      name  age    city occupation        _dlt_load_id         _dlt_id\n",
              "0   8  Person_8   38  City_B      Job_8  1706030294.7037766  Q1k+DIAjXLL7cg\n",
              "1   4  Person_4   34  City_B      Job_4  1706030294.7037766  ewlZ3LjULEchiQ\n",
              "2   5  Person_5   35  City_B      Job_5  1706030294.7037766  X+LfQEa/X8GU9w\n",
              "3   7  Person_7   37  City_B      Job_7  1706030294.7037766  lQT0h7IL7E/wxg\n",
              "4   3  Person_3   33  City_B      Job_3  1706030294.7037766  gRBswCo8B/DJmw\n",
              "5   6  Person_6   36  City_B      Job_6  1706030294.7037766  M3IbNKfZZCtbcQ"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-2f5274be-509c-41be-924d-49590376474d\" class=\"colab-df-container\">\n",
              "    <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>id</th>\n",
              "      <th>name</th>\n",
              "      <th>age</th>\n",
              "      <th>city</th>\n",
              "      <th>occupation</th>\n",
              "      <th>_dlt_load_id</th>\n",
              "      <th>_dlt_id</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>8</td>\n",
              "      <td>Person_8</td>\n",
              "      <td>38</td>\n",
              "      <td>City_B</td>\n",
              "      <td>Job_8</td>\n",
              "      <td>1706030294.7037766</td>\n",
              "      <td>Q1k+DIAjXLL7cg</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>4</td>\n",
              "      <td>Person_4</td>\n",
              "      <td>34</td>\n",
              "      <td>City_B</td>\n",
              "      <td>Job_4</td>\n",
              "      <td>1706030294.7037766</td>\n",
              "      <td>ewlZ3LjULEchiQ</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>5</td>\n",
              "      <td>Person_5</td>\n",
              "      <td>35</td>\n",
              "      <td>City_B</td>\n",
              "      <td>Job_5</td>\n",
              "      <td>1706030294.7037766</td>\n",
              "      <td>X+LfQEa/X8GU9w</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>7</td>\n",
              "      <td>Person_7</td>\n",
              "      <td>37</td>\n",
              "      <td>City_B</td>\n",
              "      <td>Job_7</td>\n",
              "      <td>1706030294.7037766</td>\n",
              "      <td>lQT0h7IL7E/wxg</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>3</td>\n",
              "      <td>Person_3</td>\n",
              "      <td>33</td>\n",
              "      <td>City_B</td>\n",
              "      <td>Job_3</td>\n",
              "      <td>1706030294.7037766</td>\n",
              "      <td>gRBswCo8B/DJmw</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>6</td>\n",
              "      <td>Person_6</td>\n",
              "      <td>36</td>\n",
              "      <td>City_B</td>\n",
              "      <td>Job_6</td>\n",
              "      <td>1706030294.7037766</td>\n",
              "      <td>M3IbNKfZZCtbcQ</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "    <div class=\"colab-df-buttons\">\n",
              "\n",
              "  <div class=\"colab-df-container\">\n",
              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2f5274be-509c-41be-924d-49590376474d')\"\n",
              "            title=\"Convert this dataframe to an interactive table.\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
              "  </svg>\n",
              "    </button>\n",
              "\n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    .colab-df-buttons div {\n",
              "      margin-bottom: 4px;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "    <script>\n",
              "      const buttonEl =\n",
              "        document.querySelector('#df-2f5274be-509c-41be-924d-49590376474d button.colab-df-convert');\n",
              "      buttonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "      async function convertToInteractive(key) {\n",
              "        const element = document.querySelector('#df-2f5274be-509c-41be-924d-49590376474d');\n",
              "        const dataTable =\n",
              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                    [key], {});\n",
              "        if (!dataTable) return;\n",
              "\n",
              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "          + ' to learn more about interactive tables.';\n",
              "        element.innerHTML = '';\n",
              "        dataTable['output_type'] = 'display_data';\n",
              "        await google.colab.output.renderOutput(dataTable, element);\n",
              "        const docLink = document.createElement('div');\n",
              "        docLink.innerHTML = docLinkHtml;\n",
              "        element.appendChild(docLink);\n",
              "      }\n",
              "    </script>\n",
              "  </div>\n",
              "\n",
              "\n",
              "<div id=\"df-59a3fb69-8001-41be-ac63-c616dc356aab\">\n",
              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-59a3fb69-8001-41be-ac63-c616dc356aab')\"\n",
              "            title=\"Suggest charts\"\n",
              "            style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "  </button>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "      --bg-color: #E8F0FE;\n",
              "      --fill-color: #1967D2;\n",
              "      --hover-bg-color: #E2EBFA;\n",
              "      --hover-fill-color: #174EA6;\n",
              "      --disabled-fill-color: #AAA;\n",
              "      --disabled-bg-color: #DDD;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "      --bg-color: #3B4455;\n",
              "      --fill-color: #D2E3FC;\n",
              "      --hover-bg-color: #434B5C;\n",
              "      --hover-fill-color: #FFFFFF;\n",
              "      --disabled-bg-color: #3B4455;\n",
              "      --disabled-fill-color: #666;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart {\n",
              "    background-color: var(--bg-color);\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: var(--fill-color);\n",
              "    height: 32px;\n",
              "    padding: 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: var(--hover-bg-color);\n",
              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: var(--button-hover-fill-color);\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart-complete:disabled,\n",
              "  .colab-df-quickchart-complete:disabled:hover {\n",
              "    background-color: var(--disabled-bg-color);\n",
              "    fill: var(--disabled-fill-color);\n",
              "    box-shadow: none;\n",
              "  }\n",
              "\n",
              "  .colab-df-spinner {\n",
              "    border: 2px solid var(--fill-color);\n",
              "    border-color: transparent;\n",
              "    border-bottom-color: var(--fill-color);\n",
              "    animation:\n",
              "      spin 1s steps(1) infinite;\n",
              "  }\n",
              "\n",
              "  @keyframes spin {\n",
              "    0% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "      border-left-color: var(--fill-color);\n",
              "    }\n",
              "    20% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    30% {\n",
              "      border-color: transparent;\n",
              "      border-left-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    40% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-top-color: var(--fill-color);\n",
              "    }\n",
              "    60% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "    }\n",
              "    80% {\n",
              "      border-color: transparent;\n",
              "      border-right-color: var(--fill-color);\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "    90% {\n",
              "      border-color: transparent;\n",
              "      border-bottom-color: var(--fill-color);\n",
              "    }\n",
              "  }\n",
              "</style>\n",
              "\n",
              "  <script>\n",
              "    async function quickchart(key) {\n",
              "      const quickchartButtonEl =\n",
              "        document.querySelector('#' + key + ' button');\n",
              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
              "      try {\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      } catch (error) {\n",
              "        console.error('Error during call to suggestCharts:', error);\n",
              "      }\n",
              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
              "    }\n",
              "    (() => {\n",
              "      let quickchartButtonEl =\n",
              "        document.querySelector('#df-59a3fb69-8001-41be-ac63-c616dc356aab button');\n",
              "      quickchartButtonEl.style.display =\n",
              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "    })();\n",
              "  </script>\n",
              "</div>\n",
              "    </div>\n",
              "  </div>\n"
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\n",
            "\n",
            "Sum of ages of people in generator `people_1()` merged with generator `people_2()` is: 213\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "TApfkuNKtlt3"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}

================================================
FILE: cohorts/2024/workshops/dlt_resources/homework_starter.ipynb
================================================
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# **Homework**: Data talks club data engineering zoomcamp Data loading workshop\n",
        "\n",
        "Hello folks, let's practice what we learned - Loading data with the best practices of data engineering.\n",
        "\n",
        "Here are the exercises we will do\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "mrTFv5nPClXh"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 1. Use a generator\n",
        "\n",
        "Remember the concept of generator? Let's practice using them to futher our understanding of how they work.\n",
        "\n",
        "Let's define a generator and then run it as practice.\n",
        "\n",
        "**Answer the following questions:**\n",
        "\n",
        "- **Question 1: What is the sum of the outputs of the generator for limit = 5?**\n",
        "- **Question 2: What is the 13th number yielded**\n",
        "\n",
        "I suggest practicing these questions without GPT as the purpose is to further your learning."
      ],
      "metadata": {
        "id": "wLF4iXf-NR7t"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def square_root_generator(limit):\n",
        "    n = 1\n",
        "    while n <= limit:\n",
        "        yield n ** 0.5\n",
        "        n += 1\n",
        "\n",
        "# Example usage:\n",
        "limit = 5\n",
        "generator = square_root_generator(limit)\n",
        "\n",
        "for sqrt_value in generator:\n",
        "    print(sqrt_value)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wLng-bDJN4jf",
        "outputId": "547683cb-5f56-4815-a903-d0d9578eb1f9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "1.0\n",
            "1.4142135623730951\n",
            "1.7320508075688772\n",
            "2.0\n",
            "2.23606797749979\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "xbe3q55zN43j"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 2. Append a generator to a table with existing data\n",
        "\n",
        "\n",
        "Below you have 2 generators. You will be tasked to load them to duckdb and answer some questions from the data\n",
        "\n",
        "1. Load the first generator and calculate the sum of ages of all people. Make sure to only load it once.\n",
        "2. Append the second generator to the same table as the first.\n",
        "3. **After correctly appending the data, calculate the sum of all ages of people.**\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "vjWhILzGJMpK"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2MoaQcdLBEk6",
        "outputId": "d2b93dc1-d83f-44ea-aeff-fdf51d75f7aa"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{'ID': 1, 'Name': 'Person_1', 'Age': 26, 'City': 'City_A'}\n",
            "{'ID': 2, 'Name': 'Person_2', 'Age': 27, 'City': 'City_A'}\n",
            "{'ID': 3, 'Name': 'Person_3', 'Age': 28, 'City': 'City_A'}\n",
            "{'ID': 4, 'Name': 'Person_4', 'Age': 29, 'City': 'City_A'}\n",
            "{'ID': 5, 'Name': 'Person_5', 'Age': 30, 'City': 'City_A'}\n",
            "{'ID': 3, 'Name': 'Person_3', 'Age': 33, 'City': 'City_B', 'Occupation': 'Job_3'}\n",
            "{'ID': 4, 'Name': 'Person_4', 'Age': 34, 'City': 'City_B', 'Occupation': 'Job_4'}\n",
            "{'ID': 5, 'Name': 'Person_5', 'Age': 35, 'City': 'City_B', 'Occupation': 'Job_5'}\n",
            "{'ID': 6, 'Name': 'Person_6', 'Age': 36, 'City': 'City_B', 'Occupation': 'Job_6'}\n",
            "{'ID': 7, 'Name': 'Person_7', 'Age': 37, 'City': 'City_B', 'Occupation': 'Job_7'}\n",
            "{'ID': 8, 'Name': 'Person_8', 'Age': 38, 'City': 'City_B', 'Occupation': 'Job_8'}\n"
          ]
        }
      ],
      "source": [
        "def people_1():\n",
        "    for i in range(1, 6):\n",
        "        yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 25 + i, \"City\": \"City_A\"}\n",
        "\n",
        "for person in people_1():\n",
        "    print(person)\n",
        "\n",
        "\n",
        "def people_2():\n",
        "    for i in range(3, 9):\n",
        "        yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 30 + i, \"City\": \"City_B\", \"Occupation\": f\"Job_{i}\"}\n",
        "\n",
        "\n",
        "for person in people_2():\n",
        "    print(person)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "vtdTIm4fvQCN"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 3. Merge a generator\n",
        "\n",
        "Re-use the generators from Exercise 2.\n",
        "\n",
        "A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\n",
        "\n",
        "Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\n",
        "\n",
        "After loading, you should have a total of 8 records, and ID 3 should have age 33.\n",
        "\n",
        "Question: **Calculate the sum of ages of all the people loaded as described above.**\n"
      ],
      "metadata": {
        "id": "pY4cFAWOSwN1"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Solution: First make sure that the following modules are installed:"
      ],
      "metadata": {
        "id": "kKB2GTB9oVjr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "#Install the dependencies\n",
        "%%capture\n",
        "!pip install dlt[duckdb]"
      ],
      "metadata": {
        "id": "xTVvtyqrfVNq"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# to do: homework :)"
      ],
      "metadata": {
        "id": "a2-PRBAkGC2K"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Questions? difficulties? We are here to help.\n",
        "- DTC data engineering course channel: https://datatalks-club.slack.com/archives/C01FABYF2RG\n",
        "- dlt's DTC cohort channel: https://dlthub-community.slack.com/archives/C06GAEX2VNX"
      ],
      "metadata": {
        "id": "PoTJu4kbGG0z"
      }
    }
  ]
}

================================================
FILE: cohorts/2024/workshops/dlt_resources/workshop.ipynb
================================================
[File too large to display: 10.7 MB]

================================================
FILE: cohorts/2024/workshops/rising-wave.md
================================================
<p align="center">
  <picture>
    <source srcset="https://github.com/risingwavelabs/risingwave/blob/main/.github/RisingWave-logo-dark.svg" width="500px" media="(prefers-color-scheme: dark)">
    <img src="https://github.com/risingwavelabs/risingwave/blob/main/.github/RisingWave-logo-light.svg" width="500px">
  </picture>
</p>


</div>

<p align="center">
  <a
    href="https://docs.risingwave.com/"
    target="_blank"
  ><b>Documentation</b></a>&nbsp;&nbsp;&nbsp;📑&nbsp;&nbsp;&nbsp;
  <a
    href="https://tutorials.risingwave.com/"
    target="_blank"
  ><b>Hands-on Tutorials</b></a>&nbsp;&nbsp;&nbsp;🎯&nbsp;&nbsp;&nbsp;
  <a
    href="https://cloud.risingwave.com/"
    target="_blank"
  ><b>RisingWave Cloud</b></a>&nbsp;&nbsp;&nbsp;🚀&nbsp;&nbsp;&nbsp;
  <a
    href="https://risingwave.com/slack"
    target="_blank"
  >
    <b>Get Instant Help</b>
  </a>
</p>
<div align="center">
  <a
    href="https://risingwave.com/slack"
    target="_blank"
  >
    <img alt="Slack" src="https://badgen.net/badge/Slack/Join%20RisingWave/0abd59?icon=slack" />
  </a>
  <a
    href="https://twitter.com/risingwavelabs"
    target="_blank"
  >
    <img alt="X" src="https://img.shields.io/twitter/follow/risingwavelabs" />
  </a>
  <a
    href="https://www.youtube.com/@risingwave-labs"
    target="_blank"
  >
    <img alt="YouTube" src="https://img.shields.io/youtube/channel/views/UCsHwdyBRxBpmkA5RRd0YNEA" />
  </a>
</div>

## Stream processing with RisingWave

In this hands-on workshop, we’ll learn how to process real-time streaming data using SQL in RisingWave. The system we’ll use is [RisingWave](https://github.com/risingwavelabs/risingwave), an open-source SQL database for processing and managing streaming data. You may not feel unfamiliar with RisingWave’s user experience, as it’s fully wire compatible with PostgreSQL.

![RisingWave](https://raw.githubusercontent.com/risingwavelabs/risingwave-docs/main/docs/images/new_archi_grey.png)


We’ll cover the following topics in this Workshop: 

- Why Stream Processing?
- Stateless computation (Filters, Projections)
- Stateful Computation (Aggregations, Joins)
- Data Ingestion and Delivery

RisingWave in 10 Minutes:
https://tutorials.risingwave.com/docs/intro

Workshop video:

<a href="https://youtube.com/live/L2BHFnZ6XjE">
  <img src="https://markdown-videos-api.jorgenkh.no/youtube/L2BHFnZ6XjE" />
</a>

[Project Repository](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04)

## Homework

**Please setup the environment in [Getting Started](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04?tab=readme-ov-file#getting-started) and for the [Homework](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04/blob/main/homework.md#setting-up) first.**

### Question 0

_This question is just a warm-up to introduce dynamic filter, please attempt it before viewing its solution._

What are the dropoff taxi zones at the latest dropoff times?

For this part, we will use the [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/).

<details>
<summary>Solution</summary>

```sql
CREATE MATERIALIZED VIEW latest_dropoff_time AS
    WITH t AS (
        SELECT MAX(tpep_dropoff_datetime) AS latest_dropoff_time
        FROM trip_data
    )
    SELECT taxi_zone.Zone as taxi_zone, latest_dropoff_time
    FROM t,
            trip_data
    JOIN taxi_zone
        ON trip_data.DOLocationID = taxi_zone.location_id
    WHERE trip_data.tpep_dropoff_datetime = t.latest_dropoff_time;

--    taxi_zone    | latest_dropoff_time
-- ----------------+---------------------
--  Midtown Center | 2022-01-03 17:24:54
-- (1 row)
```

</details>

### Question 1

Create a materialized view to compute the average, min and max trip time **between each taxi zone**.

Note that we consider the do not consider `a->b` and `b->a` as the same trip pair.
So as an example, you would consider the following trip pairs as different pairs:
```plaintext
Yorkville East -> Steinway
Steinway -> Yorkville East
```

From this MV, find the pair of taxi zones with the highest average trip time.
You may need to use the [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/) for this.

Bonus (no marks): Create an MV which can identify anomalies in the data. For example, if the average trip time between two zones is 1 minute,
but the max trip time is 10 minutes and 20 minutes respectively.

Options:
1. Yorkville East, Steinway
2. Murray Hill, Midwood
3. East Flatbush/Farragut, East Harlem North
4. Midtown Center, University Heights/Morris Heights

p.s. The trip time between taxi zones does not take symmetricity into account, i.e. `A -> B` and `B -> A` are considered different trips. This applies to subsequent questions as well.

### Question 2

Recreate the MV(s) in question 1, to also find the **number of trips** for the pair of taxi zones with the highest average trip time.

Options:
1. 5
2. 3
3. 10
4. 1

### Question 3

From the latest pickup time to 17 hours before, what are the top 3 busiest zones in terms of number of pickups?
For example if the latest pickup time is 2020-01-01 17:00:00,
then the query should return the top 3 busiest zones from 2020-01-01 00:00:00 to 2020-01-01 17:00:00.

HINT: You can use [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/)
to create a filter condition based on the latest pickup time.

NOTE: For this question `17 hours` was picked to ensure we have enough data to work with.

Options:
1. Clinton East, Upper East Side North, Penn Station
2. LaGuardia Airport, Lincoln Square East, JFK Airport
3. Midtown Center, Upper East Side South, Upper East Side North
4. LaGuardia Airport, Midtown Center, Upper East Side North


## Submitting the solutions

- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/workshop2
- Deadline: 11 March (Monday), 23:00 CET 

## Rewards 🥳

Everyone who completes the homework will get a pen and a sticker, and 5 lucky winners will receive a Tshirt and other secret surprises!
We encourage you to share your achievements with this workshop on your socials and look forward to your submissions 😁

- Follow us on **LinkedIn**: https://www.linkedin.com/company/risingwave
- Follow us on **GitHub**: https://github.com/risingwavelabs/risingwave
- Join us on **Slack**: https://risingwave-labs.com/slack

See you around!


## Solution


================================================
FILE: cohorts/2025/01-docker-terraform/homework.md
================================================
# Module 1 Homework: Docker & SQL

In this homework we'll prepare the environment and practice
Docker and SQL

When submitting your homework, you will also need to include
a link to your GitHub repository or other public code-hosting
site.

This repository should contain the code for solving the homework. 

When your solution has SQL or shell commands and not code
(e.g. python files) file format, include them directly in
the README file of your repository.


## Question 1. Understanding docker first run 

Run docker with the `python:3.12.8` image in an interactive mode, use the entrypoint `bash`.

What's the version of `pip` in the image?

- 24.3.1
- 24.2.1
- 23.3.1
- 23.2.1


## Question 2. Understanding Docker networking and docker-compose

Given the following `docker-compose.yaml`, what is the `hostname` and `port` that **pgadmin** should use to connect to the postgres database?

```yaml
services:
  db:
    container_name: postgres
    image: postgres:17-alpine
    environment:
      POSTGRES_USER: 'postgres'
      POSTGRES_PASSWORD: 'postgres'
      POSTGRES_DB: 'ny_taxi'
    ports:
      - '5433:5432'
    volumes:
      - vol-pgdata:/var/lib/postgresql/data

  pgadmin:
    container_name: pgadmin
    image: dpage/pgadmin4:latest
    environment:
      PGADMIN_DEFAULT_EMAIL: "pgadmin@pgadmin.com"
      PGADMIN_DEFAULT_PASSWORD: "pgadmin"
    ports:
      - "8080:80"
    volumes:
      - vol-pgadmin_data:/var/lib/pgadmin  

volumes:
  vol-pgdata:
    name: vol-pgdata
  vol-pgadmin_data:
    name: vol-pgadmin_data
```

- postgres:5433
- localhost:5432
- db:5433
- postgres:5432
- db:5432

If there are more than one answers, select only one of them

##  Prepare Postgres

Run Postgres and load data as shown in the videos
We'll use the green taxi trips from October 2019:

```bash
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz
```

You will also need the dataset with zones:

```bash
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv
```

Download this data and put it into Postgres.

You can use the code from the course. It's up to you whether
you want to use Jupyter or a python script.

## Question 3. Trip Segmentation Count

During the period of October 1st 2019 (inclusive) and November 1st 2019 (exclusive), how many trips, **respectively**, happened:
1. Up to 1 mile
2. In between 1 (exclusive) and 3 miles (inclusive),
3. In between 3 (exclusive) and 7 miles (inclusive),
4. In between 7 (exclusive) and 10 miles (inclusive),
5. Over 10 miles 

Answers:

- 104,802;  197,670;  110,612;  27,831;  35,281
- 104,802;  198,924;  109,603;  27,678;  35,189
- 104,793;  201,407;  110,612;  27,831;  35,281
- 104,793;  202,661;  109,603;  27,678;  35,189
- 104,838;  199,013;  109,645;  27,688;  35,202


## Question 4. Longest trip for each day

Which was the pick up day with the longest trip distance?
Use the pick up time for your calculations.

Tip: For every day, we only care about one single trip with the longest distance. 

- 2019-10-11
- 2019-10-24
- 2019-10-26
- 2019-10-31


## Question 5. Three biggest pickup zones

Which were the top pickup locations with over 13,000 in
`total_amount` (across all trips) for 2019-10-18?

Consider only `lpep_pickup_datetime` when filtering by date.
 
- East Harlem North, East Harlem South, Morningside Heights
- East Harlem North, Morningside Heights
- Morningside Heights, Astoria Park, East Harlem South
- Bedford, East Harlem North, Astoria Park


## Question 6. Largest tip

For the passengers picked up in October 2019 in the zone
named "East Harlem North" which was the drop off zone that had
the largest tip?

Note: it's `tip` , not `trip`

We need the name of the zone, not the ID.

- Yorkville West
- JFK Airport
- East Harlem North
- East Harlem South


## Terraform

In this section homework we'll prepare the environment by creating resources in GCP with Terraform.

In your VM on GCP/Laptop/GitHub Codespace install Terraform. 
Copy the files from the course repo
[here](../../../01-docker-terraform/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace.

Modify the files as necessary to create a GCP Bucket and Big Query Dataset.


## Question 7. Terraform Workflow

Which of the following sequences, **respectively**, describes the workflow for: 
1. Downloading the provider plugins and setting up backend,
2. Generating proposed changes and auto-executing the plan
3. Remove all resources managed by terraform`

Answers:
- terraform import, terraform apply -y, terraform destroy
- teraform init, terraform plan -auto-apply, terraform rm
- terraform init, terraform run -auto-approve, terraform destroy
- terraform init, terraform apply -auto-approve, terraform destroy
- terraform import, terraform apply -y, terraform rm


## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw1


================================================
FILE: cohorts/2025/02-workflow-orchestration/README.md
================================================
# Workflow Orchestration

Welcome to Module 2 of the Data Engineering Zoomcamp! This week, we’ll dive into workflow orchestration using [Kestra](https://go.kestra.io/de-zoomcamp/github). 

Kestra is an open-source, event-driven orchestration platform that simplifies building both scheduled and event-driven workflows. By adopting Infrastructure as Code practices for data and process orchestration, Kestra enables you to build reliable workflows with just a few lines of YAML.

> [!NOTE]  
>You can find all videos for this week in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist).

---

# Course Structure

## 1. Conceptual Material: Introduction to Orchestration and Kestra

In this section, you’ll learn the foundations of workflow orchestration, its importance, and how Kestra fits into the orchestration landscape.

### Videos
- **2.2.1 - Introduction to Workflow Orchestration**  
  [![2.2.1 - Workflow Orchestration Introduction](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FNp6QmmcgLCs)](https://youtu.be/Np6QmmcgLCs)

- **2.2.2 - Learn the Concepts of Kestra**  
  [![Learn Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fo79n-EVpics)](https://youtu.be/o79n-EVpics)

### Resources
- [Quickstart Guide](https://go.kestra.io/de-zoomcamp/quickstart)
- [Install Kestra with Docker Compose](https://go.kestra.io/de-zoomcamp/docker-compose)
- [Tutorial](https://go.kestra.io/de-zoomcamp/tutorial)
- [What is an Orchestrator?](https://go.kestra.io/de-zoomcamp/what-is-an-orchestrator)

---

## 2. Hands-On Coding Project: Build Data Pipelines with Kestra

This week, we're gonna build ETL pipelines for Yellow and Green Taxi data from NYC’s Taxi and Limousine Commission (TLC). You will:
1. Extract data from [CSV files](https://github.com/DataTalksClub/nyc-tlc-data/releases).
2. Load it into Postgres or Google Cloud (GCS + BigQuery).
3. Explore scheduling and backfilling workflows.

>[!NOTE] 
If you’re using the PostgreSQL and PgAdmin docker setup from Module 1 for this week’s Kestra Workflow Orchestration exercise, ensure your PostgreSQL image version is 15 or later (preferably the latest). The MERGE statement, introduced in PostgreSQL 15, won’t work on earlier versions and will likely cause syntax errors in your kestra flows.

### File Structure

The project is organized as follows:
```
.
├── flows/
│   ├── 01_getting_started_data_pipeline.yaml
│   ├── 02_postgres_taxi.yaml
│   ├── 02_postgres_taxi_scheduled.yaml
│   ├── 03_postgres_dbt.yaml
│   ├── 04_gcp_kv.yaml
│   ├── 05_gcp_setup.yaml
│   ├── 06_gcp_taxi.yaml
│   ├── 06_gcp_taxi_scheduled.yaml
│   └── 07_gcp_dbt.yaml
```

### Setup Kestra

We'll set up Kestra using Docker Compose containing one container for the Kestra server and another for the Postgres database:

```bash
cd 02-workflow-orchestration/docker/combined
docker compose up -d
```

Once the container starts, you can access the Kestra UI at [http://localhost:8080](http://localhost:8080).

If you prefer to add flows programmatically using Kestra's API, run the following commands:

```bash
curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/01_getting_started_data_pipeline.yaml
curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/02_postgres_taxi.yaml
curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/02_postgres_taxi_scheduled.yaml
curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/03_postgres_dbt.yaml
curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/04_gcp_kv.yaml
curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/05_gcp_setup.yaml
curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/06_gcp_taxi.yaml
curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/06_gcp_taxi_scheduled.yaml
curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/07_gcp_dbt.yaml
```

---

## 3. ETL Pipelines in Kestra: Detailed Walkthrough

### Getting Started Pipeline

This introductory flow is added just to demonstrate a simple data pipeline which extracts data via HTTP REST API, transforms that data in Python and then queries it using DuckDB. For this stage, a new separate Postgres database is created for the exercises. 

**Note:** Check that `pgAdmin` isn't running on the same ports as Kestra. If so, check out the [FAQ](#troubleshooting-tips) at the bottom of the README.

### Videos

- **2.2.3 - Create an ETL Pipeline with Postgres in Kestra**   
  [![Create an ETL Pipeline with Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FOkfLX28Ecjg%3Fsi%3DvKbIyWo1TtjpNnvt)](https://youtu.be/OkfLX28Ecjg?si=vKbIyWo1TtjpNnvt)
- **2.2.4 - Manage Scheduling and Backfills using Postgres in Kestra**  
  [![Manage Scheduling and Backfills using Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F_-li_z97zog%3Fsi%3DG6jZbkfJb3GAyqrd)](https://youtu.be/_-li_z97zog?si=G6jZbkfJb3GAyqrd)
- **2.2.5 - Transform Data with dbt and Postgres in Kestra**  
  [![Transform Data with dbt and Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FZLp2N6p2JjE%3Fsi%3DtWhcvq5w4lO8v1_p)](https://youtu.be/ZLp2N6p2JjE?si=tWhcvq5w4lO8v1_p)


```mermaid
graph LR
  Extract[Extract Data via HTTP REST API] --> Transform[Transform Data in Python]
  Transform --> Query[Query Data with DuckDB]
```

Add the flow [`01_getting_started_data_pipeline.yaml`](flows/01_getting_started_data_pipeline.yaml) from the UI if you haven't already and execute it to see the results. Inspect the Gantt and Logs tabs to understand the flow execution.

### Local DB: Load Taxi Data to Postgres

Before we start loading data to GCP, we'll first play with the Yellow and Green Taxi data using a local Postgres database running in a Docker container. We'll create a new Postgres database for these examples using this [Docker Compose file](docker/postgres/docker-compose.yml). Download it into a new directory, navigate to it and run the following command to start it:

```bash
docker compose up -d
```

The flow will extract CSV data partitioned by year and month, create tables, load data to the monthly table, and finally merge the data to the final destination table.

```mermaid
graph LR
  Start[Select Year & Month] --> SetLabel[Set Labels]
  SetLabel --> Extract[Extract CSV Data]
  Extract -->|Taxi=Yellow| YellowFinalTable[Create Yellow Final Table]:::yellow
  Extract -->|Taxi=Green| GreenFinalTable[Create Green Final Table]:::green
  YellowFinalTable --> YellowMonthlyTable[Create Yellow Monthly Table]:::yellow
  GreenFinalTable --> GreenMonthlyTable[Create Green Monthly Table]:::green
  YellowMonthlyTable --> YellowCopyIn[Load Data to Monthly Table]:::yellow
  GreenMonthlyTable --> GreenCopyIn[Load Data to Monthly Table]:::green
  YellowCopyIn --> YellowMerge[Merge Yellow Data]:::yellow
  GreenCopyIn --> GreenMerge[Merge Green Data]:::green

  classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px;
  classDef green fill:#32CD32,stroke:#000,stroke-width:1px;
```

The flow code: [`02_postgres_taxi.yaml`](flows/02_postgres_taxi.yaml).


> [!NOTE]  
> The NYC Taxi and Limousine Commission (TLC) Trip Record Data provided on the [nyc.gov](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) website is currently available only in a Parquet format, but this is NOT the dataset we're going to use in this course. For the purpose of this course, we'll use the **CSV files** available [here on GitHub](https://github.com/DataTalksClub/nyc-tlc-data/releases). This is because the Parquet format can be challenging to understand by newcomers, and we want to make the course as accessible as possible — the CSV format can be easily introspected using tools like Excel or Google Sheets, or even a simple text editor.

### Local DB: Learn Scheduling and Backfills

We can now schedule the same pipeline shown above to run daily at 9 AM UTC. We'll also demonstrate how to backfill the data pipeline to run on historical data.

Note: given the large dataset, we'll backfill only data for the green taxi dataset for the year 2019.

The flow code: [`02_postgres_taxi_scheduled.yaml`](flows/02_postgres_taxi_scheduled.yaml).

### Local DB: Orchestrate dbt Models (Optional)

Now that we have raw data ingested into a local Postgres database, we can use dbt to transform the data into meaningful insights. The flow will sync the dbt models from Git to Kestra and run the `dbt build` command to build the models.

```mermaid
graph LR
  Start[Select dbt command] --> Sync[Sync Namespace Files]
  Sync --> DbtBuild[Run dbt CLI]
```

This gives you a quick showcase of dbt inside of Kestra so the homework tasks do not depend on it. The course will go into more detail of dbt in [Week 4](../04-analytics-engineering).

The flow code: [`03_postgres_dbt.yaml`](flows/03_postgres_dbt.yaml).

### Resources
- [pgAdmin Download](https://www.pgadmin.org/download/)
- [Postgres DB Docker Compose](docker/postgres/docker-compose.yml)

---

## 4. ETL Pipelines in Kestra: Google Cloud Platform

Now that you've learned how to build ETL pipelines locally using Postgres, we are ready to move to the cloud. In this section, we'll load the same Yellow and Green Taxi data to Google Cloud Platform (GCP) using: 
1. Google Cloud Storage (GCS) as a data lake  
2. BigQuery as a data warehouse.

### Videos

- **2.2.6 - Create an ETL Pipeline with GCS and BigQuery in Kestra**  
  [![Create an ETL Pipeline with BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FnKqjjLJ7YXs)](https://youtu.be/nKqjjLJ7YXs)
- **2.2.7 - Manage Scheduling and Backfills using BigQuery in Kestra**   
  [![Manage Scheduling and Backfills using BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FDoaZ5JWEkH0)](https://youtu.be/DoaZ5JWEkH0)
- **2.2.8 - Transform Data with dbt and BigQuery in Kestra**   
  [![Transform Data with dbt and BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FeF_EdV4A1Wk)](https://youtu.be/eF_EdV4A1Wk)

### Setup Google Cloud Platform (GCP)

Before we start loading data to GCP, we need to set up the Google Cloud Platform. 

First, adjust the following flow [`04_gcp_kv.yaml`](flows/04_gcp_kv.yaml) to include your service account, GCP project ID, BigQuery dataset and GCS bucket name (_along with their location_) as KV Store values:
- GCP_CREDS
- GCP_PROJECT_ID
- GCP_LOCATION
- GCP_BUCKET_NAME
- GCP_DATASET.


> [!WARNING]  
> The `GCP_CREDS` service account contains sensitive information. Ensure you keep it secure and do not commit it to Git. Keep it as secure as your passwords.

### Create GCP Resources

If you haven't already created the GCS bucket and BigQuery dataset in the first week of the course, you can use this flow to create them: [`05_gcp_setup.yaml`](flows/05_gcp_setup.yaml).


### GCP Workflow: Load Taxi Data to BigQuery

```mermaid
graph LR
  SetLabel[Set Labels] --> Extract[Extract CSV Data]
  Extract --> UploadToGCS[Upload Data to GCS]
  UploadToGCS -->|Taxi=Yellow| BQYellowTripdata[Main Yellow Tripdata Table]:::yellow
  UploadToGCS -->|Taxi=Green| BQGreenTripdata[Main Green Tripdata Table]:::green
  BQYellowTripdata --> BQYellowTableExt[External Table]:::yellow
  BQGreenTripdata --> BQGreenTableExt[External Table]:::green
  BQYellowTableExt --> BQYellowTableTmp[Monthly Table]:::yellow
  BQGreenTableExt --> BQGreenTableTmp[Monthly Table]:::green
  BQYellowTableTmp --> BQYellowMerge[Merge to Main Table]:::yellow
  BQGreenTableTmp --> BQGreenMerge[Merge to Main Table]:::green
  BQYellowMerge --> PurgeFiles[Purge Files]
  BQGreenMerge --> PurgeFiles[Purge Files]

  classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px;
  classDef green fill:#32CD32,stroke:#000,stroke-width:1px;
```

The flow code: [`06_gcp_taxi.yaml`](flows/06_gcp_taxi.yaml).

### GCP Workflow: Schedule and Backfill Full Dataset

We can now schedule the same pipeline shown above to run daily at 9 AM UTC for the green dataset and at 10 AM UTC for the yellow dataset. You can backfill historical data directly from the Kestra UI.

Since we now process data in a cloud environment with infinitely scalable storage and compute, we can backfill the entire dataset for both the yellow and green taxi data without the risk of running out of resources on our local machine.

The flow code: [`06_gcp_taxi_scheduled.yaml`](flows/06_gcp_taxi_scheduled.yaml).

### GCP Workflow: Orchestrate dbt Models (Optional)

Now that we have raw data ingested into BigQuery, we can use dbt to transform that data. The flow will sync the dbt models from Git to Kestra and run the `dbt build` command to build the models:

```mermaid
graph LR
  Start[Select dbt command] --> Sync[Sync Namespace Files]
  Sync --> Build[Run dbt Build Command]
```

This gives you a quick showcase of dbt inside of Kestra so the homework tasks do not depend on it. The course will go into more detail of dbt in [Week 4](../04-analytics-engineering).

The flow code: [`07_gcp_dbt.yaml`](flows/07_gcp_dbt.yaml).

---

## 5. Bonus: Deploy to the Cloud (Optional)

Now that we've got our ETL pipeline working both locally and in the cloud, we can deploy Kestra to the cloud so it can continue to orchestrate our ETL pipelines monthly with our configured schedules, We'll cover how you can install Kestra on Google Cloud in Production, and automatically sync and deploy your workflows from a Git repository.

Note: When committing your workflows to Kestra, make sure your workflow doesn't contain any sensitive information. You can use [Secrets](https://go.kestra.io/de-zoomcamp/secret) and the [KV Store](https://go.kestra.io/de-zoomcamp/kv-store) to keep sensitive data out of your workflow logic.

### Videos

- **2.2.9 - Deploy Workflows to the Cloud with Git**   
  [![Deploy Workflows to the Cloud with Git](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fl-wC71tI3co)](https://youtu.be/l-wC71tI3co)

Resources

- [Install Kestra on Google Cloud](https://go.kestra.io/de-zoomcamp/gcp-install)
- [Moving from Development to Production](https://go.kestra.io/de-zoomcamp/dev-to-prod)
- [Using Git in Kestra](https://go.kestra.io/de-zoomcamp/git)
- [Deploy Flows with GitHub Actions](https://go.kestra.io/de-zoomcamp/deploy-github-actions)

## 6. Additional Resources 📚

- Check [Kestra Docs](https://go.kestra.io/de-zoomcamp/docs)
- Explore our [Blueprints](https://go.kestra.io/de-zoomcamp/blueprints) library
- Browse over 600 [plugins](https://go.kestra.io/de-zoomcamp/plugins) available in Kestra
- Give us a star on [GitHub](https://go.kestra.io/de-zoomcamp/github)
- Join our [Slack community](https://go.kestra.io/de-zoomcamp/slack) if you have any questions
- Find all the videos in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist)


### Troubleshooting tips

If you face any issues with Kestra flows in Module 2, make sure to use the following Docker images/ports:
- `kestra/kestra:latest` is correct = latest stable release, while `kestra/kestra:develop` is incorrect as this is a bleeding-edge development version that might contain bugs
- `postgres:latest` — make sure to use Postgres image, which uses **PostgreSQL 15** or higher
- If you run `pgAdmin` or something else on port 8080, you can adjust Kestra docker-compose to use a different port, e.g. change port mapping to 18080 instead of 8080, and then access Kestra UI in your browser from http://localhost:18080/ instead of from http://localhost:8080/

If you're using Linux, you might encounter `Connection Refused` errors when connecting to the Postgres DB from within Kestra. This is because `host.docker.internal` works differently on Linux. Using the modified Docker Compose file below, you can run both Kestra and its dedicated Postgres DB, as well as the Postgres DB for the exercises all together. You can access it within Kestra by referring to the container name `postgres_zoomcamp` instead of `host.docker.internal` in `pluginDefaults`. This applies to pgAdmin as well. If you'd prefer to keep it in separate Docker Compose files, you'll need to setup a Docker network so that they can communicate with each other.

<details>
<summary>Docker Compose Example</summary>

This Docker Compose has the Zoomcamp DB container and pgAdmin container added to it, so it's all in one file.

Changes include:
- New `volume` for the Zoomcamp DB container
- Zoomcamp DB container is added and renamed to prevent clashes with the Kestra DB container
- Depends on condition is added to make sure Kestra is running before it starts
- pgAdmin is added and running on Port 8085 so it doesn't clash wit Kestra which uses 8080 and 8081

```yaml
volumes:
  postgres-data:
    driver: local
  kestra-data:
    driver: local
  zoomcamp-data:
    driver: local

services:
  postgres:
    image: postgres
    volumes:
      - postgres-data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: kestra
      POSTGRES_USER: kestra
      POSTGRES_PASSWORD: k3str4
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -d $${POSTGRES_DB} -U $${POSTGRES_USER}"]
      interval: 30s
      timeout: 10s
      retries: 10

  kestra:
    image: kestra/kestra:latest
    pull_policy: always
    # Note that this setup with a root user is intended for development purpose.
    # Our base image runs without root, but the Docker Compose implementation needs root to access the Docker socket
    # To run Kestra in a rootless mode in production, see: https://kestra.io/docs/installation/podman-compose
    user: "root"
    command: server standalone
    volumes:
      - kestra-data:/app/storage
      - /var/run/docker.sock:/var/run/docker.sock
      - /tmp/kestra-wd:/tmp/kestra-wd
    environment:
      KESTRA_CONFIGURATION: |
        datasources:
          postgres:
            url: jdbc:postgresql://postgres:5432/kestra
            driverClassName: org.postgresql.Driver
            username: kestra
            password: k3str4
        kestra:
          server:
            basicAuth:
              enabled: false
              username: "admin@kestra.io" # it must be a valid email address
              password: kestra
          repository:
            type: postgres
          storage:
            type: local
            local:
              basePath: "/app/storage"
          queue:
            type: postgres
          tasks:
            tmpDir:
              path: /tmp/kestra-wd/tmp
          url: http://localhost:8080/
    ports:
      - "8080:8080"
      - "8081:8081"
    depends_on:
      postgres:
        condition: service_started
    
  postgres_zoomcamp:
    image: postgres
    environment:
      POSTGRES_USER: kestra
      POSTGRES_PASSWORD: k3str4
      POSTGRES_DB: postgres-zoomcamp
    ports:
      - "5432:5432"
    volumes:
      - zoomcamp-data:/var/lib/postgresql/data
    depends_on:
      kestra:
        condition: service_started

  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - "8085:80"
    depends_on:
      postgres_zoomcamp:
        condition: service_started
```

</details>

If you are still facing any issues, stop and remove your existing Kestra + Postgres containers and start them again using `docker-compose up -d`. If this doesn't help, post your question on the DataTalksClub Slack or on Kestra's Slack http://kestra.io/slack.

- **DE Zoomcamp FAQ - PostgresDB Setup and Installing pgAdmin**   
  [![DE Zoomcamp FAQ - PostgresDB Setup and Installing pgAdmin](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FywAPYNYFaB4%3Fsi%3D5X9AD0nFAT2WLWgS)](https://youtu.be/ywAPYNYFaB4?si=5X9AD0nFAT2WLWgS)
- **DE Zoomcamp FAQ - Port and Images**  
  [![DE Zoomcamp FAQ - Ports and Images](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fl2M2mW76RIU%3Fsi%3DoqyZ7KUaI27vi90V)](https://youtu.be/l2M2mW76RIU?si=oqyZ7KUaI27vi90V)
- **DE Zoomcamp FAQ - Docker Setup**  
  [![DE Zoomcamp FAQ - Docker Setup](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F73g6qJN0HcM)](https://youtu.be/73g6qJN0HcM)


If you encounter similar errors to:
```
BigQueryError{reason=invalid, location=null, 
message=Error while reading table: kestra-sandbox.zooomcamp.yellow_tripdata_2020_01, 
error message: CSV table references column position 17, but line contains only 14 columns.; 
line_number: 2103925 byte_offset_to_start_of_line: 194863028 
column_index: 17 column_name: "congestion_surcharge" column_type: NUMERIC 
File: gs://anna-geller/yellow_tripdata_2020-01.csv}
```

It means that the CSV file you're trying to load into BigQuery has a mismatch in the number of columns between the external source table (i.e. file in GCS) and the destination table in BigQuery. This can happen when for due to network/transfer issues, the file is not fully downloaded from GitHub or not correctly uploaded to GCS. The error suggests schema issues but that's not the case. Simply rerun the entire execution including redownloading the CSV file and reuploading it to GCS. This should resolve the issue.

---

## Homework 

See the [2025 cohort folder](../cohorts/2025/02-workflow-orchestration/homework.md)


---

# Community notes

Did you take notes? You can share them by creating a PR to this file! 

* [Notes from Manuel Guerra)](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/2_Workflow-Orchestration-(Kestra)/README.md)
* [Notes from Horeb Seidou](https://spotted-hardhat-eea.notion.site/Week-2-Workflow-Orchestration-17129780dc4a80148debf61e6453fffe)
* [Notes from Livia](https://docs.google.com/document/d/1Y_QMonvEtFPbXIzmdpCSVsKNC1BWAHFBA1mpK9qaZko/edit?usp=sharing)
* [2025 Gitbook Notes from Tinker0425](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/module-2/introduction-to-module-2)
* [Notes from Mercy Markus: Linux/Fedora Tweaks and Tips](https://mercymarkus.com/posts/2025/series/dtc-dez-jan-2025/dtc-dez-2025-module-2/)
* Add your notes above this line

---

# Previous Cohorts

* 2022: [notes](../cohorts/2022/week_2_data_ingestion#community-notes) and [videos](../cohorts/2022/week_2_data_ingestion)
* 2023: [notes](../cohorts/2023/week_2_workflow_orchestration#community-notes) and [videos](../cohorts/2023/week_2_workflow_orchestration)
* 2024: [notes](../cohorts/2024/02-workflow-orchestration#community-notes) and [videos](../cohorts/2024/02-workflow-orchestration)


================================================
FILE: cohorts/2025/02-workflow-orchestration/flows/01_getting_started_data_pipeline.yaml
================================================
id: 01_getting_started_data_pipeline
namespace: zoomcamp

inputs:
  - id: columns_to_keep
    type: ARRAY
    itemType: STRING
    defaults:
      - brand
      - price

tasks:
  - id: extract
    type: io.kestra.plugin.core.http.Download
    uri: https://dummyjson.com/products

  - id: transform
    type: io.kestra.plugin.scripts.python.Script
    containerImage: python:3.11-alpine
    inputFiles:
      data.json: "{{outputs.extract.uri}}"
    outputFiles:
      - "*.json"
    env:
      COLUMNS_TO_KEEP: "{{inputs.columns_to_keep}}"
    script: |
      import json
      import os

      columns_to_keep_str = os.getenv("COLUMNS_TO_KEEP")
      columns_to_keep = json.loads(columns_to_keep_str)

      with open("data.json", "r") as file:
          data = json.load(file)

      filtered_data = [
          {column: product.get(column, "N/A") for column in columns_to_keep}
          for product in data["products"]
      ]

      with open("products.json", "w") as file:
          json.dump(filtered_data, file, indent=4)

  - id: query
    type: io.kestra.plugin.jdbc.duckdb.Query
    inputFiles:
      products.json: "{{outputs.transform.outputFiles['products.json']}}"
    sql: |
      INSTALL json;
      LOAD json;
      SELECT brand, round(avg(price), 2) as avg_price
      FROM read_json_auto('{{workingDir}}/products.json')
      GROUP BY brand
      ORDER BY avg_price DESC;
    fetchType: STORE


================================================
FILE: cohorts/2025/02-workflow-orchestration/flows/02_postgres_taxi.yaml
================================================
id: 02_postgres_taxi
namespace: zoomcamp
description: |
  The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases

inputs:
  - id: taxi
    type: SELECT
    displayName: Select taxi type
    values: [yellow, green]
    defaults: yellow

  - id: year
    type: SELECT
    displayName: Select year
    values: ["2019", "2020"]
    defaults: "2019"

  - id: month
    type: SELECT
    displayName: Select month
    values: ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
    defaults: "01"

variables:
  file: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv"
  staging_table: "public.{{inputs.taxi}}_tripdata_staging"
  table: "public.{{inputs.taxi}}_tripdata"
  data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}"

tasks:
  - id: set_label
    type: io.kestra.plugin.core.execution.Labels
    labels:
      file: "{{render(vars.file)}}"
      taxi: "{{inputs.taxi}}"

  - id: extract
    type: io.kestra.plugin.scripts.shell.Commands
    outputFiles:
      - "*.csv"
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}

  - id: if_yellow_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'yellow'}}"
    then:
      - id: yellow_create_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              tpep_pickup_datetime   timestamp,
              tpep_dropoff_datetime  timestamp,
              passenger_count        integer,
              trip_distance          double precision,
              RatecodeID             text,
              store_and_fwd_flag     text,
              PULocationID           text,
              DOLocationID           text,
              payment_type           integer,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              congestion_surcharge   double precision
          );

      - id: yellow_create_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              tpep_pickup_datetime   timestamp,
              tpep_dropoff_datetime  timestamp,
              passenger_count        integer,
              trip_distance          double precision,
              RatecodeID             text,
              store_and_fwd_flag     text,
              PULocationID           text,
              DOLocationID           text,
              payment_type           integer,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              congestion_surcharge   double precision
          );

      - id: yellow_truncate_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          TRUNCATE TABLE {{render(vars.staging_table)}};

      - id: yellow_copy_in_to_staging_table
        type: io.kestra.plugin.jdbc.postgresql.CopyIn
        format: CSV
        from: "{{render(vars.data)}}"
        table: "{{render(vars.staging_table)}}"
        header: true
        columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge]

      - id: yellow_add_unique_id_and_filename
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          UPDATE {{render(vars.staging_table)}}
          SET 
            unique_row_id = md5(
              COALESCE(CAST(VendorID AS text), '') ||
              COALESCE(CAST(tpep_pickup_datetime AS text), '') || 
              COALESCE(CAST(tpep_dropoff_datetime AS text), '') || 
              COALESCE(PULocationID, '') || 
              COALESCE(DOLocationID, '') || 
              COALESCE(CAST(fare_amount AS text), '') || 
              COALESCE(CAST(trip_distance AS text), '')      
            ),
            filename = '{{render(vars.file)}}';

      - id: yellow_merge_data
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          MERGE INTO {{render(vars.table)}} AS T
          USING {{render(vars.staging_table)}} AS S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (
              unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,
              passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,
              DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,
              improvement_surcharge, total_amount, congestion_surcharge
            )
            VALUES (
              S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,
              S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,
              S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,
              S.improvement_surcharge, S.total_amount, S.congestion_surcharge
            );

  - id: if_green_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'green'}}"
    then:
      - id: green_create_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              lpep_pickup_datetime   timestamp,
              lpep_dropoff_datetime  timestamp,
              store_and_fwd_flag     text,
              RatecodeID             text,
              PULocationID           text,
              DOLocationID           text,
              passenger_count        integer,
              trip_distance          double precision,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              ehail_fee              double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              payment_type           integer,
              trip_type              integer,
              congestion_surcharge   double precision
          );

      - id: green_create_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              lpep_pickup_datetime   timestamp,
              lpep_dropoff_datetime  timestamp,
              store_and_fwd_flag     text,
              RatecodeID             text,
              PULocationID           text,
              DOLocationID           text,
              passenger_count        integer,
              trip_distance          double precision,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              ehail_fee              double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              payment_type           integer,
              trip_type              integer,
              congestion_surcharge   double precision
          );

      - id: green_truncate_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          TRUNCATE TABLE {{render(vars.staging_table)}};

      - id: green_copy_in_to_staging_table
        type: io.kestra.plugin.jdbc.postgresql.CopyIn
        format: CSV
        from: "{{render(vars.data)}}"
        table: "{{render(vars.staging_table)}}"
        header: true
        columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge]

      - id: green_add_unique_id_and_filename
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          UPDATE {{render(vars.staging_table)}}
          SET 
            unique_row_id = md5(
              COALESCE(CAST(VendorID AS text), '') ||
              COALESCE(CAST(lpep_pickup_datetime AS text), '') || 
              COALESCE(CAST(lpep_dropoff_datetime AS text), '') || 
              COALESCE(PULocationID, '') || 
              COALESCE(DOLocationID, '') || 
              COALESCE(CAST(fare_amount AS text), '') || 
              COALESCE(CAST(trip_distance AS text), '')      
            ),
            filename = '{{render(vars.file)}}';

      - id: green_merge_data
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          MERGE INTO {{render(vars.table)}} AS T
          USING {{render(vars.staging_table)}} AS S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (
              unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime,
              store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count,
              trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee,
              improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge
            )
            VALUES (
              S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime,
              S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count,
              S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee,
              S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge
            );
  
  - id: purge_files
    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles
    description: This will remove output files. If you'd like to explore Kestra outputs, disable it.

pluginDefaults:
  - type: io.kestra.plugin.jdbc.postgresql
    values:
      url: jdbc:postgresql://host.docker.internal:5432/postgres-zoomcamp
      username: kestra
      password: k3str4


================================================
FILE: cohorts/2025/02-workflow-orchestration/flows/02_postgres_taxi_scheduled.yaml
================================================
id: 02_postgres_taxi_scheduled
namespace: zoomcamp
description: |
  Best to add a label `backfill:true` from the UI to track executions created via a backfill.
  CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases

concurrency:
  limit: 1

inputs:
  - id: taxi
    type: SELECT
    displayName: Select taxi type
    values: [yellow, green]
    defaults: yellow

variables:
  file: "{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv"
  staging_table: "public.{{inputs.taxi}}_tripdata_staging"
  table: "public.{{inputs.taxi}}_tripdata"
  data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}"

tasks:
  - id: set_label
    type: io.kestra.plugin.core.execution.Labels
    labels:
      file: "{{render(vars.file)}}"
      taxi: "{{inputs.taxi}}"

  - id: extract
    type: io.kestra.plugin.scripts.shell.Commands
    outputFiles:
      - "*.csv"
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}

  - id: if_yellow_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'yellow'}}"
    then:
      - id: yellow_create_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              tpep_pickup_datetime   timestamp,
              tpep_dropoff_datetime  timestamp,
              passenger_count        integer,
              trip_distance          double precision,
              RatecodeID             text,
              store_and_fwd_flag     text,
              PULocationID           text,
              DOLocationID           text,
              payment_type           integer,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              congestion_surcharge   double precision
          );

      - id: yellow_create_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              tpep_pickup_datetime   timestamp,
              tpep_dropoff_datetime  timestamp,
              passenger_count        integer,
              trip_distance          double precision,
              RatecodeID             text,
              store_and_fwd_flag     text,
              PULocationID           text,
              DOLocationID           text,
              payment_type           integer,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              congestion_surcharge   double precision
          );

      - id: yellow_truncate_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          TRUNCATE TABLE {{render(vars.staging_table)}};

      - id: yellow_copy_in_to_staging_table
        type: io.kestra.plugin.jdbc.postgresql.CopyIn
        format: CSV
        from: "{{render(vars.data)}}"
        table: "{{render(vars.staging_table)}}"
        header: true
        columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge]

      - id: yellow_add_unique_id_and_filename
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          UPDATE {{render(vars.staging_table)}}
          SET 
            unique_row_id = md5(
              COALESCE(CAST(VendorID AS text), '') ||
              COALESCE(CAST(tpep_pickup_datetime AS text), '') || 
              COALESCE(CAST(tpep_dropoff_datetime AS text), '') || 
              COALESCE(PULocationID, '') || 
              COALESCE(DOLocationID, '') || 
              COALESCE(CAST(fare_amount AS text), '') || 
              COALESCE(CAST(trip_distance AS text), '')      
            ),
            filename = '{{render(vars.file)}}';

      - id: yellow_merge_data
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          MERGE INTO {{render(vars.table)}} AS T
          USING {{render(vars.staging_table)}} AS S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (
              unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,
              passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,
              DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,
              improvement_surcharge, total_amount, congestion_surcharge
            )
            VALUES (
              S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,
              S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,
              S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,
              S.improvement_surcharge, S.total_amount, S.congestion_surcharge
            );

  - id: if_green_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'green'}}"
    then:
      - id: green_create_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              lpep_pickup_datetime   timestamp,
              lpep_dropoff_datetime  timestamp,
              store_and_fwd_flag     text,
              RatecodeID             text,
              PULocationID           text,
              DOLocationID           text,
              passenger_count        integer,
              trip_distance          double precision,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              ehail_fee              double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              payment_type           integer,
              trip_type              integer,
              congestion_surcharge   double precision
          );

      - id: green_create_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (
              unique_row_id          text,
              filename               text,
              VendorID               text,
              lpep_pickup_datetime   timestamp,
              lpep_dropoff_datetime  timestamp,
              store_and_fwd_flag     text,
              RatecodeID             text,
              PULocationID           text,
              DOLocationID           text,
              passenger_count        integer,
              trip_distance          double precision,
              fare_amount            double precision,
              extra                  double precision,
              mta_tax                double precision,
              tip_amount             double precision,
              tolls_amount           double precision,
              ehail_fee              double precision,
              improvement_surcharge  double precision,
              total_amount           double precision,
              payment_type           integer,
              trip_type              integer,
              congestion_surcharge   double precision
          );

      - id: green_truncate_staging_table
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          TRUNCATE TABLE {{render(vars.staging_table)}};

      - id: green_copy_in_to_staging_table
        type: io.kestra.plugin.jdbc.postgresql.CopyIn
        format: CSV
        from: "{{render(vars.data)}}"
        table: "{{render(vars.staging_table)}}"
        header: true
        columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge]

      - id: green_add_unique_id_and_filename
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          UPDATE {{render(vars.staging_table)}}
          SET 
            unique_row_id = md5(
              COALESCE(CAST(VendorID AS text), '') ||
              COALESCE(CAST(lpep_pickup_datetime AS text), '') || 
              COALESCE(CAST(lpep_dropoff_datetime AS text), '') || 
              COALESCE(PULocationID, '') || 
              COALESCE(DOLocationID, '') || 
              COALESCE(CAST(fare_amount AS text), '') || 
              COALESCE(CAST(trip_distance AS text), '')      
            ),
            filename = '{{render(vars.file)}}';

      - id: green_merge_data
        type: io.kestra.plugin.jdbc.postgresql.Queries
        sql: |
          MERGE INTO {{render(vars.table)}} AS T
          USING {{render(vars.staging_table)}} AS S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (
              unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime,
              store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count,
              trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee,
              improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge
            )
            VALUES (
              S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime,
              S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count,
              S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee,
              S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge
            );
  
  - id: purge_files
    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles
    description: To avoid cluttering your storage, we will remove the downloaded files

pluginDefaults:
  - type: io.kestra.plugin.jdbc.postgresql
    values:
      url: jdbc:postgresql://host.docker.internal:5432/postgres-zoomcamp
      username: kestra
      password: k3str4

triggers:
  - id: green_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 9 1 * *"
    inputs:
      taxi: green

  - id: yellow_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 10 1 * *"
    inputs:
      taxi: yellow


================================================
FILE: cohorts/2025/02-workflow-orchestration/flows/03_postgres_dbt.yaml
================================================
id: 03_postgres_dbt
namespace: zoomcamp
inputs:
  - id: dbt_command
    type: SELECT
    allowCustomValue: true
    defaults: dbt build
    values:
      - dbt build
      - dbt debug # use when running the first time to validate DB connection
tasks:
  - id: sync
    type: io.kestra.plugin.git.SyncNamespaceFiles
    url: https://github.com/DataTalksClub/data-engineering-zoomcamp
    branch: main
    namespace: "{{ flow.namespace }}"
    gitDirectory: 04-analytics-engineering/taxi_rides_ny
    dryRun: false
    # disabled: true # this Git Sync is needed only when running it the first time, afterwards the task can be disabled

  - id: dbt-build
    type: io.kestra.plugin.dbt.cli.DbtCLI
    env:
      DBT_DATABASE: postgres-zoomcamp
      DBT_SCHEMA: public
    namespaceFiles:
      enabled: true
    containerImage: ghcr.io/kestra-io/dbt-postgres:latest
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
      networkMode: host
    commands:
      - dbt deps
      - "{{ inputs.dbt_command }}"
    storeManifest:
      key: manifest.json
      namespace: "{{ flow.namespace }}"
    profiles: |
      default:
        outputs:
          dev:
            type: postgres
            host: host.docker.internal
            user: kestra
            password: k3str4
            port: 5432
            dbname: postgres-zoomcamp
            schema: public
            threads: 8
            connect_timeout: 10
            priority: interactive
        target: dev
description: |
  Note that you need to adjust the models/staging/schema.yml file to match your database and schema. Select and edit that Namespace File from the UI. Save and run this flow. Once https://github.com/DataTalksClub/data-engineering-zoomcamp/pull/565/files is merged, you can ignore this note as it will be dynamically adjusted based on env variables.
  ```yaml
  sources:
    - name: staging
      database: postgres-zoomcamp
      schema: public
  ```


================================================
FILE: cohorts/2025/02-workflow-orchestration/flows/04_gcp_kv.yaml
================================================
id: 04_gcp_kv
namespace: zoomcamp

tasks:
  - id: gcp_project_id
    type: io.kestra.plugin.core.kv.Set
    key: GCP_PROJECT_ID
    kvType: STRING
    value: kestra-sandbox # TODO replace with your project id

  - id: gcp_location
    type: io.kestra.plugin.core.kv.Set
    key: GCP_LOCATION
    kvType: STRING
    value: europe-west2

  - id: gcp_bucket_name
    type: io.kestra.plugin.core.kv.Set
    key: GCP_BUCKET_NAME
    kvType: STRING
    value: your-name-kestra # TODO make sure it's globally unique!

  - id: gcp_dataset
    type: io.kestra.plugin.core.kv.Set
    key: GCP_DATASET
    kvType: STRING
    value: zoomcamp


================================================
FILE: cohorts/2025/02-workflow-orchestration/flows/05_gcp_setup.yaml
================================================
id: 05_gcp_setup
namespace: zoomcamp

tasks:
  - id: create_gcs_bucket
    type: io.kestra.plugin.gcp.gcs.CreateBucket
    ifExists: SKIP
    storageClass: REGIONAL
    name: "{{kv('GCP_BUCKET_NAME')}}" # make sure it's globally unique!

  - id: create_bq_dataset
    type: io.kestra.plugin.gcp.bigquery.CreateDataset
    name: "{{kv('GCP_DATASET')}}"
    ifExists: SKIP

pluginDefaults:
  - type: io.kestra.plugin.gcp
    values:
      serviceAccount: "{{kv('GCP_CREDS')}}"
      projectId: "{{kv('GCP_PROJECT_ID')}}"
      location: "{{kv('GCP_LOCATION')}}"
      bucket: "{{kv('GCP_BUCKET_NAME')}}"


================================================
FILE: cohorts/2025/02-workflow-orchestration/flows/06_gcp_taxi.yaml
================================================
id: 06_gcp_taxi
namespace: zoomcamp
description: |
  The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases

inputs:
  - id: taxi
    type: SELECT
    displayName: Select taxi type
    values: [yellow, green]
    defaults: green

  - id: year
    type: SELECT
    displayName: Select year
    values: ["2019", "2020"]
    defaults: "2019"
    allowCustomValue: true # allows you to type 2021 from the UI for the homework 🤗

  - id: month
    type: SELECT
    displayName: Select month
    values: ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"]
    defaults: "01"

variables:
  file: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv"
  gcs_file: "gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}"
  table: "{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{inputs.year}}_{{inputs.month}}"
  data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}"

tasks:
  - id: set_label
    type: io.kestra.plugin.core.execution.Labels
    labels:
      file: "{{render(vars.file)}}"
      taxi: "{{inputs.taxi}}"

  - id: extract
    type: io.kestra.plugin.scripts.shell.Commands
    outputFiles:
      - "*.csv"
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}

  - id: upload_to_gcs
    type: io.kestra.plugin.gcp.gcs.Upload
    from: "{{render(vars.data)}}"
    to: "{{render(vars.gcs_file)}}"

  - id: if_yellow_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'yellow'}}"
    then:
      - id: bq_yellow_tripdata
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata`
          (
              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),
              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          PARTITION BY DATE(tpep_pickup_datetime);

      - id: bq_yellow_table_ext
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`
          (
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          OPTIONS (
              format = 'CSV',
              uris = ['{{render(vars.gcs_file)}}'],
              skip_leading_rows = 1,
              ignore_unknown_values = TRUE
          );

      - id: bq_yellow_table_tmp
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`
          AS
          SELECT
            MD5(CONCAT(
              COALESCE(CAST(VendorID AS STRING), ""),
              COALESCE(CAST(tpep_pickup_datetime AS STRING), ""),
              COALESCE(CAST(tpep_dropoff_datetime AS STRING), ""),
              COALESCE(CAST(PULocationID AS STRING), ""),
              COALESCE(CAST(DOLocationID AS STRING), "")
            )) AS unique_row_id,
            "{{render(vars.file)}}" AS filename,
            *
          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;

      - id: bq_yellow_merge
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T
          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge)
            VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge);

  - id: if_green_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'green'}}"
    then:
      - id: bq_green_tripdata
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata`
          (
              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),
              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              ehail_fee NUMERIC,
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          PARTITION BY DATE(lpep_pickup_datetime);

      - id: bq_green_table_ext
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`
          (
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              ehail_fee NUMERIC,
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          OPTIONS (
              format = 'CSV',
              uris = ['{{render(vars.gcs_file)}}'],
              skip_leading_rows = 1,
              ignore_unknown_values = TRUE
          );

      - id: bq_green_table_tmp
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`
          AS
          SELECT
            MD5(CONCAT(
              COALESCE(CAST(VendorID AS STRING), ""),
              COALESCE(CAST(lpep_pickup_datetime AS STRING), ""),
              COALESCE(CAST(lpep_dropoff_datetime AS STRING), ""),
              COALESCE(CAST(PULocationID AS STRING), ""),
              COALESCE(CAST(DOLocationID AS STRING), "")
            )) AS unique_row_id,
            "{{render(vars.file)}}" AS filename,
            *
          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;

      - id: bq_green_merge
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T
          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge)
            VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge);

  - id: purge_files
    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles
    description: If you'd like to explore Kestra outputs, disable it.
    disabled: false

pluginDefaults:
  - type: io.kestra.plugin.gcp
    values:
      serviceAccount: "{{kv('GCP_CREDS')}}"
      projectId: "{{kv('GCP_PROJECT_ID')}}"
      location: "{{kv('GCP_LOCATION')}}"
      bucket: "{{kv('GCP_BUCKET_NAME')}}"


================================================
FILE: cohorts/2025/02-workflow-orchestration/flows/06_gcp_taxi_scheduled.yaml
================================================

id: 06_gcp_taxi_scheduled
namespace: zoomcamp
description: |
  Best to add a label `backfill:true` from the UI to track executions created via a backfill.
  CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases

inputs:
  - id: taxi
    type: SELECT
    displayName: Select taxi type
    values: [yellow, green]
    defaults: green

variables:
  file: "{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv"
  gcs_file: "gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}"
  table: "{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy_MM')}}"
  data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}"

tasks:
  - id: set_label
    type: io.kestra.plugin.core.execution.Labels
    labels:
      file: "{{render(vars.file)}}"
      taxi: "{{inputs.taxi}}"

  - id: extract
    type: io.kestra.plugin.scripts.shell.Commands
    outputFiles:
      - "*.csv"
    taskRunner:
      type: io.kestra.plugin.core.runner.Process
    commands:
      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}

  - id: upload_to_gcs
    type: io.kestra.plugin.gcp.gcs.Upload
    from: "{{render(vars.data)}}"
    to: "{{render(vars.gcs_file)}}"

  - id: if_yellow_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'yellow'}}"
    then:
      - id: bq_yellow_tripdata
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata`
          (
              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),
              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          PARTITION BY DATE(tpep_pickup_datetime);

      - id: bq_yellow_table_ext
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`
          (
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          OPTIONS (
              format = 'CSV',
              uris = ['{{render(vars.gcs_file)}}'],
              skip_leading_rows = 1,
              ignore_unknown_values = TRUE
          );

      - id: bq_yellow_table_tmp
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`
          AS
          SELECT
            MD5(CONCAT(
              COALESCE(CAST(VendorID AS STRING), ""),
              COALESCE(CAST(tpep_pickup_datetime AS STRING), ""),
              COALESCE(CAST(tpep_dropoff_datetime AS STRING), ""),
              COALESCE(CAST(PULocationID AS STRING), ""),
              COALESCE(CAST(DOLocationID AS STRING), "")
            )) AS unique_row_id,
            "{{render(vars.file)}}" AS filename,
            *
          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;

      - id: bq_yellow_merge
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T
          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge)
            VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge);

  - id: if_green_taxi
    type: io.kestra.plugin.core.flow.If
    condition: "{{inputs.taxi == 'green'}}"
    then:
      - id: bq_green_tripdata
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata`
          (
              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),
              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              ehail_fee NUMERIC,
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          PARTITION BY DATE(lpep_pickup_datetime);

      - id: bq_green_table_ext
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`
          (
              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),
              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),
              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),
              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),
              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),
              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),
              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),
              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),
              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),
              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),
              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),
              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),
              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),
              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),
              ehail_fee NUMERIC,
              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),
              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),
              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),
              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),
              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')
          )
          OPTIONS (
              format = 'CSV',
              uris = ['{{render(vars.gcs_file)}}'],
              skip_leading_rows = 1,
              ignore_unknown_values = TRUE
          );

      - id: bq_green_table_tmp
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`
          AS
          SELECT
            MD5(CONCAT(
              COALESCE(CAST(VendorID AS STRING), ""),
              COALESCE(CAST(lpep_pickup_datetime AS STRING), ""),
              COALESCE(CAST(lpep_dropoff_datetime AS STRING), ""),
              COALESCE(CAST(PULocationID AS STRING), ""),
              COALESCE(CAST(DOLocationID AS STRING), "")
            )) AS unique_row_id,
            "{{render(vars.file)}}" AS filename,
            *
          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;

      - id: bq_green_merge
        type: io.kestra.plugin.gcp.bigquery.Query
        sql: |
          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T
          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S
          ON T.unique_row_id = S.unique_row_id
          WHEN NOT MATCHED THEN
            INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge)
            VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge);

  - id: purge_files
    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles
    description: To avoid cluttering your storage, we will remove the downloaded files

pluginDefaults:
  - type: io.kestra.plugin.gcp
    values:
      serviceAccount: "{{kv('GCP_CREDS')}}"
      projectId: "{{kv('GCP_PROJECT_ID')}}"
      location: "{{kv('GCP_LOCATION')}}"
      bucket: "{{kv('GCP_BUCKET_NAME')}}"

triggers:
  - id: green_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 9 1 * *"
    inputs:
      taxi: green

  - id: yellow_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 10 1 * *"
    inputs:
      taxi: yellow


================================================
FILE: cohorts/2025/02-workflow-orchestration/flows/07_gcp_dbt.yaml
================================================
id: 07_gcp_dbt
namespace: zoomcamp
inputs:
  - id: dbt_command
    type: SELECT
    allowCustomValue: true
    defaults: dbt build
    values:
      - dbt build
      - dbt debug # use when running the first time to validate DB connection

tasks:
  - id: sync
    type: io.kestra.plugin.git.SyncNamespaceFiles
    url: https://github.com/DataTalksClub/data-engineering-zoomcamp
    branch: main
    namespace: "{{flow.namespace}}"
    gitDirectory: 04-analytics-engineering/taxi_rides_ny
    dryRun: false
    # disabled: true # this Git Sync is needed only when running it the first time, afterwards the task can be disabled

  - id: dbt-build
    type: io.kestra.plugin.dbt.cli.DbtCLI
    env:
      DBT_DATABASE: "{{kv('GCP_PROJECT_ID')}}"
      DBT_SCHEMA: "{{kv('GCP_DATASET')}}"
    namespaceFiles:
      enabled: true
    containerImage: ghcr.io/kestra-io/dbt-bigquery:latest
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
    inputFiles:
      sa.json: "{{kv('GCP_CREDS')}}"
    commands:
      - dbt deps
      - "{{ inputs.dbt_command }}"
    storeManifest:
      key: manifest.json
      namespace: "{{ flow.namespace }}"
    profiles: |
      default:
        outputs:
          dev:
            type: bigquery
            dataset: "{{kv('GCP_DATASET')}}"
            project: "{{kv('GCP_PROJECT_ID')}}"
            location: "{{kv('GCP_LOCATION')}}"
            keyfile: sa.json
            method: service-account
            priority: interactive
            threads: 16
            timeout_seconds: 300
            fixed_retries: 1
        target: dev
description: |
  Note that you need to adjust the models/staging/schema.yml file to match your database and schema. Select and edit that Namespace File from the UI. Save and run this flow. Once https://github.com/DataTalksClub/data-engineering-zoomcamp/pull/565/files is merged, you can ignore this note as it will be dynamically adjusted based on env variables.
  ```yaml
  sources:
    - name: staging
      database: kestra-sandbox 
      schema: zoomcamp
  ```


================================================
FILE: cohorts/2025/02-workflow-orchestration/homework.md
================================================
## Module 2 Homework

ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository.

> In case you don't get one option exactly, select the closest one 

For the homework, we'll be working with the _green_ taxi dataset located here:

`https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download`

To get a `wget`-able link, use this prefix (note that the link itself gives 404):

`https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/`

### Assignment

So far in the course, we processed data for the year 2019 and 2020. Your task is to extend the existing flows to include data for the year 2021.

![homework datasets](../../../02-workflow-orchestration/images/homework.png)

As a hint, Kestra makes that process really easy:
1. You can leverage the backfill functionality in the [scheduled flow](../../../02-workflow-orchestration/flows/06_gcp_taxi_scheduled.yaml) to backfill the data for the year 2021. Just make sure to select the time period for which data exists i.e. from `2021-01-01` to `2021-07-31`. Also, make sure to do the same for both `yellow` and `green` taxi data (select the right service in the `taxi` input).
2. Alternatively, run the flow manually for each of the seven months of 2021 for both `yellow` and `green` taxi data. Challenge for you: find out how to loop over the combination of Year-Month and `taxi`-type using `ForEach` task which triggers the flow for each combination using a `Subflow` task.

### Quiz Questions

Complete the Quiz shown below. It’s a set of 6 multiple-choice questions to test your understanding of workflow orchestration, Kestra and ETL pipelines for data lakes and warehouses.

1) Within the execution for `Yellow` Taxi data for the year `2020` and month `12`: what is the uncompressed file size (i.e. the output file `yellow_tripdata_2020-12.csv` of the `extract` task)?
- 128.3 MiB
- 134.5 MiB
- 364.7 MiB
- 692.6 MiB

2) What is the rendered value of the variable `file` when the inputs `taxi` is set to `green`, `year` is set to `2020`, and `month` is set to `04` during execution?
- `{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv` 
- `green_tripdata_2020-04.csv`
- `green_tripdata_04_2020.csv`
- `green_tripdata_2020.csv`

3) How many rows are there for the `Yellow` Taxi data for all CSV files in the year 2020?
- 13,537.299
- 24,648,499
- 18,324,219
- 29,430,127

4) How many rows are there for the `Green` Taxi data for all CSV files in the year 2020?
- 5,327,301
- 936,199
- 1,734,051
- 1,342,034

5) How many rows are there for the `Yellow` Taxi data for the March 2021 CSV file?
- 1,428,092
- 706,911
- 1,925,152
- 2,561,031

6) How would you configure the timezone to New York in a Schedule trigger?
- Add a `timezone` property set to `EST` in the `Schedule` trigger configuration  
- Add a `timezone` property set to `America/New_York` in the `Schedule` trigger configuration
- Add a `timezone` property set to `UTC-5` in the `Schedule` trigger configuration
- Add a `location` property set to `New_York` in the `Schedule` trigger configuration  


## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw2
* Check the link above to see the due date

## Solution

Will be added after the due date


================================================
FILE: cohorts/2025/03-data-warehouse/DLT_upload_to_GCP.ipynb
================================================
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aC2QnhmKxpq1"
      },
      "source": [
        "**Please set up your credentials JSON as GCP_CREDENTIALS secrets**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "UsUZobVduL7l"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from google.colab import userdata\n",
        "\n",
        "os.environ[\"DESTINATION__CREDENTIALS\"] = userdata.get('GCP_CREDENTIALS')\n",
        "os.environ[\"BUCKET_URL\"] = \"gs://your_bucket_url\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "mPBzsEgyjsBo"
      },
      "outputs": [],
      "source": [
        "# Install for production\n",
        "%%capture\n",
        "!pip install dlt[bigquery, gs]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "evdUsDNbkCTk"
      },
      "outputs": [],
      "source": [
        "# Install for testing\n",
        "%%capture\n",
        "!pip install dlt[duckdb]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "lYh7r1mTf4uo"
      },
      "outputs": [],
      "source": [
        "import dlt\n",
        "import requests\n",
        "import pandas as pd\n",
        "from dlt.destinations import filesystem\n",
        "from io import BytesIO"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "76zT1PzAgs7A"
      },
      "source": [
        "Ingesting parquet files to GCS."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xya0215jsnsb"
      },
      "outputs": [],
      "source": [
        "# Define a dlt source to download and process Parquet files as resources\n",
        "@dlt.source(name=\"rides\")\n",
        "def download_parquet():\n",
        "     for month in range(1,7):\n",
        "      file_name = f\"yellow_tripdata_2024-0{month}.parquet\"\n",
        "\n",
        "      url = f\"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-0{month}.parquet\"\n",
        "      response = requests.get(url)\n",
        "\n",
        "      df = pd.read_parquet(BytesIO(response.content))\n",
        "\n",
        "      # Return the dataframe as a dlt resource for ingestion\n",
        "      yield dlt.resource(df, name=file_name)\n",
        "\n",
        "# Initialize the pipeline\n",
        "pipeline = dlt.pipeline(\n",
        "    pipeline_name=\"rides_pipeline\",\n",
        "    destination=filesystem(\n",
        "      layout=\"{schema_name}/{table_name}.{ext}\"\n",
        "    ),\n",
        "    dataset_name=\"rides_dataset\"\n",
        ")\n",
        "\n",
        "# Run the pipeline to load Parquet data into DuckDB\n",
        "load_info = pipeline.run(\n",
        "    download_parquet(),\n",
        "    loader_file_format=\"parquet\"\n",
        "    )\n",
        "\n",
        "# Print the results\n",
        "print(load_info)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S0310FT-gy_P"
      },
      "source": [
        "Ingesting data to Database"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1_3K97w1c2v2",
        "outputId": "4b2d26bf-2814-46fa-f80d-7a2e17417a95"
      },
      "outputs": [],
      "source": [
        "# Define a dlt resource to download and process Parquet files as single table\n",
        "@dlt.resource(name=\"rides\", write_disposition=\"replace\")\n",
        "def download_parquet():\n",
        "     for month in range(1,7):\n",
        "      url = f\"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-0{month}.parquet\"\n",
        "      response = requests.get(url)\n",
        "\n",
        "      df = pd.read_parquet(BytesIO(response.content))\n",
        "\n",
        "      # Return the dataframe as a dlt resource for ingestion\n",
        "      yield df\n",
        "\n",
        "# Initialize the pipeline\n",
        "pipeline = dlt.pipeline(\n",
        "    pipeline_name=\"rides_pipeline\",\n",
        "    destination=\"duckdb\",  # Use DuckDB for testing\n",
        "    # destination=\"bigquery\",  # Use BigQuery for production\n",
        "    dataset_name=\"rides_dataset\"\n",
        ")\n",
        "\n",
        "# Run the pipeline to load Parquet data into DuckDB\n",
        "info = pipeline.run(download_parquet)\n",
        "\n",
        "# Print the results\n",
        "print(info)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "gDcLjzLtooBV",
        "outputId": "74ff2de7-2f2e-41b9-a681-3dc5887f6eed"
      },
      "outputs": [],
      "source": [
        "import duckdb\n",
        "conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n",
        "\n",
        "# Set search path to the dataset\n",
        "conn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n",
        "\n",
        "# Describe the dataset to see loaded tables\n",
        "res = conn.sql(\"DESCRIBE\").df()\n",
        "print(res)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "VVJy8JoerI2P",
        "outputId": "3f8c7fee-a9ee-4fd4-ec75-153ca60bd36f"
      },
      "outputs": [],
      "source": [
        "# provide a resource name to query a table of that name\n",
        "with pipeline.sql_client() as client:\n",
        "    with client.execute_query(f\"SELECT count(1) FROM rides\") as cursor:\n",
        "        data = cursor.df()\n",
        "print(data)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}


================================================
FILE: cohorts/2025/03-data-warehouse/homework.md
================================================
## Module 3 Homework

ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. 
This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or 
shell commands), please include these directly in the README file of your repository.

<b><u>Important Note:</b></u> <p> For this homework we will be using the Yellow Taxi Trip Records for **January 2024 - June 2024 NOT the entire year of data** 
Parquet Files from the New York
City Taxi Data found here: </br> https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page </br>
If you are using orchestration such as Kestra, Mage, Airflow or Prefect etc. do not load the data into Big Query using the orchestrator.</br> 
Stop with loading the files into a bucket. </br></br>

**Load Script:** You can manually download the parquet files and upload them to your GCS Bucket or you can use the linked script [here](./load_yellow_taxi_data.py):<br>
You will simply need to generate a Service Account with GCS Admin Priveleges or be authenticated with the Google SDK and update the bucket name in the script to the name of your bucket<br>
Nothing is fool proof so make sure that all 6 files show in your GCS Bucket before beginning.</br><br>

<u>NOTE:</u> You will need to use the PARQUET option files when creating an External Table</br>

<b>BIG QUERY SETUP:</b></br>
Create an external table using the Yellow Taxi Trip Records. </br>
Create a (regular/materialized) table in BQ using the Yellow Taxi Trip Records (do not partition or cluster this table). </br>
</p>

## Question 1:
What is count of records for the 2024 Yellow Taxi Data?
- 65,623
- 840,402
- 20,332,093
- 85,431,289


## Question 2:
Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.</br> 
What is the **estimated amount** of data that will be read when this query is executed on the External Table and the Table?

- 18.82 MB for the External Table and 47.60 MB for the Materialized Table
- 0 MB for the External Table and 155.12 MB for the Materialized Table
- 2.14 GB for the External Table and 0MB for the Materialized Table
- 0 MB for the External Table and 0MB for the Materialized Table

## Question 3:
Write a query to retrieve the PULocationID from the table (not the external table) in BigQuery. Now write a query to retrieve the PULocationID and DOLocationID on the same table. Why are the estimated number of Bytes different?
- BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires 
reading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed.
- BigQuery duplicates data across multiple storage partitions, so selecting two columns instead of one requires scanning the table twice, 
doubling the estimated bytes processed.
- BigQuery automatically caches the first queried column, so adding a second column increases processing time but does not affect the estimated bytes scanned.
- When selecting multiple columns, BigQuery performs an implicit join operation between them, increasing the estimated bytes processed

## Question 4:
How many records have a fare_amount of 0?
- 128,210
- 546,578
- 20,188,016
- 8,333

## Question 5:
What is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy)
- Partition by tpep_dropoff_datetime and Cluster on VendorID
- Cluster on by tpep_dropoff_datetime and Cluster on VendorID
- Cluster on tpep_dropoff_datetime Partition by VendorID
- Partition by tpep_dropoff_datetime and Partition by VendorID


## Question 6:
Write a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime
2024-03-01 and 2024-03-15 (inclusive)</br>

Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values? </br>

Choose the answer which most closely matches.</br> 

- 12.47 MB for non-partitioned table and 326.42 MB for the partitioned table
- 310.24 MB for non-partitioned table and 26.84 MB for the partitioned table
- 5.87 MB for non-partitioned table and 0 MB for the partitioned table
- 310.31 MB for non-partitioned table and 285.64 MB for the partitioned table


## Question 7: 
Where is the data stored in the External Table you created?

- Big Query
- Container Registry
- GCP Bucket
- Big Table

## Question 8:
It is best practice in Big Query to always cluster your data:
- True
- False


## (Bonus: Not worth points) Question 9:
No Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why?


## Submitting the solutions

Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw3

## Solution

Solution: https://www.youtube.com/watch?v=wpLmImIUlPg


================================================
FILE: cohorts/2025/03-data-warehouse/load_yellow_taxi_data.py
================================================
import os
import sys
import urllib.request
from concurrent.futures import ThreadPoolExecutor
from google.cloud import storage
from google.api_core.exceptions import NotFound, Forbidden
import time


# Change this to your bucket name
BUCKET_NAME = "dezoomcamp_hw3_2025"

# If you authenticated through the GCP SDK you can comment out these two lines
CREDENTIALS_FILE = "gcs.json"
client = storage.Client.from_service_account_json(CREDENTIALS_FILE)
# If commented initialize client with the following
# client = storage.Client(project='zoomcamp-mod3-datawarehouse')


BASE_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-"
MONTHS = [f"{i:02d}" for i in range(1, 7)]
DOWNLOAD_DIR = "."

CHUNK_SIZE = 8 * 1024 * 1024

os.makedirs(DOWNLOAD_DIR, exist_ok=True)

bucket = client.bucket(BUCKET_NAME)


def download_file(month):
    url = f"{BASE_URL}{month}.parquet"
    file_path = os.path.join(DOWNLOAD_DIR, f"yellow_tripdata_2024-{month}.parquet")

    try:
        print(f"Downloading {url}...")
        urllib.request.urlretrieve(url, file_path)
        print(f"Downloaded: {file_path}")
        return file_path
    except Exception as e:
        print(f"Failed to download {url}: {e}")
        return None


def create_bucket(bucket_name):
    try:
        # Get bucket details
        bucket = client.get_bucket(bucket_name)

        # Check if the bucket belongs to the current project
        project_bucket_ids = [bckt.id for bckt in client.list_buckets()]
        if bucket_name in project_bucket_ids:
            print(
                f"Bucket '{bucket_name}' exists and belongs to your project. Proceeding..."
            )
        else:
            print(
                f"A bucket with the name '{bucket_name}' already exists, but it does not belong to your project."
            )
            sys.exit(1)

    except NotFound:
        # If the bucket doesn't exist, create it
        bucket = client.create_bucket(bucket_name)
        print(f"Created bucket '{bucket_name}'")
    except Forbidden:
        # If the request is forbidden, it means the bucket exists but you don't have access to see details
        print(
            f"A bucket with the name '{bucket_name}' exists, but it is not accessible. Bucket name is taken. Please try a different bucket name."
        )
        sys.exit(1)


def verify_gcs_upload(blob_name):
    return storage.Blob(bucket=bucket, name=blob_name).exists(client)


def upload_to_gcs(file_path, max_retries=3):
    blob_name = os.path.basename(file_path)
    blob = bucket.blob(blob_name)
    blob.chunk_size = CHUNK_SIZE

    create_bucket(BUCKET_NAME)

    for attempt in range(max_retries):
        try:
            print(f"Uploading {file_path} to {BUCKET_NAME} (Attempt {attempt + 1})...")
            blob.upload_from_filename(file_path)
            print(f"Uploaded: gs://{BUCKET_NAME}/{blob_name}")

            if verify_gcs_upload(blob_name):
                print(f"Verification successful for {blob_name}")
                return
            else:
                print(f"Verification failed for {blob_name}, retrying...")
        except Exception as e:
            print(f"Failed to upload {file_path} to GCS: {e}")

        time.sleep(5)

    print(f"Giving up on {file_path} after {max_retries} attempts.")


if __name__ == "__main__":
    create_bucket(BUCKET_NAME)

    with ThreadPoolExecutor(max_workers=4) as executor:
        file_paths = list(executor.map(download_file, MONTHS))

    with ThreadPoolExecutor(max_workers=4) as executor:
        executor.map(upload_to_gcs, filter(None, file_paths))  # Remove None values

    print("All files processed and verified.")


================================================
FILE: cohorts/2025/04-analytics-engineering/homework.md
================================================
## Module 4 Homework

For this homework, you will need the following datasets:
* [Green Taxi dataset (2019 and 2020)](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green)
* [Yellow Taxi dataset (2019 and 2020)](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/yellow)
* [For Hire Vehicle dataset (2019)](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv)

### Before you start

1. Make sure you, **at least**, have them in GCS with a External Table **OR** a Native Table - use whichever method you prefer to accomplish that (Workflow Orchestration with [pandas-gbq](https://cloud.google.com/bigquery/docs/samples/bigquery-pandas-gbq-to-gbq-simple), [dlt for gcs](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem), [dlt for BigQuery](https://dlthub.com/docs/dlt-ecosystem/destinations/bigquery), [gsutil](https://cloud.google.com/storage/docs/gsutil), etc)
2. You should have exactly `7,778,101` records in your Green Taxi table
3. You should have exactly `109,047,518` records in your Yellow Taxi table
4. You should have exactly `43,244,696` records in your FHV table
5. Build the staging models for green/yellow as shown in [here](../../../04-analytics-engineering/taxi_rides_ny/models/staging/)
6. Build the dimension/fact for taxi_trips joining with `dim_zones`  as shown in [here](../../../04-analytics-engineering/taxi_rides_ny/models/core/fact_trips.sql)

**Note**: If you don't have access to GCP, you can spin up a local Postgres instance and ingest the datasets above


### Question 1: Understanding dbt model resolution

Provided you've got the following sources.yaml
```yaml
version: 2

sources:
  - name: raw_nyc_tripdata
    database: "{{ env_var('DBT_BIGQUERY_PROJECT', 'dtc_zoomcamp_2025') }}"
    schema:   "{{ env_var('DBT_BIGQUERY_SOURCE_DATASET', 'raw_nyc_tripdata') }}"
    tables:
      - name: ext_green_taxi
      - name: ext_yellow_taxi
```

with the following env variables setup where `dbt` runs:
```shell
export DBT_BIGQUERY_PROJECT=myproject
export DBT_BIGQUERY_DATASET=my_nyc_tripdata
```

What does this .sql model compile to?
```sql
select * 
from {{ source('raw_nyc_tripdata', 'ext_green_taxi' ) }}
```

- `select * from dtc_zoomcamp_2025.raw_nyc_tripdata.ext_green_taxi`
- `select * from dtc_zoomcamp_2025.my_nyc_tripdata.ext_green_taxi`
- `select * from myproject.raw_nyc_tripdata.ext_green_taxi`
- `select * from myproject.my_nyc_tripdata.ext_green_taxi`
- `select * from dtc_zoomcamp_2025.raw_nyc_tripdata.green_taxi`


### Question 2: dbt Variables & Dynamic Models

Say you have to modify the following dbt_model (`fct_recent_taxi_trips.sql`) to enable Analytics Engineers to dynamically control the date range. 

- In development, you want to process only **the last 7 days of trips**
- In production, you need to process **the last 30 days** for analytics

```sql
select *
from {{ ref('fact_taxi_trips') }}
where pickup_datetime >= CURRENT_DATE - INTERVAL '30' DAY
```

What would you change to accomplish that in a such way that command line arguments takes precedence over ENV_VARs, which takes precedence over DEFAULT value?

- Add `ORDER BY pickup_datetime DESC` and `LIMIT {{ var("days_back", 30) }}`
- Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var("days_back", 30) }}' DAY`
- Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var("DAYS_BACK", "30") }}' DAY`
- Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var("days_back", env_var("DAYS_BACK", "30")) }}' DAY`
- Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var("DAYS_BACK", var("days_back", "30")) }}' DAY`


### Question 3: dbt Data Lineage and Execution

Considering the data lineage below **and** that taxi_zone_lookup is the **only** materialization build (from a .csv seed file):

![image](./homework_q2.png)

Select the option that does **NOT** apply for materializing `fct_taxi_monthly_zone_revenue`:

- `dbt run`
- `dbt run --select +models/core/dim_taxi_trips.sql+ --target prod`
- `dbt run --select +models/core/fct_taxi_monthly_zone_revenue.sql`
- `dbt run --select +models/core/`
- `dbt run --select models/staging/+`


### Question 4: dbt Macros and Jinja

Consider you're dealing with sensitive data (e.g.: [PII](https://en.wikipedia.org/wiki/Personal_data)), that is **only available to your team and very selected few individuals**, in the `raw layer` of your DWH (e.g: a specific BigQuery dataset or PostgreSQL schema), 

 - Among other things, you decide to obfuscate/masquerade that data through your staging models, and make it available in a different schema (a `staging layer`) for other Data/Analytics Engineers to explore

- And **optionally**, yet  another layer (`service layer`), where you'll build your dimension (`dim_`) and fact (`fct_`) tables (assuming the [Star Schema dimensional modeling](https://www.databricks.com/glossary/star-schema)) for Dashboarding and for Tech Product Owners/Managers

You decide to make a macro to wrap a logic around it:

```sql
{% macro resolve_schema_for(model_type) -%}

    {%- set target_env_var = 'DBT_BIGQUERY_TARGET_DATASET'  -%}
    {%- set stging_env_var = 'DBT_BIGQUERY_STAGING_DATASET' -%}

    {%- if model_type == 'core' -%} {{- env_var(target_env_var) -}}
    {%- else -%}                    {{- env_var(stging_env_var, env_var(target_env_var)) -}}
    {%- endif -%}

{%- endmacro %}
```

And use on your staging, dim_ and fact_ models as:
```sql
{{ config(
    schema=resolve_schema_for('core'), 
) }}
```

That all being said, regarding macro above, **select all statements that are true to the models using it**:
- Setting a value for  `DBT_BIGQUERY_TARGET_DATASET` env var is mandatory, or it'll fail to compile
- Setting a value for `DBT_BIGQUERY_STAGING_DATASET` env var is mandatory, or it'll fail to compile
- When using `core`, it materializes in the dataset defined in `DBT_BIGQUERY_TARGET_DATASET`
- When using `stg`, it materializes in the dataset defined in `DBT_BIGQUERY_STAGING_DATASET`, or defaults to `DBT_BIGQUERY_TARGET_DATASET`
- When using `staging`, it materializes in the dataset defined in `DBT_BIGQUERY_STAGING_DATASET`, or defaults to `DBT_BIGQUERY_TARGET_DATASET`


## Serious SQL

Alright, in module 1, you had a SQL refresher, so now let's build on top of that with some serious SQL.

These are not meant to be easy - but they'll boost your SQL and Analytics skills to the next level.  
So, without any further do, let's get started...

You might want to add some new dimensions `year` (e.g.: 2019, 2020), `quarter` (1, 2, 3, 4), `year_quarter` (e.g.: `2019/Q1`, `2019-Q2`), and `month` (e.g.: 1, 2, ..., 12), **extracted from pickup_datetime**, to your `fct_taxi_trips` OR `dim_taxi_trips.sql` models to facilitate filtering your queries


### Question 5: Taxi Quarterly Revenue Growth

1. Create a new model `fct_taxi_trips_quarterly_revenue.sql`
2. Compute the Quarterly Revenues for each year for based on `total_amount`
3. Compute the Quarterly YoY (Year-over-Year) revenue growth 
  * e.g.: In 2020/Q1, Green Taxi had -12.34% revenue growth compared to 2019/Q1
  * e.g.: In 2020/Q4, Yellow Taxi had +34.56% revenue growth compared to 2019/Q4

***Important Note: The Year-over-Year (YoY) growth percentages provided in the examples are purely illustrative. You will not be able to reproduce these exact values using the datasets provided for this homework.***

Considering the YoY Growth in 2020, which were the yearly quarters with the best (or less worse) and worst results for green, and yellow

- green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q2, worst: 2020/Q1}
- green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q3, worst: 2020/Q4}
- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q2, worst: 2020/Q1}
- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q1, worst: 2020/Q2}
- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q3, worst: 2020/Q4}


### Question 6: P97/P95/P90 Taxi Monthly Fare

1. Create a new model `fct_taxi_trips_monthly_fare_p95.sql`
2. Filter out invalid entries (`fare_amount > 0`, `trip_distance > 0`, and `payment_type_description in ('Cash', 'Credit card')`)
3. Compute the **continous percentile** of `fare_amount` partitioning by service_type, year and and month

Now, what are the values of `p97`, `p95`, `p90` for Green Taxi and Yellow Taxi, in April 2020?

- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5}
- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0}
- green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5}
- green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0}
- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 25.5, p90: 19.0}


### Question 7: Top #Nth longest P90 travel time Location for FHV

Prerequisites:
* Create a staging model for FHV Data (2019), and **DO NOT** add a deduplication step, just filter out the entries where `where dispatching_base_num is not null`
* Create a core model for FHV Data (`dim_fhv_trips.sql`) joining with `dim_zones`. Similar to what has been done [here](../../../04-analytics-engineering/taxi_rides_ny/models/core/fact_trips.sql)
* Add some new dimensions `year` (e.g.: 2019) and `month` (e.g.: 1, 2, ..., 12), based on `pickup_datetime`, to the core model to facilitate filtering for your queries

Now...
1. Create a new model `fct_fhv_monthly_zone_traveltime_p90.sql`
2. For each record in `dim_fhv_trips.sql`, compute the [timestamp_diff](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timestamp_diff) in seconds between dropoff_datetime and pickup_datetime - we'll call it `trip_duration` for this exercise
3. Compute the **continous** `p90` of `trip_duration` partitioning by year, month, pickup_location_id, and dropoff_location_id

For the Trips that **respectively** started from `Newark Airport`, `SoHo`, and `Yorkville East`, in November 2019, what are **dropoff_zones** with the 2nd longest p90 trip_duration ?

- LaGuardia Airport, Chinatown, Garment District
- LaGuardia Airport, Park Slope, Clinton East
- LaGuardia Airport, Saint Albans, Howard Beach
- LaGuardia Airport, Rosedale, Bath Beach
- LaGuardia Airport, Yorkville East, Greenpoint


## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw4


## Solution 

* To be published after deadline


================================================
FILE: cohorts/2025/05-batch/homework.md
================================================
# Module 5 Homework

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the Yellow 2024-10 data from the official website: 

```bash
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet
```


## Question 1: Install Spark and PySpark

- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

What's the output?

> [!NOTE]
> To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/setup/pyspark.md)


## Question 2: Yellow October 2024

Read the October 2024 Yellow into a Spark Dataframe.

Repartition the Dataframe to 4 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

- 6MB
- 25MB
- 75MB
- 100MB


## Question 3: Count records 

How many taxi trips were there on the 15th of October?

Consider only trips that started on the 15th of October.

- 85,567
- 105,567
- 125,567
- 145,567


## Question 4: Longest trip

What is the length of the longest trip in the dataset in hours?

- 122
- 142
- 162
- 182


## Question 5: User Interface

Spark’s User Interface which shows the application's dashboard runs on which local port?

- 80
- 443
- 4040
- 8080


## Question 6: Least frequent pickup location zone

Load the zone lookup data into a temp view in Spark:

```bash
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
```

Using the zone lookup data and the Yellow October 2024 data, what is the name of the LEAST frequent pickup location Zone?

- Governor's Island/Ellis Island/Liberty Island
- Arden Heights
- Rikers Island
- Jamaica Bay


## Submitting the solutions

- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw5
- Deadline: See the website


================================================
FILE: cohorts/2025/06-streaming/homework/homework.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a63a4585-8a6b-4446-9b63-8c5d5d0b80fc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "from kafka import KafkaProducer\n",
    "\n",
    "def json_serializer(data):\n",
    "    return json.dumps(data).encode('utf-8')\n",
    "\n",
    "server = 'localhost:9092'\n",
    "\n",
    "producer = KafkaProducer(\n",
    "    bootstrap_servers=[server],\n",
    "    value_serializer=json_serializer\n",
    ")\n",
    "\n",
    "producer.bootstrap_connected()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "78bd28f9-66cb-4532-bf03-bb3fe90655b5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "--2025-03-07 19:27:06--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz\n",
      "Resolving github.com (github.com)... 140.82.121.3\n",
      "Connecting to github.com (github.com)|140.82.121.3|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/ea580e9e-555c-4bd0-ae73-43051d8e7c0b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250307%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250307T182706Z&X-Amz-Expires=300&X-Amz-Signature=6b8f2f603fe86515be24510f3f30bcf93c932b551769e5121fb0cbdf58e9b767&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dgreen_tripdata_2019-10.csv.gz&response-content-type=application%2Foctet-stream [following]\n",
      "--2025-03-07 19:27:07--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/ea580e9e-555c-4bd0-ae73-43051d8e7c0b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250307%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250307T182706Z&X-Amz-Expires=300&X-Amz-Signature=6b8f2f603fe86515be24510f3f30bcf93c932b551769e5121fb0cbdf58e9b767&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dgreen_tripdata_2019-10.csv.gz&response-content-type=application%2Foctet-stream\n",
      "Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...\n",
      "Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 8262584 (7.9M) [application/octet-stream]\n",
      "Saving to: 'green_tripdata_2019-10.csv.gz'\n",
      "\n",
      "     0K .......... .......... .......... .......... ..........  0% 1.08M 7s\n",
      "    50K .......... .......... .......... .......... ..........  1% 2.93M 5s\n",
      "   100K .......... .......... .......... .......... ..........  1% 3.15M 4s\n",
      "   150K .......... .......... .......... .......... ..........  2% 6.40M 3s\n",
      "   200K .......... .......... .......... .......... ..........  3% 5.41M 3s\n",
      "   250K .......... .......... .......... .......... ..........  3% 7.09M 3s\n",
      "   300K .......... .......... .......... .......... ..........  4% 4.84M 2s\n",
      "   350K .......... .......... .......... .......... ..........  4% 7.74M 2s\n",
      "   400K .......... .......... .......... .......... ..........  5% 20.4M 2s\n",
      "   450K .......... .......... .......... .......... ..........  6% 10.9M 2s\n",
      "   500K .......... .......... .......... .......... ..........  6% 5.03M 2s\n",
      "   550K .......... .......... .......... .......... ..........  7%  139M 2s\n",
      "   600K .......... .......... .......... .......... ..........  8% 11.8M 2s\n",
      "   650K .......... .......... .......... .......... ..........  8%  333M 1s\n",
      "   700K .......... .......... .......... .......... ..........  9% 6.83M 1s\n",
      "   750K .......... .......... .......... .......... ..........  9% 14.7M 1s\n",
      "   800K .......... .......... .......... .......... .......... 10% 4.41M 1s\n",
      "   850K .......... .......... .......... .......... .......... 11% 6.43M 1s\n",
      "   900K .......... .......... .......... .......... .......... 11%  292M 1s\n",
      "   950K .......... .......... .......... .......... .......... 12% 2.94M 1s\n",
      "  1000K .......... .......... .......... .......... .......... 13%  372M 1s\n",
      "  1050K .......... .......... .......... .......... .......... 13%  166M 1s\n",
      "  1100K .......... .......... .......... .......... .......... 14% 8.69M 1s\n",
      "  1150K .......... .......... .......... .......... .......... 14%  269M 1s\n",
      "  1200K .......... .......... .......... .......... .......... 15% 22.0M 1s\n",
      "  1250K .......... .......... .......... .......... .......... 16% 2.57M 1s\n",
      "  1300K .......... .......... .......... .......... .......... 16% 69.2M 1s\n",
      "  1350K .......... .......... .......... .......... .......... 17% 4.57M 1s\n",
      "  1400K .......... .......... .......... .......... .......... 17% 65.4M 1s\n",
      "  1450K .......... .......... .......... .......... .......... 18%  180M 1s\n",
      "  1500K .......... .......... .......... .......... .......... 19% 5.49M 1s\n",
      "  1550K .......... .......... .......... .......... .......... 19%  114M 1s\n",
      "  1600K .......... .......... .......... .......... .......... 20% 7.88M 1s\n",
      "  1650K .......... .......... .......... .......... .......... 21% 6.59M 1s\n",
      "  1700K .......... .......... .......... .......... .......... 21% 73.7M 1s\n",
      "  1750K .......... .......... .......... .......... .......... 22% 14.9M 1s\n",
      "  1800K .......... .......... .......... .......... .......... 22% 4.31M 1s\n",
      "  1850K .......... .......... .......... .......... .......... 23% 1.87M 1s\n",
      "  1900K .......... .......... .......... .......... .......... 24% 92.4M 1s\n",
      "  1950K .......... .......... .......... .......... .......... 24% 49.0M 1s\n",
      "  2000K .......... .......... .......... .......... .......... 25% 13.5M 1s\n",
      "  2050K .......... .......... .......... .......... .......... 26% 6.24M 1s\n",
      "  2100K .......... .......... .......... .......... .......... 26% 67.6M 1s\n",
      "  2150K .......... .......... .......... .......... .......... 27% 79.1M 1s\n",
      "  2200K .......... .......... .......... .......... .......... 27% 4.86M 1s\n",
      "  2250K .......... .......... .......... .......... .......... 28% 94.8M 1s\n",
      "  2300K .......... .......... .......... .......... .......... 29% 4.48M 1s\n",
      "  2350K .......... .......... .......... .......... .......... 29% 7.86M 1s\n",
      "  2400K .......... .......... .......... .......... .......... 30% 27.3M 1s\n",
      "  2450K .......... .......... .......... .......... .......... 30% 3.10M 1s\n",
      "  2500K .......... .......... .......... .......... .......... 31% 64.7M 1s\n",
      "  2550K .......... .......... .......... .......... .......... 32% 82.8M 1s\n",
      "  2600K .......... .......... .......... .......... .......... 32% 10.8M 1s\n",
      "  2650K .......... .......... .......... .......... .......... 33% 90.0M 1s\n",
      "  2700K .......... .......... .......... .......... .......... 34% 5.29M 1s\n",
      "  2750K .......... .......... .......... .......... .......... 34% 56.3M 1s\n",
      "  2800K .......... .......... .......... .......... .......... 35% 5.53M 1s\n",
      "  2850K .......... .......... .......... .......... .......... 35%  135M 1s\n",
      "  2900K .......... .......... .......... .......... .......... 36% 3.52M 1s\n",
      "  2950K .......... .......... .......... .......... .......... 37% 34.8M 1s\n",
      "  3000K .......... .......... .......... .......... .......... 37% 9.28M 1s\n",
      "  3050K .......... .......... .......... .......... .......... 38%  155M 1s\n",
      "  3100K .......... .......... .......... .......... .......... 39% 4.57M 1s\n",
      "  3150K .......... .......... .......... .......... .......... 39% 57.5M 1s\n",
      "  3200K .......... .......... .......... .......... .......... 40%  182M 1s\n",
      "  3250K .......... .......... .......... .......... .......... 40% 3.73M 1s\n",
      "  3300K .......... .......... .......... .......... .......... 41% 83.8M 1s\n",
      "  3350K .......... .......... .......... .......... .......... 42%  191M 1s\n",
      "  3400K .......... .......... .......... .......... .......... 42% 3.88M 1s\n",
      "  3450K .......... .......... .......... .......... .......... 43% 40.2M 1s\n",
      "  3500K .......... .......... .......... .......... .......... 43% 5.15M 1s\n",
      "  3550K .......... .......... .......... .......... .......... 44% 48.2M 1s\n",
      "  3600K .......... .......... .......... .......... .......... 45%  146M 1s\n",
      "  3650K .......... .......... .......... .......... .......... 45% 3.83M 1s\n",
      "  3700K .......... .......... .......... .......... .......... 46%  103M 1s\n",
      "  3750K .......... .......... .......... .......... .......... 47%  152M 1s\n",
      "  3800K .......... .......... .......... .......... .......... 47%  544M 1s\n",
      "  3850K .......... .......... .......... .......... .......... 48% 5.68M 0s\n",
      "  3900K .......... .......... .......... .......... .......... 48%  232M 0s\n",
      "  3950K .......... .......... .......... .......... .......... 49% 2.19M 0s\n",
      "  4000K .......... .......... .......... .......... .......... 50% 8.45M 0s\n",
      "  4050K .......... .......... .......... .......... .......... 50% 45.0M 0s\n",
      "  4100K .......... .......... .......... .......... .......... 51% 4.58M 0s\n",
      "  4150K .......... .......... .......... .......... .......... 52%  117M 0s\n",
      "  4200K .......... .......... .......... .......... .......... 52% 19.5M 0s\n",
      "  4250K .......... .......... .......... .......... .......... 53%  102M 0s\n",
      "  4300K .......... .......... .......... .......... .......... 53% 2.69M 0s\n",
      "  4350K .......... .......... .......... .......... .......... 54% 83.6M 0s\n",
      "  4400K .......... .......... .......... .......... .......... 55%  121M 0s\n",
      "  4450K .......... .......... .......... .......... .......... 55% 9.85M 0s\n",
      "  4500K .......... .......... .......... .......... .......... 56%  102M 0s\n",
      "  4550K .......... .......... .......... .......... .......... 57%  261M 0s\n",
      "  4600K .......... .......... .......... .......... .......... 57% 1.84M 0s\n",
      "  4650K .......... .......... .......... .......... .......... 58% 6.32M 0s\n",
      "  4700K .......... .......... .......... .......... .......... 58% 49.2M 0s\n",
      "  4750K .......... .......... .......... .......... .......... 59% 10.8M 0s\n",
      "  4800K .......... .......... .......... .......... .......... 60% 5.01M 0s\n",
      "  4850K .......... .......... .......... .......... .......... 60%  271M 0s\n",
      "  4900K .......... .......... .......... .......... .......... 61%  115M 0s\n",
      "  4950K .......... .......... .......... .......... .......... 61% 5.14M 0s\n",
      "  5000K .......... .......... .......... .......... .......... 62% 50.3M 0s\n",
      "  5050K .......... .......... .......... .......... .......... 63% 3.50M 0s\n",
      "  5100K .......... .......... .......... .......... .......... 63%  160M 0s\n",
      "  5150K .......... .......... .......... .......... .......... 64% 15.1M 0s\n",
      "  5200K .......... .......... .......... .......... .......... 65%  306M 0s\n",
      "  5250K .......... .......... .......... .......... .......... 65%  202M 0s\n",
      "  5300K .......... .......... .......... .......... .......... 66%  164M 0s\n",
      "  5350K .......... .......... .......... .......... .......... 66% 7.69M 0s\n",
      "  5400K .......... .......... .......... .......... .......... 67% 8.07M 0s\n",
      "  5450K .......... .......... .......... .......... .......... 68% 75.0M 0s\n",
      "  5500K .......... .......... .......... .......... .......... 68% 5.82M 0s\n",
      "  5550K .......... .......... .......... .......... .......... 69% 4.58M 0s\n",
      "  5600K .......... .......... .......... .......... .......... 70% 6.70M 0s\n",
      "  5650K .......... .......... .......... .......... .......... 70% 34.4M 0s\n",
      "  5700K .......... .......... .......... .......... .......... 71%  281M 0s\n",
      "  5750K .......... .......... .......... .......... .......... 71% 11.8M 0s\n",
      "  5800K .......... .......... .......... .......... .......... 72% 65.4M 0s\n",
      "  5850K .......... .......... .......... .......... .......... 73% 54.6M 0s\n",
      "  5900K .......... .......... .......... .......... .......... 73% 2.49M 0s\n",
      "  5950K .......... .......... .......... .......... .......... 74% 94.0M 0s\n",
      "  6000K .......... .......... .......... .......... .......... 74%  307M 0s\n",
      "  6050K .......... .......... .......... .......... .......... 75%  263M 0s\n",
      "  6100K .......... .......... .......... .......... .......... 76%  288M 0s\n",
      "  6150K .......... .......... .......... .......... .......... 76% 8.37M 0s\n",
      "  6200K .......... .......... .......... .......... .......... 77% 3.78M 0s\n",
      "  6250K .......... .......... .......... .......... .......... 78% 98.7M 0s\n",
      "  6300K .......... .......... .......... .......... .......... 78% 2.62M 0s\n",
      "  6350K .......... .......... .......... .......... .......... 79%  157M 0s\n",
      "  6400K .......... .......... .......... .......... .......... 79%  424M 0s\n",
      "  6450K .......... .......... .......... .......... .......... 80% 3.23M 0s\n",
      "  6500K .......... .......... .......... .......... .......... 81% 30.9M 0s\n",
      "  6550K .......... .......... .......... .......... .......... 81%  452M 0s\n",
      "  6600K .......... .......... .......... .......... .......... 82% 8.21M 0s\n",
      "  6650K .......... .......... .......... .......... .......... 83% 5.23M 0s\n",
      "  6700K .......... .......... .......... .......... .......... 83% 9.57M 0s\n",
      "  6750K .......... .......... .......... .......... .......... 84% 3.61M 0s\n",
      "  6800K .......... .......... .......... .......... .......... 84% 93.1M 0s\n",
      "  6850K .......... .......... .......... .......... .......... 85% 4.97M 0s\n",
      "  6900K .......... .......... .......... .......... .......... 86% 41.2M 0s\n",
      "  6950K .......... .......... .......... .......... .......... 86%  494M 0s\n",
      "  7000K .......... .......... .......... .......... .......... 87% 5.51M 0s\n",
      "  7050K .......... .......... .......... .......... .......... 87%  158M 0s\n",
      "  7100K .......... .......... .......... .......... .......... 88% 5.97M 0s\n",
      "  7150K .......... .......... .......... .......... .......... 89% 79.3M 0s\n",
      "  7200K .......... .......... .......... .......... .......... 89% 65.0M 0s\n",
      "  7250K .......... .......... .......... .......... .......... 90% 4.07M 0s\n",
      "  7300K .......... .......... .......... .......... .......... 91% 89.6M 0s\n",
      "  7350K .......... .......... .......... .......... .......... 91%  149M 0s\n",
      "  7400K .......... .......... .......... .......... .......... 92% 10.1M 0s\n",
      "  7450K .......... .......... .......... .......... .......... 92% 73.1M 0s\n",
      "  7500K .......... .......... .......... .......... .......... 93% 51.8M 0s\n",
      "  7550K .......... .......... .......... .......... .......... 94% 15.4M 0s\n",
      "  7600K .......... .......... .......... .......... .......... 94% 2.93M 0s\n",
      "  7650K .......... .......... .......... .......... .......... 95%  101M 0s\n",
      "  7700K .......... .......... .......... .......... .......... 96%  120M 0s\n",
      "  7750K .......... .......... .......... .......... .......... 96%  133M 0s\n",
      "  7800K .......... .......... .......... .......... .......... 97% 49.0M 0s\n",
      "  7850K .......... .......... .......... .......... .......... 97%  314M 0s\n",
      "  7900K .......... .......... .......... .......... .......... 98%  117M 0s\n",
      "  7950K .......... .......... .......... .......... .......... 99% 9.48M 0s\n",
      "  8000K .......... .......... .......... .......... .......... 99% 2.76M 0s\n",
      "  8050K .......... ........                                   100%  223M=0.9s\n",
      "\n",
      "2025-03-07 19:27:08 (9.10 MB/s) - 'green_tripdata_2019-10.csv.gz' saved [8262584/8262584]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "57fb14bf-f7f2-45a9-b918-d64203e5d802",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2b8b3ac1-e3fb-4713-9ccb-7c0fbfe4c017",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\alexe\\AppData\\Local\\Temp\\ipykernel_3424\\2667354967.py:1: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.\n",
      "  df = pd.read_csv('green_tripdata_2019-10.csv.gz')\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_csv('green_tripdata_2019-10.csv.gz')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a0e8ab41-1520-46b1-b8fa-a3fedf170896",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>VendorID</th>\n",
       "      <th>lpep_pickup_datetime</th>\n",
       "      <th>lpep_dropoff_datetime</th>\n",
       "      <th>store_and_fwd_flag</th>\n",
       "      <th>RatecodeID</th>\n",
       "      <th>PULocationID</th>\n",
       "      <th>DOLocationID</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>extra</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>ehail_fee</th>\n",
       "      <th>improvement_surcharge</th>\n",
       "      <th>total_amount</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>trip_type</th>\n",
       "      <th>congestion_surcharge</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.0</td>\n",
       "      <td>2019-10-01 00:26:02</td>\n",
       "      <td>2019-10-01 00:39:58</td>\n",
       "      <td>N</td>\n",
       "      <td>1.0</td>\n",
       "      <td>112</td>\n",
       "      <td>196</td>\n",
       "      <td>1.0</td>\n",
       "      <td>5.88</td>\n",
       "      <td>18.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3</td>\n",
       "      <td>19.30</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2019-10-01 00:18:11</td>\n",
       "      <td>2019-10-01 00:22:38</td>\n",
       "      <td>N</td>\n",
       "      <td>1.0</td>\n",
       "      <td>43</td>\n",
       "      <td>263</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.80</td>\n",
       "      <td>5.0</td>\n",
       "      <td>3.25</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3</td>\n",
       "      <td>9.05</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2019-10-01 00:09:31</td>\n",
       "      <td>2019-10-01 00:24:47</td>\n",
       "      <td>N</td>\n",
       "      <td>1.0</td>\n",
       "      <td>255</td>\n",
       "      <td>228</td>\n",
       "      <td>2.0</td>\n",
       "      <td>7.50</td>\n",
       "      <td>21.5</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3</td>\n",
       "      <td>22.80</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2019-10-01 00:37:40</td>\n",
       "      <td>2019-10-01 00:41:49</td>\n",
       "      <td>N</td>\n",
       "      <td>1.0</td>\n",
       "      <td>181</td>\n",
       "      <td>181</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.90</td>\n",
       "      <td>5.5</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3</td>\n",
       "      <td>6.80</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>2019-10-01 00:08:13</td>\n",
       "      <td>2019-10-01 00:17:56</td>\n",
       "      <td>N</td>\n",
       "      <td>1.0</td>\n",
       "      <td>97</td>\n",
       "      <td>188</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.52</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>2.26</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.3</td>\n",
       "      <td>13.56</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag  \\\n",
       "0       2.0  2019-10-01 00:26:02   2019-10-01 00:39:58                  N   \n",
       "1       1.0  2019-10-01 00:18:11   2019-10-01 00:22:38                  N   \n",
       "2       1.0  2019-10-01 00:09:31   2019-10-01 00:24:47                  N   \n",
       "3       1.0  2019-10-01 00:37:40   2019-10-01 00:41:49                  N   \n",
       "4       2.0  2019-10-01 00:08:13   2019-10-01 00:17:56                  N   \n",
       "\n",
       "   RatecodeID  PULocationID  DOLocationID  passenger_count  trip_distance  \\\n",
       "0         1.0           112           196              1.0           5.88   \n",
       "1         1.0            43           263              1.0           0.80   \n",
       "2         1.0           255           228              2.0           7.50   \n",
       "3         1.0           181           181              1.0           0.90   \n",
       "4         1.0            97           188              1.0           2.52   \n",
       "\n",
       "   fare_amount  extra  mta_tax  tip_amount  tolls_amount  ehail_fee  \\\n",
       "0         18.0   0.50      0.5        0.00           0.0        NaN   \n",
       "1          5.0   3.25      0.5        0.00           0.0        NaN   \n",
       "2         21.5   0.50      0.5        0.00           0.0        NaN   \n",
       "3          5.5   0.50      0.5        0.00           0.0        NaN   \n",
       "4         10.0   0.50      0.5        2.26           0.0        NaN   \n",
       "\n",
       "   improvement_surcharge  total_amount  payment_type  trip_type  \\\n",
       "0                    0.3         19.30           2.0        1.0   \n",
       "1                    0.3          9.05           2.0        1.0   \n",
       "2                    0.3         22.80           2.0        1.0   \n",
       "3                    0.3          6.80           2.0        1.0   \n",
       "4                    0.3         13.56           1.0        1.0   \n",
       "\n",
       "   congestion_surcharge  \n",
       "0                   0.0  \n",
       "1                   0.0  \n",
       "2                   0.0  \n",
       "3                   0.0  \n",
       "4                   0.0  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "d085b583-1609-41a9-a222-ff6ca495ee27",
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = [\n",
    "    'lpep_pickup_datetime',\n",
    "    'lpep_dropoff_datetime',\n",
    "    'PULocationID',\n",
    "    'DOLocationID',\n",
    "    'passenger_count',\n",
    "    'trip_distance',\n",
    "    'tip_amount'\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "66e9f47c-9284-4760-8011-3a8f48aaa49f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df[columns]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7ae3f843-d428-43d2-9e47-7f9fb43acbad",
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "f1ca1ac1-176c-4ccc-aa11-7e1cb5659d39",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm.auto import tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "0b3da4e1-2f1c-400f-bb67-82734c1193f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "messages = df.to_dict(orient='records')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "3bdc95d8-64e1-4819-a885-996813b4bf94",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "476386"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "d6f15929-e928-464d-afc1-690343f4f780",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4dffdeb2a0064e1d9bd02dff9f9c49f0",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/476386 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "topic_name = 'green-trips'\n",
    "\n",
    "for message in tqdm(messages):\n",
    "    producer.send(topic_name, value=message)\n",
    "\n",
    "producer.flush()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2409953c-e9dd-403d-a0d1-d8b883c23ef5",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}


================================================
FILE: cohorts/2025/06-streaming/homework.md
================================================
# Homework

In this homework, we're going to learn about streaming with PyFlink.

Instead of Kafka, we will use Red Panda, which is a drop-in
replacement for Kafka. It implements the same interface, 
so we can use the Kafka library for Python for communicating
with it, as well as use the Kafka connector in PyFlink.

For this homework we will be using the Taxi data:
- Green 2019-10 data from [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz)


## Setup

We need:

- Red Panda
- Flink Job Manager
- Flink Task Manager
- Postgres

It's the same setup as in the [pyflink module](../../../06-streaming/pyflink/), so go there and start docker-compose:

```bash
cd ../../../06-streaming/pyflink/
docker-compose up
```

(Add `-d` if you want to run in detached mode)

Visit http://localhost:8081 to see the Flink Job Manager

Connect to Postgres with pgcli, pg-admin, [DBeaver](https://dbeaver.io/) or any other tool.

The connection credentials are:

- Username `postgres`
- Password `postgres`
- Database `postgres`
- Host `localhost`
- Port `5432`

With pgcli, you'll need to run this to connect:

```bash
pgcli -h localhost -p 5432 -u postgres -d postgres
```

Run these query to create the Postgres landing zone for the first events and windows:

```sql 
CREATE TABLE processed_events (
    test_data INTEGER,
    event_timestamp TIMESTAMP
);

CREATE TABLE processed_events_aggregated (
    event_hour TIMESTAMP,
    test_data INTEGER,
    num_hits INTEGER 
);
```

## Question 1: Redpanda version

Now let's find out the version of redpandas. 

For that, check the output of the command `rpk help` _inside the container_. The name of the container is `redpanda-1`.

Find out what you need to execute based on the `help` output.

What's the version, based on the output of the command you executed? (copy the entire version)


## Question 2. Creating a topic

Before we can send data to the redpanda server, we
need to create a topic. We do it also with the `rpk`
command we used previously for figuring out the version of 
redpandas.

Read the output of `help` and based on it, create a topic with name `green-trips` 

What's the output of the command for creating a topic? Include the entire output in your answer.


## Question 3. Connecting to the Kafka server

We need to make sure we can connect to the server, so
later we can send some data to its topics

First, let's install the kafka connector (up to you if you
want to have a separate virtual environment for that)

```bash
pip install kafka-python
```

You can start a jupyter notebook in your solution folder or
create a script

Let's try to connect to our server:

```python
import json

from kafka import KafkaProducer

def json_serializer(data):
    return json.dumps(data).encode('utf-8')

server = 'localhost:9092'

producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=json_serializer
)

producer.bootstrap_connected()
```

Provided that you can connect to the server, what's the output
of the last command?

## Question 4: Sending the Trip Data

Now we need to send the data to the `green-trips` topic

Read the data, and keep only these columns:

* `'lpep_pickup_datetime',`
* `'lpep_dropoff_datetime',`
* `'PULocationID',`
* `'DOLocationID',`
* `'passenger_count',`
* `'trip_distance',`
* `'tip_amount'`

Now send all the data using this code:

```python
producer.send(topic_name, value=message)
```

For each row (`message`) in the dataset. In this case, `message`
is a dictionary.

After sending all the messages, flush the data:

```python
producer.flush()
```

Use `from time import time` to see the total time 

```python
from time import time

t0 = time()

# ... your code

t1 = time()
took = t1 - t0
```

How much time did it take to send the entire dataset and flush? 


## Question 5: Build a Sessionization Window (2 points)

Now we have the data in the Kafka stream. It's time to process it.

* Copy `aggregation_job.py` and rename it to `session_job.py`
* Have it read from `green-trips` fixing the schema
* Use a [session window](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/) with a gap of 5 minutes
* Use `lpep_dropoff_datetime` time as your watermark with a 5 second tolerance
* Which pickup and drop off locations have the longest unbroken streak of taxi trips?


## Submitting the solutions

- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw6
- Deadline: See the website


================================================
FILE: cohorts/2025/README.md
================================================
## Data Engineering Zoomcamp 2025 Cohort

* [Pre-launch Q&A stream](https://www.youtube.com/watch?v=DPnAOu2csYA)
* [Launch stream with course overview](https://www.youtube.com/watch?v=X8cEEwi8DTM)
* [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html)
* [Course Playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)
* [Cohort-specific playlist: only 2025 Live videos](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJZdpLpRHp7dg6EOx828q6y)


[**Module 1: Introduction & Prerequisites**](01-docker-terraform/)

* [Homework](01-docker-terraform/homework.md)


[**Module 2: Workflow Orchestration**](02-workflow-orchestration)

* [Homework](02-workflow-orchestration/homework.md)
* Office hours

[**Workshop 1: Data Ingestion**](workshops/dlt/README.md)

* Workshop with dlt
* [Homework](workshops/dlt/README.md)


[**Module 3: Data Warehouse**](03-data-warehouse)

* [Homework](03-data-warehouse/homework.md)


[**Module 4: Analytics Engineering**](04-analytics-engineering/)

* [Homework](04-analytics-engineering/homework.md)


[**Module 5: Batch processing**](05-batch/)

* [Homework](05-batch/homework.md)


[**Module 6: Stream Processing**](06-streaming)

* [Homework](06-streaming/homework.md)


[**Project**](project.md)

More information [here](project.md)


================================================
FILE: cohorts/2025/project.md
================================================
## Course Project

The goal of this project is to apply everything we learned
in this course and build an end-to-end data pipeline.

You will have two attempts to submit your project. If you don't have 
time to submit your project by the end of attempt #1 (you started the 
course late, you have vacation plans, life/work got in the way, etc.)
or you fail your first attempt, 
then you will have a second chance to submit your project as attempt
#2. 

There are only two attempts.

Remember that to pass the project, you must evaluate 3 peers. If you don't do that,
your project can't be considered complete.

To find the projects assigned to you, use the peer review assignments link 
and find your hash in the first column. You will see three rows: you need to evaluate 
each of these projects. For each project, you need to submit the form once,
so in total, you will make three submissions. 


### Submitting

#### Project Attempt #1

* Project: https://courses.datatalks.club/de-zoomcamp-2025/project/project1
* Review: https://courses.datatalks.club/de-zoomcamp-2025/project/project1/eval

#### Project Attempt #2

* Project: https://courses.datatalks.club/de-zoomcamp-2025/project/project2
* Review: https://courses.datatalks.club/de-zoomcamp-2025/project/project2/eval

> **Important**: update your "Certificate name" here: https://courses.datatalks.club/de-zoomcamp-2025/enrollment -
this is what we will use when generating certificates for you.

### Evaluation criteria

See [here](../../projects/README.md)


================================================
FILE: cohorts/2025/workshops/dlt/README.md
================================================
# Data ingestion with dlt

Homework: [dlt_homework.md](dlt_homework.md)

🎥 **Watch the workshop video**

[![Watch the workshop video](https://markdown-videos-api.jorgenkh.no/youtube/pgJWP_xqO1g)](https://www.youtube.com/watch?v=pgJWP_xqO1g "Watch the workshop video")

Welcome to this hands-on workshop, where you'll learn to build efficient and scalable data ingestion pipelines.

### **What will you learn in this workshop?**  

In this workshop, you’ll learn the core skills required to build and manage data pipelines:  
- **How to build robust, scalable, and self-maintaining pipelines**.  
- **Best practices**, like built-in data governance, for ensuring clean and reliable data flows.  
- **Incremental loading techniques** to refresh data quickly and cost-effectively.  
- **How to build a Data Lake** with dlt.

By the end of this workshop, you'll be able to build data pipelines like a senior data engineer — quickly, concisely, and with best practices baked in.


--- 

## 📂 Navigation & Resources

- Workshop:
  - [Workshop content](data_ingestion_workshop.md).
  - [Workshop Colab Notebook](https://colab.research.google.com/drive/1FiAHNFenM8RyptyTPtDTfqPCi5W6KX_V?usp=sharing).
- Homework:
  - [Homework Markdown](dlt_homework.md).
  - [Homework Colab Notebook](https://colab.research.google.com/drive/1plqdl33K_HkVx0E0nGJrrkEUssStQsW7).
- 🌐 [Official dlt Documentation](https://dlthub.com/docs/intro).
- 💬 Join our [Slack Community](https://dlthub.com/community).

---

## 📖 Course overview
This workshop is structured into three key parts:

1️⃣ **[Extracting Data](data_ingestion_workshop.md#extracting-data)** – Learn scalable data extraction techniques.  
2️⃣ **[Normalizing Data](data_ingestion_workshop.md#normalizing-data)** – Clean and structure data before loading.  
3️⃣ **[Loading & Incremental Updates](data_ingestion_workshop.md#loading-data)** – Efficiently load and update data.  

📌 **Find the full course file here**: [Course File](data_ingestion_workshop.md)  

---

## 👩‍🏫 Teacher

Welcome to the DataTalks.Club Data Engineering Zoomcamp the data ingestion workshop!

I'm Violetta Mishechkina, Solutions Engineer at dltHub. 👋
- I’ve been working in the data field since 2018, with a background in machine learning.
- I started as a Data Scientist, training ML models and neural networks.
- Over time, I realized that in production, hitting the highest RMSE isn’t as important as model size, infrastructure, and data quality - so I transitioned into MLOps.
- A year ago, I joined dltHub’s Customer Success team and discovered dlt, a Python library that automates 90% of tedious data engineering tasks.
- Now, I work closely with customers and partners to help them integrate and optimize dlt in production.
- I also collaborate with our development team as the voice of the customer, ensuring our product meets real-world data engineering needs.
- My experience across ML, MLOps, and data engineering gives me a practical, hands-on perspective on solving data challenges.

---

## Homework

- [Homework Markdown](dlt_homework.md).
- [Homework Colab Notebook](https://colab.research.google.com/drive/1plqdl33K_HkVx0E0nGJrrkEUssStQsW7).

--- 
## Next steps

As you are learning the various concepts of data engineering, 
consider creating a portfolio project that will further your own knowledge.

By demonstrating the ability to deliver end to end, you will have an easier time finding your first role. 
This will help regardless of whether your hiring manager reviews your project, largely because you will have a better 
understanding and will be able to talk the talk.

Here are some example projects that others did with dlt:
- Serverless dlt-dbt on cloud functions: [Article](https://docs.getdbt.com/blog/serverless-dlt-dbt-stack)
- Bird finder: [Part 1](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-i), [Part 2](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-ii)
- Event ingestion on GCP: [Article and repo](https://dlthub.com/docs/blog/streaming-pubsub-json-gcp)
- Event ingestion on AWS: [Article and repo](https://dlthub.com/docs/blog/dlt-aws-taktile-blog)
- Or see one of the many demos created by our working students: [Hacker news](https://dlthub.com/docs/blog/hacker-news-gpt-4-dashboard-demo), 
[GA4 events](https://dlthub.com/docs/blog/ga4-internal-dashboard-demo), 
[an E-Commerce](https://dlthub.com/docs/blog/postgresql-bigquery-metabase-demo), 
[Google Sheets](https://dlthub.com/docs/blog/google-sheets-to-data-warehouse-pipeline), 
[Motherduck](https://dlthub.com/docs/blog/dlt-motherduck-demo), 
[MongoDB + Holistics](https://dlthub.com/docs/blog/MongoDB-dlt-Holistics), 
[Deepnote](https://dlthub.com/docs/blog/deepnote-women-wellness-violence-tends), 
[Prefect](https://dlthub.com/docs/blog/dlt-prefect),
[PowerBI vs GoodData vs Metabase](https://dlthub.com/docs/blog/semantic-modeling-tools-comparison),
[Dagster](https://dlthub.com/docs/blog/dlt-dagster),
[Ingesting events via gcp webhooks](https://dlthub.com/docs/blog/dlt-webhooks-on-cloud-functions-for-event-capture),
[SAP to snowflake replication](https://dlthub.com/docs/blog/sap-hana-to-snowflake-demo-blog),
[Read emails and send sumamry to slack with AI and Kestra](https://dlthub.com/docs/blog/dlt-kestra-demo-blog),
[Mode +dlt capabilities](https://dlthub.com/docs/blog/dlt-mode-blog),
[dbt on cloud functions](https://dlthub.com/docs/blog/dlt-dbt-runner-on-cloud-functions)
- If you want to use dlt in your project, [check this list of public APIs](https://dlthub.com/docs/blog/practice-api-sources)


If you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt Slack.


## **💛 If you enjoy dlt, support us!**  

* ⭐ **Give us a [GitHub Star](https://github.com/dlt-hub/dlt)!**  
* 💬 **Join our [Slack Community](https://dlthub.com/community)!**  
* 🚀 **Let’s build great data pipelines together!**  

---

# Community notes

Did you take notes? You can share them by creating a PR to this file!

* [Ingest Data to GCS by dlt from peatwan](https://github.com/peatwan/de-zoomcamp/tree/main/workshop/dlt/homework/load_to_gcs)
* Add your notes above this line

================================================
FILE: cohorts/2025/workshops/dlt/data_ingestion_workshop.md
================================================
# Data ingestion with dlt

* Sign up: https://lu.ma/quyfn4q8 (optional) 
* Homework: [dlt_homework.md](dlt_homework.md)

## **What is data ingestion?**  
Data ingestion is the process of **extracting** data from a source, transporting it to a suitable environment, and preparing it for use. This often includes **normalizing**, **cleaning**, and **adding metadata**.

---

### **“A wild dataset Magically appears!”**  

In many data science teams, data seems to appear out of nowhere — because an engineer loads it.  

For example, the well-known **NYC Taxi dataset** looks well-structured and ready to use, making it easy to query and analyze. However, not all datasets arrive in such a clean format.

- **Well-structured data** (with an explicit schema) can be used immediately.  
  - Examples: Parquet, Avro, or database tables where data types and structures are predefined.  
- **Unstructured or weakly typed data** (without a defined schema) often needs cleaning and formatting first.  
  - Examples: CSV, JSON, where fields might be inconsistent, nested or missing key details.  

💡 **What is a schema?**  
A schema defines the expected format and structure of data, including field names, data types, and relationships.  

---

### **Be the Magician! 😎**  

Since you're here to learn data engineering, **you** will be the one making datasets magically appear!  

To build effective pipelines, you need to master:  

✅ **Extracting** data from various sources (APIs, databases, files).  
✅ **Normalization** data by transforming, cleaning, and defining schemas.  
✅ **Loading** data where it can be used (data warehouse, lake, or database).

---

### **Why are data pipelines so amazing?**  

Data pipelines are the backbone of modern data-driven organizations, transforming raw, scattered data into actionable insights. 
They ensure data flows seamlessly from its source to its final destination, where it can drive decision-making, analytics, and innovation. 
But pipelines don’t just move data — they enable an entire ecosystem of functionality that makes them indispensable.  

![pipes](img/pipes.jpg)

### **What makes data pipelines so essential?**  

1. **Collect**:  
   Data pipelines gather information from a variety of sources, such as databases, data streams, and applications. This ensures no data is overlooked.  
   - Example: Retrieving sales data from an online store or capturing user activity logs from an app.  

2. **Ingest**:  
   The collected data flows into an event queue, where it’s organized and prepared for the next steps.  
   - **Structured data** (like Parquet files or database tables) can be processed immediately.  
   - **Unstructured data** (like CSV or JSON files) often needs cleaning and normalization.  
   - Example: Cleaning a JSON response by standardizing its fields or formatting dates in a CSV file.  

3. **Store**:  
   Pipelines send the processed data to **data lakes**, **data warehouses**, or **data lakehouses** for efficient storage and easy access.  
   - Example: Storing marketing campaign data in a data warehouse to analyze its performance.  

4. **Compute**:  
   Data is processed either in **batches** (large chunks) or as **streams** (real-time updates) to make it ready for analysis.  
   - Example: Calculating monthly revenue or processing live stock market data.  

5. **Consume**:  
   Finally, the prepared data is delivered to users in forms they can act on:  
   - **Dashboards** for executives and analysts.  
   - **Self-service analytics tools** for teams exploring trends.  
   - **Machine learning models** for predictions and automation.  

---

### **Why are data engineers so important in this process?**  

Data engineers are the architects behind these pipelines. They don’t just build pipelines—they make sure they’re reliable, efficient, and scalable. Beyond pipeline development, data engineers:  
- **Optimize data storage** to keep costs low and performance high.  
- **Ensure data quality and integrity**, addressing duplicates, inconsistencies, and missing values.  
- **Implement governance** for secure, compliant, and well-managed data.  
- **Adapt data architectures** to meet the changing needs of the organization.  

Ultimately, their role is to strategically manage the entire **data lifecycle**, from collection to consumption.

---

### **What will you learn in this workshop?**  

In this workshop, you’ll learn the core skills required to build and manage data pipelines:  
- **How to build robust, scalable, and self-maintaining pipelines**.  
- **Best practices**, like built-in data governance, for ensuring clean and reliable data flows.  
- **Incremental loading techniques** to refresh data quickly and cost-effectively.  
- **How to build a Data Lake** with dlt.

By the end, you’ll not only understand why data pipelines are amazing, but you’ll also know how to create them with best practices to power your organization’s data-driven success.🚀

---
## **Extracting data**

Most of the data you’ll work with is stored behind an **API**, which is like a doorway to the data. Here are the most common types:  

- **RESTful APIs**: Provide records of data from business applications.  
  - Example: Getting a list of customers from a CRM system.  
- **File-based APIs**: Return secure file paths to bulk data like JSON or Parquet files stored in buckets.  
  - Example: Downloading monthly sales reports.  
- **Database APIs**: Connect to databases like MongoDB or SQL, often returning data as JSON, the most common interchange format.  

As an engineer, you will need to build pipelines that “just work”.

So here’s what you need to consider on extraction, to prevent the pipelines from breaking, and to keep them running smoothly:  

1. **Hardware limits**: Be mindful of memory (RAM) and storage (disk space). Overloading these can crash your system.  
2. **Network reliability**: Networks can fail! Always account for retries to make your pipelines more robust.  
   - Tip: Use libraries like `dlt` that have built-in retry mechanisms.  
3. **API rate limits**: APIs often restrict the number of requests you can make in a given time.  
   - Tip: Check the API documentation to understand its limits (e.g., [Zendesk](https://developer.zendesk.com/api-reference/introduction/rate-limits/), [Shopify](https://shopify.dev/docs/api/usage/rate-limits)).  

There are even more challenges to consider when working with APIs — such as **pagination and authentication**. Let’s explore how to handle these effectively when working with **REST APIs**.

### **Working with REST APIs**

REST APIs (Representational State Transfer APIs) are one of the most common ways to extract data. They allow you to retrieve structured data using simple HTTP requests. However, working with APIs comes with its own challenges.

#### **Common Challenges**

![rest_api](img/Rest_API.png)

#### **1. Rate limits**  
Many APIs **limit the number of requests** you can make within a certain time frame to prevent overloading their servers. If you exceed this limit, the API may **reject your requests** temporarily or even block you for a period.  

To avoid hitting these limits, we can:  
- **Monitor API rate limits** – Some APIs provide headers that tell you how many requests you have left.  
- **Pause requests when needed** – If we're close to the limit, we wait before making more requests.  
- **Implement automatic retries** – If a request fails due to rate limiting, we can wait and retry after some time.  

💡Some APIs provide a **retry-after** header, which tells you how long to wait before making another request. Always check the API documentation for best practices!

---

#### **2. Authentication**  
Many APIs require an **API key or token** to access data securely. Without authentication, requests may be limited or denied.  

🔐 **Types of Authentication in APIs:**  
- **API Keys** – A simple token included in the request header or URL.  
- **OAuth Tokens** – A more secure authentication method requiring user authorization.  
- **Basic Authentication** – Using a username and password (less common today).  

💡 Never share your API token publicly! Store it in environment variables or use a secure secrets manager.

----
#### **3. Pagination**

Many APIs return data in **chunks (or pages)** rather than sending everything at once. This prevents **overloading the server** and improves performance, especially for large datasets. To retrieve **all the data**, we need to make multiple requests and keep track of pages until we reach the last one.

📌 Example:

>In this example, we’ll request data from an API that serves the **NYC taxi dataset**.

For these purposes we created an API that can serve the data you are already familiar with. The API returns **1,000 records per page**, and we must request multiple pages to retrieve the full dataset.

```py
import requests

BASE_API_URL = "https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api"

page_number = 1
while True:
    params = {'page': page_number}
    response = requests.get(BASE_API_URL, params=params)
    page_data = response.json()

    if not page_data:
        break

    print(page_data)
    page_number += 1

    # limit the number of pages for testing
    if page_number > 2:
      break
```
What happens here:
- Starts at page 1 and makes a GET request to the API.
- Retrieves JSON data and checks if the page contains records.
- If data exists, prints it and moves to the next page.
- If the page is empty, stops requesting more data.

💡 Different APIs handle pagination differently (some use offsets, cursors, or tokens instead of page numbers). Always check the API documentation for the correct method!

---

#### **4. Avoiding memory issues during extraction**  

To prevent your pipeline from crashing, you need to control memory usage.  

#### **Challenges with memory**  
- Many pipelines run on systems with limited memory, like serverless functions or shared clusters.  
- If you try to load all the data into memory at once, it can crash the entire system.  
- Even disk space can become an issue if you’re storing large amounts of data.  


#### **The solution: streaming data**  

**Streaming** means processing data in small chunks or events, rather than loading everything at once. This keeps memory usage low and ensures your pipeline remains efficient.

As a data engineer, you’ll use streaming to transfer data between buffers, such as:  
- from APIs to local files;  
- from Webhooks to event queues;  
- from Event queues (like Kafka) to storage buckets.

---

### **Example of extracting data: Grabbing data from an API**

In this example, we’ll request data from an API that serves the **NYC taxi dataset**. For these purposes we created an API that can serve the data you are already familiar with.

#### **API documentation**:  
- **Data**: Comes in pages of 1,000 records.  
- **Pagination**: When there’s no more data, the API returns an empty page.  
- **Details**:  
  - **Method**: GET  
  - **URL**: `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api`  
  - **Parameters**:  
    - `page`: Integer (page number), defaults to 1.  

Here’s how we design our requester:  
1. **Request page by page** until we hit an empty page. Since we don’t know how much data is behind the API, we must assume it could be as little as 1,000 records or as much as 10GB.
2. **Use a generator** to handle this efficiently and avoid loading all data into memory.  


```py
import requests

BASE_API_URL = "https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api"

def paginated_getter():
    page_number = 1
    while True:
        params = {'page': page_number}
        response = requests.get(BASE_API_URL, params=params)
        response.raise_for_status()
        page_json = response.json()
        print(f'Got page {page_number} with {len(page_json)} records')

        if page_json:
            yield page_json
            page_number += 1
        else:
            break


for page_data in paginated_getter():
    print(page_data)
```

In this approach to grabbing data from APIs, there are both pros and cons:  

✅ Pros: **Easy memory management** since the API returns data in small pages or events.  
❌ Cons: **Low throughput** because data transfer is limited by API constraints (rate limits, response time).


To simplify data extraction, use specialized tools that follow best practices like streaming — for example, [dlt (data load tool)](https://dlthub.com). It efficiently processes data while **keeping memory usage low** and **leveraging parallelism** for better performance.

### **Extracting data with dlt**

Extracting data from APIs manually requires handling
- **pagination**,
- **rate limits**,
- **authentication**,
- **errors**.

Instead of writing custom scripts, **[dlt](https://dlthub.com/)** simplifies the process with a built-in **[REST API Client](https://dlthub.com/docs/general-usage/http/rest-client)**, making extraction **efficient, scalable, and reliable**.  

---

### **Why use dlt for extraction?**  

✅ **Built-in REST API support** – Extract data from APIs with minimal code.  
✅ **Automatic pagination handling** – No need to loop through pages manually.  
✅ **Manages Rate Limits & Retries** – Prevents exceeding API limits and handles failures.  
✅ **Streaming support** – Extracts and processes data without loading everything into memory.  
✅ **Seamless integration** – Works with **normalization and loading** in a single pipeline.  

![dlt](img/dlt.png)

### **Install dlt**

[Install](https://dlthub.com/docs/reference/installation) dlt with DuckDB as destination:

```shell
pip install dlt[duckdb]
```

### **Example of extracting data with dlt**  

Instead of manually writing pagination logic, let’s use **dlt’s [`RESTClient` helper](https://dlthub.com/docs/general-usage/http/rest-client)** to extract NYC taxi ride data:  
```py
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator


def paginated_getter():
    client = RESTClient(
        base_url="https://us-central1-dlthub-analytics.cloudfunctions.net",
        # Define pagination strategy - page-based pagination
        paginator=PageNumberPaginator(   # <--- Pages are numbered (1, 2, 3, ...)
            base_page=1,   # <--- Start from page 1
            total_path=None    # <--- No total count of pages provided by API, pagination should stop when a page contains no result items
        )
    )

    for page in client.paginate("data_engineering_zoomcamp_api"):    # <--- API endpoint for retrieving taxi ride data
        yield page   # remember about memory management and yield data

for page_data in paginated_getter():
    print(page_data)
```

**How dlt simplifies API extraction:**  

🔹 **No manual pagination** – dlt **automatically** fetches **all pages** of data.  
🔹 **Low memory usage** – Streams data **chunk by chunk**, avoiding RAM overflows.  
🔹 **Handles rate limits & retries** – Ensures requests are sent efficiently **without failures**.  
🔹 **Flexible destination support** – Load extracted data into **databases, warehouses, or data lakes**.

---

Well, you’ve successfully **extracted** the data — great! 🎉 But raw data isn’t always ready to use. Now, you need to **process**, **clean**, and **structure** it before it can be loaded into a data lake or data warehouse.


## **Normalizing data**

You often hear that data professionals spend most of their time **“cleaning” data** — but what does that actually mean?  

Data cleaning typically involves two key steps:  

1. **Normalizing data** – Structuring and standardizing data **without changing its meaning**.  
2. **Filtering data for a specific use case** – Selecting or modifying data **in a way that changes its meaning** to fit the analysis.

### **Data cleaning: more than just fixing errors**  

A big part of **data cleaning** is actually **metadata work** — ensuring data is structured and standardized so it can be used effectively.  

#### **Metadata tasks in data cleaning:**  

✅ **Add types** – Convert strings to numbers, timestamps, etc.  
✅ **Rename columns** – Ensure names follow a standard format (e.g., no special characters).  
✅ **Flatten nested dictionaries** – Bring values from nested dictionaries into the top-level row.  
✅ **Unnest lists/arrays** – Convert lists into **child tables** since they can’t be stored directly in a flat format.  

👉 **We’ll look at a practical example next, as these concepts are easier to understand with real data.**

---

### **Why prepare data? Why not use JSON directly?**  

While JSON is a great format for **data transfer**, it’s not ideal for analysis. Here’s why:  

❌ **No enforced schema** – We don’t always know what fields exist in a JSON document.  
❌ **Inconsistent data types** – A field like `age` might appear as `25`, `"twenty five"`, or `25.00`, which can break downstream applications.  
❌ **Hard to process** – If we need to group data by day, we must manually convert date strings to timestamps.  
❌ **Memory-heavy** – JSON requires reading the entire file into memory, unlike databases or columnar formats that allow scanning just the necessary fields.  
❌ **Slow for aggregation and search** – JSON is not optimized for quick lookups or aggregations like columnar formats (e.g., Parquet).  


JSON is great for **data exchange** but **not for direct analytical use**. To make data useful, we need to **normalize it** — flattening, typing, and structuring it for efficiency.

---

### **Normalization example**  

To understand what we’re working with, let’s look at a sample record from our API:

```py
item = page_data[0]
item
```
Output:
```json
{'End_Lat': 40.742963,
 'End_Lon': -73.980072,
 'Fare_Amt': 45.0,
 'Passenger_Count': 1,
 'Payment_Type': 'Credit',
 'Rate_Code': None,
 'Start_Lat': 40.641525,
 'Start_Lon': -73.787442,
 'Tip_Amt': 9.0,
 'Tolls_Amt': 4.15,
 'Total_Amt': 58.15,
 'Trip_Distance': 17.52,
 'Trip_Dropoff_DateTime': '2009-06-14 23:48:00',
 'Trip_Pickup_DateTime': '2009-06-14 23:23:00',
 'mta_tax': None,
 'store_and_forward': None,
 'surcharge': 0.0,
 'vendor_name': 'VTS'}
```

The data we retrieved from the API has **already been processed and unnested**, meaning that any **nested structures** (like dictionaries and lists) have been flattened, making it easier to store and query in a database or a dataframe. However, let’s imagine we originally received the **raw data** in a more complex format.

---

### **How was this data processed?**  

Before reaching this format, the raw data likely contained **nested structures** that had to be **flattened and transformed**.  

1️⃣ **Flattened nested coordinates:**  
   - Originally, the latitude and longitude values might have been nested like this:  
     ```json
     "coordinates": {
         "start": {"lat": 40.641525, "lon": -73.787442},
         "end": {"lat": 40.742963, "lon": -73.980072}
     }
     ```
   - These were **flattened** into `Start_Lat`, `Start_Lon`, `End_Lat`, and `End_Lon`.  

2️⃣ **Converted timestamps:**  
   - Originally, timestamps might have been stored as Unix timestamps or separate date/time fields:  
     ```json
     "Trip_Pickup": {"date": "2009-06-14", "time": "23:23:00"}
     ```
   - Now, they are **formatted as ISO datetime strings**:  
     ```json
     "Trip_Pickup_DateTime": "2009-06-14 23:23:00"
     ```

3️⃣ **Unnested passenger & payment information:**  
   - The original structure might have included a nested list for passengers:  
     ```json
     "passengers": [
         {"name": "John", "rating": 4.9},
         {"name": "Jack", "rating": 3.9}
     ]
     ```
   - Since lists **cannot be stored directly in a database table**, they were likely **moved to a separate table**.

💡 **However, real-world data is rarely this clean!** We often receive raw, nested, and inconsistent data. This is why the **normalization process** is so important—it **prepares** the data for efficient storage and analysis.  
**[dlt (data load tool)](https://dlthub.com/docs/intro)** simplifies the **normalization process**, automatically transforming raw data into a **structured, clean format** that is ready for storage and analysis.

---

### **Normalizing data with dlt**  

**Why use dlt for normalization?**  

✅ **Automatically detects schema** – No need to define column types manually.  
✅ **Flattens nested JSON** – Converts complex structures into table-ready formats.  
✅ **Handles data type conversion** – Converts dates, numbers, and booleans correctly.  
✅ **Splits lists into child tables** – Ensures relational integrity for better analysis.  
✅ **Schema evolution support** – Adapts to changes in data structure over time.  

---

### **Example**  

Let's assume we extracted the following raw NYC taxi ride data, which contains **nested dictionaries** and **lists**:

```py
data = [
    {
        "vendor_name": "VTS",
        "record_hash": "b00361a396177a9cb410ff61f20015ad",
        "time": {
            "pickup": "2009-06-14 23:23:00",
            "dropoff": "2009-06-14 23:48:00"
        },
        "coordinates": {
            "start": {"lon": -73.787442, "lat": 40.641525},
            "end": {"lon": -73.980072, "lat": 40.742963}
        },
        "passengers": [
            {"name": "John", "rating": 4.9},
            {"name": "Jack", "rating": 3.9}
        ]
    }
]
```

### **How dlt normalizes this data automatically**  

Instead of manually flattening fields and extracting nested lists, we can **load it directly into dlt**:

```py
import dlt

# Define a dlt pipeline with automatic normalization
pipeline = dlt.pipeline(
    pipeline_name="ny_taxi_data",
    destination="duckdb",
    dataset_name="taxi_rides",
)

# Run the pipeline with raw nested data
info = pipeline.run(data, table_name="rides", write_disposition="replace")

# Print the load summary
print(info)

print(pipeline.last_trace)
```

---

### **What happens behind the scenes?**  

After running this pipeline, dlt automatically **transforms the data** into the following **normalized structure**:  

**Main table: `rides`**  

```py
pipeline.dataset(dataset_type="default").rides.df()
```

| vendor_name | record_hash                         | time__pickup              | time__dropoff             | coordinates__start__lon | coordinates__start__lat | coordinates__end__lon | coordinates__end__lat | _dlt_load_id      | _dlt_id        |
|-------------|------------------------------------|---------------------------|---------------------------|-------------------------|-------------------------|-----------------------|-----------------------|-------------------|---------------|
| VTS         | b00361a396177a9cb410ff61f20015ad  | 2009-06-14 23:23:00+00:00 | 2009-06-14 23:48:00+00:00 | -73.787442              | 40.641525               | -73.980072            | 40.742963            | 1738604244.2625916 | k+bnoLuti245ag |
  

This table **displays structured taxi ride data**, including **vendor details, timestamps, coordinates, and dlt metadata**. 

**Child Table: `rides_passengers`** 

```py
pipeline.dataset(dataset_type="default").rides__passengers.df()
```

| name  | rating | _dlt_parent_id    | _dlt_list_idx | _dlt_id        |
|-------|--------|------------------|--------------|---------------|
| John  | 4.9    | k+bnoLuti245ag    | 0            | 8ppDh+8gQ7SSHg |
| Jack  | 3.9    | k+bnoLuti245ag    | 1            | oQnWuvkgHhxlaA |


✅ **Nested structures were flattened** into separate columns.  
✅ **Lists were extracted into child tables**, preserving relationships.  
✅ **Timestamps were converted to the correct format.**  

---

### **Why dlt makes normalization easy**  

🔹  **No manual transformations needed** – Just load the raw data, and dlt does the rest!  
🔹 **Database-ready format** – Ensures clean, structured tables for easy querying.  
🔹 **Handles schema evolution** – Adapts to new fields automatically.  
🔹 **Scales effortlessly** – Works for small datasets and enterprise-scale pipelines.  

💡 With dlt, normalization happens automatically, so you can focus on insights instead of data wrangling.

---

## **Loading data**

Now that we’ve covered **extracting** and **normalizing** data, the final step is **loading** the data **into a destination**. This is where the processed data is stored, making it ready for querying, analysis, or further transformations.


### **How data loading happens without dlt**  

Before dlt, data engineers had to manually handle **schema validation, batch processing, error handling, and retries** for every destination. This process becomes especially complex when loading data into **data warehouses and data lakes**, where performance optimization, partitioning, and incremental updates are critical.

### **Example: Loading data into database without dlt**  
A basic pipeline requires:  
1. Setting up a database connection.  
2. Creating tables and defining schemas.  
3. Handling schema changes manually.  
4. Writing queries to insert/update data.

```py
import duckdb

# 1. Create a connection to an in-memory DuckDB database
conn = duckdb.connect("ny_taxi_manual.db")

# 2. Create the rides Table
# Since our dataset has nested structures, we must manually flatten it before inserting data.
conn.execute("""
CREATE TABLE IF NOT EXISTS rides (
    record_hash TEXT PRIMARY KEY,
    vendor_name TEXT,
    pickup_time TIMESTAMP,
    dropoff_time TIMESTAMP,
    start_lon DOUBLE,
    start_lat DOUBLE,
    end_lon DOUBLE,
    end_lat DOUBLE
);
""")

# 3. Insert Data Manually
# Since JSON data has nested fields, we need to extract and transform them before inserting them into DuckDB.
data = [
    {
        "vendor_name": "VTS",
        "record_hash": "b00361a396177a9cb410ff61f20015ad",
        "time": {
            "pickup": "2009-06-14 23:23:00",
            "dropoff": "2009-06-14 23:48:00"
        },
        "coordinates": {
            "start": {"lon": -73.787442, "lat": 40.641525},
            "end": {"lon": -73.980072, "lat": 40.742963}
        }
    }
]

# Prepare data for insertion
flattened_data = [
    (
        ride["record_hash"],
        ride["vendor_name"],
        ride["time"]["pickup"],
        ride["time"]["dropoff"],
        ride["coordinates"]["start"]["lon"],
        ride["coordinates"]["start"]["lat"],
        ride["coordinates"]["end"]["lon"],
        ride["coordinates"]["end"]["lat"]
    )
    for ride in data
]

# Insert into DuckDB
conn.executemany("""
INSERT INTO rides (record_hash, vendor_name, pickup_time, dropoff_time, start_lon, start_lat, end_lon, end_lat)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", flattened_data)

print("Data successfully loaded into DuckDB!")


# 4. Query Data in DuckDB
# Now that the data is loaded, we can query it using DuckDB’s SQL engine.
df = conn.execute("SELECT * FROM rides").df()

conn.close()
```

Problems without dlt:

❌ **Schema management is manual** – If the schema changes, you need to update table structures manually.  
❌ **No automatic retries** – If the network fails, data may be lost.  
❌ **No incremental loading** – Every run reloads everything, making it slow and expensive.  
❌ **More code to maintain** – A simple pipeline quickly becomes complex.

---

### **How dlt handles the load step automatically**  

With dlt, loading data **requires just a few lines of code** — schema inference, error handling, and incremental updates are all handled automatically!

### **Why use dlt for loading?**  

✅ **Supports multiple destinations** – Load data into **BigQuery, Redshift, Snowflake, Postgres, DuckDB, Parquet (S3, GCS)** and more.  
✅ **Optimized for performance** – Uses **batch loading, parallelism, and streaming** for fast and scalable data transfer.  
✅ **Schema-aware** – Ensures that **column names, data types, and structures match** the destination’s requirements.  
✅ **Incremental loading** – Avoids unnecessary reloading by **only inserting new or updated records**.  
✅ **Resilience & retries** – Automatically handles failures, ensuring data is loaded **without missing records**.

![dlt](img/dlt.png)

### **Example: Loading data into database with dlt**


To use all the power of dlt is better to wrap our API Client in the `@dlt.resource` decorator which denotes a logical grouping of data within a data source, typically holding data of similar structure and origin:

```py
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator


# Define the API resource for NYC taxi data
@dlt.resource(name="rides")   # <--- The name of the resource (will be used as the table name)
def ny_taxi():
    client = RESTClient(
        base_url="https://us-central1-dlthub-analytics.cloudfunctions.net",
        paginator=PageNumberPaginator(
            base_page=1,
            total_path=None
        )
    )

    for page in client.paginate("data_engineering_zoomcamp_api"):    # <--- API endpoint for retrieving taxi ride data
        yield page   # <--- yield data to manage memory


# define new dlt pipeline
pipeline = dlt.pipeline(destination="duckdb")

# run the pipeline with the new resource
load_info = pipeline.run(ny_taxi, write_disposition="replace")
print(load_info)

# explore loaded data
pipeline.dataset(dataset_type="default").rides.df()
```

**Done!** The data is now stored in **DuckDB**, with schema managed automatically!

---
### **Incremental Loading**  

Incremental loading allows us to update datasets by **loading only new or changed data**, instead of replacing the entire dataset. This makes pipelines **faster and more cost-effective** by reducing redundant data processing.  


### **How does incremental loading work?**  

Incremental loading works alongside two key concepts:  

- **Incremental extraction** – Only extracts the new or modified data rather than retrieving everything again.  
- **State tracking** – Keeps track of what has already been loaded, ensuring that only new data is processed.  

In dlt, **state** is stored in a **separate table** at the destination, allowing pipelines to track what has been processed.

🔹 **Want to learn more?** You can read about incremental extraction and state management in the [dlt documentation](https://dlthub.com/docs).  

---

### **Incremental loading methods in dlt**  

dlt provides two ways to load data incrementally:  

#### **1. Append (adding new records)**  

- Best for **immutable or stateless data**, such as taxi ride records.  
- Each run **adds new records** without modifying previous data.  
- Can also be used to create a **history of changes** (slowly changing dimensions).  

**Example:**  
- If taxi ride data is loaded daily, only **new rides** are added, rather than reloading the full history.  
- If tracking changes in a list of vehicles, **each version** is stored as a new row for auditing.  

---

#### **2. Merge (updating existing records)**  

- Best for **updating existing records** (stateful data).  
- Replaces old records with updated ones based on a **unique key**.  
- Useful for tracking **status changes**, such as payment updates.  

**Example:**  
- A taxi ride's **payment status** could change from `"booked"` to `"cancelled"`, requiring an update.  
- A **customer profile** might be updated with a new email or phone number.  

---

### **Choosing between Append and Merge**  

| **Scenario**                      | **Use Append** | **Use Merge** |
|-----------------------------------|--------------|--------------|
| Immutable records (e.g., ride history) | ✅ Yes         | ❌ No        |
| Tracking historical changes (slowly changing dimensions) | ✅ Yes         | ❌ No        |
| Updating existing records (e.g., payment status) | ❌ No         | ✅ Yes        |
| Keeping full change history       | ✅ Yes         | ❌ No        |


### **Example: Incremental loading with dlt**

**The goal**: download only trips made after June 15, 2009, skipping the old ones.

Using `dlt`, we set up an [incremental filter](https://dlthub.com/docs/general-usage/incremental-loading%23incremental-loading-with-a-cursor-field) to only fetch trips made after a certain date:

```python
cursor_date = dlt.sources.incremental("Trip_Dropoff_DateTime", initial_value="2009-06-15")
```

This tells `dlt`:
- **Start date**: June 15, 2009 (`initial_value`).
- **Field to track**: `Trip_Dropoff_DateTime` (our timestamp).

As you run the pipeline repeatedly, `dlt` will keep track of the latest `Trip_Dropoff_DateTime` value processed. It will skip records older than this date in future runs.

Let's make the data resource incremental using `dlt.sources.incremental`:

```py
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator


@dlt.resource(name="rides", write_disposition="append")
def ny_taxi(
    cursor_date=dlt.sources.incremental(
        "Trip_Dropoff_DateTime",   # <--- field to track, our timestamp
        initial_value="2009-06-15",   # <--- start date June 15, 2009
        )
    ):
    client = RESTClient(
        base_url="https://us-central1-dlthub-analytics.cloudfunctions.net",
        paginator=PageNumberPaginator(
            base_page=1,
            total_path=None
        )
    )

    for page in client.paginate("data_engineering_zoomcamp_api"):
        yield page
```

Finally, we run our pipeline and load the fresh taxi rides data:

```py
# define new dlt pipeline
pipeline = dlt.pipeline(pipeline_name="ny_taxi", destination="duckdb", dataset_name="ny_taxi_data")

# run the pipeline with the new resource
load_info = pipeline.run(ny_taxi)
print(pipeline.last_trace)
```


Only 5325 rows were flitered out and loaded into the `duckdb` destination. Let's take a look at the earliest date in the loaded data:

```py
with pipeline.sql_client() as client:
    res = client.execute_sql(
            """
            SELECT
            MIN(trip_dropoff_date_time)
            FROM rides;
            """
        )
    print(res)
```

Run the same pipeline again.

```py
# define new dlt pipeline
pipeline = dlt.pipeline(pipeline_name="ny_taxi", destination="duckdb", dataset_name="ny_taxi_data")


# run the pipeline with the new resource
load_info = pipeline.run(ny_taxi)
print(pipeline.last_trace)
```

The pipeline will detect that there are **no new records** based on the `Trip_Dropoff_DateTime` field and the incremental cursor. As a result, **no new data will be loaded** into the destination:
>0 load package(s) were loaded


💡 **With dlt, incremental loading is simple, scalable, and automatic!**

---

### **Example: Loading data into a Data Warehouse (BigQuery)**  
First, install the dependencies, define the source, then change the destination name and run the pipeline.

```shell
pip install dlt[bigquery]
```

Let's use our NY Taxi API and load data from the source into destination.

```py
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator


@dlt.resource(name="rides", write_disposition="replace")
def ny_taxi():
    client = RESTClient(
        base_url="https://us-central1-dlthub-analytics.cloudfunctions.net",
        paginator=PageNumberPaginator(
            base_page=1,
            total_path=None
        )
    )

    for page in client.paginate("data_engineering_zoomcamp_api"):
        yield page
```


**Choosing a destination**

Switching between  **data warehouses (BigQuery, Snowflake, Redshift)** or **data lakes (S3, Google Cloud Storage, Parquet files)**  in dlt is incredibly straightforward — simply modify the `destination` parameter in your pipeline configuration. 

For example:

```py
pipeline = dlt.pipeline(
    pipeline_name='taxi_data',
    destination='duckdb', # <--- to test pipeline locally
    dataset_name='taxi_rides',
)

pipeline = dlt.pipeline(
    pipeline_name='taxi_data',
    destination='bigquery', # <--- to run pipeline in production
    dataset_name='taxi_rides',
)
```

This flexibility allows you to easily transition from local development to production-grade environments.

> 💡 No need to rewrite your pipeline — dlt adapts automatically!

**Set Credentials**  

The next logical step is to [set credentials](https://dlthub.com/docs/general-usage/credentials/) using **dlt's TOML providers** or **environment variables (ENVs)**.

```py
import os
from google.colab import userdata

os.environ["DESTINATION__BIGQUERY__CREDENTIALS"] = userdata.get('BIGQUERY_CREDENTIALS')
```

Run the pipeline:
```py
pipeline = dlt.pipeline(
    pipeline_name="taxi_data",
    destination="bigquery",
    dataset_name="taxi_rides",
    dev_mode=True,
)

info = pipeline.run(ny_taxi)
print(info)
```

💡 **What’s different?**  
- **dlt automatically adapts the schema** to fit BigQuery.  
- **Partitioning & clustering** can be applied for performance optimization.  
- **Efficient batch loading** ensures scalability.

---

### **Example: Loading data into a Data Lake (Parquet on Local FS or S3)**  

**Why use a Data Lake?**  
- **Cost-effective storage** – Cheaper than traditional databases.   
- **Optimized for big data processing** – Works seamlessly with Spark, Databricks, and Presto.  
- **Easy scalability** – Store petabytes of data efficiently.  


The `filesystem` destination enables you to load data into **files stored locally** or in **cloud storage** solutions, making it an excellent choice for lightweight testing, prototyping, or file-based workflows.

Below is an **example** demonstrating how to use the `filesystem` destination to load data in **Parquet** format:

* Step 1: Set up a local bucket or cloud directory for storing files

```py
import os

os.environ["BUCKET_URL"] = "/content"
```

* Step 2: Define the data source (above)
* Step 3: Run the pipeline

```py
import dlt


pipeline = dlt.pipeline(
    pipeline_name='fs_pipeline',
    destination='filesystem', # <--- change destination to 'filesystem'
    dataset_name='fs_data',
)

load_info = pipeline.run(ny_taxi, loader_file_format="parquet") # <--- choose a file format: parquet, csv or jsonl
print(load_info)
```

Look at the files:

```shell
! ls fs_data/rides
```

Look at the loaded data:

```py
# explore loaded data
pipeline.dataset(dataset_type="default").rides.df()
```

#### **Table formats: [Delta tables & Iceberg](https://dlthub.com/docs/dlt-ecosystem/destinations/delta-iceberg)**

dlt supports writing **Delta** and **Iceberg** tables when using the `filesystem` destination.

**How it works:**

dlt uses the `deltalake` and `pyiceberg` libraries to write Delta and Iceberg tables, respectively. One or multiple Parquet files are prepared during the extract and normalize steps. In the load step, these Parquet files are exposed as an Arrow data structure and fed into `deltalake` or `pyiceberg`.

```shell
 !pip install "dlt[pyiceberg]"
```

```py
pipeline = dlt.pipeline(
    pipeline_name='fs_pipeline',
    destination='filesystem', # <--- change destination to 'filesystem'
    dataset_name='fs_iceberg_data',
)

load_info = pipeline.run(
    ny_taxi,
    loader_file_format="parquet",
    table_format="iceberg",  # <--- choose a table format: delta or iceberg
)
print(load_info)
```

💡**Note:**

Open source version of dlt supports basic functionality for **iceberg**, but the dltHub team is currently working on an **extended** and **more powerful** integration with iceberg.

[Join the waiting list to learn more about dlt+ and Iceberg.](https://info.dlthub.com/waiting-list)


---

## **What’s Next?**  

- **Try loading data into different [destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/)** – Test Postgres, Snowflake, or Parquet.  
- **Experiment with [incremental loading](https://dlthub.com/docs/general-usage/incremental-loading)** – Load only new records for better efficiency.  
- **Explore dlt’s [schema evolution](https://dlthub.com/docs/general-usage/schema-evolution)** – Automatically adjust to data structure changes.  
- **Join our [Slack community](https://dlthub.com/community)** to share your progress!  


With **dlt’s automated load step**, you get **effortless, scalable, and resilient data loading**—so you can focus on insights instead of pipeline maintenance. 🚀

---

### Extra homework 💻
* [Data ingestion with DLT to Bigquery from Sara Sabater](https://github.com/saraisab/Data_Engineer/blob/main/courses/DE_zoomcamp/Homework/DLT-Workshop/extra_homework/Data_ingestion_with_DLT_to_bigquery.ipynb).


================================================
FILE: cohorts/2025/workshops/dlt/dlt_homework.md
================================================
Original file is located at
    https://colab.research.google.com/drive/1plqdl33K_HkVx0E0nGJrrkEUssStQsW7

# **Workshop "Data Ingestion with dlt": Homework**

---

## **Dataset & API**

We’ll use **NYC Taxi data** via the same custom API from the workshop:

🔹 **Base API URL:**  
```
https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api
```
🔹 **Data format:** Paginated JSON (1,000 records per page).  
🔹 **API Pagination:** Stop when an empty page is returned.

## **Question 1: dlt Version**

1. **Install dlt**:

```
!pip install dlt[duckdb]
```

> Or choose a different bracket—`bigquery`, `redshift`, etc.—if you prefer another primary destination. For this assignment, we’ll still do a quick test with DuckDB.

2. **Check** the version:

```
!dlt --version
```

or:

```py
import dlt
print("dlt version:", dlt.__version__)
```

Provide the **version** you see in the output.

## **Question 2: Define & Run the Pipeline (NYC Taxi API)**

Use dlt to extract all pages of data from the API.

Steps:

1️⃣ Use the `@dlt.resource` decorator to define the API source.

2️⃣ Implement automatic pagination using dlt's built-in REST client.

3️⃣ Load the extracted data into DuckDB for querying.

```py
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator


# your code is here


pipeline = dlt.pipeline(
    pipeline_name="ny_taxi_pipeline",
    destination="duckdb",
    dataset_name="ny_taxi_data"
)
```

Load the data into DuckDB to test:
```py
load_info = pipeline.run(ny_taxi)
print(load_info)
```
Start a connection to your database using native `duckdb` connection and look what tables were generated:"""

```py
import duckdb
from google.colab import data_table
data_table.enable_dataframe_formatter()

# A database '<pipeline_name>.duckdb' was created in working directory so just connect to it

# Connect to the DuckDB database
conn = duckdb.connect(f"{pipeline.pipeline_name}.duckdb")

# Set search path to the dataset
conn.sql(f"SET search_path = '{pipeline.dataset_name}'")

# Describe the dataset
conn.sql("DESCRIBE").df()

```

How many tables were created?

* 2
* 4
* 6
* 8

## **Question 3: Explore the loaded data**

Inspect the table `ride`:

```py
df = pipeline.dataset(dataset_type="default").rides.df()
df
```

What is the total number of records extracted?

* 2500
* 5000
* 7500
* 10000

## **Question 4: Trip Duration Analysis**

Run the SQL query below to:

* Calculate the average trip duration in minutes.

```py
with pipeline.sql_client() as client:
    res = client.execute_sql(
            """
            SELECT
            AVG(date_diff('minute', trip_pickup_date_time, trip_dropoff_date_time))
            FROM rides;
            """
        )
    # Prints column values of the first row
    print(res)
```

What is the average trip duration?

* 12.3049
* 22.3049
* 32.3049
* 42.3049

## **Submitting the solutions**

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/workshop1

## **Solution**

We will publish the solution here after deadline.


================================================
FILE: cohorts/2025/workshops/dynamic_load_dlt.py
================================================
import json
import os
import toml
import requests
import dlt
from dlt.sources.filesystem import filesystem, read_parquet
from google.cloud import storage
import io
import pyarrow.parquet as pq

# Load the TOML file
# the TOML file should follow below format:
#[credentials]
#project_id = "your project id"
#private_key = "your sevice account key"
#client_email = "email"
config = toml.load("./.dlt/secrets.toml")

# Set environment variables
os.environ["CREDENTIALS__PROJECT_ID"] = config["credentials"]["project_id"]
os.environ["CREDENTIALS__PRIVATE_KEY"] = config["credentials"]["private_key"]
os.environ["CREDENTIALS__CLIENT_EMAIL"] = config["credentials"]["client_email"]

# Function to generate URLs based on user input for the date range and trip color
def generate_urls(color, start_year, end_year, start_month, end_month):
    base_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/"
    urls = []

# Generate the list of URLs based on the specified date range and color

    for year in range(start_year, end_year + 1):
        for month in range(start_month, end_month + 1):
            # Format the month to ensure two digits
            month_str = f"{month:02d}"
            url = f"{base_url}{color}_tripdata_{year}-{month_str}.parquet"
            #https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2020-01.parquet
            urls.append(url)

    return urls

# User input for time range and trip color
color = input("Enter color (green, yellow): ").lower()  
start_year = int(input("Enter the start year (e.g., 2019): "))
end_year = int(input("Enter the end year (e.g., 2022): "))
start_month = int(input("Enter the start month (1-12): "))
end_month = int(input("Enter the end month (1-12): "))

# Generate URLs based on user input
urls = generate_urls(color, start_year, end_year, start_month, end_month)


# Debug: Print generated URLs
print("Generated URLs:")
for url in urls:
    print(url)


dlt_method = input("Choose loading method: 1 for GCS -> Bigquery, 2 for Direct Web -> Bigquery: ")

if dlt_method == "1":

    # Initialize GCS client
    storage_client = storage.Client.from_service_account_json("gcs.json")
    bucket_name = input("Enter the GCS bucket name: ")  # Replace with your GCS bucket name
    bucket = storage_client.bucket(bucket_name)

    # Download files and upload them to GCS
    gcs_files = []
    for url in urls:
        file_name = url.split("/")[-1]  # Extract the file name from the URL
        gcs_blob = bucket.blob(file_name)

        print(f"Downloading {url} and uploading to GCS as {file_name}")
        response = requests.get(url)
        gcs_blob.upload_from_string(response.content)
        gcs_files.append(f"gs://{bucket_name}/{file_name}")

    @dlt.resource(name="rides", write_disposition="replace")
    def parquet_source():
        # Use filesystem to load files from GCS and apply read_parquet transformation
        files = filesystem(bucket_url=f"gs://{bucket_name}/", file_glob="*.parquet")
        reader = (files | read_parquet()).with_name("tripdata")

        # Iterate through the rows from the reader and yield them
        row_count = 0
        for row in reader:
            row_count += 1
            yield row
        print(f"Total rows yielded: {row_count}")

elif dlt_method == "2":
    # Alternative method: Streaming Parquet files directly from the web
    @dlt.resource(name="ny_taxi_dlt", write_disposition="replace")
    def paginated_getter():
        for url in urls:
            try:
                with requests.get(url, stream=True) as response:
                    response.raise_for_status()
                    buffer = io.BytesIO()
                    for chunk in response.iter_content(chunk_size=1024 * 1024):  # 1MB chunks
                        buffer.write(chunk)
                    buffer.seek(0)
                    table = pq.read_table(buffer)
                    print(f'Got data from {url} with {table.num_rows} records')
                    if table.num_rows > 0:
                        yield table
            except Exception as e:
                print(f"Failed to fetch data from {url}: {e}")

# Create the pipeline
pipeline = dlt.pipeline(
    pipeline_name="test_taxi",
    dataset_name=input("Enter the dataset name: "),
    destination="bigquery"
   # dev_mode=True
)

# Run the pipeline with either method
if dlt_method == "1":
    info = pipeline.run(parquet_source())
elif dlt_method == "2":
    info = pipeline.run(paginated_getter())
else:
    print("Invalid selection")
    exit()

print(info)

================================================
FILE: cohorts/2026/01-docker-terraform/homework.md
================================================
# Module 1 Homework: Docker & SQL

In this homework we'll prepare the environment and practice
Docker and SQL

When submitting your homework, you will also need to include
a link to your GitHub repository or other public code-hosting
site.

This repository should contain the code for solving the homework.

When your solution has SQL or shell commands and not code
(e.g. python files) file format, include them directly in
the README file of your repository.


## Question 1. Understanding Docker images

Run docker with the `python:3.13` image. Use an entrypoint `bash` to interact with the container.

What's the version of `pip` in the image?

- 25.3
- 24.3.1
- 24.2.1
- 23.3.1


## Question 2. Understanding Docker networking and docker-compose

Given the following `docker-compose.yaml`, what is the `hostname` and `port` that pgadmin should use to connect to the postgres database?

```yaml
services:
  db:
    container_name: postgres
    image: postgres:17-alpine
    environment:
      POSTGRES_USER: 'postgres'
      POSTGRES_PASSWORD: 'postgres'
      POSTGRES_DB: 'ny_taxi'
    ports:
      - '5433:5432'
    volumes:
      - vol-pgdata:/var/lib/postgresql/data

  pgadmin:
    container_name: pgadmin
    image: dpage/pgadmin4:latest
    environment:
      PGADMIN_DEFAULT_EMAIL: "pgadmin@pgadmin.com"
      PGADMIN_DEFAULT_PASSWORD: "pgadmin"
    ports:
      - "8080:80"
    volumes:
      - vol-pgadmin_data:/var/lib/pgadmin

volumes:
  vol-pgdata:
    name: vol-pgdata
  vol-pgadmin_data:
    name: vol-pgadmin_data
```

- postgres:5433
- localhost:5432
- db:5433
- postgres:5432
- db:5432

If multiple answers are correct, select any 


## Prepare the Data

Download the green taxi trips data for November 2025:

```bash
wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet
```

You will also need the dataset with zones:

```bash
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv
```

## Question 3. Counting short trips

For the trips in November 2025 (lpep_pickup_datetime between '2025-11-01' and '2025-12-01', exclusive of the upper bound), how many trips had a `trip_distance` of less than or equal to 1 mile?

- 7,853
- 8,007
- 8,254
- 8,421


## Question 4. Longest trip for each day

Which was the pick up day with the longest trip distance? Only consider trips with `trip_distance` less than 100 miles (to exclude data errors).

Use the pick up time for your calculations.

- 2025-11-14
- 2025-11-20
- 2025-11-23
- 2025-11-25


## Question 5. Biggest pickup zone

Which was the pickup zone with the largest `total_amount` (sum of all trips) on November 18th, 2025?

- East Harlem North
- East Harlem South
- Morningside Heights
- Forest Hills


## Question 6. Largest tip

For the passengers picked up in the zone named "East Harlem North" in November 2025, which was the drop off zone that had the largest tip?

Note: it's `tip` , not `trip`. We need the name of the zone, not the ID.

- JFK Airport
- Yorkville West
- East Harlem North
- LaGuardia Airport


## Terraform

In this section homework we'll prepare the environment by creating resources in GCP with Terraform.

In your VM on GCP/Laptop/GitHub Codespace install Terraform.
Copy the files from the course repo
[here](../../../01-docker-terraform/terraform/terraform) to your VM/Laptop/GitHub Codespace.

Modify the files as necessary to create a GCP Bucket and Big Query Dataset.


## Question 7. Terraform Workflow

Which of the following sequences, respectively, describes the workflow for:
1. Downloading the provider plugins and setting up backend,
2. Generating proposed changes and auto-executing the plan
3. Remove all resources managed by terraform`

Answers:
- terraform import, terraform apply -y, terraform destroy
- teraform init, terraform plan -auto-apply, terraform rm
- terraform init, terraform run -auto-approve, terraform destroy
- terraform init, terraform apply -auto-approve, terraform destroy
- terraform import, terraform apply -y, terraform rm


## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw1


## Learning in Public

We encourage everyone to share what they learned. This is called "learning in public".

### Why learn in public?

- Accountability: Sharing your progress creates commitment and motivation to continue
- Feedback: The community can provide valuable suggestions and corrections
- Networking: You'll connect with like-minded people and potential collaborators
- Documentation: Your posts become a learning journal you can reference later
- Opportunities: Employers and clients often discover talent through public learning

You can read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).

Don't worry about being perfect. Everyone starts somewhere, and people love following genuine learning journeys!

### Example post for LinkedIn

```
🚀 Week 1 of Data Engineering Zoomcamp by @DataTalksClub complete!

Just finished Module 1 - Docker & Terraform. Learned how to:

✅ Containerize applications with Docker and Docker Compose
✅ Set up PostgreSQL databases and write SQL queries
✅ Build data pipelines to ingest NYC taxi data
✅ Provision cloud infrastructure with Terraform

Here's my homework solution: <LINK>

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```

### Example post for Twitter/X


```
🐳 Module 1 of Data Engineering Zoomcamp done!

- Docker containers
- Postgres & SQL
- Terraform & GCP
- NYC taxi data pipeline

My solution: <LINK>

Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```


================================================
FILE: cohorts/2026/02-workflow-orchestration/homework.md
================================================
## Module 2 Homework

ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository.

> In case you don't get one option exactly, select the closest one 

For the homework, we'll be working with the _green_ taxi dataset located here:

`https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download`

To get a `wget`-able link, use this prefix (note that the link itself gives 404):

`https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/`

### Assignment

So far in the course, we processed data for the year 2019 and 2020. Your task is to extend the existing flows to include data for the year 2021.

![homework datasets](../../../02-workflow-orchestration/images/homework.png)

As a hint, Kestra makes that process really easy:
1. You can leverage the backfill functionality in the [scheduled flow](../../../02-workflow-orchestration/flows/09_gcp_taxi_scheduled.yaml) to backfill the data for the year 2021. Just make sure to select the time period for which data exists i.e. from `2021-01-01` to `2021-07-31`. Also, make sure to do the same for both `yellow` and `green` taxi data (select the right service in the `taxi` input).
2. Alternatively, run the flow manually for each of the seven months of 2021 for both `yellow` and `green` taxi data. Challenge for you: find out how to loop over the combination of Year-Month and `taxi`-type using `ForEach` task which triggers the flow for each combination using a `Subflow` task.

### Quiz Questions

Complete the quiz shown below. It's a set of 6 multiple-choice questions to test your understanding of workflow orchestration, Kestra, and ETL pipelines.

1) Within the execution for `Yellow` Taxi data for the year `2020` and month `12`: what is the uncompressed file size (i.e. the output file `yellow_tripdata_2020-12.csv` of the `extract` task)?
- 128.3 MiB
- 134.5 MiB
- 364.7 MiB
- 692.6 MiB

2) What is the rendered value of the variable `file` when the inputs `taxi` is set to `green`, `year` is set to `2020`, and `month` is set to `04` during execution?
- `{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv` 
- `green_tripdata_2020-04.csv`
- `green_tripdata_04_2020.csv`
- `green_tripdata_2020.csv`

3) How many rows are there for the `Yellow` Taxi data for all CSV files in the year 2020?
- 13,537.299
- 24,648,499
- 18,324,219
- 29,430,127

4) How many rows are there for the `Green` Taxi data for all CSV files in the year 2020?
- 5,327,301
- 936,199
- 1,734,051
- 1,342,034

5) How many rows are there for the `Yellow` Taxi data for the March 2021 CSV file?
- 1,428,092
- 706,911
- 1,925,152
- 2,561,031

6) How would you configure the timezone to New York in a Schedule trigger?
- Add a `timezone` property set to `EST` in the `Schedule` trigger configuration  
- Add a `timezone` property set to `America/New_York` in the `Schedule` trigger configuration
- Add a `timezone` property set to `UTC-5` in the `Schedule` trigger configuration
- Add a `location` property set to `New_York` in the `Schedule` trigger configuration  

## Submitting the solutions

* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw2
* Check the link above to see the due date

## Solution

Will be added after the due date


## Learning in Public

We encourage everyone to share what they learned. This is called "learning in public".

Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).

### Example post for LinkedIn

```
🚀 Week 2 of Data Engineering Zoomcamp by @DataTalksClub and @Will Russell complete!

Just finished Module 2 - Workflow Orchestration with @Kestra. Learned how to:

✅ Orchestrate data pipelines with Kestra flows
✅ Use variables and expressions for dynamic workflows
✅ Implement backfill for historical data
✅ Schedule workflows with timezone support
✅ Process NYC taxi data (Yellow & Green) for 2019-2021

Built ETL pipelines that extract, transform, and load taxi trip data automatically!

Thanks to the @Kestra team for the great orchestration tool!

Here's my homework solution: <LINK>

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```

### Example post for Twitter/X

```
Module 2 of DE Zoomcamp by @DataTalksClub @wrussell1999 done!

- @kestra_io workflow orchestration
- ETL pipelines for taxi data
- Backfill & scheduling
- Variables & dynamic flows

My solution: <LINK>

Join me here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```


================================================
FILE: cohorts/2026/03-data-warehouse/DLT_upload_to_GCP.ipynb
================================================
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aC2QnhmKxpq1"
      },
      "source": [
        "**Please set up your credentials JSON as GCP_CREDENTIALS secrets**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UsUZobVduL7l"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from google.colab import userdata\n",
        "\n",
        "os.environ[\"DESTINATION__CREDENTIALS\"] = userdata.get(\"GCP_CREDENTIALS\")\n",
        "os.environ[\"BUCKET_URL\"] = \"gs://your_bucket_url\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "mPBzsEgyjsBo"
      },
      "outputs": [],
      "source": [
        "# Install for production\n",
        "%%capture\n",
        "!pip install dlt[bigquery, gs]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "evdUsDNbkCTk"
      },
      "outputs": [],
      "source": [
        "# Install for testing\n",
        "%%capture\n",
        "!pip install dlt[duckdb]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "lYh7r1mTf4uo"
      },
      "outputs": [],
      "source": [
        "import dlt\n",
        "import requests\n",
        "import pandas as pd\n",
        "from dlt.destinations import filesystem\n",
        "from io import BytesIO"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "76zT1PzAgs7A"
      },
      "source": [
        "Ingesting parquet files to GCS."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xya0215jsnsb"
      },
      "outputs": [],
      "source": [
        "# Define a dlt source to download and process Parquet files as resources\n",
        "@dlt.source(name=\"rides\")\n",
        "def download_parquet():\n",
        "    prefix = \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata\"\n",
        "    for month in range(1, 7):\n",
        "        file_name = f\"yellow_tripdata_2024-0{month}.parquet\"\n",
        "        url = f\"{prefix}_2024-0{month}.parquet\"\n",
        "        response = requests.get(url)\n",
        "\n",
        "        df = pd.read_parquet(BytesIO(response.content))\n",
        "\n",
        "        # Return the dataframe as a dlt resource for ingestion\n",
        "        yield dlt.resource(df, name=file_name)\n",
        "\n",
        "\n",
        "# Initialize the pipeline\n",
        "pipeline = dlt.pipeline(\n",
        "    pipeline_name=\"rides_pipeline\",\n",
        "    destination=filesystem(layout=\"{schema_name}/{table_name}.{ext}\"),\n",
        "    dataset_name=\"rides_dataset\",\n",
        ")\n",
        "\n",
        "# Run the pipeline to load Parquet data into DuckDB\n",
        "load_info = pipeline.run(download_parquet(), loader_file_format=\"parquet\")\n",
        "\n",
        "# Print the results\n",
        "print(load_info)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S0310FT-gy_P"
      },
      "source": [
        "Ingesting data to Database"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1_3K97w1c2v2",
        "outputId": "4b2d26bf-2814-46fa-f80d-7a2e17417a95"
      },
      "outputs": [],
      "source": [
        "# Define a dlt resource to download and process Parquet files as single table\n",
        "@dlt.resource(name=\"rides\", write_disposition=\"replace\")\n",
        "def download_parquet():\n",
        "    prefix = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata'\n",
        "\n",
        "    for month in range(1, 7):\n",
        "        url = f\"{prefix}_2024-0{month}.parquet\"\n",
        "        response = requests.get(url)\n",
        "\n",
        "        df = pd.read_parquet(BytesIO(response.content))\n",
        "\n",
        "        yield df\n",
        "\n",
        "\n",
        "# Initialize the pipeline\n",
        "pipeline = dlt.pipeline(\n",
        "    pipeline_name=\"rides_pipeline\",\n",
        "    destination=\"duckdb\",  # Use DuckDB for testing\n",
        "    # destination=\"bigquery\",  # Use BigQuery for production\n",
        "    dataset_name=\"rides_dataset\",\n",
        ")\n",
        "\n",
        "# Run the pipeline to load Parquet data into DuckDB\n",
        "info = pipeline.run(download_parquet)\n",
        "\n",
        "# Print the results\n",
        "print(info)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "gDcLjzLtooBV",
        "outputId": "74ff2de7-2f2e-41b9-a681-3dc5887f6eed"
      },
      "outputs": [],
      "source": [
        "import duckdb\n",
        "\n",
        "conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n",
        "\n",
        "# Set search path to the dataset\n",
        "conn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n",
        "\n",
        "# Describe the dataset to see loaded tables\n",
        "res = conn.sql(\"DESCRIBE\").df()\n",
        "print(res)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "VVJy8JoerI2P",
        "outputId": "3f8c7fee-a9ee-4fd4-ec75-153ca60bd36f"
      },
      "outputs": [],
      "source": [
        "# provide a resource name to query a table of that name\n",
        "with pipeline.sql_client() as client:\n",
        "    with client.execute_query(f\"SELECT count(1) FROM rides\") as cursor:\n",
        "        data = cursor.df()\n",
        "print(data)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}


================================================
FILE: cohorts/2026/03-data-warehouse/homework.md
================================================
# Module 3 Homework: Data Warehousing & BigQuery

In this homework we'll practice working with BigQuery and Google Cloud Storage.

When submitting your homework, you will also need to include
a link to your GitHub repository or other public code-hosting
site.

This repository should contain the code for solving the homework.

When your solution has SQL or shell commands and not code
(e.g. python files) file format, include them directly in
the README file of your repository.

## Data

For this homework we will be using the Yellow Taxi Trip Records for January 2024 - June 2024 (not the entire year of data).

Parquet Files are available from the New York City Taxi Data found here:

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

## Loading the data

You can use the following scripts to load the data into your GCS bucket:

- Python script: [load_yellow_taxi_data.py](./load_yellow_taxi_data.py)
- Jupyter notebook with DLT: [DLT_upload_to_GCP.ipynb](./DLT_upload_to_GCP.ipynb)

You will need to generate a Service Account with GCS Admin privileges or be authenticated with the Google SDK, and update the bucket name in the script.

If you are using orchestration tools such as Kestra, Mage, Airflow, or Prefect, do not load the data into BigQuery using the orchestrator.

Make sure that all 6 files show in your GCS bucket before beginning.

Note: You will need to use the PARQUET option when creating an external table.


## BigQuery Setup

Create an external table using the Yellow Taxi Trip Records. 

Create a (regular/materialized) table in BQ using the Yellow Taxi Trip Records (do not partition or cluster this table). 


## Question 1. Counting records

What is count of records for the 2024 Yellow Taxi Data?
- 65,623
- 840,402
- 20,332,093
- 85,431,289


## Question 2. Data read estimation

Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.
 
What is the **estimated amount** of data that will be read when this query is executed on the External Table and the Table?

- 18.82 MB for the External Table and 47.60 MB for the Materialized Table
- 0 MB for the External Table and 155.12 MB for the Materialized Table
- 2.14 GB for the External Table and 0MB for the Materialized Table
- 0 MB for the External Table and 0MB for the Materialized Table

## Question 3. Understanding columnar storage

Write a query to retrieve the PULocationID from the table (not the external table) in BigQuery. Now write a query to retrieve the PULocationID and DOLocationID on the same table.

Why are the estimated number of Bytes different?
- BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires 
reading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed.
- BigQuery duplicates data across multiple storage partitions, so selecting two columns instead of one requires scanning the table twice, 
doubling the estimated bytes processed.
- BigQuery automatically caches the first queried column, so adding a second column increases processing time but does not affect the estimated bytes scanned.
- When selecting multiple columns, BigQuery performs an implicit join operation between them, increasing the estimated bytes processed

## Question 4. Counting zero fare trips

How many records have a fare_amount of 0?
- 128,210
- 546,578
- 20,188,016
- 8,333

## Question 5. Partitioning and clustering

What is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy)

- Partition by tpep_dropoff_datetime and Cluster on VendorID
- Cluster on by tpep_dropoff_datetime and Cluster on VendorID
- Cluster on tpep_dropoff_datetime Partition by VendorID
- Partition by tpep_dropoff_datetime and Partition by VendorID


## Question 6. Partition benefits

Write a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime
2024-03-01 and 2024-03-15 (inclusive)


Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values? 


Choose the answer which most closely matches.
 

- 12.47 MB for non-partitioned table and 326.42 MB for the partitioned table
- 310.24 MB for non-partitioned table and 26.84 MB for the partitioned table
- 5.87 MB for non-partitioned table and 0 MB for the partitioned table
- 310.31 MB for non-partitioned table and 285.64 MB for the partitioned table


## Question 7. External table storage

Where is the data stored in the External Table you created?

- Big Query
- Container Registry
- GCP Bucket
- Big Table

## Question 8. Clustering best practices

It is best practice in Big Query to always cluster your data:
- True
- False


## Question 9. Understanding table scans

No Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why?


## Submitting the solutions

Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw3


## Learning in Public

We encourage everyone to share what they learned. This is called "learning in public".

Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).

### Example post for LinkedIn

```
🚀 Week 3 of Data Engineering Zoomcamp by @DataTalksClub complete!

Just finished Module 3 - Data Warehousing with BigQuery. Learned how to:

✅ Create external tables from GCS bucket data
✅ Build materialized tables in BigQuery
✅ Partition and cluster tables for performance
✅ Understand columnar storage and query optimization
✅ Analyze NYC taxi data at scale

Working with 20M+ records and learning how partitioning reduces query costs!

Here's my homework solution: <LINK>

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```

### Example post for Twitter/X

```
📊 Module 3 of Data Engineering Zoomcamp done!

- BigQuery & GCS
- External vs materialized tables
- Partitioning & clustering
- Query optimization

My solution: <LINK>

Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```


================================================
FILE: cohorts/2026/03-data-warehouse/load_yellow_taxi_data.py
================================================
import os
import sys
import urllib.request
from concurrent.futures import ThreadPoolExecutor
from google.cloud import storage
from google.api_core.exceptions import NotFound, Forbidden
import time


# Change this to your bucket name
BUCKET_NAME = "dezoomcamp_hw3_2025"

# If you authenticated through the GCP SDK you can comment out these two lines
CREDENTIALS_FILE = "gcs.json"
client = storage.Client.from_service_account_json(CREDENTIALS_FILE)
# If commented initialize client with the following
# client = storage.Client(project='zoomcamp-mod3-datawarehouse')


BASE_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-"
MONTHS = [f"{i:02d}" for i in range(1, 7)]
DOWNLOAD_DIR = "."

CHUNK_SIZE = 8 * 1024 * 1024

os.makedirs(DOWNLOAD_DIR, exist_ok=True)

bucket = client.bucket(BUCKET_NAME)


def download_file(month):
    url = f"{BASE_URL}{month}.parquet"
    file_path = os.path.join(DOWNLOAD_DIR, f"yellow_tripdata_2024-{month}.parquet")

    try:
        print(f"Downloading {url}...")
        urllib.request.urlretrieve(url, file_path)
        print(f"Downloaded: {file_path}")
        return file_path
    except Exception as e:
        print(f"Failed to download {url}: {e}")
        return None


def create_bucket(bucket_name):
    try:
        # Get bucket details
        bucket = client.get_bucket(bucket_name)

        # Check if the bucket belongs to the current project
        project_bucket_ids = [bckt.id for bckt in client.list_buckets()]
        if bucket_name in project_bucket_ids:
            print(
                f"Bucket '{bucket_name}' exists and belongs to your project. Proceeding..."
            )
        else:
            print(
                f"A bucket with the name '{bucket_name}' already exists, but it does not belong to your project."
            )
            sys.exit(1)

    except NotFound:
        # If the bucket doesn't exist, create it
        bucket = client.create_bucket(bucket_name)
        print(f"Created bucket '{bucket_name}'")
    except Forbidden:
        # If the request is forbidden, it means the bucket exists but you don't have access to see details
        print(
            f"A bucket with the name '{bucket_name}' exists, but it is not accessible. Bucket name is taken. Please try a different bucket name."
        )
        sys.exit(1)


def verify_gcs_upload(blob_name):
    return storage.Blob(bucket=bucket, name=blob_name).exists(client)


def upload_to_gcs(file_path, max_retries=3):
    blob_name = os.path.basename(file_path)
    blob = bucket.blob(blob_name)
    blob.chunk_size = CHUNK_SIZE

    create_bucket(BUCKET_NAME)

    for attempt in range(max_retries):
        try:
            print(f"Uploading {file_path} to {BUCKET_NAME} (Attempt {attempt + 1})...")
            blob.upload_from_filename(file_path)
            print(f"Uploaded: gs://{BUCKET_NAME}/{blob_name}")

            if verify_gcs_upload(blob_name):
                print(f"Verification successful for {blob_name}")
                return
            else:
                print(f"Verification failed for {blob_name}, retrying...")
        except Exception as e:
            print(f"Failed to upload {file_path} to GCS: {e}")

        time.sleep(5)

    print(f"Giving up on {file_path} after {max_retries} attempts.")


if __name__ == "__main__":
    create_bucket(BUCKET_NAME)

    with ThreadPoolExecutor(max_workers=4) as executor:
        file_paths = list(executor.map(download_file, MONTHS))

    with ThreadPoolExecutor(max_workers=4) as executor:
        executor.map(upload_to_gcs, filter(None, file_paths))  # Remove None values

    print("All files processed and verified.")


================================================
FILE: cohorts/2026/04-analytics-engineering/homework.md
================================================
# Module 4 Homework: Analytics Engineering with dbt

In this homework, we'll use the dbt project in `04-analytics-engineering/taxi_rides_ny/` to transform NYC taxi data and answer questions by querying the models.

## Setup

1. Set up your dbt project following the [setup guide](../../../04-analytics-engineering/setup/)
2. Load the Green and Yellow taxi data for 2019-2020 and FHV trip data for 2019 into your warehouse (use static tables from [dtc github](https://github.com/DataTalksClub/nyc-tlc-data/), don't use offical tables from tlc because some values change from time to time)
3. Run `dbt build --target prod` to create all models and run tests

> **Note:** By default, dbt uses the `dev` target. You must use `--target prod` to build the models in the production dataset, which is required for the homework queries below.

After a successful build, you should have models like `fct_trips`, `dim_zones`, and `fct_monthly_zone_revenue` in your warehouse.

---

### Question 1. dbt Lineage and Execution

Given a dbt project with the following structure:

```
models/
├── staging/
│   ├── stg_green_tripdata.sql
│   └── stg_yellow_tripdata.sql
└── intermediate/
    └── int_trips_unioned.sql (depends on stg_green_tripdata & stg_yellow_tripdata)
```

If you run `dbt run --select int_trips_unioned`, what models will be built?

- `stg_green_tripdata`, `stg_yellow_tripdata`, and `int_trips_unioned` (upstream dependencies)
- Any model with upstream and downstream dependencies to `int_trips_unioned`
- `int_trips_unioned` only
- `int_trips_unioned`, `int_trips`, and `fct_trips` (downstream dependencies)

---

### Question 2. dbt Tests

You've configured a generic test like this in your `schema.yml`:

```yaml
columns:
  - name: payment_type
    data_tests:
      - accepted_values:
          arguments:
            values: [1, 2, 3, 4, 5]
            quote: false
```

Your model `fct_trips` has been running successfully for months. A new value `6` now appears in the source data.

What happens when you run `dbt test --select fct_trips`?

- dbt will skip the test because the model didn't change
- dbt will fail the test, returning a non-zero exit code
- dbt will pass the test with a warning about the new value
- dbt will update the configuration to include the new value

---

### Question 3. Counting Records in `fct_monthly_zone_revenue`

After running your dbt project, query the `fct_monthly_zone_revenue` model.

What is the count of records in the `fct_monthly_zone_revenue` model?

- 12,998
- 14,120
- 12,184
- 15,421

---

### Question 4. Best Performing Zone for Green Taxis (2020)

Using the `fct_monthly_zone_revenue` table, find the pickup zone with the **highest total revenue** (`revenue_monthly_total_amount`) for **Green** taxi trips in 2020.

Which zone had the highest revenue?

- East Harlem North
- Morningside Heights
- East Harlem South
- Washington Heights South

---

### Question 5. Green Taxi Trip Counts (October 2019)

Using the `fct_monthly_zone_revenue` table, what is the **total number of trips** (`total_monthly_trips`) for Green taxis in October 2019?

- 500,234
- 350,891
- 384,624
- 421,509

---

### Question 6. Build a Staging Model for FHV Data

Create a staging model for the **For-Hire Vehicle (FHV)** trip data for 2019.

1. Load the [FHV trip data for 2019](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv) into your data warehouse
2. Create a staging model `stg_fhv_tripdata` with these requirements:
   - Filter out records where `dispatching_base_num IS NULL`
   - Rename fields to match your project's naming conventions (e.g., `PUlocationID` → `pickup_location_id`)

What is the count of records in `stg_fhv_tripdata`?

- 42,084,899
- 43,244,693
- 22,998,722
- 44,112,187

---

## Submitting the solutions

- Form for submitting: <https://courses.datatalks.club/de-zoomcamp-2026/homework/hw4>

=======

## Learning in Public

We encourage everyone to share what they learned. This is called "learning in public".

Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).

### Example post for LinkedIn

```
🚀 Week 4 of Data Engineering Zoomcamp by @DataTalksClub complete!

Just finished Module 4 - Analytics Engineering with dbt. Learned how to:

✅ Build transformation models with dbt
✅ Create staging, intermediate, and fact tables
✅ Write tests to ensure data quality
✅ Understand lineage and model dependencies
✅ Analyze revenue patterns across NYC zones

Transforming raw data into analytics-ready models - the T in ELT!

Here's my homework solution: <LINK>

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```

### Example post for Twitter/X

```
📈 Module 4 of Data Engineering Zoomcamp done!

- Analytics Engineering with dbt
- Transformation models & tests
- Data lineage & dependencies
- NYC taxi revenue analysis

My solution: <LINK>

Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```


================================================
FILE: cohorts/2026/05-data-platforms/homework.md
================================================
# Module 5 Homework: Data Platforms with Bruin

In this homework, we'll use Bruin to build a complete data pipeline, from ingestion to reporting.

## Setup

1. Install Bruin CLI: `curl -LsSf https://getbruin.com/install/cli | sh`
2. Initialize the zoomcamp template: `bruin init zoomcamp my-pipeline`
3. Configure your `.bruin.yml` with a DuckDB connection
4. Follow the tutorial in the [main module README](../../../05-data-platforms/)

After completing the setup, you should have a working NYC taxi data pipeline.

---

### Question 1. Bruin Pipeline Structure

In a Bruin project, what are the required files/directories?

- `bruin.yml` and `assets/`
- `.bruin.yml` and `pipeline.yml` (assets can be anywhere)
- `.bruin.yml` and `pipeline/` with `pipeline.yml` and `assets/`
- `pipeline.yml` and `assets/` only

---

### Question 2. Materialization Strategies

You're building a pipeline that processes NYC taxi data organized by month based on `pickup_datetime`. Which incremental strategy is best for processing a specific interval period by deleting and inserting data for that time period?

- `append` - always add new rows
- `replace` - truncate and rebuild entirely
- `time_interval` - incremental based on a time column
- `view` - create a virtual table only

---

### Question 3. Pipeline Variables

You have the following variable defined in `pipeline.yml`:

```yaml
variables:
  taxi_types:
    type: array
    items:
      type: string
    default: ["yellow", "green"]
```

How do you override this when running the pipeline to only process yellow taxis?

- `bruin run --taxi-types yellow`
- `bruin run --var taxi_types=yellow`
- `bruin run --var 'taxi_types=["yellow"]'`
- `bruin run --set taxi_types=["yellow"]`

---

### Question 4. Running with Dependencies

You've modified the `ingestion/trips.py` asset and want to run it plus all downstream assets. Which command should you use?

- `bruin run ingestion.trips --all`
- `bruin run ingestion/trips.py --downstream`
- `bruin run pipeline/trips.py --recursive`
- `bruin run --select ingestion.trips+`

---

### Question 5. Quality Checks

You want to ensure the `pickup_datetime` column in your trips table never has NULL values. Which quality check should you add to your asset definition?

- `name: unique`
- `name: not_null`
- `name: positive`
- `name: accepted_values, value: [not_null]`

---

### Question 6. Lineage and Dependencies

After building your pipeline, you want to visualize the dependency graph between assets. Which Bruin command should you use?

- `bruin graph`
- `bruin dependencies`
- `bruin lineage`
- `bruin show`

---

### Question 7. First-Time Run

You're running a Bruin pipeline for the first time on a new DuckDB database. What flag should you use to ensure tables are created from scratch?

- `--create`
- `--init`
- `--full-refresh`
- `--truncate`

---

## Submitting the solutions

- Form for submitting: <https://courses.datatalks.club/de-zoomcamp-2026/homework/hw5>

=======

## Learning in Public

We encourage everyone to share what they learned. This is called "learning in public".

Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).

### Example post for LinkedIn

```
🚀 Week 5 of Data Engineering Zoomcamp by @DataTalksClub complete!

Just finished Module 5 - Data Platforms with Bruin. Learned how to:

✅ Build end-to-end ELT pipelines with Bruin
✅ Configure environments and connections
✅ Use materialization strategies for incremental processing
✅ Add data quality checks to ensure data integrity
✅ Deploy pipelines from local to cloud (BigQuery)

Modern data platforms in a single CLI tool - no vendor lock-in!

Here's my homework solution: <LINK>

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```

### Example post for Twitter/X

```
📊 Module 5 of Data Engineering Zoomcamp done!

- Data Platforms with Bruin
- End-to-end ELT pipelines
- Data quality & lineage
- Deployment to BigQuery

My solution: <LINK>

Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```


================================================
FILE: cohorts/2026/06-batch/homework.md
================================================
# Module 6 Homework

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the Yellow 2025-11 data from the official website:

```bash
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet
```


## Question 1: Install Spark and PySpark

- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.

What's the output?

> [!NOTE]
> To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/06-batch/setup/)


## Question 2: Yellow November 2025

Read the November 2025 Yellow into a Spark Dataframe.

Repartition the Dataframe to 4 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

- 6MB
- 25MB
- 75MB
- 100MB


## Question 3: Count records

How many taxi trips were there on the 15th of November?

Consider only trips that started on the 15th of November.

- 62,610
- 102,340
- 162,604
- 225,768


## Question 4: Longest trip

What is the length of the longest trip in the dataset in hours?

- 22.7
- 58.2
- 90.6
- 134.5


## Question 5: User Interface

Spark's User Interface which shows the application's dashboard runs on which local port?

- 80
- 443
- 4040
- 8080


## Question 6: Least frequent pickup location zone

Load the zone lookup data into a temp view in Spark:

```bash
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
```

Using the zone lookup data and the Yellow November 2025 data, what is the name of the LEAST frequent pickup location Zone?

- Governor's Island/Ellis Island/Liberty Island
- Arden Heights
- Rikers Island
- Jamaica Bay

If multiple answers are correct, select any

## Submitting the solutions

- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw6
- Deadline: See the website


## Learning in Public

We encourage everyone to share what they learned. This is called "learning in public".

Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).

### Example post for LinkedIn

```
🚀 Week 6 of Data Engineering Zoomcamp by @DataTalksClub complete!

Just finished Module 6 - Batch Processing with Spark. Learned how to:

✅ Set up PySpark and create Spark sessions
✅ Read and process Parquet files at scale
✅ Repartition data for optimal performance
✅ Analyze millions of taxi trips with DataFrames
✅ Use Spark UI for monitoring jobs

Processing 4M+ taxi trips with Spark - distributed computing is powerful! 💪

Here's my homework solution: <LINK>

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```

### Example post for Twitter/X

```
⚡ Module 6 of Data Engineering Zoomcamp done!

- Batch processing with Spark 🔥
- PySpark & DataFrames
- Parquet file optimization
- Spark UI on port 4040

My solution: <LINK>

Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```


================================================
FILE: cohorts/2026/07-streaming/homework.md
================================================
# Homework

In this homework, we'll practice streaming with Kafka (Redpanda) and PyFlink.

We use Redpanda, a drop-in replacement for Kafka. It implements the same
protocol, so any Kafka client library works with it unchanged.

For this homework we will be using Green Taxi Trip data from October 2025:

- [green_tripdata_2025-10.parquet](https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-10.parquet)


## Setup

We'll use the same infrastructure from the [workshop](../../../07-streaming/workshop/).

Follow the setup instructions: build the Docker image, start the services:

```bash
cd 07-streaming/workshop/
docker compose build
docker compose up -d
```

This gives us:

- Redpanda (Kafka-compatible broker) on `localhost:9092`
- Flink Job Manager at http://localhost:8081
- Flink Task Manager
- PostgreSQL on `localhost:5432` (user: `postgres`, password: `postgres`)

If you previously ran the workshop and have old containers/volumes,
do a clean start:

```bash
docker compose down -v
docker compose build
docker compose up -d
```

Note: the container names (like `workshop-redpanda-1`) assume the
directory is called `workshop`. If you renamed it, adjust accordingly.


## Question 1. Redpanda version

Run `rpk version` inside the Redpanda container:

```bash
docker exec -it workshop-redpanda-1 rpk version
```

What version of Redpanda are you running?


## Question 2. Sending data to Redpanda

Create a topic called `green-trips`:

```bash
docker exec -it workshop-redpanda-1 rpk topic create green-trips
```

Now write a producer to send the green taxi data to this topic.

Read the parquet file and keep only these columns:

- `lpep_pickup_datetime`
- `lpep_dropoff_datetime`
- `PULocationID`
- `DOLocationID`
- `passenger_count`
- `trip_distance`
- `tip_amount`
- `total_amount`

Convert each row to a dictionary and send it to the `green-trips` topic.
You'll need to handle the datetime columns - convert them to strings
before serializing to JSON.

Measure the time it takes to send the entire dataset and flush:

```python
from time import time

t0 = time()

# send all rows ...

producer.flush()

t1 = time()
print(f'took {(t1 - t0):.2f} seconds')
```

How long did it take to send the data?

- 10 seconds
- 60 seconds
- 120 seconds
- 300 seconds


## Question 3. Consumer - trip distance

Write a Kafka consumer that reads all messages from the `green-trips` topic
(set `auto_offset_reset='earliest'`).

Count how many trips have a `trip_distance` greater than 5.0 kilometers.

How many trips have `trip_distance` > 5?

- 6506
- 7506
- 8506
- 9506


## Part 2: PyFlink (Questions 4-6)

For the PyFlink questions, you'll adapt the workshop code to work with
the green taxi data. The key differences from the workshop:

- Topic name: `green-trips` (instead of `rides`)
- Datetime columns use `lpep_` prefix (instead of `tpep_`)
- You'll need to handle timestamps as strings (not epoch milliseconds)

You can convert string timestamps to Flink timestamps in your source DDL:

```sql
lpep_pickup_datetime VARCHAR,
event_timestamp AS TO_TIMESTAMP(lpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss'),
WATERMARK FOR event_timestamp AS event_timestamp - INTERVAL '5' SECOND
```

Before running the Flink jobs, create the necessary PostgreSQL tables
for your results.

Important notes for the Flink jobs:

- Place your job files in `workshop/src/job/` - this directory is
  mounted into the Flink containers at `/opt/src/job/`
- Submit jobs with:
  `docker exec -it workshop-jobmanager-1 flink run -py /opt/src/job/your_job.py`
- The `green-trips` topic has 1 partition, so set parallelism to 1
  in your Flink jobs (`env.set_parallelism(1)`). With higher parallelism,
  idle consumer subtasks prevent the watermark from advancing.
- Flink streaming jobs run continuously. Let the job run for a minute
  or two until results appear in PostgreSQL, then query the results.
  You can cancel the job from the Flink UI at http://localhost:8081
- If you sent data to the topic multiple times, delete and recreate
  the topic to avoid duplicates:
  `docker exec -it workshop-redpanda-1 rpk topic delete green-trips`


## Question 4. Tumbling window - pickup location

Create a Flink job that reads from `green-trips` and uses a 5-minute
tumbling window to count trips per `PULocationID`.

Write the results to a PostgreSQL table with columns:
`window_start`, `PULocationID`, `num_trips`.

After the job processes all data, query the results:

```sql
SELECT PULocationID, num_trips
FROM <your_table>
ORDER BY num_trips DESC
LIMIT 3;
```

Which `PULocationID` had the most trips in a single 5-minute window?

- 42
- 74
- 75
- 166


## Question 5. Session window - longest streak

Create another Flink job that uses a session window with a 5-minute gap
on `PULocationID`, using `lpep_pickup_datetime` as the event time
with a 5-second watermark tolerance.

A session window groups events that arrive within 5 minutes of each other.
When there's a gap of more than 5 minutes, the window closes.

Write the results to a PostgreSQL table and find the `PULocationID`
with the longest session (most trips in a single session).

How many trips were in the longest session?

- 12
- 31
- 51
- 81


## Question 6. Tumbling window - largest tip

Create a Flink job that uses a 1-hour tumbling window to compute the
total `tip_amount` per hour (across all locations).

Which hour had the highest total tip amount?

- 2025-10-01 18:00:00
- 2025-10-16 18:00:00
- 2025-10-22 08:00:00
- 2025-10-30 16:00:00


## Submitting the solutions

- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw7


## Learning in public

We encourage everyone to share what they learned.
Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).

## Example post for LinkedIn

```
Week 7 of Data Engineering Zoomcamp by @DataTalksClub complete!

Just finished Module 7 - Streaming with PyFlink. Learned how to:

- Set up Redpanda as a Kafka replacement
- Build Kafka producers and consumers in Python
- Create tumbling and session windows in Flink
- Analyze real-time taxi trip data with stream processing

Here's my homework solution: <LINK>

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```

## Example post for Twitter/X

```
Module 7 of Data Engineering Zoomcamp done!

- Kafka producers and consumers
- PyFlink tumbling and session windows
- Real-time taxi data analysis
- Redpanda as Kafka replacement

My solution: <LINK>

Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```


================================================
FILE: cohorts/2026/README.md
================================================
## Data Engineering Zoomcamp 2026 Cohort

* [Pre-launch Q&A stream](https://www.youtube.com/watch?v=WB6b1lcguaA)
* [Launch stream with course overview](https://www.youtube.com/watch?v=JgspdlKXS-w)
* [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html)
* [Course Playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)


[**Module 1: Introduction & Prerequisites**](01-docker-terraform/)

* [Homework](01-docker-terraform/homework.md)


[**Module 2: Workflow Orchestration**](02-workflow-orchestration)

* [Homework](02-workflow-orchestration/homework.md)
* Office hours

[**Workshop 1: Data Ingestion**](workshops/dlt/README.md)

* Workshop with dlt
* [Homework](workshops/dlt/README.md)

[**Workshop 2: AI-Assisted Data Ingestion with dlt**](workshops/dlt.md)

* [Workshop details and registration](workshops/dlt.md)


[**Module 3: Data Warehouse**](03-data-warehouse)

* [Homework](03-data-warehouse/homework.md)


[**Module 4: Analytics Engineering**](04-analytics-engineering/)

* [Homework](04-analytics-engineering/homework.md)


[**Module 5: Data Platforms**](05-data-platforms/)

* [Homework](05-data-platforms/homework.md)


[**Module 6: Batch processing**](06-batch/)

* [Homework](06-batch/homework.md)


[**Module 7: Stream Processing**](07-streaming)

* [Homework](07-streaming/homework.md)


[**Project**](project.md)

More information [here](project.md)


================================================
FILE: cohorts/2026/project.md
================================================
## Course Project

The goal of this project is to apply everything we learned
in this course and build an end-to-end data pipeline.

You will have two attempts to submit your project. If you don't have 
time to submit your project by the end of attempt #1 (you started the 
course late, you have vacation plans, life/work got in the way, etc.)
or you fail your first attempt, 
then you will have a second chance to submit your project as attempt
#2. 

There are only two attempts.

Remember that to pass the project, you must evaluate 3 peers. If you don't do that,
your project can't be considered complete.

To find the projects assigned to you, use the peer review assignments link 
and find your hash in the first column. You will see three rows: you need to evaluate 
each of these projects. For each project, you need to submit the form once,
so in total, you will make three submissions. 


### Submitting

#### Project Attempt #1

* Project: https://courses.datatalks.club/de-zoomcamp-2026/project/project1
* Review: https://courses.datatalks.club/de-zoomcamp-2026/project/project1/eval

#### Project Attempt #2

* Project: https://courses.datatalks.club/de-zoomcamp-2026/project/project2
* Review: https://courses.datatalks.club/de-zoomcamp-2026/project/project2/eval

> **Important**: update your "Certificate name" here: https://courses.datatalks.club/de-zoomcamp-2026/enrollment -
this is what we will use when generating certificates for you.

### Evaluation criteria

See [here](../../projects/README.md)


================================================
FILE: cohorts/2026/workshops/dlt/README.md
================================================
# From APIs to Warehouses: AI-Assisted Data Ingestion with dlt

Welcome to the **Data Engineering Zoomcamp 2026** workshop!

In this workshop, you'll use an AI-powered IDE to build a complete data pipeline. Using simple prompts, you can go from an API to a local data warehouse with [dlt](https://dlthub.com/docs) (data load tool). The AI handles the code generation. You focus on the results.

## What You'll Build

By the end of this workshop, you will have:

1. A working dlt pipeline that extracts data from the [Open Library API](https://openlibrary.org/developers/api)
2. Normalized relational tables stored in DuckDB
3. The ability to query, inspect, and visualize your data
4. Experience using AI-assisted development for data engineering

**No API key required!** The Open Library API is completely open and doesn't require authentication. You can start building immediately.

---

## Prerequisites

Before the workshop, make sure you have the following set up:

### 1. Understand What dlt Does (Recommended for Beginners)

If you're unfamiliar with dlt and what the library does, we recommend reading through the included Jupyter notebook before the workshop.

**[Open the notebook in Google Colab](https://colab.research.google.com/github/anair123/data-engineering-zoomcamp/blob/workshop/dlt_2026/cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb)**

It walks through dlt step by step:

- What a dlt source and pipeline are
- How data moves through Extract, Normalize, and Load
- How to inspect the loaded data

Understanding these concepts will help you know what the agent-generated code is actually doing.

> You do not need to clone the repo to follow the workshop. The `dlt init` command scaffolds everything you need.

### 2. An Agentic IDE

You'll need an AI-powered code editor that can understand context and generate code from natural language. We recommend:

| IDE | Description |
|-----|-------------|
| [**Cursor**](https://cursor.sh) | VS Code fork with built-in AI assistance (recommended) |
| [Windsurf](https://codeium.com/windsurf) | Alternative agentic IDE |
| [VS Code + GitHub Copilot](https://github.com/features/copilot) | Works, but less integrated |

### 3. Python 3.11+

```bash
python --version  # Should be 3.11 or higher
```

### 4. uv (Recommended) or pip

We use [uv](https://docs.astral.sh/uv/) for fast dependency management:

```bash
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
```

---

## Workshop Instructions

### Step 1: Create a New Project Folder

Create a fresh folder for your pipeline and open it in Cursor (or your preferred agentic IDE):

```bash
mkdir my-dlt-pipeline
cd my-dlt-pipeline
```

### Step 2: Add the dlt MCP Server Config

Choose the setup for your IDE:

Cursor - go to **Settings → Tools & MCP → New MCP Server** and add:

```json
{
  "mcpServers": {
    "dlt": {
      "command": "uv",
      "args": [
        "run",
        "--with",
        "dlt[duckdb]",
        "--with",
        "dlt-mcp[search]",
        "python",
        "-m",
        "dlt_mcp"
      ]
    }
  }
}
```

VS Code (Copilot) - create `.vscode/mcp.json` in your project folder:

```json
{
  "servers": {
    "dlt": {
      "command": "uv",
      "args": [
        "run",
        "--with",
        "dlt[duckdb]",
        "--with",
        "dlt-mcp[search]",
        "python",
        "-m",
        "dlt_mcp"
      ]
    }
  }
}
```

Claude Code - run in your terminal:

```bash
claude mcp add dlt -- uv run --with "dlt[duckdb]" --with "dlt-mcp[search]" python -m dlt_mcp
```

This enables the dlt MCP server, which gives the AI access to dlt documentation, code examples, and your pipeline metadata.

### Step 3: Install dlt Workspace

```bash
pip install "dlt[workspace]"
```

### Step 4: Initialize the dlt Project

```bash
dlt init dlthub:open_library duckdb
```

This scaffolds the pipeline files and configuration for Open Library. You now have everything you need to start prompting.

> 📖 **Reference:** [Open Library Workspace Instructions](https://dlthub.com/workspace/source/open-library)

### Step 5: Prompt the Agent to Build and Run the Pipeline

This is where the magic happens. The `dlt init` command scaffolds sample prompts you can use. Here's an example to get started:

```
Please generate a REST API Source for Open Library API, as specified in @open_library-docs.yaml
Start with endpoint(s) books and skip incremental loading for now.
Place the code in open_library_pipeline.py and name the pipeline open_library_pipeline.
If the file exists, use it as a starting point.
Do not add or modify any other files.
Use @dlt rest api as a tutorial.
After adding the endpoints, allow the user to run the pipeline with python open_library_pipeline.py and await further instructions.
```

Feel free to tweak the prompt based on your objective. The agent will:
1. Generate the pipeline code
2. Run the pipeline
3. Load data into your local DuckDB database

All from a single prompt.

### Step 6: Debug with the Agent

If there are any errors, paste them into the chat and let the AI resolve them. This is the power of AI-assisted development: you iterate quickly without getting stuck.

### Step 7: Inspect Pipeline Data with the dlt Dashboard

Once your pipeline runs successfully, launch the dashboard to inspect your data and metadata:

```bash
dlt pipeline open_library_pipeline show
```

This opens a web app where you can:
- View pipeline state and run history
- Explore schemas, tables, and columns
- Query the loaded data
- Debug any issues

> 📖 **Reference:** [dlt Dashboard Documentation](https://dlthub.com/docs/general-usage/dashboard)

### Step 8: Inspect the Pipeline via Chat

With the dlt MCP server configured, you can ask the AI about your pipeline directly:

> "What tables were created in the pipeline?"  
> "Show me the schema for the books table."  
> "How many rows were loaded?"

The agent has access to your pipeline metadata and can answer these questions.

### Step 9 (Bonus): Build Visualizations with marimo + ibis

Take your analysis further by creating interactive reports with [marimo](https://marimo.io/) notebooks and [ibis](https://ibis-project.org/).

Prompt the agent to build a visualization:

> "Create a marimo notebook that visualizes the top 10 authors by book count. Use ibis for data access. Reference: https://dlthub.com/docs/general-usage/dataset-access/marimo"

By providing the docs link, the agent will use the correct stack.

Run your notebook:

```bash
# Edit mode (for development)
marimo edit your_notebook.py

# Run mode (view the report)
marimo run your_notebook.py
```

> 📖 **Reference:** [Explore Data with marimo](https://dlthub.com/docs/general-usage/dataset-access/marimo)

---

## Homework

You've seen me do it, now it's your turn!

See [dlt_homework.md](dlt_homework.md) for instructions.

---

## Resources

| Resource | Link |
|----------|------|
| dlt Documentation | [dlthub.com/docs](https://dlthub.com/docs) |
| Open Library Workspace Guide | [dlthub.com/workspace/source/open-library](https://dlthub.com/workspace/source/open-library) |
| dlt Dashboard Docs | [dlthub.com/docs/general-usage/dashboard](https://dlthub.com/docs/general-usage/dashboard) |
| marimo + dlt Guide | [dlthub.com/docs/general-usage/dataset-access/marimo](https://dlthub.com/docs/general-usage/dataset-access/marimo) |
| Open Library API | [openlibrary.org/developers/api](https://openlibrary.org/developers/api) |

---

*Workshop by [dltHub](https://dlthub.com) for the Data Engineering Zoomcamp 2026*


================================================
FILE: cohorts/2026/workshops/dlt/analysis.py
================================================
import marimo

__generated_with = "0.19.9"
app = marimo.App(width="medium")


@app.cell
def _():
    import marimo as mo
    import dlt
    import ibis
    import altair as alt
    from dlt.helpers.marimo import render, load_package_viewer

    return alt, dlt, ibis, load_package_viewer, mo, render


@app.cell
def _(mo):
    mo.md(r"""
    # 📚 Open Library Harry Potter Books Analysis

    This notebook analyzes Harry Potter-related books from the Open Library API using dlt's dataset interface.
    """)
    return


@app.cell
def _(dlt):
    # Access the pipeline and dataset using dlt's native interface
    pipeline = dlt.attach("open_library_pipeline")
    dataset = pipeline.dataset()
    # Get ibis connection for rich data exploration
    ibis_con = dataset.ibis()
    return (ibis_con,)


@app.cell
async def _(load_package_viewer, render):
    # Display the dlt package viewer widget
    await render(load_package_viewer)
    return


@app.cell
def _(mo):
    mo.md(r"""
    ## 📊 Books by Author
    """)
    return


@app.cell
def _(alt, ibis, ibis_con):
    # Query for books by author (top 15) using ibis
    author_table = ibis_con.table("books__author_name")
    author_query = (
        author_table
        .group_by("value")
        .agg(book_count=author_table.value.count())
        .order_by(ibis.desc("book_count"))
        .limit(15)
    )
    author_df = author_query.to_pandas()
    author_df = author_df.rename(columns={"value": "author"})

    # Bar chart for authors
    author_chart = alt.Chart(author_df).mark_bar(color="#6366f1").encode(
        x=alt.X("book_count:Q", title="Number of Books"),
        y=alt.Y("author:N", sort="-x", title="Author"),
        tooltip=["author", "book_count"]
    ).properties(
        title="Top 15 Authors by Number of Books",
        width=600,
        height=400
    )
    author_chart
    return


@app.cell
def _(mo):
    mo.md(r"""
    ## 📈 Books Published Per Year
    """)
    return


@app.cell
def _(alt, ibis_con):
    # Query for books by year using ibis
    books_table = ibis_con.table("books")
    year_query = (
        books_table
        .filter((books_table.first_publish_year >= 1997) & (books_table.first_publish_year <= 2025))
        .group_by("first_publish_year")
        .agg(books=books_table.first_publish_year.count())
        .order_by("first_publish_year")
    )
    year_df = year_query.to_pandas()
    year_df = year_df.rename(columns={"first_publish_year": "year"})

    # Line chart for publication years
    year_chart = alt.Chart(year_df).mark_line(
        point=True,
        color="#10b981"
    ).encode(
        x=alt.X("year:O", title="Year"),
        y=alt.Y("books:Q", title="Number of Books"),
        tooltip=["year", "books"]
    ).properties(
        title="Harry Potter-Related Books Published Per Year (1997-2025)",
        width=700,
        height=350
    )
    year_chart
    return


@app.cell
def _(mo):
    mo.md(r"""
    ## 🌍 Books by Language
    """)
    return


@app.cell
def _(alt, ibis, ibis_con):
    # Query for books by language using ibis
    lang_table = ibis_con.table("books__language")
    lang_query = (
        lang_table
        .group_by("value")
        .agg(count=lang_table.value.count())
        .order_by(ibis.desc("count"))
        .limit(10)
    )
    language_df = lang_query.to_pandas()

    # Map language codes to full names
    lang_map = {
        'eng': 'English', 'ger': 'German', 'fre': 'French',
        'spa': 'Spanish', 'ita': 'Italian', 'chi': 'Chinese',
        'por': 'Portuguese', 'rus': 'Russian', 'kor': 'Korean', 'pol': 'Polish'
    }
    language_df["language"] = language_df["value"].map(lambda x: lang_map.get(x, x))

    # Pie chart for languages
    language_chart = alt.Chart(language_df).mark_arc(innerRadius=50).encode(
        theta=alt.Theta("count:Q", title="Count"),
        color=alt.Color("language:N", title="Language", scale=alt.Scale(scheme="tableau10")),
        tooltip=["language", "count"]
    ).properties(
        title="Proportion of Books by Language (Top 10)",
        width=400,
        height=400
    )
    language_chart
    return


@app.cell
def _(mo):
    mo.md(r"""
    ## 📋 Summary Statistics

    Key insights from the Open Library Harry Potter books dataset.
    """)
    return


@app.cell
def _(ibis_con, mo):
    # Get summary stats using ibis
    total_books = ibis_con.table("books").count().to_pandas()
    total_authors = ibis_con.table("books__author_name").value.nunique().to_pandas()
    total_languages = ibis_con.table("books__language").value.nunique().to_pandas()

    mo.md(f"""
    | Metric | Value |
    |--------|-------|
    | **Total Books** | {total_books:,} |
    | **Unique Authors** | {total_authors:,} |
    | **Languages** | {total_languages} |
    """)
    return


@app.cell
def _():
    return


@app.cell
def _():
    return


if __name__ == "__main__":
    app.run()


================================================
FILE: cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bPVVve29bu6Z"
   },
   "source": [
    "# Building a Data Pipeline with dlt\n",
    "\n",
    "In this notebook, we will build a complete data pipeline from scratch using **dlt**.\n",
    "\n",
    "Our goal is simple:\n",
    "\n",
    "→ Fetch real data from an API  \n",
    "→ Turn it into clean relational tables  \n",
    "→ Load it into a database  \n",
    "→ Explore and analyze it  \n",
    "\n",
    "We will use the **Open Library API** as our data source and **DuckDB** as our database.\n",
    "\n",
    "Along the way, you will learn:\n",
    "\n",
    "- What a dlt source is  \n",
    "- What a dlt pipeline does  \n",
    "- How data moves through Extract → Normalize → Load  \n",
    "- How to inspect and explore the final dataset  \n",
    "\n",
    "By the end, you will understand not just how to run a pipeline, but what happens at each stage.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "u9eCv60qV5PS"
   },
   "source": [
    "## 📦 Step 0: Install Dependencies\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "Arp4d7KZNRTS"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "zsh:1: no matches found: dlt[duckdb]\n"
     ]
    }
   ],
   "source": [
    "# install dependencies first\n",
    "!pip -q install dlt[duckdb]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "x7VGYS5hWNKQ"
   },
   "source": [
    "<p>In this notebook we will use:</p>\n",
    "\n",
    "<ul>\n",
    "  <li><strong>dlt</strong> to extract, normalize, and load data</li>\n",
    "  <li><strong>DuckDB</strong> as the destination database (runs locally inside Colab)</li>\n",
    "</ul>\n",
    "\n",
    "<p>\n",
    "  DuckDB is great for beginners because it requires no setup and no credentials.\n",
    "</p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aQTSvnvnHWBd"
   },
   "source": [
    "## 📚 Step 1: Import Libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YFQGLTECWkpn"
   },
   "source": [
    "\n",
    "<p>In this cell we import the libraries we will use throughout the notebook:</p>\n",
    "\n",
    "<ul>\n",
    "  <li><strong>dlt</strong> is the main library for building and running the pipeline</li>\n",
    "  <li><strong>rest_api_source</strong> helps us define an API source using a simple configuration</li>\n",
    "  <li><strong>islice</strong> (from <code>itertools</code>) is a small Python helper for previewing only a few records</li>\n",
    "</ul>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "id": "Lm8AbbHBImjI"
   },
   "outputs": [],
   "source": [
    "import dlt\n",
    "import dlt\n",
    "from itertools import islice\n",
    "from dlt.sources.rest_api import rest_api_source"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UFoBTwDVhzRL"
   },
   "source": [
    "## 🔗 Step 2: Define the API Source (Open Library)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VdKrEM-VXEY2"
   },
   "source": [
    "<p>\n",
    "  In <strong>dlt</strong>, a <strong>source</strong> is the part of your pipeline that knows how to fetch data from somewhere.\n",
    "  In this notebook, our source fetches data from the <strong>Open Library Search API</strong>.\n",
    "</p>\n",
    "\n",
    "<p>\n",
    "  We define the source using <code>rest_api_source</code>, which lets us describe an API in a simple\n",
    "  Python dictionary instead of writing lots of request code.\n",
    "</p>\n",
    "\n",
    "<p>\n",
    "  📖 <strong>Open Library Search API docs:</strong><br>\n",
    "  <a href=\"https://openlibrary.org/dev/docs/api/search\" target=\"_blank\">\n",
    "    https://openlibrary.org/dev/docs/api/search\n",
    "  </a>\n",
    "</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "id": "hOxkEKy4Kaj4"
   },
   "outputs": [],
   "source": [
    "def openlibrary_source(query: str = \"harry potter\"):\n",
    "\n",
    "    return rest_api_source({\n",
    "        \"client\": {\n",
    "            \"base_url\": \"https://openlibrary.org\",\n",
    "        },\n",
    "        \"resource_defaults\": {\n",
    "            \"primary_key\": \"key\",\n",
    "            \"write_disposition\": \"replace\",\n",
    "        },\n",
    "        \"resources\": [\n",
    "            {\n",
    "                \"name\": \"books\",\n",
    "                \"endpoint\": {\n",
    "                    \"path\": \"search.json\",\n",
    "                    \"params\": {\n",
    "                        \"q\": query,\n",
    "                        \"limit\": 100,\n",
    "                    },\n",
    "                    \"data_selector\": \"docs\",\n",
    "                    \"paginator\": {\n",
    "                        \"type\": \"offset\",\n",
    "                        \"limit\": 100,\n",
    "                        \"offset_param\": \"offset\",\n",
    "                        \"limit_param\": \"limit\",\n",
    "                        \"total_path\": \"numFound\",\n",
    "                    },\n",
    "                },\n",
    "            },\n",
    "        ],\n",
    "    })\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ntKAVaEGYFgw"
   },
   "source": [
    "## 🔧 Step 3: Create the dlt Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bxpFEetGh3lS"
   },
   "outputs": [],
   "source": [
    "pipeline = dlt.pipeline(\n",
    "    pipeline_name=\"ol_demo\",\n",
    "    destination=\"duckdb\",\n",
    "    dataset_name=\"ol_data\",\n",
    "    progress=\"log\" # logs the pipeline run (Optiona)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y7CJ9A2HXsFb"
   },
   "source": [
    "## 🔍 Understanding the Pipeline\n",
    "\n",
    "At this point we have defined two key building blocks:\n",
    "\n",
    "- **The source** describes where the data comes from and how to fetch it from the API.  \n",
    "- **The pipeline** describes where the data should go (DuckDB) and keeps track of tables, schemas, and run history.  \n",
    "\n",
    "---\n",
    "\n",
    "Instead of running everything at once, we will now run the pipeline in three separate phases so you can clearly see what happens at each stage:\n",
    "\n",
    "1. **Extract**: download raw data from the API  \n",
    "2. **Normalize**: turn nested JSON into relational tables  \n",
    "3. **Load**: write those tables into DuckDB  \n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![ETL Diagram](./images/etl_diagram.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pAYgUUJIw-c4"
   },
   "source": [
    "Once these steps make sense, we will run the full workflow again using one command:\n",
    "\n",
    "```python\n",
    "pipeline.run(source)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JsfcBA7McJMo"
   },
   "source": [
    "## ⬇️ Step 4: Extract\n",
    "\n",
    "Now we run the first stage of the pipeline: **Extract**.\n",
    "\n",
    "Extract means:\n",
    "\n",
    "- dlt sends requests to the Open Library API\n",
    "- the raw JSON responses are downloaded\n",
    "- the results are stored in dlt’s local working folder\n",
    "\n",
    "At this stage, the data is **not** in DuckDB yet. We are just confirming that we successfully pulled data from the API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "id": "yifCIPxSKJZ4"
   },
   "outputs": [],
   "source": [
    "extract_info = pipeline.extract(openlibrary_source())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NLRRVLnLcNgl"
   },
   "source": [
    "---\n",
    "\n",
    "### What we will print\n",
    "\n",
    "After extraction, we will print a small summary showing:\n",
    "\n",
    "- which **resources** were extracted\n",
    "- which **tables** will be created later\n",
    "- how many rows were extracted per resource\n",
    "\n",
    "This helps confirm that the pipeline is working before we move on to normalization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "wtDasHRNNNN0",
    "outputId": "51c71eeb-5435-40a1-8728-ea48c59bfd58"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Resources: ['books']\n",
      "Tables: ['books']\n",
      "Load ID: 1770907406.962898\n",
      "\n",
      "Resource: books\n",
      "rows extracted: 3756\n",
      "\n"
     ]
    }
   ],
   "source": [
    "load_id = extract_info.loads_ids[-1]\n",
    "m = extract_info.metrics[load_id][0]\n",
    "\n",
    "print(\"Resources:\", list(m[\"resource_metrics\"].keys()))\n",
    "print(\"Tables:\", list(m[\"table_metrics\"].keys()))\n",
    "print(\"Load ID:\", load_id)\n",
    "print()\n",
    "\n",
    "for resource, rm in m[\"resource_metrics\"].items():\n",
    "    print(f\"Resource: {resource}\")\n",
    "    print(f\"rows extracted: {rm.items_count}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f6MwYtznc3UX"
   },
   "source": [
    "### What you should see after Extract\n",
    "\n",
    "In our case, Extract shows only **one resource and one table**:\n",
    "\n",
    "- **Resources:** `['books']`  \n",
    "- **Tables:** `['books']`\n",
    "\n",
    "That is expected.\n",
    "\n",
    "The `search` endpoint returns a list of book results, so dlt stores those rows in a single table called `books`. The interesting part comes next, because many fields inside each row are lists or nested objects. Those will turn into additional tables during **Normalize**.\n",
    "\n",
    "Example output:\n",
    "\n",
    "- **25 rows extracted** means we pulled 25 search results (books)  \n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lQVLZMcyXWkm"
   },
   "source": [
    "## 🔄 Step 5: Normalize\n",
    "\n",
    "Now we run **Normalize**. This is where dlt transforms raw JSON into a clean relational structure.\n",
    "\n",
    "During normalization, dlt does three key things:\n",
    "\n",
    "### 1. Adds Tracking Columns to the Main Table\n",
    "\n",
    "dlt adds special columns to every table:\n",
    "- `_dlt_id`: A unique identifier for each row\n",
    "- `_dlt_load_id`: Links each row to the load job that created it\n",
    "\n",
    "### 2. Flattens Nested Data into Child Tables\n",
    "\n",
    "APIs often return nested JSON. For example, a book can have multiple authors (a list), multiple editions, and multiple identifiers.\n",
    "\n",
    "dlt flattens these nested structures into separate **child tables** with names like:\n",
    "- `books__author_name`\n",
    "- `books__author_key`\n",
    "- `books__language`\n",
    "\n",
    "Each child table has a `_dlt_parent_id` column that references `_dlt_id` in the parent table. This is how dlt maintains relationships.\n",
    "\n",
    "### 3. Creates Metadata Tables\n",
    "\n",
    "dlt also creates internal tables to track pipeline state:\n",
    "- `_dlt_loads`: Tracks load history (when data was loaded, status)\n",
    "- `_dlt_pipeline_state`: Stores pipeline state for incremental loading\n",
    "- `_dlt_version`: Tracks schema versions\n",
    "\n",
    "In the next cell, we will print a summary showing which tables were created.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "id": "LCmiiG3tXXwh"
   },
   "outputs": [],
   "source": [
    "normalize_info = pipeline.normalize()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "-kNiY112Xvuk",
    "outputId": "502bff6b-edb2-4bd8-a9e9-1f1b88f20c48"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Load ID: 1770907406.962898\n",
      "\n",
      "Tables created/updated:\n",
      "  - books: 3756 rows\n",
      "  - books__author_key: 4600 rows\n",
      "  - books__author_name: 4600 rows\n",
      "  - books__ia: 3422 rows\n",
      "  - books__ia_collection: 2724 rows\n",
      "  - books__language: 3748 rows\n",
      "  - books__id_standard_ebooks: 12 rows\n",
      "  - books__id_librivox: 60 rows\n",
      "  - books__id_project_gutenberg: 54 rows\n"
     ]
    }
   ],
   "source": [
    "load_id = normalize_info.loads_ids[-1]\n",
    "m = normalize_info.metrics[load_id][0]\n",
    "\n",
    "print(\"Load ID:\", load_id)\n",
    "print()\n",
    "\n",
    "print(\"Tables created/updated:\")\n",
    "for table_name, tm in m[\"table_metrics\"].items():\n",
    "    # skip dlt internal tables to keep it beginner-friendly\n",
    "    if table_name.startswith(\"_dlt\"):\n",
    "        continue\n",
    "    print(f\"  - {table_name}: {tm.items_count} rows\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ctHuJ0yEdNaq"
   },
   "source": [
    "### What happened during Normalize?\n",
    "\n",
    "After running `pipeline.normalize()`, we now see multiple tables instead of just one.\n",
    "\n",
    "Tables created/updated:\n",
    "\n",
    "- `books`\n",
    "- `books__author_key`\n",
    "- `books__author_name`\n",
    "- `books__editions__docs`\n",
    "- `books__editions__docs__language`\n",
    "- `books__ia`\n",
    "\n",
    "---\n",
    "\n",
    "### What does this mean?\n",
    "\n",
    "We started with **N book search results** in the `books` table.\n",
    "\n",
    "During normalization:\n",
    "\n",
    "- Each book may have **more than N authors**, so those were split into:\n",
    "  - `books__author_name`\n",
    "  - `books__author_key`\n",
    "\n",
    "- Each book may contain **edition information**, which became:\n",
    "  - `books__editions__docs`\n",
    "\n",
    "- Some editions contain **language information**, which became:\n",
    "  - `books__editions__docs__language`\n",
    "\n",
    "- The `ia` field (Internet Archive IDs) is a list, so it became:\n",
    "  - `books__ia`\n",
    "\n",
    "This is the key moment in the pipeline.\n",
    "\n",
    "The data has been transformed from nested JSON into a **relational structure** with multiple linked tables. This makes it much easier to query and analyze.\n",
    "\n",
    "---\n",
    "\n",
    "### Schema Visualization\n",
    "\n",
    "dlt can render the schema as a visual diagram. Run the next cell to see the parent-child table relationships:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<script src=\"//d3js.org/d3.v7.min.js\"></script>\n",
       "<script src=\"https://unpkg.com/@hpcc-js/wasm@2.20.0/dist/graphviz.umd.js\"></script>\n",
       "<script src=\"https://unpkg.com/d3-graphviz@5.6.0/build/d3-graphviz.js\"></script>\n",
       "\n",
       "<div id=\"graph\" style=\"width:100%;height:100vh;display:flex;justify-content:center;align-items:center;\"></div>\n",
       "<script>\n",
       "    d3.select(\"#graph\")\n",
       "      .graphviz({fit: true})\n",
       "      .renderDot(\n",
       "        `\n",
       "        digraph rest_api {\n",
       "    graph [fontname=\"helvetica\", fontcolor=\"{TABLE_BORDER_COLOR}\", rankdir=\"BT\", ranksep=5, layout=\"twopi\", root=\"_dlt_loads\"];\n",
       "    node [penwidth=0, margin=0, fontname=\"helvetica\"];\n",
       "    edge [fontname=\"helvetica\", fontcolor=\"{TABLE_BORDER_COLOR}\", color=\"{TABLE_BORDER_COLOR}\"];\n",
       "\n",
       "\"books\" [id=\"books\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>books</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">cover_edition_key</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">cover_i</td>\n",
       "                        <td align=\"right\"><font>bigint</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">ebook_access</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">edition_count</td>\n",
       "                        <td align=\"right\"><font>bigint</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f5\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">first_publish_year</td>\n",
       "                        <td align=\"right\"><font>bigint</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f6\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">has_fulltext</td>\n",
       "                        <td align=\"right\"><font>bool</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f7\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\"><B>key🔑</B></td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f8\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">lending_edition_s</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f9\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">lending_identifier_s</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f10\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">public_scan_b</td>\n",
       "                        <td align=\"right\"><font>bool</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f11\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">title</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f12\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_load_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f13\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f14\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">subtitle</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"books__author_key\" [id=\"books__author_key\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>books__author_key</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">value</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_parent_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_list_idx</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"books__author_name\" [id=\"books__author_name\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>books__author_name</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">value</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_parent_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_list_idx</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"books__ia\" [id=\"books__ia\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>books__ia</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">value</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_parent_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_list_idx</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"books__ia_collection\" [id=\"books__ia_collection\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>books__ia_collection</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">value</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_parent_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_list_idx</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"books__language\" [id=\"books__language\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>books__language</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">value</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_parent_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_list_idx</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"books__id_standard_ebooks\" [id=\"books__id_standard_ebooks\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>books__id_standard_ebooks</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">value</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_parent_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_list_idx</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"books__id_librivox\" [id=\"books__id_librivox\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>books__id_librivox</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">value</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_parent_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_list_idx</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"books__id_project_gutenberg\" [id=\"books__id_project_gutenberg\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>books__id_project_gutenberg</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">value</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_parent_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_list_idx</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"_dlt_version\" [id=\"_dlt_version\";tooltip=\"Created by DLT. Tracks schema updates\";label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>_dlt_version</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">version</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">engine_version</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">inserted_at</td>\n",
       "                        <td align=\"right\"><font>timestamp <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">schema_name</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f5\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">version_hash</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f6\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">schema</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"_dlt_loads\" [id=\"_dlt_loads\";tooltip=\"Created by DLT. Tracks completed loads\";label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>_dlt_loads</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">load_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">schema_name</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">status</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">inserted_at</td>\n",
       "                        <td align=\"right\"><font>timestamp <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f5\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">schema_version_hash</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "\"_dlt_pipeline_state\" [id=\"_dlt_pipeline_state\"; label=<\n",
       "    <table border=\"0\" color=\"#1c1c34\" cellborder=\"1\" cellspacing=\"0\" cellpadding=\"6\">\n",
       "                <tr>\n",
       "            <td port=\"p0\" bgcolor=\"#bbca06\">\n",
       "                <font color=\"#1c1c34\"><b>_dlt_pipeline_state</b></font>\n",
       "            </td>\n",
       "        </tr>\n",
       "\n",
       "        <tr>\n",
       "            <td align=\"left\" port=\"f1\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">version</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f2\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">engine_version</td>\n",
       "                        <td align=\"right\"><font>bigint <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f3\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">pipeline_name</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f4\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">state</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f5\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">created_at</td>\n",
       "                        <td align=\"right\"><font>timestamp <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f6\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">version_hash</td>\n",
       "                        <td align=\"right\"><font>text</font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f7\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_load_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr><tr>\n",
       "            <td align=\"left\" port=\"f8\" bgcolor=\"#e7e2dd\">\n",
       "                <table cellpadding=\"0\" cellspacing=\"0\" border=\"0\">\n",
       "                    <tr>\n",
       "                        <td align=\"left\">_dlt_id</td>\n",
       "                        <td align=\"right\"><font>text <B>NN</B></font></td>\n",
       "                    </tr>\n",
       "                </table>\n",
       "            </td>\n",
       "        </tr>\n",
       "    </table>\n",
       ">];\n",
       "\n",
       "books:p0 -> _dlt_loads:p0 [style=invis]\n",
       "books:f12:_ -> _dlt_loads:f1:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "_dlt_pipeline_state:p0 -> _dlt_loads:p0 [style=invis]\n",
       "_dlt_pipeline_state:f7:_ -> _dlt_loads:f1:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "books__author_key:p0 -> books:p0 [style=invis]\n",
       "books__author_key:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "books__author_name:p0 -> books:p0 [style=invis]\n",
       "books__author_name:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "books__ia:p0 -> books:p0 [style=invis]\n",
       "books__ia:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "books__ia_collection:p0 -> books:p0 [style=invis]\n",
       "books__ia_collection:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "books__language:p0 -> books:p0 [style=invis]\n",
       "books__language:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "books__id_standard_ebooks:p0 -> books:p0 [style=invis]\n",
       "books__id_standard_ebooks:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "books__id_librivox:p0 -> books:p0 [style=invis]\n",
       "books__id_librivox:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "books__id_project_gutenberg:p0 -> books:p0 [style=invis]\n",
       "books__id_project_gutenberg:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "_dlt_version:p0 -> _dlt_loads:p0 [style=invis]\n",
       "_dlt_version:f5:_ -> _dlt_loads:f5:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "_dlt_version:p0 -> _dlt_loads:p0 [style=invis]\n",
       "_dlt_version:f4:_ -> _dlt_loads:f2:_ [dir=both, penwidth=1, color=\"#1c1c34\", arrowtail=\"vee\", arrowhead=\"dot\"];\n",
       "}\n",
       "        `\n",
       "      );\n",
       "</script>\n"
      ],
      "text/plain": [
       "<dlt.Schema(name='rest_api', version=2, tables=['_dlt_version', '_dlt_loads', 'books', '_dlt_pipeline_state', 'books__author_key', 'books__author_name', 'books__ia', 'books__ia_collection', 'books__language', 'books__id_standard_ebooks', 'books__id_librivox', 'books__id_project_gutenberg'], version_hash='ZJIabaQJ9DAYgsR04wEVeXOgU80roBUfdvrR2YoBEyU=')>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display schema \n",
    "pipeline.default_schema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lJ5QzSnYdidK"
   },
   "source": [
    "## 📤 Step 6: Load\n",
    "\n",
    "Now we run the final stage of the pipeline: **Load**.\n",
    "\n",
    "Load means:\n",
    "\n",
    "- dlt creates tables in DuckDB (if they do not already exist)\n",
    "- the normalized rows are inserted into those tables\n",
    "- the pipeline records the load in its internal tracking tables\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "id": "d9Xb67c5XfL5"
   },
   "outputs": [],
   "source": [
    "load_info = pipeline.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ehkz8lESGGdm"
   },
   "source": [
    "\n",
    "After this step, the data is fully stored in the database and ready to query.\n",
    "\n",
    "At this point:\n",
    "\n",
    "- The `books` table contains our books\n",
    "- The related tables (such as `books__author_name` and `books__editions__docs`) contain the exploded nested data\n",
    "- Everything is now queryable using `pipeline.dataset()` or SQL\n",
    "\n",
    "This is the moment where the data officially moves from “pipeline processing” into a database you can explore."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jBznxM00eCOF"
   },
   "source": [
    "## 🚀 Step 7: Run the Full Pipeline\n",
    "\n",
    "Now that we have walked through each step individually, we can run the entire workflow using a single command:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "id": "YQLigkh-f7Ey"
   },
   "outputs": [],
   "source": [
    "load_info = pipeline.run(openlibrary_source())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SbLkA8W7eNPb"
   },
   "source": [
    "<h3>What does <code>pipeline.run()</code> do?</h3>\n",
    "\n",
    "<p>\n",
    "  <code>pipeline.run()</code> simply combines the three steps we already executed manually:\n",
    "</p>\n",
    "\n",
    "<ol>\n",
    "  <li><strong>Extract</strong> – fetch data from the Open Library API</li>\n",
    "  <li><strong>Normalize</strong> – convert nested JSON into relational tables</li>\n",
    "  <li><strong>Load</strong> – write those tables into DuckDB</li>\n",
    "</ol>\n",
    "\n",
    "<p>In other words, this:</p>\n",
    "\n",
    "<pre><code>pipeline.run(source)</code></pre>\n",
    "\n",
    "<p>is equivalent to:</p>\n",
    "\n",
    "<pre><code>pipeline.extract(source)\n",
    "pipeline.normalize()\n",
    "pipeline.load()</code></pre>\n",
    "\n",
    "<p>\n",
    "  There is no hidden magic. It just runs the full ELT process in order.\n",
    "</p>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7ViMq6gIfJj_"
   },
   "source": [
    "## 🔎 Step 8: Inspect the Loaded Data\n",
    "\n",
    "Now that the data is loaded into DuckDB, we can inspect it using `pipeline.dataset()`.\n",
    "\n",
    "This gives us a convenient Python interface for exploring the tables that dlt created, without writing SQL.\n",
    "\n",
    "---\n",
    "\n",
    "### List available tables\n",
    "\n",
    "First, let’s see what tables exist in the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "id": "bmnrK1aVZXPO"
   },
   "outputs": [],
   "source": [
    "ds = pipeline.dataset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "SV6J6AtBf0xq",
    "outputId": "19ad26bf-f34a-4f8e-c30c-5acd3342c3c5"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['books',\n",
       " 'books__author_key',\n",
       " 'books__author_name',\n",
       " 'books__ia',\n",
       " 'books__ia_collection',\n",
       " 'books__language',\n",
       " 'books__id_standard_ebooks',\n",
       " 'books__id_librivox',\n",
       " 'books__id_project_gutenberg',\n",
       " '_dlt_version',\n",
       " '_dlt_loads',\n",
       " '_dlt_pipeline_state']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 315
    },
    "id": "WLa4yN7lf1TF",
    "outputId": "d2da841b-a8bf-461f-a011-eb1db644656f"
   },
   "outputs": [
    {
     "data": {
      "application/vnd.google.colaboratory.intrinsic+json": {
       "summary": "{\n  \"name\": \"df\",\n  \"rows\": 3756,\n  \"fields\": [\n    {\n      \"column\": \"cover_edition_key\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 1192,\n        \"samples\": [\n          \"OL24951484M\",\n          \"OL9131663M\",\n          \"OL47198575M\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"cover_i\",\n      \"properties\": {\n        \"dtype\": \"Int64\",\n        \"num_unique_values\": 1288,\n        \"samples\": [\n          842156,\n          10365881,\n          3341732\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"ebook_access\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"printdisabled\",\n          \"unclassified\",\n          \"no_ebook\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"edition_count\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 108,\n        \"min\": 0,\n        \"max\": 3546,\n        \"num_unique_values\": 62,\n        \"samples\": [\n          44,\n          92,\n          396\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"first_publish_year\",\n      \"properties\": {\n        \"dtype\": \"Int64\",\n        \"num_unique_values\": 127,\n        \"samples\": [\n          2008,\n          1622,\n          1962\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"has_fulltext\",\n      \"properties\": {\n        \"dtype\": \"boolean\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          false,\n          true\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"key\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3756,\n        \"samples\": [\n          \"/works/OL34662215W\",\n          \"/works/OL39702699W\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"lending_edition_s\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 281,\n        \"samples\": [\n          \"OL45637056M\",\n          \"OL26064272M\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"lending_identifier_s\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 281,\n        \"samples\": [\n          \"alicesadventures0000unse_v7d2\",\n          \"harrypottermagic0000unse_n5w6\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"public_scan_b\",\n      \"properties\": {\n        \"dtype\": \"boolean\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          true,\n          false\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"title\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2984,\n        \"samples\": [\n          \"1000 Facts and Trivia about Marvel Cinematic Universe, Game of Thrones, Disney, Star Wars, Harry Potter 1\",\n          \"The Unofficial Harry Potter Insults Handbook\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"_dlt_load_id\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          \"1770819876.9353185\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"_dlt_id\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 3756,\n        \"samples\": [\n          \"ZN3UfCkWBXFxSw\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"subtitle\",\n      \"properties\": {\n        \"dtype\": \"category\",\n        \"num_unique_values\": 59,\n        \"samples\": [\n          \"Hogwarts Through the Years\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}",
       "type": "dataframe",
       "variable_name": "df"
      },
      "text/html": [
       "\n",
       "  <div id=\"df-78b9c6b5-669d-4905-9f29-f9885ad30a9d\" class=\"colab-df-container\">\n",
       "    <div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cover_edition_key</th>\n",
       "      <th>cover_i</th>\n",
       "      <th>ebook_access</th>\n",
       "      <th>edition_count</th>\n",
       "      <th>first_publish_year</th>\n",
       "      <th>has_fulltext</th>\n",
       "      <th>key</th>\n",
       "      <th>lending_edition_s</th>\n",
       "      <th>lending_identifier_s</th>\n",
       "      <th>public_scan_b</th>\n",
       "      <th>title</th>\n",
       "      <th>_dlt_load_id</th>\n",
       "      <th>_dlt_id</th>\n",
       "      <th>subtitle</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>OL61027601M</td>\n",
       "      <td>15155833</td>\n",
       "      <td>borrowable</td>\n",
       "      <td>396</td>\n",
       "      <td>1997</td>\n",
       "      <td>True</td>\n",
       "      <td>/works/OL82563W</td>\n",
       "      <td>OL38565767M</td>\n",
       "      <td>harrypotterylapi0000rowl_q5r6</td>\n",
       "      <td>False</td>\n",
       "      <td>Harry Potter and the Philosopher's Stone</td>\n",
       "      <td>1770819876.9353185</td>\n",
       "      <td>lGJrV2BS8Z9qJQ</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>OL26378158M</td>\n",
       "      <td>15158660</td>\n",
       "      <td>printdisabled</td>\n",
       "      <td>144</td>\n",
       "      <td>2007</td>\n",
       "      <td>True</td>\n",
       "      <td>/works/OL82586W</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>False</td>\n",
       "      <td>Harry Potter and the Deathly Hallows</td>\n",
       "      <td>1770819876.9353185</td>\n",
       "      <td>F9W0WQlLwgvsFw</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>OL26234270M</td>\n",
       "      <td>10580435</td>\n",
       "      <td>borrowable</td>\n",
       "      <td>278</td>\n",
       "      <td>1999</td>\n",
       "      <td>True</td>\n",
       "      <td>/works/OL82536W</td>\n",
       "      <td>OL48101764M</td>\n",
       "      <td>bdrc-W8LS66814</td>\n",
       "      <td>False</td>\n",
       "      <td>Harry Potter and the Prisoner of Azkaban</td>\n",
       "      <td>1770819876.9353185</td>\n",
       "      <td>kSdfO1XbBVAjmQ</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "    <div class=\"colab-df-buttons\">\n",
       "\n",
       "  <div class=\"colab-df-container\">\n",
       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-78b9c6b5-669d-4905-9f29-f9885ad30a9d')\"\n",
       "            title=\"Convert this dataframe to an interactive table.\"\n",
       "            style=\"display:none;\">\n",
       "\n",
       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
       "  </svg>\n",
       "    </button>\n",
       "\n",
       "  <style>\n",
       "    .colab-df-container {\n",
       "      display:flex;\n",
       "      gap: 12px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert {\n",
       "      background-color: #E8F0FE;\n",
       "      border: none;\n",
       "      border-radius: 50%;\n",
       "      cursor: pointer;\n",
       "      display: none;\n",
       "      fill: #1967D2;\n",
       "      height: 32px;\n",
       "      padding: 0 0 0 0;\n",
       "      width: 32px;\n",
       "    }\n",
       "\n",
       "    .colab-df-convert:hover {\n",
       "      background-color: #E2EBFA;\n",
       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
       "      fill: #174EA6;\n",
       "    }\n",
       "\n",
       "    .colab-df-buttons div {\n",
       "      margin-bottom: 4px;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert {\n",
       "      background-color: #3B4455;\n",
       "      fill: #D2E3FC;\n",
       "    }\n",
       "\n",
       "    [theme=dark] .colab-df-convert:hover {\n",
       "      background-color: #434B5C;\n",
       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
       "      fill: #FFFFFF;\n",
       "    }\n",
       "  </style>\n",
       "\n",
       "    <script>\n",
       "      const buttonEl =\n",
       "        document.querySelector('#df-78b9c6b5-669d-4905-9f29-f9885ad30a9d button.colab-df-convert');\n",
       "      buttonEl.style.display =\n",
       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
       "\n",
       "      async function convertToInteractive(key) {\n",
       "        const element = document.querySelector('#df-78b9c6b5-669d-4905-9f29-f9885ad30a9d');\n",
       "        const dataTable =\n",
       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
       "                                                    [key], {});\n",
       "        if (!dataTable) return;\n",
       "\n",
       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
       "          + ' to learn more about interactive tables.';\n",
       "        element.innerHTML = '';\n",
       "        dataTable['output_type'] = 'display_data';\n",
       "        await google.colab.output.renderOutput(dataTable, element);\n",
       "        const docLink = document.createElement('div');\n",
       "        docLink.innerHTML = docLinkHtml;\n",
       "        element.appendChild(docLink);\n",
       "      }\n",
       "    </script>\n",
       "  </div>\n",
       "\n",
       "\n",
       "    </div>\n",
       "  </div>\n"
      ],
      "text/plain": [
       "  cover_edition_key   cover_i   ebook_access  edition_count  \\\n",
       "0       OL61027601M  15155833     borrowable            396   \n",
       "1       OL26378158M  15158660  printdisabled            144   \n",
       "2       OL26234270M  10580435     borrowable            278   \n",
       "\n",
       "   first_publish_year  has_fulltext              key lending_edition_s  \\\n",
       "0                1997          True  /works/OL82563W       OL38565767M   \n",
       "1                2007          True  /works/OL82586W              None   \n",
       "2                1999          True  /works/OL82536W       OL48101764M   \n",
       "\n",
       "            lending_identifier_s  public_scan_b  \\\n",
       "0  harrypotterylapi0000rowl_q5r6          False   \n",
       "1                           None          False   \n",
       "2                 bdrc-W8LS66814          False   \n",
       "\n",
       "                                      title        _dlt_load_id  \\\n",
       "0  Harry Potter and the Philosopher's Stone  1770819876.9353185   \n",
       "1      Harry Potter and the Deathly Hallows  1770819876.9353185   \n",
       "2  Harry Potter and the Prisoner of Azkaban  1770819876.9353185   \n",
       "\n",
       "          _dlt_id subtitle  \n",
       "0  lGJrV2BS8Z9qJQ     None  \n",
       "1  F9W0WQlLwgvsFw     None  \n",
       "2  kSdfO1XbBVAjmQ     None  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = ds.books.df()      # main table\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OWFqaH2wgCWR"
   },
   "source": [
    "## 💡 Conclusion\n",
    "\n",
    "### What dlt handled for us\n",
    "\n",
    "✔ API requests  \n",
    "✔ JSON normalization  \n",
    "✔ Table creation  \n",
    "✔ Database loading  \n",
    "✔ Simple dataset inspection  \n",
    "\n",
    "---\n",
    "\n",
    "### But there are still friction points\n",
    "\n",
    "• Getting the REST API config exactly right  \n",
    "• Remembering paginator syntax  \n",
    "• Remembering how to inspect tables  \n",
    "• Debugging schema or pagination issues  \n",
    "• Writing Python or SQL to get insights  \n",
    "\n",
    "It works... but it still takes effort.\n",
    "\n",
    "---\n",
    "\n",
    "## 🚀 Next Up: LLM-Powered Workflows\n",
    "\n",
    "dlt now integrates LLMs directly into the workflow to make:\n",
    "\n",
    "• Pipeline runs easier  \n",
    "• Debugging faster  \n",
    "• Schema inspection simpler  \n",
    "• Data analysis more natural  \n",
    "\n",
    "Instead of writing glue code, you can use natural language.\n",
    "\n",
    "In the workshop, we will see what that looks like.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "id": "BweSVO3igErN"
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}


================================================
FILE: cohorts/2026/workshops/dlt/dlt_homework.md
================================================
# Homework: Build Your Own dlt Pipeline

You've seen how to build a pipeline with a scaffolded source. Now it's your turn to do it from scratch with a **custom API**.

## Workshop Content

* [Workshop README](README.md)
* [dlt Pipeline Overview Notebook (Google Colab)](https://colab.research.google.com/github/anair123/data-engineering-zoomcamp/blob/workshop/dlt_2026/cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb)
* [Workshop registration page](https://luma.com/hzis1yzp)

## The Challenge

For this homework, build a dlt pipeline that loads NYC taxi trip data from a custom API into DuckDB and then answer some questions using the loaded data.

## Data Source

You'll be working with **NYC Yellow Taxi trip data** from a custom API (not available as a dlt scaffold). This dataset contains records of individual taxi trips in New York City.

| Property | Value |
|----------|-------|
| Base URL | `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api` |
| Format | Paginated JSON |
| Page Size | 1,000 records per page |
| Pagination | Stop when an empty page is returned |

## Setup Instructions

Since this API is custom (not one of the scaffolds in dlt workspace), the setup is slightly different.

### Step 1: Create a New Project (or Reuse Your Demo Project)

If you already created a project folder while following along with the workshop demo, you can reuse that folder. Otherwise, create a new one:

```bash
mkdir taxi-pipeline
cd taxi-pipeline
```

Open this folder in Cursor (or your preferred agentic IDE).

### Step 2: Set Up the dlt MCP Server (If Not Already Done)

Choose the setup for your IDE:

Cursor - go to **Settings → Tools & MCP → New MCP Server** and add:

```json
{
  "mcpServers": {
    "dlt": {
      "command": "uv",
      "args": [
        "run",
        "--with",
        "dlt[duckdb]",
        "--with",
        "dlt-mcp[search]",
        "python",
        "-m",
        "dlt_mcp"
      ]
    }
  }
}
```

VS Code (Copilot) - create `.vscode/mcp.json` in your project folder:

```json
{
  "servers": {
    "dlt": {
      "command": "uv",
      "args": [
        "run",
        "--with",
        "dlt[duckdb]",
        "--with",
        "dlt-mcp[search]",
        "python",
        "-m",
        "dlt_mcp"
      ]
    }
  }
}
```

Claude Code - run in your terminal:

```bash
claude mcp add dlt -- uv run --with "dlt[duckdb]" --with "dlt-mcp[search]" python -m dlt_mcp
```

This enables the dlt MCP server, giving the AI access to dlt documentation, code examples, and your pipeline metadata.

### Step 3: Install dlt

```bash
pip install "dlt[workspace]"
```

### Step 4: Initialize the Project

```bash
dlt init dlthub:taxi_pipeline duckdb
```

You can name the project whatever you like. Since this API has no scaffold, the command will create:
- The dlt project files
- Cursor rules for AI assistance

**But no YAML file with API metadata.** You will need to provide the API information yourself.

### Step 5: Prompt the Agent

Now use your AI assistant to build the pipeline. You'll need to provide the API details in your prompt since there's no scaffold.

Here's an example to get you started:

```
Build a REST API source for NYC taxi data.

API details:
- Base URL: https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api
- Data format: Paginated JSON (1,000 records per page)
- Pagination: Stop when an empty page is returned

Place the code in taxi_pipeline.py and name the pipeline taxi_pipeline.
Use @dlt rest api as a tutorial.
```

### Step 6: Run and Debug

Run your pipeline and iterate with the agent until it works:

```bash
python taxi_pipeline.py
```

---

## Questions

Once your pipeline has run successfully, use the methods covered in the workshop to investigate the following:

- **dlt Dashboard**: `dlt pipeline taxi_pipeline show`
- **dlt MCP Server**: Ask the agent questions about your pipeline
- **Marimo Notebook**: Build visualizations and run queries

We challenge you to try out the different methods explored in the workshop when answering these questions to see what works best for you. Feel free to share your thoughts on what worked (or didn't) in your submission!

### Question 1: What is the start date and end date of the dataset?

- 2009-01-01 to 2009-01-31
- 2009-06-01 to 2009-07-01
- 2024-01-01 to 2024-02-01
- 2024-06-01 to 2024-07-01

### Question 2: What proportion of trips are paid with credit card?

- 16.66%
- 26.66%
- 36.66%
- 46.66%

### Question 3: What is the total amount of money generated in tips?

- $4,063.41
- $6,063.41
- $8,063.41
- $10,063.41


### Resources

| Resource | Link |
|----------|------|
| dlt Dashboard Docs | [dlthub.com/docs/general-usage/dashboard](https://dlthub.com/docs/general-usage/dashboard) |
| marimo + dlt Guide | [dlthub.com/docs/general-usage/dataset-access/marimo](https://dlthub.com/docs/general-usage/dataset-access/marimo) |
| dlt Documentation | [dlthub.com/docs](https://dlthub.com/docs) |

---

## Submitting the solutions

- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/dlt
- Deadline: See the website

## Tips

- The API returns paginated data. Make sure your pipeline handles pagination correctly.
- If the agent gets stuck, paste the error into the chat and let it debug.
- Use the dlt MCP server to ask questions about your pipeline metadata.


## Learning in Public

We encourage everyone to share what they learned. This is called "learning in public".

Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).

### Example post for LinkedIn

```
🚀 dlt Workshop of Data Engineering Zoomcamp by @DataTalksClub complete!

Just finished the Data Ingestion workshop with @dltHub. Learned how to:

✅ Build REST API data pipelines with dlt
✅ Use AI-assisted development with dlt MCP Server
✅ Load paginated API data into DuckDB
✅ Inspect pipeline data with dlt Dashboard and marimo notebooks

Built a full NYC taxi data pipeline from a custom API - AI-assisted data engineering is the future!

Here's my homework solution: <LINK>

Following along with this amazing free course - who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```

### Example post for Twitter/X

```
🔄 dlt Workshop of Data Engineering Zoomcamp done!

- REST API pipelines with @dltHub
- AI-assisted pipeline building
- DuckDB as local data warehouse
- dlt Dashboard & marimo notebooks

My solution: <LINK>

Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/
```


================================================
FILE: cohorts/2026/workshops/dlt/open_library_pipeline.py
================================================
"""Pipeline to ingest data from the Open Library Search API."""

import dlt
from dlt.sources.rest_api import rest_api_source


def open_library_source(query: str = "harry potter"):
    """
    Create a dlt source for the Open Library Search API.
    
    Args:
        query: Search query string (default: "harry potter")
    """
    return rest_api_source({
        "client": {
            "base_url": "https://openlibrary.org",
        },
        "resource_defaults": {
            "primary_key": "key",
            "write_disposition": "replace",
        },
        "resources": [
            {
                "name": "books",
                "endpoint": {
                    "path": "search.json",
                    "params": {
                        "q": query,
                        "limit": 100,
                    },
                    "data_selector": "docs",
                    "paginator": {
                        "type": "offset",
                        "limit": 100,
                        "offset_param": "offset",
                        "limit_param": "limit",
                        "total_path": "numFound",
                    },
                },
            },
        ],
    })


if __name__ == "__main__":
    pipeline = dlt.pipeline(
        pipeline_name="open_library_pipeline",
        destination="duckdb",
        dataset_name="open_library_data",
        progress="log",
    )

    # Load Harry Potter books from Open Library
    load_info = pipeline.run(open_library_source(query="harry potter"))
    print(load_info)


================================================
FILE: cohorts/2026/workshops/dlt/pyproject.toml
================================================
[project]
name = "zoomcamp-workshop-prep"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = [
    "altair>=6.0.0",
    "dlt[workspace]>=1.21.0",
    "ibis-framework[duckdb]>=12.0.0",
    "jupyterlab>=4.5.4",
    "marimo>=0.19.9",
]


================================================
FILE: cohorts/2026/workshops/dlt.md
================================================
# From APIs to Warehouses: AI-Assisted Data Ingestion with dlt

[Video](https://www.youtube.com/watch?v=5eMytPBgmVs)

This hands-on workshop focuses on building reliable data ingestion pipelines to data warehouses (for example, Snowflake) using dlt (data load tool), enhanced with LLMs, the dlt dashboard, and dlt MCP.

## What you'll learn

You'll work through the key building blocks of a production-ready ingestion setup, including:

- Extracting data from APIs, files, and databases
- Normalizing data into consistent schemas
- Writing data to a data warehouse (e.g. Snowflake)
- Using LLMs to accelerate dlt pipeline development
- Validating data and schema changes using the dlt dashboard and dlt MCP

The session is fully practical and code-driven. By the end of the workshop, you'll understand how to design maintainable, scalable ingestion pipelines and use AI and validation tools to build them faster and with confidence.

## Materials

* [Workshop instructions](dlt/README.md)
* [dlt Pipeline Overview Notebook (Google Colab)](https://colab.research.google.com/github/anair123/data-engineering-zoomcamp/blob/workshop/dlt_2026/cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb)
* [Homework](dlt/dlt_homework.md)
* [Homework submission form](https://courses.datatalks.club/de-zoomcamp-2026/homework/dlt)

## About the Speaker

**Aashish Nair** is a Data Engineer at dltHub and the creator of the famous _dlt deployment_ course, where he teaches best practices for running dlt pipelines in production.


================================================
FILE: learning-in-public.md
================================================
# Learning in public

Most people learn in private: they consume content but don't tell
anyone about it. There's nothing wrong with it.

But we want to encourage you to document your progress and
share it publicly on social media.

It helps you get noticed and will lead to:

* Expanding your network: meeting new people and making new friends
* Being invited to meetups, conferences and podcasts
* Landing a job or getting clients
* Many other good things

Here's a more comprehensive reading on why you want to do it: https://github.com/readme/guides/publishing-your-work


## Learning in Public for Zoomcamps

When you submit your homework or project, you can also submit
learning in public posts:

<img src="https://github.com/DataTalksClub/mlops-zoomcamp/raw/main/images/learning-in-public-links.png" />

You can watch this video to see how your learning in public posts may look like:

<a href="https://www.loom.com/share/710e3297487b409d94df0e8da1c984ce" target="_blank">
    <img src="https://github.com/DataTalksClub/mlops-zoomcamp/raw/main/images/learning-in-public.png" height="240" />
</a>

## Daily Documentation

- **Post Daily Diaries**: Document what you learn each day, including the challenges faced and the methods used to overcome them.
- **Create Quick Videos**: Make short videos showcasing your work and upload them to GitHub.

Send a PR if you want to suggest improvements for this document


================================================
FILE: projects/README.md
================================================
## Course Project

[🎥 Projects how-to (watch it!)](https://www.youtube.com/watch?v=BL0E8xO8OnE)


### Objective

The goal of this project is to apply everything we have learned
in this course to build an end-to-end data pipeline.

### Problem statement

Develop a dashboard with two tiles by:

* Selecting a dataset of interest (see [Datasets](#datasets))
* Creating a pipeline for processing this dataset and putting it to a datalake
* Creating a pipeline for moving the data from the lake to a data warehouse
* Transforming the data in the data warehouse: prepare it for the dashboard
* Building a dashboard to visualize the data


## Data Pipeline 

The pipeline could be **stream** or **batch**: this is the first thing you'll need to decide 

* **Stream**: If you want to consume data in real-time and put them to data lake
* **Batch**: If you want to run things periodically (e.g. hourly/daily)

## Technologies 

You don't have to limit yourself to technologies covered in the course. You can use alternatives as well:

* **Cloud**: AWS, GCP, Azure, ...
* **Infrastructure as code (IaC)**: Terraform, Pulumi, Cloud Formation, ...
* **Workflow orchestration**: Airflow, Prefect, Luigi, ...
* **Data Warehouse**: BigQuery, Snowflake, Redshift, ...
* **Batch processing**: Spark, Flink, AWS Batch, ...
* **Stream processing**: Kafka, Pulsar, Kinesis, ...

If you use a tool that wasn't covered in the course, be sure to explain what that tool does.

If you're not certain about some tools, ask in Slack.

## Dashboard

You can use any of the tools shown in the course (Looker Studio or Streamlit) or any other BI tool of your choice to build a dashboard. If you do use another tool, please specify and make sure that the dashboard is somehow accessible to your peers. 

Your dashboard should contain at least two tiles, we suggest you include:

- 1 graph that shows the distribution of some categorical data 
- 1 graph that shows the distribution of the data across a temporal line

Ensure that your graph is easy to understand by adding references and titles.
 
Example dashboard: ![image](https://user-images.githubusercontent.com/4315804/159771458-b924d0c1-91d5-4a8a-8c34-f36c25c31a3c.png)


## Peer reviewing

> [!IMPORTANT]  
> To evaluate the projects, we'll use peer reviewing. This is a great opportunity for you to learn from each other.
> * To get points for your project, you need to evaluate 3 projects of your peers
> * You get 3 extra points for each evaluation

## Evaluation Criteria

* Problem description
    * 0 points: Problem is not described
    * 2 points: Problem is described but shortly or not clearly 
    * 4 points: Problem is well described and it's clear what the problem the project solves
* Cloud
    * 0 points: Cloud is not used, things run only locally
    * 2 points: The project is developed in the cloud
    * 4 points: The project is developed in the cloud and IaC tools are used
* Data ingestion (choose either batch or stream)
    * Batch / Workflow orchestration
        * 0 points: No workflow orchestration
        * 2 points: Partial workflow orchestration: some steps are orchestrated, some run manually
        * 4 points: End-to-end pipeline: multiple steps in the DAG, uploading data to data lake
    * Stream
        * 0 points: No streaming system (like Kafka, Pulsar, etc)
        * 2 points: A simple pipeline with one consumer and one producer
        * 4 points: Using consumer/producers and streaming technologies (like Kafka streaming, Spark streaming, Flink, etc)
* Data warehouse
    * 0 points: No DWH is used
    * 2 points: Tables are created in DWH, but not optimized
    * 4 points: Tables are partitioned and clustered in a way that makes sense for the upstream queries (with explanation)
* Transformations (dbt, spark, etc)
    * 0 points: No tranformations
    * 2 points: Simple SQL transformation (no dbt or similar tools)
    * 4 points: Tranformations are defined with dbt, Spark or similar technologies
* Dashboard
    * 0 points: No dashboard
    * 2 points: A dashboard with 1 tile
    * 4 points: A dashboard with 2 tiles
* Reproducibility
    * 0 points: No instructions how to run the code at all
    * 2 points: Some instructions are there, but they are not complete
    * 4 points: Instructions are clear, it's easy to run the code, and the code works


> [!NOTE]
> It's highly recommended to create a new repository for your project (not inside an existing repo) with a meaningful title, such as
> "Quake Analytics Dashboard" or "Bike Data Insights" and include as many details as possible in the README file. ChatGPT can assist you with this. Doing so will not only make it easier to showcase your project for potential job opportunities but also have it featured on the [Projects Gallery App](#projects-gallery).
> If you leave the README file empty or with minimal details, there may be point deductions as per the [Evaluation Criteria](#evaluation-criteria).

## Going the extra mile (Optional)

> [!NOTE]
> The following things are not covered in the course, are entirely optional and they will not be graded.

However, implementing these could significantly enhance the quality of your project:

* Add tests
* Use make
* Add CI/CD pipeline

If you intend to include this project in your portfolio, adding these additional features will definitely help you to stand out from others.

## Cheating and plagiarism

Plagiarism in any form is not allowed. Examples of plagiarism:

* Taking somebody's else notebooks and projects (in full or partly) and using it for the capstone project
* Re-using your own projects (in full or partly) from other courses and bootcamps
* Re-using your midterm project from ML Zoomcamp in capstone
* Re-using your ML Zoomcamp from previous iterations of the course

Violating any of this will result in 0 points for this project.

## Resources

### Datasets

Refer to the provided [datasets](datasets.md) for possible selection.

### Helpful Links

* [Unit Tests + CI for Airflow](https://www.astronomer.io/events/recaps/testing-airflow-to-bulletproof-your-code/)
* [CI/CD for Airflow (with Gitlab & GCP state file)](https://engineering.ripple.com/building-ci-cd-with-airflow-gitlab-and-terraform-in-gcp)
* [CI/CD for Airflow (with GitHub and S3 state file)](https://programmaticponderings.com/2021/12/14/devops-for-dataops-building-a-ci-cd-pipeline-for-apache-airflow-dags/)
* [CD for Terraform](https://medium.com/towards-data-science/git-actions-terraform-for-data-engineers-scientists-gcp-aws-azure-448dc7c60fcc)
* [Spark + Airflow](https://medium.com/doubtnut/github-actions-airflow-for-automating-your-spark-pipeline-c9dff32686b)


### Projects Gallery

Explore a collection of projects completed by members of our community. The projects cover a wide range of topics and utilize different tools and techniques. Feel free to delve into any project and see how others have tackled real-world problems with data, structured their code, and presented their findings. It's a great resource to learn and get ideas for your own projects.

[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://datatalksclub-projects.streamlit.app/)

### DE Zoomcamp 2023

* [2023 Projects](../cohorts/2023/project.md)

### DE Zoomcamp 2022

* [2022 Projects](../cohorts/2022/project.md)


================================================
FILE: projects/datasets.md
================================================
## Datasets

Here are some datasets that you could use for the project:


* [Kaggle](https://www.kaggle.com/datasets)
* [AWS datasets](https://registry.opendata.aws/)
* [UK government open data](https://data.gov.uk/)
* [Github archive](https://www.gharchive.org)
* [Awesome public datasets](https://github.com/awesomedata/awesome-public-datasets)
* [Million songs dataset](http://millionsongdataset.com)
* [Some random datasets](https://components.one/datasets/)
* [COVID Datasets](https://www.reddit.com/r/datasets/comments/n3ph2d/coronavirus_datsets/)
* [Datasets from Azure](https://docs.microsoft.com/en-us/azure/azure-sql/public-data-sets)
* [Datasets from BigQuery](https://cloud.google.com/bigquery/public-data/)
* [Dataset search engine from Google](https://datasetsearch.research.google.com/)
* [Public datasets offered by different GCP services](https://cloud.google.com/solutions/datasets)
* [European statistics datasets](https://ec.europa.eu/eurostat/data/database)
* [Datasets for streaming](https://github.com/ColinEberhardt/awesome-public-streaming-datasets)
* [Dataset for Santander bicycle rentals in London](https://cycling.data.tfl.gov.uk/)
* [Common crawl data](https://commoncrawl.org/) (copy of the internet)
* [NASA's EarthData](https://search.earthdata.nasa.gov/search) (May require introductory geospatial analysis)
* Collection Of Data Repositories
  * [part 1](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-1.html) (from agriculture and finance to government)
  * [part 2](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-2.html) (from healthcare to transportation)
* [Data For Good by Meta](https://dataforgood.facebook.com/dfg/tools)

PRs with more datasets are welcome!

It's not mandatory that you use a dataset from this list. You can use any dataset you want.


================================================
FILE: workshop-best-practices.md
================================================
# Workshop Best Practices

Preferences and patterns learned from building the PyFlink streaming workshop.

## Structure and Pacing

- Introduce services one at a time, not all at once. Start with one container
  (e.g., Redpanda), explain it, use it. Then add the next (PostgreSQL), etc.
- Start with the simplest version that works (plain Python consumer), then
  motivate the more complex tool (Flink) by showing what's missing.
- Use `docker compose up <service> -d` to start services selectively during
  the gradual buildup. `docker compose up --build -d` only when everything
  is ready.

## Data

- Use real datasets, not fake test data. NYC taxi data
  (`yellow_tripdata_YYYY-MM.parquet`) is a good go-to.
- Limit to manageable sizes (e.g., first 1000 rows) for workshop speed.

## Project Setup

- Assume starting from scratch: `uv init -p 3.12` + `uv add <package>`.
- Add dependencies gradually as they're needed in the narrative
  (e.g., `uv add kafka-python pandas pyarrow` first, `uv add psycopg2-binary`
  later when PostgreSQL is introduced).
- Always note "if you cloned the repo, run `uv sync` instead" as a blockquote.

## Code Delivery

- Break large code blocks into small, focused blocks. Each block should do
  one thing. Don't dump a full script in one block.
- Pattern for code blocks: short intro line (what it does), then the code,
  then the explanation of how it works below. Don't put detailed
  explanations before the code - let the reader see the code first.
- Keep imports local to each block - don't introduce all imports upfront.
  Each block should only import what it uses.
- Introduce functions and utilities where they're first used, not earlier.
  For example, show `dataclasses.asdict()` in the block that calls it, not
  in the block that defines the dataclass.
- When introducing a function, show a test with sample data before using it
  in the real code. For example, create a test binary string to verify a
  deserializer, then pass it to the consumer.
- Prefer named functions over inline lambdas. A named function is reusable,
  testable, and easier to explain step by step. For example,
  `value_deserializer=ride_deserializer` instead of
  `value_deserializer=lambda m: json.loads(m.decode('utf-8'))`.
- Extract repetitive logic into named functions. For example, row-to-object
  conversion that appears in multiple places should be a function like
  `ride_from_row(row)`.
- Split one-liner functions into multiple lines. Each step (decode, parse,
  construct) on its own line is easier to follow and explain.
- Show the simple approach first, then improve it. For example, show a
  generic `json_serializer` with manual `dataclasses.asdict()` calls, then
  introduce a specialized `ride_serializer` that handles the conversion
  internally. Let the student feel the friction before showing the fix.
- Extract shared code (dataclasses, serializers, deserializers, converters)
  into shared modules (e.g., `models.py`) so multiple scripts can import
  from one place.
- Reference the complete script at the end (e.g., "> The complete script is
  in `src/producers/producer.py`.").
- For infrastructure files that are long or complex (Dockerfile, YAML configs),
  link to the file on GitHub and provide a short summary list of what it does.
  Use `wget` to download from the GitHub repo instead of asking students to
  type them.
- Mention that students can run Python code in Jupyter notebooks
  (`uv add jupyter`, `uv run jupyter lab`) as an alternative to .py scripts.
  The small-block style maps naturally to notebook cells.
- Flink jobs must remain as .py files (they're submitted to the cluster via
  `docker compose exec`). Add a note explaining this distinction.

## Formatting

- No bold formatting (`**text**`) in README files. Use plain text.
- No em dashes. Use hyphens with spaces (` - `) instead.
- Use `python` not `python3`.
- Use `docker compose` not `docker-compose`.
- Use `uvx pgcli` not just `pgcli`.
- Use `uv run python` not `python` for running scripts.

## Naming

- Use meaningful names that reflect purpose, not generic placeholders.
  For example, `group_id='rides-console'` or `group_id='rides-to-postgres'`,
  not `group_id='test-consumer-group'`.

## Explanations

- For complex configurations (like Redpanda's docker-compose command), explain
  every parameter in a table or list.
- Explain the "why" not just the "what" (why two Kafka addresses? why
  checkpointing every 10 seconds? why watermarks?).
- Use tables for parameter explanations and comparisons.
- Include sample output for every command students will run.
- Use `>` blockquotes for tips, notes about the repo, and common mistakes from
  original workshops/streams.
- For complex concepts (watermarks, task slots, parallelism), pull the
  explanation out of bullet lists into its own multi-paragraph section. State
  the value or syntax in the bullet, then explain the concept below in
  separate paragraphs for easier reading.
- Use lists for multi-point summaries instead of packing everything into one
  long sentence.
- When showing a development shortcut (like mounting local files into Docker),
  add a note explaining how it works in production. Students benefit from
  understanding real-world deployment patterns alongside the workshop setup.

## Code Organization

- Define the source (where you read from) before the sink (where you write
  to) when presenting code blocks. Set up the consumer/reader first, then
  the database connection or output destination.

## Docker Compose

- Don't use `container_name` or `hostname` - Docker Compose handles naming
  automatically.
- Don't use `extra_hosts` unless specifically needed.
- Service names are automatically resolvable as hostnames within the Docker
  network.
- Prefer short service names (e.g., `redpanda` not `redpanda-1`).
- Keep `restart: on-failure` only for services that need it (like databases).

## Dependencies and Versions

- Always use the latest stable versions of images and libraries.
- Pin exact versions for Flink and its connectors (they must match).
- Use `uv` for everything Python-related (package management, running scripts,
  even installing Python itself inside Docker).
- Prefer `COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/` in Dockerfiles
  instead of `apt-get install`.

## Workshop Header

- Credit the original stream/video at the top with a link.
- If the new video is not yet available, put "TBA" with a sign-up link
  (e.g., Luma).
- Brief description of what we'll build and prerequisites.

## Workshop Flow Template

1. Introduce the first component (message broker, database, etc.)
2. Set up with docker-compose (explain parameters)
3. Create a simple producer/writer
4. Create a simple consumer/reader
5. Add a database, save data
6. Show limitations of the simple approach
7. Introduce the framework (Flink, Spark, etc.)
8. Reproduce the simple case with the framework
9. Do something the simple approach can't (aggregation, windowing)
10. Explain advanced concepts (window types, offsets, etc.)
11. Cleanup
12. Q&A - questions and answers from the original stream. Include production
    deployment topics here rather than as standalone sections.