Repository: DataTalksClub/data-engineering-zoomcamp Branch: main Commit: ef44b885b9bb Files: 381 Total size: 12.5 MB Directory structure: gitextract_ffam8vtv/ ├── .github/ │ └── FUNDING.yml ├── .gitignore ├── 01-docker-terraform/ │ ├── README.md │ ├── docker-sql/ │ │ ├── 01-introduction.md │ │ ├── 02-virtual-environment.md │ │ ├── 03-dockerizing-pipeline.md │ │ ├── 04-postgres-docker.md │ │ ├── 05-data-ingestion.md │ │ ├── 06-ingestion-script.md │ │ ├── 07-pgadmin.md │ │ ├── 08-dockerizing-ingestion.md │ │ ├── 09-docker-compose.md │ │ ├── 10-sql-refresher.md │ │ ├── 11-cleanup.md │ │ ├── README.md │ │ └── pipeline/ │ │ ├── .python-version │ │ ├── Dockerfile │ │ ├── docker-compose.yaml │ │ ├── docker-helper-scripts/ │ │ │ ├── docker-ingest.sh │ │ │ ├── docker-pgadmin.sh │ │ │ └── docker-postgres.sh │ │ ├── ingest_data.py │ │ └── pyproject.toml │ └── terraform/ │ ├── 1_terraform_overview.md │ ├── 2_gcp_overview.md │ ├── README.md │ ├── terraform/ │ │ ├── README.md │ │ ├── terraform_basic/ │ │ │ └── main.tf │ │ ├── terraform_with_variable_AWS/ │ │ │ ├── README.md │ │ │ ├── main.tf │ │ │ ├── terraform.tfvars │ │ │ └── variables.tf │ │ └── terraform_with_variables/ │ │ ├── main.tf │ │ └── variables.tf │ └── windows.md ├── 02-workflow-orchestration/ │ ├── README.md │ ├── docker-compose.yml │ └── flows/ │ ├── 01_hello_world.yaml │ ├── 02_python.yaml │ ├── 03_getting_started_data_pipeline.yaml │ ├── 04_postgres_taxi.yaml │ ├── 05_postgres_taxi_scheduled.yaml │ ├── 06_gcp_kv.yaml │ ├── 07_gcp_setup.yaml │ ├── 08_gcp_taxi.yaml │ ├── 09_gcp_taxi_scheduled.yaml │ ├── 10_chat_without_rag.yaml │ └── 11_chat_with_rag.yaml ├── 03-data-warehouse/ │ ├── README.md │ ├── big_query.sql │ ├── big_query_hw.sql │ ├── big_query_ml.sql │ ├── extract_model.md │ └── extras/ │ ├── .env-example │ ├── .gitignore │ ├── README.md │ ├── pyproject.toml │ ├── web_to_gcs.py │ └── web_to_gcs_with_progress_bar.py ├── 04-analytics-engineering/ │ ├── README.md │ ├── class_notes/ │ │ ├── 4_1_1_analytics_engineering_basics.md │ │ ├── 4_1_2_what_is_dbt.md │ │ ├── 4_2_1_dbt_core_vs_dbt_cloud.md │ │ ├── 4_3_1_dbt_project_structure.md │ │ ├── 4_3_2_dbt_sources.md │ │ ├── 4_4_1_dbt_models.md │ │ ├── 4_4_2_dbt_seeds_and_macros.md │ │ ├── 4_5_1_documentation.md │ │ ├── 4_5_2_dbt_tests.md │ │ ├── 4_5_3_dbt_packages.md │ │ └── 4_6_1_dbt_commands.md │ ├── refreshers/ │ │ └── SQL.md │ ├── setup/ │ │ ├── cloud_setup.md │ │ ├── duckdb_troubleshooting.md │ │ └── local_setup.md │ └── taxi_rides_ny/ │ ├── .gitignore │ ├── dbt_project.yml │ ├── macros/ │ │ ├── get_trip_duration_minutes.sql │ │ ├── get_vendor_data.sql │ │ ├── macros_properties.yml │ │ └── safe_cast.sql │ ├── models/ │ │ ├── intermediate/ │ │ │ ├── int_trips.sql │ │ │ ├── int_trips_unioned.sql │ │ │ └── schema.yml │ │ ├── marts/ │ │ │ ├── dim_vendors.sql │ │ │ ├── dim_zones.sql │ │ │ ├── fct_trips.sql │ │ │ ├── reporting/ │ │ │ │ ├── fct_monthly_zone_revenue.sql │ │ │ │ └── schema.yml │ │ │ └── schema.yml │ │ └── staging/ │ │ ├── schema.yml │ │ ├── sources.yml │ │ ├── stg_green_tripdata.sql │ │ └── stg_yellow_tripdata.sql │ ├── package-lock.yml │ ├── packages.yml │ ├── seeds/ │ │ └── seeds_properties.yml │ ├── snapshots/ │ │ └── .gitkeep │ └── tests/ │ └── .gitkeep ├── 05-data-platforms/ │ ├── README.md │ └── notes/ │ ├── 01-introduction.md │ ├── 02-getting-started.md │ ├── 03-nyc-taxi-pipeline.md │ ├── 04-bruin-mcp.md │ ├── 05-bruin-cloud.md │ ├── 06-core-01-projects.md │ ├── 06-core-02-pipelines.md │ ├── 06-core-03-assets.md │ ├── 06-core-04-variables.md │ └── 06-core-05-commands.md ├── 06-batch/ │ ├── .gitignore │ ├── README.md │ ├── code/ │ │ ├── 03_test.ipynb │ │ ├── 04_pyspark.ipynb │ │ ├── 05_taxi_schema.ipynb │ │ ├── 06_spark_sql.ipynb │ │ ├── 06_spark_sql.py │ │ ├── 06_spark_sql_big_query.py │ │ ├── 07_groupby_join.ipynb │ │ ├── 08_rdds.ipynb │ │ ├── 09_spark_gcs.ipynb │ │ ├── cloud.md │ │ ├── download_data.sh │ │ └── homework.ipynb │ └── setup/ │ ├── config/ │ │ ├── core-site.xml │ │ ├── spark-defaults.conf │ │ └── spark.dockerfile │ ├── hadoop-yarn.md │ ├── linux.md │ ├── macos.md │ └── windows.md ├── 07-streaming/ │ ├── .gitignore │ ├── README.md │ ├── extras/ │ │ ├── README.md │ │ ├── ksqldb/ │ │ │ └── commands.md │ │ ├── pyflink/ │ │ │ ├── .gitignore │ │ │ ├── Dockerfile.flink │ │ │ ├── LICENSE │ │ │ ├── Makefile │ │ │ ├── README.md │ │ │ ├── docker-compose.yml │ │ │ ├── homework.md │ │ │ ├── requirements.txt │ │ │ └── src/ │ │ │ ├── job/ │ │ │ │ ├── aggregation_job.py │ │ │ │ ├── start_job.py │ │ │ │ └── taxi_job.py │ │ │ └── producers/ │ │ │ ├── load_taxi_data.py │ │ │ └── producer.py │ │ └── python/ │ │ ├── README.md │ │ ├── avro_example/ │ │ │ ├── consumer.py │ │ │ ├── producer.py │ │ │ ├── ride_record.py │ │ │ ├── ride_record_key.py │ │ │ └── settings.py │ │ ├── docker/ │ │ │ ├── README.md │ │ │ ├── docker-compose.yml │ │ │ ├── kafka/ │ │ │ │ └── docker-compose.yml │ │ │ └── spark/ │ │ │ ├── build.sh │ │ │ ├── cluster-base.Dockerfile │ │ │ ├── docker-compose.yml │ │ │ ├── jupyterlab.Dockerfile │ │ │ ├── spark-base.Dockerfile │ │ │ ├── spark-master.Dockerfile │ │ │ └── spark-worker.Dockerfile │ │ ├── json_example/ │ │ │ ├── consumer.py │ │ │ ├── producer.py │ │ │ ├── ride.py │ │ │ └── settings.py │ │ ├── redpanda_example/ │ │ │ ├── README.md │ │ │ ├── consumer.py │ │ │ ├── docker-compose.yaml │ │ │ ├── producer.py │ │ │ ├── ride.py │ │ │ └── settings.py │ │ ├── requirements.txt │ │ ├── resources/ │ │ │ └── schemas/ │ │ │ ├── taxi_ride_key.avsc │ │ │ └── taxi_ride_value.avsc │ │ └── streams-example/ │ │ ├── faust/ │ │ │ ├── branch_price.py │ │ │ ├── producer_taxi_json.py │ │ │ ├── stream.py │ │ │ ├── stream_count_vendor_trips.py │ │ │ ├── taxi_rides.py │ │ │ └── windowing.py │ │ ├── pyspark/ │ │ │ ├── README.md │ │ │ ├── consumer.py │ │ │ ├── producer.py │ │ │ ├── settings.py │ │ │ ├── spark-submit.sh │ │ │ ├── streaming-notebook.ipynb │ │ │ └── streaming.py │ │ └── redpanda/ │ │ ├── README.md │ │ ├── consumer.py │ │ ├── docker-compose.yaml │ │ ├── producer.py │ │ ├── settings.py │ │ ├── spark-submit.sh │ │ ├── streaming-notebook.ipynb │ │ └── streaming.py │ ├── theory/ │ │ ├── README.md │ │ └── java/ │ │ └── kafka_examples/ │ │ ├── .gitignore │ │ ├── build/ │ │ │ └── generated-main-avro-java/ │ │ │ └── schemaregistry/ │ │ │ ├── RideRecord.java │ │ │ ├── RideRecordCompatible.java │ │ │ └── RideRecordNoneCompatible.java │ │ ├── build.gradle │ │ ├── gradle/ │ │ │ └── wrapper/ │ │ │ ├── gradle-wrapper.jar │ │ │ └── gradle-wrapper.properties │ │ ├── gradlew │ │ ├── gradlew.bat │ │ ├── settings.gradle │ │ └── src/ │ │ ├── main/ │ │ │ ├── avro/ │ │ │ │ ├── rides.avsc │ │ │ │ ├── rides_compatible.avsc │ │ │ │ └── rides_non_compatible.avsc │ │ │ └── java/ │ │ │ └── org/ │ │ │ └── example/ │ │ │ ├── AvroProducer.java │ │ │ ├── JsonConsumer.java │ │ │ ├── JsonKStream.java │ │ │ ├── JsonKStreamJoins.java │ │ │ ├── JsonKStreamWindow.java │ │ │ ├── JsonProducer.java │ │ │ ├── JsonProducerPickupLocation.java │ │ │ ├── Secrets.java │ │ │ ├── Topics.java │ │ │ ├── customserdes/ │ │ │ │ └── CustomSerdes.java │ │ │ └── data/ │ │ │ ├── PickupLocation.java │ │ │ ├── Ride.java │ │ │ └── VendorInfo.java │ │ └── test/ │ │ └── java/ │ │ └── org/ │ │ └── example/ │ │ ├── JsonKStreamJoinsTest.java │ │ ├── JsonKStreamTest.java │ │ └── helper/ │ │ └── DataGeneratorHelper.java │ └── workshop/ │ ├── .python-version │ ├── Dockerfile.flink │ ├── Dockerfile_ARM64.flink │ ├── Makefile │ ├── README.md │ ├── docker-compose.yml │ ├── flink-config.yaml │ ├── live/ │ │ ├── .gitignore │ │ ├── .python-version │ │ ├── Dockerfile.flink │ │ ├── README.md │ │ ├── docker-compose.yaml │ │ ├── flink-config.yaml │ │ ├── main.py │ │ ├── notebooks/ │ │ │ ├── consumer_db.ipynb │ │ │ ├── models.py │ │ │ └── producer.ipynb │ │ ├── pyproject.flink.toml │ │ ├── pyproject.toml │ │ └── src/ │ │ ├── job/ │ │ │ ├── aggregation_job.py │ │ │ └── pass_through_job.py │ │ └── producers/ │ │ ├── models.py │ │ └── producer_realtime.py │ ├── pyproject.flink.toml │ ├── pyproject.toml │ └── src/ │ ├── consumers/ │ │ ├── consumer.py │ │ └── consumer_postgres.py │ ├── job/ │ │ ├── aggregation_job.py │ │ ├── aggregation_job_demo.py │ │ └── pass_through_job.py │ ├── models.py │ └── producers/ │ ├── producer.py │ └── producer_realtime.py ├── README.md ├── after-sign-up.md ├── asking-questions.md ├── awesome-data-engineering.md ├── certificates.md ├── cohorts/ │ ├── 2022/ │ │ ├── README.md │ │ ├── project.md │ │ ├── week_1_basics_n_setup/ │ │ │ └── homework.md │ │ ├── week_2_data_ingestion/ │ │ │ ├── README.md │ │ │ ├── airflow/ │ │ │ │ ├── .env_example │ │ │ │ ├── 1_setup_official.md │ │ │ │ ├── 2_setup_nofrills.md │ │ │ │ ├── Dockerfile │ │ │ │ ├── README.md │ │ │ │ ├── dags/ │ │ │ │ │ └── data_ingestion_gcs_dag.py │ │ │ │ ├── dags_local/ │ │ │ │ │ ├── data_ingestion_local.py │ │ │ │ │ └── ingest_script.py │ │ │ │ ├── docker-compose-nofrills.yml │ │ │ │ ├── docker-compose.yaml │ │ │ │ ├── docker-compose_2.3.4.yaml │ │ │ │ ├── docs/ │ │ │ │ │ └── 1_concepts.md │ │ │ │ ├── extras/ │ │ │ │ │ ├── data_ingestion_gcs_dag_ex2.py │ │ │ │ │ └── web_to_gcs.sh │ │ │ │ ├── requirements.txt │ │ │ │ └── scripts/ │ │ │ │ └── entrypoint.sh │ │ │ ├── homework/ │ │ │ │ ├── homework.md │ │ │ │ └── solution.py │ │ │ └── transfer_service/ │ │ │ └── README.md │ │ ├── week_3_data_warehouse/ │ │ │ └── airflow/ │ │ │ ├── .env_example │ │ │ ├── 1_setup_official.md │ │ │ ├── 2_setup_nofrills.md │ │ │ ├── README.md │ │ │ ├── dags/ │ │ │ │ └── gcs_to_bq_dag.py │ │ │ ├── docker-compose-nofrills.yml │ │ │ ├── docker-compose.yaml │ │ │ └── scripts/ │ │ │ └── entrypoint.sh │ │ ├── week_5_batch_processing/ │ │ │ └── homework.md │ │ └── week_6_stream_processing/ │ │ └── homework.md │ ├── 2023/ │ │ ├── README.md │ │ ├── leaderboard.md │ │ ├── project.md │ │ ├── week_1_docker_sql/ │ │ │ └── homework.md │ │ ├── week_1_terraform/ │ │ │ └── homework.md │ │ ├── week_2_workflow_orchestration/ │ │ │ ├── README.md │ │ │ └── homework.md │ │ ├── week_3_data_warehouse/ │ │ │ └── homework.md │ │ ├── week_4_analytics_engineering/ │ │ │ └── homework.md │ │ ├── week_5_batch_processing/ │ │ │ └── homework.md │ │ ├── week_6_stream_processing/ │ │ │ ├── client.properties │ │ │ ├── homework.md │ │ │ ├── producer_confluent.py │ │ │ ├── settings.py │ │ │ ├── spark-submit.sh │ │ │ └── streaming_confluent.py │ │ └── workshops/ │ │ └── piperider.md │ ├── 2024/ │ │ ├── 01-docker-terraform/ │ │ │ ├── homework.md │ │ │ └── solutions.md │ │ ├── 02-workflow-orchestration/ │ │ │ ├── README.md │ │ │ └── homework.md │ │ ├── 03-data-warehouse/ │ │ │ └── homework.md │ │ ├── 04-analytics-engineering/ │ │ │ └── homework.md │ │ ├── 05-batch/ │ │ │ └── homework.md │ │ ├── 06-streaming/ │ │ │ ├── docker-compose.yml │ │ │ └── homework.md │ │ ├── README.md │ │ ├── leaderboard.md │ │ ├── project.md │ │ └── workshops/ │ │ ├── dlt.md │ │ ├── dlt_resources/ │ │ │ ├── data_ingestion_workshop.md │ │ │ ├── homework_solution.ipynb │ │ │ ├── homework_starter.ipynb │ │ │ └── workshop.ipynb │ │ └── rising-wave.md │ ├── 2025/ │ │ ├── 01-docker-terraform/ │ │ │ └── homework.md │ │ ├── 02-workflow-orchestration/ │ │ │ ├── README.md │ │ │ ├── flows/ │ │ │ │ ├── 01_getting_started_data_pipeline.yaml │ │ │ │ ├── 02_postgres_taxi.yaml │ │ │ │ ├── 02_postgres_taxi_scheduled.yaml │ │ │ │ ├── 03_postgres_dbt.yaml │ │ │ │ ├── 04_gcp_kv.yaml │ │ │ │ ├── 05_gcp_setup.yaml │ │ │ │ ├── 06_gcp_taxi.yaml │ │ │ │ ├── 06_gcp_taxi_scheduled.yaml │ │ │ │ └── 07_gcp_dbt.yaml │ │ │ └── homework.md │ │ ├── 03-data-warehouse/ │ │ │ ├── DLT_upload_to_GCP.ipynb │ │ │ ├── homework.md │ │ │ └── load_yellow_taxi_data.py │ │ ├── 04-analytics-engineering/ │ │ │ └── homework.md │ │ ├── 05-batch/ │ │ │ └── homework.md │ │ ├── 06-streaming/ │ │ │ ├── homework/ │ │ │ │ └── homework.ipynb │ │ │ └── homework.md │ │ ├── README.md │ │ ├── project.md │ │ └── workshops/ │ │ ├── dlt/ │ │ │ ├── README.md │ │ │ ├── data_ingestion_workshop.md │ │ │ └── dlt_homework.md │ │ └── dynamic_load_dlt.py │ └── 2026/ │ ├── 01-docker-terraform/ │ │ └── homework.md │ ├── 02-workflow-orchestration/ │ │ └── homework.md │ ├── 03-data-warehouse/ │ │ ├── DLT_upload_to_GCP.ipynb │ │ ├── homework.md │ │ └── load_yellow_taxi_data.py │ ├── 04-analytics-engineering/ │ │ └── homework.md │ ├── 05-data-platforms/ │ │ └── homework.md │ ├── 06-batch/ │ │ └── homework.md │ ├── 07-streaming/ │ │ └── homework.md │ ├── README.md │ ├── project.md │ └── workshops/ │ ├── dlt/ │ │ ├── README.md │ │ ├── analysis.py │ │ ├── dlt_Pipeline_Overview.ipynb │ │ ├── dlt_homework.md │ │ ├── open_library_pipeline.py │ │ └── pyproject.toml │ └── dlt.md ├── learning-in-public.md ├── projects/ │ ├── README.md │ └── datasets.md └── workshop-best-practices.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/FUNDING.yml ================================================ github: alexeygrigorev ================================================ FILE: .gitignore ================================================ .DS_Store .idea *.tfstate *.tfstate.* **.terraform **.terraform.lock.* **google_credentials.json **logs/ **.env **__pycache__/ .history **/ny_taxi_postgres_data/* serving_dir .ipynb_checkpoints/ !week_6_stream_processing/avro_example/data/rides.csv *.parquet *.csv *.duckdb ================================================ FILE: 01-docker-terraform/README.md ================================================ # Introduction [![](https://markdown-videos-api.jorgenkh.no/youtube/JgspdlKXS-w)](https://www.youtube.com/watch?v=JgspdlKXS-w) We suggest watching videos in the same order as in this document. # Docker + Postgres ## Workshop [![](https://markdown-videos-api.jorgenkh.no/youtube/lP8xXebHmuE)](https://youtu.be/lP8xXebHmuE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10) * Video: https://www.youtube.com/watch?v=lP8xXebHmuE * Follow the instructions here: [docker-sql/](docker-sql/) ## :movie_camera: SQL refresher [![](https://markdown-videos-api.jorgenkh.no/youtube/QEcps_iskgg)](https://youtu.be/QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10) * Video: https://www.youtube.com/watch?v=QEcps_iskgg * SQL queries: [10-sql-refresher.md](docker-sql/10-sql-refresher.md) # GCP ## :movie_camera: Introduction to GCP (Google Cloud Platform) [![](https://markdown-videos-api.jorgenkh.no/youtube/18jIzE41fJ4)](https://youtu.be/18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=3) # Terraform [Code and notes](terraform/) ## :movie_camera: Introduction Terraform: Concepts and Overview, a primer [![](https://markdown-videos-api.jorgenkh.no/youtube/s2bOYDCKl_M)](https://youtu.be/s2bOYDCKl_M&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=11) ## :movie_camera: Terraform Basics: Simple one file Terraform Deployment [![](https://markdown-videos-api.jorgenkh.no/youtube/Y2ux7gq3Z0o)](https://youtu.be/Y2ux7gq3Z0o&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=12) ## :movie_camera: Deployment with a Variables File [![](https://markdown-videos-api.jorgenkh.no/youtube/PBi0hHjLftk)](https://youtu.be/PBi0hHjLftk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=13) ## Configuring terraform and GCP SDK on Windows * [Instructions](terraform/windows.md) # Homework * [Homework](../cohorts/2026/01-docker-terraform/homework.md) # Community notes
Did you take notes? You can share them here * [Notes from Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/1_intro.md) * [Notes from Abd](https://itnadigital.notion.site/Week-1-Introduction-f18de7e69eb4453594175d0b1334b2f4) * [Notes from Aaron](https://github.com/ABZ-Aaron/DataEngineerZoomCamp/blob/master/week_1_basics_n_setup/README.md) * [Notes from Faisal](https://github.com/FaisalMohd/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/Notes/DE%20Zoomcamp%20Week-1.pdf) * [Michael Harty's Notes](https://github.com/mharty3/data_engineering_zoomcamp_2022/tree/main/week01) * [Blog post from Isaac Kargar](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/18/data-engineering-w1.html) * [Handwritten Notes By Mahmoud Zaher](https://github.com/zaherweb/DataEngineering/blob/master/week%201.pdf) * [Notes from Candace Williams](https://teacherc.github.io/data-engineering/2023/01/18/zoomcamp1.html) * [Notes from Marcos Torregrosa](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-1/) * [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd) * [Notes from Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week1) * [Notes from froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_1_basics_n_setup/notes/notes_week_01.md) * [Notes from adamiaonr](https://github.com/adamiaonr/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/2_docker_sql/NOTES.md) * [Notes from Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/01/week-1-data-engineering-zoomcamp-notes/) * [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%201/Detailed%20Week%201%20Notes.ipynb) * [Notes from Erik](https://twitter.com/ehub96/status/1621351266281730049) * [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week1.md) * Notes on [Docker, Docker Compose, and setting up a proper Python environment](https://medium.com/@verazabeida/zoomcamp-2023-week-1-f4f94cb360ae), by Vera * [Setting up the development environment on Google Virtual Machine](https://itsadityagupta.hashnode.dev/setting-up-the-development-environment-on-google-virtual-machine), blog post by Aditya Gupta * [Notes from Zharko Cekovski](https://www.zharconsulting.com/contents/data/data-engineering-bootcamp-2024/week-1-postgres-docker-and-ingestion-scripts/) * [2024 Module-01 Walkthough video by ellacharmed on youtube](https://youtu.be/VUZshlVAnk4) * [2024 Companion Module Walkthough slides by ellacharmed](https://github.com/ellacharmed/data-engineering-zoomcamp/blob/ella2024/cohorts/2024/01-docker-terraform/walkthrough-01.pdf) * [2024 Module-01 Environment setup video by ellacharmed on youtube](https://youtu.be/Zce_Hd37NGs) * [Docker Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/1a-docker_sql/readme.md) • [Terraform Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/1b-terraform_gcp/readme.md) * [Notes from Hammad Tariq](https://github.com/hamad-tariq/HammadTariq-ZoomCamp2024/blob/9c8b4908416eb8cade3d7ec220e7664c003e9b11/week_1_basics_n_setup/README.md) * [Hung's Notes](https://hung.bearblog.dev/docker/) & [Docker Cheatsheet](https://github.com/HangenYuu/docker-cheatsheet) * [Kemal's Notes](https://github.com/kemaldahha/data-engineering-course/blob/main/week_1_notes.md) * [Notes from Manuel Guerra (Windows+WSL2 Environment)](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/1_Containerization-and-Infrastructure-as-Code/README.md) * [Notes from Horeb SEIDOU](https://spotted-hardhat-eea.notion.site/Week-1-Containerization-and-Infrastructure-as-Code-15729780dc4a80a08288e497ba937a37) * [2025 Gitbook Notes from Tinker0425](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/introduction/introduction-and-set-up) * [Alex's Docker Notes](https://github.com/alexg9010/2025_data_engineering_zoomcamp/blob/master/01_docker/README.md) | [Alex's Terraform Notes](https://github.com/alexg9010/2025_data_engineering_zoomcamp/blob/master/01_3_terraform/README.md) * [2025 SQL Refresher - Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/01_docker_postgress/0_sql_refresh.ipynb) * [2025 Setting up the Environment - Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/01_docker_postgress/_setting_up.md) * [Notes from Mercy Markus: Linux/Fedora Tweaks and Tips](https://mercymarkus.com/posts/2025/series/dtc-dez-jan-2025/dtc-dez-2025-module-1/) * [[2026 tutorial video - Khanh Nguyen] Setting up the environment for homework-w1](https://youtu.be/_iqCWi_UoOc) * Add your notes above this line
================================================ FILE: 01-docker-terraform/docker-sql/01-introduction.md ================================================ # Introduction to Docker **[↑ Up](README.md)** | **[← Previous](README.md)** | **[Next →](02-virtual-environment.md)** Docker is a _containerization software_ that allows us to isolate software in a similar way to virtual machines but in a much leaner way. A Docker image is a _snapshot_ of a container that we can define to run our software, or in this case our data pipelines. By exporting our Docker images to Cloud providers such as Amazon Web Services or Google Cloud Platform we can run our containers there. ## Why Docker? Docker provides the following advantages: - Reproducibility: Same environment everywhere - Isolation: Applications run independently - Portability: Run anywhere Docker is installed They are used in many situations: - Integration tests: CI/CD pipelines - Running pipelines on the cloud: AWS Batch, Kubernetes jobs - Spark: Analytics engine for large-scale data processing - Serverless: AWS Lambda, Google Functions ## Basic Docker Commands Check Docker version: ```bash docker --version ``` Run a simple container: ```bash docker run hello-world ``` Run something more complex: ```bash docker run ubuntu ``` Nothing happens. Need to run it in `-it` mode: ```bash docker run -it ubuntu ``` We don't have `python` there so let's install it: ```bash apt update && apt install python3 python3 -V ``` ## Stateless Containers Important: Docker containers are stateless - any changes done inside a container will NOT be saved when the container is killed and started again. When you exit the container and use it again, the changes are gone: ```bash docker run -it ubuntu python3 -V ``` This is good, because it doesn't affect your host system. Let's say you do something crazy like this: ```bash docker run -it ubuntu rm -rf / # don't run it on your computer! ``` Next time we run it, all the files are back. ## Managing Containers But, this is not _completely_ correct. The state is saved somewhere. We can see stopped containers: ```bash docker ps -a ``` We can restart one of them, but we won't do it, because it's not a good practice. They take space, so let's delete them: ```bash docker rm $(docker ps -aq) ``` Next time we run something, we add `--rm`: ```bash docker run -it --rm ubuntu ``` ## Different Base Images There are other base images besides `hello-world` and `ubuntu`. For example, Python: ```bash docker run -it --rm python:3.9.16 # add -slim to get a smaller version ``` This one starts `python`. If we want bash, we need to overwrite `entrypoint`: ```bash docker run -it \ --rm \ --entrypoint=bash \ python:3.9.16-slim ``` ## Volumes So, we know that with docker we can restore any container to its initial state in a reproducible manner. But what about data? A common way to do so is with _volumes_. Let's create some data in `test`: ```bash mkdir test cd test touch file1.txt file2.txt file3.txt echo "Hello from host" > file1.txt cd .. ``` Now let's create a simple script `test/list_files.py` that shows the files in the folder: ```python from pathlib import Path current_dir = Path.cwd() current_file = Path(__file__).name print(f"Files in {current_dir}:") for filepath in current_dir.iterdir(): if filepath.name == current_file: continue print(f" - {filepath.name}") if filepath.is_file(): content = filepath.read_text(encoding='utf-8') print(f" Content: {content}") ``` Now let's map this to a Python container: ```bash docker run -it \ --rm \ -v $(pwd)/test:/app/test \ --entrypoint=bash \ python:3.9.16-slim ``` Inside the container, run: ```bash cd /app/test ls -la cat file1.txt python list_files.py ``` You'll see the files from your host machine are accessible in the container! **[↑ Up](README.md)** | **[← Previous](README.md)** | **[Next →](02-virtual-environment.md)** ================================================ FILE: 01-docker-terraform/docker-sql/02-virtual-environment.md ================================================ # Virtual Environments and Data Pipelines **[↑ Up](README.md)** | **[← Previous](01-introduction.md)** | **[Next →](03-dockerizing-pipeline.md)** A **data pipeline** is a service that receives data as input and outputs more data. For example, reading a CSV file, transforming the data somehow and storing it as a table in a PostgreSQL database. ```mermaid graph LR A[CSV File] --> B[Data Pipeline] B --> C[Parquet File] B --> D[PostgreSQL Database] B --> E[Data Warehouse] style B fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff ``` In this workshop, we'll build pipelines that: - Download CSV data from the web - Transform and clean the data with pandas - Load it into PostgreSQL for querying - Process data in chunks to handle large files ## Creating a Simple Pipeline Let's create an example pipeline. First, create a directory `pipeline` and inside, create a file `pipeline.py`: ```python import sys print("arguments", sys.argv) day = int(sys.argv[1]) print(f"Running pipeline for day {day}") ``` Now let's add pandas: ```python import pandas as pd df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) print(df.head()) df.to_parquet(f"output_day_{sys.argv[1]}.parquet") ``` ## Why Virtual Environments? We need pandas, but we don't have it. We want to test it before we run things in a container. We can install it with `pip`: ```bash pip install pandas pyarrow ``` But this installs it globally on your system. This can cause conflicts if different projects need different versions of the same package. Instead, we want to use a **virtual environment** - an isolated Python environment that keeps dependencies for this project separate from other projects and from your system Python. ## Using uv - Modern Python Package Manager We'll use `uv` - a modern, fast Python package and project manager written in Rust. It's much faster than pip and handles virtual environments automatically. ```bash pip install uv ``` Now initialize a Python project with uv: ```bash uv init --python=3.13 ``` This creates a `pyproject.toml` file for managing dependencies and a `.python-version` file. ### Comparing Python Versions ```bash uv run which python # Python in the virtual environment uv run python -V which python # System Python python -V ``` You'll see they're different - `uv run` uses the isolated environment. ### Adding Dependencies Now let's add pandas: ```bash uv add pandas pyarrow ``` This adds pandas to your `pyproject.toml` and installs it in the virtual environment. ### Running the Pipeline Now we can execute the file: ```bash uv run python pipeline.py 10 ``` We will see: * `['pipeline.py', '10']` * `job finished successfully for day = 10` ## Git Configuration This script produces a binary (parquet) file, so let's make sure we don't accidentally commit it to git by adding parquet extensions to `.gitignore`: ``` *.parquet ``` **[↑ Up](README.md)** | **[← Previous](01-introduction.md)** | **[Next →](03-dockerizing-pipeline.md)** ================================================ FILE: 01-docker-terraform/docker-sql/03-dockerizing-pipeline.md ================================================ # Dockerizing the Pipeline **[↑ Up](README.md)** | **[← Previous](02-virtual-environment.md)** | **[Next →](04-postgres-docker.md)** Now let's containerize the script. Create the following `Dockerfile` file: ## Simple Dockerfile with pip ```dockerfile # base Docker image that we will build on FROM python:3.13.11-slim # set up our image by installing prerequisites; pandas in this case RUN pip install pandas pyarrow # set up the working directory inside the container WORKDIR /app # copy the script to the container. 1st name is source file, 2nd is destination COPY pipeline.py pipeline.py # define what to do first when the container runs # in this example, we will just run the script ENTRYPOINT ["python", "pipeline.py"] ``` **Explanation:** - `FROM`: Base image (Python 3.13) - `RUN`: Execute commands during build - `WORKDIR`: Set working directory - `COPY`: Copy files into the image - `ENTRYPOINT`: Default command to run ### Build and Run Let's build the image: ```bash docker build -t test:pandas . ``` * The image name will be `test` and its tag will be `pandas`. If the tag isn't specified it will default to `latest`. We can now run the container and pass an argument to it, so that our pipeline will receive it: ```bash docker run -it test:pandas some_number ``` You should get the same output you did when you ran the pipeline script by itself. > Note: these instructions assume that `pipeline.py` and `Dockerfile` are in the same directory. The Docker commands should also be run from the same directory as these files. ## Dockerfile with uv What about uv? Let's use it instead of using pip: ```dockerfile # Start with slim Python 3.13 image FROM python:3.13.10-slim # Copy uv binary from official uv image (multi-stage build pattern) COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/ # Set working directory WORKDIR /app # Add virtual environment to PATH so we can use installed packages ENV PATH="/app/.venv/bin:$PATH" # Copy dependency files first (better layer caching) COPY "pyproject.toml" "uv.lock" ".python-version" ./ # Install dependencies from lock file (ensures reproducible builds) RUN uv sync --locked # Copy application code COPY pipeline.py pipeline.py # Set entry point ENTRYPOINT ["uv", "run", "python", "pipeline.py"] ``` **[↑ Up](README.md)** | **[← Previous](02-virtual-environment.md)** | **[Next →](04-postgres-docker.md)** ================================================ FILE: 01-docker-terraform/docker-sql/04-postgres-docker.md ================================================ # Running PostgreSQL with Docker **[↑ Up](README.md)** | **[← Previous](03-dockerizing-pipeline.md)** | **[Next →](05-data-ingestion.md)** Now we want to do real data engineering. Let's use a Postgres database for that. You can run a containerized version of Postgres that doesn't require any installation steps. You only need to provide a few _environment variables_ to it as well as a _volume_ for storing data. ## Running PostgreSQL in a Container Create a folder anywhere you'd like for Postgres to store data in. We will use the example folder `ny_taxi_postgres_data`. Here's how to run the container: ```bash docker run -it --rm \ -e POSTGRES_USER="root" \ -e POSTGRES_PASSWORD="root" \ -e POSTGRES_DB="ny_taxi" \ -v ny_taxi_postgres_data:/var/lib/postgresql \ -p 5432:5432 \ postgres:18 ``` ### Explanation of Parameters * `-e` sets environment variables (user, password, database name) * `-v ny_taxi_postgres_data:/var/lib/postgresql` creates a **named volume** * Docker manages this volume automatically * Data persists even after container is removed * Volume is stored in Docker's internal storage * `-p 5432:5432` maps port 5432 from container to host * `postgres:18` uses PostgreSQL version 18 (latest as of Dec 2025) ### Alternative Approach - Bind Mount First create the directory, then map it: ```bash mkdir ny_taxi_postgres_data docker run -it \ -e POSTGRES_USER="root" \ -e POSTGRES_PASSWORD="root" \ -e POSTGRES_DB="ny_taxi" \ -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql \ -p 5432:5432 \ postgres:18 ``` ### Named Volume vs Bind Mount * **Named volume** (`name:/path`): Managed by Docker, easier * **Bind mount** (`/host/path:/container/path`): Direct mapping to host filesystem, more control ## Connecting to PostgreSQL Once the container is running, we can log into our database with [pgcli](https://www.pgcli.com/). Install pgcli: ```bash uv add --dev pgcli ``` The `--dev` flag marks this as a development dependency (not needed in production). It will be added to the `[dependency-groups]` section of `pyproject.toml` instead of the main `dependencies` section. Now use it to connect to Postgres: ```bash uv run pgcli -h localhost -p 5432 -u root -d ny_taxi ``` * `uv run` executes a command in the context of the virtual environment * `-h` is the host. Since we're running locally we can use `localhost`. * `-p` is the port. * `-u` is the username. * `-d` is the database name. * The password is not provided; it will be requested after running the command. When prompted, enter the password: `root` ## Basic SQL Commands Try some SQL commands: ```sql -- List tables \dt -- Create a test table CREATE TABLE test (id INTEGER, name VARCHAR(50)); -- Insert data INSERT INTO test VALUES (1, 'Hello Docker'); -- Query data SELECT * FROM test; -- Exit \q ``` **[↑ Up](README.md)** | **[← Previous](03-dockerizing-pipeline.md)** | **[Next →](05-data-ingestion.md)** ================================================ FILE: 01-docker-terraform/docker-sql/05-data-ingestion.md ================================================ # NY Taxi Dataset and Data Ingestion **[↑ Up](README.md)** | **[← Previous](04-postgres-docker.md)** | **[Next →](06-ingestion-script.md)** We will now create a Jupyter Notebook `notebook.ipynb` file which we will use to read a CSV file and export it to Postgres. ## Setting up Jupyter Install Jupyter: ```bash uv add --dev jupyter ``` Let's create a Jupyter notebook to explore the data: ```bash uv run jupyter notebook ``` ## The NYC Taxi Dataset We will use data from the [NYC TLC Trip Record Data website](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Specifically, we will use the [Yellow taxi trip records CSV file for January 2021](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz). This data used to be csv, but later they switched to parquet. We want to keep using CSV because we need to do a bit of extra pre-processing (for the purposes of learning it). A dictionary to understand each field is available [here](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf). > Note: The CSV data is stored as gzipped files. Pandas can read them directly. ## Explore the Data Create a new notebook and run: ```python import pandas as pd # Read a sample of the data prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/' df = pd.read_csv(prefix + 'yellow_tripdata_2021-01.csv.gz', nrows=100) # Display first rows df.head() # Check data types df.dtypes # Check data shape df.shape ``` ### Handling Data Types We have a warning: (Note that this warning might pop up later for some users, so it's best to follow the instructions below) ``` /tmp/ipykernel_25483/2933316018.py:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False. ``` So we need to specify the types: ```python dtype = { "VendorID": "Int64", "passenger_count": "Int64", "trip_distance": "float64", "RatecodeID": "Int64", "store_and_fwd_flag": "string", "PULocationID": "Int64", "DOLocationID": "Int64", "payment_type": "Int64", "fare_amount": "float64", "extra": "float64", "mta_tax": "float64", "tip_amount": "float64", "tolls_amount": "float64", "improvement_surcharge": "float64", "total_amount": "float64", "congestion_surcharge": "float64" } parse_dates = [ "tpep_pickup_datetime", "tpep_dropoff_datetime" ] df = pd.read_csv( prefix + 'yellow_tripdata_2021-01.csv.gz', nrows=100, dtype=dtype, parse_dates=parse_dates ) ``` ## Ingesting Data into Postgres In the Jupyter notebook, we create code to: 1. Download the CSV file 2. Read it in chunks with pandas 3. Convert datetime columns 4. Insert data into PostgreSQL using SQLAlchemy ### Install SQLAlchemy ```bash uv add sqlalchemy "psycopg[binary,pool]" ``` ### Create Database Connection ```python from sqlalchemy import create_engine engine = create_engine('postgresql+psycopg://root:root@localhost:5432/ny_taxi') ``` ### Get DDL Schema ```python print(pd.io.sql.get_schema(df, name='yellow_taxi_data', con=engine)) ``` Output: ```sql CREATE TABLE yellow_taxi_data ( "VendorID" BIGINT, tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE, tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE, passenger_count BIGINT, trip_distance FLOAT(53), "RatecodeID" BIGINT, store_and_fwd_flag TEXT, "PULocationID" BIGINT, "DOLocationID" BIGINT, payment_type BIGINT, fare_amount FLOAT(53), extra FLOAT(53), mta_tax FLOAT(53), tip_amount FLOAT(53), tolls_amount FLOAT(53), improvement_surcharge FLOAT(53), total_amount FLOAT(53), congestion_surcharge FLOAT(53) ) ``` ### Create the Table ```python df.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace') ``` `head(n=0)` makes sure we only create the table, we don't add any data yet. ## Ingesting Data in Chunks We don't want to insert all the data at once. Let's do it in batches and use an iterator for that: ```python df_iter = pd.read_csv( prefix + 'yellow_tripdata_2021-01.csv.gz', dtype=dtype, parse_dates=parse_dates, iterator=True, chunksize=100000 ) ``` ### Iterate Over Chunks ```python for df_chunk in df_iter: print(len(df_chunk)) ``` ### Inserting Data ```python df_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append') ``` ### Complete Ingestion Loop ```python first = True for df_chunk in df_iter: if first: # Create table schema (no data) df_chunk.head(0).to_sql( name="yellow_taxi_data", con=engine, if_exists="replace" ) first = False print("Table created") # Insert chunk df_chunk.to_sql( name="yellow_taxi_data", con=engine, if_exists="append" ) print("Inserted:", len(df_chunk)) ``` ### Alternative Approach (Without First Flag) ```python first_chunk = next(df_iter) first_chunk.head(0).to_sql( name="yellow_taxi_data", con=engine, if_exists="replace" ) print("Table created") first_chunk.to_sql( name="yellow_taxi_data", con=engine, if_exists="append" ) print("Inserted first chunk:", len(first_chunk)) for df_chunk in df_iter: df_chunk.to_sql( name="yellow_taxi_data", con=engine, if_exists="append" ) print("Inserted chunk:", len(df_chunk)) ``` ## Adding Progress Bar Add `tqdm` to see progress: ```bash uv add tqdm ``` Put it around the iterable: ```python from tqdm.auto import tqdm for df_chunk in tqdm(df_iter): ... ``` To see progress in terms of total chunks, you would have to add the `total` argument to `tqdm(df_iter)`. In our scenario, the pragmatic way is to hardcode a value based on the number of entries in the table. ## Verify the Data Connect to it using pgcli: ```bash uv run pgcli -h localhost -p 5432 -u root -d ny_taxi ``` And explore the data. **[↑ Up](README.md)** | **[← Previous](04-postgres-docker.md)** | **[Next →](06-ingestion-script.md)** ================================================ FILE: 01-docker-terraform/docker-sql/06-ingestion-script.md ================================================ # Creating the Data Ingestion Script **[↑ Up](README.md)** | **[← Previous](05-data-ingestion.md)** | **[Next →](07-pgadmin.md)** Now let's convert the notebook to a Python script. ## Convert Notebook to Script ```bash uv run jupyter nbconvert --to=script notebook.ipynb mv notebook.py ingest_data.py ``` ## The Complete Ingestion Script See the `pipeline/` directory for the complete script with click integration. Here's the core structure: ```python import pandas as pd from sqlalchemy import create_engine from tqdm.auto import tqdm dtype = { "VendorID": "Int64", "passenger_count": "Int64", "trip_distance": "float64", "RatecodeID": "Int64", "store_and_fwd_flag": "string", "PULocationID": "Int64", "DOLocationID": "Int64", "payment_type": "Int64", "fare_amount": "float64", "extra": "float64", "mta_tax": "float64", "tip_amount": "float64", "tolls_amount": "float64", "improvement_surcharge": "float64", "total_amount": "float64", "congestion_surcharge": "float64" } parse_dates = [ "tpep_pickup_datetime", "tpep_dropoff_datetime" ] ``` ## Click Integration The script uses `click` for command-line argument parsing: ```python import click @click.command() @click.option('--pg-user', default='root', help='PostgreSQL user') @click.option('--pg-pass', default='root', help='PostgreSQL password') @click.option('--pg-host', default='localhost', help='PostgreSQL host') @click.option('--pg-port', default=5432, type=int, help='PostgreSQL port') @click.option('--pg-db', default='ny_taxi', help='PostgreSQL database name') @click.option('--target-table', default='yellow_taxi_data', help='Target table name') def run(pg_user, pg_pass, pg_host, pg_port, pg_db, target_table): # Ingestion logic here pass ``` ## Running the Script The script reads data in chunks (100,000 rows at a time) to handle large files efficiently without running out of memory. Example usage: ```bash uv run python ingest_data.py \ --pg-user=root \ --pg-pass=root \ --pg-host=localhost \ --pg-port=5432 \ --pg-db=ny_taxi \ --target-table=yellow_taxi_trips ``` **[↑ Up](README.md)** | **[← Previous](05-data-ingestion.md)** | **[Next →](07-pgadmin.md)** ================================================ FILE: 01-docker-terraform/docker-sql/07-pgadmin.md ================================================ # pgAdmin - Database Management Tool **[↑ Up](README.md)** | **[← Previous](06-ingestion-script.md)** | **[Next →](08-dockerizing-ingestion.md)** `pgcli` is a handy tool but it's cumbersome to use for complex queries and database management. [`pgAdmin` is a web-based tool](https://www.pgadmin.org/) that makes it more convenient to access and manage our databases. It's possible to run pgAdmin as a container along with the Postgres container, but both containers will have to be in the same _virtual network_ so that they can find each other. ## Run pgAdmin Container ```bash docker run -it \ -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \ -e PGADMIN_DEFAULT_PASSWORD="root" \ -v pgadmin_data:/var/lib/pgadmin \ -p 8085:80 \ dpage/pgadmin4 ``` The `-v pgadmin_data:/var/lib/pgadmin` volume mapping saves pgAdmin settings (server connections, preferences) so you don't have to reconfigure it every time you restart the container. ### Parameters Explained * The container needs 2 environment variables: a login email and a password. We use `admin@admin.com` and `root` in this example. * pgAdmin is a web app and its default port is 80; we map it to 8085 in our localhost to avoid any possible conflicts. * The actual image name is `dpage/pgadmin4`. **Note:** This won't work yet because pgAdmin can't see the PostgreSQL container. They need to be on the same Docker network! ## Docker Networks Let's create a virtual Docker network called `pg-network`: ```bash docker network create pg-network ``` > You can remove the network later with the command `docker network rm pg-network`. You can look at the existing networks with `docker network ls`. ### Run Containers on the Same Network Stop both containers and re-run them with the network configuration: ```bash # Run PostgreSQL on the network docker run -it \ -e POSTGRES_USER="root" \ -e POSTGRES_PASSWORD="root" \ -e POSTGRES_DB="ny_taxi" \ -v ny_taxi_postgres_data:/var/lib/postgresql \ -p 5432:5432 \ --network=pg-network \ --name pgdatabase \ postgres:18 # In another terminal, run pgAdmin on the same network docker run -it \ -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \ -e PGADMIN_DEFAULT_PASSWORD="root" \ -v pgadmin_data:/var/lib/pgadmin \ -p 8085:80 \ --network=pg-network \ --name pgadmin \ dpage/pgadmin4 ``` * Just like with the Postgres container, we specify a network and a name for pgAdmin. * The container names (`pgdatabase` and `pgadmin`) allow the containers to find each other within the network. ## Connect pgAdmin to PostgreSQL You should now be able to load pgAdmin on a web browser by browsing to `http://localhost:8085`. Use the same email and password you used for running the container to log in. 1. Open browser and go to `http://localhost:8085` 2. Login with email: `admin@admin.com`, password: `root` 3. Right-click "Servers" → Register → Server 4. Configure: - **General tab**: Name: `Local Docker` - **Connection tab**: - Host: `pgdatabase` (the container name) - Port: `5432` - Username: `root` - Password: `root` 5. Save Now you can explore the database using the pgAdmin interface! **[↑ Up](README.md)** | **[← Previous](06-ingestion-script.md)** | **[Next →](08-dockerizing-ingestion.md)** ================================================ FILE: 01-docker-terraform/docker-sql/08-dockerizing-ingestion.md ================================================ # Dockerizing the Ingestion Script **[↑ Up](README.md)** | **[← Previous](07-pgadmin.md)** | **[Next →](09-docker-compose.md)** Now let's containerize the ingestion script so we can run it in Docker. ## The Dockerfile The `pipeline/Dockerfile` shows how to containerize the ingestion script: ```dockerfile FROM python:3.13.11-slim COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/ WORKDIR /code ENV PATH="/code/.venv/bin:$PATH" COPY pyproject.toml .python-version uv.lock ./ RUN uv sync --locked COPY ingest_data.py . ENTRYPOINT ["uv", "run", "python", "ingest_data.py"] ``` ### Explanation - `FROM python:3.13.11-slim`: Start with slim Python 3.13 image for smaller size - `COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/`: Copy uv binary from official uv image - `WORKDIR /code`: Set working directory inside container - `ENV PATH="/code/.venv/bin:$PATH"`: Add virtual environment to PATH - `COPY pyproject.toml .python-version uv.lock ./`: Copy dependency files first (better caching) - `RUN uv sync --locked`: Install all dependencies from lock file (ensures reproducible builds) - `COPY ingest_data.py .`: Copy ingestion script - `ENTRYPOINT ["uv", "run", "python", "ingest_data.py"]`: Set entry point to run the ingestion script ## Build the Docker Image ```bash cd pipeline docker build -t taxi_ingest:v001 . ``` ## Run the Containerized Ingestion ```bash docker run -it \ --network=pg-network \ taxi_ingest:v001 \ --pg-user=root \ --pg-pass=root \ --pg-host=pgdatabase \ --pg-port=5432 \ --pg-db=ny_taxi \ --target-table=yellow_taxi_trips ``` ### Important Notes * We need to provide the network for Docker to find the Postgres container. It goes before the name of the image. * Since Postgres is running on a separate container, the host argument will have to point to the container name of Postgres (`pgdatabase`). * You can drop the table in pgAdmin beforehand if you want, but the script will automatically replace the pre-existing table. **[↑ Up](README.md)** | **[← Previous](07-pgadmin.md)** | **[Next →](09-docker-compose.md)** ================================================ FILE: 01-docker-terraform/docker-sql/09-docker-compose.md ================================================ # Docker Compose **[↑ Up](README.md)** | **[← Previous](08-dockerizing-ingestion.md)** | **[Next →](10-sql-refresher.md)** `docker-compose` allows us to launch multiple containers using a single configuration file, so that we don't have to run multiple complex `docker run` commands separately. Docker compose makes use of YAML files. Here's the `docker-compose.yaml` file: ```yaml services: pgdatabase: image: postgres:18 environment: POSTGRES_USER: "root" POSTGRES_PASSWORD: "root" POSTGRES_DB: "ny_taxi" volumes: - "ny_taxi_postgres_data:/var/lib/postgresql" ports: - "5432:5432" pgadmin: image: dpage/pgadmin4 environment: PGADMIN_DEFAULT_EMAIL: "admin@admin.com" PGADMIN_DEFAULT_PASSWORD: "root" volumes: - "pgadmin_data:/var/lib/pgadmin" ports: - "8085:80" volumes: ny_taxi_postgres_data: pgadmin_data: ``` ### Explanation * We don't have to specify a network because `docker compose` takes care of it: every single container (or "service", as the file states) will run within the same network and will be able to find each other according to their names (`pgdatabase` and `pgadmin` in this example). * All other details from the `docker run` commands (environment variables, volumes and ports) are mentioned accordingly in the file following YAML syntax. ## Start Services with Docker Compose We can now run Docker compose by running the following command from the same directory where `docker-compose.yaml` is found. Make sure that all previous containers aren't running anymore: ```bash docker-compose up ``` ### Detached Mode If you want to run the containers again in the background rather than in the foreground (thus freeing up your terminal), you can run them in detached mode: ```bash docker-compose up -d ``` ## Stop Services You will have to press `Ctrl+C` in order to shut down the containers when running in foreground mode. The proper way of shutting them down is with this command: ```bash docker-compose down ``` ## Other Useful Commands ```bash # View logs docker-compose logs # Stop and remove volumes docker-compose down -v ``` ## Benefits of Docker Compose - Single command to start all services - Automatic network creation - Easy configuration management - Declarative infrastructure ## Running the Ingestion Script with Docker Compose If you want to re-run the dockerized ingest script when you run Postgres and pgAdmin with `docker compose`, you will have to find the name of the virtual network that Docker compose created for the containers. ```bash # check the network link: docker network ls # it's pipeline_default (or similar based on directory name) # now run the script: docker run -it --rm\ --network=pipeline_default \ taxi_ingest:v001 \ --pg-user=root \ --pg-pass=root \ --pg-host=pgdatabase \ --pg-port=5432 \ --pg-db=ny_taxi \ --target-table=yellow_taxi_trips ``` **[↑ Up](README.md)** | **[← Previous](08-dockerizing-ingestion.md)** | **[Next →](10-sql-refresher.md)** ================================================ FILE: 01-docker-terraform/docker-sql/10-sql-refresher.md ================================================ # SQL Refresher **[↑ Up](README.md)** | **[← Previous](09-docker-compose.md)** | **[Next →](11-cleanup.md)** [![](https://markdown-videos-api.jorgenkh.no/youtube/QEcps_iskgg)](https://youtu.be/QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10) Pre-Requisites: If you followed the course in the given order, Docker Compose should already be running with pgdatabase and pgAdmin. Once done, you can go to http://localhost:8085/browser/ to access pgAdmin. Don't forget to Right Click on the server or database to refresh it in case you don't see the new table. Now start querying! ## Inner Joins ### Implicit INNER JOIN Joining Yellow Taxi table with Zones Lookup table (implicit INNER JOIN): ```sql SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc", CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc" FROM yellow_taxi_trips t, zones zpu, zones zdo WHERE t."PULocationID" = zpu."LocationID" AND t."DOLocationID" = zdo."LocationID" LIMIT 100; ``` ### Explicit INNER JOIN ```sql SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc", CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc" FROM yellow_taxi_trips t JOIN -- or INNER JOIN but it's less used, when writing JOIN, postgreSQL understands implicitly that we want to use an INNER JOIN zones zpu ON t."PULocationID" = zpu."LocationID" JOIN zones zdo ON t."DOLocationID" = zdo."LocationID" LIMIT 100; ``` ## Data Quality Checks ### Checking for NULL Location IDs ```sql SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, "PULocationID", "DOLocationID" FROM yellow_taxi_trips WHERE "PULocationID" IS NULL OR "DOLocationID" IS NULL LIMIT 100; ``` ### Checking for Location IDs NOT IN Zones Table ```sql SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, "PULocationID", "DOLocationID" FROM yellow_taxi_trips WHERE "DOLocationID" NOT IN (SELECT "LocationID" from zones) OR "PULocationID" NOT IN (SELECT "LocationID" from zones) LIMIT 100; ``` ## LEFT, RIGHT, and OUTER JOINS Using LEFT, RIGHT, and OUTER JOINS when some Location IDs are not in either Tables: ```sql DELETE FROM zones WHERE "LocationID" = 142; SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc", CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc" FROM yellow_taxi_trips t LEFT JOIN zones zpu ON t."PULocationID" = zpu."LocationID" JOIN zones zdo ON t."DOLocationID" = zdo."LocationID" LIMIT 100; ``` ```sql SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc", CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc" FROM yellow_taxi_trips t RIGHT JOIN zones zpu ON t."PULocationID" = zpu."LocationID" JOIN zones zdo ON t."DOLocationID" = zdo."LocationID" LIMIT 100; ``` ```sql SELECT tpep_pickup_datetime, tpep_dropoff_datetime, total_amount, CONCAT(zpu."Borough", ' | ', zpu."Zone") AS "pickup_loc", CONCAT(zdo."Borough", ' | ', zdo."Zone") AS "dropoff_loc" FROM yellow_taxi_trips t OUTER JOIN zones zpu ON t."PULocationID" = zpu."LocationID" JOIN zones zdo ON t."DOLocationID" = zdo."LocationID" LIMIT 100; ``` ## GROUP BY ### Calculate Number of Trips Per Day ```sql SELECT CAST(tpep_dropoff_datetime AS DATE) AS "day", COUNT(1) FROM yellow_taxi_trips GROUP BY CAST(tpep_dropoff_datetime AS DATE) LIMIT 100; ``` ## ORDER BY ### Ordering by Day ```sql SELECT CAST(tpep_dropoff_datetime AS DATE) AS "day", COUNT(1) FROM yellow_taxi_trips GROUP BY CAST(tpep_dropoff_datetime AS DATE) ORDER BY "day" ASC LIMIT 100; ``` ### Ordering by Count ```sql SELECT CAST(tpep_dropoff_datetime AS DATE) AS "day", COUNT(1) AS "count" FROM yellow_taxi_trips GROUP BY CAST(tpep_dropoff_datetime AS DATE) ORDER BY "count" DESC LIMIT 100; ``` ## Other Aggregations ```sql SELECT CAST(tpep_dropoff_datetime AS DATE) AS "day", COUNT(1) AS "count", MAX(total_amount) AS "total_amount", MAX(passenger_count) AS "passenger_count" FROM yellow_taxi_trips GROUP BY CAST(tpep_dropoff_datetime AS DATE) ORDER BY "count" DESC LIMIT 100; ``` ## Grouping by Multiple Fields ```sql SELECT CAST(tpep_dropoff_datetime AS DATE) AS "day", "DOLocationID", COUNT(1) AS "count", MAX(total_amount) AS "total_amount", MAX(passenger_count) AS "passenger_count" FROM yellow_taxi_trips GROUP BY 1, 2 ORDER BY "day" ASC, "DOLocationID" ASC LIMIT 100; ``` **[↑ Up](README.md)** | **[← Previous](09-docker-compose.md)** | **[Next →](11-cleanup.md)** ================================================ FILE: 01-docker-terraform/docker-sql/11-cleanup.md ================================================ # Cleanup **[↑ Up](README.md)** | **[← Previous](10-sql-refresher.md)** | **[Next →](../README.md)** When you're done with the workshop, clean up Docker resources to free up disk space. ## Stop All Running Containers ```bash docker-compose down ``` ## Remove Specific Containers ```bash # List all containers docker ps -a # Remove specific container docker rm # Remove all stopped containers docker container prune ``` ## Remove Docker Images ```bash # List all images docker images # Remove specific image docker rmi taxi_ingest:v001 # Remove all unused images docker image prune -a ``` ## Remove Docker Volumes ```bash # List volumes docker volume ls # Remove specific volumes docker volume rm ny_taxi_postgres_data docker volume rm pgadmin_data # Remove all unused volumes docker volume prune ``` ## Remove Docker Networks ```bash # List networks docker network ls # Remove specific network docker network rm pg-network # Remove all unused networks docker network prune ``` ## Complete Cleanup Removes ALL Docker resources - use with caution! ```bash # ⚠️ Warning: This removes ALL Docker resources! docker system prune -a --volumes ``` ## Clean Up Local Files ```bash # Remove parquet files rm *.parquet # Remove Python cache rm -rf __pycache__ .pytest_cache # Remove virtual environment (if using venv) rm -rf .venv ``` --- That's all for today. Happy learning! 🐳📊 **[↑ Up](README.md)** | **[← Previous](10-sql-refresher.md)** | **[Next →](../README.md)** ================================================ FILE: 01-docker-terraform/docker-sql/README.md ================================================ # Docker and PostgreSQL: Data Engineering Workshop * Video: [link](https://www.youtube.com/watch?v=lP8xXebHmuE) * Slides: [link](https://docs.google.com/presentation/d/19pXcInDwBnlvKWCukP5sDoCAb69SPqgIoxJ_0Bikr00/edit?usp=sharing) * Code: [pipeline/](pipeline/) In this workshop, we will explore Docker fundamentals and data engineering workflows using Docker containers. This workshop is part of Module 1 of the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp). **Data Engineering** is the design and development of systems for collecting, storing and analyzing data at scale. ## Prerequisites - Basic understanding of Python - Basic SQL knowledge (helpful but not required) - Docker and Python installed on your machine - Git (optional) ## Workshop Contents 1. [Introduction to Docker](01-introduction.md) - What is Docker, why use it, basic commands 2. [Virtual Environments and Data Pipelines](02-virtual-environment.md) - Setting up Python environments with uv 3. [Dockerizing the Pipeline](03-dockerizing-pipeline.md) - Creating a Dockerfile for a simple pipeline 4. [Running PostgreSQL with Docker](04-postgres-docker.md) - Dockerizing PostgreSQL database 5. [NY Taxi Dataset and Data Ingestion](05-data-ingestion.md) - Working with real data, pandas, SQLAlchemy 6. [Creating the Data Ingestion Script](06-ingestion-script.md) - Converting notebook to Python script 7. [pgAdmin - Database Management Tool](07-pgadmin.md) - Web-based database management 8. [Dockerizing the Ingestion Script](08-dockerizing-ingestion.md) - Containerizing the pipeline 9. [Docker Compose](09-docker-compose.md) - Multi-container orchestration 10. [SQL Refresher](10-sql-refresher.md) - SQL joins, aggregations, and queries 11. [Cleanup](11-cleanup.md) - Cleaning up Docker resources ================================================ FILE: 01-docker-terraform/docker-sql/pipeline/.python-version ================================================ 3.13 ================================================ FILE: 01-docker-terraform/docker-sql/pipeline/Dockerfile ================================================ FROM python:3.13.11-slim COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/ WORKDIR /code ENV PATH="/code/.venv/bin:$PATH" COPY pyproject.toml .python-version uv.lock ./ RUN uv sync --locked COPY ingest_data.py . ENTRYPOINT ["python", "ingest_data.py"] ================================================ FILE: 01-docker-terraform/docker-sql/pipeline/docker-compose.yaml ================================================ services: pgdatabase: image: postgres:18 environment: POSTGRES_USER: "root" POSTGRES_PASSWORD: "root" POSTGRES_DB: "ny_taxi" volumes: - ny_taxi_postgres_data:/var/lib/postgresql ports: - "5432:5432" pgadmin: image: dpage/pgadmin4 environment: PGADMIN_DEFAULT_EMAIL: "admin@admin.com" PGADMIN_DEFAULT_PASSWORD: "root" volumes: - pgadmin_data:/var/lib/pgadmin ports: - "8085:80" volumes: ny_taxi_postgres_data: pgadmin_data: ================================================ FILE: 01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-ingest.sh ================================================ #!/usr/bin/env bash ## bash script to run the ingestion container echo "Running data ingestion for January 2021..." docker run -it --rm \ --network=pg-network \ taxi_ingest:v001 \ --year=2021 \ --month=1 \ --pg-user=root \ --pg-pass=root \ --pg-host=pgdatabase \ --pg-port=5432 \ --pg-db=ny_taxi \ --chunksize=100000 \ --target-table=yellow_taxi_trips ================================================ FILE: 01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-pgadmin.sh ================================================ #!/usr/bin/env bash ## bash script to start pgadmin echo "Starting pgAdmin container..." mkdir -p ../pgadmin_data docker run -it \ -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \ -e PGADMIN_DEFAULT_PASSWORD="root" \ -v ../pgadmin_data:/var/lib/pgadmin \ -p 8085:80 \ --network=pg-network \ --name pgadmin \ dpage/pgadmin4 ================================================ FILE: 01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-postgres.sh ================================================ #!/usr/bin/env bash ## bash script to start the Postgres container mkdir -p ../ny_taxi_postgres_data echo "Starting PostgreSQL container..." docker run -it \ -e POSTGRES_USER="root" \ -e POSTGRES_PASSWORD="root" \ -e POSTGRES_DB="ny_taxi" \ -v ../ny_taxi_postgres_data:/var/lib/postgresql \ -p 5432:5432 \ --network=pg-network \ --name pgdatabase \ postgres:18 # to use the pgcli # pgcli -h localhost -p 5432 -u root -d ny_taxi ================================================ FILE: 01-docker-terraform/docker-sql/pipeline/ingest_data.py ================================================ #!/usr/bin/env python # coding: utf-8 import click import pandas as pd from sqlalchemy import create_engine from tqdm.auto import tqdm dtype = { "VendorID": "Int64", "passenger_count": "Int64", "trip_distance": "float64", "RatecodeID": "Int64", "store_and_fwd_flag": "string", "PULocationID": "Int64", "DOLocationID": "Int64", "payment_type": "Int64", "fare_amount": "float64", "extra": "float64", "mta_tax": "float64", "tip_amount": "float64", "tolls_amount": "float64", "improvement_surcharge": "float64", "total_amount": "float64", "congestion_surcharge": "float64" } parse_dates = [ "tpep_pickup_datetime", "tpep_dropoff_datetime" ] @click.command() @click.option('--pg-user', default='root', help='PostgreSQL user') @click.option('--pg-pass', default='root', help='PostgreSQL password') @click.option('--pg-host', default='localhost', help='PostgreSQL host') @click.option('--pg-port', default=5432, type=int, help='PostgreSQL port') @click.option('--pg-db', default='ny_taxi', help='PostgreSQL database name') @click.option('--year', default=2021, type=int, help='Year of the data') @click.option('--month', default=1, type=int, help='Month of the data') @click.option('--target-table', default='yellow_taxi_data', help='Target table name') @click.option('--chunksize', default=100000, type=int, help='Chunk size for reading CSV') def run(pg_user, pg_pass, pg_host, pg_port, pg_db, year, month, target_table, chunksize): """Ingest NYC taxi data into PostgreSQL database.""" prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow' url = f'{prefix}/yellow_tripdata_{year}-{month:02d}.csv.gz' engine = create_engine(f'postgresql+psycopg://{pg_user}:{pg_pass}@{pg_host}:{pg_port}/{pg_db}') df_iter = pd.read_csv( url, dtype=dtype, parse_dates=parse_dates, iterator=True, chunksize=chunksize, ) first = True for df_chunk in tqdm(df_iter): if first: df_chunk.head(0).to_sql( name=target_table, con=engine, if_exists='replace' ) first = False df_chunk.to_sql( name=target_table, con=engine, if_exists='append' ) if __name__ == '__main__': run() ================================================ FILE: 01-docker-terraform/docker-sql/pipeline/pyproject.toml ================================================ [project] name = "pipeline" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.13" dependencies = [ "click>=8.3.1", "pandas>=2.3.3", "psycopg2-binary>=2.9.11", "pyarrow>=22.0.0", "sqlalchemy>=2.0.44", "tqdm>=4.67.1", ] [dependency-groups] dev = [ "jupyter>=1.1.1", "pgcli>=4.3.0", ] ================================================ FILE: 01-docker-terraform/terraform/1_terraform_overview.md ================================================ ## Terraform Overview [Video](https://www.youtube.com/watch?v=18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=2) ### Concepts #### Introduction 1. What is [Terraform](https://www.terraform.io)? * open-source tool by [HashiCorp](https://www.hashicorp.com), used for provisioning infrastructure resources * supports DevOps best practices for change management * Managing configuration files in source control to maintain an ideal provisioning state for testing and production environments 2. What is IaC? * Infrastructure-as-Code * build, change, and manage your infrastructure in a safe, consistent, and repeatable way by defining resource configurations that you can version, reuse, and share. 3. Some advantages * Infrastructure lifecycle management * Version control commits * Very useful for stack-based deployments, and with cloud providers such as AWS, GCP, Azure, K8S… * State-based approach to track resource changes throughout deployments #### Files * `main.tf` * `variables.tf` * Optional: `resources.tf`, `output.tf` * `.tfstate` #### Declarations * `terraform`: configure basic Terraform settings to provision your infrastructure * `required_version`: minimum Terraform version to apply to your configuration * `backend`: stores Terraform's "state" snapshots, to map real-world resources to your configuration. * `local`: stores state file locally as `terraform.tfstate` * `required_providers`: specifies the providers required by the current module * `provider`: * adds a set of resource types and/or data sources that Terraform can manage * The Terraform Registry is the main directory of publicly available providers from most major infrastructure platforms. * `resource` * blocks to define components of your infrastructure * Project modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table * `variable` & `locals` * runtime arguments and constants #### Execution steps 1. `terraform init`: * Initializes & configures the backend, installs plugins/providers, & checks out an existing configuration from a version control 2. `terraform plan`: * Matches/previews local changes against a remote state, and proposes an Execution Plan. 3. `terraform apply`: * Asks for approval to the proposed plan, and applies changes to cloud 4. `terraform destroy` * Removes your stack from the Cloud ### Terraform Workshop to create GCP Infra Continue [here](./terraform): `week_1_basics_n_setup/1_terraform_gcp/terraform` ### References https://learn.hashicorp.com/collections/terraform/gcp-get-started ================================================ FILE: 01-docker-terraform/terraform/2_gcp_overview.md ================================================ ## GCP Overview [Video](https://www.youtube.com/watch?v=18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=2) ### Project infrastructure modules in GCP: * Google Cloud Storage (GCS): Data Lake * BigQuery: Data Warehouse (Concepts explained in Week 2 - Data Ingestion) ### Initial Setup For this course, we'll use a free version (upto EUR 300 credits). 1. Create an account with your Google email ID 2. Setup your first [project](https://console.cloud.google.com/) if you haven't already * eg. "DTC DE Course", and note down the "Project ID" (we'll use this later when deploying infra with TF) 3. Setup [service account & authentication](https://cloud.google.com/docs/authentication/getting-started) for this project * Grant `Viewer` role to begin with. * Download service-account-keys (.json) for auth. 4. Download [SDK](https://cloud.google.com/sdk/docs/quickstart) for local setup 5. Set environment variable to point to your downloaded GCP keys: ```shell export GOOGLE_APPLICATION_CREDENTIALS=".json" # Refresh token/session, and verify authentication gcloud auth application-default login ``` ### Setup for Access 1. [IAM Roles](https://cloud.google.com/storage/docs/access-control/iam-roles) for Service account: * Go to the *IAM* section of *IAM & Admin* https://console.cloud.google.com/iam-admin/iam * Click the *Edit principal* icon for your service account. * Add these roles in addition to *Viewer* : **Storage Admin** + **Storage Object Admin** + **BigQuery Admin** 2. Enable these APIs for your project: * https://console.cloud.google.com/apis/library/iam.googleapis.com * https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com 3. Please ensure `GOOGLE_APPLICATION_CREDENTIALS` env-var is set. ```shell export GOOGLE_APPLICATION_CREDENTIALS=".json" ``` ### Terraform Workshop to create GCP Infra Continue [here](./terraform): `week_1_basics_n_setup/1_terraform_gcp/terraform` ================================================ FILE: 01-docker-terraform/terraform/README.md ================================================ ## Local Setup for Terraform and GCP ### Pre-Requisites 1. Terraform client installation: https://www.terraform.io/downloads 2. Cloud Provider account: https://console.cloud.google.com/ ### Terraform Concepts [Terraform Overview](1_terraform_overview.md) ### GCP setup 1. [Setup for First-time](2_gcp_overview.md#initial-setup) * [Only for Windows](windows.md) - Steps 4 & 5 2. [IAM / Access specific to this course](2_gcp_overview.md#setup-for-access) ### Terraform Workshop for GCP Infra Your setup is ready! Now head to the [terraform](terraform) directory, and perform the execution steps to create your infrastructure. ================================================ FILE: 01-docker-terraform/terraform/terraform/README.md ================================================ ### Concepts * [Terraform_overview](../1_terraform_overview.md) * If you were unable to generate a service account keyfile due to organizational policies, refer to the instructions [below](#fallback) ### Execution ```shell # Refresh service-account's auth-token for this session gcloud auth application-default login # Initialize state file (.tfstate) terraform init # Check changes to new infra plan terraform plan -var="project=" ``` ```shell # Create new infra terraform apply -var="project=" ``` ```shell # Delete infra after your work, to avoid costs on any running services terraform destroy ``` ### Warning Remember to use a [proper gitignore](https://github.com/github/gitignore/blob/main/Terraform.gitignore) file before publishing your code on GitHub ### Fallback 1. Give yourself the token creator role on the pertinent service account ```bash gcloud iam service-accounts add-iam-policy-binding \ \ --member="user:YOUR_EMAIL@gmail.com" \ --role="roles/iam.serviceAccountTokenCreator" ``` 2. Add the sections below the first block to your main terraform configuration ```terraform # Connect to gcp using ADC (identity verification) provider "google" { project = var.project region = var.region zone = var.zone } /* add these data blocks */ # This data source gets a temporary token for the service account data "google_service_account_access_token" "default" { provider = google target_service_account = "" scopes = ["https://www.googleapis.com/auth/cloud-platform"] lifetime = "3600s" } # This second provider block uses that temporary token and does the real work provider "google" { alias = "impersonated" access_token = data.google_service_account_access_token.default.access_token project = var.project region = var.region zone = var.zone } ``` 3. Now, you can follow the instructions [above](#execution) ================================================ FILE: 01-docker-terraform/terraform/terraform/terraform_basic/main.tf ================================================ terraform { required_providers { google = { source = "hashicorp/google" version = "4.51.0" } } } provider "google" { # Credentials only needs to be set if you do not have the GOOGLE_APPLICATION_CREDENTIALS set # credentials = project = "" region = "us-central1" } resource "google_storage_bucket" "data-lake-bucket" { name = "" location = "US" # Optional, but recommended settings: storage_class = "STANDARD" uniform_bucket_level_access = true versioning { enabled = true } lifecycle_rule { action { type = "Delete" } condition { age = 30 // days } } force_destroy = true } resource "google_bigquery_dataset" "dataset" { dataset_id = "" project = "" location = "US" } ================================================ FILE: 01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/README.md ================================================ # AWS Terraform Data Lake (GCP Equivalent) ## 📌 Overview This repository contains an **AWS-based Terraform implementation** that mirrors the **Google Cloud Platform (GCP)** infrastructure used in the Data Engineering course (e.g. GCS + BigQuery), but implemented using **AWS services**. The goal is to help learners who: - Are enrolled in a **GCP-focused Data Engineering course** - Prefer or need to work with **AWS** - Want to understand **cloud-agnostic data engineering concepts** This setup focuses on building a **basic data lake foundation** using: - **Amazon S3** (equivalent to GCS) - **AWS Glue Data Catalog** (equivalent to BigQuery datasets / metadata layer) - **Terraform** as Infrastructure as Code (IaC) --- ## 🏗️ Architecture Mapping (GCP → AWS) | GCP Service | AWS Equivalent | Purpose | |------------|---------------|---------| | Google Cloud Storage (GCS) | Amazon S3 | Data Lake storage | | Uniform Bucket Level Access | S3 Public Access Block | Secure bucket access | | Object Lifecycle Rules | S3 Lifecycle Configuration | Automatic data expiration | | BigQuery Dataset | AWS Glue Catalog Database | Metadata & query layer | | Terraform (GCP provider) | Terraform (AWS provider) | Infrastructure as Code | --- ## 📁 Project Structure ```text . ├── main.tf # Core infrastructure resources ├── variables.tf # Input variable definitions ├── terraform.tfvars # Environment-specific values └── README.md # Project documentation ================================================ FILE: 01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/main.tf ================================================ terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } } provider "aws" { region = var.aws_region } #S3 Bucket to store data equivalent to GCS Bucket in GCP resource "aws_s3_bucket" "data_lake_bucket" { bucket = var.bucket_name force_destroy = true } #Bucket verisioning resource "aws_s3_bucket_versioning" "versioning" { bucket = aws_s3_bucket.data_lake_bucket.id # Reference the S3 bucket created above versioning_configuration { status = "Enabled" # Enable versioning } } # "Uniform bucket level access" ~ control prin policy/ACL; recomandat: block public access resource "aws_s3_bucket_public_access_block" "block_public_access" { bucket = aws_s3_bucket.data_lake_bucket.id block_public_acls = true block_public_policy = true ignore_public_acls = true restrict_public_buckets = true } # Lifecycle: delete objects older than 30 days (echivalent lifecycle_rule age=30) resource "aws_s3_bucket_lifecycle_configuration" "lifecycle_rules" { bucket = aws_s3_bucket.data_lake_bucket.id rule { id = "Delete_old_older_than_30_days" status = "Enabled" expiration { days = 30 } filter { prefix = "" # Apply to all objects in the bucket } } } resource "aws_glue_catalog_database" "dataset" { name = var.dataset_name } ================================================ FILE: 01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/terraform.tfvars ================================================ bucket_name = "my-unique-data-lake-bucket-12345" dataset_name = "ny_taxi_dataset" ================================================ FILE: 01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/variables.tf ================================================ # Specifies the geographic location for AWS resource deployment. # Defaulting to Stockholm (eu-north-1) to keep latency low for European users. variable "aws_region" { description = "AWS region to deploy resources in" type = string default = "eu-north-1" } # The unique identifier for the S3 bucket where raw data will be stored. # S3 bucket names must be globally unique across all AWS accounts. variable "bucket_name" { description = "Name of the S3 bucket" type = string default = "data-engineering-zoomcamp-1568692036" } # Defines the logical grouping for metadata in the AWS Glue Catalog. # This allows tools like Athena to query the S3 data using SQL. variable "dataset_name" { description = "Glue Catalog database name (logical dataset for Athena/Glue)" type = string default = "ny_taxi_database" } ================================================ FILE: 01-docker-terraform/terraform/terraform/terraform_with_variables/main.tf ================================================ terraform { required_providers { google = { source = "hashicorp/google" version = "5.6.0" } } } provider "google" { credentials = file(var.credentials) project = var.project region = var.region } resource "google_storage_bucket" "demo-bucket" { name = var.gcs_bucket_name location = var.location force_destroy = true lifecycle_rule { condition { age = 1 } action { type = "AbortIncompleteMultipartUpload" } } } resource "google_bigquery_dataset" "demo_dataset" { dataset_id = var.bq_dataset_name location = var.location } ================================================ FILE: 01-docker-terraform/terraform/terraform/terraform_with_variables/variables.tf ================================================ variable "credentials" { description = "My Credentials" default = "" #ex: if you have a directory where this file is called keys with your service account json file #saved there as my-creds.json you could use default = "./keys/my-creds.json" } variable "project" { description = "Project" default = "" } variable "region" { description = "Region" #Update the below to your desired region default = "us-central1" } variable "location" { description = "Project Location" #Update the below to your desired location default = "US" } variable "bq_dataset_name" { description = "My BigQuery Dataset Name" #Update the below to what you want your dataset to be called default = "demo_dataset" } variable "gcs_bucket_name" { description = "My Storage Bucket Name" #Update the below to a unique bucket name default = "terraform-demo-terra-bucket" } variable "gcs_storage_class" { description = "Bucket Storage Class" default = "STANDARD" } ================================================ FILE: 01-docker-terraform/terraform/windows.md ================================================ ## GCP and Terraform on Windows You don't need these instructions if you use WSL. It's only for "plain Windows" ### Google Cloud SDK * For this tutorial, you'll need a Linux-like environment, e.g. [GitBash](https://gitforwindows.org/), [MinGW](https://www.mingw-w64.org/) or [cygwin](https://www.cygwin.com/) * Power Shell should also work, but will require adjustments * Download SDK in zip: https://dl.google.com/dl/cloudsdk/channels/rapid/google-cloud-sdk.zip * source: https://cloud.google.com/sdk/docs/downloads-interactive * Unzip it and run the `install.sh` script When installing it, you might see something like that: ``` The installer is unable to automatically update your system PATH. Please add C:\tools\google-cloud-sdk\bin ``` * To fix that, adjust your `.bashrc` to include this in `PATH` ([instructions](https://unix.stackexchange.com/questions/26047/how-to-correctly-add-a-path-to-path)) * You can also do it system-wide ([instructions](https://gist.github.com/nex3/c395b2f8fd4b02068be37c961301caa7)) Now we need to point it to correct Python installation. Assuming you use [Anaconda](https://www.anaconda.com/products/individual): ```bash export CLOUDSDK_PYTHON=~/Anaconda3/python ``` Now let's check that it works: ```bash $ gcloud version Google Cloud SDK 367.0.0 bq 2.0.72 core 2021.12.10 gsutil 5.5 ``` ### Google Cloud SDK Authentication * Now create a service account and generate keys like shown in the videos * Download the key and put it to some location, e.g. `.gc/ny-rides.json` * Set `GOOGLE_APPLICATION_CREDENTIALS` to point to the file ```bash export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/ny-rides.json ``` Now authenticate: ```bash gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS ``` Alternatively, you can authenticate using OAuth like shown in the video ```bash gcloud auth application-default login ``` If you get a message like `quota exceeded` > WARNING: > Cannot find a quota project to add to ADC. You might receive a "quota exceeded" or "API not enabled" error. > Run `$ gcloud auth application-default set-quota-project` to add a quota project. Then run this: ```bash PROJECT_NAME="ny-rides-alexey" gcloud auth application-default set-quota-project ${PROJECT_NAME} ``` ### Terraform * [Download Terraform](https://www.terraform.io/downloads) * Put it to a folder in [PATH](https://gist.github.com/nex3/c395b2f8fd4b02068be37c961301caa7) * Go to the location with Terraform files and initialize it ```bash terraform init ``` Optionally you can configure your terraform files (`variables.tf`) to include your project id: ```bash variable "project" { description = "Your GCP Project ID" default = "ny-rides-alexey" type = string } ``` * Now [follow the instructions](1_terraform_overview.md#execution-steps) * Run `terraform plan` * Next, run `terraform apply` If you get an error like that: > Error: googleapi: Error 403: terraform@ny-rides-alexey.iam.gserviceaccount.com does not have > storage.buckets.create access to the Google Cloud project., forbidden Then you need to give your service account all the permissions. Make sure you follow the instructions in the videos * You can also use [this file](https://docs.google.com/document/d/e/2PACX-1vSZapy7gIj0TP-EFzub2OpAlAkuifGEVJ4XpkA1RvxZ45NjiQi29b6OhLuetdXXHWAn2lbbKxnbzMdd/pub), but it doesn't list all the required permissions ================================================ FILE: 02-workflow-orchestration/README.md ================================================ # Workflow Orchestration Welcome to Module 2 of the Data Engineering Zoomcamp! This week, we’ll dive into workflow orchestration using [Kestra](https://go.kestra.io/de-zoomcamp/github). Kestra is an open-source, event-driven orchestration platform that simplifies building both scheduled and event-driven workflows. By adopting Infrastructure as Code practices for data and process orchestration, Kestra enables you to build reliable workflows with just a few lines of YAML. > [!NOTE] >You can find all videos for this week in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist). --- ## Course Structure - [2.1 - Introduction to Workflow Orchestration](#21-introduction-to-workflow-orchestration) - [2.2 - Getting Started With Kestra](#22-getting-started-with-kestra) - [2.3 - Hands-On Coding Project: Build ETL Data Pipelines with Kestra](#23-hands-on-coding-project-build-data-pipelines-with-kestra) - [2.4 - ELT Pipelines in Kestra: Google Cloud Platform](#24-elt-pipelines-in-kestra-google-cloud-platform) - [2.5 - Using AI for Data Engineering in Kestra](#25-using-ai-for-data-engineering-in-kestra) - [2.6 - Bonus](#26-bonus-deploy-to-the-cloud-optional) ## 2.1 Introduction to Workflow Orchestration In this section, you’ll learn the foundations of workflow orchestration, its importance, and how Kestra fits into the orchestration landscape. ### 2.1.1 - What is Workflow Orchestration? Think of a music orchestra. There's a variety of different instruments. Some more than others, all with different roles when it comes to playing music. To make sure they all come together at the right time, they follow a conductor who helps the orchestra to play together. Now replace the instruments with tools and the conductor with an orchestrator. We often have multiple tools and platforms that we need to work together. Sometimes on a routine schedule, other times based on events that happen. That's where the orchestrator comes in to help all of these tools work together. A workflow orchestrator might do the following tasks: - Run workflows which contain a number of predefined steps - Monitor and log errors, as well as taking a number of extra steps when they occur - Automatically run workflows based on schedules and events In data engineering, you often need to move data from one place, to another, sometimes with some modifications made to the data in the middle. This is where a workflow orchestrator can help out by managing these steps, while giving us visibility into it at the same time. In this module, we're going to build our own data pipeline using ETL (Extract, Transform Load) with Kestra at the core of the operation, but first we need to understand a bit more about how Kestra works before we can get building! #### Videos - **2.1.1 - What is Workflow Orchestration?** [![2.1.1 - What is Workflow Orchestration?](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F-JLnp-iLins)](https://youtu.be/-JLnp-iLins) ### 2.1.2 - What is Kestra? Kestra is an open-source, infinitely-scalable orchestration platform that enables all engineers to manage business-critical workflows. Kestra is a great choice for workflow orchestration: - Build with Flow code (YAML), No-code or with the AI Copilot - flexibility in how you build your workflows - 1000+ Plugins - integrate with all the tools you use - Support for any programming language - pick the right tool for the job - Schedule or Event Based Triggers - have your workflows respond to data #### Videos - **2.1.2 - What is Kestra?** [![2.1.2 - What is Kestra?](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FZvVN_NmB_1s)](https://youtu.be/ZvVN_NmB_1s) ### Resources - [Quickstart Guide](https://go.kestra.io/de-zoomcamp/quickstart) - [What is an Orchestrator?](https://go.kestra.io/de-zoomcamp/what-is-an-orchestrator) --- ## 2.2 Getting Started with Kestra In this section, you'll learn how to install Kestra, as well as the key concepts required to build your first workflow. Once our first workflow is built, we can extend this further by executing a Python script inside of a workflow. You will: 1. Install Kestra using Docker Compose 2. Learn the concepts of Kestra to build your first workflow 3. Execute a Python script inside of a Kestra Flow ### 2.2.1 - Installing Kestra To install Kestra, we are going to use Docker Compose. We already have a Postgres database set up, along with pgAdmin from Module 1. We can continue to use these with Kestra but we'll need to make a few modifications to our Docker Compose file. Use [this example Docker Compose file](docker-compose.yml) to correctly add the 2 new services and set up the volumes correctly. Add information about setting a username and password. We'll set up Kestra using Docker Compose containing one container for the Kestra server and another for the Postgres database: ```bash cd 02-workflow-orchestration docker compose up -d ``` **Note:** Check that `pgAdmin` isn't running on the same ports as Kestra. If so, check out the [FAQ](#troubleshooting-tips) at the bottom of the README. Once the container starts, you can access the Kestra UI at [http://localhost:8080](http://localhost:8080). To shut down Kestra, go to the same directory and run the following command: ```bash docker compose down ``` #### Add Flows to Kestra Flows can be added to Kestra by copying and pasting the YAML directly into the editor, or by adding via Kestra's API. See below for adding programmatically.
Add Flows to Kestra programmatically If you prefer to add flows programmatically using Kestra's API, run the following commands: ```bash # Import all flows: assuming username admin@kestra.io and password Admin1234! (adjust to match your username and password) curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/01_hello_world.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/02_python.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/03_getting_started_data_pipeline.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/04_postgres_taxi.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/05_postgres_taxi_scheduled.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/06_gcp_kv.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/07_gcp_setup.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/08_gcp_taxi.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/09_gcp_taxi_scheduled.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/10_chat_without_rag.yaml curl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/11_chat_with_rag.yaml ```
#### Videos - **2.2.1 - Installing Kestra** [![2.2.1 - Installing Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FwgPxC4UjoLM)](https://youtu.be/wgPxC4UjoLM) #### Resources - [Install Kestra with Docker Compose](https://go.kestra.io/de-zoomcamp/docker-compose) ### 2.2.2 - Kestra Concepts To start building workflows in Kestra, we need to understand a number of concepts. - [Flow](https://go.kestra.io/de-zoomcamp/flow) - a container for tasks and their orchestration logic. - [Tasks](https://go.kestra.io/de-zoomcamp/tasks) - the steps within a flow. - [Inputs](https://go.kestra.io/de-zoomcamp/inputs) - dynamic values passed to the flow at runtime. - [Outputs](https://go.kestra.io/de-zoomcamp/outputs) - pass data between tasks and flows. - [Triggers](https://go.kestra.io/de-zoomcamp/triggers) - mechanism that automatically starts the execution of a flow. - [Execution](https://go.kestra.io/de-zoomcamp/execution) - a single run of a flow with a specific state. - [Variables](https://go.kestra.io/de-zoomcamp/variables) - key–value pairs that let you reuse values across tasks. - [Plugin Defaults](https://go.kestra.io/de-zoomcamp/plugin-defaults) - default values applied to every task of a given type within one or more flows. - [Concurrency](https://go.kestra.io/de-zoomcamp/concurrency) - control how many executions of a flow can run at the same time. While there are more concepts used for building powerful workflows, these are the ones we're going to use to build our data pipelines. The flow [`01_hello_world.yaml`](flows/01_hello_world.yaml) showcases all of these concepts inside of one workflow: - The flow has 5 tasks: 3 log tasks and a sleep task - The flow takes an input called `name`. - There is a variable that takes the `name` input to generate a full welcome message. - An output is generated from the return task and is logged in a later log task. - There is a trigger to execute this flow every day at 10am. - Plugin Defaults are used to make both log tasks send their messages as `ERROR` level. - We have a concurrency limit of 2 executions. Any further ones made while 2 are running will fail. #### Videos - **2.2.2 - Kestra Concepts** [![2.2.2 - Kestra Concepts](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FMNOKVx8780E)](https://youtu.be/MNOKVx8780E) #### Resources - [Tutorial](https://go.kestra.io/de-zoomcamp/tutorial) - [Workflow Components Documentation](https://go.kestra.io/de-zoomcamp/workflow-components) ### 2.2.3 - Orchestrate Python Code Now that we've built our first workflow, we can take it a step further by adding Python code into our flow. In Kestra, we can run Python code from a dedicated file or write it directly inside of our workflow. While Kestra has a huge variety of plugins available for building your workflows, you also have the option to write your own code and have Kestra execute that based on schedules or events. This means you can pick the right tools for your pipelines, rather than the ones you're limited to. In our example Python workflow, [`02_python.yaml`](flows/02_python.yaml), our code fetches the number of Docker image pulls from DockerHub and returns it as an output to Kestra. This is useful as we can access this output with other tasks, even though it was generated inside of our Python script. #### Videos - **2.2.3 - Orchestrate Python Code** [![2.2.3 - Orchestrate Python Code](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FVAHm0R_XjqI)](https://youtu.be/VAHm0R_XjqI) #### Resources - [How-to Guide: Python](https://go.kestra.io/de-zoomcamp/python) ## 2.3 Hands-On Coding Project: Build Data Pipelines with Kestra Next, we're gonna build ETL pipelines for Yellow and Green Taxi data from NYC’s Taxi and Limousine Commission (TLC). You will: 1. Extract data from [CSV files](https://github.com/DataTalksClub/nyc-tlc-data/releases). 2. Load it into Postgres or Google Cloud (GCS + BigQuery). 3. Explore scheduling and backfilling workflows. ### 2.3.1 Getting Started Pipeline This introductory flow is added just to demonstrate a simple data pipeline which extracts data via HTTP REST API, transforms that data in Python and then queries it using DuckDB. For this stage, a new separate Postgres database is created for the exercises. ```mermaid graph LR Extract[Extract Data via HTTP REST API] --> Transform[Transform Data in Python] Transform --> Query[Query Data with DuckDB] ``` Add the flow [`03_getting_started_data_pipeline.yaml`](flows/03_getting_started_data_pipeline.yaml) from the UI if you haven't already and execute it to see the results. Inspect the Gantt and Logs tabs to understand the flow execution. #### Videos - **2.3.1 - Getting Started Pipeline** [![Create an ETL Pipeline with Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F-KmwrCqRhic)](https://youtu.be/-KmwrCqRhic) #### Resources - [ETL Tutorial Video](https://go.kestra.io/de-zoomcamp/etl-tutorial) - [ETL in 3 Minutes](https://go.kestra.io/de-zoomcamp/etl-get-started) ### 2.3.2 Local DB: Load Taxi Data to Postgres Before we start loading data to GCP, we'll first play with the Yellow and Green Taxi data using a local Postgres database running in a Docker container. We will use the same database from Module 1 which should be in the same Docker Compose file as Kestra. The flow will extract CSV data partitioned by year and month, create tables, load data to the monthly table, and finally merge the data to the final destination table. ```mermaid graph LR Start[Select Year & Month] --> SetLabel[Set Labels] SetLabel --> Extract[Extract CSV Data] Extract -->|Taxi=Yellow| YellowFinalTable[Create Yellow Final Table]:::yellow Extract -->|Taxi=Green| GreenFinalTable[Create Green Final Table]:::green YellowFinalTable --> YellowMonthlyTable[Create Yellow Monthly Table]:::yellow GreenFinalTable --> GreenMonthlyTable[Create Green Monthly Table]:::green YellowMonthlyTable --> YellowCopyIn[Load Data to Monthly Table]:::yellow GreenMonthlyTable --> GreenCopyIn[Load Data to Monthly Table]:::green YellowCopyIn --> YellowMerge[Merge Yellow Data]:::yellow GreenCopyIn --> GreenMerge[Merge Green Data]:::green classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px,color:#000; classDef green fill:#32CD32,stroke:#000,stroke-width:1px,color:#000; ``` The flow code: [`04_postgres_taxi.yaml`](flows/04_postgres_taxi.yaml). > [!NOTE] > The NYC Taxi and Limousine Commission (TLC) Trip Record Data provided on the [nyc.gov](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) website is currently available only in a Parquet format, but this is NOT the dataset we're going to use in this course. For the purpose of this course, we'll use the **CSV files** available [here on GitHub](https://github.com/DataTalksClub/nyc-tlc-data/releases). This is because the Parquet format can be challenging to understand by newcomers, and we want to make the course as accessible as possible — the CSV format can be easily introspected using tools like Excel or Google Sheets, or even a simple text editor. #### Videos - **2.3.2 - Local DB: Load Taxi Data to Postgres** [![Local DB: Load Taxi Data to Postgres](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FZ9ZmmwtXDcU)](https://youtu.be/Z9ZmmwtXDcU) #### Resources - [Docker Compose with Kestra, Postgres and pgAdmin](docker-compose.yml) ### 2.3.3 Local DB: Learn Scheduling and Backfills We can now schedule the same pipeline shown above to run daily at 9 AM UTC. We'll also demonstrate how to backfill the data pipeline to run on historical data. Note: given the large dataset, we'll backfill only data for the green taxi dataset for the year 2019. The flow code: [`05_postgres_taxi_scheduled.yaml`](flows/05_postgres_taxi_scheduled.yaml). #### Videos - **2.3.3 - Scheduling and Backfills** [![Scheduling and Backfills](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F1pu_C_oOAMA)](https://youtu.be/1pu_C_oOAMA) --- ## 2.4 ELT Pipelines in Kestra: Google Cloud Platform Now that you've learned how to build ETL pipelines locally using Postgres, we are ready to move to the cloud. In this section, we'll load the same Yellow and Green Taxi data to Google Cloud Platform (GCP) using: 1. Google Cloud Storage (GCS) as a data lake 2. BigQuery as a data warehouse. ### 2.4.1 - ETL vs ELT In 2.3, we made a ETL pipeline inside of Kestra: - **Extract:** Firstly, we extract the dataset from GitHub - **Transform:** Next, we transform it with Python - **Load:** Finally, we load it into our Postgres database While this is very standard across the industry, sometimes it makes sense to change the order when working with the cloud. If you're working with a large dataset, like the Yellow Taxi data, there can be benefits to extracting and loading straight into a data warehouse, and then performing transformations directly in the data warehouse. When working with BigQuery, we will use ELT: - **Extract:** Firstly, we extract the dataset from GitHub - **Load:** Next, we load this dataset (in this case, a csv file) into a data lake (Google Cloud Storage) - **Transform:** Finally, we can create a table inside of our data warehouse (BigQuery) which uses the data from our data lake to perform our transformations. The reason for loading into the data warehouse before transforming means we can utilize the cloud's performance benefits for transforming large datasets. What might take a lot longer for a local machine, can take a fraction of the time in the cloud. Over the next few videos, we'll look at setting up BigQuery and transforming the Yellow Taxi dataset. #### Videos - **2.4.1 - ETL vs ELT** [![ETL vs ELT](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FE04yurp1tSU)](https://youtu.be/E04yurp1tSU) #### Resources - [ETL vs ELT Video](https://go.kestra.io/de-zoomcamp/etl-vs-elt) - [Data Warehouse 101 Video](https://go.kestra.io/de-zoomcamp/data-warehouse-101) - [Data Lakes 101 Video](https://go.kestra.io/de-zoomcamp/data-lakes-101) ### 2.4.2 Setup Google Cloud Platform (GCP) Before we start loading data to GCP, we need to set up the Google Cloud Platform. First, adjust the following flow [`06_gcp_kv.yaml`](flows/06_gcp_kv.yaml) to include your service account, GCP project ID, BigQuery dataset and GCS bucket name (_along with their location_) as KV Store values: - GCP_PROJECT_ID - GCP_LOCATION - GCP_BUCKET_NAME - GCP_DATASET. #### Create GCP Resources If you haven't already created the GCS bucket and BigQuery dataset in the first week of the course, you can use this flow to create them: [`07_gcp_setup.yaml`](flows/07_gcp_setup.yaml). > [!WARNING] > The `GCP_CREDS` service account contains sensitive information. Ensure you keep it secure and do not commit it to Git. Keep it as secure as your passwords. #### Videos - **2.4.2 - Setup Google Cloud Platform** [![Setup Google Cloud Platform](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FTLGFAOHpOYM)](https://youtu.be/TLGFAOHpOYM) #### Resources - [Set up Google Cloud Service Account in Kestra](https://go.kestra.io/de-zoomcamp/google-sa) ### 2.4.3 GCP Workflow: Load Taxi Data to BigQuery Now that Google Cloud is set up with a storage bucket, we can start the ELT process. ```mermaid graph LR SetLabel[Set Labels] --> Extract[Extract CSV Data] Extract --> UploadToGCS[Upload Data to GCS] UploadToGCS -->|Taxi=Yellow| BQYellowTripdata[Main Yellow Tripdata Table]:::yellow UploadToGCS -->|Taxi=Green| BQGreenTripdata[Main Green Tripdata Table]:::green BQYellowTripdata --> BQYellowTableExt[External Table]:::yellow BQGreenTripdata --> BQGreenTableExt[External Table]:::green BQYellowTableExt --> BQYellowTableTmp[Monthly Table]:::yellow BQGreenTableExt --> BQGreenTableTmp[Monthly Table]:::green BQYellowTableTmp --> BQYellowMerge[Merge to Main Table]:::yellow BQGreenTableTmp --> BQGreenMerge[Merge to Main Table]:::green BQYellowMerge --> PurgeFiles[Purge Files] BQGreenMerge --> PurgeFiles[Purge Files] classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px,color:#000 classDef green fill:#32CD32,stroke:#000,stroke-width:1px,color:#000 ``` The flow code: [`08_gcp_taxi.yaml`](flows/08_gcp_taxi.yaml). #### Videos - **2.4.3 - Create an ETL Pipeline with GCS and BigQuery in Kestra** [![Create an ETL Pipeline with GCS and BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F52u9X_bfTAo)](https://youtu.be/52u9X_bfTAo) ### 2.4.4 GCP Workflow: Schedule and Backfill Full Dataset We can now schedule the same pipeline shown above to run daily at 9 AM UTC for the green dataset and at 10 AM UTC for the yellow dataset. You can backfill historical data directly from the Kestra UI. Since we now process data in a cloud environment with infinitely scalable storage and compute, we can backfill the entire dataset for both the yellow and green taxi data without the risk of running out of resources on our local machine. The flow code: [`09_gcp_taxi_scheduled.yaml`](flows/09_gcp_taxi_scheduled.yaml). #### Videos - **2.4.4 - GCP Workflow: Schedule and Backfills** [![GCP Workflow: Schedule and Backfills](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fb-6KhfWfk2M)](https://youtu.be/b-6KhfWfk2M) --- ## 2.5 Using AI for Data Engineering in Kestra This section builds on what you learned earlier in Module 2 to show you how AI can speed up workflow development. By the end of this section, you will: - Understand why context engineering matters when collaborating with LLMs - Use AI Copilot to build Kestra flows faster - Use Retrieval Augmented Generation (RAG) in data pipelines ### Prerequisites - Completion of earlier sections in Module 2 (Workflow Orchestration with Kestra) - Kestra running locally - Google Cloud account with access to Gemini API (there's a generous free tier!) --- ### 2.5.1 Introduction: Why AI for Workflows? As data engineers, we spend significant time writing boilerplate code, searching documentation, and structuring data pipelines. AI tools can help us: - **Generate workflows faster**: Describe what you want to accomplish in natural language instead of writing YAML from scratch - **Avoid errors**: Get syntax-correct, up-to-date workflow code that follows best practices However, AI is only as good as the context we provide. This section teaches you how to engineer that context for reliable, production-ready data workflows. #### Videos - **2.5.1 - Using AI for Data Engineering** [![Using AI for Data Engineering](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FGHPtRDAv044)](https://youtu.be/GHPtRDAv044) --- ### 2.5.2 Context Engineering with ChatGPT Let's start by seeing what happens when AI lacks proper context. #### Experiment: ChatGPT Without Context 1. **Open ChatGPT in a private browser window** (to avoid any existing chat context): https://chatgpt.com 2. **Enter this prompt:** ``` Create a Kestra flow that loads NYC taxi data from a CSV file to BigQuery. The flow should extract data, upload to GCS, and load to BigQuery. ``` 3. **Observe the results:** - ChatGPT will generate a Kestra flow, but it likely contains: - **Outdated plugin syntax** e.g., old task types that have been renamed - **Incorrect property names** e.g., properties that don't exist in current versions - **Hallucinated features** e.g., tasks, triggers or properties that never existed #### Why Does This Happen? Large Language Models (LLMs) like GPT models from OpenAI are trained on data up to a specific point in time (knowledge cutoff). They don't automatically know about: - Software updates and new releases - Renamed plugins or changed APIs This is the fundamental challenge of using AI: **the model can only work with information it has access to.** #### Key Learning: Context is Everything Without proper context: - ❌ Generic AI assistants hallucinate outdated or incorrect code - ❌ You can't trust the output for production use With proper context: - ✅ AI generates accurate, current, production-ready code - ✅ You can iterate faster by letting AI generate boilerplate workflow code In the next section, we'll see how Kestra's AI Copilot solves this problem. #### Videos - **2.5.2 - Context Engineering with ChatGPT** [![Context Engineering with ChatGPT](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FLmnfjGKwnVU)](https://youtu.be/LmnfjGKwnVU) --- ### 2.5.3 AI Copilot in Kestra Kestra's AI Copilot is specifically designed to generate and modify Kestra flows with full context about the latest plugins, workflow syntax, and best practices. #### Setup AI Copilot Before using AI Copilot, you need to configure Gemini API access in your Kestra instance. **Step 1: Get Your Gemini API Key** 1. Visit Google AI Studio: https://aistudio.google.com/app/apikey 2. Sign in with your Google account 3. Click "Create API Key" 4. Copy the generated key (keep it secure!) > [!WARNING] > Never commit API keys to Git. Always use environment variables or Kestra's KV Store. **Step 2: Configure Kestra AI Copilot** Add the following to your Kestra configuration. You can do this by modifying your `docker-compose.yml` file from 2.2: ```yaml services: kestra: environment: KESTRA_CONFIGURATION: | kestra: ai: type: gemini gemini: model-name: gemini-2.5-flash api-key: ${GEMINI_API_KEY} ``` Then restart Kestra: ```bash cd 02-workflow-orchestration/docker export GEMINI_API_KEY="your-api-key-here" docker compose up -d ``` #### Exercise: ChatGPT vs AI Copilot Comparison **Objective:** Learn why context engineering matters. 1. **Open Kestra UI** at http://localhost:8080 2. **Create a new flow** and open the Code editor panel 3. **Click the AI Copilot button** (sparkle icon ✨) in the top-right corner 4. **Enter the same exact prompt** we used with ChatGPT: ``` Create a Kestra flow that loads NYC taxi data from a CSV file to BigQuery. The flow should extract data, upload to GCS, and load to BigQuery. ``` 5. **Compare the outputs:** - ✅ Copilot generates executable, working YAML - ✅ Copilot uses correct plugin types and properties - ✅ Copilot follows current Kestra best practices **Key Learning:** Context matters! AI Copilot has access to current Kestra documentation, generating Kestra flows better than a generic ChatGPT assistant. #### Videos - **2.5.3 - AI Copilot in Kestra** [![AI Copilot in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F3IbjHfC8bMg)](https://youtu.be/3IbjHfC8bMg) ### 2.5.4 Bonus: Retrieval Augmented Generation (RAG) To further learn how to provide context to your prompts, this bonus section demonstrates how to use RAG. #### What is RAG? **RAG (Retrieval Augmented Generation)** is a technique that: 1. **Retrieves** relevant information from your data sources 2. **Augments** the AI prompt with this context 3. **Generates** a response grounded in real data This solves the hallucination problem by ensuring the AI has access to current, accurate information at query time. #### How RAG Works in Kestra ```mermaid graph LR A[Ask AI] --> B[Fetch Docs] B --> C[Create Embeddings] C --> D[Find Similar Content] D --> E[Add Context to Prompt] E --> F[LLM Answer] ``` **The Process:** 1. **Ingest documents**: Load documentation, release notes, or other data sources 2. **Create embeddings**: Convert text into vector representations using an LLM 3. **Store embeddings**: Save vectors in Kestra's KV Store (or a vector database) 4. **Query with context**: When you ask a question, retrieve relevant embeddings and include them in the prompt 5. **Generate response**: The LLM has real context and provides accurate answers #### Exercise: Retrieval With vs Without Context **Objective:** Understand how RAG eliminates hallucinations by grounding LLM responses in real data. **Part A: Without RAG** 1. Navigate to the [`10_chat_without_rag.yaml`](flows/10_chat_without_rag.yaml) flow in your Kestra UI 2. Click **Execute** 3. Wait for the execution to complete 4. Open the **Logs** tab 5. Read the output - notice how the response about "Kestra 1.1 features" is: - Vague or generic - Potentially incorrect - Missing specific details - Based only on the model's training data (which may be outdated) **Part B: With RAG** 1. Navigate to the [`11_chat_with_rag.yaml`](flows/11_chat_with_rag.yaml) flow 2. Click **Execute** 3. Watch the execution: - First task: **Ingests** Kestra 1.1 release documentation, creates **embeddings** and stores them - Second task: **Prompts LLM** with context retrieved from stored embeddings 4. Open the **Logs** tab 5. Compare this output with the previous one - notice how it's: - ✅ Specific and detailed - ✅ Accurate with real features from the release - ✅ Grounded in actual documentation **Key Learning:** RAG (Retrieval Augmented Generation) grounds AI responses in current documentation, eliminating hallucinations and providing accurate, context-aware answers. #### RAG Best Practices 1. **Keep documents updated**: Regularly re-ingest to ensure current information 2. **Chunk appropriately**: Break large documents into meaningful chunks 3. **Test retrieval quality**: Verify that the right documents are retrieved #### Additional AI Resources Kestra Documentation: - [AI Tools Overview](https://go.kestra.io/de-zoomcamp/ai-tools) - [AI Copilot](https://go.kestra.io/de-zoomcamp/ai-copilot) - [RAG Workflows](https://go.kestra.io/de-zoomcamp/rag-workflows) - [AI Workflows](https://go.kestra.io/de-zoomcamp/ai-workflows) - [Kestra Blueprints](https://go.kestra.io/de-zoomcamp/blueprints) - Pre-built workflow examples Kestra Plugin Documentation: - [AI Plugin](https://go.kestra.io/de-zoomcamp/ai-plugin) - [RAG Tasks](https://go.kestra.io/de-zoomcamp/ai-rag-task) External Documentation: - [Google Gemini](https://go.kestra.io/de-zoomcamp/gemini-docs) - [Google AI Studio](https://go.kestra.io/de-zoomcamp/ai-studio) #### Videos - **2.5.4 (Bonus) - Retrieval Augmented Generation** [![Retrieval Augmented Generation](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FXuPDQ1UcNyI)](https://youtu.be/XuPDQ1UcNyI) ## 2.6 Bonus: Deploy to the Cloud (Optional) Now that we've got all our pipelines working and we know how to quickly create new flows with Kestra's AI Copilot, we can deploy Kestra to the cloud so it can continue to orchestrate our scheduled pipelines. In this bonus section, we'll cover how you can deploy Kestra on Google Cloud and automatically sync your workflows from a Git repository. Note: When committing your workflows to Kestra, make sure your workflow doesn't contain any sensitive information. You can use [Secrets](https://go.kestra.io/de-zoomcamp/secret) and the [KV Store](https://go.kestra.io/de-zoomcamp/kv-store) to keep sensitive data out of your workflow logic. #### Resources - [Install Kestra on Google Cloud](https://go.kestra.io/de-zoomcamp/gcp-install) - [Moving from Development to Production](https://go.kestra.io/de-zoomcamp/dev-to-prod) - [Using Git in Kestra](https://go.kestra.io/de-zoomcamp/git) - [Deploy Flows with GitHub Actions](https://go.kestra.io/de-zoomcamp/deploy-github-actions) ## 2.7 Additional Resources 📚 - Check [Kestra Docs](https://go.kestra.io/de-zoomcamp/docs) - Explore our [Blueprints](https://go.kestra.io/de-zoomcamp/blueprints) library - Browse over 600 [plugins](https://go.kestra.io/de-zoomcamp/plugins) available in Kestra - Give us a star on [GitHub](https://go.kestra.io/de-zoomcamp/github) - Join our [Slack community](https://go.kestra.io/de-zoomcamp/slack) if you have any questions - Find all the videos in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist) ### Troubleshooting tips If you face any issues with Kestra flows in Module 2, make sure to use the following Docker images/ports: - `image: kestra/kestra:v1.1` - pin your Kestra Docker image to this version so we can ensure reproducibility; do NOT use `kestra/kestra:develop` as this is a bleeding-edge development version that might contain bugs - `postgres:18` — make sure to pin your Postgres image to version 18 - If you run `pgAdmin` or something else on port 8080, you can adjust Kestra `docker-compose` to use a different port, e.g. change port mapping to 18080 instead of 8080, and then access Kestra UI in your browser from http://localhost:18080/ instead of from http://localhost:8080/ If you are still facing any issues, stop and remove your existing Kestra + Postgres containers and start them again using `docker-compose up -d`. If this doesn't help, post your question on the DataTalksClub Slack or on Kestra's Slack http://kestra.io/slack. If you encounter similar errors to: ``` BigQueryError{reason=invalid, location=null, message=Error while reading table: kestra-sandbox.zooomcamp.yellow_tripdata_2020_01, error message: CSV table references column position 17, but line contains only 14 columns.; line_number: 2103925 byte_offset_to_start_of_line: 194863028 column_index: 17 column_name: "congestion_surcharge" column_type: NUMERIC File: gs://anna-geller/yellow_tripdata_2020-01.csv} ``` It means that the CSV file you're trying to load into BigQuery has a mismatch in the number of columns between the external source table (i.e. file in GCS) and the destination table in BigQuery. This can happen when for due to network/transfer issues, the file is not fully downloaded from GitHub or not correctly uploaded to GCS. The error suggests schema issues but that's not the case. Simply rerun the entire execution including redownloading the CSV file and reuploading it to GCS. This should resolve the issue. --- ## Homework See the [2026 cohort folder](../cohorts/2026/02-workflow-orchestration/homework.md) --- # Community notes Did you take notes? You can share them by creating a PR to this file! * Add your notes above this line --- # Previous Cohorts * 2022: [notes](../cohorts/2022/week_2_data_ingestion#community-notes) and [videos](../cohorts/2022/week_2_data_ingestion) * 2023: [notes](../cohorts/2023/week_2_workflow_orchestration#community-notes) and [videos](../cohorts/2023/week_2_workflow_orchestration) * 2024: [notes](../cohorts/2024/02-workflow-orchestration#community-notes) and [videos](../cohorts/2024/02-workflow-orchestration) * 2025: [notes](../cohorts/2025/02-workflow-orchestration/README.md#community-notes) and [videos](../cohorts/2025/02-workflow-orchestration) ================================================ FILE: 02-workflow-orchestration/docker-compose.yml ================================================ volumes: ny_taxi_postgres_data: driver: local kestra_postgres_data: driver: local kestra_data: driver: local kestra_tmp: driver: local services: pgdatabase: image: postgres:18 environment: POSTGRES_USER: root POSTGRES_PASSWORD: root POSTGRES_DB: ny_taxi ports: - "5432:5432" volumes: - ny_taxi_postgres_data:/var/lib/postgresql depends_on: kestra: condition: service_started pgadmin: image: dpage/pgadmin4 environment: - PGADMIN_DEFAULT_EMAIL=admin@admin.com - PGADMIN_DEFAULT_PASSWORD=root ports: - "8085:80" depends_on: pgdatabase: condition: service_started kestra_postgres: image: postgres:18 volumes: - kestra_postgres_data:/var/lib/postgresql environment: POSTGRES_DB: kestra POSTGRES_USER: kestra POSTGRES_PASSWORD: k3str4 healthcheck: test: ["CMD-SHELL", "pg_isready -d $${POSTGRES_DB} -U $${POSTGRES_USER}"] interval: 30s timeout: 10s retries: 10 kestra: image: kestra/kestra:v1.1 pull_policy: always # Note that this setup with a root user is intended for development purpose. # Our base image runs without root, but the Docker Compose implementation needs root to access the Docker socket # To run Kestra in a rootless mode in production, see: https://kestra.io/docs/installation/podman-compose user: "root" command: server standalone volumes: - kestra_data:/app/storage - /var/run/docker.sock:/var/run/docker.sock - kestra_tmp:/tmp/kestra-wd environment: KESTRA_CONFIGURATION: | datasources: postgres: url: jdbc:postgresql://kestra_postgres:5432/kestra driverClassName: org.postgresql.Driver username: kestra password: k3str4 kestra: server: basicAuth: username: "admin@kestra.io" # it must be a valid email address password: Admin1234! repository: type: postgres storage: type: local local: basePath: "/app/storage" queue: type: postgres tasks: tmpDir: path: /tmp/kestra-wd/tmp url: http://localhost:8080/ ports: - "8080:8080" - "8081:8081" depends_on: kestra_postgres: condition: service_started ================================================ FILE: 02-workflow-orchestration/flows/01_hello_world.yaml ================================================ id: 01_hello_world namespace: zoomcamp inputs: - id: name type: STRING defaults: Will concurrency: behavior: FAIL limit: 2 variables: welcome_message: "Hello, {{ inputs.name }}!" tasks: - id: hello_message type: io.kestra.plugin.core.log.Log message: "{{ render(vars.welcome_message) }}" - id: generate_output type: io.kestra.plugin.core.debug.Return format: I was generated during this workflow. - id: sleep type: io.kestra.plugin.core.flow.Sleep duration: PT15S - id: log_output type: io.kestra.plugin.core.log.Log message: "This is an output: {{ outputs.generate_output.value }}" - id: goodbye_message type: io.kestra.plugin.core.log.Log message: "Goodbye, {{ inputs.name }}!" pluginDefaults: - type: io.kestra.plugin.core.log.Log values: level: ERROR triggers: - id: schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 10 * * *" inputs: name: Sarah disabled: true ================================================ FILE: 02-workflow-orchestration/flows/02_python.yaml ================================================ id: 02_python namespace: zoomcamp description: This flow will install the pip package in a Docker container, and use kestra's Python library to generate outputs (number of downloads of the Kestra Docker image) and metrics (duration of the script). tasks: - id: collect_stats type: io.kestra.plugin.scripts.python.Script taskRunner: type: io.kestra.plugin.scripts.runner.docker.Docker containerImage: python:slim dependencies: - requests - kestra script: | from kestra import Kestra import requests def get_docker_image_downloads(image_name: str = "kestra/kestra"): """Queries the Docker Hub API to get the number of downloads for a specific Docker image.""" url = f"https://hub.docker.com/v2/repositories/{image_name}/" response = requests.get(url) data = response.json() downloads = data.get('pull_count', 'Not available') return downloads downloads = get_docker_image_downloads() outputs = { 'downloads': downloads } Kestra.outputs(outputs) ================================================ FILE: 02-workflow-orchestration/flows/03_getting_started_data_pipeline.yaml ================================================ id: 03_getting_started_data_pipeline namespace: zoomcamp inputs: - id: columns_to_keep type: ARRAY itemType: STRING defaults: - brand - price tasks: - id: extract type: io.kestra.plugin.core.http.Download uri: https://dummyjson.com/products - id: transform type: io.kestra.plugin.scripts.python.Script containerImage: python:3.11-alpine inputFiles: data.json: "{{outputs.extract.uri}}" outputFiles: - "*.json" env: COLUMNS_TO_KEEP: "{{inputs.columns_to_keep}}" script: | import json import os columns_to_keep_str = os.getenv("COLUMNS_TO_KEEP") columns_to_keep = json.loads(columns_to_keep_str) with open("data.json", "r") as file: data = json.load(file) filtered_data = [ {column: product.get(column, "N/A") for column in columns_to_keep} for product in data["products"] ] with open("products.json", "w") as file: json.dump(filtered_data, file, indent=4) - id: query type: io.kestra.plugin.jdbc.duckdb.Queries inputFiles: products.json: "{{outputs.transform.outputFiles['products.json']}}" sql: | INSTALL json; LOAD json; SELECT brand, round(avg(price), 2) as avg_price FROM read_json_auto('{{workingDir}}/products.json') GROUP BY brand ORDER BY avg_price DESC; fetchType: STORE ================================================ FILE: 02-workflow-orchestration/flows/04_postgres_taxi.yaml ================================================ id: 04_postgres_taxi namespace: zoomcamp description: | The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases inputs: - id: taxi type: SELECT displayName: Select taxi type values: [yellow, green] defaults: yellow - id: year type: SELECT displayName: Select year values: ["2019", "2020"] defaults: "2019" - id: month type: SELECT displayName: Select month values: ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"] defaults: "01" variables: file: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv" staging_table: "public.{{inputs.taxi}}_tripdata_staging" table: "public.{{inputs.taxi}}_tripdata" data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}" tasks: - id: set_label type: io.kestra.plugin.core.execution.Labels labels: file: "{{render(vars.file)}}" taxi: "{{inputs.taxi}}" - id: extract type: io.kestra.plugin.scripts.shell.Commands outputFiles: - "*.csv" taskRunner: type: io.kestra.plugin.core.runner.Process commands: - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}} - id: if_yellow_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'yellow'}}" then: - id: yellow_create_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.table)}} ( unique_row_id text, filename text, VendorID text, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count integer, trip_distance double precision, RatecodeID text, store_and_fwd_flag text, PULocationID text, DOLocationID text, payment_type integer, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, improvement_surcharge double precision, total_amount double precision, congestion_surcharge double precision ); - id: yellow_create_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} ( unique_row_id text, filename text, VendorID text, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count integer, trip_distance double precision, RatecodeID text, store_and_fwd_flag text, PULocationID text, DOLocationID text, payment_type integer, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, improvement_surcharge double precision, total_amount double precision, congestion_surcharge double precision ); - id: yellow_truncate_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | TRUNCATE TABLE {{render(vars.staging_table)}}; - id: yellow_copy_in_to_staging_table type: io.kestra.plugin.jdbc.postgresql.CopyIn format: CSV from: "{{render(vars.data)}}" table: "{{render(vars.staging_table)}}" header: true columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge] - id: yellow_add_unique_id_and_filename type: io.kestra.plugin.jdbc.postgresql.Queries sql: | UPDATE {{render(vars.staging_table)}} SET unique_row_id = md5( COALESCE(CAST(VendorID AS text), '') || COALESCE(CAST(tpep_pickup_datetime AS text), '') || COALESCE(CAST(tpep_dropoff_datetime AS text), '') || COALESCE(PULocationID, '') || COALESCE(DOLocationID, '') || COALESCE(CAST(fare_amount AS text), '') || COALESCE(CAST(trip_distance AS text), '') ), filename = '{{render(vars.file)}}'; - id: yellow_merge_data type: io.kestra.plugin.jdbc.postgresql.Queries sql: | MERGE INTO {{render(vars.table)}} AS T USING {{render(vars.staging_table)}} AS S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT ( unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge ) VALUES ( S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge ); - id: if_green_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'green'}}" then: - id: green_create_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.table)}} ( unique_row_id text, filename text, VendorID text, lpep_pickup_datetime timestamp, lpep_dropoff_datetime timestamp, store_and_fwd_flag text, RatecodeID text, PULocationID text, DOLocationID text, passenger_count integer, trip_distance double precision, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, ehail_fee double precision, improvement_surcharge double precision, total_amount double precision, payment_type integer, trip_type integer, congestion_surcharge double precision ); - id: green_create_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} ( unique_row_id text, filename text, VendorID text, lpep_pickup_datetime timestamp, lpep_dropoff_datetime timestamp, store_and_fwd_flag text, RatecodeID text, PULocationID text, DOLocationID text, passenger_count integer, trip_distance double precision, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, ehail_fee double precision, improvement_surcharge double precision, total_amount double precision, payment_type integer, trip_type integer, congestion_surcharge double precision ); - id: green_truncate_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | TRUNCATE TABLE {{render(vars.staging_table)}}; - id: green_copy_in_to_staging_table type: io.kestra.plugin.jdbc.postgresql.CopyIn format: CSV from: "{{render(vars.data)}}" table: "{{render(vars.staging_table)}}" header: true columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge] - id: green_add_unique_id_and_filename type: io.kestra.plugin.jdbc.postgresql.Queries sql: | UPDATE {{render(vars.staging_table)}} SET unique_row_id = md5( COALESCE(CAST(VendorID AS text), '') || COALESCE(CAST(lpep_pickup_datetime AS text), '') || COALESCE(CAST(lpep_dropoff_datetime AS text), '') || COALESCE(PULocationID, '') || COALESCE(DOLocationID, '') || COALESCE(CAST(fare_amount AS text), '') || COALESCE(CAST(trip_distance AS text), '') ), filename = '{{render(vars.file)}}'; - id: green_merge_data type: io.kestra.plugin.jdbc.postgresql.Queries sql: | MERGE INTO {{render(vars.table)}} AS T USING {{render(vars.staging_table)}} AS S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT ( unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge ) VALUES ( S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge ); - id: purge_files type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles description: This will remove output files. If you'd like to explore Kestra outputs, disable it. pluginDefaults: - type: io.kestra.plugin.jdbc.postgresql values: url: jdbc:postgresql://pgdatabase:5432/ny_taxi username: root password: root ================================================ FILE: 02-workflow-orchestration/flows/05_postgres_taxi_scheduled.yaml ================================================ id: 05_postgres_taxi_scheduled namespace: zoomcamp description: | Best to add a label `backfill:true` from the UI to track executions created via a backfill. CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases concurrency: limit: 1 inputs: - id: taxi type: SELECT displayName: Select taxi type values: [yellow, green] defaults: yellow variables: file: "{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv" staging_table: "public.{{inputs.taxi}}_tripdata_staging" table: "public.{{inputs.taxi}}_tripdata" data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}" tasks: - id: set_label type: io.kestra.plugin.core.execution.Labels labels: file: "{{render(vars.file)}}" taxi: "{{inputs.taxi}}" - id: extract type: io.kestra.plugin.scripts.shell.Commands outputFiles: - "*.csv" taskRunner: type: io.kestra.plugin.core.runner.Process commands: - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}} - id: if_yellow_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'yellow'}}" then: - id: yellow_create_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.table)}} ( unique_row_id text, filename text, VendorID text, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count integer, trip_distance double precision, RatecodeID text, store_and_fwd_flag text, PULocationID text, DOLocationID text, payment_type integer, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, improvement_surcharge double precision, total_amount double precision, congestion_surcharge double precision ); - id: yellow_create_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} ( unique_row_id text, filename text, VendorID text, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count integer, trip_distance double precision, RatecodeID text, store_and_fwd_flag text, PULocationID text, DOLocationID text, payment_type integer, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, improvement_surcharge double precision, total_amount double precision, congestion_surcharge double precision ); - id: yellow_truncate_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | TRUNCATE TABLE {{render(vars.staging_table)}}; - id: yellow_copy_in_to_staging_table type: io.kestra.plugin.jdbc.postgresql.CopyIn format: CSV from: "{{render(vars.data)}}" table: "{{render(vars.staging_table)}}" header: true columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge] - id: yellow_add_unique_id_and_filename type: io.kestra.plugin.jdbc.postgresql.Queries sql: | UPDATE {{render(vars.staging_table)}} SET unique_row_id = md5( COALESCE(CAST(VendorID AS text), '') || COALESCE(CAST(tpep_pickup_datetime AS text), '') || COALESCE(CAST(tpep_dropoff_datetime AS text), '') || COALESCE(PULocationID, '') || COALESCE(DOLocationID, '') || COALESCE(CAST(fare_amount AS text), '') || COALESCE(CAST(trip_distance AS text), '') ), filename = '{{render(vars.file)}}'; - id: yellow_merge_data type: io.kestra.plugin.jdbc.postgresql.Queries sql: | MERGE INTO {{render(vars.table)}} AS T USING {{render(vars.staging_table)}} AS S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT ( unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge ) VALUES ( S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge ); - id: if_green_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'green'}}" then: - id: green_create_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.table)}} ( unique_row_id text, filename text, VendorID text, lpep_pickup_datetime timestamp, lpep_dropoff_datetime timestamp, store_and_fwd_flag text, RatecodeID text, PULocationID text, DOLocationID text, passenger_count integer, trip_distance double precision, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, ehail_fee double precision, improvement_surcharge double precision, total_amount double precision, payment_type integer, trip_type integer, congestion_surcharge double precision ); - id: green_create_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} ( unique_row_id text, filename text, VendorID text, lpep_pickup_datetime timestamp, lpep_dropoff_datetime timestamp, store_and_fwd_flag text, RatecodeID text, PULocationID text, DOLocationID text, passenger_count integer, trip_distance double precision, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, ehail_fee double precision, improvement_surcharge double precision, total_amount double precision, payment_type integer, trip_type integer, congestion_surcharge double precision ); - id: green_truncate_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | TRUNCATE TABLE {{render(vars.staging_table)}}; - id: green_copy_in_to_staging_table type: io.kestra.plugin.jdbc.postgresql.CopyIn format: CSV from: "{{render(vars.data)}}" table: "{{render(vars.staging_table)}}" header: true columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge] - id: green_add_unique_id_and_filename type: io.kestra.plugin.jdbc.postgresql.Queries sql: | UPDATE {{render(vars.staging_table)}} SET unique_row_id = md5( COALESCE(CAST(VendorID AS text), '') || COALESCE(CAST(lpep_pickup_datetime AS text), '') || COALESCE(CAST(lpep_dropoff_datetime AS text), '') || COALESCE(PULocationID, '') || COALESCE(DOLocationID, '') || COALESCE(CAST(fare_amount AS text), '') || COALESCE(CAST(trip_distance AS text), '') ), filename = '{{render(vars.file)}}'; - id: green_merge_data type: io.kestra.plugin.jdbc.postgresql.Queries sql: | MERGE INTO {{render(vars.table)}} AS T USING {{render(vars.staging_table)}} AS S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT ( unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge ) VALUES ( S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge ); - id: purge_files type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles description: To avoid cluttering your storage, we will remove the downloaded files pluginDefaults: - type: io.kestra.plugin.jdbc.postgresql values: url: jdbc:postgresql://pgdatabase:5432/ny_taxi username: root password: root triggers: - id: green_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 9 1 * *" inputs: taxi: green - id: yellow_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 10 1 * *" inputs: taxi: yellow ================================================ FILE: 02-workflow-orchestration/flows/06_gcp_kv.yaml ================================================ id: 06_gcp_kv namespace: zoomcamp tasks: - id: gcp_project_id type: io.kestra.plugin.core.kv.Set key: GCP_PROJECT_ID kvType: STRING value: kestra-sandbox # TODO replace with your project id - id: gcp_location type: io.kestra.plugin.core.kv.Set key: GCP_LOCATION kvType: STRING value: europe-west2 - id: gcp_bucket_name type: io.kestra.plugin.core.kv.Set key: GCP_BUCKET_NAME kvType: STRING value: your-name-kestra # TODO make sure it's globally unique! - id: gcp_dataset type: io.kestra.plugin.core.kv.Set key: GCP_DATASET kvType: STRING value: zoomcamp ================================================ FILE: 02-workflow-orchestration/flows/07_gcp_setup.yaml ================================================ id: 07_gcp_setup namespace: zoomcamp tasks: - id: create_gcs_bucket type: io.kestra.plugin.gcp.gcs.CreateBucket ifExists: SKIP storageClass: REGIONAL name: "{{kv('GCP_BUCKET_NAME')}}" # make sure it's globally unique! - id: create_bq_dataset type: io.kestra.plugin.gcp.bigquery.CreateDataset name: "{{kv('GCP_DATASET')}}" ifExists: SKIP pluginDefaults: - type: io.kestra.plugin.gcp values: serviceAccount: "{{secret('GCP_CREDS')}}" projectId: "{{kv('GCP_PROJECT_ID')}}" location: "{{kv('GCP_LOCATION')}}" bucket: "{{kv('GCP_BUCKET_NAME')}}" ================================================ FILE: 02-workflow-orchestration/flows/08_gcp_taxi.yaml ================================================ id: 08_gcp_taxi namespace: zoomcamp description: | The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases inputs: - id: taxi type: SELECT displayName: Select taxi type values: [yellow, green] defaults: green - id: year type: SELECT displayName: Select year values: ["2019", "2020"] defaults: "2019" allowCustomValue: true # allows you to type 2021 from the UI for the homework 🤗 - id: month type: SELECT displayName: Select month values: ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"] defaults: "01" variables: file: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv" gcs_file: "gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}" table: "{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{inputs.year}}_{{inputs.month}}" data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}" tasks: - id: set_label type: io.kestra.plugin.core.execution.Labels labels: file: "{{render(vars.file)}}" taxi: "{{inputs.taxi}}" - id: extract type: io.kestra.plugin.scripts.shell.Commands outputFiles: - "*.csv" taskRunner: type: io.kestra.plugin.core.runner.Process commands: - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}} - id: upload_to_gcs type: io.kestra.plugin.gcp.gcs.Upload from: "{{render(vars.data)}}" to: "{{render(vars.gcs_file)}}" - id: if_yellow_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'yellow'}}" then: - id: bq_yellow_tripdata type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` ( unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'), filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'), VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) PARTITION BY DATE(tpep_pickup_datetime); - id: bq_yellow_table_ext type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext` ( VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) OPTIONS ( format = 'CSV', uris = ['{{render(vars.gcs_file)}}'], skip_leading_rows = 1, ignore_unknown_values = TRUE ); - id: bq_yellow_table_tmp type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` AS SELECT MD5(CONCAT( COALESCE(CAST(VendorID AS STRING), ""), COALESCE(CAST(tpep_pickup_datetime AS STRING), ""), COALESCE(CAST(tpep_dropoff_datetime AS STRING), ""), COALESCE(CAST(PULocationID AS STRING), ""), COALESCE(CAST(DOLocationID AS STRING), "") )) AS unique_row_id, "{{render(vars.file)}}" AS filename, * FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`; - id: bq_yellow_merge type: io.kestra.plugin.gcp.bigquery.Query sql: | MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge) VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge); - id: if_green_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'green'}}" then: - id: bq_green_tripdata type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` ( unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'), filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'), VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), ehail_fee NUMERIC, improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) PARTITION BY DATE(lpep_pickup_datetime); - id: bq_green_table_ext type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext` ( VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), ehail_fee NUMERIC, improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) OPTIONS ( format = 'CSV', uris = ['{{render(vars.gcs_file)}}'], skip_leading_rows = 1, ignore_unknown_values = TRUE ); - id: bq_green_table_tmp type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` AS SELECT MD5(CONCAT( COALESCE(CAST(VendorID AS STRING), ""), COALESCE(CAST(lpep_pickup_datetime AS STRING), ""), COALESCE(CAST(lpep_dropoff_datetime AS STRING), ""), COALESCE(CAST(PULocationID AS STRING), ""), COALESCE(CAST(DOLocationID AS STRING), "") )) AS unique_row_id, "{{render(vars.file)}}" AS filename, * FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`; - id: bq_green_merge type: io.kestra.plugin.gcp.bigquery.Query sql: | MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge) VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge); - id: purge_files type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles description: If you'd like to explore Kestra outputs, disable it. disabled: false pluginDefaults: - type: io.kestra.plugin.gcp values: serviceAccount: "{{secret('GCP_CREDS')}}" projectId: "{{kv('GCP_PROJECT_ID')}}" location: "{{kv('GCP_LOCATION')}}" bucket: "{{kv('GCP_BUCKET_NAME')}}" ================================================ FILE: 02-workflow-orchestration/flows/09_gcp_taxi_scheduled.yaml ================================================ id: 09_gcp_taxi_scheduled namespace: zoomcamp description: | Best to add a label `backfill:true` from the UI to track executions created via a backfill. CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases inputs: - id: taxi type: SELECT displayName: Select taxi type values: [yellow, green] defaults: green variables: file: "{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv" gcs_file: "gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}" table: "{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy_MM')}}" data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}" tasks: - id: set_label type: io.kestra.plugin.core.execution.Labels labels: file: "{{render(vars.file)}}" taxi: "{{inputs.taxi}}" - id: extract type: io.kestra.plugin.scripts.shell.Commands outputFiles: - "*.csv" taskRunner: type: io.kestra.plugin.core.runner.Process commands: - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}} - id: upload_to_gcs type: io.kestra.plugin.gcp.gcs.Upload from: "{{render(vars.data)}}" to: "{{render(vars.gcs_file)}}" - id: if_yellow_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'yellow'}}" then: - id: bq_yellow_tripdata type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` ( unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'), filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'), VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) PARTITION BY DATE(tpep_pickup_datetime); - id: bq_yellow_table_ext type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext` ( VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) OPTIONS ( format = 'CSV', uris = ['{{render(vars.gcs_file)}}'], skip_leading_rows = 1, ignore_unknown_values = TRUE ); - id: bq_yellow_table_tmp type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` AS SELECT MD5(CONCAT( COALESCE(CAST(VendorID AS STRING), ""), COALESCE(CAST(tpep_pickup_datetime AS STRING), ""), COALESCE(CAST(tpep_dropoff_datetime AS STRING), ""), COALESCE(CAST(PULocationID AS STRING), ""), COALESCE(CAST(DOLocationID AS STRING), "") )) AS unique_row_id, "{{render(vars.file)}}" AS filename, * FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`; - id: bq_yellow_merge type: io.kestra.plugin.gcp.bigquery.Query sql: | MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge) VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge); - id: if_green_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'green'}}" then: - id: bq_green_tripdata type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` ( unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'), filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'), VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), ehail_fee NUMERIC, improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) PARTITION BY DATE(lpep_pickup_datetime); - id: bq_green_table_ext type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext` ( VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), ehail_fee NUMERIC, improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) OPTIONS ( format = 'CSV', uris = ['{{render(vars.gcs_file)}}'], skip_leading_rows = 1, ignore_unknown_values = TRUE ); - id: bq_green_table_tmp type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` AS SELECT MD5(CONCAT( COALESCE(CAST(VendorID AS STRING), ""), COALESCE(CAST(lpep_pickup_datetime AS STRING), ""), COALESCE(CAST(lpep_dropoff_datetime AS STRING), ""), COALESCE(CAST(PULocationID AS STRING), ""), COALESCE(CAST(DOLocationID AS STRING), "") )) AS unique_row_id, "{{render(vars.file)}}" AS filename, * FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`; - id: bq_green_merge type: io.kestra.plugin.gcp.bigquery.Query sql: | MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge) VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge); - id: purge_files type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles description: To avoid cluttering your storage, we will remove the downloaded files pluginDefaults: - type: io.kestra.plugin.gcp values: serviceAccount: "{{secret('GCP_CREDS')}}" projectId: "{{kv('GCP_PROJECT_ID')}}" location: "{{kv('GCP_LOCATION')}}" bucket: "{{kv('GCP_BUCKET_NAME')}}" triggers: - id: green_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 9 1 * *" inputs: taxi: green - id: yellow_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 10 1 * *" inputs: taxi: yellow ================================================ FILE: 02-workflow-orchestration/flows/10_chat_without_rag.yaml ================================================ id: 10_chat_without_rag namespace: zoomcamp description: | This flow demonstrates what happens when you query an LLM WITHOUT RAG. The model can only rely on its training data, which may be outdated or incomplete. After running this, check out 11_chat_with_rag.yaml to see how RAG fixes these issues. tasks: - id: chat_without_rag type: io.kestra.plugin.ai.completion.ChatCompletion description: Query about Kestra 1.1 features WITHOUT RAG provider: type: io.kestra.plugin.ai.provider.GoogleGemini modelName: gemini-2.5-flash apiKey: "{{ kv('GEMINI_API_KEY') }}" messages: - type: USER content: | Which features were released in Kestra 1.1? Please list at least 5 major features with brief descriptions. - id: log_results type: io.kestra.plugin.core.log.Log message: | ❌ Response WITHOUT RAG (no retrieved context): {{ outputs.chat_without_rag.textOutput }} 🤔 Did you notice that this response seems to be: - Incorrect - Vague/generic - Listing features that haven't been added in exactly this version but rather a long time ago 👉 This is why context matters. Run `11_chat_with_rag.yaml` to see the accurate, context-grounded response. ================================================ FILE: 02-workflow-orchestration/flows/11_chat_with_rag.yaml ================================================ id: 11_chat_with_rag namespace: zoomcamp description: | This flow demonstrates RAG (Retrieval Augmented Generation) by ingesting Kestra release documentation and using it to answer questions accurately. Compare this with 10_chat_without_rag.yaml to see the difference RAG makes. tasks: - id: ingest_release_notes type: io.kestra.plugin.ai.rag.IngestDocument description: Ingest Kestra 1.1 release notes to create embeddings provider: type: io.kestra.plugin.ai.provider.GoogleGemini modelName: gemini-embedding-001 apiKey: "{{ kv('GEMINI_API_KEY') }}" embeddings: type: io.kestra.plugin.ai.embeddings.KestraKVStore drop: true fromExternalURLs: - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/src/contents/blogs/release-1-1/index.md - id: chat_with_rag type: io.kestra.plugin.ai.rag.ChatCompletion description: Query about Kestra 1.1 features with RAG context chatProvider: type: io.kestra.plugin.ai.provider.GoogleGemini modelName: gemini-2.5-flash apiKey: "{{ kv('GEMINI_API_KEY') }}" embeddingProvider: type: io.kestra.plugin.ai.provider.GoogleGemini modelName: gemini-embedding-001 apiKey: "{{ kv('GEMINI_API_KEY') }}" embeddings: type: io.kestra.plugin.ai.embeddings.KestraKVStore systemMessage: | You are a helpful assistant that answers questions about Kestra. Use the provided documentation to give accurate, specific answers. If you don't find the information in the context, say so. prompt: | Which features were released in Kestra 1.1? Please list at least 5 major features with brief descriptions. - id: log_results type: io.kestra.plugin.core.log.Log message: | ✅ RAG Response (with retrieved context): {{ outputs.chat_with_rag.textOutput }} Note that this response is detailed, accurate, and grounded in the actual release documentation. Compare this with the output from 06_chat_without_rag.yaml. ================================================ FILE: 03-data-warehouse/README.md ================================================ # Data Warehouse and BigQuery - [Slides](https://docs.google.com/presentation/d/1a3ZoBAXFk8-EhUsd7rAZd-5p_HpltkzSeujjRGB2TAI/edit?usp=sharing) - [Big Query basic SQL](big_query.sql) # Videos ## Data Warehouse - Data Warehouse and BigQuery [![](https://markdown-videos-api.jorgenkh.no/youtube/jrHljAoD6nM)](https://youtu.be/jrHljAoD6nM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34) ## :movie_camera: Partitioning and clustering - Partitioning vs Clustering [![](https://markdown-videos-api.jorgenkh.no/youtube/-CqXf7vhhDs)](https://youtu.be/-CqXf7vhhDs?si=p1sYQCAs8dAa7jIm&t=193&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=35) ## :movie_camera: Best practices [![](https://markdown-videos-api.jorgenkh.no/youtube/k81mLJVX08w)](https://youtu.be/k81mLJVX08w&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=36) ## :movie_camera: Internals of BigQuery [![](https://markdown-videos-api.jorgenkh.no/youtube/eduHi1inM4s)](https://youtu.be/eduHi1inM4s&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=37) ## Advanced topics ### :movie_camera: Machine Learning in Big Query [![](https://markdown-videos-api.jorgenkh.no/youtube/B-WtpB0PuG4)](https://youtu.be/B-WtpB0PuG4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34) * [SQL for ML in BigQuery](big_query_ml.sql) **Important links** - [BigQuery ML Tutorials](https://cloud.google.com/bigquery-ml/docs/tutorials) - [BigQuery ML Reference Parameter](https://cloud.google.com/bigquery-ml/docs/analytics-reference-patterns) - [Hyper Parameter tuning](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-glm) - [Feature preprocessing](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-preprocess-overview) ### :movie_camera: Deploying Machine Learning model from BigQuery [![](https://markdown-videos-api.jorgenkh.no/youtube/BjARzEWaznU)](https://youtu.be/BjARzEWaznU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=39) - [Steps to extract and deploy model with docker](extract_model.md) # Homework * [2026 Homework](../cohorts/2026/03-data-warehouse/homework.md) # Community notes
Did you take notes? You can share them here * [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/3_data_warehouse.md) * [Isaac Kargar's blog post](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/30/data-engineering-w3.html) * [Marcos Torregrosa's blog post](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-3/) * [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week3) * [Notes from Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/02/week-3-data-engineering-zoomcamp-notes-data-warehouse-and-bigquery/) * [Bigger picture summary on Data Lakes, Data Warehouses, and tooling](https://medium.com/@verazabeida/zoomcamp-week-4-b8bde661bf98), by Vera * [Notes by froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_3_data_warehouse/notes/notes_week_03.md) * [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week3.md) * [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd) * [2024 videos transcript week3](https://drive.google.com/drive/folders/1quIiwWO-tJCruqvtlqe_Olw8nvYSmmDJ?usp=sharing) by Maria Fisher * [Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/3a-data-warehouse/readme.md) * [Jonah Oliver's blog post](https://www.jonahboliver.com/blog/de-zc-w3) * [2024 - steps to send data from Mage to GCS + creating external table](https://drive.google.com/file/d/1GIi6xnS4070a8MUlIg-ozITt485_-ePB/view?usp=drive_link) by Maria Fisher * [2024 - mage dataloader script to load the parquet files from a remote URL and push it to Google bucket as parquet file](https://github.com/amohan601/dataengineering-zoomcamp2024/blob/main/week_3_data_warehouse/mage_scripts/green_taxi_2022_v2.py) by Anju Mohan * [2024 - steps to send data from Mage to GCS + creating external table](https://drive.google.com/file/d/1GIi6xnS4070a8MUlIg-ozITt485_-ePB/view?usp=drive_link) by Maria Fisher * [Notes by HongWei](https://github.com/hwchua0209/data-engineering-zoomcamp-submission/blob/main/03-data-warehouse/README.md) * [2025 Notes by Manuel Guerra](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/3_Data-Warehouse/README.md) * [Notes from Horeb SEIDOU](https://spotted-hardhat-eea.notion.site/Week-3-Data-Warehouse-and-BigQuery-17c29780dc4a80c8a226f372543ae388) * [2025 - Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/03_data_warehouse/00_notes.md) * [2025 Gitbook Notes Tinker0425](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/module-3/introduction-to-module-3) * [2025 Notes from Daniel Lachner](https://drive.google.com/file/d/105zjtLFi0sRqqFFgdMSCTzfcLPx2rfv4/view?usp=sharing) * [2026 Notes from Catherine Frost](https://docs.google.com/document/d/1j3jeNnBI2fw1nq7JwEauPx2G8FybDfTqmMk7eRu0vSo/edit?tab=t.0) * Add your notes here (above this line)
================================================ FILE: 03-data-warehouse/big_query.sql ================================================ -- Query public available table SELECT station_id, name FROM bigquery-public-data.new_york_citibike.citibike_stations LIMIT 100; -- Creating external table referring to gcs path CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.external_yellow_tripdata` OPTIONS ( format = 'CSV', uris = ['gs://nyc-tl-data/trip data/yellow_tripdata_2019-*.csv', 'gs://nyc-tl-data/trip data/yellow_tripdata_2020-*.csv'] ); -- Check yellow trip data SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata limit 10; -- Create a non partitioned table from external table CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_non_partitioned AS SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata; -- Create a partitioned table from external table CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned PARTITION BY DATE(tpep_pickup_datetime) AS SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata; -- Impact of partition -- Scanning 1.6GB of data SELECT DISTINCT(VendorID) FROM taxi-rides-ny.nytaxi.yellow_tripdata_non_partitioned WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30'; -- Scanning ~106 MB of DATA SELECT DISTINCT(VendorID) FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30'; -- Let's look into the partitions SELECT table_name, partition_id, total_rows FROM `nytaxi.INFORMATION_SCHEMA.PARTITIONS` WHERE table_name = 'yellow_tripdata_partitioned' ORDER BY total_rows DESC; -- Creating a partition and cluster table CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered PARTITION BY DATE(tpep_pickup_datetime) CLUSTER BY VendorID AS SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata; -- Query scans 1.1 GB SELECT count(*) as trips FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31' AND VendorID=1; -- Query scans 864.5 MB SELECT count(*) as trips FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31' AND VendorID=1; ================================================ FILE: 03-data-warehouse/big_query_hw.sql ================================================ CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.fhv_tripdata` OPTIONS ( format = 'CSV', uris = ['gs://nyc-tl-data/trip data/fhv_tripdata_2019-*.csv'] ); SELECT count(*) FROM `taxi-rides-ny.nytaxi.fhv_tripdata`; SELECT COUNT(DISTINCT(dispatching_base_num)) FROM `taxi-rides-ny.nytaxi.fhv_tripdata`; CREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.fhv_nonpartitioned_tripdata` AS SELECT * FROM `taxi-rides-ny.nytaxi.fhv_tripdata`; CREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.fhv_partitioned_tripdata` PARTITION BY DATE(dropoff_datetime) CLUSTER BY dispatching_base_num AS ( SELECT * FROM `taxi-rides-ny.nytaxi.fhv_tripdata` ); SELECT count(*) FROM `taxi-rides-ny.nytaxi.fhv_nonpartitioned_tripdata` WHERE DATE(dropoff_datetime) BETWEEN '2019-01-01' AND '2019-03-31' AND dispatching_base_num IN ('B00987', 'B02279', 'B02060'); SELECT count(*) FROM `taxi-rides-ny.nytaxi.fhv_partitioned_tripdata` WHERE DATE(dropoff_datetime) BETWEEN '2019-01-01' AND '2019-03-31' AND dispatching_base_num IN ('B00987', 'B02279', 'B02060'); ================================================ FILE: 03-data-warehouse/big_query_ml.sql ================================================ -- SELECT THE COLUMNS INTERESTED FOR YOU SELECT passenger_count, trip_distance, PULocationID, DOLocationID, payment_type, fare_amount, tolls_amount, tip_amount FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitioned` WHERE fare_amount != 0; -- CREATE A ML TABLE WITH APPROPRIATE TYPE CREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.yellow_tripdata_ml` ( `passenger_count` INTEGER, `trip_distance` FLOAT64, `PULocationID` STRING, `DOLocationID` STRING, `payment_type` STRING, `fare_amount` FLOAT64, `tolls_amount` FLOAT64, `tip_amount` FLOAT64 ) AS ( SELECT passenger_count, trip_distance, cast(PULocationID AS STRING), CAST(DOLocationID AS STRING), CAST(payment_type AS STRING), fare_amount, tolls_amount, tip_amount FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitioned` WHERE fare_amount != 0 ); -- CREATE MODEL WITH DEFAULT SETTING CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_model` OPTIONS (model_type='linear_reg', input_label_cols=['tip_amount'], DATA_SPLIT_METHOD='AUTO_SPLIT') AS SELECT * FROM `taxi-rides-ny.nytaxi.yellow_tripdata_ml` WHERE tip_amount IS NOT NULL; -- CHECK FEATURES SELECT * FROM ML.FEATURE_INFO(MODEL `taxi-rides-ny.nytaxi.tip_model`); -- EVALUATE THE MODEL SELECT * FROM ML.EVALUATE(MODEL `taxi-rides-ny.nytaxi.tip_model`, ( SELECT * FROM `taxi-rides-ny.nytaxi.yellow_tripdata_ml` WHERE tip_amount IS NOT NULL )); -- PREDICT THE MODEL SELECT * FROM ML.PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`, ( SELECT * FROM `taxi-rides-ny.nytaxi.yellow_tripdata_ml` WHERE tip_amount IS NOT NULL )); -- PREDICT AND EXPLAIN SELECT * FROM ML.EXPLAIN_PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`, ( SELECT * FROM `taxi-rides-ny.nytaxi.yellow_tripdata_ml` WHERE tip_amount IS NOT NULL ), STRUCT(3 as top_k_features)); -- HYPER PARAM TUNNING CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_hyperparam_model` OPTIONS (model_type='linear_reg', input_label_cols=['tip_amount'], DATA_SPLIT_METHOD='AUTO_SPLIT', num_trials=5, max_parallel_trials=2, l1_reg=hparam_range(0, 20), l2_reg=hparam_candidates([0, 0.1, 1, 10])) AS SELECT * FROM `taxi-rides-ny.nytaxi.yellow_tripdata_ml` WHERE tip_amount IS NOT NULL; ================================================ FILE: 03-data-warehouse/extract_model.md ================================================ ## Model deployment [Tutorial](https://cloud.google.com/bigquery-ml/docs/export-model-tutorial) ### Steps - gcloud auth login - bq --project_id taxi-rides-ny extract -m nytaxi.tip_model gs://taxi_ml_model/tip_model - mkdir /tmp/model - gsutil cp -r gs://taxi_ml_model/tip_model /tmp/model - mkdir -p serving_dir/tip_model/1 - cp -r /tmp/model/tip_model/* serving_dir/tip_model/1 - docker pull tensorflow/serving - docker run -p 8501:8501 --mount type=bind,source=`pwd`/serving_dir/tip_model,target= /models/tip_model -e MODEL_NAME=tip_model -t tensorflow/serving & - curl -d '{"instances": [{"passenger_count":1, "trip_distance":12.2, "PULocationID":"193", "DOLocationID":"264", "payment_type":"2","fare_amount":20.4,"tolls_amount":0.0}]}' -X POST http://localhost:8501/v1/models/tip_model:predict - http://localhost:8501/v1/models/tip_model ================================================ FILE: 03-data-warehouse/extras/.env-example ================================================ GCP_GCS_BUCKET="your_bucket_name" GOOGLE_APPLICATION_CREDENTIALS=Path/to/key/GCP_service_account_key.json ================================================ FILE: 03-data-warehouse/extras/.gitignore ================================================ *.env *.parquet *.csv* ================================================ FILE: 03-data-warehouse/extras/README.md ================================================ Quick hack to load files directly to GCS, without Airflow. Downloads csv files from https://nyc-tlc.s3.amazonaws.com/trip+data/ and uploads them to your Cloud Storage Account as parquet files. 1. Install pre-reqs with `uv sync` 2. Run: `uv run python web_to_gcs_with_progress_bar.py` 2. or Run: `uv run python web_to_gcs.py` for less verbose (if you have fast internet connection in upload) ================================================ FILE: 03-data-warehouse/extras/pyproject.toml ================================================ [project] name = "extras" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.14" dependencies = [ "google-cloud-storage>=3.8.0", "pandas>=3.0.0", "pyarrow>=23.0.0", "python-dotenv>=1.2.1", "requests>=2.32.5", "tqdm>=4.67.1", ] ================================================ FILE: 03-data-warehouse/extras/web_to_gcs.py ================================================ import os import requests import pandas as pd from google.cloud import storage from dotenv import load_dotenv """ Pre-reqs: 1. run `uv sync` from this 'extra' folder (create venv and install dependencies from pyproject.toml) 2. rename .env-example to .env (not commited thanks to .gitignore) 3. in .env, - set GCP_GCS_BUCKET as your bucket or change default value of BUCKET - Set GOOGLE_APPLICATION_CREDENTIALS to your project/service-account json key (or don't set it if you use google ADC) """ # load env vars from .env load_dotenv() # services = ['fhv','green','yellow'] init_url = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download/" # if not done in .env, switch out the default bucketname BUCKET = os.environ.get("GCP_GCS_BUCKET", "dtc-data-lake-bucketname") def upload_to_gcs(bucket, object_name, local_file): """ Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python """ # # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed. # # (Ref: https://github.com/googleapis/python-storage/issues/74) # storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024 # 5 MB # storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024 # 5 MB client = storage.Client() bucket = client.bucket(bucket) blob = bucket.blob(object_name) blob.upload_from_filename(local_file) def web_to_gcs(year, service): for i in range(12): # sets the month part of the file_name string month = "0" + str(i + 1) month = month[-2:] # csv file_name file_name = f"{service}_tripdata_{year}-{month}.csv.gz" # download it using requests via a pandas df request_url = f"{init_url}{service}/{file_name}" r = requests.get(request_url) open(file_name, "wb").write(r.content) print(f"Local: {file_name}") # read it back into a parquet file # enforce types so parquet columns will directly have good types # (as we did in module 1 in ingest.py script) dtypes = { "VendorID": "Int64", "RatecodeID": "Int64", "PULocationID": "Int64", "DOLocationID": "Int64", "passenger_count": "Int64", "payment_type": "Int64", "trip_type": "Int64", # only in green but ignored if missing column "store_and_fwd_flag": "string", "trip_distance": "float64", "fare_amount": "float64", "extra": "float64", "mta_tax": "float64", "tip_amount": "float64", "tolls_amount": "float64", "ehailfee": "float64", # only in green but ignored if missing column "improvement_surcharge": "float64", "total_amount": "float64", "congestion_surcharge": "float64", } if service == "yellow": parse_dates = ["tpep_pickup_datetime", "tpep_dropoff_datetime"] else: parse_dates = ["lpep_pickup_datetime", "lpep_dropoff_datetime"] df = pd.read_csv( file_name, dtype=dtypes, parse_dates=parse_dates, compression="gzip" ) file_name = file_name.replace(".csv.gz", ".parquet") df.to_parquet(file_name, engine="pyarrow") print(f"Parquet: {file_name}") # upload it to gcs upload_to_gcs(BUCKET, f"{service}/{file_name}", file_name) print(f"GCS: {service}/{file_name}") web_to_gcs("2019", "green") web_to_gcs("2020", "green") web_to_gcs("2021", "green") # fail when reach 08 (normal, file not in github :) # web_to_gcs("2019", "yellow") # web_to_gcs("2020", "yellow") # web_to_gcs("2021", "yellow") # fail when reach 08 (normal, file not in github :) ================================================ FILE: 03-data-warehouse/extras/web_to_gcs_with_progress_bar.py ================================================ import os import requests import pandas as pd from google.cloud import storage from dotenv import load_dotenv from tqdm import tqdm import gzip import pyarrow as pa import pyarrow.parquet as pq """ Pre-reqs: 1. run `uv sync` from this 'extra' folder (create venv and install dependencies from pyproject.toml) 2. rename .env-example to .env (not commited thanks to .gitignore) 3. in .env, - set GCP_GCS_BUCKET as your bucket or change default value of BUCKET - Set GOOGLE_APPLICATION_CREDENTIALS to your project/service-account json key (or don't set it if you use google ADC) """ # load env vars from .env load_dotenv() # services = ['fhv','green','yellow'] init_url = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download/" # if not done in .env, switch out the default bucketname BUCKET = os.environ.get("GCP_GCS_BUCKET", "dtc-data-lake-bucketname") def download_with_progress(url: str, local_path: str, desc: str = "Downloading"): with requests.get(url, stream=True) as r: r.raise_for_status() total = int(r.headers.get("content-length", 0)) # Configure tqdm for bytes with ( open(local_path, "wb") as f, tqdm( total=total, unit="B", unit_scale=True, unit_divisor=1024, desc=desc, ) as bar, ): for chunk in r.iter_content(chunk_size=1024 * 1024): # 1 MB if not chunk: continue size = f.write(chunk) bar.update(size) def csv_to_parquet_with_progress( csv_path: str, parquet_path: str, service_color: str, chunksize: int = 100_000 ): # 1) Count rows (gzip-aware) with gzip.open(csv_path, mode="rt") as f: total_rows = sum(1 for _ in f) - 1 # minus header if total_rows <= 0: raise ValueError("CSV appears to be empty") # 2) Read in chunks with fixed dtypes so parquet columns will directly have good types # (as we did in module 1 in ingest.py script) dtypes = { "VendorID": "Int64", "RatecodeID": "Int64", "PULocationID": "Int64", "DOLocationID": "Int64", "passenger_count": "Int64", "payment_type": "Int64", "trip_type": "Int64", # only in green but ignored if missing column "store_and_fwd_flag": "string", "trip_distance": "float64", "fare_amount": "float64", "extra": "float64", "mta_tax": "float64", "tip_amount": "float64", "tolls_amount": "float64", "ehailfee": "float64", # only in green but ignored if missing column "improvement_surcharge": "float64", "total_amount": "float64", "congestion_surcharge": "float64", } if service_color == "yellow": parse_dates = ["tpep_pickup_datetime", "tpep_dropoff_datetime"] else: parse_dates = ["lpep_pickup_datetime", "lpep_dropoff_datetime"] reader = pd.read_csv( csv_path, dtype=dtypes, parse_dates=parse_dates, compression="gzip", chunksize=chunksize, low_memory=False, ) writer = None with tqdm(total=total_rows, unit="rows", desc=f"Parquet {csv_path}") as bar: for chunk in reader: table = pa.Table.from_pandas(chunk) if writer is None: writer = pq.ParquetWriter(parquet_path, table.schema) else: # Optional safety: align to first schema table = table.cast(writer.schema) writer.write_table(table) bar.update(len(chunk)) if writer is not None: writer.close() def upload_to_gcs_with_progress(bucket: str, object_name: str, local_file: str): # # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed. # # (Ref: https://github.com/googleapis/python-storage/issues/74) # Optional: tune chunk size (must be multiple of 256 KiB) storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024 # 5 MB storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024 # 5 MB client = storage.Client() bucket_obj = client.bucket(bucket) blob = bucket_obj.blob(object_name) if blob.exists(client): print(f"Skipping upload, already in GCS: gs://{bucket}/{object_name}") return file_size = os.path.getsize(local_file) with open(local_file, "rb") as f: with tqdm.wrapattr( f, "read", total=file_size, miniters=1, unit="B", unit_scale=True, unit_divisor=1024, desc=f"Uploading {os.path.basename(local_file)}", ) as wrapped_file: blob.upload_from_file( wrapped_file, size=file_size, # important so the library knows total bytes ) print(f"Uploaded to GCS: gs://{bucket}/{object_name}") def web_to_gcs(year, service): client = storage.Client() bucket_obj = client.bucket(BUCKET) for i in tqdm(range(12), desc=f"{service} {year}", unit="month"): month = f"{i + 1:02d}" csv_file_name = f"{service}_tripdata_{year}-{month}.csv.gz" parquet_file_name = csv_file_name.replace(".csv.gz", ".parquet") object_name = f"{service}/{parquet_file_name}" # 1) Check if parquet already in GCS blob = bucket_obj.blob(object_name) if blob.exists(client): print(f"Already in GCS, skipping: gs://{BUCKET}/{object_name}") continue # 2) Check if CSV already downloaded locally if os.path.exists(csv_file_name): print(f"CSV already exists locally, skipping download: {csv_file_name}") else: request_url = f"{init_url}{service}/{csv_file_name}" download_with_progress( request_url, csv_file_name, desc=f"Downloading {csv_file_name}" ) # 3) Check if Parquet already exists locally if os.path.exists(parquet_file_name): print( f"Parquet already exists locally, skipping conversion: {parquet_file_name}" ) else: csv_to_parquet_with_progress(csv_file_name, parquet_file_name, service) print(f"Parquet: {parquet_file_name}") # 4) Upload with per-byte progress bar upload_to_gcs_with_progress(BUCKET, object_name, parquet_file_name) web_to_gcs("2019", "green") web_to_gcs("2020", "green") web_to_gcs( "2021", "green" ) # will fail when reaching 08 (normal, file does not exists in github :) # web_to_gcs("2019", "yellow") # web_to_gcs("2020", "yellow") # web_to_gcs("2021", "yellow") # will fail when reaching 08 (normal, file does not exists in github :) ================================================ FILE: 04-analytics-engineering/README.md ================================================ # Module 4: Analytics Engineering Goal: Transforming the data loaded in DWH into Analytical Views developing a [dbt project](taxi_rides_ny/README.md). ### Prerequisites The prerequisites depend on which setup path you choose: **For Cloud Setup (BigQuery):** - Completed [Module 3: Data Warehouse](../03-data-warehouse/) with: - A GCP project with BigQuery enabled - Service account with BigQuery permissions - NYC taxi data loaded into BigQuery (yellow and green taxi data for 2019-2020) **For Local Setup (DuckDB):** - No prerequisites! The local setup guide will walk you through downloading and loading the data. > [!NOTE] > This module focuses on **yellow and green taxi data** (2019-2020). While Module 3 may have included FHV data, it is not used in this dbt project. ## Setting up your environment Choose your setup path: ### 🏠 [Local Setup](setup/local_setup.md) - **Stack**: DuckDB + dbt Core - **Cost**: Free - [→ Get Started](setup/local_setup.md) ### ☁️ [Cloud Setup](setup/cloud_setup.md) - **Stack**: BigQuery + dbt Cloud - **Cost**: Free tier available (dbt Cloud Developer), BigQuery costs vary - **Requires**: Completed Module 3 with BigQuery data - [→ Get Started](setup/cloud_setup.md) ## Content ### Introduction to Analytics Engineering [![](https://markdown-videos-api.jorgenkh.no/youtube/HxMIsPrIyGQ)](https://www.youtube.com/watch?v=HxMIsPrIyGQ) ### Introduction to data modeling [![](https://markdown-videos-api.jorgenkh.no/youtube/uF76d5EmdtU)](https://www.youtube.com/watch?v=uF76d5EmdtU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=40) ### What is dbt? [![](https://markdown-videos-api.jorgenkh.no/youtube/gsKuETFJr54)](https://www.youtube.com/watch?v=gsKuETFJr54&list=PLaNLNpjZpzwgneiI-Gl8df8GCsPYp_6Bs&index=5) ### Differences between dbt Core and dbt Cloud [![](https://markdown-videos-api.jorgenkh.no/youtube/auzcdLRyEIk)](https://www.youtube.com/watch?v=auzcdLRyEIk) ### Project Setup | Alternative A | Alternative B | |-----------------------------|--------------------------------| | BigQuery + dbt Platform | DuckDB + dbt core | | [![](https://markdown-videos-api.jorgenkh.no/youtube/GFbwlrt6f54)](https://www.youtube.com/watch?v=GFbwlrt6f54) | [![](https://markdown-videos-api.jorgenkh.no/youtube/GoFAbJYfvlw)](https://www.youtube.com/watch?v=GoFAbJYfvlw) | ### dbt Course | dbt Project Structure | dbt Sources | dbt Models | Seeds and Macros | |-----------------------|-------------|------------|------------------| | [![](https://markdown-videos-api.jorgenkh.no/youtube/2dYDS4OQbT0)](https://www.youtube.com/watch?v=2dYDS4OQbT0) | [![](https://markdown-videos-api.jorgenkh.no/youtube/7CrrXazV_8k)](https://www.youtube.com/watch?v=7CrrXazV_8k) | [![](https://markdown-videos-api.jorgenkh.no/youtube/JQYz-8sl1aQ)](https://www.youtube.com/watch?v=JQYz-8sl1aQ) | [![](https://markdown-videos-api.jorgenkh.no/youtube/lT4fmTDEqVk)](https://www.youtube.com/watch?v=lT4fmTDEqVk) | | dbt Tests | Documentation | dbt Packages | dbt Commands | |-----------|---------------|----------------------|---------------| | [![](https://markdown-videos-api.jorgenkh.no/youtube/bvZ-rJm7uMU)](https://www.youtube.com/watch?v=bvZ-rJm7uMU) | [![](https://markdown-videos-api.jorgenkh.no/youtube/UqoWyMjcqrA)](https://www.youtube.com/watch?v=UqoWyMjcqrA) | [![](https://markdown-videos-api.jorgenkh.no/youtube/KfhUA9Kfp8Y)](https://www.youtube.com/watch?v=KfhUA9Kfp8Y) | [![](https://markdown-videos-api.jorgenkh.no/youtube/t4OeWHW3SsA)](https://www.youtube.com/watch?v=t4OeWHW3SsA) | ## Troubleshooting - [DuckDB Troubleshooting Guide](setup/duckdb_troubleshooting.md) — If you're getting OOM errors during `dbt build` with DuckDB ## Extra resources > [!NOTE] > If you find the videos above overwhelming, we recommend completing the [dbt Fundamentals](https://learn.getdbt.com/courses/dbt-fundamentals) course and then rewatching the module. It provides a solid foundation for all the key concepts you need in this module. ## SQL refresher The homework for this module focuses heavily on window functions and CTEs. If you need a refresher on these topics, you can refer to these notes. * [SQL refresher](refreshers/SQL.md) ## Homework * [2026 Homework](../cohorts/2026/04-analytics-engineering/homework.md) # Community notes
Did you take notes? You can share them here * [Slides used in previous years](https://docs.google.com/presentation/d/1xSll_jv0T8JF4rYZvLHfkJXYqUjPtThA/edit?usp=sharing&ouid=114544032874539580154&rtpof=true&sd=true) * [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/4_analytics.md) * [Sandy's DE learning blog](https://learningdataengineering540969211.wordpress.com/2022/02/17/week-4-setting-up-dbt-cloud-with-bigquery/) * [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week4) * [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-4/) * [Notes by froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_4_analytics_engineering/notes/notes_week_04.md) * [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week4.md) * [Setting up Prefect with dbt by Vera](https://medium.com/@verazabeida/zoomcamp-week-5-5b6a9d53a3a0) * [Blog by Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/02/week-4-data-engineering-zoomcamp-notes-analytics-engineering-and-dbt/) * [Setting up DBT with BigQuery by Tofag](https://medium.com/@fagbuyit/setting-up-your-dbt-cloud-dej-9-d18e5b7c96ba) * [Blog post by Dewi Oktaviani](https://medium.com/@oktavianidewi/de-zoomcamp-2023-learning-week-4-analytics-engineering-with-dbt-53f781803d3e) * [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd) * [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%204/Data%20Engineering%20Zoomcamp%20Week%204.ipynb) * [Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/4-analytics-engineering/readme.md) * [2024 - Videos transcript week4](https://drive.google.com/drive/folders/1V2sHWOotPEMQTdMT4IMki1fbMPTn3jOP?usp=drive) * [Blog Post](https://www.jonahboliver.com/blog/de-zc-w4) by Jonah Oliver * [2025 Notes by Manuel Guerra](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/4_Analytics-Engineering/README.md) * [2025 Notes by Horeb SEIDOU](https://spotted-hardhat-eea.notion.site/Week-4-Analytics-Engineering-18929780dc4a808692e4e0ee488bf49c?pvs=74) * [2025 Notes by Daniel Lachner](https://github.com/mossdet/dlp_data_eng/blob/main/Notes/04_01_Analytics_Engineering.pdf) * [2026 Notes by Sharad K. Gupta](https://github.com/sharadgupta27/data-engineering/blob/main/Notes/dbt_commands.md) * [Analytical Engineering overview](https://github.com/khanhnguyen7802/DataEngineer101/tree/main/week4-analytics-engineering#readme) * [2026 Notes about dbt](https://github.com/khanhnguyen7802/DataEngineer101/blob/main/week4-analytics-engineering/dbt_installation.md) | [dbt + Duckdb setup using Docker](https://github.com/khanhnguyen7802/DataEngineer101/blob/main/week4-analytics-engineering/dbt_installation.md) by Khanh Nguyen * Add your notes here (above this line)
================================================ FILE: 04-analytics-engineering/class_notes/4_1_1_analytics_engineering_basics.md ================================================ # DE Zoomcamp 4.1.1 — Analytics Engineering Basics > 📄 Video: [Analytics Engineering Basics](https://www.youtube.com/watch?v=uF76d5EmdtU) > 📄 Further reading: [What is Analytics Engineering?](https://docs.getdbt.com/docs/introduction) > 📄 Kimball's Dimensional Modeling: *The Data Warehouse Toolkit* (Ralph Kimball & Margy Ross) This is the kickoff video for Module 4. No hands-on coding here — it's all about setting the stage. Why does analytics engineering exist, what does it actually do, and what are the data modeling concepts we'll be leaning on for the rest of the module. Worth sitting with before diving into the dbt stuff. --- ## Why analytics engineering exists A few shifts in the data world created a gap that nobody was filling: - **Cloud data warehouses** (BigQuery, Snowflake, Redshift) made storage and compute cheap. You no longer have to be surgical about what data you load. - **EL tools** like Fivetran and Stitch made getting data into the warehouse almost trivial — the extract and load steps are basically automated now. - **SQL-first BI tools** like Looker brought version control into the data workflow. And tools like Mode enabled self-service analytics for business users. - **Data governance** became a bigger conversation as more people started touching data. All of this changed how data teams work and how stakeholders consume data. But it left a gap between the people building the infrastructure and the people using the data. ### The traditional data team In the old model you had three roles and a pretty clean split: - **Data Engineer** — builds and maintains the infrastructure. Great software engineer, but not necessarily close to how the business actually uses the data. - **Data Analyst** — uses the data to answer questions and solve business problems. Understands the business well, but not trained as a software engineer. - **Data Scientist** — similar story to the analyst. Writing more and more code these days, but software engineering best practices weren't part of the training. ### The gap Analysts and scientists are writing more code, but they weren't trained for it. Engineers are great at building systems, but they don't always know how the data gets consumed downstream. Nobody was bridging that gap. ### Analytics Engineer The analytics engineer is the bridge. They bring software engineering best practices — version control, testing, documentation, modularity — into the work that analysts and scientists are already doing. It's a role that sits at the intersection of the data engineer and the data analyst. In terms of the toolchain, an analytics engineer might touch: - **Data loading** — tools like Fivetran, Stitch (the EL layer) - **Data storing** — cloud data warehouses, shared territory with data engineers - **Data modeling** — this is the core of it. Tools like dbt or Dataform. This is where most of Module 4 lives. - **Data presentation** — BI tools like Google Looker Studio. The end product that business users actually see. The focus this week is on modeling and presentation — everything in between "data is in the warehouse" and "business user sees a dashboard." --- ## ETL vs ELT — a quick recap Two philosophies for getting data transformed and ready: **ETL (Extract → Transform → Load)** — you transform the data *before* it hits the warehouse. Takes longer to set up because the transformation logic has to be built first, but the data in the warehouse is clean and stable from day one. **ELT (Extract → Load → Transform)** — you load the raw data first, then transform it *inside* the warehouse. Faster and more flexible. This is the approach that cloud warehouses made possible — storage is cheap, so just load everything and figure out the transformations later. ELT is the dominant approach now, and it's the one we'll be working with. dbt fits squarely into the "T" of ELT — it runs transformations inside the warehouse using SQL. --- ## Dimensional Modeling — the key concepts This is Kimball's framework, and it's the main mental model for how we'll structure our data this week. The goal is twofold: make the data **understandable to business users**, and make **queries fast**. Note: unlike third normal form (3NF), dimensional modeling deliberately allows some data redundancy. The priority is usability and performance, not eliminating duplication. ### Fact tables vs Dimension tables (Star Schema) The two building blocks: - **Fact tables** — measurements, metrics, business events. Think of them as **verbs**. "A sale happened." "An order was placed." They correspond to a business process. - **Dimension tables** — the context around those facts. Think of them as **nouns**. "Who bought it? What product? When?" They correspond to a business entity like a customer or a product. Together they form a **star schema** — the fact table in the center, dimension tables radiating out around it. It's the classic layout you'll see in most data warehouses. ### The Kitchen Analogy Kimball's book uses a restaurant analogy to describe how data flows through a warehouse. It maps pretty cleanly onto what we'll be doing in the project: - **Staging area (the pantry)** — raw data lands here. Not meant for business users. Only people who know what they're doing should be poking around in it. - **Processing area (the kitchen)** — this is where raw data gets transformed into proper data models. Again, limited to the people doing the cooking — the data engineers and analytics engineers. The focus here is on efficiency and following standards. - **Presentation area (the dining hall)** — the final, polished output. This is what business stakeholders actually see and interact with. Clean, structured, ready to consume. We'll be building exactly this layered structure in our dbt project throughout the module. ================================================ FILE: 04-analytics-engineering/class_notes/4_1_2_what_is_dbt.md ================================================ # DE Zoomcamp 4.1.2 — What is dbt? > 📄 Video: [What is dbt?](https://www.youtube.com/watch?v=gsKuETFJr54) > 📄 Official docs: [Introduction to dbt](https://docs.getdbt.com/docs/introduction) > 📄 dbt Cloud vs Core: [Choose your dbt](https://docs.getdbt.com/docs/cloud/about-cloud/dbt-cloud-features) This is the big-picture overview of dbt before we start building anything. What it is, what problems it solves, and how we'll be using it in the course. No hands-on work yet — just the framing. --- ## What is dbt? dbt is a transformation workflow tool. It sits on top of your data warehouse and helps you turn raw data into something useful for downstream consumers (analysts, BI tools, ML pipelines, whatever needs clean, structured data). You write SQL (or Python) to define your transformations, and dbt handles the rest: compiling it, running it against the warehouse, managing dependencies, and persisting the results as tables or views. In a real company setup, you'd have data flowing in from all over the place — backend systems, frontend apps, third-party APIs like weather data. All of that gets loaded into your warehouse (BigQuery, Snowflake, Databricks, whatever), and dbt is the layer that transforms that raw data into something the business can actually consume. --- ## What problems it solves The transformation step has always existed. What dbt brings to the table is **software engineering best practices for analytics code**. Things that software engineers have been doing for years but didn't have a clear path into the analytics world: - **Version control** — your transformations live in git, just like any other code - **Modularity** — break complex logic into reusable pieces instead of massive spaghetti queries - **Testing** — automated data quality checks that run with every deployment - **Documentation** — generated from your code, not a separate wiki that gets out of date - **Environments** — separate dev and prod. Each developer gets their own sandbox to work in without stepping on each other's toes - **CI/CD** — automated deployments with validation and rollback The result is higher-quality pipelines that are easier to maintain and less prone to breaking in production. --- ## How it works — the mechanics You write a SQL file. It looks like a normal `SELECT` statement. dbt takes that file, figures out where it should go in the warehouse (which schema, which dataset, what environment), wraps it in the necessary DDL/DML, compiles it with any Jinja templating you've used, and runs it. When you run `dbt run`, it: 1. Compiles your SQL (resolves `ref()` calls, `source()` calls, Jinja macros, everything) 2. Sends the compiled SQL to your warehouse 3. Materializes the result as a table, view, incremental table, or ephemeral CTE — whatever you configured You don't write `CREATE TABLE` statements yourself. You just write the `SELECT`, and dbt handles the rest. --- ## dbt Core vs dbt Cloud There are two ways to use dbt, and it's worth understanding the difference: ### dbt Core Open source. Free. You install it locally on your machine (or wherever) and run commands from the terminal. You're responsible for: - Setting up your dev environment - Orchestrating production runs (Airflow, cron jobs, whatever you want) - Hosting documentation if you want it accessible - Managing logs and metadata It's the raw engine. You get full control, but you also have to build the surrounding infrastructure yourself. ### dbt Cloud SaaS product that runs dbt Core under the hood. It gives you: - A web-based IDE for writing transformations (or you can use a Cloud CLI if you prefer local development) - Environment management — dev/staging/prod, all handled for you - Built-in orchestration (job scheduling, triggers, dependencies) - Hosted documentation (automatically generated and served) - Logging and observability - APIs for administration and metadata access - A semantic layer for metrics (if you need it) There's a free Developer plan that works for small teams or individual learning. For anything bigger, it's a paid product. --- ## The course setup — two paths The Zoomcamp gives you two options, and the videos will alternate between them (version A and version B): ### Option A: BigQuery + dbt Cloud (recommended) - Data warehouse: BigQuery (assuming you set this up in previous weeks) - dbt: dbt Cloud Developer plan (free account, web IDE) - No local installation needed This is the path most of the videos will follow. It's the fastest way to get started and closest to how teams actually use dbt in production. ### Option B: DuckDB + dbt Core - Data warehouse: DuckDB (local or however you've got it set up) - dbt: dbt Core installed locally - Dev environment: your own IDE (VS Code, etc.) - Orchestration: you'll need to handle this separately (Airflow, Prefect, whatever) This path gives you more hands-on control but requires more setup. --- ## The project flow By the time we get to the end of the module, here's what we'll have built: 1. Raw data sitting in the warehouse — trip data from previous weeks, plus a lookup table to demonstrate joining multiple sources 2. dbt transformations that turn that raw data into properly modeled tables following the dimensional modeling concepts from 4.1.1 3. Dashboards that consume the final output and make it useful for business stakeholders The next videos will walk through actually setting this up and building it out step by step. ================================================ FILE: 04-analytics-engineering/class_notes/4_2_1_dbt_core_vs_dbt_cloud.md ================================================ # DE Zoomcamp 4.2.1 — dbt Core vs dbt Cloud > 📄 Official feature comparison: [dbt Core vs dbt Cloud](https://www.getdbt.com/product/dbt-core-vs-dbt-cloud) ## dbt Core - Born in **2016** as a fully **open-source, command-line tool** - 100% free, runs locally on your own machine - All code is available on GitHub (can fork, modify, etc.) ## dbt Cloud - Introduced **two years after dbt Core** (~2018) by dbt Labs (originally called Fishtown Analytics) - Sold as a **paid SaaS platform** — no need to manage infrastructure yourself - Handles the heavy lifting: - Hosting dbt documentation - Orchestration - Environment setup - Backups of dbt artifacts (e.g. for Slim CI) - Comes with **collaboration and security features** useful for teams/companies ## How They Were Used Together (Hybrid Approach) - Common pattern: more technical users worked with dbt Core; less technical users used dbt Cloud - The two were designed to be **compatible** — e.g. developers could work locally with dbt Core while production runs were executed through dbt Cloud - dbt Labs published an article in **October 2024** outlining how both products were meant to coexist side by side → [How we think about dbt Core and dbt Cloud](https://www.getdbt.com/blog/how-we-think-about-dbt-core-and-dbt-cloud) ## dbt Fusion — The Future - In **May 2025**, dbt Labs announced a **full rewrite of the code base** using a new engine called **Fusion** - Key improvements: - **Faster compilation** of dbt code (up to 30x faster in some cases) - **Better developer experience** — catches many errors *before* running/building, saving time and money - dbt Core will continue to be maintained, but **Fusion is the future direction** for both Core and Cloud ### Fusion Limitations - **Not supported by all adapters** — as of early 2026, Fusion supports major adapters like Snowflake, Databricks, Postgres (and derivatives), BigQuery, and Redshift - Notably **does not support DuckDB** (yet) or many community-maintained adapters - If you use a less common adapter, dbt Fusion and the newest versions of dbt Cloud may not work for you - Adapter support is being actively expanded — check the official docs for the current list > 📄 Fusion upgrade guide: [Upgrading to the dbt Fusion engine](https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-fusion) > 📄 Full adapter support list: [Supported features](https://docs.getdbt.com/docs/fusion/supported-features) ## New Vision: Unified License - Instead of splitting users between Core and Cloud, Fusion envisions **everyone having a dbt license** - Users can choose to work in: - The **dbt Cloud IDE**, or - **VS Code** using the official dbt Labs extension - Both options are backed by the same Fusion engine ## Course Decisions & Recommendations - This course uses **DuckDB + dbt Core** (local, via VS Code) because: - It forces learners to understand what's actually happening under the hood - dbt Cloud abstracts a lot away — understanding Core first makes Cloud easier to pick up later - If you follow along with dbt Cloud + BigQuery, the concepts transfer well - dbt Labs' own documentation and courses are excellent resources for learning dbt Cloud specifically → [dbt Developer Hub](https://docs.getdbt.com) - **Bottom line:** It doesn't matter much which one you learn first — especially as a consultant, you'll likely use both. Focus on the shared fundamentals. --- *Note: This document was last updated February 2026. For the latest information on dbt Fusion and adapter support, always consult the official dbt documentation.* ================================================ FILE: 04-analytics-engineering/class_notes/4_3_1_dbt_project_structure.md ================================================ # DE Zoomcamp 4.3.1 — dbt Project Structure > 📄 Video: [dbt Project Structure](https://www.youtube.com/watch?v=2dYDS4OQbT0) > 📄 Official docs: [About dbt projects](https://docs.getdbt.com/docs/build/projects) > 📄 Best practices: [How we structure our dbt projects](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview) When you run `dbt init`, dbt automatically creates a set of files and folders. This video walks through each one and explains its purpose. The structure below applies to both dbt Core and dbt Cloud (the DuckDB database file and `data/` folder are local-only artifacts and can be ignored here). --- ## Top-Level Files & Folders ### `analysis/` - A place for **ad-hoc SQL scripts** that you don't necessarily want to share with stakeholders - Not heavily used by everyone, but handy for things like **data quality reports** or **administrative checks** - Think of it as a scratchpad — if you want to investigate how bad a data quality issue is, drop a SQL script here ### `dbt_project.yml` - **The most important file in a dbt project** - Every time you run a dbt command, dbt looks for this file first — if it's missing, the command fails - Key things it contains: - Project name - Profile name (must match your `profiles.yml` — critical for dbt Core users) - Default materializations - Variables - Also a place to set project-wide defaults and configuration > 📄 [dbt_project.yml reference](https://docs.getdbt.com/reference/dbt_project.yml) ### `macros/` - Macros behave like **reusable functions** (similar to Python functions or UDFs) - Use them when you find yourself **repeating the same SQL logic** in multiple places, or when you want to **encapsulate a piece of logic** in one place - Benefits: - Easier to test (you're testing a small, isolated chunk) - If a definition changes, you only update it in one place - Common use cases: - **Calendar conversions** (e.g. converting standard dates to a company's fiscal calendar) - **Tax rates or regulatory definitions** that might change over time - Any reusable business logic that shouldn't be duplicated across models > 📄 [Jinja and macros](https://docs.getdbt.com/docs/build/jinja-macros) ### `models/` - The **most important directory** — this is where all your SQL transformation logic lives - dbt suggests breaking it into **three subfolders** (see below) ### `README.md` - Standard project documentation — the first thing someone sees when they open your project - dbt creates a default one, but most teams customize it - Good things to include: - How to run the project - Whether you need credentials or onboarding - Contact information - Installation/setup guides ### `seeds/` - A place to **upload CSV or flat files** and ingest them as dbt models in your database - Considered a **quick-and-dirty** approach — if you have the option, it's better to load data properly at the source - Useful for: - **Lookup tables** - Quick experiments or prototypes - Showing a stakeholder something before fully committing to a data load - Use when you don't have the right permissions, or the data is expected to change frequently during experimentation > 📄 [Seeds](https://docs.getdbt.com/docs/build/seeds) ### `snapshots/` - Solves a specific problem: a source table has a column that **overwrites itself**, but you need to **keep the history** - Example: an `orders` table with a `current_status` column that only ever shows the latest status. For analytics, you want to know *when* each status changed - How it works: a snapshot takes a **"picture" of a table at a point in time**. Each time you run it, if a value has changed, a new row is recorded with a timestamp — without overwriting the previous value - Like seeds, this is a **workaround** — ideally you'd solve this at the source. But if you don't control the source, snapshots work well > 📄 [Snapshots](https://docs.getdbt.com/docs/build/snapshots) ### `tests/` - A place for **singular tests** written as SQL assertions - The logic is simple: **if the query returns more than zero rows, the dbt build fails** - Example from the course: a client needed to ensure that vehicle timestamps always covered exactly 24 hours per day. A test query checked for any day where the total hours deviated from 24 — catching logic errors like accidental filters or bad joins early - This is one of several ways to test in dbt, but singular tests are especially good for **custom business rules** that don't fit standard schema tests > 📄 [Data tests (singular & generic)](https://docs.getdbt.com/docs/build/data-tests) --- ## The `models/` Subfolders dbt suggests organizing models into three layers: ### `staging/` - Contains two things: - **Source definitions** — telling dbt where your raw data lives in the database - **Staging models** — a **1:1 copy** of each source table with only **minimal cleaning** applied - Minimal cleaning means things like: - Fixing data types - Renaming columns - Filtering out clearly empty rows - Removing unnecessary columns - Standardizing values - Keep it **1:1** — same number of rows and columns as the raw source. Breaking this rule is occasionally convenient but should be the exception ### `intermediate/` - Everything that is **not raw** and **not ready to expose** to end users - A catch-all for: - Complex joins - Heavy-duty cleaning or standardization - Data quality processing - No strict guidelines on what goes here — if it doesn't fit neatly into staging or marts, it belongs in intermediate ### `marts/` - Where all the **final, consumption-ready** tables live - If it's in marts, it's **ready for end users** - In a well-governed dbt project, **only marts tables should be exposed** to BI tools, analysts, and business stakeholders — nothing else - Typically contains: - Tables ready for dashboards - Properly modeled, clean tables - Often star schemas, but not necessarily --- ## A Note on Conventions The `staging → intermediate → marts` structure is dbt's recommendation, but it's not mandatory. The instructor has seen teams use: - **Medallion architecture** naming: `bronze`, `silver`, `gold` - Numbered layers: `first`, `second`, `third`, `last` - Other custom conventions If your organization already has a convention, follow it. Otherwise, stick with dbt's default structure — it's well thought out and what this course uses. ================================================ FILE: 04-analytics-engineering/class_notes/4_3_2_dbt_sources.md ================================================ # DE Zoomcamp 4.3.2 — dbt Sources > 📄 Video: [dbt Sources](https://www.youtube.com/watch?v=7CrrXazV_8k) > 📄 Official docs: [Sources](https://docs.getdbt.com/docs/build/sources) > 📄 Best practices: [How we structure our dbt projects — Staging](https://docs.getdbt.com/best-practices/how-we-structure/03-staging) This video is about telling dbt where your raw data actually lives. Sources are how dbt knows which tables to pull from before any transformation happens. Everything in this video takes place inside the `models/staging/` folder that we set up in 4.3.1. --- ## Defining Sources ### `sources.yml` - A **YAML file** inside `models/staging/` that tells dbt where your raw data is - The **name** of the file is arbitrary — common choices are `sources.yml`, `_sources.yml` (underscore so it sorts to the top), or something named after the origin like `bigquery_sources.yml` - You give your source a **name** — this is arbitrary too. Think of it as a label: `raw`, `raw_data`, or something more descriptive like `google_analytics_data` or `finance_data` - Then you provide three fields that are **not** arbitrary — they must exactly match your warehouse: - **database** — the database name or GCP project - **schema** — the schema inside that database or BigQuery dataset - **tables** — the individual tables you want to reference ```yaml sources: - name: nytaxi database: taxi_rides_ny # Or name of your GCP project schema: prod # Or name of your BigQuery dataset tables: - name: green_tripdata - name: yellow_tripdata ``` > 📄 [Sources — full reference](https://docs.getdbt.com/docs/build/sources) ### Local (DuckDB) vs BigQuery — what goes where The meaning of database, schema, and tables changes depending on your setup: | Field | Local (DuckDB) | BigQuery | |---|---|---| | **database** | `taxi_rides_ny` | Your GCP Project ID | | **schema** | `main` | Your BigQuery Dataset name (e.g. `trips_data_all`) | | **tables** | `green_tripdata`, `yellow_tripdata` | Same table names | - If you followed the default local setup, these names should be exactly right out of the box - If you're on BigQuery, just double-check that your table names match what you actually have in your dataset --- ## Using Sources in Your Models ### The `source()` function - Instead of hard-coding the full path to your table (e.g. `FROM production.trips_data_all.green_tripdata`), you use the **`source()`** function - It's a **Jinja macro** — you'll recognize it by the double curly brackets `{{ }}` - It takes two arguments: - The **source name** — the one you defined in your YAML (e.g. `staging`) - The **table name** — must match exactly what you put under `tables` in the YAML - As long as there's a YAML file somewhere in your project with a matching source declaration, this will resolve correctly at compile time ```sql select * from {{ source('staging', 'green_tripdata') }} ``` - Run a preview and you should see the raw table data come back - If it works, that's the foundation — everything else builds on this --- ## Building a Proper Staging Model ### Naming convention - Prefix your staging model files with **`stg_`** to make it clear what layer they belong to - So `green_tripdata.sql` becomes `stg_green_tripdata.sql` - Other common prefixes: `int_` for intermediate, and sometimes nothing at all for final mart models ### Rename and reorder columns - List out every column explicitly and give them **cleaner aliases** - Be purposeful about the **order** — it should follow a logical grouping: - **Identifiers first** — `vendor_id`, `trip_id`, anything that's an ID - **Timestamps next** — `pickup_datetime`, `dropoff_datetime` - **Trip details** — `passenger_count`, `trip_distance`, `trip_type` - **Payment info last** — `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `total_amount`, `payment_type` ### Cast data types explicitly - Don't rely on whatever the source gave you — cast everything to the type you actually want: - IDs → `integer` - Timestamps → `timestamp` - Counts → `integer` - Monetary values → `numeric` or `float` (depends on your platform) ```sql with tripdata as ( select * from {{ source('staging','green_tripdata') }} where vendorid is not null ), renamed as ( select -- identifiers cast(vendorid as integer) as vendorid, cast(ratecodeid as integer) as ratecodeid, cast(pulocationid as integer) as pickup_locationid, cast(dolocationid as integer) as dropoff_locationid, -- timestamps cast(lpep_pickup_datetime as timestamp) as pickup_datetime, cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime, -- trip info store_and_fwd_flag, cast(passenger_count as integer) as passenger_count, cast(trip_distance as numeric) as trip_distance, cast(trip_type as integer) as trip_type, -- payment info cast(fare_amount as numeric) as fare_amount, cast(extra as numeric) as extra, cast(mta_tax as numeric) as mta_tax, cast(tip_amount as numeric) as tip_amount, cast(tolls_amount as numeric) as tolls_amount, cast(ehail_fee as numeric) as ehail_fee, cast(improvement_surcharge as numeric) as improvement_surcharge, cast(total_amount as numeric) as total_amount, cast(payment_type as integer) as payment_type, {{ get_payment_type_description('payment_type') }} as payment_type_description from tripdata ) select * from renamed ``` --- ## A Note on Filtering - The general recommendation is to keep staging models as **1:1 copies** of the source — same number of rows, same number of columns, just cleaned up - That said, this dataset has some data quality issues (we'll cover those later), so it makes sense to filter out rows where **`vendor_id IS NULL`** right here in staging - It's a deviation from convention, but a practical one for this project --- ## Your Exercise Do the same thing for the **yellow tripdata** table. The columns are almost identical to green, so it shouldn't be too painful. By the end you should have: - A `sources.yml` that declares both tables - A `stg_green_tripdata.sql` staging model - A `stg_yellow_tripdata.sql` staging model ================================================ FILE: 04-analytics-engineering/class_notes/4_4_1_dbt_models.md ================================================ # DE Zoomcamp 4.4.1 — dbt Models > 📄 Video: [dbt Models](https://www.youtube.com/watch?v=JQYz-8sl1aQ) > 📄 Official docs: [SQL models](https://docs.getdbt.com/docs/build/sql-models) > 📄 ref() function: [About ref](https://docs.getdbt.com/reference/dbt-jinja-functions/ref) Staging is done. From here on out it's not just typing SQL behind a computer — you need to actually **explore the data**, understand what's in it, and get some **business context**. In a real org that means querying exhaustively until you understand the common data quality issues, what a normal row looks like, and talking to people about what the codes mean and when rows trigger. All of that understanding eventually gets encoded as SQL. --- ## What are we building? Before writing any code, it helps to think about what the end result should look like. There are generally two things you want in your marts: ### Reports and dashboards - If there's an important dashboard or data application out there — especially one that requires a lot of manual work or spreadsheet maintenance — that's a sign it should become a dbt model - Example: imagine there's a dashboard with a dataset called **monthly revenue per location**. That's something we want to build and version-control properly ### A dimensional model - Beyond reports, you want a proper **star schema** — the kind of structure you see in data warehouses - Two key table types to know: - **Fact tables** — one row per event/process. One row per trip, one row per sale, one row per order. Named with a `fct_` prefix (e.g. `fct_trips`) - **Dimension tables** — attributes of an entity. Named with a `dim_` prefix (e.g. `dim_zones`, `dim_vendors` is not shown here) - The power of a good star schema: answering "how many?" questions becomes trivial. *How many zones do we have?* → `COUNT(*)` on `dim_zones`. *How many trips?* → `COUNT(*)` on `fct_trips`. Simple, focused tables that you join when you need something more complex ### What we're building in this course - `dim_zones` — zone/location attributes - `fct_trips` — one row per trip (yellow + green combined) - A report model for monthly revenue per zone (inside a `models/core/` folder) --- ## source() vs ref() — the key distinction This is an important moment in the course. Up until now we've been using `{{ source() }}` to pull in raw data. But that's **only** for things declared in your sources YAML — i.e. raw tables that live outside of dbt. If the input to your model is **another dbt model**, you use `{{ ref() }}` instead. - `{{ source('name', 'table') }}` → raw data defined in your YAML - `{{ ref('model_name') }}` → another dbt model > 📄 [ref() — full reference](https://docs.getdbt.com/reference/dbt-jinja-functions/ref) This distinction matters because `ref()` also does something useful under the hood: it automatically builds the **dependency graph**. dbt knows that if model B refs model A, then A has to run first. You never have to manage run order yourself. --- ## The intermediate layer — why it exists We want `fct_trips` to be a union of yellow and green trip data. But doing that union directly inside the fact model would make it messy. So we put it in an **intermediate model** instead — something that's not raw, and not ready to expose to end users. - Convention: prefix intermediate models with `int_` - In this case: `int_trips_unioned.sql` - The idea is to keep intermediate work out of marts. Marts should only contain things that are consumption-ready ```sql with green_data as ( select *, 'Green' as service_type from {{ ref('stg_green_tripdata') }} ), yellow_data as ( select *, 'Yellow' as service_type from {{ ref('stg_yellow_tripdata') }} ), trips_unioned as ( select * from green_data union all select * from yellow_data ) select * from trips_unioned ``` --- ## The union problem — yellow and green aren't identical When you try to union the two staging models, it fails. The error: *set operation can only be applied with expressions with the same number of columns*. Turns out green has **two extra columns** that yellow doesn't: ### `trip_type` - Values are `1` or `2` - `1` = street hail (you flag down the taxi) - `2` = booked via phone or app - Yellow taxis **don't have this column** because by law you can only get a yellow taxi by hailing it on the street — it's always type 1 - Fix: add `trip_type` to the yellow staging model and hard-code it as `1` (street hail) ### `ehail_fee` (e-hail fee) - An extra fee that can apply when you request a taxi through an app - In practice, most of this data is null — the feature isn't consistently implemented across vendors - Yellow taxis by definition **never** have an e-hail fee - Fix: add `ehail_fee` to the yellow staging model and hard-code it as `0` ```sql -- Updated stg_yellow_tripdata.sql to match green schema with tripdata as ( select * from {{ source('staging','yellow_tripdata') }} where vendorid is not null ), renamed as ( select -- identifiers cast(vendorid as integer) as vendor_id, cast(ratecodeid as integer) as ratecode_id, cast(pulocationid as integer) as pickup_location_id, cast(dolocationid as integer) as dropoff_location_id, -- timestamps cast(tpep_pickup_datetime as timestamp) as pickup_datetime, cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime, -- trip info store_and_fwd_flag, cast(passenger_count as integer) as passenger_count, cast(trip_distance as numeric) as trip_distance, cast(1 as integer) as trip_type, -- Yellow only does street-hail -- payment info cast(fare_amount as numeric) as fare_amount, cast(extra as numeric) as extra, cast(mta_tax as numeric) as mta_tax, cast(tip_amount as numeric) as tip_amount, cast(tolls_amount as numeric) as tolls_amount, cast(0 as numeric) as ehail_fee, -- Yellow doesn't have ehail cast(improvement_surcharge as numeric) as improvement_surcharge, cast(total_amount as numeric) as total_amount, cast(payment_type as integer) as payment_type, from tripdata ) select * from renamed ``` A note: adding these columns directly in staging is technically a break from the "1:1 copy" rule. It's done here to keep things simple, but in a stricter project you'd handle this in the intermediate layer. **Updated union after schema alignment:** ```sql -- models/staging/int_trips_unioned.sql with green_data as ( select *, 'Green' as service_type from {{ ref('stg_green_tripdata') }} ), yellow_data as ( select *, 'Yellow' as service_type from {{ ref('stg_yellow_tripdata') }} ), trips_unioned as ( select * from green_data union all select * from yellow_data ) select * from trips_unioned ``` --- ## Why the business context matters The column discrepancy between yellow and green isn't just a technical problem — it's a **business story**. Yellow and green taxis exist because of how NYC taxi licensing works: yellow cabs stay in Manhattan, green cabs were created so people in the outer boroughs could get rides too. Understanding that context is what lets you make the right call on how to handle `trip_type` and `ehail_fee` — not just technically, but semantically. This is the part of analytics engineering where you stop just writing SQL and start understanding what the data actually represents. ================================================ FILE: 04-analytics-engineering/class_notes/4_4_2_dbt_seeds_and_macros.md ================================================ # DE Zoomcamp 4.4.2 — dbt Seeds and Macros > 📄 Video: [dbt Seeds and Macros](https://www.youtube.com/watch?v=lT4fmTDEqVk) > 📄 Seeds docs: [Seeds](https://docs.getdbt.com/docs/build/seeds) > 📄 Macros docs: [Jinja and macros](https://docs.getdbt.com/docs/build/jinja-macros) The union model is done, but right now vendor IDs and location IDs are just numbers — meaningless codes. This video is about enriching that data. Two dbt features come in: **seeds** for bringing in lookup data, and **macros** for turning reusable SQL logic into something you don't have to copy-paste everywhere. --- ## The problem — codes everywhere If you query `vendor_id`, you get values: 1 and 2. Those map to real companies: - **1** → Creative Mobile Technologies - **2** → VeriFone Inc. Same story with locations — 265 location IDs that could have names, boroughs, coordinates, and more. The raw data just doesn't have any of that. So how do we add it? --- ## Seeds — bringing in lookup data ### What seeds are - A way to **upload a CSV file** and make it available as a dbt model - You drop the CSV into the `seeds/` directory, run `dbt seed`, and it becomes queryable just like any other model - You reference it with `{{ ref('filename') }}` — same as any other model ### When to use them - **Lookup tables** that don't exist anywhere in your warehouse yet - Cases where you don't have write permissions to load data properly - Quick experiments or local testing before committing to a proper data load - Small, static datasets ### When NOT to use them - **Never commit confidential data** — seeds go into your git repo - Keep the data **small** — large CSVs in git will slow down pulls and pushes - If you have the option to load the data properly at the source, do that instead. Seeds are a quick-and-dirty workaround > 📄 [Seeds — full reference](https://docs.getdbt.com/docs/build/seeds) --- ## dim_zones — using a seed in practice The taxi zone lookup CSV has exactly what we need: location ID, borough, zone name, and service area. Drop it into `seeds/`, run `dbt seed`, and it's live. Now we build `dim_zones`. The model simply selects from the seed and renames columns to something cleaner. ```sql select locationid as location_id, borough, zone, service_zone from {{ ref('taxi_zone_lookup') }} ``` That's it — first dimension table done. The seed did the heavy lifting. --- ## dim_vendors — the CASE WHEN problem (not implemented in this project, but shown for learning) For vendors, we could pull distinct `vendor_id` from the intermediate union model using `ref()`. Easy enough. But we want to enrich it with vendor **names**. ### The naive approach: CASE WHEN You could just write it inline: ```sql with vendors as ( select distinct vendorid from {{ ref('stg_green_tripdata') }} ) select vendorid, case when vendorid = 1 then 'Creative Mobile Technologies, LLC' when vendorid = 2 then 'VeriFone Inc.' else 'Unknown' end as vendor_name from vendors ``` This works. But it has a real problem: **what happens when a new vendor appears, or a vendor changes its name?** You have to open this file, find the CASE block, and add another line. And if you need the same mapping somewhere else in the project, you copy-paste the whole thing. Eventually someone forgets to update one of the copies. ### The better approach: macros Macros are dbt's answer to this. Think of them as **reusable SQL functions** — same idea as a Python function, but for SQL snippets. > 📄 [Jinja and macros — full reference](https://docs.getdbt.com/docs/build/jinja-macros) ### How macros work - Defined in `.sql` files inside the `macros/` directory - You wrap your SQL logic in `{% macro macro_name(argument) %}` ... `{% endmacro %}` - The argument works just like a function parameter — you pass in a value when you call it - You call it in your models with `{{ macro_name(argument) }}` - dbt compiles it down — the final SQL looks exactly like you typed the CASE block inline, but your source code stays clean ```sql {% macro get_vendor_data(vendor_id_column) %} {% set vendors = { 1: 'Creative Mobile Technologies', 2: 'VeriFone Inc.', 4: 'Unknown/Other' } %} case {{ vendor_id_column }} {% for vendor_id, vendor_name in vendors.items() %} when {{ vendor_id }} then '{{ vendor_name }}' {% endfor %} end {% endmacro %} ``` **Using the macro in a model:** ```sql with trips as ( select * from {{ ref('fct_trips') }} ), vendors as ( select distinct vendor_id, {{ get_vendor_data('vendor_id') }} as vendor_name from trips ) select * from vendors ``` ### Why this is better - **Reusable** — need the same payment type logic somewhere else? Just call the macro again - **Single source of truth** — payment types change? Update the macro in one place, it's fixed everywhere - **Testable** — the logic is isolated in its own file, easier to reason about --- ## Homework preview — fct_trips The fact trips model is left as an exercise. Here's what's expected: - **One row per trip** — yellow and green combined (the union is already done in the intermediate model) - **Add a primary key** (`trip_id`) — it has to be **unique** - **Find and fix duplicates** — there are quite a few in this dataset. Some come from the source, some get introduced during the union. Find them, understand why they happen, and fix them - **Enrich `payment_type`** (there is a seed for this in the repo). ================================================ FILE: 04-analytics-engineering/class_notes/4_5_1_documentation.md ================================================ # DE Zoomcamp 4.5.1 — Documentation > 📄 Video: [Documentation](https://www.youtube.com/watch?v=UqoWyMjcqrA) > 📄 Official docs: [Documentation](https://docs.getdbt.com/docs/build/documentation) > 📄 Model properties: [Model properties](https://docs.getdbt.com/reference/model-properties) The models are built. Now it's time to make sure other people can actually understand what they do. This video covers how dbt's documentation system works — what you write, where you write it, and what dbt does with it. --- ## Where documentation lives — YAML files You've already seen YAML files in the context of sources. But they do more than just declare where raw data lives — they're also the **primary place to document your entire project**. The most common convention is to have a single file called `schema.yml` per directory. Some teams prefer **one YAML file per model** — that's fine too, it keeps things from getting unwieldy when projects get large. For this course we stick with `schema.yml`. > 📄 [Model properties — full reference](https://docs.getdbt.com/reference/model-properties) --- ## What you can document Almost everything in dbt can be documented. The structure is the same pattern regardless of what you're documenting: ### Sources You already have a `sources.yml` — you can add descriptions to the source itself and to each table inside it. ```yaml version: 2 sources: - name: staging description: > Raw NYC taxi trip data loaded from BigQuery external tables. Contains both yellow and green taxi trip records for 2019-2020. database: production schema: trips_data_all tables: - name: green_tripdata description: > Green taxi trip records. Green taxis operate primarily in outer boroughs (outside Manhattan). - name: yellow_tripdata description: Yellow taxi trips, primarily from Manhattan ``` ### Models In `schema.yml`, you switch from `sources:` to `models:`. Same idea — give each model a name and a description, then drill down into columns. ```yaml version: 2 models: - name: dim_zones description: > Zone lookup table containing LocationID, borough, zone name and service zone. One row per taxi zone in NYC. columns: - name: locationid description: Primary key for taxi zones tests: - unique - not_null - name: borough description: NYC borough name (Manhattan, Queens, Brooklyn, Bronx, Staten Island, EWR) - name: zone description: Taxi zone name/neighborhood - name: service_zone description: Service zone type (Yellow, Green, or Airports) ``` ### Columns Under each model, you can list every column with: - **name** — must match the actual column name - **description** — what it means - **data_type** — what type it should be (informational, not enforced) - **tests** — we'll cover these in the next video, but the slot is here - **meta** — custom key-value tags (more on this below) ### Macros and seeds You can document these too, using the same YAML pattern. Same `version: 2` header, just different top-level keys. --- ## Multi-line descriptions If you need more than one line for a description, use the YAML **pipe operator** (`|`) or **greater-than operator** (`>`). Everything indented under it becomes part of the description. The `>` folds newlines into spaces, while `|` preserves them. ```yaml version: 2 models: - name: fct_trips description: | Fact table containing all taxi trips from both yellow and green taxis. This is the core analytical table for trip-level analysis. Each row represents a single trip with: - Trip identifiers and service type - Pickup and dropoff locations and timestamps - Trip details (distance, passenger count, etc.) - Payment information and amounts Data is filtered for 2019-2020 only and excludes records with unknown pickup or dropoff locations. ``` --- ## Meta tags — custom metadata The `meta` field lets you attach arbitrary key-value pairs to any column or model. There's no predefined set — you and your team decide what matters. Common examples: - **PII** — flag columns that contain personally identifiable information - **owner** — who's responsible for this data asset, who to contact if something breaks - **importance** — mark which columns or models are critical vs. informational These don't affect how dbt runs anything. They're purely for governance, discoverability, and helping your team navigate the project. --- ## Generating and viewing the docs Two commands, run them in order: ### `dbt docs generate` - Compiles everything — your YAML descriptions, your model code, and metadata from the warehouse (like actual column types and table sizes) — into a JSON file - In **dbt Cloud**, this happens automatically. There's even a checkbox for it - In **dbt Core**, you have to run it yourself ### `dbt docs serve` - Takes the generated JSON and spins up a local website (defaults to `localhost:8080`) - Only needed if you're on **dbt Core** — dbt Cloud hosts the docs for you - If you want other people to see it, you'll need to host it somewhere (S3, Netlify, etc.) ### What the docs site shows you - **Model code** — both the Jinja version you wrote and the compiled SQL that actually hits the database - **Column info** — types, descriptions, anything you added - **Lineage graph** — a visual DAG showing sources in green, all the way through to your final mart models. You can see exactly what depends on what, and whether a change might break something downstream - **Project structure** — toggle between a folder view and a database view It's more of a **technical documentation** tool than a pretty data catalog. It's not going to replace something like Looker or Confluent's data catalog for non-technical stakeholders. But for the people building the models, it's genuinely useful — you can see at a glance what data assets exist, how they connect, and how they work. ================================================ FILE: 04-analytics-engineering/class_notes/4_5_2_dbt_tests.md ================================================ # DE Zoomcamp 4.5.2 — dbt Tests > 📄 Video: [dbt Tests](https://www.youtube.com/watch?v=bvZ-rJm7uMU) > 📄 Official docs: [Data tests](https://docs.getdbt.com/docs/build/data-tests) | [Unit tests](https://docs.getdbt.com/docs/build/unit-tests) | [Model contracts](https://docs.getdbt.com/docs/mesh/govern/model-contracts) Wrong KPIs in dashboards, bad numbers in reports — there are really only two causes: the underlying data wasn't what you expected, or you messed up the SQL. As an analytics engineer, if you can't tell which one it is, both are technically your fault. Tests are how you stay on top of this proactively. dbt ships with a pretty large suite of testing options, and this video walks through all of them. --- ## 1. Singular tests The simplest kind of test. You write a plain SQL query, stick it in the `tests/` directory, and that's it — it's now a test. The logic is straightforward: **if the query returns any rows, the test fails.** You're writing a query that selects for the "bad" cases. Zero rows back means everything checks out. ```sql -- tests/assert_positive_fare_amount.sql -- Fare amounts should always be positive select tripid, fare_amount from {{ ref('fct_trips') }} where fare_amount <= 0 ``` These are great for one-off business rules that are very specific to your organization — the kind of thing no generic test is going to cover out of the box. > 📄 [Singular data tests — docs](https://docs.getdbt.com/docs/build/data-tests#singular-data-tests) --- ## 2. Source freshness tests These live in your source YAML, not in a separate file. You add a `freshness` block to a source and tell dbt which column indicates when data was last loaded. Then you run `dbt source freshness` and dbt checks whether that timestamp is recent enough. You can set both `warn_after` and `error_after` thresholds — one to flag it, one to actually fail. ```yaml version: 2 sources: - name: staging database: production schema: trips_data_all tables: - name: green_tripdata loaded_at_field: lpep_pickup_datetime freshness: warn_after: {count: 6, period: hour} error_after: {count: 12, period: hour} - name: yellow_tripdata loaded_at_field: tpep_pickup_datetime freshness: warn_after: {count: 6, period: hour} error_after: {count: 12, period: hour} ``` Not something you see everywhere, but for pipelines where stale data would cause real problems it's a lifesaver. > 📄 [Source freshness — docs](https://docs.getdbt.com/reference/resource-properties/freshness) --- ## 3. Generic tests This is the big one — the most common type of test you'll see in dbt projects. Generic tests are defined in your YAML right alongside your column descriptions. They're parameterized and reusable, so you write the logic once and apply it across as many columns and models as you need. ### The four built-in generic tests dbt ships with exactly four: - **unique** — no duplicate values in this column - **not_null** — no nulls allowed - **accepted_values** — column values must be within a defined list - **relationships** — every value in this column must exist in another model (referential integrity) ```yaml version: 2 models: - name: stg_green_tripdata description: Staged green taxi data columns: - name: tripid description: Primary key for trips tests: - unique - not_null - name: vendorid tests: - not_null - name: payment_type description: Payment method code tests: - accepted_values: values: [1, 2, 3, 4, 5, 6] - name: pickup_locationid description: Taxi zone where trip started tests: - relationships: to: ref('taxi_zone_lookup') field: locationid ``` > 📄 [Generic data tests — docs](https://docs.getdbt.com/docs/build/data-tests#generic-data-tests) ### Writing your own custom generic tests Four tests won't cover everything. You can write your own — they're SQL files that live in `tests/generic/`. The syntax uses Jinja test blocks, and dbt will pick them up and make them available just like the built-ins. ```sql -- tests/generic/test_positive_values.sql {% test positive_values(model, column_name) %} select * from {{ model }} where {{ column_name }} < 0 {% endtest %} ``` **Usage in schema.yml:** ```yaml models: - name: fct_trips columns: - name: fare_amount tests: - positive_values - name: trip_distance tests: - positive_values ``` And here's the thing — you probably don't need to write as many custom tests as you'd expect. The dbt community has already built a ton of them in open-source packages (dbt-utils, dbt-expectations, etc.). Worth checking those before rolling your own. > 📄 [Writing custom generic tests — docs](https://docs.getdbt.com/best-practices/writing-custom-generic-tests) --- ## 4. Unit tests Available from dbt v1.8 onwards (released in mid-2024). Unit tests let you test your SQL logic in isolation, without hitting the warehouse with real data. The idea: you define a small set of mock input rows and the expected output rows. dbt runs your model's SQL against those mocks and checks whether the output matches what you said it should be. This is especially handy for complex logic — rolling windows, regex, edge cases — because you can test for scenarios that haven't even shown up in your real data yet. ```yaml version: 2 unit_tests: - name: test_payment_type_mapping description: Test that payment type codes map to correct descriptions model: stg_green_tripdata given: - input: source('staging', 'green_tripdata') rows: - {tripid: '1', payment_type: 1} - {tripid: '2', payment_type: 2} - {tripid: '3', payment_type: 5} expect: rows: - {tripid: '1', payment_type_description: 'Credit card'} - {tripid: '2', payment_type_description: 'Cash'} - {tripid: '3', payment_type_description: 'Unknown'} ``` Unit tests are defined in YAML in your `models/` directory, and currently only support SQL models. Since the inputs are static, there's no reason to run them in production — use them in development and CI. As of early 2026, unit tests have been available for about 18 months and are seeing increasing adoption, especially for teams with complex transformation logic or strict data quality requirements. They're particularly useful in CI/CD pipelines where you want to catch logic errors before they hit production data. > 📄 [Unit tests — docs](https://docs.getdbt.com/docs/build/unit-tests) --- ## 5. Model contracts The last type covered in this video, and a bit different from the others. Model contracts aren't about catching bad data after the fact — they're about **preventing your model from building at all** if it doesn't match a defined shape. You define the expected columns, data types, and optionally constraints in your YAML. Then you flip on `contract: enforced: true` in the model's config. From that point on, if your model's output doesn't match — wrong column name, wrong type, missing column — dbt will error out before anything gets materialized. ```yaml version: 2 models: - name: fct_trips config: contract: enforced: true columns: - name: tripid data_type: string constraints: - type: not_null - type: unique - name: pickup_datetime data_type: timestamp constraints: - type: not_null - name: service_type data_type: string - name: total_amount data_type: numeric ``` The idea behind this comes from the concept of **data contracts** — you sit down with your stakeholder, agree on what the output dataset should look like (column names, types, freshness expectations), and the contract enforces that agreement automatically. If someone changes the model in a way that breaks it, they'll know immediately. > 📄 [Model contracts — docs](https://docs.getdbt.com/docs/mesh/govern/model-contracts) ================================================ FILE: 04-analytics-engineering/class_notes/4_5_3_dbt_packages.md ================================================ # DE Zoomcamp 4.5.3 — dbt Packages > 📄 Video: [dbt Packages](https://www.youtube.com/watch?v=KfhUA9Kfp8Y) > 📄 Official docs: [Packages](https://docs.getdbt.com/docs/build/packages) > 📄 Package Hub: [hub.getdbt.com](https://hub.getdbt.com) One of the things that makes dbt's community so strong is packages. A dbt package is basically a self-contained dbt project — it has its own macros, tests, models, sources — but instead of using it yourself, you distribute it so other people can drop it into their own projects. Think Python libraries, but for dbt. This video covers the most useful packages out there and how to actually install and use them. --- ## Packages worth knowing about ### dbt-utils The big one. Maintained by dbt Labs, so it's well-kept and safe to use. It bundles a ton of common SQL utilities as macros — things like generating surrogate keys, deduplicating, pivoting, safe division, extracting URL parameters. Stuff most of us have written ourselves at some point. The real kicker is **cross-database compatibility**. dbt-utils macros compile down to the correct SQL dialect depending on your warehouse. So the same macro works on BigQuery, DuckDB, Snowflake, etc. — no need to maintain separate versions of your code. ### dbt-codegen A massive time-saver for the YAML grind. Codegen does two things: - **YAML from SQL** — point it at a model or source and it auto-generates the `schema.yml` with all the columns listed out. No more manually typing hundreds of column names. - **SQL from YAML** — the reverse. Give it a YAML spec and it generates a staging model SQL file following dbt conventions (single CTE for renaming, proper file naming, etc.). ### dbt-project-evaluator Scores your dbt project against best practices. Good for teams that want a quick sanity check on whether they're following conventions. ### dbt-audit-helper Handy when you're refactoring. It compares an old model against a new one and validates that they produce the same results — same columns, same row counts, same values. Takes the anxiety out of rewriting existing SQL. ### dbt-expectations This is the one that makes custom tests almost unnecessary. It's a massive library of pre-built generic tests covering almost every assertion you can think of — row counts, value ranges, consistent casing, regex matching, approximate equality, and way more. In practice, if you need to test something, there's a very good chance dbt-expectations already has it. > 📄 [dbt-expectations on the Package Hub](https://hub.getdbt.com/calogica/dbt_expectations/latest/) ### Warehouse-specific packages The hub has plenty of packages tailored to specific platforms — Snowflake, BigQuery, etc. These typically come with models or macros for monitoring spend, evaluating best practices, applying constraints, or working with platform-specific features like semantic views. --- ## A note on trust Packages on the dbt Hub have gone through a vetting process by dbt Labs — they're generally safe to use. Packages you find floating around on GitHub that aren't on the Hub? Take a closer look at what they actually do before dropping them into your project. --- ## How to install a package — the demo The video walks through installing dbt-utils and using it to generate surrogate keys. Here's the workflow: ### 1. Create packages.yml At the root of your dbt project (same level as `dbt_project.yml`), create a file called `packages.yml`. Declare the package and pin the version. ```yaml packages: - package: dbt-labs/dbt_utils version: 1.1.1 ``` ### 2. Run `dbt deps` This downloads and installs the package. After it runs, two things appear: - A `package-lock.yml` file — contains a hash of exactly what was installed. Commit this to version control so everyone on your team gets the same versions. - A `dbt_packages/` directory — this is where the installed package code lives. It's git-ignored by default (you don't want to commit other people's source code into your repo), but you can browse it if you're curious how the macros work. ### 3. Use it Once installed, the package's macros are immediately available. You call them with the standard Jinja syntax, prefixing with the package name. **Before (manual surrogate key):** ```sql select -- Manual concatenation approach concat( cast(vendorid as string), '-', cast(lpep_pickup_datetime as string) ) as tripid, vendorid, pickup_datetime from {{ source('staging', 'green_tripdata') }} ``` **After (using dbt_utils.generate_surrogate_key):** ```sql select -- Clean, cross-database macro {{ dbt_utils.generate_surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid, vendorid, pickup_datetime from {{ source('staging', 'green_tripdata') }} ``` That's it. The macro handles the rest — compiles to the right SQL for whatever warehouse you're targeting (MD5 hash for BigQuery, hash function for Snowflake, etc.). > 📄 [dbt deps command — docs](https://docs.getdbt.com/reference/commands/deps) ================================================ FILE: 04-analytics-engineering/class_notes/4_6_1_dbt_commands.md ================================================ # DE Zoomcamp 4.6.1 — dbt Commands > 📄 Video: [dbt Commands](https://www.youtube.com/watch?v=t4OeWHW3SsA) > 📄 Official docs: [dbt command reference](https://docs.getdbt.com/reference/dbt-commands) > 📄 Selection syntax: [Node selection syntax](https://docs.getdbt.com/reference/node-selection/syntax) We've been using dbt commands throughout the series without really stopping to talk about all of them. This video is the full tour — every command you'll actually use, plus the flags that make them powerful. Good one to bookmark. --- ## The setup commands — run these once (or when needed) ### dbt init Creates your dbt project from scratch. Generates the full directory structure — `models/`, `seeds/`, `snapshots/`, `tests/`, `analysis/`, all of it. You only ever run this once, at the very start. ### dbt debug Checks that your `profiles.yml` is valid and that dbt can actually connect to your warehouse. Run this whenever you're setting up a new environment or something feels off with your connection. ### dbt deps Installs packages from your `packages.yml`. We covered this in 4.5.3 — just know it lives here in the command lineup too. ### dbt clean Deletes the directories listed under `clean-targets` in your `dbt_project.yml`. By default that's `target/` and `dbt_packages/`. Useful for a fresh start, but remember you'll need to run `dbt deps` again after cleaning if you deleted `dbt_packages/`. You can add other directories to `clean-targets` if you want. > 📄 [dbt clean — docs](https://docs.getdbt.com/reference/commands/clean) --- ## The feature-specific commands These are tied to specific dbt features rather than being general-purpose. ### dbt seed Loads all the CSVs in your `seeds/` directory into the warehouse. Quick and simple — great for reference data or small lookup tables. ### dbt snapshot Runs any snapshots you've defined in your project. Snapshots are dbt's way of tracking how source data changes over time (think SCD Type 2). Not something you use every day, but it's there when you need it. ### dbt source freshness Checks whether your source data is stale. If you've defined `freshness` blocks in your source YAML (we covered this in 4.5.2), this is the command that actually runs the check. ### dbt docs generate / dbt docs serve `dbt docs generate` compiles your YAML documentation, model code, and warehouse metadata into a `catalog.json` artifact in `target/`. `dbt docs serve` spins up a local website (localhost:8080) so you can browse it. On dbt Cloud, `docs serve` isn't needed — it's handled automatically. For dbt Core users, finding a scalable way to host that docs site is something you'll need to sort out yourself. > 📄 [dbt docs commands — docs](https://docs.getdbt.com/reference/commands/cmd-docs) --- ## The big four — these are your daily drivers ### dbt compile Looks like it's doing nothing, but it's actually super useful. Takes all your models — with their Jinja, `ref()`, `source()` calls and everything — and outputs the fully resolved SQL into `target/compiled/`. No data moves, nothing hits the warehouse. It's just pure SQL sitting there for you to inspect. Why bother? Two reasons. First, it's the fastest way to catch Jinja errors — way quicker than waiting for a full `dbt run`. Second, it's completely free — no compute, no warehouse cost. Good habit to run after making changes. > 📄 [dbt compile — docs](https://docs.getdbt.com/reference/commands/compile) ### dbt run Materializes every model in your project. Views become views, tables become tables, incremental models get incremental logic applied — whatever you configured. Models run in dependency order, so dbt figures out the sequence for you. This is your go-to during active development when you just want to see your models built. > 📄 [dbt run — docs](https://docs.getdbt.com/reference/commands/run) ### dbt test Runs all the tests in your project — generic tests, singular tests, unit tests, all of it. Reports pass/fail at the end. Nothing gets built here, it just validates what's already in the warehouse. > 📄 [dbt test — docs](https://docs.getdbt.com/reference/commands/test) ### dbt build ⭐ The most important command. It's a smart combination of `dbt run` + `dbt test` + `dbt seed` + `dbt snapshot`, all in one. But it's not just running them sequentially — it's DAG-aware. It knows the right order, and if something fails along the way, it skips everything downstream of that failure rather than wasting compute on models that are going to break anyway. This is what you want for CI, production runs, or any time you need confidence that your whole project is solid. > 📄 [dbt build — docs](https://docs.getdbt.com/reference/commands/build) ### dbt retry If a `dbt build` or `dbt run` fails partway through, don't just re-run the whole thing from scratch. `dbt retry` re-executes from the point of failure by reading the `run_results.json` file from the previous run. It automatically identifies which nodes failed and re-runs those nodes plus everything downstream of them. How it works: - dbt looks at `target/run_results.json` from the last command - It identifies failed nodes and skipped nodes (anything downstream of a failure) - It re-runs only those nodes, reusing the same selection criteria from the original command - If the previous command completed successfully, `dbt retry` finishes as a no-op Saves a lot of time on big projects, especially when a single model fails deep in the DAG. --- ## Flags — the important ones ### --help / -h Works on any command. `dbt --help` gives you the full list, `dbt run --help` gives you flags specific to `run`. Standard stuff, but worth knowing it's there. ### --version / -V Tells you which version of dbt you have installed. Also lets you know if there's an update available. ### --full-refresh / -f Used with `dbt run` or `dbt build`. When you have an incremental model, it normally just appends new rows. `--full-refresh` drops the whole thing and rebuilds from scratch. Handy when historical data has changed, you've got duplicates, or you just want to make sure everything is clean. Most teams do this on a regular schedule — maybe once a month — just to keep things tidy. ```bash dbt run --full-refresh ``` ### --fail-fast Runs a stricter version of dbt. Normally warnings don't stop execution — with `--fail-fast` they do. Good for CI or any time you want to be sure nothing slips through. Better to fail loud than to be permissive and find surprises later. ### --target / -t Controls which profile target dbt runs against. By default everything runs on `dev`. But you can override it: ```bash dbt run --target prod ``` Works with `dbt run`, `dbt build`, `dbt test`, `dbt snapshot` — basically any command that touches the warehouse. Best practice: developers work in `dev`, production runs use `--target prod`. ### --select / -s This is the big one. Lets you run only specific parts of your project instead of everything. There are a few ways to use it: **By model name** — just give it the model name (no `.sql` needed): ```bash dbt run --select stg_green_tripdata ``` **By directory path** — everything in a folder: ```bash dbt run --select models/staging ``` **By tag:** ```bash dbt run --select tag:nightly ``` **With graph operators (the + sign)** — this is where it gets really useful. The `+` lets you pull in upstream or downstream dependencies: ```bash # Run stg_green_tripdata and all upstream dependencies dbt run --select +stg_green_tripdata # Run fct_trips and all downstream dependencies dbt run --select fct_trips+ # Run dim_zones plus everything upstream AND downstream dbt run --select +dim_zones+ ``` - `+my_model` — builds `my_model` and everything upstream of it (all its ancestors) - `my_model+` — builds `my_model` and everything downstream of it (all its descendants) - `+my_model+` — both directions. Everything upstream, the model itself, and everything downstream > 📄 [Graph operators — docs](https://docs.getdbt.com/reference/node-selection/graph-operators) **With state selectors** — instead of guessing what changed, let dbt figure it out: ```bash dbt build --select state:modified+ --state ./prod-artifacts ``` - `state:new` — only files you just created - `state:modified` — anything that's changed since the last run - Add `+` after to include downstream dependencies of modified models How state comparison works: - You need artifacts from a **previous run** stored somewhere persistent (not the same `target/` directory you're currently writing to) - On **dbt Cloud**, this is handled automatically — production artifacts are stored and accessible for comparison - On **dbt Core**, you need to manually store artifacts (especially `manifest.json`) somewhere — a cloud bucket, a separate directory, version control, etc. - Point `--state` to where those previous artifacts live - dbt compares your current code against those artifacts to determine what's new or modified The key is that you're comparing against a *different environment's artifacts* (usually production) or a *previous point in time* — not against the directory you're currently building into. This lets you run only what's changed since your last production deployment, which is incredibly useful for CI/CD workflows. Storing those JSON artifacts persistently is also just good practice in general — you can use them to analyze how your project evolves over time. > 📄 [Node selection syntax — docs](https://docs.getdbt.com/reference/node-selection/syntax) ================================================ FILE: 04-analytics-engineering/refreshers/SQL.md ================================================ # SQL Refresher ### Table of contents - [Window Functions](#window-funtions) - [Row Number](#row-number) - [Rank and Dense Rank](#rank-and-dense-rank) - [Lag and Lead](#lag-and-lead) - [Percentile Cont](#percentile-cont) - [Common Table Expression](#common-table-expression) - [dbt models and CTEs](#dbt-models-and-ctes) ## Window Functions A window function performs a calculation across a set of table rows that are related to the current row within a specific "window" or subset of data. This is comparable to the type of calculation that can be done with an aggregate function (such as SUM(), AVG(), COUNT(), etc.). But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. **Syntax:** ```sql FUNCTION() OVER (PARTITION BY column_name ORDER BY column_name) ``` A window function always has two components. This second part here defines your window: ```sql OVER (PARTITION BY column_name ORDER BY column_name) ``` Your window here is how you want to be viewing your data when you're applying your function - PARTITION BY: divides the result set into groups (optional). - ORDER BY: defines the order of processing rows within the partition. **Common Window Functions:** Ranking Functions: - ROW_NUMBER(): Assigns a unique row number within a partition. - RANK(): Similar to ROW_NUMBER(), but assigns the same rank to duplicate values, skipping numbers. - DENSE_RANK(): Like RANK(), but without gaps in numbering. Aggregate Functions as Window Functions: - SUM() OVER(): Computes a running total. - AVG() OVER(): Computes a moving average. Lag and Lead Functions: - LAG(): Retrieves the value from a previous row. - LEAD(): Retrieves the value from the next row. ### Row Number ROW_NUMBER() does just what it sounds like—displays the number of a given row. It starts at 1 and numbers the rows according to the ORDER BY part of the window statement. Using the PARTITION BY clause will allow you to begin counting 1 again in each partition. **Syntax:** ```sql ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) ``` **Common Uses:** - Removing Duplicates: You can use ROW_NUMBER() to identify duplicate rows and keep only one by filtering out rows with a row number greater than 1. - Ranking Data: Used when ranking rows based on specific criteria but requiring unique row numbers. - Selecting the Latest Record: Helps in selecting the most recent entry per category when combined with PARTITION BY. **Example 1:** ```sql SELECT total_amount, ROW_NUMBER() OVER (ORDER BY total_amount DESC) AS ranking FROM `greentaxi_trips` LIMIT 10; ``` The query returns the top 10 highest total_amount values from the table, along with a row number indicating their ranking. | total_amount | ranking | |--------|--------| | 4012.3 | 1 | | 2878.3 | 2 | | 2438.8 | 3 | | 2156.3 | 4 | | 2109.8 | 5 | | 2017.3 | 6 | | 1971.05| 7 | | 1958.8 | 8 | | 1762.8 | 9 | | 1600.8 | 10 | The column generated with ROW_NUMBER() is temporary and does not modify the original table. It is just a calculation applied to the data in the query result. **Example 2:** Let's modify the previous query to add a partition by pick up location ID ```sql SELECT total_amount, PULocationID, ROW_NUMBER() OVER (PARTITION BY PULocationID ORDER BY total_amount DESC) AS ranking FROM `greentaxi_trips` LIMIT 10; ``` This SQL query assigns a ranking to each row based on total_amount in descending order within each PULocationID group: | total_amount | PULocationID | ranking | |-----------|-----------|-----------| | 8.51 | 224 | 432 | | 8.3 | 224 | 433 | | 8.3 | 224 | 434 | | 7.3 | 224 | 435 | | 3.3 | 224 | 436 | | 86.42 | 234 | 1 | | 73.5 | 234 | 2 | | 62.7 | 234 | 3 | | 61.94 | 234 | 4 | | 61.94 | 234 | 5 | Using the PARTITION BY clause will allow you to begin counting 1 again in each partition. ### Rank and Dense Rank ROW_NUMBER(), RANK(), and DENSE_RANK() are window functions used to assign a ranking to rows based on a specified order. However, they behave differently when there are duplicate values in the ranking column. RANK() assigns a ranking, but skips numbers if there are ties. DENSE_RANK() its similar to RANK(), but does not skip numbers when there are ties. For example: | Score | ROW_NUMBER() | RANK() | DENSE_RANK() | |-------|--------------|--------|--------------| | 95 | 1 | 1 | 1 | | 90 | 2 | 2 | 2 | | 90 | 3 | 2 | 2 | | 85 | 4 | 4 | 3 | ### Lag and Lead It can often be useful to compare rows to preceding or following rows. You can use LAG or LEAD to create columns that pull values from other rows without the need for a self-join. All you need to do is enter which column to pull from and how many rows away you'd like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows **Syntax:** ```sql LAG(expression) OVER (PARTITION BY partition_expression ORDER BY order_expression) ``` - expression: The column whose value you want to retrieve from the previous row - offset (optional): The number of rows back from the current row to look. The default is 1, meaning it looks at the immediate previous row. - PARTITION BY (optional): Divides the result set into partitions to apply the function to each partition separately. - ORDER BY: Specifies the order in which the rows are processed. **Example:** ```sql SELECT lpep_pickup_datetime, total_amount, LAG(total_amount) OVER (ORDER BY lpep_pickup_datetime) as prev_total_amount, LEAD(total_amount) OVER (ORDER BY lpep_pickup_datetime) as next_total_amount FROM `greentaxi_trips` ORDER BY lpep_pickup_datetime ``` The query retrieves the lpep_pickup_datetime, total_amount, the previous trip's total_amount, and the next trip's total_amount. | lpep_pickup_datetime | total_amount | prev_total_amount | next_total_amount | |---------------------------|--------------|-------------------|-------------------| | 2008-12-31 23:33:38 UTC | 7.3 | 6.3 | 5.3 | | 2008-12-31 23:42:31 UTC | 5.3 | 7.3 | 14.55 | | 2008-12-31 23:47:51 UTC | 14.55 | 5.3 | 19.55 | | 2008-12-31 23:57:46 UTC | 19.55 | 14.55 | 9.8 | | 2009-01-01 00:00:00 UTC | 9.8 | 19.55 | 81.3 | ### Percentile Cont Computes the specified percentile value for the value_expression, with linear interpolation. **Syntax:** ```sql PERCENTILE_CONT(value_expression, percentile ) OVER (PARTITION BY partition_expression) ``` **Example:** Let's calculate the 90th percentile of total_amount for each unique pickup location (PULocationID) ```sql SELECT PULocationID, total_amount, PERCENTILE_CONT(total_amount, 0.9 ) OVER (PARTITION BY PULocationID) AS p90 FROM `greentaxi_trips` ``` - PERCENTILE_CONT(total_amount, 0.9): calculates the 90th percentile (p90) of total_amount - PARTITION BY PULocationID: This groups the calculations by PULocationID, so the 90th percentile is computed separately for each location. Query results looks like this: | PULocationID | total_amount | p90 | |------|-------|-------| | 224 | 17.3 | 51.9 | | 224 | 20.67 | 51.9 | | 224 | 21 | 51.9 | | 224 | 26.06 | 51.9 | | 224 | 27.13 | 51.9 | | 224 | 40.14 | 51.9 | | 224 | 55.46 | 51.9 | | 224 | 25.74 | 51.9 | | 224 | 27.02 | 51.9 | | 224 | 37 | 51.9 | The P90 value is essentially the amount below which 90% of the values fall. In this table, the P90 is constant at 51.9, which means that for location "224", 90% of the total amounts are below 51.9. ## Common Table Expression A CTE, short for Common Table Expression, is like a query within a query. With the WITH statement, you can create temporary tables to store results, making complex queries more readable and maintainable. These temporary tables exist only for the duration of the main query. CTEs and subqueries are both powerful tools and can be used to achieve similar goals, but they have different use cases and advantages. Differences are CTE is reusable during the entire session and more readable By declaring CTEs at the beginning of the query, you enhance code readability, enabling a clearer grasp of your analysis logic. **Syntax:** ```sql WITH cte_name AS ( SELECT column1, column2 FROM some_table WHERE condition ) SELECT * FROM cte_name; ``` **Example: Let's find the trip with the second largest total_amount** ```sql WITH cte AS( SELECT lpep_pickup_datetime, total_amount, RANK() OVER (ORDER BY total_amount DESC) AS rank FROM `greentaxi_trips` ) SELECT * FROM cte WHERE rank = 2; ``` The query starts with a Common Table Expression (CTE) named cte. We use the RANK() window function to assign a ranking (rank) to each row based on total_amount in descending order (from highest to lowest). Now, we use the CTE in the main query: ```SELECT * FROM cte WHERE rank = 2;``` Result of the query: | lpep_pickup_datetime | total_amount | rank | |---------------------------|--------------|-------------------| | 2019-10-10 15:22:49 UTC | 2878.3 | 2 | ## dbt models and CTEs CTEs and window functions will be used a lot in module 4 on dbt. Let's see an example of application in dbt models **Example:** Suppose we start from the FHV dataset and we want to create a dbt model that enriches the data by calculating the trip duration and the 90th percentile. ```sql WITH trip_duration_calculated AS ( SELECT *, timestamp_diff(dropOff_datetime, pickup_datetime, second) as trip_duration FROM `fhv_trips` ) SELECT PUlocationID, trip_duration, PERCENTILE_CONT(trip_duration, 0.90) OVER (PARTITION BY PUlocationID) AS trip_duration_p90 FROM trip_duration_calculated ``` **Step 1: Understanding the CTE** The WITH clause creates a CTE named trip_duration_calculated. This CTE acts as a temporary table that contains all columns from the fhv_trips table. Additionally, it calculates the trip duration for each ride **Step 2: Main Query using the CTE and Window Function** This query computes the 90th percentile of trip duration for each PUlocationID using a window function: The PARTITION BY PUlocationID clause ensures that the percentile calculation is performed separately for each unique PUlocationID. The percentile 90 means that 90% of the trips have a duration equal to or below this value **Query result looks like this:** | PUlocationID | trip_duration | trip_duration_p90 | |-------------|---------------|--------------------| | 190 | 451 | 2170.0 | | 190 | 1373 | 2170.0 | | 190 | 817 | 2170.0 | | 190 | 589 | 2170.0 | | 190 | 1648 | 2170.0 | | 32 | 546 | 1988.0 | | 32 | 151 | 1988.0 | | 32 | 1752 | 1988.0 | | 32 | 2426 | 1988.0 | | 32 | 888 | 1988.0 | - For PUlocationID = 190, 90% of trips have a duration ≤ 2170.0 seconds. - For PUlocationID = 32, 90% of trips have a duration ≤ 1988.0 seconds. ================================================ FILE: 04-analytics-engineering/setup/cloud_setup.md ================================================ # Cloud Setup Guide This guide walks you through setting up dbt to work with the BigQuery data warehouse you created in Module 3.
[![dbt](https://img.shields.io/badge/dbt-FF694B?style=for-the-badge&logo=dbt&logoColor=white)](https://www.getdbt.com/) [![BigQuery](https://img.shields.io/badge/BigQuery-4285F4?style=for-the-badge&logo=google-cloud&logoColor=white)](https://cloud.google.com/bigquery)
> [!NOTE] > This guide assumes you've completed [Module 3: Data Warehouse](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/03-data-warehouse) where you: > - Created a GCP project and enabled the BigQuery API > - Created a service account with BigQuery permissions > - Learned how to load data into BigQuery (in the `nytaxi` dataset) > > Module 4 uses **different data** than Module 3 (green and yellow taxi data for 2019-2020 instead of yellow-only 2024). You'll load the new data in [Step 1](#load-the-taxi-data) below. ## Step 1: Verify Your BigQuery Setup Before setting up dbt Cloud, confirm you have the required data and credentials from Module 3. ### Check Your Service Account You should already have a service account JSON key file from Module 3. Make sure it has these permissions: - **BigQuery Data Editor** - **BigQuery Job User** - **BigQuery User** If you need to create a new service account or download a new key, follow the instructions below. ### How to Download Service Account JSON Key If you don't have the JSON key file or need to download a new one: 1. Go to [Google Cloud Console](https://console.cloud.google.com/) 2. Navigate to **IAM & Admin** > **Service Accounts** - Or use the search bar and type "Service Accounts" 3. Find your service account in the list - It should look like: `service-account-name@project-id.iam.gserviceaccount.com` - If you don't have a service account yet, click **+ CREATE SERVICE ACCOUNT** and: - Enter a name (e.g., `dbt-bigquery-service-account`) - Click **CREATE AND CONTINUE** - Add these roles: - **BigQuery Admin** (or at minimum: BigQuery Data Editor, BigQuery Job User, BigQuery User) - Click **CONTINUE** > **DONE** 4. Click on your service account name to open its details 5. Go to the **KEYS** tab 6. Click **ADD KEY** > **Create new key** 7. Select **JSON** as the key type 8. Click **CREATE** 9. The JSON key file will automatically download to your computer - Save it in a secure location - **Never commit this file to Git or share it publicly** - it contains credentials to access your GCP resources The downloaded JSON file will look something like this: ```json { "type": "service_account", "project_id": "your-project-id", "private_key_id": "...", "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n", "client_email": "service-account-name@project-id.iam.gserviceaccount.com", ... } ``` You'll use this JSON file in Step 4 to connect dbt Cloud to BigQuery. ### Load the Taxi Data This module uses **yellow and green taxi data for 2019-2020**, which is different from the data you loaded in Module 3. Using the same approach you learned in Module 3, load the following data into your BigQuery `nytaxi` dataset: - **Yellow taxi trip records** for all months of 2019 and 2020 - **Green taxi trip records** for all months of 2019 and 2020 > [!IMPORTANT] > Download the data from the [DataTalksClub NYC TLC Data repository](https://github.com/DataTalksClub/nyc-tlc-data/releases), **not** from the official NYC TLC website. The official site has been retroactively updated over the years, so its data differs from what the homework answers are based on. After loading, verify your data: 1. Go to [BigQuery Console](https://console.cloud.google.com/bigquery) 2. In the Explorer panel on the left, expand your project 3. You should see the `nytaxi` dataset 4. Expand the `nytaxi` dataset - you should see tables: - `green_tripdata` - `yellow_tripdata` ### Note Your Dataset Location When you created your BigQuery datasets in Module 3, you chose a location (e.g., `US`, `EU`, `us-central1`). You'll need to use the same location when configuring dbt. **To check your dataset location:** 1. In BigQuery Console, click on the `nytaxi` dataset 2. Look for **Data location** in the dataset details ## Step 2: Sign Up for dbt Platform dbt Platform is dbt's cloud-based development environment with a web IDE, scheduler, and collaboration features. dbt offers a **free Developer plan**. This should be more than enough to learn dbt and follow the course. ## Step 3: Create a New dbt Project Now you'll create a fresh dbt project from scratch in dbt Cloud. 1. Navigate to **Account settings** (gear icon in the top-right corner) and click **+ New Project** 2. Enter a project name: - Project name: `taxi_rides_ny` 3. Click **Continue** ## Step 4: Configure BigQuery Connection After clicking **Continue** in the previous step, dbt Cloud will prompt you to configure your data warehouse connection. > [!TIP] > If you're not automatically taken to the connection setup, you can also configure it from **Account settings** > **Projects** > **taxi_rides_ny** > **Connection**. ### Upload Service Account JSON 1. For the connection type, select **BigQuery** 2. Click **Upload a Service Account JSON file** 3. Select the service account JSON key file from Module 3 4. dbt will automatically extract: - Your GCP project ID - Authentication credentials ### Configure Connection Settings 1. **Dataset**: Enter `dbt_prod` - This is the base schema name where dbt will create datasets - dbt will organize your models into schemas like: - `dbt_prod_staging` - for staging models - `dbt_prod_intermediate` - for intermediate models - `dbt_prod_marts` - for final analytics tables 2. **Location**: Select the same location as your `nytaxi` dataset from Module 3 - Example: `US`, `EU`, or `us-central1` - **This must match your nytaxi dataset location** - You can find this under **Optional Settings** or **Advanced Settings** depending on your UI version 3. **Timeout**: `300` seconds 4. **Maximum Bytes Billed**: (optional) - Leave blank for unlimited, OR - Set a limit like `1000000000` (1 GB) to prevent runaway queries ### Test the Connection 1. Click **Test Connection** 2. You should see a success message: "Connection test succeeded" 3. Click **Continue** ## Step 5: Set Up Your Repository dbt Cloud needs a Git repository to store your project code. You have two options: - Let dbt Manage the Repository (Recommended for Beginners) - Connect Your Own GitHub Repository (Recommended for Production) It doesn't matter which one you prefer for this course. ## Step 6: Verify Your Development Environment ### What Are Environments in dbt? In dbt, **environments** define different contexts where your data transformations run: - **Development Environment**: Your personal workspace for building and testing models - Uses your personal credentials - Creates temporary schemas with your name (e.g., `dbt_`) - Changes only affect your work, not production - Used when working in the dbt Cloud IDE - **Deployment Environment**: The production workspace where final models run on schedule - Uses service account credentials - Creates production schemas (e.g., `dbt_prod_staging`, `dbt_prod_marts`) - Used by scheduled jobs that keep your data warehouse updated Think of it like having a draft folder (development) and a published folder (deployment) for your analytics code. ### Check Your Development Environment dbt Cloud **automatically creates a development environment** when you set up a project. You don't need to create one manually. To verify it was created: 1. Navigate to **Deploy** > **Environments** in the top navigation bar 2. You should see a **Development** environment already listed ### Customize Your Development Credentials (Optional) If you need to change how dbt connects to BigQuery during development, or adjust your development schema: 1. Click your profile icon (bottom-left corner) > **Your Profile** > **Credentials** 2. Select the credential linked to your project 3. From here you can update: - **Development Schema**: Where your personal development models will be created - dbt automatically suggests: `dbt_` (e.g., `dbt_john_smith`) - This schema is separate from production (`dbt_prod`) - **Target Name**: Leave as `dev` (default) ## Step 7: Start Developing Once your project, connection, and repository are configured, you're ready to start building dbt models. 1. Click **Start developing in the Studio IDE** - If you don't see this option, navigate to **Develop** in the top navigation bar 2. dbt Cloud will initialize your workspace (this may take a minute) 3. Once the IDE loads, you'll have a fresh project ready for development! ## Additional Resources * [BigQuery Documentation](https://cloud.google.com/bigquery/docs) * [dbt Documentation](https://docs.getdbt.com/docs/cloud/about-cloud/dbt-cloud-features) * [BigQuery Best Practices](https://cloud.google.com/bigquery/docs/best-practices) * [NYC Taxi Data Dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) ================================================ FILE: 04-analytics-engineering/setup/duckdb_troubleshooting.md ================================================ # Troubleshooting DuckDB Out of Memory Errors If you're getting `Out of Memory` errors while running dbt build commands, don't panic. This is a common issue, especially on machines with limited RAM. This guide explains why it happens and what you can do about it. ## Why does this happen? DuckDB is an **in-process database**, which means it runs inside your computer's memory (RAM) rather than on a remote server. The NYC taxi dataset we use in this project contains **tens of millions of rows** across 24 months of yellow and green taxi data. When dbt builds models, DuckDB needs to load, transform, and write this data (all using your local RAM). Some operations are more memory-intensive than others: | Operation | Why it's expensive | Where it happens | |---|---|---| | `QUALIFY` with window functions | Requires sorting and partitioning the entire dataset in memory | `int_trips.sql` (deduplication) | | `UNION ALL` on large tables | Combines two large datasets into one | `int_trips_unioned.sql` | | Surrogate key generation (`generate_surrogate_key`) | Computes hashes across the full dataset | `int_trips.sql` | | `JOIN` on large fact tables | Expands memory footprint when enriching trips with zones | `fct_trips.sql` | ## Check your available RAM Before troubleshooting, know what you're working with. You can generally find this in your settings menu. As a rule of thumb: - **4 GB RAM**: You will very likely hit OOM. Consider using GitHub Codespaces or the Cloud Setup instead. - **8 GB RAM**: You might hit OOM on some models. Adjust memory settings or use GitHub Codespaces. - **16+ GB RAM**: You should be fine with default settings. ## Option A: Use GitHub Codespaces or Cloud Setup If your local machine doesn't have enough RAM, the easiest solution is to avoid running DuckDB locally altogether. ### GitHub Codespaces Run the project in a **GitHub Codespace**. The free tier includes machines with **4 cores / 8 GB RAM**, and **8 cores / 16 GB RAM** is available within the free monthly quota for personal accounts. A 16 GB machine can comfortably run this entire project without any of the workarounds below. To get started: 1. Go to the [course repository on GitHub](https://github.com/DataTalksClub/data-engineering-zoomcamp). 2. Click **Code** > **Codespaces** > **Create codespace on main**. 3. Select the **8-core** machine type for the best experience. Codespaces come with Python, pip, and git pre-installed, so setup is minimal. ### Cloud Setup (BigQuery) Alternatively, use the **Cloud Setup (BigQuery)** path. BigQuery runs on Google's servers, so your local RAM doesn't matter. See the [Cloud Setup Guide](cloud_setup.md). ## Option B: Make it work on your local machine If you prefer to run the project locally, follow the steps below to reduce memory usage. ### Step 1: Adjust DuckDB memory settings in `profiles.yml` Your `~/.dbt/profiles.yml` controls how much memory DuckDB can use. Here's what you can tune: - **`memory_limit`**: By default, DuckDB will try to use up to 80% of your system's RAM. That sounds reasonable, but your operating system, browser, IDE, and other apps also need memory. If DuckDB claims too much, the OS may kill the process — that's your OOM error. Setting an explicit limit (roughly **50% of your total RAM**) leaves enough room for everything else. So if you have 8 GB, try `'4GB'`. - **`threads`**: This controls how many **dbt models** are built in parallel. Lowering `threads` to `1` means fewer concurrent models, which reduces overall memory pressure. - **`preserve_insertion_order: false`**: Tells DuckDB it doesn't need to maintain row order, which saves memory. ### Step 2: Use `dbt retry` after a failure If your `dbt build` fails partway through, you **don't need to rebuild everything from scratch**. Use: ```bash dbt retry ``` This command picks up where the last run left off, only running the models that failed or were skipped. This is very useful when an OOM error kills a single model — fix the issue, then retry without re-running the models that already succeeded. ### Step 3: Build models selectively with `--select` Instead of building the entire project at once, build one model at a time to reduce peak memory usage: ```bash dbt build --select stg_yellow_tripdata --target prod dbt build --select stg_green_tripdata --target prod dbt build --select int_trips_unioned --target prod dbt build --select int_trips --target prod dbt build --select fct_trips --target prod ``` This way, DuckDB only needs to handle one model at a time. ### Step 4: Leverage incremental models The `fct_trips` model in this project is already configured as **incremental**. This means that after the first full build, subsequent runs only process **new records** instead of reprocessing the entire dataset. If your first full build fails due to OOM but some models succeeded, use `dbt retry` (Step 2). Once `fct_trips` is built for the first time, future runs will be much lighter on memory. ## DuckDB performance best practices These tips come from [DuckDB's official performance guide](https://duckdb.org/docs/guides/performance/environment.html): 1. **Close other applications**: Browsers, IDEs, and other apps compete for RAM. Close what you don't need before running `dbt build`. 2. **Use an SSD**: DuckDB spills to disk when it runs out of memory. An SSD makes this spill-to-disk process much faster than an HDD. 3. **Avoid running inside Docker** (if possible): Docker containers have memory limits that may be lower than your system's total RAM. If you must use Docker, increase the container's memory limit. ## Still stuck? If you've tried everything above and still can't build the project, ask for help in the [course Slack channel](https://datatalks-club.slack.com/). Include your RAM, OS, and the exact error message. ================================================ FILE: 04-analytics-engineering/setup/local_setup.md ================================================ # Local Setup Guide This guide walks you through setting up a local analytics engineering environment using DuckDB and dbt.
[![dbt Core](https://img.shields.io/badge/dbt-FF694B?style=for-the-badge&logo=dbt&logoColor=white)](https://www.getdbt.com/) [![DuckDB](https://img.shields.io/badge/DuckDB-FFF000?style=for-the-badge&logo=duckdb&logoColor=black)](https://duckdb.org/)
>[!NOTE] >*This guide will explain how to do the setup manually. If you want an additional challenge, try to run this setup using Docker Compose or a Python virtual environment.* **Important**: All dbt commands must be run from inside the `taxi_rides_ny/` directory. The setup steps below will guide you through: 1. Installing the necessary tools 2. Configuring your connection to DuckDB 3. Loading the NYC taxi data 4. Verifying everything works ## Step 1: Install DuckDB DuckDB is a fast, in-process SQL database that works great for local analytics workloads. To install DuckDB, follow the instruction on the [official site](https://duckdb.org/docs/installation) for your specific operating system. > [!TIP] > *You can install DuckDB in two ways. You can install the CLI or install the client API for your favorite programming language (in the case of Python, you can use `pip install duckdb`). I personally prefer installing the CLI, but either way is fine.* ## Step 2: Install dbt ```bash pip install dbt-duckdb ``` This installs: * `dbt-core`: The core dbt framework * `dbt-duckdb`: The DuckDB adapter for dbt ## Step 3: Configure dbt Profile Since this repository already contains a dbt project (`taxi_rides_ny/`), you don't need to run `dbt init`. Instead, you need to configure your dbt profile to connect to DuckDB. ### Create or Update `~/.dbt/profiles.yml` The dbt profile tells dbt how to connect to your database. Create or update the file `~/.dbt/profiles.yml` with the following content: ```yaml taxi_rides_ny: target: dev outputs: # DuckDB Development profile dev: type: duckdb path: taxi_rides_ny.duckdb schema: dev threads: 1 extensions: - parquet settings: memory_limit: '2GB' preserve_insertion_order: false # DuckDB Production profile prod: type: duckdb path: taxi_rides_ny.duckdb schema: prod threads: 1 extensions: - parquet settings: memory_limit: '2GB' preserve_insertion_order: false # Troubleshooting: # - If you have less than 4GB RAM, try setting memory_limit to '1GB' # - If you have 16GB+ RAM, you can increase to '4GB' for faster builds # - Expected build time: 5-10 minutes on most systems ``` ## Step 4: Download and Ingest Data Now that your dbt profile is configured, let's load the taxi data into DuckDB. Navigate to the dbt project directory and run the ingestion script ```python import duckdb import requests from pathlib import Path BASE_URL = "https://github.com/DataTalksClub/nyc-tlc-data/releases/download" def download_and_convert_files(taxi_type): data_dir = Path("data") / taxi_type data_dir.mkdir(exist_ok=True, parents=True) for year in [2019, 2020]: for month in range(1, 13): parquet_filename = f"{taxi_type}_tripdata_{year}-{month:02d}.parquet" parquet_filepath = data_dir / parquet_filename if parquet_filepath.exists(): print(f"Skipping {parquet_filename} (already exists)") continue # Download CSV.gz file csv_gz_filename = f"{taxi_type}_tripdata_{year}-{month:02d}.csv.gz" csv_gz_filepath = data_dir / csv_gz_filename response = requests.get(f"{BASE_URL}/{taxi_type}/{csv_gz_filename}", stream=True) response.raise_for_status() with open(csv_gz_filepath, 'wb') as f: for chunk in response.iter_content(chunk_size=8192): f.write(chunk) print(f"Converting {csv_gz_filename} to Parquet...") con = duckdb.connect() con.execute(f""" COPY (SELECT * FROM read_csv_auto('{csv_gz_filepath}')) TO '{parquet_filepath}' (FORMAT PARQUET) """) con.close() # Remove the CSV.gz file to save space csv_gz_filepath.unlink() print(f"Completed {parquet_filename}") def update_gitignore(): gitignore_path = Path(".gitignore") # Read existing content or start with empty string content = gitignore_path.read_text() if gitignore_path.exists() else "" # Add data/ if not already present if 'data/' not in content: with open(gitignore_path, 'a') as f: f.write('\n# Data directory\ndata/\n' if content else '# Data directory\ndata/\n') if __name__ == "__main__": # Update .gitignore to exclude data directory update_gitignore() for taxi_type in ["yellow", "green"]: download_and_convert_files(taxi_type) con = duckdb.connect("taxi_rides_ny.duckdb") con.execute("CREATE SCHEMA IF NOT EXISTS prod") for taxi_type in ["yellow", "green"]: con.execute(f""" CREATE OR REPLACE TABLE prod.{taxi_type}_tripdata AS SELECT * FROM read_parquet('data/{taxi_type}/*.parquet', union_by_name=true) """) con.close() ``` This script downloads yellow and green taxi data from 2019-2020, creates the `prod` schema, and loads the raw data into DuckDB. The download may take several minutes depending on your internet connection. ## Step 5: Test the dbt Connection Verify dbt can connect to your DuckDB database: ```bash dbt debug ``` ## Step 6: Install dbt Power User Extension (VS Code Users) If you're using Visual Studio Code, install the **dbt Power User** extension to enhance your dbt development experience. ### What is dbt Power User? dbt Power User is a VS Code extension that provides: * SQL syntax highlighting and formatting for dbt models * Inline column-level lineage visualization * Auto-completion for dbt models, sources, and macros * Interactive documentation preview * Model compilation and execution directly from the editor ### Why Not Use the Official dbt Extension? dbt Labs released an official VS Code extension called [dbt Extension](https://marketplace.visualstudio.com/items?itemName=dbtLabsInc.dbt) powered by the new dbt Fusion engine. However, this extension **requires dbt Fusion** and does not support dbt Core. Since we're using **dbt Core** with DuckDB for local development, we need the community-maintained **dbt Power User by AltimateAI** extension instead. This extension: * Works seamlessly with dbt Core (not just dbt Cloud) * Supports all dbt adapters, including DuckDB * Is actively maintained and open source * Provides a rich feature set for local development ### Installation 1. Open VS Code 2. Go to Extensions (Ctrl+Shift+X / Cmd+Shift+X) 3. Search for "dbt Power User" 4. Install **dbt Power User by AltimateAI** (not the dbt Labs version) Alternatively, install it from the [VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user). > [!NOTE] > At this point, your local dbt environment is fully configured and ready to use. The next steps (running models, tests, and building documentation) will be covered in the tutorial videos. ## Additional Resources * [DuckDB Documentation](https://duckdb.org/docs/) * [dbt Documentation](https://docs.getdbt.com/) * [dbt-duckdb Adapter](https://github.com/duckdb/dbt-duckdb) * [NYC Taxi Data Dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/.gitignore ================================================ # you shouldn't commit these into source control # these are the default directory names, adjust/add to fit your needs target/ dbt_packages/ logs/ profiles.yml .user.yml # Data files for DuckDB data/green_tripdata/ data/yellow_tripdata/ data/ *.duckdb *.duckdb.wal .duckdb_temp/ # Parquet data files *.parquet # Python artifacts __pycache__/ *.py[cod] *$py.class *.so .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ pip-wheel-metadata/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # Virtual environments venv/ env/ ENV/ env.bak/ venv.bak/ .venv/ # PyCharm .idea/ # VS Code .vscode/ # Jupyter Notebook .ipynb_checkpoints *.ipynb # pyenv .python-version # pytest .pytest_cache/ .coverage htmlcov/ # mypy .mypy_cache/ .dmypy.json dmypy.json # GCP credentials and service account keys *-key.json *-keys.json *key*.json *credential*.json *service-account*.json *serviceaccount*.json service-account.json serviceaccount.json gcp-*.json google-*.json # Environment variables .env .env.local .env.*.local *.env dbt_internal_packages/ ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/dbt_project.yml ================================================ name: 'taxi_rides_ny' version: '1.0.0' # Require a specific dbt version for reproducibility require-dbt-version: [">=1.7.0", "<3.0.0"] # This setting configures which "profile" dbt uses for this project. profile: 'taxi_rides_ny' # These configurations specify where dbt should look for different types of files. model-paths: ["models"] analysis-paths: ["analyses"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"] clean-targets: - "target" - "dbt_packages" # Project-level variables vars: # Date range for dev environment sampling dev_start_date: '2019-01-01' dev_end_date: '2019-02-01' # Configuring models # Full documentation: https://docs.getdbt.com/docs/configuring-models models: taxi_rides_ny: staging: +materialized: view intermediate: +materialized: table marts: +materialized: table flags: require_generic_test_arguments_property: true ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/macros/get_trip_duration_minutes.sql ================================================ {# Calculate trip duration in minutes from pickup and dropoff timestamps. Uses dbts built-in cross-database datediff macro. This works seamlessly across DuckDB, BigQuery, Snowflake, Redshift, PostgreSQL, etc. Returns: Trip duration as a numeric value in minutes #} {% macro get_trip_duration_minutes(pickup_datetime, dropoff_datetime) %} {{ dbt.datediff(pickup_datetime, dropoff_datetime, 'minute') }} {% endmacro %} ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/macros/get_vendor_data.sql ================================================ {# Macro to generate vendor_name column using Jinja dictionary. This approach works seamlessly across BigQuery, DuckDB, Snowflake, etc. by generating a CASE statement at compile time. Usage: {{ get_vendor_data('vendor_id') }} Returns: SQL CASE expression that maps vendor_id to vendor_name #} {% macro get_vendor_data(vendor_id_column) %} {% set vendors = { 1: 'Creative Mobile Technologies', 2: 'VeriFone Inc.', 4: 'Unknown/Other' } %} case {{ vendor_id_column }} {% for vendor_id, vendor_name in vendors.items() %} when {{ vendor_id }} then '{{ vendor_name }}' {% endfor %} end {% endmacro %} ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/macros/macros_properties.yml ================================================ macros: - name: get_trip_duration_minutes description: > Calculates trip duration in minutes from pickup and dropoff timestamps. This macro is cross-database compatible, supporting both DuckDB and BigQuery. Returns a numeric value representing the duration in minutes. arguments: - name: pickup_datetime type: timestamp description: The pickup timestamp - name: dropoff_datetime type: timestamp description: The dropoff timestamp - name: get_vendor_data description: > Generates a CASE statement that maps vendor_id to vendor_name. This macro is cross-database compatible and generates SQL at compile time using a Jinja dictionary. Supports vendor IDs: 1 (Creative Mobile Technologies), 2 (VeriFone Inc.), 4 (Unknown/Other). arguments: - name: vendor_id_column type: integer description: The column name containing the vendor ID ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/macros/safe_cast.sql ================================================ {% macro safe_cast(column, data_type) %} {% if target.type == 'bigquery' %} safe_cast({{ column }} as {{ data_type }}) {% else %} cast({{ column }} as {{ data_type }}) {% endif %} {% endmacro %} ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips.sql ================================================ -- Enrich and deduplicate trip data -- Demonstrates enrichment and surrogate key generation -- Note: Data quality analysis available in analyses/trips_data_quality.sql with unioned as ( select * from {{ ref('int_trips_unioned') }} ), payment_types as ( select * from {{ ref('payment_type_lookup') }} ), cleaned_and_enriched as ( select -- Generate unique trip identifier (surrogate key pattern) {{ dbt_utils.generate_surrogate_key(['u.vendor_id', 'u.pickup_datetime', 'u.pickup_location_id', 'u.service_type']) }} as trip_id, -- Identifiers u.vendor_id, u.service_type, u.rate_code_id, -- Location IDs u.pickup_location_id, u.dropoff_location_id, -- Timestamps u.pickup_datetime, u.dropoff_datetime, -- Trip details u.store_and_fwd_flag, u.passenger_count, u.trip_distance, u.trip_type, -- Payment breakdown u.fare_amount, u.extra, u.mta_tax, u.tip_amount, u.tolls_amount, u.ehail_fee, u.improvement_surcharge, u.total_amount, -- Enrich with payment type description coalesce(u.payment_type, 0) as payment_type, coalesce(pt.description, 'Unknown') as payment_type_description from unioned u left join payment_types pt on coalesce(u.payment_type, 0) = pt.payment_type ) select * from cleaned_and_enriched -- Deduplicate: if multiple trips match (same vendor, second, location, service), keep first qualify row_number() over( partition by vendor_id, pickup_datetime, pickup_location_id, service_type order by dropoff_datetime ) = 1 ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips_unioned.sql ================================================ -- Union green and yellow taxi data into a single dataset -- Demonstrates how to combine data from multiple sources with slightly different schemas with green_trips as ( select vendor_id, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, trip_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, 'Green' as service_type from {{ ref('stg_green_tripdata') }} ), yellow_trips as ( select vendor_id, rate_code_id, pickup_location_id, dropoff_location_id, pickup_datetime, dropoff_datetime, store_and_fwd_flag, passenger_count, trip_distance, cast(1 as integer) as trip_type, -- Yellow taxis only do street-hail (code 1) fare_amount, extra, mta_tax, tip_amount, tolls_amount, cast(0 as numeric) as ehail_fee, -- Yellow taxis don't have ehail_fee improvement_surcharge, total_amount, payment_type, 'Yellow' as service_type from {{ ref('stg_yellow_tripdata') }} ) select * from green_trips union all select * from yellow_trips ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/intermediate/schema.yml ================================================ models: - name: int_trips_unioned description: Union of green and yellow taxi trip data with normalized schema columns: - name: vendor_id description: Taxi technology provider ID - name: rate_code_id description: Rate code at end of trip (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group) - name: pickup_location_id description: TLC Taxi Zone where trip started - name: dropoff_location_id description: TLC Taxi Zone where trip ended - name: pickup_datetime description: Timestamp when meter was engaged - name: dropoff_datetime description: Timestamp when meter was disengaged - name: store_and_fwd_flag description: Trip record stored in vehicle memory (Y/N) - name: passenger_count description: Number of passengers in the vehicle - name: trip_distance description: Trip distance in miles - name: trip_type description: Trip type (1=Street-hail, 2=Dispatch) - name: fare_amount description: Time and distance fare - name: extra description: Miscellaneous extras and surcharges - name: mta_tax description: MTA tax - name: tip_amount description: Tip amount (credit card only) - name: tolls_amount description: Total tolls paid - name: ehail_fee description: E-hail service fee - name: improvement_surcharge description: Improvement surcharge - name: total_amount description: Total amount charged to passenger - name: payment_type description: Payment method code - name: service_type description: Type of taxi service (Green or Yellow) - name: int_trips description: Cleaned, enriched, and deduplicated trip data ready for marts columns: - name: trip_id description: Unique trip identifier (surrogate key) data_tests: - unique - not_null - name: vendor_id description: Taxi technology provider ID data_tests: - not_null - name: service_type description: Type of taxi service (Green or Yellow) data_tests: - not_null - accepted_values: arguments: values: ['Green', 'Yellow'] - name: rate_code_id description: Rate code at end of trip - name: pickup_location_id description: TLC Taxi Zone where trip started - name: dropoff_location_id description: TLC Taxi Zone where trip ended - name: pickup_datetime description: Timestamp when meter was engaged data_tests: - not_null - name: dropoff_datetime description: Timestamp when meter was disengaged - name: store_and_fwd_flag description: Trip record stored in vehicle memory (Y/N) - name: passenger_count description: Number of passengers in the vehicle - name: trip_distance description: Trip distance in miles - name: trip_type description: Trip type (1=Street-hail, 2=Dispatch) - name: fare_amount description: Time and distance fare - name: extra description: Miscellaneous extras and surcharges - name: mta_tax description: MTA tax - name: tip_amount description: Tip amount (credit card only) - name: tolls_amount description: Total tolls paid - name: ehail_fee description: E-hail service fee - name: improvement_surcharge description: Improvement surcharge - name: total_amount description: Total amount charged to passenger data_tests: - not_null - name: payment_type description: Payment method code - name: payment_type_description description: Human-readable payment method description ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/dim_vendors.sql ================================================ -- Dimension table for taxi technology vendors -- Small static dimension defining vendor codes and their company names with trips as ( select * from {{ ref('fct_trips') }} ), vendors as ( select distinct vendor_id, {{ get_vendor_data('vendor_id') }} as vendor_name from trips ) select * from vendors ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/dim_zones.sql ================================================ -- Dimension table for NYC taxi zones -- This is a simple pass-through from the seed file, but having it as a model -- allows for future enhancements (e.g., adding calculated fields, filtering) select locationid as location_id, borough, zone, service_zone from {{ ref('taxi_zone_lookup') }} ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/fct_trips.sql ================================================ {{ config( materialized='incremental', unique_key='trip_id', incremental_strategy='merge', on_schema_change='append_new_columns' ) }} -- Fact table containing all taxi trips enriched with zone information -- This is a classic star schema design: fact table (trips) joined to dimension table (zones) -- Materialized incrementally to handle large datasets efficiently select -- Trip identifiers trips.trip_id, trips.vendor_id, trips.service_type, trips.rate_code_id, -- Location details (enriched with human-readable zone names from dimension) trips.pickup_location_id, pz.borough as pickup_borough, pz.zone as pickup_zone, trips.dropoff_location_id, dz.borough as dropoff_borough, dz.zone as dropoff_zone, -- Trip timing trips.pickup_datetime, trips.dropoff_datetime, trips.store_and_fwd_flag, -- Trip metrics trips.passenger_count, trips.trip_distance, trips.trip_type, {{ get_trip_duration_minutes('trips.pickup_datetime', 'trips.dropoff_datetime') }} as trip_duration_minutes, -- Payment breakdown trips.fare_amount, trips.extra, trips.mta_tax, trips.tip_amount, trips.tolls_amount, trips.ehail_fee, trips.improvement_surcharge, trips.total_amount, trips.payment_type, trips.payment_type_description from {{ ref('int_trips') }} as trips -- LEFT JOIN preserves all trips even if zone information is missing or unknown left join {{ ref('dim_zones') }} as pz on trips.pickup_location_id = pz.location_id left join {{ ref('dim_zones') }} as dz on trips.dropoff_location_id = dz.location_id {% if is_incremental() %} -- Only process new trips based on pickup datetime where trips.pickup_datetime > (select max(pickup_datetime) from {{ this }}) {% endif %} ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/reporting/fct_monthly_zone_revenue.sql ================================================ -- Data mart for monthly revenue analysis by pickup zone and service type -- This aggregation is optimized for business reporting and dashboards -- Enables analysis of revenue trends across different zones and taxi types select -- Grouping dimensions coalesce(pickup_zone, 'Unknown Zone') as pickup_zone, {% if target.type == 'bigquery' %}cast(date_trunc(pickup_datetime, month) as date) {% elif target.type == 'duckdb' %}date_trunc('month', pickup_datetime) {% endif %} as revenue_month, service_type, -- Revenue breakdown (summed by zone, month, and service type) sum(fare_amount) as revenue_monthly_fare, sum(extra) as revenue_monthly_extra, sum(mta_tax) as revenue_monthly_mta_tax, sum(tip_amount) as revenue_monthly_tip_amount, sum(tolls_amount) as revenue_monthly_tolls_amount, sum(ehail_fee) as revenue_monthly_ehail_fee, sum(improvement_surcharge) as revenue_monthly_improvement_surcharge, sum(total_amount) as revenue_monthly_total_amount, -- Additional metrics for operational analysis count(trip_id) as total_monthly_trips, avg(passenger_count) as avg_monthly_passenger_count, avg(trip_distance) as avg_monthly_trip_distance from {{ ref('fct_trips') }} group by pickup_zone, revenue_month, service_type ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/reporting/schema.yml ================================================ models: - name: fct_monthly_zone_revenue description: Monthly revenue aggregation by pickup zone and service type for business reporting data_tests: - dbt_utils.unique_combination_of_columns: arguments: combination_of_columns: - pickup_zone - revenue_month - service_type columns: - name: pickup_zone description: Pickup zone where revenue was generated data_tests: - not_null - name: revenue_month description: Month for revenue aggregation data_tests: - not_null - name: service_type description: Service type (Green or Yellow) data_tests: - not_null - accepted_values: arguments: values: ['Green', 'Yellow'] - name: revenue_monthly_total_amount description: Monthly sum of total fares data_tests: - not_null - name: total_monthly_trips description: Count of trips in the month data_tests: - not_null ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/marts/schema.yml ================================================ models: - name: dim_zones description: Taxi zone dimension table with location details columns: - name: location_id description: Unique identifier for each taxi zone data_tests: - unique - not_null - name: borough description: NYC borough name - name: zone description: Specific zone name within the borough - name: service_zone description: Service zone classification - name: dim_vendors description: Taxi technology vendor dimension table columns: - name: vendor_id description: Unique vendor identifier data_tests: - unique - not_null - name: vendor_name description: Company name of the vendor - name: fct_trips description: Fact table with all taxi trips including trip and payment details config: contract: enforced: true columns: - name: trip_id description: Unique trip identifier data_type: string data_tests: - unique - not_null - name: vendor_id description: Taxi technology provider data_type: integer data_tests: - not_null - name: service_type description: Type of taxi service (Green or Yellow) data_type: string data_tests: - accepted_values: arguments: values: ['Green', 'Yellow'] - not_null - name: rate_code_id description: Final rate code data_type: integer - name: pickup_location_id description: TLC Taxi Zone where trip started data_type: integer data_tests: - relationships: arguments: to: ref('dim_zones') field: location_id - name: pickup_borough description: NYC borough where trip started data_type: string - name: pickup_zone description: Specific zone where trip started data_type: string - name: dropoff_location_id description: TLC Taxi Zone where trip ended data_type: integer data_tests: - relationships: arguments: to: ref('dim_zones') field: location_id - name: dropoff_borough description: NYC borough where trip ended data_type: string - name: dropoff_zone description: Specific zone where trip ended data_type: string - name: pickup_datetime description: Timestamp when meter was engaged data_type: timestamp data_tests: - not_null - name: dropoff_datetime description: Timestamp when meter was disengaged data_type: timestamp - name: store_and_fwd_flag description: Trip record stored in vehicle memory (Y/N) data_type: string - name: passenger_count description: Number of passengers data_type: integer - name: trip_distance description: Trip distance in miles data_type: numeric - name: trip_type description: Trip type (1=Street-hail, 2=Dispatch) data_type: integer - name: trip_duration_minutes description: Trip duration in minutes (calculated using cross-database macro) data_type: bigint - name: fare_amount description: Time and distance fare data_type: numeric - name: extra description: Miscellaneous extras and surcharges data_type: numeric - name: mta_tax description: MTA tax data_type: numeric - name: tip_amount description: Tip amount (credit card only) data_type: numeric - name: tolls_amount description: Total tolls paid data_type: numeric - name: ehail_fee description: E-hail service fee data_type: numeric - name: improvement_surcharge description: Improvement surcharge data_type: numeric - name: total_amount description: Total amount charged data_type: numeric data_tests: - not_null - name: payment_type description: Payment method code data_type: integer - name: payment_type_description description: Human-readable payment method description data_type: string ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/staging/schema.yml ================================================ models: - name: stg_green_tripdata description: > Staging model for green taxi trip data. This model standardizes column names and data types from the raw green_tripdata source, providing a clean foundation for downstream transformations. columns: - name: vendor_id description: Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) data_tests: - not_null - name: rate_code_id description: Rate code at end of trip (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group) - name: pickup_location_id description: TLC Taxi Zone where the meter was engaged - name: dropoff_location_id description: TLC Taxi Zone where the meter was disengaged - name: pickup_datetime description: Date and time when the meter was engaged data_tests: - not_null - name: dropoff_datetime description: Date and time when the meter was disengaged - name: store_and_fwd_flag description: Flag indicating if trip record was held in vehicle memory (Y/N) - name: passenger_count description: Number of passengers in the vehicle (driver-entered value) - name: trip_distance description: Trip distance in miles reported by the taximeter - name: trip_type description: Code for trip type (1=Street-hail, 2=Dispatch) - name: fare_amount description: Time and distance fare calculated by the meter - name: extra description: Miscellaneous extras and surcharges (rush hour, overnight) - name: mta_tax description: $0.50 MTA tax automatically triggered based on meter rate - name: tip_amount description: Tip amount (credit card tips only, cash tips not included) - name: tolls_amount description: Total amount of all tolls paid during the trip - name: ehail_fee description: E-hail service fee - name: improvement_surcharge description: Improvement surcharge assessed on hailed trips - name: total_amount description: Total amount charged to passengers (does not include cash tips) - name: payment_type description: Payment method code (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided) - name: stg_yellow_tripdata description: > Staging model for yellow taxi trip data. This model standardizes column names and data types from the raw yellow_tripdata source, providing a clean foundation for downstream transformations. columns: - name: vendor_id description: Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) data_tests: - not_null - name: rate_code_id description: Rate code at end of trip (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group) - name: pickup_location_id description: TLC Taxi Zone where the meter was engaged - name: dropoff_location_id description: TLC Taxi Zone where the meter was disengaged - name: pickup_datetime description: Date and time when the meter was engaged data_tests: - not_null - name: dropoff_datetime description: Date and time when the meter was disengaged - name: store_and_fwd_flag description: Flag indicating if trip record was held in vehicle memory (Y/N) - name: passenger_count description: Number of passengers in the vehicle (driver-entered value) - name: trip_distance description: Trip distance in miles reported by the taximeter - name: fare_amount description: Time and distance fare calculated by the meter - name: extra description: Miscellaneous extras and surcharges (rush hour, overnight) - name: mta_tax description: $0.50 MTA tax automatically triggered based on meter rate - name: tip_amount description: Tip amount (credit card tips only, cash tips not included) - name: tolls_amount description: Total amount of all tolls paid during the trip - name: improvement_surcharge description: Improvement surcharge assessed on hailed trips - name: total_amount description: Total amount charged to passengers (does not include cash tips) - name: payment_type description: Payment method code (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided) ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/staging/sources.yml ================================================ sources: - name: raw description: Raw taxi trip data from NYC TLC database: | {%- if target.type == 'bigquery' -%} {{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }} {%- else -%} taxi_rides_ny {%- endif -%} schema: | {%- if target.type == 'bigquery' -%} nytaxi {%- else -%} prod {%- endif -%} tables: - name: green_tripdata description: Raw green taxi trip records columns: - name: vendorid description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging" - name: lpep_pickup_datetime description: Date and time when the meter was engaged - name: lpep_dropoff_datetime description: Date and time when the meter was disengaged - name: passenger_count description: Number of passengers in the vehicle - name: trip_distance description: Trip distance in miles - name: pulocationid description: TLC Taxi Zone where the meter was engaged - name: dolocationid description: TLC Taxi Zone where the meter was disengaged - name: ratecodeid description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group) - name: store_and_fwd_flag description: Trip record held in vehicle memory (Y/N) - name: payment_type description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided) - name: fare_amount description: Time and distance fare - name: extra description: Miscellaneous extras and surcharges - name: mta_tax description: MTA tax - name: tip_amount description: Tip amount (credit card only) - name: tolls_amount description: Total tolls paid - name: improvement_surcharge description: Improvement surcharge - name: total_amount description: Total amount charged - name: trip_type description: Trip type (1=Street-hail, 2=Dispatch) - name: ehail_fee description: E-hail fee config: loaded_at_field: lpep_pickup_datetime - name: yellow_tripdata description: Raw yellow taxi trip records columns: - name: vendorid description: "Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging" - name: tpep_pickup_datetime description: Date and time when the meter was engaged - name: tpep_dropoff_datetime description: Date and time when the meter was disengaged - name: passenger_count description: Number of passengers in the vehicle - name: trip_distance description: Trip distance in miles - name: pulocationid description: TLC Taxi Zone where the meter was engaged - name: dolocationid description: TLC Taxi Zone where the meter was disengaged - name: ratecodeid description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group) - name: store_and_fwd_flag description: Trip record held in vehicle memory (Y/N) - name: payment_type description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided) - name: fare_amount description: Time and distance fare - name: extra description: Miscellaneous extras and surcharges - name: mta_tax description: MTA tax - name: tip_amount description: Tip amount (credit card only) - name: tolls_amount description: Total tolls paid - name: improvement_surcharge description: Improvement surcharge - name: total_amount description: Total amount charged config: loaded_at_field: tpep_pickup_datetime config: freshness: warn_after: {count: 24, period: hour} error_after: {count: 48, period: hour} ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/staging/stg_green_tripdata.sql ================================================ with source as ( select * from {{ source('raw', 'green_tripdata') }} ), renamed as ( select -- identifiers cast(vendorid as integer) as vendor_id, {{ safe_cast('ratecodeid', 'integer') }} as rate_code_id, cast(pulocationid as integer) as pickup_location_id, cast(dolocationid as integer) as dropoff_location_id, -- timestamps cast(lpep_pickup_datetime as timestamp) as pickup_datetime, -- lpep = Licensed Passenger Enhancement Program (green taxis) cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime, -- trip info cast(store_and_fwd_flag as string) as store_and_fwd_flag, cast(passenger_count as integer) as passenger_count, cast(trip_distance as numeric) as trip_distance, {{ safe_cast('trip_type', 'integer') }} as trip_type, -- payment info cast(fare_amount as numeric) as fare_amount, cast(extra as numeric) as extra, cast(mta_tax as numeric) as mta_tax, cast(tip_amount as numeric) as tip_amount, cast(tolls_amount as numeric) as tolls_amount, cast(ehail_fee as numeric) as ehail_fee, cast(improvement_surcharge as numeric) as improvement_surcharge, cast(total_amount as numeric) as total_amount, {{ safe_cast('payment_type', 'integer') }} as payment_type from source -- Filter out records with null vendor_id (data quality requirement) where vendorid is not null ) select * from renamed -- Sample records for dev environment using deterministic date filter {% if target.name == 'dev' %} where pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01' {% endif %} ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/models/staging/stg_yellow_tripdata.sql ================================================ with source as ( select * from {{ source('raw', 'yellow_tripdata') }} ), renamed as ( select -- identifiers (standardized naming for consistency across yellow/green) cast(vendorid as integer) as vendor_id, cast(ratecodeid as integer) as rate_code_id, cast(pulocationid as integer) as pickup_location_id, cast(dolocationid as integer) as dropoff_location_id, -- timestamps (standardized naming) cast(tpep_pickup_datetime as timestamp) as pickup_datetime, -- tpep = Taxicab Passenger Enhancement Program (yellow taxis) cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime, -- trip info cast(store_and_fwd_flag as string) as store_and_fwd_flag, cast(passenger_count as integer) as passenger_count, cast(trip_distance as numeric) as trip_distance, -- payment info cast(fare_amount as numeric) as fare_amount, cast(extra as numeric) as extra, cast(mta_tax as numeric) as mta_tax, cast(tip_amount as numeric) as tip_amount, cast(tolls_amount as numeric) as tolls_amount, cast(improvement_surcharge as numeric) as improvement_surcharge, cast(total_amount as numeric) as total_amount, cast(payment_type as integer) as payment_type from source -- Filter out records with null vendor_id (data quality requirement) where vendorid is not null ) select * from renamed -- Sample records for dev environment using deterministic date filter {% if target.name == 'dev' %} where pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01' {% endif %} ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/package-lock.yml ================================================ packages: - name: dbt_utils package: dbt-labs/dbt_utils version: 1.3.3 - name: codegen package: dbt-labs/codegen version: 0.14.0 sha1_hash: 01f31e0d658d76121f50e62b998342ebf138df11 ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/packages.yml ================================================ packages: - package: dbt-labs/dbt_utils version: [">=1.3.0", "<2.0.0"] - package: dbt-labs/codegen version: [">=0.14.0", "<1.0.0"] ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/seeds/seeds_properties.yml ================================================ seeds: - name: taxi_zone_lookup description: > Taxi Zones roughly based on NYC Department of City Planning's Neighborhood Tabulation Areas (NTAs) and are meant to approximate neighborhoods, so you can see which neighborhood a passenger was picked up in, and which neighborhood they were dropped off in. Includes associated service_zone (EWR, Boro Zone, Yellow Zone) - name: payment_type_lookup description: > Payment type reference data mapping payment type codes to their descriptions. Used as a dimension table for payment method analysis. columns: - name: payment_type description: Numeric code for payment type data_tests: - unique - not_null - name: description description: Human-readable description of payment method ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/snapshots/.gitkeep ================================================ ================================================ FILE: 04-analytics-engineering/taxi_rides_ny/tests/.gitkeep ================================================ ================================================ FILE: 05-data-platforms/README.md ================================================ # Module 5: Data Platforms ## Overview In this module, you'll learn about data platforms - tools that help you manage the entire data lifecycle from ingestion to analytics. We'll use [Bruin](https://getbruin.com/) as an example of a data platform. Bruin puts multiple tools under one platform: - Data ingestion (extract from sources to your warehouse) - Data transformation (cleaning, modeling, aggregating) - Data orchestration (scheduling and dependency management) - Data quality (built-in checks and validation) - Metadata management (lineage, documentation) ## Tutorial Follow the complete hands-on tutorial at: [Bruin Data Engineering Zoomcamp Template](https://github.com/bruin-data/bruin/tree/main/templates/zoomcamp) The template is a TODO-based learning exercise — run `bruin init zoomcamp my-taxi-pipeline` and fill in the configuration and code guided by inline comments. The [notes](notes/) contain completed reference implementations. ## Videos ### :movie_camera: 5.1 - Introduction to Bruin [![](https://markdown-videos-api.jorgenkh.no/youtube/f6vg7lGqZx0)](https://youtu.be/f6vg7lGqZx0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=1) Introduction to the Bruin data platform: what it is, what a modern data stack looks like (ETL/ELT, orchestration, data quality), and how Bruin brings all of these together into a single project. - [Notes](notes/01-introduction.md) ### :movie_camera: 5.2 - Getting Started with Bruin [![](https://markdown-videos-api.jorgenkh.no/youtube/JJwHKSidX_c)](https://youtu.be/JJwHKSidX_c&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=2) Install Bruin, set up the VS Code/Cursor extension and Bruin MCP, and create a first project using `bruin init`. Walk through environments, connections (DuckDB, Chess.com), pipeline YAML configuration, and running Python, YAML ingestor, and SQL assets. - [Notes](notes/02-getting-started.md) ### :movie_camera: 5.3 - Building an End-to-End Pipeline with NYC Taxi Data [![](https://markdown-videos-api.jorgenkh.no/youtube/q0k_iz9kWsI)](https://youtu.be/q0k_iz9kWsI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=3) Build a full pipeline with a three-layered architecture (ingestion, staging, reports) using NYC taxi data and DuckDB. - [Notes](notes/03-nyc-taxi-pipeline.md) ### :movie_camera: 5.4 - Using Bruin MCP with AI Agents [![](https://markdown-videos-api.jorgenkh.no/youtube/224xH7h8OaQ)](https://youtu.be/224xH7h8OaQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=4) Install the Bruin MCP in Cursor/VS Code and use an AI agent to build the entire NYC taxi pipeline end to end. Query data conversationally, ask questions about pipeline logic, and troubleshoot issues — all through natural language. - [Notes](notes/04-bruin-mcp.md) ### :movie_camera: 5.5 - Deploying to Bruin Cloud [![](https://markdown-videos-api.jorgenkh.no/youtube/uBqjLEwF8rc)](https://youtu.be/uBqjLEwF8rc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5) Register for Bruin Cloud, connect your GitHub repository, set up data warehouse connections, deploy and monitor your pipelines with a fully managed infrastructure. - [Notes](notes/05-bruin-cloud.md) ## Bruin Core Concepts Short videos covering the fundamental concepts of Bruin: projects, pipelines, assets, variables, and commands. ### :movie_camera: Projects [![](https://markdown-videos-api.jorgenkh.no/youtube/YWDjnSxbBtY)](https://www.youtube.com/watch?v=YWDjnSxbBtY) The root directory where you create your Bruin data pipeline. Learn about project initialization, the `.bruin.yml` configuration file, environments, and connections. - [Notes](notes/06-core-01-projects.md) ### :movie_camera: Pipelines [![](https://markdown-videos-api.jorgenkh.no/youtube/uzp_DiR4Sok)](https://www.youtube.com/watch?v=uzp_DiR4Sok) A grouping mechanism for organizing assets based on their execution schedule. Each pipeline has a single schedule and its own configuration file. - [Notes](notes/06-core-02-pipelines.md) ### :movie_camera: Assets [![](https://markdown-videos-api.jorgenkh.no/youtube/ZElY5SoqrwI)](https://www.youtube.com/watch?v=ZElY5SoqrwI) Single files that perform specific tasks, creating or updating tables/views in your database. Covers SQL, Python, and YAML asset types with examples. - [Notes](notes/06-core-03-assets.md) ### :movie_camera: Variables [![](https://markdown-videos-api.jorgenkh.no/youtube/XCx0nDmhhxA)](https://www.youtube.com/watch?v=XCx0nDmhhxA) Dynamic values initialized at each pipeline run. Learn about built-in variables (start_date, end_date) and custom variables for parameterizing your pipelines. - [Notes](notes/06-core-04-variables.md) ### :movie_camera: Commands [![](https://markdown-videos-api.jorgenkh.no/youtube/3nykPEs_V7E)](https://www.youtube.com/watch?v=3nykPEs_V7E) CLI commands for interacting with your Bruin project: `bruin run`, `bruin validate`, `bruin lineage`, and more with practical examples. - [Notes](notes/06-core-05-commands.md) ## Resources - [Bruin Documentation](https://getbruin.com/docs) - [Bruin GitHub Repository](https://github.com/bruin-data/bruin) - [Bruin MCP (AI Integration)](https://getbruin.com/docs/bruin/getting-started/bruin-mcp) - [Bruin Cloud](https://getbruin.com/) — managed deployment and monitoring # Homework * [2026 Homework](../cohorts/2026/05-data-platforms/homework.md) # Community notes
Did you take notes? You can share them here * Add your notes here (above this line)
================================================ FILE: 05-data-platforms/notes/01-introduction.md ================================================ # 5.1 - Introduction to Bruin ## What is Bruin? Bruin is an end-to-end data platform that combines ingestion, transformations, orchestration, data quality checks, metadata, and lineage into a single tool. Instead of using five or six different tools configured separately, Bruin lets you have your code logic, configurations, dependencies, and quality checks all in the same place. ## The modern data stack A typical data stack involves several components: - Extract/ingest data from third-party sources or databases into a data warehouse or data lake - Run transformations: clean data, create reports, push results to a warehouse, lake, or third-party application - Orchestrate: tell different scripts and services when to run, how to run, and how to communicate with each other - Data quality and governance: ensure accuracy, completeness, and consistency of data before delivering it to consumers Bruin brings all of these together so you don't need to be a DevOps person, data infrastructure engineer, and data architect just to build a pipeline. ## Learning goals for the tutorial series - Bruin project structure - What is a pipeline and what are assets - How to configure pipelines - Materialization strategies supported by Bruin - Lineage and how to build dependencies between assets - Metadata created automatically and manually - Parameterizing pipelines with custom variables ================================================ FILE: 05-data-platforms/notes/02-getting-started.md ================================================ # 5.2 - Getting Started with Bruin ## Installation Install Bruin CLI: ```bash curl -LsSf https://getbruin.com/install/cli | sh bruin version ``` Install the Bruin extension for VS Code or Cursor. This adds a Bruin render panel that lets you run assets and pipelines directly from the IDE. ## Bruin MCP Bruin provides an MCP (Model Context Protocol) server that you can add to your IDE (Cursor, VS Code) to use AI agents for creating pipelines. Add the Bruin MCP under your IDE settings > Tools and MCP. ### Bruin MCP Integration for VS Code Create a new file `mcp.json` in your Repository Root: In the root directory of your project (the same level as your `.git` folder or `package.json`), create a new file named `mcp.json`. Add the Configuration: Open the `mcp.json` file and paste the following JSON configuration into it: ```json { "servers": { "bruin": { "type": "stdio", "command": "bruin", "args": [ "mcp" ] } }, "inputs": [] } ``` This configuration instructs VS Code to launch the `bruin mcp` command, establishing a standard input/output connection with the Bruin MCP server. ## Initializing a project ```bash bruin init default my-first-pipeline cd my-first-pipeline ``` This creates a project from a template, initializes git, adds a `.gitignore`, and creates the `bruin.yaml` file. Bruin requires the project to be git-initialized. The `bruin init` command handles this automatically. ## Project structure ```text my-first-pipeline/ ├── .bruin.yml # Environment and connection configuration ├── pipeline.yml # Pipeline name, schedule, default connections └── assets/ ├── players.asset.yml # Ingestr asset (data ingestion) ├── player_stats.sql # SQL asset with quality checks └── my_python_asset.py # Python asset ``` ### .bruin.yml - Stays local only (auto-added to `.gitignore`) - Never push this to your repo — it contains database connections and secrets - Defines environments (default, production, staging, etc.) - Under each environment, define connections (e.g. DuckDB, Chess.com, custom secrets) ```yaml default_environment: default environments: default: connections: duckdb: - name: duckdb-default path: duckdb.db ``` ### pipeline.yml Configures the pipeline: name, schedule, default connection, start date. ```yaml name: my-pipeline schedule: daily start_date: "2022-01-01" default_connections: duckdb: duckdb-default ``` ## Asset types ### Python asset Simplest form: a Python script with a name that prints or processes data. Run from the Bruin panel in your IDE. ### YAML ingestor asset Uses Bruin's built-in ingestor. Define source connection, destination, and table. Supports many built-in sources and destinations: Redshift, MySQL, Postgres, Motherduck, BigQuery, etc. Automatically creates the destination database/table if it doesn't exist. ### SQL asset Runs SQL queries against your database. Define dependencies to other assets — when a dependency finishes, this asset runs automatically. ## Intervals and incremental ingestion - Set `start_date` and `end_date` parameters to ingest data for a specific time range - Bruin provides these as variables you can inject into your code - Built-in ingestion assets automatically use the start/end dates ## Dependencies and lineage - Define dependencies between assets so they run in the correct order - When the first asset completes, it automatically triggers the next dependent asset - Bruin builds a lineage graph from these dependencies ## Key CLI commands | Command | Purpose | |---------|---------| | `bruin validate ` | Check syntax and dependencies without running | | `bruin run ` | Execute pipeline or individual asset | | `bruin run --downstream` | Run asset and all downstream dependencies | | `bruin run --full-refresh` | Truncate and rebuild tables from scratch | | `bruin lineage ` | View asset dependencies | | `bruin query --connection --query "..."` | Execute ad-hoc SQL queries | ================================================ FILE: 05-data-platforms/notes/03-nyc-taxi-pipeline.md ================================================ # 5.3 - Building an End-to-End Pipeline with NYC Taxi Data ## Architecture Three-layered pipeline using DuckDB as a locally hosted database: 1. Ingestion layer: extract data and store in raw format 2. Staging layer: pre-process, clean, transform, join with lookup tables 3. Reports layer: aggregate data and run calculations All assets have dependencies that create the data lineage Bruin uses for orchestration. ## Project setup Initialize from the zoomcamp template: ```bash bruin init zoomcamp my-taxi-pipeline cd my-taxi-pipeline ``` Project structure: ```text zoomcamp/ ├── .bruin.yml ├── README.md └── pipeline/ ├── pipeline.yml └── assets/ ├── ingestion/ │ ├── trips.py │ ├── requirements.txt │ ├── payment_lookup.asset.yml │ └── payment_lookup.csv ├── staging/ │ └── trips.sql └── reports/ └── trips_report.sql ``` ### .bruin.yml ```yaml default_environment: default environments: default: connections: duckdb: - name: duckdb-default path: duckdb.db ``` ### pipeline.yml ```yaml name: nyc_taxi schedule: daily start_date: "2022-01-01" default_connections: duckdb: duckdb-default variables: taxi_types: type: array items: type: string default: ["yellow"] ``` - `start_date`: when running a full refresh, process data starting from this date - Custom variables: `taxi_types` lets you control which taxi types to ingest (yellow, green, or both) - Variables can be overridden at runtime with `--var` ## Ingestion layer ### Python asset: trips.py The Python asset connects to the NYC taxi API and extracts data. ```python """@bruin name: ingestion.trips type: python image: python:3.11 materialization: type: table strategy: append columns: - name: pickup_datetime type: timestamp description: "When the meter was engaged" - name: dropoff_datetime type: timestamp description: "When the meter was disengaged" @bruin""" import os import json import pandas as pd def materialize(): start_date = os.environ["BRUIN_START_DATE"] end_date = os.environ["BRUIN_END_DATE"] taxi_types = json.loads(os.environ["BRUIN_VARS"]).get("taxi_types", ["yellow"]) # Generate list of months between start and end dates # Fetch parquet files from: # https://d37ci6vzurychx.cloudfront.net/trip-data/{taxi_type}_tripdata_{year}-{month}.parquet return final_dataframe ``` - `materialize()` returns a DataFrame; Bruin handles inserting it into the destination - `append` strategy: each run inserts data without touching existing rows - Uses `BRUIN_START_DATE` / `BRUIN_END_DATE` environment variables for the time window - Uses `BRUIN_VARS` to read the `taxi_types` pipeline variable ### Seed file: payment_lookup.asset.yml Seed files ingest data from local CSV files into the database. ```yaml name: ingestion.payment_lookup type: duckdb.seed parameters: path: payment_lookup.csv columns: - name: payment_type_id type: integer description: "Numeric code for payment type" primary_key: true checks: - name: not_null - name: unique - name: payment_type_name type: string description: "Human-readable payment type" checks: - name: not_null ``` payment_lookup.csv: ```csv payment_type_id,payment_type_name 0,flex_fare 1,credit_card 2,cash 3,no_charge 4,dispute 5,unknown 6,voided_trip ``` Quality checks (`not_null`, `unique`) run automatically after the asset finishes. ### requirements.txt ``` pandas requests pyarrow python-dateutil ``` Bruin handles the environment and installs dependencies locally within the pipeline. ## Staging layer ### SQL asset: staging/trips.sql ```sql /* @bruin name: staging.trips type: duckdb.sql depends: - ingestion.trips - ingestion.payment_lookup materialization: type: table strategy: time_interval incremental_key: pickup_datetime time_granularity: timestamp columns: - name: pickup_datetime type: timestamp primary_key: true checks: - name: not_null custom_checks: - name: row_count_greater_than_zero query: | SELECT CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END FROM staging.trips value: 1 @bruin */ SELECT t.pickup_datetime, t.dropoff_datetime, t.pickup_location_id, t.dropoff_location_id, t.fare_amount, t.taxi_type, p.payment_type_name FROM ingestion.trips t LEFT JOIN ingestion.payment_lookup p ON t.payment_type = p.payment_type_id WHERE t.pickup_datetime >= '{{ start_datetime }}' AND t.pickup_datetime < '{{ end_datetime }}' QUALIFY ROW_NUMBER() OVER ( PARTITION BY t.pickup_datetime, t.dropoff_datetime, t.pickup_location_id, t.dropoff_location_id, t.fare_amount ORDER BY t.pickup_datetime ) = 1 ``` - `time_interval` strategy: deletes rows in the time window, then inserts the query result - The `WHERE` clause must filter to the same time window to avoid duplicates - `QUALIFY ROW_NUMBER()` deduplicates using a composite key - Dependencies on both `ingestion.trips` and `ingestion.payment_lookup` ensure this runs after ingestion ## Reports layer ### SQL asset: reports/trips_report.sql ```sql /* @bruin name: reports.trips_report type: duckdb.sql depends: - staging.trips materialization: type: table strategy: time_interval incremental_key: trip_date time_granularity: date columns: - name: trip_date type: date primary_key: true - name: taxi_type type: string primary_key: true - name: payment_type type: string primary_key: true - name: trip_count type: bigint checks: - name: non_negative @bruin */ SELECT CAST(pickup_datetime AS DATE) AS trip_date, taxi_type, payment_type_name AS payment_type, COUNT(*) AS trip_count, SUM(fare_amount) AS total_fare, AVG(fare_amount) AS avg_fare FROM staging.trips WHERE pickup_datetime >= '{{ start_datetime }}' AND pickup_datetime < '{{ end_datetime }}' GROUP BY 1, 2, 3 ``` ## Running the full pipeline ```bash # Validate structure and definitions bruin validate ./pipeline/pipeline.yml # Run with a small date range for testing bruin run ./pipeline/pipeline.yml --start-date 2022-01-01 --end-date 2022-02-01 # Full refresh bruin run ./pipeline/pipeline.yml --full-refresh # Query results bruin query --connection duckdb-default --query "SELECT COUNT(*) FROM ingestion.trips" ``` Open the pipeline YAML file in the Bruin panel and view the lineage tab to see all assets and their dependencies. Execution order: 1. Ingestion assets run first (trips + lookup, in parallel) 2. Staging asset runs after both ingestion assets complete 3. Report asset runs after staging completes ## Materialization strategies summary | Strategy | Behavior | |----------|----------| | `table` | Drop and recreate the table each time | | `append` | Insert new data without touching existing rows | | `merge` | Upsert based on key columns | | `time_interval` | Delete rows in date range, then re-insert | | `delete+insert` | Delete matching rows, then insert | | `create+replace` | Create or replace the table | ================================================ FILE: 05-data-platforms/notes/04-bruin-mcp.md ================================================ # 5.4 - Using Bruin MCP with AI Agents ## What is Bruin MCP? MCP stands for **Model Context Protocol**. Bruin MCP is a way for AI agents (in Cursor, VS Code, Claude, etc.) to communicate with Bruin — querying documentation, running commands on your behalf, going through your code, troubleshooting, and analyzing data. With the Bruin MCP and an AI agent, you can: - Write pipeline code and asset configurations - Write documentation and metadata - Troubleshoot errors and debug issues - Run queries and analyze data using natural language - Ask questions about your pipeline logic and structure ## Installing Bruin MCP Make sure you have [Bruin CLI installed](https://getbruin.com/docs/bruin/getting-started/introduction/installation) first. ### Cursor Go to **Settings → Tools & MCP → New MCP Server** and add: ```json { "mcpServers": { "bruin": { "command": "bruin", "args": ["mcp"] } } } ``` If it shows a failure/error, close and reopen your IDE — you should see "Bruin enabled". ### VS Code (Copilot) Create `.vscode/mcp.json` in your project folder: ```json { "servers": { "bruin": { "command": "bruin", "args": ["mcp"] } } } ``` ### Claude Code ```bash claude mcp add bruin -- bruin mcp ``` See the full [Bruin MCP documentation](https://getbruin.com/docs/bruin/getting-started/bruin-mcp) for other agents and troubleshooting. ## Building a pipeline with MCP ### Using the template prompt The zoomcamp template includes an example prompt in its README that you can give to the AI agent to create the entire pipeline end-to-end: ```bash bruin init zoomcamp my-taxi-pipeline ``` Open the generated `README.md` — it contains a prompt you can paste into the agent to scaffold the entire pipeline automatically. ### What the agent does When given the pipeline prompt, the agent will: 1. Create all pipeline assets (ingestion, staging, reports) 2. Configure materialization strategies and dependencies 3. Set up quality checks and column metadata 4. Validate the pipeline with `bruin validate` 5. Run the pipeline with a test date range 6. Run custom checks to validate query logic 7. Execute verification queries using `bruin query` ### Working incrementally In practice, you may prefer working asset by asset rather than generating everything at once. This lets you be involved in every design choice: - Create and test the ingestion asset first - Then build the staging layer - Then add the reports layer - Review and adjust quality checks at each step ## Querying data with the agent Once your pipeline has run, you can use the agent conversationally to query your data: **Example queries:** - "Query the staging table and tell me how many days of data we have" - "Which day had the highest number of trips and total fare?" - "In which asset are we aggregating data?" The agent understands the context of your pipeline — it knows the table structures, can write SQL queries, and can explain the logic behind each asset. This is useful for: - Ad hoc analysis without writing SQL manually - Understanding unfamiliar pipeline logic - Data validation and troubleshooting - Onboarding new team members to an existing pipeline ================================================ FILE: 05-data-platforms/notes/05-bruin-cloud.md ================================================ # 5.5 - Deploying to Bruin Cloud ## What is Bruin Cloud? Bruin Cloud is a fully managed infrastructure for your data pipelines. It is powered by the same open-source CLI tool you use locally for development. Everything lives in the same place: - Ingestions and transformations - Quality checks and monitoring - Lineage and metadata - Data governance - AI-powered features (automatic metadata generation, conversational data analysis) ## Registration 1. Go to [Bruin Cloud](https://getbruin.com/) and sign up 2. Fill out your name, email, and set a password 3. Verify your email by clicking the link in the verification email 4. Choose to join an existing team or create a new organization 5. Give your organization a name ## Connecting your GitHub repository You have two options: 1. **Direct GitHub connection** (recommended) — connect your GitHub account directly and select your repo from a dropdown 2. **Personal Access Token** — provide a GitHub personal access token and your repo link manually ## Setting up connections After connecting your repo, set up your data warehouse connections. These are the same connections you configure locally in `.bruin.yml`, but stored securely in the cloud. 1. Go to the connections page 2. Select your connection type (MotherDuck, BigQuery, Redshift, etc.) 3. Give it the same connection name you use locally 4. Provide the required credentials (e.g., service token, database name) 5. The connection will be validated and tested automatically Read the Bruin documentation for details on how secrets are stored securely. ## Deploying pipelines 1. Navigate to the **Pipelines** page to see the list of pipelines from your repository 2. Bruin will validate every asset and ensure lineage and connections work (this takes a moment) 3. Once ready, **enable** the pipeline When you enable a pipeline with a schedule, Bruin automatically creates a run for the last interval. For example, a monthly pipeline will immediately process the previous month's data. ## Monitoring After a pipeline runs: - Check the status of each asset (success/failure) - Review quality check results - View lineage across all assets - Use AI-powered features to analyze data or ask questions about your pipelines ## Getting help - Join the [Bruin Slack community](https://getbruin.com/) for questions and feature requests - Submit issues on [GitHub](https://github.com/bruin-data/bruin) ================================================ FILE: 05-data-platforms/notes/06-core-01-projects.md ================================================ # 5.6 - Core Concepts: Projects 🎥 [Bruin Core Concepts | Projects](https://www.youtube.com/watch?v=YWDjnSxbBtY) (3:03) ## What is a Project? A **Project** is the root directory where you create your entire Bruin data pipeline. It serves as the foundation for organizing all your data assets, configurations, and connections. ## Project Initialization The project must be initialized with `bruin init` so the CLI tool can understand the directory structure and navigate files correctly. ```bash bruin init zoomcamp my-pipeline cd my-pipeline ``` ## The `.bruin.yml` File Located at the root of your project, this file defines environments, connections, and secrets. **Important:** This file is always added to `.gitignore` to protect secrets. It stays local only and should never be pushed to your repo. ### Environments Define different environments for various stages: ```yaml default_environment: default environments: default: connections: duckdb: - name: duckdb-default path: duckdb.db motherduck: - name: motherduck token: production: connections: bigquery: - name: bq-prod project: my-project dataset: production ``` **Benefits:** - Run pipelines locally or on servers without exposing production credentials - Different teams can have different connection access - Default to `dev` environment to prevent accidental production runs ### Connection Types Built-in connections include: - DuckDB, MotherDuck - PostgreSQL, MySQL - BigQuery, Redshift, Snowflake - Custom connections (for API keys, secrets, etc.) ### Default Environment Set which environment is used by default: ```yaml default_environment: dev ``` This ensures pipelines run on development unless explicitly told to use production. ## Quick Reference ```bash # Initialize a new project bruin init zoomcamp my-pipeline # Navigate to your project cd my-pipeline # Check project is valid bruin validate . ``` ## Further Reading - [Bruin Documentation - Projects](https://getbruin.com/docs/bruin/core-concepts/project.html) - [Bruin GitHub - Templates](https://github.com/bruin-data/bruin/tree/main/templates) ================================================ FILE: 05-data-platforms/notes/06-core-02-pipelines.md ================================================ # 5.6 - Core Concepts: Pipelines 🎥 [Bruin Core Concepts | Pipelines](https://www.youtube.com/watch?v=uzp_DiR4Sok) (3:13) ## What is a Pipeline? A **Pipeline** is a grouping mechanism for organizing assets based on their execution schedule and configuration requirements. Within a project, you can have multiple pipelines. ## Key Characteristics ### Single Schedule Each pipeline has **one schedule** - this is the primary reason to group assets together: - Assets with the same schedule belong in the same pipeline - Common schedules: `hourly`, `daily`, `monthly`, or cron expressions ### Pipeline Structure Each pipeline has its own folder containing a `pipeline.yml` file: ```text project/ ├── .bruin.yml ├── pipelines/ │ ├── nyc-taxi/ │ │ ├── pipeline.yml │ │ └── assets/ │ └── another-pipeline/ │ ├── pipeline.yml │ └── assets/ ``` ## The `pipeline.yml` File ```yaml name: nyc_taxi schedule: monthly start_date: "2019-01-01" default_connections: duckdb: duckdb-default ``` ### Configuration Options | Setting | Description | |---------|-------------| | `name` | Pipeline identifier | | `schedule` | When to run (cron, daily, monthly, etc.) | | `start_date` | When the pipeline starts being active | | `default_connections` | Which connections to use | | `variables` | Custom variables for the pipeline | ### Connection Scoping Even though connections are defined at the project level (`.bruin.yml`), each pipeline specifies **which connections it uses**. **Why this matters:** - In large organizations, different teams may need different credentials - Prevents unnecessary exposure of secrets - Only initializes connections needed for the specific pipeline run - Security isolation between departments ## Quick Reference ```bash # Validate a pipeline bruin validate ./pipelines/nyc-taxi/pipeline.yml # View pipeline lineage bruin lineage ./pipelines/nyc-taxi/pipeline.yml # Run the entire pipeline bruin run ./pipelines/nyc-taxi/pipeline.yml ``` ## Further Reading - [Bruin Documentation - Pipelines](https://getbruin.com/docs/bruin/pipelines/definition.html) - [Pipeline Configuration Reference](https://getbruin.com/docs/bruin/pipelines/definition.html) ================================================ FILE: 05-data-platforms/notes/06-core-03-assets.md ================================================ # 5.6 - Core Concepts: Assets 🎥 [Bruin Core Concepts | Assets](https://www.youtube.com/watch?v=ZElY5SoqrwI) (6:11) ## What is an Asset? An **Asset** is a single file that performs a specific task, almost always related to creating or updating a table or view in the destination database. Each asset file contains two parts: 1. **Definition** (Configuration) - Metadata, name, type, connection 2. **Content** (Code) - The actual SQL, Python, or R code to execute ## Asset Types | Type | Description | Use Case | |------|-------------|----------| | **Python** | Python scripts | Ingestion, data processing, ML models | | **SQL** | SQL queries | Transformations, aggregations | | **YAML/Seed** | File-based tables | Reference data, static lookups | | **R** | R scripts | Statistical analysis, R-specific workflows | ## Asset Naming The asset name can be: 1. **Explicitly defined** in the decorator 2. **Inferred from file path** (default behavior) **Convention:** Group assets by schema/dataset: - `assets/raw/trips_raw.py` → Creates table `raw.trips_raw` - `assets/staging/trips_summary.sql` → Creates table `staging.trips_summary` ## SQL Asset Example ```sql @bruin.asset( name="staging.trips_summary", type="sql", connection="duckdb-default", materialization="table" ) SELECT pickup_date, COUNT(*) as trip_count, SUM(fare_amount) as total_fare FROM raw.trips_raw WHERE pickup_date >= '{{ start_date }}' AND pickup_date < '{{ end_date }}' GROUP BY pickup_date ``` ### Materialization Strategies | Strategy | Behavior | |----------|----------| | `table` | Recreates the table on each run | | `view` | Creates a view (no data stored) | | `insert` | Appends new data to existing table | | `incremental` | Smart merge based on key columns | ## Python Asset Example (Ingestion) ```python @bruin.asset( name="raw.trips_raw", type="python", connection="duckdb-default" ) def ingest_trips(): import requests import pandas as pd # Connect to API, fetch data response = requests.get("https://api.example.com/trips") data = response.json() # Return pandas DataFrame # Bruin handles materialization to database return pd.DataFrame(data) ``` ## YAML/Seed Asset Example ```yaml @bruin.asset( name="lookup.taxi_types", type="seed", connection="duckdb-default" ) path: reference_data/taxi_types.csv ``` Simply loads a local CSV file and creates a table in the destination database. ## Lineage & Dependencies Assets automatically define dependencies based on what they read: - If Asset B reads from Asset A's table, **B depends on A** - Visualized in VS Code extension - Used for execution ordering during runs ```sql -- This asset depends on raw.trips_raw @bruin.asset(name="staging.trips_summary", type="sql") SELECT * FROM raw.trips_raw -- Creates dependency ``` ## Quick Reference ```bash # Run a specific asset bruin run ./pipeline.yml --asset raw.trips_raw # Run asset with all downstream dependencies bruin run ./pipeline.yml --asset raw.trips_raw --downstream # Run asset with all upstream dependencies bruin run ./pipeline.yml --asset staging.trips_summary --upstream # View lineage for an asset bruin lineage ./pipeline.yml --asset raw.trips_raw ``` ## Further Reading - [Bruin Documentation - Assets](https://getbruin.com/docs/bruin/assets/definition-schema.html) - [Materialization Strategies](https://getbruin.com/docs/bruin/assets/materialization.html) ================================================ FILE: 05-data-platforms/notes/06-core-04-variables.md ================================================ # 5.6 - Core Concepts: Variables 🎥 [Bruin Core Concepts | Variables](https://www.youtube.com/watch?v=XCx0nDmhhxA) (6:03) ## What are Variables? **Variables** are dynamically initialized each time a pipeline run is created. They allow you to parameterize your pipelines and pass dynamic values at runtime. ## Variable Types ### 1. Built-in Variables Always provided by Bruin automatically: | Variable | Description | |----------|-------------| | `start_date` | Beginning of the scheduled interval | | `end_date` | End of the scheduled interval | These dates are determined by the pipeline's schedule: | Schedule | Start Date | End Date | |----------|------------|----------| | **Monthly** | First day of month | Last day of month | | **Daily** | Start of day | End of day | | **Hourly** | Start of hour | End of hour | #### SQL Assets - Jinja Format In SQL, variables are injected using Jinja templating: ```sql @bruin.asset(name="staging.monthly_trips", type="sql") SELECT * FROM raw.trips WHERE pickup_date >= '{{ start_date }}' AND pickup_date < '{{ end_date }}' ``` Use the **Bruin Render panel** in VS Code to see the compiled query with actual values. #### Python Assets - Environment Variables In Python, variables are accessed via environment variables: ```python import os from datetime import datetime @bruin.asset(name="raw.monthly_data", type="python") def ingest_monthly_data(): start_date = os.environ['BRUIN_VAR_START_DATE'] end_date = os.environ['BRUIN_VAR_END_DATE'] # Parse and use dates to fetch data for specific period start = datetime.fromisoformat(start_date) end = datetime.fromisoformat(end_date) # Loop through months in range # ... ``` ### 2. Custom Variables User-defined variables set at the pipeline level. #### Definition in `pipeline.yml` ```yaml variables: - name: taxi_types type: array default: - "yellow" ``` #### Override at Runtime Change default values when creating a run: ```bash bruin run ./pipeline.yml --var taxi_types=["green","fhv"] ``` #### Accessing Custom Variables in Python ```python import os import json @bruin.asset(name="example.asset", type="python") def example_asset(): # Custom variables are prefixed with BRUIN_VAR_ taxi_types_json = os.environ['BRUIN_VAR_TAXI_TYPES'] taxi_types = json.loads(taxi_types_json) # Use the variable in your code for taxi_type in taxi_types: # Process each taxi type pass ``` ## VS Code Extension Panel From the Bruin panel in VS Code/Cursor: 1. **Variable Override** - Set custom variable values before running 2. **Bruin Render** - See how Jinja templates are compiled with actual values 3. **Run Configuration** - Set dates, environment, and variables ## Practical Use Cases | Use Case | Description | |----------|-------------| | **Date-based partitioning** | Extract data for specific time periods | | **Multi-tenant processing** | Run same pipeline for different customers | | **Parameterized transformations** | Change logic based on variables | | **A/B testing** | Test different configurations without code changes | ## Quick Reference ```bash # Run with custom dates bruin run ./pipeline.yml --start-date 2020-01-01 --end-date 2020-01-31 # Run with variable override (array) bruin run ./pipeline.yml --var taxi_types=["green","fhv"] # Run with variable override (string) bruin run ./pipeline.yml --var customer_id=12345 # Run with full refresh (affects materialization) bruin run ./pipeline.yml --full-refresh # Set end date as exclusive bruin run ./pipeline.yml --exclusive-end-date ``` ## Further Reading - [Bruin Documentation - Variables](https://getbruin.com/docs/bruin/core-concepts/variables.html) - [Pipeline Runtime Options](https://getbruin.com/docs/bruin/commands/run.html) ================================================ FILE: 05-data-platforms/notes/06-core-05-commands.md ================================================ # 5.6 - Core Concepts: Commands 🎥 [Bruin Core Concepts | Commands](https://www.youtube.com/watch?v=3nykPEs_V7E) (6:46) ## Bruin CLI Commands Commands are how you interact with your Bruin project - running pipelines, validating configurations, querying data, and more. ## `bruin run` - Execute a Pipeline Creates a **single execution instance** (a "run") of your pipeline. ### Basic Usage ```bash bruin run ./pipelines/nyc-taxi/pipeline.yml ``` ### Run Scope Options | Option | Description | |--------|-------------| | Entire pipeline | Runs all assets in dependency order | | Single asset | `--asset staging.trips_summary` | | With upstream | `--asset X --upstream` - Runs X plus all dependencies | | With downstream | `--asset X --downstream` - Runs X plus all dependents | ### Common Run Flags | Flag | Description | |------|-------------| | `--start-date DATE` | Set execution start date | | `--end-date DATE` | Set execution end date | | `--full-refresh` | Drop and recreate tables (overrides incremental) | | `--exclusive-end-date` | End date is exclusive (default: inclusive) | | `--environment ENV` | Use specific environment (dev/prod) | | `--var KEY=VALUE` | Override custom variables | ### Example Run Commands ```bash # Simple run bruin run ./pipelines/nyc-taxi/pipeline.yml # With date range bruin run ./pipelines/nyc-taxi/pipeline.yml \ --start-date 2020-01-01 \ --end-date 2020-01-31 # Full refresh with variables bruin run ./pipelines/nyc-taxi/pipeline.yml \ --full-refresh \ --var taxi_types=["yellow","green"] \ --environment default ``` ## `bruin validate` - Validate Pipeline Checks for configuration issues before running: ```bash bruin validate ./pipelines/nyc-taxi/pipeline.yml ``` **Validates:** - No circular dependencies in lineage - Asset definitions are correct - Connections exist and are properly configured - No broken references **Always validate before running!** ## `bruin lineage` - View Dependency Graph Visualize how assets are connected: ```bash bruin lineage ./pipelines/nyc-taxi/pipeline.yml ``` Shows upstream and downstream relationships between assets. ## `bruin query` - Query Data Run ad-hoc queries against your connections: ```bash bruin query --connection duckdb-default \ --query "SELECT * FROM ingestion.trips LIMIT 10" ``` ## What is a "Run"? A **run** is a single instance of pipeline execution: - Has unique start/end times - May run all assets or a subset - Has its own variable values - Creates execution logs and results ## Putting It All Together The complete Bruin workflow: ``` 1. Project (root, initialized) └── .bruin.yml (environments, connections) 2. Pipeline (scheduled grouping) └── pipeline.yml (schedule, default connection, variables) 3. Assets (the actual work) ├── Python (ingestion, processing) ├── SQL (transformations) └── YAML/Seed (static data) 4. Commands (make it happen) ├── bruin run (execute) ├── bruin validate (check) └── bruin query (inspect) ``` ## Quick Reference ```bash # Initialize new project bruin init zoomcamp my-pipeline # Validate before running bruin validate ./pipeline/pipeline.yml # Run entire pipeline bruin run ./pipeline/pipeline.yml # Run with date range bruin run ./pipeline/pipeline.yml \ --start-date 2020-01-01 \ --end-date 2020-01-31 # Run single asset with downstream bruin run ./pipeline/pipeline.yml \ --asset raw.trips \ --downstream # View lineage bruin lineage ./pipeline/pipeline.yml # Query a table bruin query --connection duckdb-default \ --query "SELECT COUNT(*) FROM staging.trips" ``` ## Further Reading - [Bruin Documentation - CLI Reference](https://getbruin.com/docs/bruin/commands/overview.html) - [Bruin GitHub Repository](https://github.com/bruin-data/bruin) ================================================ FILE: 06-batch/.gitignore ================================================ ================================================ FILE: 06-batch/README.md ================================================ # Module 6: Batch Processing ## 6.1 Introduction * :movie_camera: 6.1.1 Introduction to Batch Processing [![](https://markdown-videos-api.jorgenkh.no/youtube/dcHe5Fl3MF8)](https://youtu.be/dcHe5Fl3MF8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=51) * :movie_camera: 6.1.2 Introduction to Spark [![](https://markdown-videos-api.jorgenkh.no/youtube/FhaqbEOuQ8U)](https://youtu.be/FhaqbEOuQ8U&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=52) ## 6.2 Installation Follow [these instructions](setup/) to install Spark: * [Windows](setup/windows.md) * [Linux](setup/linux.md) * [MacOS](setup/macos.md) :movie_camera: 6.2.1 (Optional) Installing Spark (Linux) [![](https://markdown-videos-api.jorgenkh.no/youtube/hqUbB9c8sKg)](https://youtu.be/hqUbB9c8sKg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=53) Alternatively, if the setups above don't work, you can run Spark in Google Colab. > [!NOTE] > It's advisable to invest some time in setting things up locally rather than immediately jumping into this solution * [Google Colab Instructions](https://medium.com/gitconnected/launch-spark-on-google-colab-and-connect-to-sparkui-342cad19b304) * [Google Colab Starter Notebook](https://github.com/aaalexlit/medium_articles/blob/main/Spark_in_Colab.ipynb) ## 6.3 Spark SQL and DataFrames * :movie_camera: 6.3.1 First Look at Spark/PySpark [![](https://markdown-videos-api.jorgenkh.no/youtube/r_Sf6fCB40c)](https://youtu.be/r_Sf6fCB40c&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=54) * :movie_camera: 6.3.2 Spark Dataframes [![](https://markdown-videos-api.jorgenkh.no/youtube/ti3aC1m3rE8)](https://youtu.be/ti3aC1m3rE8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=55) * :movie_camera: 6.3.3 (Optional) Preparing Yellow and Green Taxi Data [![](https://markdown-videos-api.jorgenkh.no/youtube/CI3P4tAtru4)](https://youtu.be/CI3P4tAtru4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=56) Script to prepare the Dataset [download_data.sh](code/download_data.sh) > [!NOTE] > The other way to infer the schema (apart from pandas) for the csv files, is to set the `inferSchema` option to `true` while reading the files in Spark. * :movie_camera: 6.3.4 SQL with Spark [![](https://markdown-videos-api.jorgenkh.no/youtube/uAlp2VuZZPY)](https://youtu.be/uAlp2VuZZPY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=57) ## 6.4 Spark Internals * :movie_camera: 6.4.1 Anatomy of a Spark Cluster [![](https://markdown-videos-api.jorgenkh.no/youtube/68CipcZt7ZA)](https://youtu.be/68CipcZt7ZA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=58) * :movie_camera: 6.4.2 GroupBy in Spark [![](https://markdown-videos-api.jorgenkh.no/youtube/9qrDsY_2COo)](https://youtu.be/9qrDsY_2COo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=59) * :movie_camera: 6.4.3 Joins in Spark [![](https://markdown-videos-api.jorgenkh.no/youtube/lu7TrqAWuH4)](https://youtu.be/lu7TrqAWuH4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=60) ## 6.5 (Optional) Resilient Distributed Datasets * :movie_camera: 6.5.1 Operations on Spark RDDs [![](https://markdown-videos-api.jorgenkh.no/youtube/Bdu-xIrF3OM)](https://youtu.be/Bdu-xIrF3OM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=61) * :movie_camera: 6.5.2 Spark RDD mapPartition [![](https://markdown-videos-api.jorgenkh.no/youtube/k3uB2K99roI)](https://youtu.be/k3uB2K99roI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=62) ## 6.6 Running Spark in the Cloud * :movie_camera: 6.6.1 Connecting to Google Cloud Storage [![](https://markdown-videos-api.jorgenkh.no/youtube/Yyz293hBVcQ)](https://youtu.be/Yyz293hBVcQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=63) * :movie_camera: 6.6.2 Creating a Local Spark Cluster [![](https://markdown-videos-api.jorgenkh.no/youtube/HXBwSlXo5IA)](https://youtu.be/HXBwSlXo5IA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=64) * :movie_camera: 6.6.3 Setting up a Dataproc Cluster [![](https://markdown-videos-api.jorgenkh.no/youtube/osAiAYahvh8)](https://youtu.be/osAiAYahvh8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=65) * :movie_camera: 6.6.4 Connecting Spark to Big Query [![](https://markdown-videos-api.jorgenkh.no/youtube/HIm2BOj8C0Q)](https://youtu.be/HIm2BOj8C0Q&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=66) # Homework * [2026 Homework](../cohorts/2026/06-batch/homework.md) # Community notes
Did you take notes? You can share them here * [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md) * [Sandy's DE Learning Blog](https://learningdataengineering540969211.wordpress.com/2022/02/24/week-5-de-zoomcamp-5-2-1-installing-spark-on-linux/) * [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week5.md) * [Alternative : Using docker-compose to launch spark by rafik](https://gist.github.com/rafik-rahoui/f98df941c4ccced9c46e9ccbdef63a03) * [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-5-batch-spark) * [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week5) * [Notes by Oscar Garcia](https://github.com/ozkary/Data-Engineering-Bootcamp/tree/main/Step5-Batch-Processing) * [Notes by HongWei](https://github.com/hwchua0209/data-engineering-zoomcamp-submission/blob/main/05-batch-processing/README.md) * [2024 videos transcript](https://drive.google.com/drive/folders/1XMmP4H5AMm1qCfMFxc_hqaPGw31KIVcb?usp=drive_link) by Maria Fisher * [2025 Notes by Manuel Guerra](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/5_Batch-Processing-Spark/README.md) * [2025 Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/05_batch_processing/00_notes.md) * [2025 Notes on Installing Spark on MacOS (with Anaconda + brew) by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/05_batch_processing/01_env_setup.md) * [2025 Notes by Daniel Lachner](https://github.com/mossdet/dlp_data_eng/blob/main/Notes/05_01_Batch_Processing_Spark_GCP.pdf) * [2026 Notes by Ajay Katte](https://github.com/mushroomsandchai/dtdez/tree/main/06_batch_processing/notes) * Add your notes here (above this line)
================================================ FILE: 06-batch/code/03_test.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "72505747", "metadata": {}, "outputs": [], "source": [ "import pyspark" ] }, { "cell_type": "code", "execution_count": 3, "id": "bd55afbe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/__init__.py'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyspark.__file__" ] }, { "cell_type": "code", "execution_count": 4, "id": "29f1cf4c", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import SparkSession" ] }, { "cell_type": "code", "execution_count": 5, "id": "cf6d80ad", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n", "22/02/15 22:22:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "spark = SparkSession.builder \\\n", " .master(\"local[*]\") \\\n", " .appName('test') \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": null, "id": "3f604529", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-02-15 22:23:22-- https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv\n", "Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.196.8\n", "Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.196.8|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 12322 (12K) [application/octet-stream]\n", "Saving to: ‘taxi+_zone_lookup.csv’\n", "\n", "taxi+_zone_lookup.c 100%[===================>] 12.03K --.-KB/s in 0s \n", "\n", "2022-02-15 22:23:23 (114 MB/s) - ‘taxi+_zone_lookup.csv’ saved [12322/12322]\n", "\n" ] } ], "source": [ "!wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv" ] }, { "cell_type": "code", "execution_count": null, "id": "12342345", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"LocationID\",\"Borough\",\"Zone\",\"service_zone\"\r\n", "\r\n", "1,\"EWR\",\"Newark Airport\",\"EWR\"\r\n", "\r\n", "2,\"Queens\",\"Jamaica Bay\",\"Boro Zone\"\r\n", "\r\n", "3,\"Bronx\",\"Allerton/Pelham Gardens\",\"Boro Zone\"\r\n", "\r\n", "4,\"Manhattan\",\"Alphabet City\",\"Yellow Zone\"\r\n", "\r\n", "5,\"Staten Island\",\"Arden Heights\",\"Boro Zone\"\r\n", "\r\n", "6,\"Staten Island\",\"Arrochar/Fort Wadsworth\",\"Boro Zone\"\r\n", "\r\n", "7,\"Queens\",\"Astoria\",\"Boro Zone\"\r\n", "\r\n", "8,\"Queens\",\"Astoria Park\",\"Boro Zone\"\r\n", "\r\n", "9,\"Queens\",\"Auburndale\",\"Boro Zone\"\r\n", "\r\n" ] } ], "source": [ "!head taxi_zone_lookup.csv" ] }, { "cell_type": "code", "execution_count": null, "id": "809464d0", "metadata": {}, "outputs": [], "source": [ "df = spark.read \\\n", " .option(\"header\", \"true\") \\\n", " .csv('taxi_zone_lookup.csv')" ] }, { "cell_type": "code", "execution_count": 11, "id": "e36dd996", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+-------------+--------------------+------------+\n", "|LocationID| Borough| Zone|service_zone|\n", "+----------+-------------+--------------------+------------+\n", "| 1| EWR| Newark Airport| EWR|\n", "| 2| Queens| Jamaica Bay| Boro Zone|\n", "| 3| Bronx|Allerton/Pelham G...| Boro Zone|\n", "| 4| Manhattan| Alphabet City| Yellow Zone|\n", "| 5|Staten Island| Arden Heights| Boro Zone|\n", "| 6|Staten Island|Arrochar/Fort Wad...| Boro Zone|\n", "| 7| Queens| Astoria| Boro Zone|\n", "| 8| Queens| Astoria Park| Boro Zone|\n", "| 9| Queens| Auburndale| Boro Zone|\n", "| 10| Queens| Baisley Park| Boro Zone|\n", "| 11| Brooklyn| Bath Beach| Boro Zone|\n", "| 12| Manhattan| Battery Park| Yellow Zone|\n", "| 13| Manhattan| Battery Park City| Yellow Zone|\n", "| 14| Brooklyn| Bay Ridge| Boro Zone|\n", "| 15| Queens|Bay Terrace/Fort ...| Boro Zone|\n", "| 16| Queens| Bayside| Boro Zone|\n", "| 17| Brooklyn| Bedford| Boro Zone|\n", "| 18| Bronx| Bedford Park| Boro Zone|\n", "| 19| Queens| Bellerose| Boro Zone|\n", "| 20| Bronx| Belmont| Boro Zone|\n", "+----------+-------------+--------------------+------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "df.show()" ] }, { "cell_type": "code", "execution_count": 12, "id": "cb547351", "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r\n", "[Stage 4:> (0 + 1) / 1]\r\n", "\r\n", " \r" ] } ], "source": [ "df.write.parquet('zones')" ] }, { "cell_type": "code", "execution_count": 14, "id": "02fe2bdb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 28K\r\n", "-rw-rw-r-- 1 alexey alexey 6.8K Feb 15 22:25 Untitled.ipynb\r\n", "-rw-rw-r-- 1 alexey alexey 13K Aug 17 2016 taxi+_zone_lookup.csv\r\n", "drwxr-xr-x 2 alexey alexey 4.0K Feb 15 22:25 zones\r\n" ] } ], "source": [ "!ls -lh" ] }, { "cell_type": "code", "execution_count": null, "id": "659f0812", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 06-batch/code/04_pyspark.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "07de9dc3", "metadata": {}, "outputs": [], "source": [ "import pyspark\n", "from pyspark.sql import SparkSession" ] }, { "cell_type": "code", "execution_count": 2, "id": "ca5bbb06", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n", "22/02/16 21:11:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "spark = SparkSession.builder \\\n", " .master(\"local[*]\") \\\n", " .appName('test') \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 3, "id": "cf8de204", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2022-02-16 21:13:50-- https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-01.csv\n", "Resolving nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)... 52.217.84.132\n", "Connecting to nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)|52.217.84.132|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 752335705 (717M) [text/csv]\n", "Saving to: ‘fhvhv_tripdata_2021-01.csv’\n", "\n", "fhvhv_tripdata_2021 100%[===================>] 717.48M 35.6MB/s in 21s \n", "\n", "2022-02-16 21:14:11 (34.4 MB/s) - ‘fhvhv_tripdata_2021-01.csv’ saved [752335705/752335705]\n", "\n" ] } ], "source": [ "!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz" ] }, { "cell_type": "code", "execution_count": null, "id": "201a5957", "metadata": {}, "outputs": [], "source": [ "!gzip -dc fhvhv_tripdata_2021-01.csv.gz" ] }, { "cell_type": "code", "execution_count": 4, "id": "2a52087c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11908469 fhvhv_tripdata_2021-01.csv\r\n" ] } ], "source": [ "!wc -l fhvhv_tripdata_2021-01.csv" ] }, { "cell_type": "code", "execution_count": 5, "id": "931021a7", "metadata": {}, "outputs": [], "source": [ "df = spark.read \\\n", " .option(\"header\", \"true\") \\\n", " .csv('fhvhv_tripdata_2021-01.csv')" ] }, { "cell_type": "code", "execution_count": 10, "id": "d44b7839", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "StructType(List(StructField(hvfhs_license_num,StringType,true),StructField(dispatching_base_num,StringType,true),StructField(pickup_datetime,StringType,true),StructField(dropoff_datetime,StringType,true),StructField(PULocationID,StringType,true),StructField(DOLocationID,StringType,true),StructField(SR_Flag,StringType,true)))" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.schema" ] }, { "cell_type": "code", "execution_count": 14, "id": "4249e790", "metadata": {}, "outputs": [], "source": [ "!head -n 1001 fhvhv_tripdata_2021-01.csv > head.csv" ] }, { "cell_type": "code", "execution_count": 15, "id": "6894312c", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 16, "id": "f3ca771b", "metadata": {}, "outputs": [], "source": [ "df_pandas = pd.read_csv('head.csv')" ] }, { "cell_type": "code", "execution_count": 19, "id": "f1066b4f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "hvfhs_license_num object\n", "dispatching_base_num object\n", "pickup_datetime object\n", "dropoff_datetime object\n", "PULocationID int64\n", "DOLocationID int64\n", "SR_Flag float64\n", "dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pandas.dtypes" ] }, { "cell_type": "code", "execution_count": 23, "id": "f8413c9d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "StructType(List(StructField(hvfhs_license_num,StringType,true),StructField(dispatching_base_num,StringType,true),StructField(pickup_datetime,StringType,true),StructField(dropoff_datetime,StringType,true),StructField(PULocationID,LongType,true),StructField(DOLocationID,LongType,true),StructField(SR_Flag,DoubleType,true)))" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark.createDataFrame(df_pandas).schema" ] }, { "cell_type": "markdown", "id": "80f252c1", "metadata": {}, "source": [ "Integer - 4 bytes\n", "Long - 8 bytes" ] }, { "cell_type": "code", "execution_count": 24, "id": "16937bfd", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import types" ] }, { "cell_type": "code", "execution_count": 26, "id": "fc61a99a", "metadata": {}, "outputs": [], "source": [ "schema = types.StructType([\n", " types.StructField('hvfhs_license_num', types.StringType(), True),\n", " types.StructField('dispatching_base_num', types.StringType(), True),\n", " types.StructField('pickup_datetime', types.TimestampType(), True),\n", " types.StructField('dropoff_datetime', types.TimestampType(), True),\n", " types.StructField('PULocationID', types.IntegerType(), True),\n", " types.StructField('DOLocationID', types.IntegerType(), True),\n", " types.StructField('SR_Flag', types.StringType(), True)\n", "])" ] }, { "cell_type": "code", "execution_count": 32, "id": "f94052ae", "metadata": {}, "outputs": [], "source": [ "df = spark.read \\\n", " .option(\"header\", \"true\") \\\n", " .schema(schema) \\\n", " .csv('fhvhv_tripdata_2021-01.csv')" ] }, { "cell_type": "code", "execution_count": 36, "id": "c270d9d6", "metadata": {}, "outputs": [], "source": [ "df = df.repartition(24)" ] }, { "cell_type": "code", "execution_count": null, "id": "7796c2b2", "metadata": {}, "outputs": [], "source": [ "df.write.parquet('fhvhv/2021/01/')" ] }, { "cell_type": "code", "execution_count": 44, "id": "c3cab876", "metadata": {}, "outputs": [], "source": [ "df = spark.read.parquet('fhvhv/2021/01/')" ] }, { "cell_type": "code", "execution_count": 48, "id": "203b5627", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- hvfhs_license_num: string (nullable = true)\n", " |-- dispatching_base_num: string (nullable = true)\n", " |-- pickup_datetime: timestamp (nullable = true)\n", " |-- dropoff_datetime: timestamp (nullable = true)\n", " |-- PULocationID: integer (nullable = true)\n", " |-- DOLocationID: integer (nullable = true)\n", " |-- SR_Flag: string (nullable = true)\n", "\n" ] } ], "source": [ "df.printSchema()" ] }, { "cell_type": "markdown", "id": "64172a47", "metadata": {}, "source": [ "SELECT * FROM df WHERE hvfhs_license_num = HV0003" ] }, { "cell_type": "code", "execution_count": 56, "id": "d24840a0", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import functions as F" ] }, { "cell_type": "code", "execution_count": 61, "id": "3ab1ca44", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+\n", "|hvfhs_license_num|dispatching_base_num| pickup_datetime| dropoff_datetime|PULocationID|DOLocationID|SR_Flag|\n", "+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+\n", "| HV0005| B02510|2021-01-07 06:43:22|2021-01-07 06:55:06| 142| 230| null|\n", "| HV0005| B02510|2021-01-01 16:01:26|2021-01-01 16:20:20| 133| 91| null|\n", "| HV0003| B02764|2021-01-01 00:23:13|2021-01-01 00:30:35| 147| 159| null|\n", "| HV0003| B02869|2021-01-06 11:43:12|2021-01-06 11:55:07| 79| 164| null|\n", "| HV0003| B02884|2021-01-04 15:35:32|2021-01-04 15:52:02| 174| 18| null|\n", "| HV0003| B02875|2021-01-04 13:42:15|2021-01-04 14:04:57| 201| 180| null|\n", "| HV0005| B02510|2021-01-04 18:57:31|2021-01-04 19:09:55| 230| 142| null|\n", "| HV0003| B02872|2021-01-03 18:42:03|2021-01-03 19:12:22| 132| 72| null|\n", "| HV0004| B02800|2021-01-01 05:31:50|2021-01-01 05:40:03| 188| 61| null|\n", "| HV0005| B02510|2021-01-04 20:21:47|2021-01-04 20:26:03| 97| 189| null|\n", "| HV0003| B02764|2021-01-01 01:51:18|2021-01-01 02:05:32| 174| 235| null|\n", "| HV0003| B02871|2021-01-05 10:20:54|2021-01-05 10:32:44| 35| 76| null|\n", "| HV0005| B02510|2021-01-06 02:32:09|2021-01-06 02:43:35| 35| 39| null|\n", "| HV0003| B02882|2021-01-04 12:34:52|2021-01-04 12:38:59| 231| 13| null|\n", "| HV0003| B02617|2021-01-02 20:12:56|2021-01-02 20:41:18| 87| 127| null|\n", "| HV0005| B02510|2021-01-02 16:55:48|2021-01-02 17:20:40| 17| 89| null|\n", "| HV0003| B02869|2021-01-02 15:14:38|2021-01-02 15:23:27| 11| 14| null|\n", "| HV0005| B02510|2021-01-01 05:54:50|2021-01-01 06:03:46| 21| 26| null|\n", "| HV0003| B02869|2021-01-04 12:40:42|2021-01-04 12:48:34| 83| 260| null|\n", "| HV0005| B02510|2021-01-01 14:58:57|2021-01-01 15:09:53| 189| 52| null|\n", "+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "df.show()" ] }, { "cell_type": "code", "execution_count": 63, "id": "6d98c2ce", "metadata": {}, "outputs": [], "source": [ "def crazy_stuff(base_num):\n", " num = int(base_num[1:])\n", " if num % 7 == 0:\n", " return f's/{num:03x}'\n", " elif num % 3 == 0:\n", " return f'a/{num:03x}'\n", " else:\n", " return f'e/{num:03x}'" ] }, { "cell_type": "code", "execution_count": 65, "id": "f3175419", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'s/b44'" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "crazy_stuff('B02884')" ] }, { "cell_type": "code", "execution_count": 66, "id": "9bb5d503", "metadata": {}, "outputs": [], "source": [ "crazy_stuff_udf = F.udf(crazy_stuff, returnType=types.StringType())" ] }, { "cell_type": "code", "execution_count": 67, "id": "b38f0465", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+-----------+------------+------------+------------+\n", "|base_id|pickup_date|dropoff_date|PULocationID|DOLocationID|\n", "+-------+-----------+------------+------------+------------+\n", "| e/9ce| 2021-01-07| 2021-01-07| 142| 230|\n", "| e/9ce| 2021-01-01| 2021-01-01| 133| 91|\n", "| e/acc| 2021-01-01| 2021-01-01| 147| 159|\n", "| e/b35| 2021-01-06| 2021-01-06| 79| 164|\n", "| s/b44| 2021-01-04| 2021-01-04| 174| 18|\n", "| e/b3b| 2021-01-04| 2021-01-04| 201| 180|\n", "| e/9ce| 2021-01-04| 2021-01-04| 230| 142|\n", "| e/b38| 2021-01-03| 2021-01-03| 132| 72|\n", "| s/af0| 2021-01-01| 2021-01-01| 188| 61|\n", "| e/9ce| 2021-01-04| 2021-01-04| 97| 189|\n", "| e/acc| 2021-01-01| 2021-01-01| 174| 235|\n", "| a/b37| 2021-01-05| 2021-01-05| 35| 76|\n", "| e/9ce| 2021-01-06| 2021-01-06| 35| 39|\n", "| e/b42| 2021-01-04| 2021-01-04| 231| 13|\n", "| e/a39| 2021-01-02| 2021-01-02| 87| 127|\n", "| e/9ce| 2021-01-02| 2021-01-02| 17| 89|\n", "| e/b35| 2021-01-02| 2021-01-02| 11| 14|\n", "| e/9ce| 2021-01-01| 2021-01-01| 21| 26|\n", "| e/b35| 2021-01-04| 2021-01-04| 83| 260|\n", "| e/9ce| 2021-01-01| 2021-01-01| 189| 52|\n", "+-------+-----------+------------+------------+------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "df \\\n", " .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \\\n", " .withColumn('dropoff_date', F.to_date(df.dropoff_datetime)) \\\n", " .withColumn('base_id', crazy_stuff_udf(df.dispatching_base_num)) \\\n", " .select('base_id', 'pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID') \\\n", " .show()" ] }, { "cell_type": "code", "execution_count": 55, "id": "00921644", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Row(pickup_datetime=datetime.datetime(2021, 1, 1, 0, 23, 13), dropoff_datetime=datetime.datetime(2021, 1, 1, 0, 30, 35), PULocationID=147, DOLocationID=159),\n", " Row(pickup_datetime=datetime.datetime(2021, 1, 6, 11, 43, 12), dropoff_datetime=datetime.datetime(2021, 1, 6, 11, 55, 7), PULocationID=79, DOLocationID=164),\n", " Row(pickup_datetime=datetime.datetime(2021, 1, 4, 15, 35, 32), dropoff_datetime=datetime.datetime(2021, 1, 4, 15, 52, 2), PULocationID=174, DOLocationID=18),\n", " Row(pickup_datetime=datetime.datetime(2021, 1, 4, 13, 42, 15), dropoff_datetime=datetime.datetime(2021, 1, 4, 14, 4, 57), PULocationID=201, DOLocationID=180),\n", " Row(pickup_datetime=datetime.datetime(2021, 1, 3, 18, 42, 3), dropoff_datetime=datetime.datetime(2021, 1, 3, 19, 12, 22), PULocationID=132, DOLocationID=72)]" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID') \\\n", " .filter(df.hvfhs_license_num == 'HV0003')\n" ] }, { "cell_type": "code", "execution_count": 50, "id": "0866f9c0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hvfhs_license_num,dispatching_base_num,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,SR_Flag\r\n", "\r\n", "HV0003,B02682,2021-01-01 00:33:44,2021-01-01 00:49:07,230,166,\r\n", "\r\n", "HV0003,B02682,2021-01-01 00:55:19,2021-01-01 01:18:21,152,167,\r\n", "\r\n", "HV0003,B02764,2021-01-01 00:23:56,2021-01-01 00:38:05,233,142,\r\n", "\r\n", "HV0003,B02764,2021-01-01 00:42:51,2021-01-01 00:45:50,142,143,\r\n", "\r\n", "HV0003,B02764,2021-01-01 00:48:14,2021-01-01 01:08:42,143,78,\r\n", "\r\n", "HV0005,B02510,2021-01-01 00:06:59,2021-01-01 00:43:01,88,42,\r\n", "\r\n", "HV0005,B02510,2021-01-01 00:50:00,2021-01-01 01:04:57,42,151,\r\n", "\r\n", "HV0003,B02764,2021-01-01 00:14:30,2021-01-01 00:50:27,71,226,\r\n", "\r\n", "HV0003,B02875,2021-01-01 00:22:54,2021-01-01 00:30:20,112,255,\r\n", "\r\n" ] } ], "source": [ "!head -n 10 head.csv" ] }, { "cell_type": "code", "execution_count": null, "id": "aa1b0e18", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 06-batch/code/05_taxi_schema.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "8c1d0c08", "metadata": {}, "outputs": [], "source": [ "import pyspark\n", "from pyspark.sql import SparkSession" ] }, { "cell_type": "code", "execution_count": 2, "id": "96a248f5", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n", "22/02/17 21:59:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "spark = SparkSession.builder \\\n", " .master(\"local[*]\") \\\n", " .appName('test') \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 7, "id": "c53274b1", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 12, "id": "5d8434e1", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import types" ] }, { "cell_type": "code", "execution_count": 18, "id": "a84c6c6d", "metadata": {}, "outputs": [], "source": [ "green_schema = types.StructType([\n", " types.StructField(\"VendorID\", types.IntegerType(), True),\n", " types.StructField(\"lpep_pickup_datetime\", types.TimestampType(), True),\n", " types.StructField(\"lpep_dropoff_datetime\", types.TimestampType(), True),\n", " types.StructField(\"store_and_fwd_flag\", types.StringType(), True),\n", " types.StructField(\"RatecodeID\", types.IntegerType(), True),\n", " types.StructField(\"PULocationID\", types.IntegerType(), True),\n", " types.StructField(\"DOLocationID\", types.IntegerType(), True),\n", " types.StructField(\"passenger_count\", types.IntegerType(), True),\n", " types.StructField(\"trip_distance\", types.DoubleType(), True),\n", " types.StructField(\"fare_amount\", types.DoubleType(), True),\n", " types.StructField(\"extra\", types.DoubleType(), True),\n", " types.StructField(\"mta_tax\", types.DoubleType(), True),\n", " types.StructField(\"tip_amount\", types.DoubleType(), True),\n", " types.StructField(\"tolls_amount\", types.DoubleType(), True),\n", " types.StructField(\"ehail_fee\", types.DoubleType(), True),\n", " types.StructField(\"improvement_surcharge\", types.DoubleType(), True),\n", " types.StructField(\"total_amount\", types.DoubleType(), True),\n", " types.StructField(\"payment_type\", types.IntegerType(), True),\n", " types.StructField(\"trip_type\", types.IntegerType(), True),\n", " types.StructField(\"congestion_surcharge\", types.DoubleType(), True)\n", "])\n", "\n", "yellow_schema = types.StructType([\n", " types.StructField(\"VendorID\", types.IntegerType(), True),\n", " types.StructField(\"tpep_pickup_datetime\", types.TimestampType(), True),\n", " types.StructField(\"tpep_dropoff_datetime\", types.TimestampType(), True),\n", " types.StructField(\"passenger_count\", types.IntegerType(), True),\n", " types.StructField(\"trip_distance\", types.DoubleType(), True),\n", " types.StructField(\"RatecodeID\", types.IntegerType(), True),\n", " types.StructField(\"store_and_fwd_flag\", types.StringType(), True),\n", " types.StructField(\"PULocationID\", types.IntegerType(), True),\n", " types.StructField(\"DOLocationID\", types.IntegerType(), True),\n", " types.StructField(\"payment_type\", types.IntegerType(), True),\n", " types.StructField(\"fare_amount\", types.DoubleType(), True),\n", " types.StructField(\"extra\", types.DoubleType(), True),\n", " types.StructField(\"mta_tax\", types.DoubleType(), True),\n", " types.StructField(\"tip_amount\", types.DoubleType(), True),\n", " types.StructField(\"tolls_amount\", types.DoubleType(), True),\n", " types.StructField(\"improvement_surcharge\", types.DoubleType(), True),\n", " types.StructField(\"total_amount\", types.DoubleType(), True),\n", " types.StructField(\"congestion_surcharge\", types.DoubleType(), True)\n", "])" ] }, { "cell_type": "code", "execution_count": 27, "id": "3f7e0cb9", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/7\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/9\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/10\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/11\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/12\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "year = 2020\n", "\n", "for month in range(1, 13):\n", " print(f'processing data for {year}/{month}')\n", "\n", " input_path = f'data/raw/green/{year}/{month:02d}/'\n", " output_path = f'data/pq/green/{year}/{month:02d}/'\n", "\n", " df_green = spark.read \\\n", " .option(\"header\", \"true\") \\\n", " .schema(green_schema) \\\n", " .csv(input_path)\n", "\n", " df_green \\\n", " .repartition(4) \\\n", " .write.parquet(output_path)" ] }, { "cell_type": "code", "execution_count": 26, "id": "96ac2ad7", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/7\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 15:> (0 + 1) / 1]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", " \r" ] }, { "ename": "AnalysisException", "evalue": "Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/green/2021/08;", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmp/ipykernel_129101/906373977.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0moutput_path\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34mf'data/pq/green/{year}/{month:02d}/'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0mdf_green\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mspark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0moption\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"header\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"true\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgreen_schema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/readwriter.py\u001b[0m in \u001b[0;36mcsv\u001b[0;34m(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup)\u001b[0m\n\u001b[1;32m 536\u001b[0m \u001b[0mpath\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 537\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 538\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_df\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jreader\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcsv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_spark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jvm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPythonUtils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtoSeq\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 539\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mRDD\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 540\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miterator\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1302\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1303\u001b[0m \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1304\u001b[0;31m return_value = get_return_value(\n\u001b[0m\u001b[1;32m 1305\u001b[0m answer, self.gateway_client, self.target_id, self.name)\n\u001b[1;32m 1306\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 132\u001b[0m \u001b[0;31m# Hide where the exception came from that shows a non-Pythonic\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 133\u001b[0m \u001b[0;31m# JVM exception message.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 134\u001b[0;31m \u001b[0mraise_from\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconverted\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 135\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 136\u001b[0m \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mraise_from\u001b[0;34m(e)\u001b[0m\n", "\u001b[0;31mAnalysisException\u001b[0m: Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/green/2021/08;" ] } ], "source": [ "year = 2021 \n", "\n", "for month in range(1, 13):\n", " print(f'processing data for {year}/{month}')\n", "\n", " input_path = f'data/raw/green/{year}/{month:02d}/'\n", " output_path = f'data/pq/green/{year}/{month:02d}/'\n", "\n", " df_green = spark.read \\\n", " .option(\"header\", \"true\") \\\n", " .schema(green_schema) \\\n", " .csv(input_path)\n", "\n", " df_green \\\n", " .repartition(4) \\\n", " .write.parquet(output_path)" ] }, { "cell_type": "code", "execution_count": null, "id": "463c7dc8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 23, "id": "6ff4265d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 24, "id": "6e982d29", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 28, "id": "19326bc9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/7\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/9\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/10\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/11\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2020/12\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "year = 2020\n", "\n", "for month in range(1, 13):\n", " print(f'processing data for {year}/{month}')\n", "\n", " input_path = f'data/raw/yellow/{year}/{month:02d}/'\n", " output_path = f'data/pq/yellow/{year}/{month:02d}/'\n", "\n", " df_yellow = spark.read \\\n", " .option(\"header\", \"true\") \\\n", " .schema(yellow_schema) \\\n", " .csv(input_path)\n", "\n", " df_yellow \\\n", " .repartition(4) \\\n", " .write.parquet(output_path)" ] }, { "cell_type": "code", "execution_count": 29, "id": "aeca811a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/7\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Stage 78:===========================================> (3 + 1) / 4]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "processing data for 2021/8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", " \r" ] }, { "ename": "AnalysisException", "evalue": "Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/yellow/2021/08;", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAnalysisException\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmp/ipykernel_129101/2088663510.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0moutput_path\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34mf'data/pq/yellow/{year}/{month:02d}/'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0mdf_yellow\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mspark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0moption\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"header\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"true\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mschema\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0myellow_schema\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/readwriter.py\u001b[0m in \u001b[0;36mcsv\u001b[0;34m(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup)\u001b[0m\n\u001b[1;32m 536\u001b[0m \u001b[0mpath\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 537\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 538\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_df\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jreader\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcsv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_spark\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_sc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_jvm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPythonUtils\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtoSeq\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 539\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mRDD\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 540\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0miterator\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m 1302\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1303\u001b[0m \u001b[0manswer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgateway_client\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend_command\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcommand\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1304\u001b[0;31m return_value = get_return_value(\n\u001b[0m\u001b[1;32m 1305\u001b[0m answer, self.gateway_client, self.target_id, self.name)\n\u001b[1;32m 1306\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mdeco\u001b[0;34m(*a, **kw)\u001b[0m\n\u001b[1;32m 132\u001b[0m \u001b[0;31m# Hide where the exception came from that shows a non-Pythonic\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 133\u001b[0m \u001b[0;31m# JVM exception message.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 134\u001b[0;31m \u001b[0mraise_from\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconverted\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 135\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 136\u001b[0m \u001b[0;32mraise\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\u001b[0m in \u001b[0;36mraise_from\u001b[0;34m(e)\u001b[0m\n", "\u001b[0;31mAnalysisException\u001b[0m: Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/yellow/2021/08;" ] } ], "source": [ "year = 2021\n", "\n", "for month in range(1, 13):\n", " print(f'processing data for {year}/{month}')\n", "\n", " input_path = f'data/raw/yellow/{year}/{month:02d}/'\n", " output_path = f'data/pq/yellow/{year}/{month:02d}/'\n", "\n", " df_yellow = spark.read \\\n", " .option(\"header\", \"true\") \\\n", " .schema(yellow_schema) \\\n", " .csv(input_path)\n", "\n", " df_yellow \\\n", " .repartition(4) \\\n", " .write.parquet(output_path)" ] }, { "cell_type": "code", "execution_count": null, "id": "d7eb0da9", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 06-batch/code/06_spark_sql.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "3307b886", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n", "22/02/17 22:43:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "import pyspark\n", "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder \\\n", " .master(\"local[*]\") \\\n", " .appName('test') \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 2, "id": "1ee1eb1d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_green = spark.read.parquet('data/pq/green/*/*')" ] }, { "cell_type": "code", "execution_count": null, "id": "0ca5ee99", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 16, "id": "649bb4da", "metadata": {}, "outputs": [], "source": [ "df_green = df_green \\\n", " .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \\\n", " .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')" ] }, { "cell_type": "code", "execution_count": 5, "id": "90cd6845", "metadata": {}, "outputs": [], "source": [ "df_yellow = spark.read.parquet('data/pq/yellow/*/*')" ] }, { "cell_type": "code", "execution_count": 19, "id": "88822efd", "metadata": {}, "outputs": [], "source": [ "df_yellow = df_yellow \\\n", " .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \\\n", " .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')" ] }, { "cell_type": "code", "execution_count": 22, "id": "610167a2", "metadata": {}, "outputs": [], "source": [ "common_colums = []\n", "\n", "yellow_columns = set(df_yellow.columns)\n", "\n", "for col in df_green.columns:\n", " if col in yellow_columns:\n", " common_colums.append(col)" ] }, { "cell_type": "code", "execution_count": 26, "id": "839d773f", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import functions as F" ] }, { "cell_type": "code", "execution_count": 28, "id": "2498810a", "metadata": {}, "outputs": [], "source": [ "df_green_sel = df_green \\\n", " .select(common_colums) \\\n", " .withColumn('service_type', F.lit('green'))" ] }, { "cell_type": "code", "execution_count": 29, "id": "19032efc", "metadata": {}, "outputs": [], "source": [ "df_yellow_sel = df_yellow \\\n", " .select(common_colums) \\\n", " .withColumn('service_type', F.lit('yellow'))" ] }, { "cell_type": "code", "execution_count": 30, "id": "f5b0f3d1", "metadata": {}, "outputs": [], "source": [ "df_trips_data = df_green_sel.unionAll(df_yellow_sel)" ] }, { "cell_type": "code", "execution_count": 33, "id": "1bed8b33", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+------------+--------+\n", "|service_type| count|\n", "+------------+--------+\n", "| green| 2304517|\n", "| yellow|39649199|\n", "+------------+--------+\n", "\n" ] } ], "source": [ "df_trips_data.groupBy('service_type').count().show()" ] }, { "cell_type": "code", "execution_count": 40, "id": "28cc8fa3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['VendorID',\n", " 'pickup_datetime',\n", " 'dropoff_datetime',\n", " 'store_and_fwd_flag',\n", " 'RatecodeID',\n", " 'PULocationID',\n", " 'DOLocationID',\n", " 'passenger_count',\n", " 'trip_distance',\n", " 'fare_amount',\n", " 'extra',\n", " 'mta_tax',\n", " 'tip_amount',\n", " 'tolls_amount',\n", " 'improvement_surcharge',\n", " 'total_amount',\n", " 'payment_type',\n", " 'congestion_surcharge',\n", " 'service_type']" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_trips_data.columns" ] }, { "cell_type": "code", "execution_count": 35, "id": "36e90cbc", "metadata": {}, "outputs": [], "source": [ "df_trips_data.registerTempTable('trips_data')" ] }, { "cell_type": "code", "execution_count": 38, "id": "d0e01bf1", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+------------+--------+\n", "|service_type|count(1)|\n", "+------------+--------+\n", "| green| 2304517|\n", "| yellow|39649199|\n", "+------------+--------+\n", "\n" ] } ], "source": [ "spark.sql(\"\"\"\n", "SELECT\n", " service_type,\n", " count(1)\n", "FROM\n", " trips_data\n", "GROUP BY \n", " service_type\n", "\"\"\").show()" ] }, { "cell_type": "code", "execution_count": null, "id": "b2ee7038", "metadata": {}, "outputs": [], "source": [ "df_result = spark.sql(\"\"\"\n", "SELECT \n", " -- Revenue grouping \n", " PULocationID AS revenue_zone,\n", " date_trunc('month', pickup_datetime) AS revenue_month, \n", " service_type, \n", "\n", " -- Revenue calculation \n", " SUM(fare_amount) AS revenue_monthly_fare,\n", " SUM(extra) AS revenue_monthly_extra,\n", " SUM(mta_tax) AS revenue_monthly_mta_tax,\n", " SUM(tip_amount) AS revenue_monthly_tip_amount,\n", " SUM(tolls_amount) AS revenue_monthly_tolls_amount,\n", " SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge,\n", " SUM(total_amount) AS revenue_monthly_total_amount,\n", " SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge,\n", "\n", " -- Additional calculations\n", " AVG(passenger_count) AS avg_monthly_passenger_count,\n", " AVG(trip_distance) AS avg_monthly_trip_distance\n", "FROM\n", " trips_data\n", "GROUP BY\n", " 1, 2, 3\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 49, "id": "f67eeb92", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_result.coalesce(1).write.parquet('data/report/revenue/', mode='overwrite')" ] }, { "cell_type": "code", "execution_count": null, "id": "f56a885d", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 06-batch/code/06_spark_sql.py ================================================ #!/usr/bin/env python # coding: utf-8 import argparse import pyspark from pyspark.sql import SparkSession from pyspark.sql import functions as F parser = argparse.ArgumentParser() parser.add_argument('--input_green', required=True) parser.add_argument('--input_yellow', required=True) parser.add_argument('--output', required=True) args = parser.parse_args() input_green = args.input_green input_yellow = args.input_yellow output = args.output spark = SparkSession.builder \ .appName('test') \ .getOrCreate() df_green = spark.read.parquet(input_green) df_green = df_green \ .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \ .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime') df_yellow = spark.read.parquet(input_yellow) df_yellow = df_yellow \ .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \ .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime') common_colums = [ 'VendorID', 'pickup_datetime', 'dropoff_datetime', 'store_and_fwd_flag', 'RatecodeID', 'PULocationID', 'DOLocationID', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'payment_type', 'congestion_surcharge' ] df_green_sel = df_green \ .select(common_colums) \ .withColumn('service_type', F.lit('green')) df_yellow_sel = df_yellow \ .select(common_colums) \ .withColumn('service_type', F.lit('yellow')) df_trips_data = df_green_sel.unionAll(df_yellow_sel) df_trips_data.registerTempTable('trips_data') df_result = spark.sql(""" SELECT -- Reveneue grouping PULocationID AS revenue_zone, date_trunc('month', pickup_datetime) AS revenue_month, service_type, -- Revenue calculation SUM(fare_amount) AS revenue_monthly_fare, SUM(extra) AS revenue_monthly_extra, SUM(mta_tax) AS revenue_monthly_mta_tax, SUM(tip_amount) AS revenue_monthly_tip_amount, SUM(tolls_amount) AS revenue_monthly_tolls_amount, SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge, SUM(total_amount) AS revenue_monthly_total_amount, SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge, -- Additional calculations AVG(passenger_count) AS avg_montly_passenger_count, AVG(trip_distance) AS avg_montly_trip_distance FROM trips_data GROUP BY 1, 2, 3 """) df_result.coalesce(1) \ .write.parquet(output, mode='overwrite') ================================================ FILE: 06-batch/code/06_spark_sql_big_query.py ================================================ #!/usr/bin/env python # coding: utf-8 import argparse import pyspark from pyspark.sql import SparkSession from pyspark.sql import functions as F parser = argparse.ArgumentParser() parser.add_argument('--input_green', required=True) parser.add_argument('--input_yellow', required=True) parser.add_argument('--output', required=True) args = parser.parse_args() input_green = args.input_green input_yellow = args.input_yellow output = args.output spark = SparkSession.builder \ .appName('test') \ .getOrCreate() spark.conf.set('temporaryGcsBucket', 'dataproc-temp-europe-west6-828225226997-fckhkym8') df_green = spark.read.parquet(input_green) df_green = df_green \ .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \ .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime') df_yellow = spark.read.parquet(input_yellow) df_yellow = df_yellow \ .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \ .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime') common_columns = [ 'VendorID', 'pickup_datetime', 'dropoff_datetime', 'store_and_fwd_flag', 'RatecodeID', 'PULocationID', 'DOLocationID', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'payment_type', 'congestion_surcharge' ] df_green_sel = df_green \ .select(common_columns) \ .withColumn('service_type', F.lit('green')) df_yellow_sel = df_yellow \ .select(common_columns) \ .withColumn('service_type', F.lit('yellow')) df_trips_data = df_green_sel.unionAll(df_yellow_sel) df_trips_data.registerTempTable('trips_data') df_result = spark.sql(""" SELECT -- Revenue grouping PULocationID AS revenue_zone, date_trunc('month', pickup_datetime) AS revenue_month, service_type, -- Revenue calculation SUM(fare_amount) AS revenue_monthly_fare, SUM(extra) AS revenue_monthly_extra, SUM(mta_tax) AS revenue_monthly_mta_tax, SUM(tip_amount) AS revenue_monthly_tip_amount, SUM(tolls_amount) AS revenue_monthly_tolls_amount, SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge, SUM(total_amount) AS revenue_monthly_total_amount, SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge, -- Additional calculations AVG(passenger_count) AS avg_monthly_passenger_count, AVG(trip_distance) AS avg_monthly_trip_distance FROM trips_data GROUP BY 1, 2, 3 """) df_result.write.format('bigquery') \ .option('table', output) \ .save() ================================================ FILE: 06-batch/code/07_groupby_join.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "4341e0e6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n", "22/02/18 21:41:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "import pyspark\n", "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder \\\n", " .master(\"local[*]\") \\\n", " .appName('test') \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 2, "id": "cd304aec", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_green = spark.read.parquet('data/pq/green/*/*')" ] }, { "cell_type": "code", "execution_count": 3, "id": "243991f3", "metadata": {}, "outputs": [], "source": [ "df_green.registerTempTable('green')" ] }, { "cell_type": "code", "execution_count": 18, "id": "e43764a7", "metadata": {}, "outputs": [], "source": [ "df_green_revenue = spark.sql(\"\"\"\n", "SELECT \n", " date_trunc('hour', lpep_pickup_datetime) AS hour, \n", " PULocationID AS zone,\n", "\n", " SUM(total_amount) AS amount,\n", " COUNT(1) AS number_records\n", "FROM\n", " green\n", "WHERE\n", " lpep_pickup_datetime >= '2020-01-01 00:00:00'\n", "GROUP BY\n", " 1, 2\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 26, "id": "3e00310e", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_green_revenue \\\n", " .repartition(20) \\\n", " .write.parquet('data/report/revenue/green', mode='overwrite')" ] }, { "cell_type": "code", "execution_count": 20, "id": "07ebb68c", "metadata": {}, "outputs": [], "source": [ "df_yellow = spark.read.parquet('data/pq/yellow/*/*')\n", "df_yellow.registerTempTable('yellow')" ] }, { "cell_type": "code", "execution_count": 22, "id": "9d5be29d", "metadata": {}, "outputs": [], "source": [ "df_yellow_revenue = spark.sql(\"\"\"\n", "SELECT \n", " date_trunc('hour', tpep_pickup_datetime) AS hour, \n", " PULocationID AS zone,\n", "\n", " SUM(total_amount) AS amount,\n", " COUNT(1) AS number_records\n", "FROM\n", " yellow\n", "WHERE\n", " tpep_pickup_datetime >= '2020-01-01 00:00:00'\n", "GROUP BY\n", " 1, 2\n", "\"\"\")" ] }, { "cell_type": "code", "execution_count": 27, "id": "8bd9264e", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_yellow_revenue \\\n", " .repartition(20) \\\n", " .write.parquet('data/report/revenue/yellow', mode='overwrite')" ] }, { "cell_type": "code", "execution_count": 46, "id": "fd5d74d7", "metadata": {}, "outputs": [], "source": [ "df_green_revenue = spark.read.parquet('data/report/revenue/green')\n", "df_yellow_revenue = spark.read.parquet('data/report/revenue/yellow')" ] }, { "cell_type": "code", "execution_count": 47, "id": "35015ee6", "metadata": {}, "outputs": [], "source": [ "df_green_revenue_tmp = df_green_revenue \\\n", " .withColumnRenamed('amount', 'green_amount') \\\n", " .withColumnRenamed('number_records', 'green_number_records')\n", "\n", "df_yellow_revenue_tmp = df_yellow_revenue \\\n", " .withColumnRenamed('amount', 'yellow_amount') \\\n", " .withColumnRenamed('number_records', 'yellow_number_records')" ] }, { "cell_type": "code", "execution_count": 48, "id": "ec9f34ea", "metadata": {}, "outputs": [], "source": [ "df_join = df_green_revenue_tmp.join(df_yellow_revenue_tmp, on=['hour', 'zone'], how='outer')" ] }, { "cell_type": "code", "execution_count": 50, "id": "10238be7", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_join.write.parquet('data/report/revenue/total', mode='overwrite')" ] }, { "cell_type": "code", "execution_count": 51, "id": "c3af7169", "metadata": {}, "outputs": [], "source": [ "df_join = spark.read.parquet('data/report/revenue/total')" ] }, { "cell_type": "code", "execution_count": 56, "id": "bc2a6680", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DataFrame[hour: timestamp, zone: int, green_amount: double, green_number_records: bigint, yellow_amount: double, yellow_number_records: bigint]" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_join" ] }, { "cell_type": "code", "execution_count": 54, "id": "abb46398", "metadata": {}, "outputs": [], "source": [ "df_zones = spark.read.parquet('zones/')" ] }, { "cell_type": "code", "execution_count": 57, "id": "b3cf98a5", "metadata": {}, "outputs": [], "source": [ "df_result = df_join.join(df_zones, df_join.zone == df_zones.LocationID)" ] }, { "cell_type": "code", "execution_count": 62, "id": "5e0614ba", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_result.drop('LocationID', 'zone').write.parquet('tmp/revenue-zones')" ] }, { "cell_type": "code", "execution_count": null, "id": "9f5ca913", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 06-batch/code/08_rdds.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "d66f42fd", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n", "22/02/21 22:25:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "import pyspark\n", "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder \\\n", " .master(\"local[*]\") \\\n", " .appName('test') \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 2, "id": "646fc343", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 0:> (0 + 1) / 1]\r", "\r", " \r" ] } ], "source": [ "df_green = spark.read.parquet('data/pq/green/*/*')" ] }, { "cell_type": "markdown", "id": "196cccd5", "metadata": {}, "source": [ "```\n", "SELECT \n", " date_trunc('hour', lpep_pickup_datetime) AS hour, \n", " PULocationID AS zone,\n", "\n", " SUM(total_amount) AS amount,\n", " COUNT(1) AS number_records\n", "FROM\n", " green\n", "WHERE\n", " lpep_pickup_datetime >= '2020-01-01 00:00:00'\n", "GROUP BY\n", " 1, 2\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "id": "74fe52cb", "metadata": { "scrolled": true }, "outputs": [], "source": [ "rdd = df_green \\\n", " .select('lpep_pickup_datetime', 'PULocationID', 'total_amount') \\\n", " .rdd" ] }, { "cell_type": "code", "execution_count": 13, "id": "1a0bf382", "metadata": {}, "outputs": [], "source": [ "from datetime import datetime" ] }, { "cell_type": "code", "execution_count": 19, "id": "fa2b00f1", "metadata": {}, "outputs": [], "source": [ "start = datetime(year=2020, month=1, day=1)\n", "\n", "def filter_outliers(row):\n", " return row.lpep_pickup_datetime >= start" ] }, { "cell_type": "code", "execution_count": 21, "id": "69dd326d", "metadata": {}, "outputs": [], "source": [ "rows = rdd.take(10)\n", "row = rows[0]" ] }, { "cell_type": "code", "execution_count": 29, "id": "cd4b7006", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Row(lpep_pickup_datetime=datetime.datetime(2020, 1, 16, 19, 49, 27), PULocationID=260, total_amount=14.3)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "row" ] }, { "cell_type": "code", "execution_count": 31, "id": "d99eb089", "metadata": {}, "outputs": [], "source": [ "def prepare_for_grouping(row): \n", " hour = row.lpep_pickup_datetime.replace(minute=0, second=0, microsecond=0)\n", " zone = row.PULocationID\n", " key = (hour, zone)\n", " \n", " amount = row.total_amount\n", " count = 1\n", " value = (amount, count)\n", "\n", " return (key, value)" ] }, { "cell_type": "code", "execution_count": 34, "id": "cb328a44", "metadata": {}, "outputs": [], "source": [ "def calculate_revenue(left_value, right_value):\n", " left_amount, left_count = left_value\n", " right_amount, right_count = right_value\n", " \n", " output_amount = left_amount + right_amount\n", " output_count = left_count + right_count\n", " \n", " return (output_amount, output_count)" ] }, { "cell_type": "code", "execution_count": 39, "id": "2ea260f1", "metadata": {}, "outputs": [], "source": [ "from collections import namedtuple" ] }, { "cell_type": "code", "execution_count": 40, "id": "7dae6064", "metadata": {}, "outputs": [], "source": [ "RevenueRow = namedtuple('RevenueRow', ['hour', 'zone', 'revenue', 'count'])" ] }, { "cell_type": "code", "execution_count": 41, "id": "e0a98ee4", "metadata": {}, "outputs": [], "source": [ "def unwrap(row):\n", " return RevenueRow(\n", " hour=row[0][0], \n", " zone=row[0][1],\n", " revenue=row[1][0],\n", " count=row[1][1]\n", " )" ] }, { "cell_type": "code", "execution_count": 45, "id": "a09200b8", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import types" ] }, { "cell_type": "code", "execution_count": 46, "id": "5c14d15e", "metadata": {}, "outputs": [], "source": [ "result_schema = types.StructType([\n", " types.StructField('hour', types.TimestampType(), True),\n", " types.StructField('zone', types.IntegerType(), True),\n", " types.StructField('revenue', types.DoubleType(), True),\n", " types.StructField('count', types.IntegerType(), True)\n", "])" ] }, { "cell_type": "code", "execution_count": 47, "id": "56ea72ff", "metadata": {}, "outputs": [], "source": [ "df_result = rdd \\\n", " .filter(filter_outliers) \\\n", " .map(prepare_for_grouping) \\\n", " .reduceByKey(calculate_revenue) \\\n", " .map(unwrap) \\\n", " .toDF(result_schema) " ] }, { "cell_type": "code", "execution_count": 50, "id": "4675bd3f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_result.write.parquet('tmp/green-revenue')" ] }, { "cell_type": "code", "execution_count": 55, "id": "255b5503", "metadata": {}, "outputs": [], "source": [ "columns = ['VendorID', 'lpep_pickup_datetime', 'PULocationID', 'DOLocationID', 'trip_distance']\n", "\n", "duration_rdd = df_green \\\n", " .select(columns) \\\n", " .rdd" ] }, { "cell_type": "code", "execution_count": 67, "id": "645c3190", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 68, "id": "921e4ef9", "metadata": {}, "outputs": [], "source": [ "rows = duration_rdd.take(10)" ] }, { "cell_type": "code", "execution_count": 81, "id": "f50db3eb", "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(rows, columns=columns)" ] }, { "cell_type": "code", "execution_count": 74, "id": "5b8ecc53", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['VendorID',\n", " 'lpep_pickup_datetime',\n", " 'PULocationID',\n", " 'DOLocationID',\n", " 'trip_distance']" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns" ] }, { "cell_type": "code", "execution_count": 76, "id": "6766c0f8", "metadata": {}, "outputs": [], "source": [ "#model = ...\n", "\n", "def model_predict(df):\n", "# y_pred = model.predict(df)\n", " y_pred = df.trip_distance * 5\n", " return y_pred" ] }, { "cell_type": "code", "execution_count": 98, "id": "7437b848", "metadata": {}, "outputs": [], "source": [ "def apply_model_in_batch(rows):\n", " df = pd.DataFrame(rows, columns=columns)\n", " predictions = model_predict(df)\n", " df['predicted_duration'] = predictions\n", "\n", " for row in df.itertuples():\n", " yield row" ] }, { "cell_type": "code", "execution_count": 102, "id": "580b5845", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_predicts = duration_rdd \\\n", " .mapPartitions(apply_model_in_batch)\\\n", " .toDF() \\\n", " .drop('Index')" ] }, { "cell_type": "code", "execution_count": 104, "id": "6055d543", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 48:> (0 + 1) / 1]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+------------------+\n", "|predicted_duration|\n", "+------------------+\n", "| 12.95|\n", "| 31.25|\n", "| 14.0|\n", "| 12.75|\n", "| 0.1|\n", "| 11.05|\n", "|11.299999999999999|\n", "|54.349999999999994|\n", "| 15.25|\n", "| 91.75|\n", "| 12.25|\n", "| 3.1|\n", "| 7.5|\n", "|11.899999999999999|\n", "| 78.89999999999999|\n", "| 4.45|\n", "| 23.2|\n", "| 4.85|\n", "| 6.65|\n", "| 15.1|\n", "+------------------+\n", "only showing top 20 rows\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", " \r" ] } ], "source": [ "df_predicts.select('predicted_duration').show()" ] }, { "cell_type": "code", "execution_count": null, "id": "9e91d243", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 06-batch/code/09_spark_gcs.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "3307b886", "metadata": {}, "outputs": [], "source": [ "import pyspark\n", "from pyspark.sql import SparkSession\n", "from pyspark.conf import SparkConf\n", "from pyspark.context import SparkContext" ] }, { "cell_type": "code", "execution_count": 2, "id": "9f0ddbff", "metadata": {}, "outputs": [], "source": [ "credentials_location = '/home/alexey/.google/credentials/google_credentials.json'\n", "\n", "conf = SparkConf() \\\n", " .setMaster('local[*]') \\\n", " .setAppName('test') \\\n", " .set(\"spark.jars\", \"./lib/gcs-connector-hadoop3-2.2.5.jar\") \\\n", " .set(\"spark.hadoop.google.cloud.auth.service.account.enable\", \"true\") \\\n", " .set(\"spark.hadoop.google.cloud.auth.service.account.json.keyfile\", credentials_location)" ] }, { "cell_type": "code", "execution_count": 3, "id": "b83404e8", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n", "22/03/30 12:25:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "sc = SparkContext(conf=conf)\n", "\n", "hadoop_conf = sc._jsc.hadoopConfiguration()\n", "\n", "hadoop_conf.set(\"fs.AbstractFileSystem.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS\")\n", "hadoop_conf.set(\"fs.gs.impl\", \"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem\")\n", "hadoop_conf.set(\"fs.gs.auth.service.account.json.keyfile\", credentials_location)\n", "hadoop_conf.set(\"fs.gs.auth.service.account.enable\", \"true\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "c4713e2b", "metadata": {}, "outputs": [], "source": [ "spark = SparkSession.builder \\\n", " .config(conf=sc.getConf()) \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 5, "id": "1ee1eb1d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "df_green = spark.read.parquet('gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/*/*')" ] }, { "cell_type": "code", "execution_count": 7, "id": "104b40ab", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/plain": [ "2304517" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_green.count()" ] }, { "cell_type": "code", "execution_count": null, "id": "f56a885d", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 06-batch/code/cloud.md ================================================ ## Running Spark in the Cloud ### Connecting to Google Cloud Storage Uploading data to GCS: ```bash gsutil -m cp -r pq/ gs://dtc_data_lake_de-zoomcamp-nytaxi/pq ``` Download the jar for connecting to GCS to any location (e.g. the `lib` folder): **Note**: For other versions of GCS connector for Hadoop see [Cloud Storage connector ](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#connector-setup-on-non-dataproc-clusters). ```bash gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar ./lib/ ``` See the notebook with configuration in [09_spark_gcs.ipynb](09_spark_gcs.ipynb) (Thanks Alvin Do for the instructions!) ### Local Cluster and Spark-Submit Creating a stand-alone cluster ([docs](https://spark.apache.org/docs/latest/spark-standalone.html)): ```bash ./sbin/start-master.sh ``` Creating a worker: ```bash URL="spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077" ./sbin/start-slave.sh ${URL} # for newer versions of spark use that: #./sbin/start-worker.sh ${URL} ``` Turn the notebook into a script: ```bash jupyter nbconvert --to=script 06_spark_sql.ipynb ``` Edit the script and then run it: ```bash python 06_spark_sql.py \ --input_green=data/pq/green/2020/*/ \ --input_yellow=data/pq/yellow/2020/*/ \ --output=data/report-2020 ``` Use `spark-submit` for running the script on the cluster ```bash URL="spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077" spark-submit \ --master="${URL}" \ 06_spark_sql.py \ --input_green=data/pq/green/2021/*/ \ --input_yellow=data/pq/yellow/2021/*/ \ --output=data/report-2021 ``` ### Data Proc Upload the script to GCS: ```bash gsutil -m cp -r 06_spark_sql.py gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py ``` Params for the job: * `--input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2021/*/` * `--input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2021/*/` * `--output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2021` Using Google Cloud SDK for submitting to dataproc ([link](https://cloud.google.com/dataproc/docs/guides/submit-job#dataproc-submit-job-gcloud)) ```bash gcloud dataproc jobs submit pyspark \ --cluster=de-zoomcamp-cluster \ --region=europe-west6 \ gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py \ -- \ --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \ --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \ --output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2020 ``` ### Big Query Upload the script to GCS: ```bash gsutil -m cp -r 06_spark_sql_big_query.py gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql_big_query.py ``` Write results to big query ([docs](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#pyspark)): ```bash gcloud dataproc jobs submit pyspark \ --cluster=de-zoomcamp-cluster \ --region=europe-west6 \ --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \ gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql_big_query.py \ -- \ --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \ --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \ --output=trips_data_all.reports-2020 ``` There can be issue with latest Spark version and the Big query connector. Download links to the jar file for respective Spark versions can be found at: [Spark and Big query connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector) **Note**: Dataproc on GCE 2.1+ images pre-install Spark BigQquery connector: [DataProc Release 2.2](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2). Therefore, no need to include the jar file in the job submission. ================================================ FILE: 06-batch/code/download_data.sh ================================================ set -e TAXI_TYPE=$1 # "yellow" YEAR=$2 # 2020 URL_PREFIX="https://github.com/DataTalksClub/nyc-tlc-data/releases/download" for MONTH in {1..12}; do FMONTH=`printf "%02d" ${MONTH}` URL="${URL_PREFIX}/${TAXI_TYPE}/${TAXI_TYPE}_tripdata_${YEAR}-${FMONTH}.csv.gz" LOCAL_PREFIX="data/raw/${TAXI_TYPE}/${YEAR}/${FMONTH}" LOCAL_FILE="${TAXI_TYPE}_tripdata_${YEAR}_${FMONTH}.csv.gz" LOCAL_PATH="${LOCAL_PREFIX}/${LOCAL_FILE}" echo "downloading ${URL} to ${LOCAL_PATH}" mkdir -p ${LOCAL_PREFIX} wget ${URL} -O ${LOCAL_PATH} done ================================================ FILE: 06-batch/code/homework.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 5, "id": "00bc6543", "metadata": {}, "outputs": [], "source": [ "import pyspark\n", "from pyspark.sql import SparkSession\n", "from pyspark.sql import types" ] }, { "cell_type": "code", "execution_count": 2, "id": "cd4a0f3d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n", "22/03/07 21:55:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "spark = SparkSession.builder \\\n", " .master(\"local[*]\") \\\n", " .appName('test') \\\n", " .getOrCreate()" ] }, { "cell_type": "code", "execution_count": 3, "id": "eb3e4c36", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'3.0.3'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark.version" ] }, { "cell_type": "code", "execution_count": 4, "id": "5236cebd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 alexey alexey 700M Oct 29 18:53 fhvhv_tripdata_2021-02.csv\r\n" ] } ], "source": [ "!ls -lh fhvhv_tripdata_2021-02.csv" ] }, { "cell_type": "code", "execution_count": 6, "id": "0a3399a3", "metadata": {}, "outputs": [], "source": [ "schema = types.StructType([\n", " types.StructField('hvfhs_license_num', types.StringType(), True),\n", " types.StructField('dispatching_base_num', types.StringType(), True),\n", " types.StructField('pickup_datetime', types.TimestampType(), True),\n", " types.StructField('dropoff_datetime', types.TimestampType(), True),\n", " types.StructField('PULocationID', types.IntegerType(), True),\n", " types.StructField('DOLocationID', types.IntegerType(), True),\n", " types.StructField('SR_Flag', types.StringType(), True)\n", "])" ] }, { "cell_type": "code", "execution_count": 8, "id": "68bc8b72", "metadata": {}, "outputs": [], "source": [ "df = spark.read \\\n", " .option(\"header\", \"true\") \\\n", " .schema(schema) \\\n", " .csv('fhvhv_tripdata_2021-02.csv')\n", "\n", "df = df.repartition(24)\n", "\n", "df.write.parquet('data/pq/fhvhv/2021/02/', compression=)" ] }, { "cell_type": "code", "execution_count": 16, "id": "58989b55", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 0:> (0 + 1) / 1]\r", "\r", " \r" ] } ], "source": [ "df = spark.read.parquet('data/pq/fhvhv/2021/02/')" ] }, { "cell_type": "markdown", "id": "48b01d2f", "metadata": {}, "source": [ "**Q3**: How many taxi trips were there on February 15?" ] }, { "cell_type": "code", "execution_count": 18, "id": "f7489aea", "metadata": {}, "outputs": [], "source": [ "from pyspark.sql import functions as F" ] }, { "cell_type": "code", "execution_count": 24, "id": "6c2500fd", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/plain": [ "367170" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df \\\n", " .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \\\n", " .filter(\"pickup_date = '2021-02-15'\") \\\n", " .count()" ] }, { "cell_type": "code", "execution_count": 25, "id": "dd7ae60d", "metadata": {}, "outputs": [], "source": [ "df.registerTempTable('fhvhv_2021_02')" ] }, { "cell_type": "code", "execution_count": 28, "id": "6d47c147", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 20:> (0 + 4) / 4]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+--------+\n", "|count(1)|\n", "+--------+\n", "| 367170|\n", "+--------+\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 20:==============> (1 + 3) / 4]\r", "\r", " \r" ] } ], "source": [ "spark.sql(\"\"\"\n", "SELECT\n", " COUNT(1)\n", "FROM \n", " fhvhv_2021_02\n", "WHERE\n", " to_date(pickup_datetime) = '2021-02-15';\n", "\"\"\").show()" ] }, { "cell_type": "markdown", "id": "ae3f533b", "metadata": {}, "source": [ "**Q4**: Longest trip for each day" ] }, { "cell_type": "code", "execution_count": 29, "id": "7befe422", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['hvfhs_license_num',\n", " 'dispatching_base_num',\n", " 'pickup_datetime',\n", " 'dropoff_datetime',\n", " 'PULocationID',\n", " 'DOLocationID',\n", " 'SR_Flag']" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 36, "id": "279d9161", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 37:==============> (1 + 3) / 4]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+-----------+-------------+\n", "|pickup_date|max(duration)|\n", "+-----------+-------------+\n", "| 2021-02-11| 75540|\n", "| 2021-02-17| 57221|\n", "| 2021-02-20| 44039|\n", "| 2021-02-03| 40653|\n", "| 2021-02-19| 37577|\n", "+-----------+-------------+\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 38:==================================================> (187 + 4) / 200]\r", "\r", " \r" ] } ], "source": [ "df \\\n", " .withColumn('duration', df.dropoff_datetime.cast('long') - df.pickup_datetime.cast('long')) \\\n", " .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \\\n", " .groupBy('pickup_date') \\\n", " .max('duration') \\\n", " .orderBy('max(duration)', ascending=False) \\\n", " .limit(5) \\\n", " .show()" ] }, { "cell_type": "code", "execution_count": 38, "id": "74cf0e8b", "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 43:> (0 + 4) / 4]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+-----------+-----------------+\n", "|pickup_date| duration|\n", "+-----------+-----------------+\n", "| 2021-02-11| 1259.0|\n", "| 2021-02-17|953.6833333333333|\n", "| 2021-02-20|733.9833333333333|\n", "| 2021-02-03| 677.55|\n", "| 2021-02-19|626.2833333333333|\n", "| 2021-02-25| 583.5|\n", "| 2021-02-18|576.8666666666667|\n", "| 2021-02-10|569.4833333333333|\n", "| 2021-02-21| 537.05|\n", "| 2021-02-09|534.7833333333333|\n", "+-----------+-----------------+\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 44:================================================> (180 + 4) / 200]\r", "\r", " \r" ] } ], "source": [ "spark.sql(\"\"\"\n", "SELECT\n", " to_date(pickup_datetime) AS pickup_date,\n", " MAX((CAST(dropoff_datetime AS LONG) - CAST(pickup_datetime AS LONG)) / 60) AS duration\n", "FROM \n", " fhvhv_2021_02\n", "GROUP BY\n", " 1\n", "ORDER BY\n", " 2 DESC\n", "LIMIT 10;\n", "\"\"\").show()" ] }, { "cell_type": "markdown", "id": "d915096b", "metadata": {}, "source": [ "**Q5**: Most frequent `dispatching_base_num`\n", "\n", "How many stages this spark job has?\n", "\n" ] }, { "cell_type": "code", "execution_count": 44, "id": "25816aa2", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 73:> (0 + 4) / 4]\r", "\r", "[Stage 73:==============> (1 + 3) / 4]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+--------+\n", "|dispatching_base_num|count(1)|\n", "+--------------------+--------+\n", "| B02510| 3233664|\n", "| B02764| 965568|\n", "| B02872| 882689|\n", "| B02875| 685390|\n", "| B02765| 559768|\n", "+--------------------+--------+\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 74:===================================================> (189 + 5) / 200]\r", "\r", " \r" ] } ], "source": [ "spark.sql(\"\"\"\n", "SELECT\n", " dispatching_base_num,\n", " COUNT(1)\n", "FROM \n", " fhvhv_2021_02\n", "GROUP BY\n", " 1\n", "ORDER BY\n", " 2 DESC\n", "LIMIT 5;\n", "\"\"\").show()" ] }, { "cell_type": "code", "execution_count": 46, "id": "a78f9fe3", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 86:> (0 + 4) / 4]\r", "\r", "[Stage 86:=============================> (2 + 2) / 4]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+-------+\n", "|dispatching_base_num| count|\n", "+--------------------+-------+\n", "| B02510|3233664|\n", "| B02764| 965568|\n", "| B02872| 882689|\n", "| B02875| 685390|\n", "| B02765| 559768|\n", "+--------------------+-------+\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 87:===========================================> (161 + 5) / 200]\r", "\r", " \r" ] } ], "source": [ "df \\\n", " .groupBy('dispatching_base_num') \\\n", " .count() \\\n", " .orderBy('count', ascending=False) \\\n", " .limit(5) \\\n", " .show()" ] }, { "cell_type": "markdown", "id": "0d10173a", "metadata": {}, "source": [ "**Q6**: Most common locations pair" ] }, { "cell_type": "code", "execution_count": 47, "id": "74b7f664", "metadata": {}, "outputs": [], "source": [ "df_zones = spark.read.parquet('zones')" ] }, { "cell_type": "code", "execution_count": 49, "id": "81642d3b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['LocationID', 'Borough', 'Zone', 'service_zone']" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_zones.columns" ] }, { "cell_type": "code", "execution_count": 51, "id": "4f460dda", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['hvfhs_license_num',\n", " 'dispatching_base_num',\n", " 'pickup_datetime',\n", " 'dropoff_datetime',\n", " 'PULocationID',\n", " 'DOLocationID',\n", " 'SR_Flag']" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 50, "id": "ad8f0101", "metadata": {}, "outputs": [], "source": [ "df_zones.registerTempTable('zones')" ] }, { "cell_type": "code", "execution_count": 57, "id": "6f738414", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Stage 103:==============================================> (176 + 4) / 200]\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+--------+\n", "| pu_do_pair|count(1)|\n", "+--------------------+--------+\n", "|East New York / E...| 45041|\n", "|Borough Park / Bo...| 37329|\n", "| Canarsie / Canarsie| 28026|\n", "|Crown Heights Nor...| 25976|\n", "|Bay Ridge / Bay R...| 17934|\n", "+--------------------+--------+\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\r", " \r" ] } ], "source": [ "spark.sql(\"\"\"\n", "SELECT\n", " CONCAT(pul.Zone, ' / ', dol.Zone) AS pu_do_pair,\n", " COUNT(1)\n", "FROM \n", " fhvhv_2021_02 fhv LEFT JOIN zones pul ON fhv.PULocationID = pul.LocationID\n", " LEFT JOIN zones dol ON fhv.DOLocationID = dol.LocationID\n", "GROUP BY \n", " 1\n", "ORDER BY\n", " 2 DESC\n", "LIMIT 5;\n", "\"\"\").show()" ] }, { "cell_type": "code", "execution_count": null, "id": "e4b754d1", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 06-batch/setup/config/core-site.xml ================================================ fs.AbstractFileSystem.gs.impl com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS fs.gs.impl com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem fs.gs.auth.service.account.json.keyfile /home/alexey/.google/credentials/google_credentials.json fs.gs.auth.service.account.enable true ================================================ FILE: 06-batch/setup/config/spark-defaults.conf ================================================ spark-master yarn spark.hadoop.google.cloud.auth.service.account.enable true spark.hadoop.google.cloud.auth.service.account.json.keyfile /home/alexey ================================================ FILE: 06-batch/setup/config/spark.dockerfile ================================================ FROM library/openjdk:11 ================================================ FILE: 06-batch/setup/hadoop-yarn.md ================================================ ## Spark on YARN For the Spark and Docker module, we need YARN, which comes together with Hadoop. So we need to install Hadoop In this document, we'll assume you use Linux. For Windows, use WSL. It should work (supposedly) on MacOS as well. We'll need to run it in a pseudo-distributed mode. ### Configuring ssh You need to run be able to `ssh` to your localhost without having to type any password. In other words, you execute ```bash ssh localhost ``` And you get ssh access. If you don't have it, add your `id_rsa.pub` key to the list of keys authorized to access your computer: ```bash cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys ``` (This assumes you already have `id_rsa.pub` in `~/.ssh`) On WSL, you may need to start the ssh service: ```bash sudo service ssh start ``` ### Download Hadoop binaries We use Spark that expects Hadoop 3.2 version. So we'll install it. Go to the [Hadoop's website](https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz) to get the closest mirror. And then download it: ```bash wget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz ``` Unpack it and go to this directory ```bash tar xzfv hadoop-3.2.3.tar.gz cd hadoop-3.2.3/ ``` ### YARN on a Single Node Set `JAVA_HOME` in `etc/hadoop/hadoop-env.sh`: ```bash echo "export JAVA_HOME=${JAVA_HOME}" >> etc/hadoop/hadoop-env.sh ``` Start YARN ```bash ./sbin/start-yarn.sh ``` YARN should work on port 8088: http://localhost:8088/ ### Running Spark on YARN For submitting spark jobs, we'll need to use `master="yarn"`. Spark needs to know where to look for YARN config files, so we need to set it: ```bash export HADOOP_HOME="${HOME}/spark/hadoop-3.2.3" export YARN_CONF_DIR="${HADOOP_HOME}/etc/hadoop" ``` Then run Jupyter or use spark-submit. ### Connecting Spark and YARN to GCS Download the GCS connector: ```bash gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar . ``` Config changes: * Change `${SPARK_HOME}/conf/spark-defaults.conf` (see [here]()) * Change `${YARN_CONF_DIR}/core-site.xml` (see [here](config/core-site.xml)) Template for hadoop properties: ```xml ``` ### Spark and YARN with Docker Copy the config from [here](https://hadoop.apache.org/docs/r3.2.3/hadoop-yarn/hadoop-yarn-site/DockerContainers.html) Running spark-submit: ```bash MOUNTS="$HADOOP_HOME:$HADOOP_HOME:ro,/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro" IMAGE_ID="pyspark-docker:test" spark-submit \ --master yarn \ --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \ --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE_ID} \ --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \ --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE_ID} \ 06_spark_sql.py \ --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2021/*/ \ --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2021/*/ \ --output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2021 ``` ### Sources * https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-common/SingleCluster.html * https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration ================================================ FILE: 06-batch/setup/linux.md ================================================ ## Linux Here we'll show you how to install Spark 4.x for Linux. We tested it on Ubuntu 24.04 (also WSL), but it should work for other Linux distros as well ### Installing Java Spark 4.x requires Java 17 or 21. The simplest way is to install it via your package manager: ```bash sudo apt update sudo apt install default-jdk ``` Check that it works: ```bash java --version ``` Output (example): ``` openjdk 21.0.10 2026-01-20 OpenJDK Runtime Environment (build 21.0.10+7-Ubuntu-124.04) OpenJDK 64-Bit Server VM (build 21.0.10+7-Ubuntu-124.04, mixed mode, sharing) ``` Set `JAVA_HOME` (add to your `.bashrc` or `.zshrc`): ```bash export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java)))) export PATH="${JAVA_HOME}/bin:${PATH}" ``` ### PySpark We recommend using [uv](https://docs.astral.sh/uv/) for managing Python packages: ```bash uv init uv add pyspark ``` Then run your scripts with `uv run`: ```bash uv run python your_script.py ``` Alternatively, you can use pip: ```bash pip install pyspark ``` Both approaches install PySpark along with a bundled Spark distribution - no separate Spark download needed. ### Testing it Create a test script `test_spark.py`: ```python import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[*]") \ .appName('test') \ .getOrCreate() print(f"Spark version: {spark.version}") df = spark.range(10) df.show() spark.stop() ``` Run it: ```bash uv run python test_spark.py ``` ================================================ FILE: 06-batch/setup/macos.md ================================================ ## MacOS Here we'll show you how to install Spark 4.x for macOS. We tested it on macOS 15 (Sequoia), but it should work for other versions as well. ### Installing Java Spark 4.x requires Java 17. Ensure [Homebrew](https://brew.sh/) is installed, then install OpenJDK 17: ```bash brew install openjdk@17 ``` Add the following environment variables to your `.zshrc` (or `.bash_profile`): ```bash export JAVA_HOME=$(brew --prefix openjdk@17) export PATH="$JAVA_HOME/bin:$PATH" ``` Check that Java works correctly: ```bash java --version ``` Output (example): ``` openjdk 17.0.14 2026-01-21 OpenJDK Runtime Environment Homebrew (build 17.0.14+0) OpenJDK 64-Bit Server VM Homebrew (build 17.0.14+0, mixed mode, sharing) ``` ### PySpark We recommend using [uv](https://docs.astral.sh/uv/) for managing Python packages: ```bash uv init uv add pyspark ``` Then run your scripts with `uv run`: ```bash uv run python your_script.py ``` Alternatively, you can use pip: ```bash pip install pyspark ``` Both approaches install PySpark along with a bundled Spark distribution — no separate Spark download needed. > If you previously installed Spark 3.x and have `SPARK_HOME` set in your `.zshrc` or `.bash_profile` (e.g. pointing to a local Spark directory), remove that line. PySpark 4.x bundles its own Spark, so `SPARK_HOME` is no longer needed. If the old `SPARK_HOME` is still set, PySpark 4.x will load the old JARs and fail. ### Testing it Create a test script `test_spark.py`: ```python import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[*]") \ .appName('test') \ .getOrCreate() print(f"Spark version: {spark.version}") df = spark.range(10) df.show() spark.stop() ``` Run it: ```bash uv run python test_spark.py ``` You may see a warning like `WARNING: Using incubator modules: jdk.incubator.vector` — you can safely ignore it. ================================================ FILE: 06-batch/setup/windows.md ================================================ ## Windows Here we'll show you how to install Spark 4.x for Windows. We tested it on Windows 10 and 11, but it should work for other versions as well. In this tutorial, we'll use [MINGW](https://www.mingw-w64.org/)/[Git Bash](https://gitforwindows.org/) for the command line. If you use WSL, follow the instructions from [linux.md](linux.md). ### Installing Java Spark 4.x requires Java 17. Download and unpack the Adoptium JDK 17: ```bash wget https://github.com/adoptium/temurin17-binaries/releases/download/jdk-17.0.18%2B8/OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip unzip OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip -d /c/tools/ ``` The full path to JDK will be `/c/tools/jdk-17.0.18+8`. Now let's configure it and add it to `PATH` (add to your `.bashrc`): ```bash export JAVA_HOME="/c/tools/jdk-17.0.18+8" export PATH="${JAVA_HOME}/bin:${PATH}" ``` Check that Java works correctly: ```bash java --version ``` Output: ``` openjdk 17.0.18 2026-01-20 LTS OpenJDK Runtime Environment Temurin-17.0.18+8 (build 17.0.18+8-LTS) OpenJDK 64-Bit Server VM Temurin-17.0.18+8 (build 17.0.18+8-LTS, mixed mode, sharing) ``` ### PySpark We recommend using [uv](https://docs.astral.sh/uv/) for managing Python packages: ```bash uv init uv add pyspark ``` Then run your scripts with `uv run`: ```bash uv run python your_script.py ``` Alternatively, you can use pip: ```bash pip install pyspark ``` Both approaches install PySpark along with a bundled Spark distribution — no separate Spark or Hadoop download needed. > If you previously installed Spark 3.x and have `SPARK_HOME` set in your `.bashrc` (e.g. pointing to `C:/tools/spark-3.3.2-bin-hadoop3`), remove that line. PySpark 4.x bundles its own Spark, so `SPARK_HOME` is no longer needed. If the old `SPARK_HOME` is still set, PySpark 4.x will load the old JARs and fail. ### Testing it Create a test script `test_spark.py`: ```python import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[*]") \ .appName('test') \ .getOrCreate() print(f"Spark version: {spark.version}") df = spark.range(10) df.show() spark.stop() ``` Run it: ```bash uv run python test_spark.py ``` At this point you may get a message from Windows Firewall — allow it. You may see a warning like `WARNING: Using incubator modules: jdk.incubator.vector` — you can safely ignore it. ================================================ FILE: 07-streaming/.gitignore ================================================ week6_venv ================================================ FILE: 07-streaming/README.md ================================================ # Module 7: Stream Processing Video: https://www.youtube.com/live/YDUgFeHQzJU - [PyFlink workshop](workshop/) - build a real-time streaming pipeline step by step (Redpanda, Python, Flink, PostgreSQL) - [Homework](../cohorts/2026/07-streaming/homework.md) - [Kafka theory](theory/) - video lectures on Kafka concepts with Java code examples (optional) - [Extras](extras/) - supplementary Python and PyFlink examples from previous years (optional) ## Community notes
Did you take notes? You can share them here * [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/6_streaming.md ) * [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-6-stream-processing/) * [Notes by Oscar Garcia](https://github.com/ozkary/Data-Engineering-Bootcamp/tree/main/Step6-Streaming) * [Notes by Shayan Shafiee Moghadam](https://github.com/shayansm2/eng-notebook/blob/main/kafka/readme.md) * Add your notes here (above this line)
================================================ FILE: 07-streaming/extras/README.md ================================================ # Supplementary streaming examples Additional stream processing examples from previous course years. These are not part of the main workshop but may be useful as reference material. ## python/ Python Kafka examples by Irem Erturk, using various libraries. - [json_example/](python/json_example) - producer and consumer using `kafka-python` with JSON serialization - [avro_example/](python/avro_example) - producer and consumer using `confluent-kafka` with Avro serialization and Schema Registry - [redpanda_example/](python/redpanda_example) - same as the JSON example but running against Redpanda instead of Kafka, with a local docker-compose setup - [streams-example/faust/](python/streams-example/faust) - stream processing with [Faust](https://faust-streaming.github.io/faust/), a Python library for Kafka Streams. Includes windowing, branching, and counting examples. - [streams-example/pyspark/](python/streams-example/pyspark) - Spark Structured Streaming consuming from Kafka, with a Jupyter notebook - [streams-example/redpanda/](python/streams-example/redpanda) - same as the PySpark example but using Redpanda as the broker - [docker/](python/docker) - Docker Compose files for running Kafka and Spark clusters locally - [resources/](python/resources) - sample data (rides.csv) and Avro schemas ## pyflink/ PyFlink workshop by Irem Erturk. Uses Apache Flink 1.x with a Makefile-based workflow, PostgreSQL sink, and Docker Compose setup. The [2025 stream with Zach Wilson](https://www.youtube.com/watch?v=P2loELMUUeI) was rewritten into the current [2026 workshop](../workshop/) by Alexey, using Flink 2.2, uv, and a step-by-step README. ## ksqldb/ [commands.md](ksqldb/commands.md) - example ksqlDB queries for creating streams, filtering, grouping, and windowed aggregations over Kafka topics. Companion to the [ksqlDB and Connect video](../theory/#kafka-streams) in the theory section. ================================================ FILE: 07-streaming/extras/ksqldb/commands.md ================================================ ## KSQL DB Examples ### Create streams ```sql CREATE STREAM ride_streams ( VendorId varchar, trip_distance double, payment_type varchar ) WITH (KAFKA_TOPIC='rides', VALUE_FORMAT='JSON'); ``` ### Query stream ```sql select * from RIDE_STREAMS EMIT CHANGES; ``` ### Query stream count ```sql SELECT VENDORID, count(*) FROM RIDE_STREAMS GROUP BY VENDORID EMIT CHANGES; ``` ### Query stream with filters ```sql SELECT payment_type, count(*) FROM RIDE_STREAMS WHERE payment_type IN ('1', '2') GROUP BY payment_type EMIT CHANGES; ``` ### Query stream with window functions ```sql CREATE TABLE payment_type_sessions AS SELECT payment_type, count(*) FROM RIDE_STREAMS WINDOW SESSION (60 SECONDS) GROUP BY payment_type EMIT CHANGES; ``` ## KSQL documentation for details [KSQL DB Documentation](https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/quick-reference/) [KSQL DB Java client](https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-clients/java-client/) ================================================ FILE: 07-streaming/extras/pyflink/.gitignore ================================================ data/ postgres-data # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ pip-wheel-metadata/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py,cover .hypothesis/ .pytest_cache/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv .python-version # pipenv # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. # However, in case of collaboration, if having platform-specific dependencies or dependencies # having no cross-platform support, pipenv may install dependencies that don't work, or not # install all needed dependencies. #Pipfile.lock # PEP 582; used by e.g. github.com/David-OConnor/pyflow __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ dump.sql # Personal workspace files .idea/* .vscode/* ================================================ FILE: 07-streaming/extras/pyflink/Dockerfile.flink ================================================ FROM --platform=linux/amd64 flink:1.16.0-scala_2.12-java8 # install python3: it has updated Python to 3.9 in Debian 11 and so install Python 3.7 from source # it currently only supports Python 3.6, 3.7 and 3.8 in PyFlink officially. # ref: https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker RUN apt-get update -y && \ apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev liblzma-dev && \ wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz && \ tar -xvf Python-3.7.9.tgz && \ cd Python-3.7.9 && \ ./configure --without-tests --enable-shared && \ make -j6 && \ make install && \ ldconfig /usr/local/lib && \ cd .. && rm -f Python-3.7.9.tgz && rm -rf Python-3.7.9 && \ ln -s /usr/local/bin/python3 /usr/local/bin/python && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* # install PyFlink COPY requirements.txt . RUN python -m pip install --upgrade pip; \ pip3 install --upgrade google-api-python-client; \ pip3 install -r requirements.txt --no-cache-dir; # Download connector libraries RUN wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/1.16.0/flink-json-1.16.0.jar; \ wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/1.16.0/flink-sql-connector-kafka-1.16.0.jar; \ wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc/1.16.0/flink-connector-jdbc-1.16.0.jar; \ wget -P /opt/flink/lib/ https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.24/postgresql-42.2.24.jar; RUN echo "taskmanager.memory.jvm-metaspace.size: 512m" >> /opt/flink/conf/flink-conf.yaml; WORKDIR /opt/flink ================================================ FILE: 07-streaming/extras/pyflink/LICENSE ================================================ MIT License Copyright (c) 2025 Sreela Das, Julie Scherer, Zach Wilson Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: 07-streaming/extras/pyflink/Makefile ================================================ PLATFORM ?= linux/amd64 # COLORS GREEN := $(shell tput -Txterm setaf 2) YELLOW := $(shell tput -Txterm setaf 3) WHITE := $(shell tput -Txterm setaf 7) RESET := $(shell tput -Txterm sgr0) TARGET_MAX_CHAR_NUM=20 ## Show help with `make help` help: @echo '' @echo 'Usage:' @echo ' ${YELLOW}make${RESET} ${GREEN}${RESET}' @echo '' @echo 'Targets:' @awk '/^[a-zA-Z\-\_0-9]+:/ { \ helpMessage = match(lastLine, /^## (.*)/); \ if (helpMessage) { \ helpCommand = substr($$1, 0, index($$1, ":")-1); \ helpMessage = substr(lastLine, RSTART + 3, RLENGTH); \ printf " ${YELLOW}%-$(TARGET_MAX_CHAR_NUM)s${RESET} ${GREEN}%s${RESET}\n", helpCommand, helpMessage; \ } \ } \ { lastLine = $$0 }' $(MAKEFILE_LIST) .PHONY: build ## Builds the Flink base image with pyFlink and connectors installed build: docker build . .PHONY: up ## Builds the base Docker image and starts Flink cluster up: docker compose up --build --remove-orphans -d .PHONY: down ## Shuts down the Flink cluster down: docker compose down --remove-orphans .PHONY: job ## Submit the Flink job job: docker compose exec jobmanager ./bin/flink run -py /opt/src/job/start_job.py --pyFiles /opt/src -d aggregation_job: docker compose exec jobmanager ./bin/flink run -py /opt/src/job/aggregation_job.py --pyFiles /opt/src -d .PHONY: stop ## Stops all services in Docker compose stop: docker compose stop .PHONY: start ## Starts all services in Docker compose start: docker compose start ================================================ FILE: 07-streaming/extras/pyflink/README.md ================================================ # Apache Flink Training Apache Flink Streaming Pipelines ## :pushpin: Getting started ### :whale: Installations To run this repo, the following components will need to be installed: 1. [Docker](https://docs.docker.com/get-docker/) (required) 2. [Docker compose](https://docs.docker.com/compose/install/#installation-scenarios) (required) 3. Make (recommended) -- see below - On most Linux distributions and macOS, `make` is typically pre-installed by default. To check if `make` is installed on your system, you can run the `make --version` command in your terminal or command prompt. If it's installed, it will display the version information. - Otherwise, you can try following the instructions below, or you can just copy+paste the commands from the `Makefile` into your terminal or command prompt and run manually. ```bash # On Ubuntu or Debian: sudo apt-get update sudo apt-get install build-essential # On CentOS or Fedora: sudo dnf install make # On macOS: xcode-select --install # On windows: choco install make # uses Chocolatey, https://chocolatey.org/install ``` ### :computer: Local setup Make sure you're in the `pyflick` folder: ```bash cd 07-streaming/pyflink ``` ## :boom: Running the pipeline 1. Build the Docker image and deploy the services in the `docker-compose.yml` file, including the PostgreSQL database and Flink cluster. This will (should) also create the sink table, `processed_events`, where Flink will write the Kafka messages to. ```bash make up #// if you dont have make, you can run: # docker compose up --build --remove-orphans -d ``` **:star: Wait until the Flink UI is running at [http://localhost:8081/](http://localhost:8081/) before proceeding to the next step.** _Note the first time you build the Docker image it can take anywhere from 5 to 30 minutes. Future builds should only take a few second, assuming you haven't deleted the image since._ :information_source: After the image is built, Docker will automatically start up the job manager and task manager services. This will take a minute or so. Check the container logs in Docker desktop and when you see the line below, you know you're good to move onto the next step. ``` taskmanager Successful registration at resource manager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_* under registration id ``` 2. Now that the Flink cluster is up and running, it's time to finally run the PyFlink job! :smile: ```bash make job #// if you dont have make, you can run: # docker-compose exec jobmanager ./bin/flink run -py /opt/job/start_job.py -d ``` After about a minute, you should see a prompt that the job's been submitted (e.g., `Job has been submitted with JobID `). Now go back to the [Flink UI](http://localhost:8081/#/job/running) to see the job running! :tada: 3. When you're done, you can stop and/or clean up the Docker resources by running the commands below. ```bash make stop # to stop running services in docker compose make down # to stop and remove docker compose services make clean # to remove the docker container and dangling images ``` :grey_exclamation: Note the `/var/lib/postgresql/data` directory inside the PostgreSQL container is mounted to the `./postgres-data` directory on your local machine. This means the data will persist across container restarts or removals, so even if you stop/remove the container, you won't lose any data written within the container. ------ :information_source: To see all the make commands that're available and what they do, run: ```bash make help ``` As of the time of writing this, the available commands are: ```bash Usage: make Targets: help Show help with `make help` db-init Builds and runs the PostgreSQL database service build Builds the Flink base image with pyFlink and connectors installed up Builds the base Docker image and starts Flink cluster down Shuts down the Flink cluster job Submit the Flink job stop Stops all services in Docker compose start Starts all services in Docker compose clean Stops and removes the Docker container as well as images with tag `` psql Runs psql to query containerized postgreSQL database in CLI postgres-die-mac Removes mounted postgres data dir on local machine (mac users) and in Docker postgres-die-pc Removes mounted postgres data dir on local machine (PC users) and in Docker ``` ================================================ FILE: 07-streaming/extras/pyflink/docker-compose.yml ================================================ version: "3.9" services: redpanda-1: image: redpandadata/redpanda:v24.2.18 container_name: redpanda-1 command: - redpanda - start - --smp - '1' - --reserve-memory - 0M - --overprovisioned - --node-id - '1' - --kafka-addr - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092 - --advertise-kafka-addr - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092 - --pandaproxy-addr - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082 - --advertise-pandaproxy-addr - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082 - --rpc-addr - 0.0.0.0:33145 - --advertise-rpc-addr - redpanda-1:33145 ports: # - 8081:8081 - 8082:8082 - 9092:9092 - 28082:28082 - 29092:29092 jobmanager: build: context: . dockerfile: ./Dockerfile.flink image: pyflink:1.16.0 container_name: "flink-jobmanager" pull_policy: never platform: "linux/amd64" hostname: "jobmanager" expose: - "6123" ports: - "8081:8081" volumes: - ./:/opt/flink/usrlib - ./keys/:/var/private/ssl/ - ./src/:/opt/src command: jobmanager extra_hosts: - "host.docker.internal:127.0.0.1" #// Linux - "host.docker.internal:host-gateway" #// Access services on the host machine from within the Docker container environment: - POSTGRES_URL=${POSTGRES_URL:-jdbc:postgresql://host.docker.internal:5432/postgres} - POSTGRES_USER=${POSTGRES_USER:-postgres} - POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-postgres} - POSTGRES_DB=${POSTGRES_DB:-postgres} - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager # Flink task manager taskmanager: image: pyflink:1.16.0 container_name: "flink-taskmanager" pull_policy: never platform: "linux/amd64" expose: - "6121" - "6122" volumes: - ./:/opt/flink/usrlib - ./src/:/opt/src depends_on: - jobmanager command: taskmanager --taskmanager.registration.timeout 5 min extra_hosts: - "host.docker.internal:127.0.0.1" #// Linux - "host.docker.internal:host-gateway" #// Access services on the host machine from within the Docker container environment: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager taskmanager.numberOfTaskSlots: 15 parallelism.default: 3 postgres: image: postgres:14 restart: on-failure container_name: "postgres" environment: - POSTGRES_DB=postgres - POSTGRES_USER=postgres - POSTGRES_PASSWORD=postgres ports: - "5432:5432" extra_hosts: - "host.docker.internal:127.0.0.1" #// Linux - "host.docker.internal:host-gateway" #// Access services on the host machine from within the Docker container ================================================ FILE: 07-streaming/extras/pyflink/homework.md ================================================ # Homework For this homework we will be using the Taxi data: - Green 2019-10 data from [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz) ## Start Red Panda, Flink Job Manager, Flink Task Manager, and Postgres There's a `docker-compose.yml` file in the homework folder (taken from [here](https://github.com/redpanda-data-blog/2023-python-gsg/blob/main/docker-compose.yml)) Copy this file to your homework directory and run ```bash docker-compose up ``` (Add `-d` if you want to run in detached mode) Visit `localhost:8081` to see the Flink Job Manager Connect to Postgres with [DBeaver](https://dbeaver.io/). The connection credentials are: - Username `postgres` - Password `postgres` - Database `postgres` - Host `localhost` - Port `5432` In DBeaver, run this query to create the Postgres landing zone for the first events: ```sql CREATE TABLE processed_events ( test_data INTEGER, event_timestamp TIMESTAMP ) ``` ## Question 1. Connecting to the Kafka server We need to make sure we can connect to the server, so later we can send some data to its topics First, let's install the kafka connector (up to you if you want to have a separate virtual environment for that) ```bash pip install kafka-python ``` You can start a jupyter notebook in your solution folder or create a script Let's try to connect to our server: ```python import json import time from kafka import KafkaProducer def json_serializer(data): return json.dumps(data).encode('utf-8') server = 'localhost:9092' producer = KafkaProducer( bootstrap_servers=[server], value_serializer=json_serializer ) producer.bootstrap_connected() ``` ## Question 3: Sending the Trip Data * Read the green csv.gz file * We will only need these columns: * `'lpep_pickup_datetime',` * `'lpep_dropoff_datetime',` * `'PULocationID',` * `'DOLocationID',` * `'passenger_count',` * `'trip_distance',` * `'tip_amount'` * Create a topic `green-trips` and send the data there with `load_taxi_data.py` * How much time in seconds did it take? (You can round it to a whole number) * Make sure you don't include sleeps in your code ## Question 4: Build a Sessionization Window * Copy `aggregation_job.py` and rename it to `session_job.py` * Have it read from `green-trips` fixing the schema * Use a [session window](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/) with a gap of 5 minutes * Use `lpep_dropoff_datetime` time as your watermark with a 5 second tolerance * Which pickup and drop off locations have the longest unbroken streak of taxi trips? ================================================ FILE: 07-streaming/extras/pyflink/requirements.txt ================================================ apache-flink==1.16.0 psycopg2-binary==2.9.1 requests kafka-python ================================================ FILE: 07-streaming/extras/pyflink/src/job/aggregation_job.py ================================================ from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, DataTypes, TableEnvironment, StreamTableEnvironment from pyflink.common.watermark_strategy import WatermarkStrategy from pyflink.common.time import Duration def create_events_aggregated_sink(t_env): table_name = 'processed_events_aggregated' sink_ddl = f""" CREATE TABLE {table_name} ( event_hour TIMESTAMP(3), test_data INT, num_hits BIGINT, PRIMARY KEY (event_hour, test_data) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name def create_events_source_kafka(t_env): table_name = "events" source_ddl = f""" CREATE TABLE {table_name} ( test_data INTEGER, event_timestamp BIGINT, event_watermark AS TO_TIMESTAMP_LTZ(event_timestamp, 3), WATERMARK for event_watermark as event_watermark - INTERVAL '1' SECOND ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda-1:29092', 'topic' = 'test-topic', 'scan.startup.mode' = 'earliest-offset', 'properties.auto.offset.reset' = 'earliest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name def log_aggregation(): # Set up the execution environment env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) env.set_parallelism(3) # Set up the table environment settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) watermark_strategy = ( WatermarkStrategy .for_bounded_out_of_orderness(Duration.of_seconds(5)) .with_timestamp_assigner( # This lambda is your timestamp assigner: # event -> The data record # timestamp -> The previously assigned (or default) timestamp lambda event, timestamp: event[2] # We treat the second tuple element as the event-time (ms). ) ) try: # Create Kafka table source_table = create_events_source_kafka(t_env) aggregated_table = create_events_aggregated_sink(t_env) t_env.execute_sql(f""" INSERT INTO {aggregated_table} SELECT window_start as event_hour, test_data, COUNT(*) AS num_hits FROM TABLE( TUMBLE(TABLE {source_table}, DESCRIPTOR(event_watermark), INTERVAL '1' MINUTE) ) GROUP BY window_start, test_data; """).wait() except Exception as e: print("Writing records from Kafka to JDBC failed:", str(e)) if __name__ == '__main__': log_aggregation() ================================================ FILE: 07-streaming/extras/pyflink/src/job/start_job.py ================================================ from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, DataTypes, TableEnvironment, StreamTableEnvironment def create_processed_events_sink_postgres(t_env): table_name = 'processed_events' sink_ddl = f""" CREATE TABLE {table_name} ( test_data INTEGER, event_timestamp TIMESTAMP ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name def create_events_source_kafka(t_env): table_name = "events" pattern = "yyyy-MM-dd HH:mm:ss.SSS" source_ddl = f""" CREATE TABLE {table_name} ( test_data INTEGER, event_timestamp BIGINT, event_watermark AS TO_TIMESTAMP_LTZ(event_timestamp, 3), WATERMARK for event_watermark as event_watermark - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda-1:29092', 'topic' = 'test-topic', 'scan.startup.mode' = 'latest-offset', 'properties.auto.offset.reset' = 'latest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name def log_processing(): # Set up the execution environment env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) # env.set_parallelism(1) # Set up the table environment settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) try: # Create Kafka table source_table = create_events_source_kafka(t_env) postgres_sink = create_processed_events_sink_postgres(t_env) # write records to postgres too! t_env.execute_sql( f""" INSERT INTO {postgres_sink} SELECT test_data, TO_TIMESTAMP_LTZ(event_timestamp, 3) as event_timestamp FROM {source_table} """ ).wait() except Exception as e: print("Writing records from Kafka to JDBC failed:", str(e)) if __name__ == '__main__': log_processing() ================================================ FILE: 07-streaming/extras/pyflink/src/job/taxi_job.py ================================================ from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, DataTypes, TableEnvironment, StreamTableEnvironment def create_taxi_events_sink_postgres(t_env): table_name = 'taxi_events' sink_ddl = f""" CREATE OR REPLACE TABLE {table_name} ( VendorID INTEGER, lpep_pickup_datetime VARCHAR, lpep_dropoff_datetime VARCHAR, store_and_fwd_flag VARCHAR, RatecodeID INTEGER , PULocationID INTEGER, DOLocationID INTEGER, passenger_count INTEGER, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type INTEGER, trip_type INTEGER, congestion_surcharge DOUBLE ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name def create_events_source_kafka(t_env): table_name = "taxi_events" pattern = "yyyy-MM-dd HH:mm:ss" source_ddl = f""" CREATE TABLE {table_name} ( VendorID INTEGER, lpep_pickup_datetime VARCHAR, lpep_dropoff_datetime VARCHAR, store_and_fwd_flag VARCHAR, RatecodeID INTEGER , PULocationID INTEGER, DOLocationID INTEGER, passenger_count INTEGER, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type INTEGER, trip_type INTEGER, congestion_surcharge DOUBLE, pickup_timestamp AS TO_TIMESTAMP(lpep_pickup_datetime, '{pattern}'), WATERMARK FOR pickup_timestamp AS pickup_timestamp - INTERVAL '15' SECOND ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda-1:29092', 'topic' = 'green-data', 'scan.startup.mode' = 'earliest-offset', 'properties.auto.offset.reset' = 'earliest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name def log_processing(): # Set up the execution environment env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) # env.set_parallelism(1) # Set up the table environment settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) try: # Create Kafka table source_table = create_events_source_kafka(t_env) postgres_sink = create_taxi_events_sink_postgres(t_env) # write records to postgres too! t_env.execute_sql( f""" INSERT INTO {postgres_sink} SELECT * FROM {source_table} """ ).wait() except Exception as e: print("Writing records from Kafka to JDBC failed:", str(e)) if __name__ == '__main__': log_processing() ================================================ FILE: 07-streaming/extras/pyflink/src/producers/load_taxi_data.py ================================================ import csv import json from kafka import KafkaProducer def main(): # Create a Kafka producer producer = KafkaProducer( bootstrap_servers='localhost:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8') ) csv_file = 'data/green_tripdata_2019-10.csv' # change to your CSV file path if needed with open(csv_file, 'r', newline='', encoding='utf-8') as file: reader = csv.DictReader(file) for row in reader: # Each row will be a dictionary keyed by the CSV headers # Send data to Kafka topic "green-data" producer.send('green-data', value=row) # Make sure any remaining messages are delivered producer.flush() producer.close() if __name__ == "__main__": main() ================================================ FILE: 07-streaming/extras/pyflink/src/producers/producer.py ================================================ import json import time from kafka import KafkaProducer def json_serializer(data): return json.dumps(data).encode('utf-8') server = 'localhost:9092' producer = KafkaProducer( bootstrap_servers=[server], value_serializer=json_serializer ) t0 = time.time() topic_name = 'test-topic' for i in range(10, 1000): message = {'test_data': i, 'event_timestamp': time.time() * 1000} producer.send(topic_name, value=message) print(f"Sent: {message}") time.sleep(0.05) producer.flush() t1 = time.time() print(f'took {(t1 - t0):.2f} seconds') ================================================ FILE: 07-streaming/extras/python/README.md ================================================ ### Stream-Processing with Python In this document, you will be finding information about stream processing using different Python libraries (`kafka-python`,`confluent-kafka`,`pyspark`, `faust`). This Python module can be separated in following modules. #### 1. Docker Docker module includes, Dockerfiles and docker-compose definitions to run Kafka and Spark in a docker container. Setting up required services is the prerequsite step for running following modules. #### 2. Kafka Producer - Consumer Examples - [Json Producer-Consumer Example](json_example) using `kafka-python` library - [Avro Producer-Consumer Example](avro_example) using `confluent-kafka` library Both of these examples require, up-and running Kafka services, therefore please ensure following steps under [docker-README](docker/README.md) To run the producer-consumer examples in the respective example folder, run following commands ```bash # Start producer script python3 producer.py # Start consumer script python3 consumer.py ``` ================================================ FILE: 07-streaming/extras/python/avro_example/consumer.py ================================================ import os from typing import Dict, List from confluent_kafka import Consumer from confluent_kafka.schema_registry import SchemaRegistryClient from confluent_kafka.schema_registry.avro import AvroDeserializer from confluent_kafka.serialization import SerializationContext, MessageField from ride_record_key import dict_to_ride_record_key from ride_record import dict_to_ride_record from settings import BOOTSTRAP_SERVERS, SCHEMA_REGISTRY_URL, \ RIDE_KEY_SCHEMA_PATH, RIDE_VALUE_SCHEMA_PATH, KAFKA_TOPIC class RideAvroConsumer: def __init__(self, props: Dict): # Schema Registry and Serializer-Deserializer Configurations key_schema_str = self.load_schema(props['schema.key']) value_schema_str = self.load_schema(props['schema.value']) schema_registry_props = {'url': props['schema_registry.url']} schema_registry_client = SchemaRegistryClient(schema_registry_props) self.avro_key_deserializer = AvroDeserializer(schema_registry_client=schema_registry_client, schema_str=key_schema_str, from_dict=dict_to_ride_record_key) self.avro_value_deserializer = AvroDeserializer(schema_registry_client=schema_registry_client, schema_str=value_schema_str, from_dict=dict_to_ride_record) consumer_props = {'bootstrap.servers': props['bootstrap.servers'], 'group.id': 'datatalkclubs.taxirides.avro.consumer.2', 'auto.offset.reset': "earliest"} self.consumer = Consumer(consumer_props) @staticmethod def load_schema(schema_path: str): path = os.path.realpath(os.path.dirname(__file__)) with open(f"{path}/{schema_path}") as f: schema_str = f.read() return schema_str def consume_from_kafka(self, topics: List[str]): self.consumer.subscribe(topics=topics) while True: try: # SIGINT can't be handled when polling, limit timeout to 1 second. msg = self.consumer.poll(1.0) if msg is None: continue key = self.avro_key_deserializer(msg.key(), SerializationContext(msg.topic(), MessageField.KEY)) record = self.avro_value_deserializer(msg.value(), SerializationContext(msg.topic(), MessageField.VALUE)) if record is not None: print("{}, {}".format(key, record)) except KeyboardInterrupt: break self.consumer.close() if __name__ == "__main__": config = { 'bootstrap.servers': BOOTSTRAP_SERVERS, 'schema_registry.url': SCHEMA_REGISTRY_URL, 'schema.key': RIDE_KEY_SCHEMA_PATH, 'schema.value': RIDE_VALUE_SCHEMA_PATH, } avro_consumer = RideAvroConsumer(props=config) avro_consumer.consume_from_kafka(topics=[KAFKA_TOPIC]) ================================================ FILE: 07-streaming/extras/python/avro_example/producer.py ================================================ import os import csv from time import sleep from typing import Dict from confluent_kafka import Producer from confluent_kafka.schema_registry import SchemaRegistryClient from confluent_kafka.schema_registry.avro import AvroSerializer from confluent_kafka.serialization import SerializationContext, MessageField from ride_record_key import RideRecordKey, ride_record_key_to_dict from ride_record import RideRecord, ride_record_to_dict from settings import RIDE_KEY_SCHEMA_PATH, RIDE_VALUE_SCHEMA_PATH, \ SCHEMA_REGISTRY_URL, BOOTSTRAP_SERVERS, INPUT_DATA_PATH, KAFKA_TOPIC def delivery_report(err, msg): if err is not None: print("Delivery failed for record {}: {}".format(msg.key(), err)) return print('Record {} successfully produced to {} [{}] at offset {}'.format( msg.key(), msg.topic(), msg.partition(), msg.offset())) class RideAvroProducer: def __init__(self, props: Dict): # Schema Registry and Serializer-Deserializer Configurations key_schema_str = self.load_schema(props['schema.key']) value_schema_str = self.load_schema(props['schema.value']) schema_registry_props = {'url': props['schema_registry.url']} schema_registry_client = SchemaRegistryClient(schema_registry_props) self.key_serializer = AvroSerializer(schema_registry_client, key_schema_str, ride_record_key_to_dict) self.value_serializer = AvroSerializer(schema_registry_client, value_schema_str, ride_record_to_dict) # Producer Configuration producer_props = {'bootstrap.servers': props['bootstrap.servers']} self.producer = Producer(producer_props) @staticmethod def load_schema(schema_path: str): path = os.path.realpath(os.path.dirname(__file__)) with open(f"{path}/{schema_path}") as f: schema_str = f.read() return schema_str @staticmethod def delivery_report(err, msg): if err is not None: print("Delivery failed for record {}: {}".format(msg.key(), err)) return print('Record {} successfully produced to {} [{}] at offset {}'.format( msg.key(), msg.topic(), msg.partition(), msg.offset())) @staticmethod def read_records(resource_path: str): ride_records, ride_keys = [], [] with open(resource_path, 'r') as f: reader = csv.reader(f) header = next(reader) # skip the header for row in reader: ride_records.append(RideRecord(arr=[row[0], row[3], row[4], row[9], row[16]])) ride_keys.append(RideRecordKey(vendor_id=int(row[0]))) return zip(ride_keys, ride_records) def publish(self, topic: str, records: [RideRecordKey, RideRecord]): for key_value in records: key, value = key_value try: self.producer.produce(topic=topic, key=self.key_serializer(key, SerializationContext(topic=topic, field=MessageField.KEY)), value=self.value_serializer(value, SerializationContext(topic=topic, field=MessageField.VALUE)), on_delivery=delivery_report) except KeyboardInterrupt: break except Exception as e: print(f"Exception while producing record - {value}: {e}") self.producer.flush() sleep(1) if __name__ == "__main__": config = { 'bootstrap.servers': BOOTSTRAP_SERVERS, 'schema_registry.url': SCHEMA_REGISTRY_URL, 'schema.key': RIDE_KEY_SCHEMA_PATH, 'schema.value': RIDE_VALUE_SCHEMA_PATH } producer = RideAvroProducer(props=config) ride_records = producer.read_records(resource_path=INPUT_DATA_PATH) producer.publish(topic=KAFKA_TOPIC, records=ride_records) ================================================ FILE: 07-streaming/extras/python/avro_example/ride_record.py ================================================ from typing import List, Dict class RideRecord: def __init__(self, arr: List[str]): self.vendor_id = int(arr[0]) self.passenger_count = int(arr[1]) self.trip_distance = float(arr[2]) self.payment_type = int(arr[3]) self.total_amount = float(arr[4]) @classmethod def from_dict(cls, d: Dict): return cls(arr=[ d['vendor_id'], d['passenger_count'], d['trip_distance'], d['payment_type'], d['total_amount'] ] ) def __repr__(self): return f'{self.__class__.__name__}: {self.__dict__}' def dict_to_ride_record(obj, ctx): if obj is None: return None return RideRecord.from_dict(obj) def ride_record_to_dict(ride_record: RideRecord, ctx): return ride_record.__dict__ ================================================ FILE: 07-streaming/extras/python/avro_example/ride_record_key.py ================================================ from typing import Dict class RideRecordKey: def __init__(self, vendor_id): self.vendor_id = vendor_id @classmethod def from_dict(cls, d: Dict): return cls(vendor_id=d['vendor_id']) def __repr__(self): return f'{self.__class__.__name__}: {self.__dict__}' def dict_to_ride_record_key(obj, ctx): if obj is None: return None return RideRecordKey.from_dict(obj) def ride_record_key_to_dict(ride_record_key: RideRecordKey, ctx): return ride_record_key.__dict__ ================================================ FILE: 07-streaming/extras/python/avro_example/settings.py ================================================ INPUT_DATA_PATH = '../resources/rides.csv' RIDE_KEY_SCHEMA_PATH = '../resources/schemas/taxi_ride_key.avsc' RIDE_VALUE_SCHEMA_PATH = '../resources/schemas/taxi_ride_value.avsc' SCHEMA_REGISTRY_URL = 'http://localhost:8081' BOOTSTRAP_SERVERS = 'localhost:9092' KAFKA_TOPIC = 'rides_avro' ================================================ FILE: 07-streaming/extras/python/docker/README.md ================================================ # Running Spark and Kafka Clusters on Docker ### 1. Build Required Images for running Spark The details of how to spark-images are build in different layers can be created can be read through the blog post written by André Perez on [Medium blog -Towards Data Science](https://towardsdatascience.com/apache-spark-cluster-on-docker-ft-a-juyterlab-interface-418383c95445) ```bash # Build Spark Images ./build.sh ``` ### 2. Create Docker Network & Volume ```bash # Create Network docker network create kafka-spark-network # Create Volume docker volume create --name=hadoop-distributed-file-system ``` ### 3. Run Services on Docker ```bash # Start Docker-Compose (within for kafka and spark folders) docker compose up -d ``` In depth explanation of [Kafka Listeners](https://www.confluent.io/blog/kafka-listeners-explained/) Explanation of [Kafka Listeners](https://www.confluent.io/blog/kafka-listeners-explained/) ### 4. Stop Services on Docker ```bash # Stop Docker-Compose (within for kafka and spark folders) docker compose down ``` ### 5. Helpful Comands ```bash # Delete all Containers docker rm -f $(docker ps -a -q) # Delete all volumes docker volume rm $(docker volume ls -q) ``` ================================================ FILE: 07-streaming/extras/python/docker/docker-compose.yml ================================================ version: "3.6" volumes: shared-workspace: name: "hadoop-distributed-file-system" driver: local services: jupyterlab: image: jupyterlab container_name: jupyterlab ports: - 8888:8888 volumes: - shared-workspace:/opt/workspace spark-master: image: spark-master container_name: spark-master environment: SPARK_LOCAL_IP: 'spark-master' ports: - 8080:8080 - 7077:7077 volumes: - shared-workspace:/opt/workspace spark-worker-1: image: spark-worker container_name: spark-worker-1 environment: - SPARK_WORKER_CORES=1 - SPARK_WORKER_MEMORY=4g ports: - 8083:8081 volumes: - shared-workspace:/opt/workspace depends_on: - spark-master spark-worker-2: image: spark-worker container_name: spark-worker-2 environment: - SPARK_WORKER_CORES=1 - SPARK_WORKER_MEMORY=4g ports: - 8082:8081 volumes: - shared-workspace:/opt/workspace depends_on: - spark-master broker: image: confluentinc/cp-kafka:7.2.0 hostname: broker container_name: broker depends_on: - zookeeper ports: - '9092:9092' environment: KAFKA_BROKER_ID: 1 KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' # KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT # KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:9092,PLAINTEXT_HOST://localhost:9092 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_BOB:PLAINTEXT,LISTENER_FRED:PLAINTEXT KAFKA_LISTENERS: LISTENER_BOB://broker:29092,LISTENER_FRED://broker:9092 KAFKA_ADVERTISED_LISTENERS: LISTENER_BOB://broker:29092,LISTENER_FRED://localhost:9092 KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_BOB KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1 schema-registry: image: confluentinc/cp-schema-registry:7.2.0 hostname: schema-registry container_name: schema-registry depends_on: - zookeeper - broker ports: - "8081:8081" environment: # SCHEMA_REGISTRY_HOST_NAME: schema-registry # used for intercommunication # SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: "zookeeper:2181" #(depreciated) SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: "broker:29092" SCHEMA_REGISTRY_HOST_NAME: "localhost" SCHEMA_REGISTRY_LISTENERS: "http://0.0.0.0:8081" #(default: http://0.0.0.0:8081) zookeeper: image: confluentinc/cp-zookeeper:7.2.0 hostname: zookeeper container_name: zookeeper ports: - '2181:2181' environment: ZOOKEEPER_CLIENT_PORT: 2181 ZOOKEEPER_TICK_TIME: 2000 control-center: image: confluentinc/cp-enterprise-control-center:7.2.0 hostname: control-center container_name: control-center depends_on: - zookeeper - broker - schema-registry ports: - "9021:9021" environment: CONTROL_CENTER_BOOTSTRAP_SERVERS: 'broker:29092' CONTROL_CENTER_ZOOKEEPER_CONNECT: 'zookeeper:2181' CONTROL_CENTER_SCHEMA_REGISTRY_URL: "http://localhost:8081" CONTROL_CENTER_REPLICATION_FACTOR: 1 CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1 CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1 CONFLUENT_METRICS_TOPIC_REPLICATION: 1 PORT: 9021 ================================================ FILE: 07-streaming/extras/python/docker/kafka/docker-compose.yml ================================================ version: '3.6' networks: default: name: kafka-spark-network external: true services: broker: image: confluentinc/cp-kafka:7.2.0 hostname: broker container_name: broker depends_on: - zookeeper ports: - '9092:9092' environment: KAFKA_BROKER_ID: 1 KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://broker:9092 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092 KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1 schema-registry: image: confluentinc/cp-schema-registry:7.2.0 hostname: schema-registry container_name: schema-registry depends_on: - zookeeper - broker ports: - "8081:8081" environment: # SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: "zookeeper:2181" #(depreciated) SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: "broker:29092" SCHEMA_REGISTRY_HOST_NAME: "localhost" SCHEMA_REGISTRY_LISTENERS: "http://0.0.0.0:8081" #(default: http://0.0.0.0:8081) zookeeper: image: confluentinc/cp-zookeeper:7.2.0 hostname: zookeeper container_name: zookeeper ports: - '2181:2181' environment: ZOOKEEPER_CLIENT_PORT: 2181 ZOOKEEPER_TICK_TIME: 2000 control-center: image: confluentinc/cp-enterprise-control-center:7.2.0 hostname: control-center container_name: control-center depends_on: - zookeeper - broker - schema-registry ports: - "9021:9021" environment: CONTROL_CENTER_BOOTSTRAP_SERVERS: 'broker:29092' CONTROL_CENTER_ZOOKEEPER_CONNECT: 'zookeeper:2181' CONTROL_CENTER_SCHEMA_REGISTRY_URL: "http://localhost:8081" CONTROL_CENTER_REPLICATION_FACTOR: 1 CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1 CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1 CONFLUENT_METRICS_TOPIC_REPLICATION: 1 PORT: 9021 kafka-rest: image: confluentinc/cp-kafka-rest:7.2.0 hostname: kafka-rest ports: - "8082:8082" depends_on: - schema-registry - broker environment: KAFKA_REST_BOOTSTRAP_SERVERS: 'broker:29092' KAFKA_REST_ZOOKEEPER_CONNECT: 'zookeeper:2181' KAFKA_REST_SCHEMA_REGISTRY_URL: 'http://localhost:8081' KAFKA_REST_HOST_NAME: localhost KAFKA_REST_LISTENERS: 'http://0.0.0.0:8082' ================================================ FILE: 07-streaming/extras/python/docker/spark/build.sh ================================================ # -- Software Stack Version SPARK_VERSION="3.3.1" HADOOP_VERSION="3" JUPYTERLAB_VERSION="3.6.1" # -- Building the Images docker build \ -f cluster-base.Dockerfile \ -t cluster-base . docker build \ --build-arg spark_version="${SPARK_VERSION}" \ --build-arg hadoop_version="${HADOOP_VERSION}" \ -f spark-base.Dockerfile \ -t spark-base . docker build \ -f spark-master.Dockerfile \ -t spark-master . docker build \ -f spark-worker.Dockerfile \ -t spark-worker . docker build \ --build-arg spark_version="${SPARK_VERSION}" \ --build-arg jupyterlab_version="${JUPYTERLAB_VERSION}" \ -f jupyterlab.Dockerfile \ -t jupyterlab . ================================================ FILE: 07-streaming/extras/python/docker/spark/cluster-base.Dockerfile ================================================ # Reference from offical Apache Spark repository Dockerfile for Kubernetes # https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile ARG java_image_tag=17-jre FROM eclipse-temurin:${java_image_tag} # -- Layer: OS + Python ARG shared_workspace=/opt/workspace RUN mkdir -p ${shared_workspace} && \ apt-get update -y && \ apt-get install -y python3 && \ ln -s /usr/bin/python3 /usr/bin/python && \ rm -rf /var/lib/apt/lists/* ENV SHARED_WORKSPACE=${shared_workspace} # -- Runtime VOLUME ${shared_workspace} CMD ["bash"] ================================================ FILE: 07-streaming/extras/python/docker/spark/docker-compose.yml ================================================ version: "3.6" volumes: shared-workspace: name: "hadoop-distributed-file-system" driver: local networks: default: name: kafka-spark-network external: true services: jupyterlab: image: jupyterlab container_name: jupyterlab ports: - 8888:8888 volumes: - shared-workspace:/opt/workspace spark-master: image: spark-master container_name: spark-master environment: SPARK_LOCAL_IP: 'spark-master' ports: - 8080:8080 - 7077:7077 volumes: - shared-workspace:/opt/workspace spark-worker-1: image: spark-worker container_name: spark-worker-1 environment: - SPARK_WORKER_CORES=1 - SPARK_WORKER_MEMORY=4g ports: - 8083:8081 volumes: - shared-workspace:/opt/workspace depends_on: - spark-master spark-worker-2: image: spark-worker container_name: spark-worker-2 environment: - SPARK_WORKER_CORES=1 - SPARK_WORKER_MEMORY=4g ports: - 8084:8081 volumes: - shared-workspace:/opt/workspace depends_on: - spark-master ================================================ FILE: 07-streaming/extras/python/docker/spark/jupyterlab.Dockerfile ================================================ FROM cluster-base # -- Layer: JupyterLab ARG spark_version=3.3.1 ARG jupyterlab_version=3.6.1 RUN apt-get update -y && \ apt-get install -y python3-pip && \ pip3 install wget pyspark==${spark_version} jupyterlab==${jupyterlab_version} # -- Runtime EXPOSE 8888 WORKDIR ${SHARED_WORKSPACE} CMD jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token= ================================================ FILE: 07-streaming/extras/python/docker/spark/spark-base.Dockerfile ================================================ FROM cluster-base # -- Layer: Apache Spark ARG spark_version=3.3.1 ARG hadoop_version=3 RUN apt-get update -y && \ apt-get install -y curl && \ curl https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz -o spark.tgz && \ tar -xf spark.tgz && \ mv spark-${spark_version}-bin-hadoop${hadoop_version} /usr/bin/ && \ mkdir /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/logs && \ rm spark.tgz ENV SPARK_HOME /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version} ENV SPARK_MASTER_HOST spark-master ENV SPARK_MASTER_PORT 7077 ENV PYSPARK_PYTHON python3 # -- Runtime WORKDIR ${SPARK_HOME} ================================================ FILE: 07-streaming/extras/python/docker/spark/spark-master.Dockerfile ================================================ FROM spark-base # -- Runtime ARG spark_master_web_ui=8080 EXPOSE ${spark_master_web_ui} ${SPARK_MASTER_PORT} CMD bin/spark-class org.apache.spark.deploy.master.Master >> logs/spark-master.out ================================================ FILE: 07-streaming/extras/python/docker/spark/spark-worker.Dockerfile ================================================ FROM spark-base # -- Runtime ARG spark_worker_web_ui=8081 EXPOSE ${spark_worker_web_ui} CMD bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} >> logs/spark-worker.out ================================================ FILE: 07-streaming/extras/python/json_example/consumer.py ================================================ from typing import Dict, List from json import loads from kafka import KafkaConsumer from ride import Ride from settings import BOOTSTRAP_SERVERS, KAFKA_TOPIC class JsonConsumer: def __init__(self, props: Dict): self.consumer = KafkaConsumer(**props) def consume_from_kafka(self, topics: List[str]): self.consumer.subscribe(topics) print('Consuming from Kafka started') print('Available topics to consume: ', self.consumer.subscription()) while True: try: # SIGINT can't be handled when polling, limit timeout to 1 second. message = self.consumer.poll(1.0) if message is None or message == {}: continue for message_key, message_value in message.items(): for msg_val in message_value: print(msg_val.key, msg_val.value) except KeyboardInterrupt: break self.consumer.close() if __name__ == '__main__': config = { 'bootstrap_servers': BOOTSTRAP_SERVERS, 'auto_offset_reset': 'earliest', 'enable_auto_commit': True, 'key_deserializer': lambda key: int(key.decode('utf-8')), 'value_deserializer': lambda x: loads(x.decode('utf-8'), object_hook=lambda d: Ride.from_dict(d)), 'group_id': 'consumer.group.id.json-example.1', } json_consumer = JsonConsumer(props=config) json_consumer.consume_from_kafka(topics=[KAFKA_TOPIC]) ================================================ FILE: 07-streaming/extras/python/json_example/producer.py ================================================ import csv import json from typing import List, Dict from kafka import KafkaProducer from kafka.errors import KafkaTimeoutError from ride import Ride from settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, KAFKA_TOPIC class JsonProducer(KafkaProducer): def __init__(self, props: Dict): self.producer = KafkaProducer(**props) @staticmethod def read_records(resource_path: str): records = [] with open(resource_path, 'r') as f: reader = csv.reader(f) header = next(reader) # skip the header row for row in reader: records.append(Ride(arr=row)) return records def publish_rides(self, topic: str, messages: List[Ride]): for ride in messages: try: record = self.producer.send(topic=topic, key=ride.pu_location_id, value=ride) print('Record {} successfully produced at offset {}'.format(ride.pu_location_id, record.get().offset)) except KafkaTimeoutError as e: print(e.__str__()) if __name__ == '__main__': # Config Should match with the KafkaProducer expectation config = { 'bootstrap_servers': BOOTSTRAP_SERVERS, 'key_serializer': lambda key: str(key).encode(), 'value_serializer': lambda x: json.dumps(x.__dict__, default=str).encode('utf-8') } producer = JsonProducer(props=config) rides = producer.read_records(resource_path=INPUT_DATA_PATH) producer.publish_rides(topic=KAFKA_TOPIC, messages=rides) ================================================ FILE: 07-streaming/extras/python/json_example/ride.py ================================================ from typing import List, Dict from decimal import Decimal from datetime import datetime class Ride: def __init__(self, arr: List[str]): self.vendor_id = arr[0] self.tpep_pickup_datetime = datetime.strptime(arr[1], "%Y-%m-%d %H:%M:%S"), self.tpep_dropoff_datetime = datetime.strptime(arr[2], "%Y-%m-%d %H:%M:%S"), self.passenger_count = int(arr[3]) self.trip_distance = Decimal(arr[4]) self.rate_code_id = int(arr[5]) self.store_and_fwd_flag = arr[6] self.pu_location_id = int(arr[7]) self.do_location_id = int(arr[8]) self.payment_type = arr[9] self.fare_amount = Decimal(arr[10]) self.extra = Decimal(arr[11]) self.mta_tax = Decimal(arr[12]) self.tip_amount = Decimal(arr[13]) self.tolls_amount = Decimal(arr[14]) self.improvement_surcharge = Decimal(arr[15]) self.total_amount = Decimal(arr[16]) self.congestion_surcharge = Decimal(arr[17]) @classmethod def from_dict(cls, d: Dict): return cls(arr=[ d['vendor_id'], d['tpep_pickup_datetime'][0], d['tpep_dropoff_datetime'][0], d['passenger_count'], d['trip_distance'], d['rate_code_id'], d['store_and_fwd_flag'], d['pu_location_id'], d['do_location_id'], d['payment_type'], d['fare_amount'], d['extra'], d['mta_tax'], d['tip_amount'], d['tolls_amount'], d['improvement_surcharge'], d['total_amount'], d['congestion_surcharge'], ] ) def __repr__(self): return f'{self.__class__.__name__}: {self.__dict__}' ================================================ FILE: 07-streaming/extras/python/json_example/settings.py ================================================ INPUT_DATA_PATH = '../resources/rides.csv' BOOTSTRAP_SERVERS = ['localhost:9092'] KAFKA_TOPIC = 'rides_json' ================================================ FILE: 07-streaming/extras/python/redpanda_example/README.md ================================================ # Basic PubSub example with Redpanda The aim of this module is to have a good grasp on the foundation of these Kafka/Redpanda concepts, to be able to submit a capstone project using streaming: - clusters - brokers - topics - producers - consumers and consumer groups - data serialization and deserialization - replication and retention - offsets - consumer-groups - ## 1. Pre-requisites If you have been following the [module-07](./../../../07-streaming/README.md) videos, you might already have installed the `kafka-python` library, so you can move on to [Docker](#2-docker) section. If you have not, this is the only package you need to install in your virtual environment for this Redpanda lesson. 1. activate your environment 2. `pip install kafka-python` ## 2. Docker Start a Redpanda cluster. Redpanda is a single binary image, so it is very easy to start learning kafka concepts with Redpanda. ```bash cd 07-streaming/python/redpanda_example/ docker-compose up -d ``` ## 3. Set RPK alias Redpanda has a console command `rpk` which means `Redpanda keeper`, the CLI tool that ships with Redpanda and is already available in the Docker image. Set the following `rpk` alias so we can use it from our terminal, without having to open a Docker interactive terminal. We can use this `rpk` alias directly in our terminal. ```bash alias rpk="docker exec -ti redpanda-1 rpk" rpk version ``` At this time, the verion is shown as `v23.2.26 (rev 328d83a06e)`. The important version munber is the major one `v23` following the versioning semantics `major.minor[.build[.revision]]`, to ensure that you get the same results as whatever is shared in this document. > [!TIP] > If you're reading this after Mar, 2024 and want to update the Docker file to use the latest Redpanda images, just visit [Docker hub](https://hub.docker.com/r/vectorized/redpanda/tags), and paste the new version number. ## 4. Kafka Producer - Consumer Examples To run the producer-consumer examples, open 2 shell terminals in 2 side-by-side tabs and run following commands. Be sure to activate your virtual environment in each terminal. ```bash # Start consumer script, in 1st terminal tab python -m consumer.py # Start producer script, in 2nd terminal tab python -m producer.py ``` Run the `python -m producer.py` command again (and again) to observe that the `consumer` worker tab would automatically consume messages in real-time when new `events` occur ## 5. Redpanda UI You can also see the clusters, topics, etc from the Redpanda Console UI via your browser at [http://localhost:8080](http://localhost:8080) ## 6. rpk commands glossary Visit [get-started-rpk blog post](https://redpanda.com/blog/get-started-rpk-manage-streaming-data-clusters) for more. ```bash # set alias for rpk alias rpk="docker exec -ti redpanda-1 rpk" # get info on cluster rpk cluster info # create topic_name with m partitions and n replication factor rpk topic create [topic_name] --partitions m --replicas n # get list of available topics, without extra details and with details rpk topic list rpk topic list --detailed # inspect topic config rpk topic describe [topic_name] # consume [topic_name] rpk topic consume [topic_name] # list the consumer groups in a Redpanda cluster rpk group list # get additional information about a consumer group, from above listed result rpk group describe my-group ``` ## 7. Additional Resources Redpanda Univerity (needs a Redpanda account and it is free to enrol and do the course(s)) - [RP101: Getting Started with Redpanda](https://university.redpanda.com/courses/hands-on-redpanda-getting-started) - [RP102: Stream Processing with Redpanda](https://university.redpanda.com/courses/take/hands-on-redpanda-stream-processing/lessons/37830192-intro) - [SF101: Streaming Fundamentals](https://university.redpanda.com/courses/streaming-fundamentals) - [SF102: Kafka building blocks](https://university.redpanda.com/courses/kafka-building-blocks) If you feel that you already have a good foundational basis on Streaming and Kafka, feel free to skip these supplementary courses. ================================================ FILE: 07-streaming/extras/python/redpanda_example/consumer.py ================================================ import os from typing import Dict, List from json import loads from kafka import KafkaConsumer from ride import Ride from settings import BOOTSTRAP_SERVERS, KAFKA_TOPIC class JsonConsumer: def __init__(self, props: Dict): self.consumer = KafkaConsumer(**props) def consume_from_kafka(self, topics: List[str]): self.consumer.subscribe(topics) print('Consuming from Kafka started') print('Available topics to consume: ', self.consumer.subscription()) while True: try: # SIGINT can't be handled when polling, limit timeout to 1 second. message = self.consumer.poll(1.0) if message is None or message == {}: continue for message_key, message_value in message.items(): for msg_val in message_value: print(msg_val.key, msg_val.value) except KeyboardInterrupt: break self.consumer.close() if __name__ == '__main__': config = { 'bootstrap_servers': BOOTSTRAP_SERVERS, 'auto_offset_reset': 'earliest', 'enable_auto_commit': True, 'key_deserializer': lambda key: int(key.decode('utf-8')), 'value_deserializer': lambda x: loads(x.decode('utf-8'), object_hook=lambda d: Ride.from_dict(d)), 'group_id': 'consumer.group.id.json-example.1', } json_consumer = JsonConsumer(props=config) json_consumer.consume_from_kafka(topics=[KAFKA_TOPIC]) # There's no schema in JSON format, so if the schema changes and one column is removed or new one added or the data types is changed, the Ride class would still work and produce-consume messages would still run without a hitch. # But the issue is in the downstream Analytics as the dataset would no longer have that column and the dashboards would thus fail. Therefore, the trust in our data and processes would erodes. ================================================ FILE: 07-streaming/extras/python/redpanda_example/docker-compose.yaml ================================================ version: '3.7' services: # Redpanda cluster redpanda-1: image: docker.redpanda.com/redpandadata/redpanda:v23.2.26 container_name: redpanda-1 command: - redpanda - start - --smp - '1' - --reserve-memory - 0M - --overprovisioned - --node-id - '1' - --kafka-addr - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092 - --advertise-kafka-addr - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092 - --pandaproxy-addr - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082 - --advertise-pandaproxy-addr - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082 - --rpc-addr - 0.0.0.0:33145 - --advertise-rpc-addr - redpanda-1:33145 ports: # - 8081:8081 - 8082:8082 - 9092:9092 - 9644:9644 - 28082:28082 - 29092:29092 # Want a two node Redpanda cluster? Uncomment this block :) # redpanda-2: # image: docker.redpanda.com/redpandadata/redpanda:v23.1.1 # container_name: redpanda-2 # command: # - redpanda # - start # - --smp # - '1' # - --reserve-memory # - 0M # - --overprovisioned # - --node-id # - '2' # - --seeds # - redpanda-1:33145 # - --kafka-addr # - PLAINTEXT://0.0.0.0:29093,OUTSIDE://0.0.0.0:9093 # - --advertise-kafka-addr # - PLAINTEXT://redpanda-2:29093,OUTSIDE://localhost:9093 # - --pandaproxy-addr # - PLAINTEXT://0.0.0.0:28083,OUTSIDE://0.0.0.0:8083 # - --advertise-pandaproxy-addr # - PLAINTEXT://redpanda-2:28083,OUTSIDE://localhost:8083 # - --rpc-addr # - 0.0.0.0:33146 # - --advertise-rpc-addr # - redpanda-2:33146 # ports: # - 8083:8083 # - 9093:9093 redpanda-console: image: docker.redpanda.com/redpandadata/console:v2.2.2 container_name: redpanda-console entrypoint: /bin/sh command: -c "echo \"$$CONSOLE_CONFIG_FILE\" > /tmp/config.yml; /app/console" environment: CONFIG_FILEPATH: /tmp/config.yml CONSOLE_CONFIG_FILE: | kafka: brokers: ["redpanda-1:29092"] schemaRegistry: enabled: false redpanda: adminApi: enabled: true urls: ["http://redpanda-1:9644"] connect: enabled: false ports: - 8080:8080 depends_on: - redpanda-1 ================================================ FILE: 07-streaming/extras/python/redpanda_example/producer.py ================================================ import csv import json from typing import List, Dict from kafka import KafkaProducer from kafka.errors import KafkaTimeoutError from ride import Ride from settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, KAFKA_TOPIC class JsonProducer(KafkaProducer): def __init__(self, props: Dict): self.producer = KafkaProducer(**props) @staticmethod def read_records(resource_path: str): records = [] with open(resource_path, 'r') as f: reader = csv.reader(f) header = next(reader) # skip the header row for row in reader: records.append(Ride(arr=row)) return records def publish_rides(self, topic: str, messages: List[Ride]): for ride in messages: try: record = self.producer.send(topic=topic, key=ride.pu_location_id, value=ride) print('Record {} successfully produced at offset {}'.format(ride.pu_location_id, record.get().offset)) except KafkaTimeoutError as e: print(e.__str__()) if __name__ == '__main__': # Config Should match with the KafkaProducer expectation # kafka expects binary format for the key-value pair config = { 'bootstrap_servers': BOOTSTRAP_SERVERS, 'key_serializer': lambda key: str(key).encode(), 'value_serializer': lambda x: json.dumps(x.__dict__, default=str).encode('utf-8') } producer = JsonProducer(props=config) rides = producer.read_records(resource_path=INPUT_DATA_PATH) producer.publish_rides(topic=KAFKA_TOPIC, messages=rides) ================================================ FILE: 07-streaming/extras/python/redpanda_example/ride.py ================================================ from typing import List, Dict from decimal import Decimal from datetime import datetime class Ride: def __init__(self, arr: List[str]): self.vendor_id = arr[0] self.tpep_pickup_datetime = datetime.strptime(arr[1], "%Y-%m-%d %H:%M:%S"), self.tpep_dropoff_datetime = datetime.strptime(arr[2], "%Y-%m-%d %H:%M:%S"), self.passenger_count = int(arr[3]) self.trip_distance = Decimal(arr[4]) self.rate_code_id = int(arr[5]) self.store_and_fwd_flag = arr[6] self.pu_location_id = int(arr[7]) self.do_location_id = int(arr[8]) self.payment_type = arr[9] self.fare_amount = Decimal(arr[10]) self.extra = Decimal(arr[11]) self.mta_tax = Decimal(arr[12]) self.tip_amount = Decimal(arr[13]) self.tolls_amount = Decimal(arr[14]) self.improvement_surcharge = Decimal(arr[15]) self.total_amount = Decimal(arr[16]) self.congestion_surcharge = Decimal(arr[17]) @classmethod def from_dict(cls, d: Dict): return cls(arr=[ d['vendor_id'], d['tpep_pickup_datetime'][0], d['tpep_dropoff_datetime'][0], d['passenger_count'], d['trip_distance'], d['rate_code_id'], d['store_and_fwd_flag'], d['pu_location_id'], d['do_location_id'], d['payment_type'], d['fare_amount'], d['extra'], d['mta_tax'], d['tip_amount'], d['tolls_amount'], d['improvement_surcharge'], d['total_amount'], d['congestion_surcharge'], ] ) def __repr__(self): return f'{self.__class__.__name__}: {self.__dict__}' ================================================ FILE: 07-streaming/extras/python/redpanda_example/settings.py ================================================ INPUT_DATA_PATH = '../resources/rides.csv' BOOTSTRAP_SERVERS = ['localhost:9092'] KAFKA_TOPIC = 'rides_json' ================================================ FILE: 07-streaming/extras/python/requirements.txt ================================================ kafka-python==1.4.6 confluent_kafka requests avro faust fastavro ================================================ FILE: 07-streaming/extras/python/resources/schemas/taxi_ride_key.avsc ================================================ { "namespace": "com.datatalksclub.taxi", "type": "record", "name": "RideRecordKey", "fields": [ { "name": "vendor_id", "type": "int" } ] } ================================================ FILE: 07-streaming/extras/python/resources/schemas/taxi_ride_value.avsc ================================================ { "namespace": "com.datatalksclub.taxi", "type": "record", "name": "RideRecord", "fields": [ { "name": "vendor_id", "type": "int" }, { "name": "passenger_count", "type": "int" }, { "name": "trip_distance", "type": "float" }, { "name": "payment_type", "type": "int" }, { "name": "total_amount", "type": "float" } ] } ================================================ FILE: 07-streaming/extras/python/streams-example/faust/branch_price.py ================================================ import faust from taxi_rides import TaxiRide from faust import current_event app = faust.App('datatalksclub.stream.v3', broker='kafka://localhost:9092', consumer_auto_offset_reset="earliest") topic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide) high_amount_rides = app.topic('datatalks.yellow_taxi_rides.high_amount') low_amount_rides = app.topic('datatalks.yellow_taxi_rides.low_amount') @app.agent(topic) async def process(stream): async for event in stream: if event.total_amount >= 40.0: await current_event().forward(high_amount_rides) else: await current_event().forward(low_amount_rides) if __name__ == '__main__': app.main() ================================================ FILE: 07-streaming/extras/python/streams-example/faust/producer_taxi_json.py ================================================ import csv from json import dumps from kafka import KafkaProducer from time import sleep producer = KafkaProducer(bootstrap_servers=['localhost:9092'], key_serializer=lambda x: dumps(x).encode('utf-8'), value_serializer=lambda x: dumps(x).encode('utf-8')) file = open('../../resources/rides.csv') csvreader = csv.reader(file) header = next(csvreader) for row in csvreader: key = {"vendorId": int(row[0])} value = {"vendorId": int(row[0]), "passenger_count": int(row[3]), "trip_distance": float(row[4]), "payment_type": int(row[9]), "total_amount": float(row[16])} producer.send('datatalkclub.yellow_taxi_ride.json', value=value, key=key) print("producing") sleep(1) ================================================ FILE: 07-streaming/extras/python/streams-example/faust/stream.py ================================================ import faust from taxi_rides import TaxiRide app = faust.App('datatalksclub.stream.v2', broker='kafka://localhost:9092') topic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide) @app.agent(topic) async def start_reading(records): async for record in records: print(record) if __name__ == '__main__': app.main() ================================================ FILE: 07-streaming/extras/python/streams-example/faust/stream_count_vendor_trips.py ================================================ import faust from taxi_rides import TaxiRide app = faust.App('datatalksclub.stream.v2', broker='kafka://localhost:9092') topic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide) vendor_rides = app.Table('vendor_rides', default=int) @app.agent(topic) async def process(stream): async for event in stream.group_by(TaxiRide.vendorId): vendor_rides[event.vendorId] += 1 if __name__ == '__main__': app.main() ================================================ FILE: 07-streaming/extras/python/streams-example/faust/taxi_rides.py ================================================ import faust class TaxiRide(faust.Record, validation=True): vendorId: str passenger_count: int trip_distance: float payment_type: int total_amount: float ================================================ FILE: 07-streaming/extras/python/streams-example/faust/windowing.py ================================================ from datetime import timedelta import faust from taxi_rides import TaxiRide app = faust.App('datatalksclub.stream.v2', broker='kafka://localhost:9092') topic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide) vendor_rides = app.Table('vendor_rides_windowed', default=int).tumbling( timedelta(minutes=1), expires=timedelta(hours=1), ) @app.agent(topic) async def process(stream): async for event in stream.group_by(TaxiRide.vendorId): vendor_rides[event.vendorId] += 1 if __name__ == '__main__': app.main() ================================================ FILE: 07-streaming/extras/python/streams-example/pyspark/README.md ================================================ # Running PySpark Streaming #### Prerequisite Ensure your Kafka and Spark services up and running by following the [docker setup readme](./../../docker/README.md). It is important to create network and volume as described in the document. Therefore please ensure, your volume and network are created correctly ```bash docker volume ls # should list hadoop-distributed-file-system docker network ls # should list kafka-spark-network ``` ### Running Producer and Consumer ```bash # Run producer python3 producer.py # Run consumer with default settings python3 consumer.py # Run consumer for specific topic python3 consumer.py --topic ``` ### Running Streaming Script spark-submit script ensures installation of necessary jars before running the streaming.py ```bash ./spark-submit.sh streaming.py ``` ### Additional Resources - [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streaming-programming-guide) - [Structured Streaming + Kafka Integration](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#structured-streaming-kafka-integration-guide-kafka-broker-versio) ================================================ FILE: 07-streaming/extras/python/streams-example/pyspark/consumer.py ================================================ import argparse from typing import Dict, List from kafka import KafkaConsumer from settings import BOOTSTRAP_SERVERS, CONSUME_TOPIC_RIDES_CSV class RideCSVConsumer: def __init__(self, props: Dict): self.consumer = KafkaConsumer(**props) def consume_from_kafka(self, topics: List[str]): self.consumer.subscribe(topics=topics) print('Consuming from Kafka started') print('Available topics to consume: ', self.consumer.subscription()) while True: try: # SIGINT can't be handled when polling, limit timeout to 1 second. msg = self.consumer.poll(1.0) if msg is None or msg == {}: continue for msg_key, msg_values in msg.items(): for msg_val in msg_values: print(f'Key:{msg_val.key}-type({type(msg_val.key)}), ' f'Value:{msg_val.value}-type({type(msg_val.value)})') except KeyboardInterrupt: break self.consumer.close() if __name__ == '__main__': parser = argparse.ArgumentParser(description='Kafka Consumer') parser.add_argument('--topic', type=str, default=CONSUME_TOPIC_RIDES_CSV) args = parser.parse_args() topic = args.topic config = { 'bootstrap_servers': [BOOTSTRAP_SERVERS], 'auto_offset_reset': 'earliest', 'enable_auto_commit': True, 'key_deserializer': lambda key: int(key.decode('utf-8')), 'value_deserializer': lambda value: value.decode('utf-8'), 'group_id': 'consumer.group.id.csv-example.1', } csv_consumer = RideCSVConsumer(props=config) csv_consumer.consume_from_kafka(topics=[topic]) ================================================ FILE: 07-streaming/extras/python/streams-example/pyspark/producer.py ================================================ import csv from time import sleep from typing import Dict from kafka import KafkaProducer from settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, PRODUCE_TOPIC_RIDES_CSV def delivery_report(err, msg): if err is not None: print("Delivery failed for record {}: {}".format(msg.key(), err)) return print('Record {} successfully produced to {} [{}] at offset {}'.format( msg.key(), msg.topic(), msg.partition(), msg.offset())) class RideCSVProducer: def __init__(self, props: Dict): self.producer = KafkaProducer(**props) # self.producer = Producer(producer_props) @staticmethod def read_records(resource_path: str): records, ride_keys = [], [] i = 0 with open(resource_path, 'r') as f: reader = csv.reader(f) header = next(reader) # skip the header for row in reader: # vendor_id, passenger_count, trip_distance, payment_type, total_amount records.append(f'{row[0]}, {row[1]}, {row[2]}, {row[3]}, {row[4]}, {row[9]}, {row[16]}') ride_keys.append(str(row[0])) i += 1 if i == 5: break return zip(ride_keys, records) def publish(self, topic: str, records: [str, str]): for key_value in records: key, value = key_value try: self.producer.send(topic=topic, key=key, value=value) print(f"Producing record for ") except KeyboardInterrupt: break except Exception as e: print(f"Exception while producing record - {value}: {e}") self.producer.flush() sleep(1) if __name__ == "__main__": config = { 'bootstrap_servers': [BOOTSTRAP_SERVERS], 'key_serializer': lambda x: x.encode('utf-8'), 'value_serializer': lambda x: x.encode('utf-8') } producer = RideCSVProducer(props=config) ride_records = producer.read_records(resource_path=INPUT_DATA_PATH) print(ride_records) producer.publish(topic=PRODUCE_TOPIC_RIDES_CSV, records=ride_records) ================================================ FILE: 07-streaming/extras/python/streams-example/pyspark/settings.py ================================================ import pyspark.sql.types as T INPUT_DATA_PATH = '../../resources/rides.csv' BOOTSTRAP_SERVERS = 'localhost:9092' TOPIC_WINDOWED_VENDOR_ID_COUNT = 'vendor_counts_windowed' PRODUCE_TOPIC_RIDES_CSV = CONSUME_TOPIC_RIDES_CSV = 'rides_csv' RIDE_SCHEMA = T.StructType( [T.StructField("vendor_id", T.IntegerType()), T.StructField('tpep_pickup_datetime', T.TimestampType()), T.StructField('tpep_dropoff_datetime', T.TimestampType()), T.StructField("passenger_count", T.IntegerType()), T.StructField("trip_distance", T.FloatType()), T.StructField("payment_type", T.IntegerType()), T.StructField("total_amount", T.FloatType()), ]) ================================================ FILE: 07-streaming/extras/python/streams-example/pyspark/spark-submit.sh ================================================ # Submit Python code to SparkMaster if [ $# -lt 1 ] then echo "Usage: $0 [ executor-memory ]" echo "(specify memory in string format such as \"512M\" or \"2G\")" exit 1 fi PYTHON_JOB=$1 if [ -z $2 ] then EXEC_MEM="1G" else EXEC_MEM=$2 fi spark-submit --master spark://localhost:7077 --num-executors 2 \ --executor-memory $EXEC_MEM --executor-cores 1 \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.3.1 \ $PYTHON_JOB ================================================ FILE: 07-streaming/extras/python/streams-example/pyspark/streaming-notebook.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "id": "c4419168-c0e6-4a65-b56e-8454c42060ac", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "### 0. Spark Setup" ] }, { "cell_type": "code", "execution_count": null, "id": "32bd7cdd-8504-4a54-a461-244bf7878d2a", "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 pyspark-shell'" ] }, { "cell_type": "code", "execution_count": 2, "id": "3aab2a7e-a685-4925-9c9a-b5adf201af77", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ":: loading settings :: url = jar:file:/usr/local/lib/python3.10/dist-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Ivy Default Cache set to: /root/.ivy2/cache\n", "The jars for the packages stored in: /root/.ivy2/jars\n", "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n", "org.apache.spark#spark-avro_2.12 added as a dependency\n", ":: resolving dependencies :: org.apache.spark#spark-submit-parent-5a3a4db6-be91-4d32-9884-8b0f38241b3f;1.0\n", "\tconfs: [default]\n", "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 in central\n", "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 in central\n", "\tfound org.apache.kafka#kafka-clients;2.8.1 in central\n", "\tfound org.lz4#lz4-java;1.8.0 in central\n", "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n", "\tfound org.slf4j#slf4j-api;1.7.32 in central\n", "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.2 in central\n", "\tfound org.spark-project.spark#unused;1.0.0 in central\n", "\tfound org.apache.hadoop#hadoop-client-api;3.3.2 in central\n", "\tfound commons-logging#commons-logging;1.1.3 in central\n", "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n", "\tfound org.apache.commons#commons-pool2;2.11.1 in central\n", "\tfound org.apache.spark#spark-avro_2.12;3.3.1 in central\n", "\tfound org.tukaani#xz;1.8 in central\n", ":: resolution report :: resolve 544ms :: artifacts dl 11ms\n", "\t:: modules in use:\n", "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n", "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n", "\torg.apache.commons#commons-pool2;2.11.1 from central in [default]\n", "\torg.apache.hadoop#hadoop-client-api;3.3.2 from central in [default]\n", "\torg.apache.hadoop#hadoop-client-runtime;3.3.2 from central in [default]\n", "\torg.apache.kafka#kafka-clients;2.8.1 from central in [default]\n", "\torg.apache.spark#spark-avro_2.12;3.3.1 from central in [default]\n", "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 from central in [default]\n", "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 from central in [default]\n", "\torg.lz4#lz4-java;1.8.0 from central in [default]\n", "\torg.slf4j#slf4j-api;1.7.32 from central in [default]\n", "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n", "\torg.tukaani#xz;1.8 from central in [default]\n", "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n", "\t---------------------------------------------------------------------\n", "\t| | modules || artifacts |\n", "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", "\t---------------------------------------------------------------------\n", "\t| default | 14 | 0 | 0 | 0 || 14 | 0 |\n", "\t---------------------------------------------------------------------\n", ":: retrieving :: org.apache.spark#spark-submit-parent-5a3a4db6-be91-4d32-9884-8b0f38241b3f\n", "\tconfs: [default]\n", "\t0 artifacts copied, 14 already retrieved (0kB/8ms)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "23/02/21 21:20:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "from pyspark.sql import SparkSession\n", "import pyspark.sql.types as T\n", "import pyspark.sql.functions as F\n", "\n", "spark = SparkSession \\\n", " .builder \\\n", " .appName(\"Spark-Notebook\") \\\n", " .getOrCreate()" ] }, { "cell_type": "markdown", "id": "6f4b62fa-b3ce-4a1b-a1f4-2ed332a0d55a", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "### 1. Reading from Kafka Stream\n", "\n", "through `readStream`" ] }, { "cell_type": "markdown", "id": "f491fa45-4471-4bc5-92f7-48081f687140", "metadata": {}, "source": [ "#### 1.1 Raw Kafka Stream" ] }, { "cell_type": "code", "execution_count": 3, "id": "82c25cb2-2599-4f9b-8849-967fbb604a44", "metadata": { "tags": [] }, "outputs": [], "source": [ "# default for startingOffsets is \"latest\"\n", "df_kafka_raw = spark \\\n", " .readStream \\\n", " .format(\"kafka\") \\\n", " .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n", " .option(\"subscribe\", \"rides_csv\") \\\n", " .option(\"startingOffsets\", \"earliest\") \\\n", " .option(\"checkpointLocation\", \"checkpoint\") \\\n", " .load()" ] }, { "cell_type": "code", "execution_count": 4, "id": "d9149ccd-69b2-4f5b-afc0-43567673c634", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- key: binary (nullable = true)\n", " |-- value: binary (nullable = true)\n", " |-- topic: string (nullable = true)\n", " |-- partition: integer (nullable = true)\n", " |-- offset: long (nullable = true)\n", " |-- timestamp: timestamp (nullable = true)\n", " |-- timestampType: integer (nullable = true)\n", "\n" ] } ], "source": [ "df_kafka_raw.printSchema()" ] }, { "cell_type": "markdown", "id": "62e5e753-89c7-460f-a8be-16868ce5c680", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "#### 1.2 Encoded Kafka Stream" ] }, { "cell_type": "code", "execution_count": 5, "id": "0b745eed-7d74-421e-8e4b-c8343fda4de3", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_kafka_encoded = df_kafka_raw.selectExpr(\"CAST(key AS STRING)\",\"CAST(value AS STRING)\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "6839addc-c7c0-4117-8c9c-d2cd59cbf136", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- key: string (nullable = true)\n", " |-- value: string (nullable = true)\n", "\n" ] } ], "source": [ "df_kafka_encoded.printSchema()" ] }, { "cell_type": "markdown", "id": "6749c4de-6f80-4b91-b2b8-b2968c761d75", "metadata": {}, "source": [ "#### 1.3 Structure Streaming DataFrame" ] }, { "cell_type": "code", "execution_count": 7, "id": "ca20ae37-49f0-421f-9859-73fac8d4ca45", "metadata": { "tags": [] }, "outputs": [], "source": [ "def parse_ride_from_kafka_message(df_raw, schema):\n", " \"\"\" take a Spark Streaming df and parse value col based on , return streaming df cols in schema \"\"\"\n", " assert df_raw.isStreaming is True, \"DataFrame doesn't receive streaming data\"\n", "\n", " df = df_raw.selectExpr(\"CAST(key AS STRING)\", \"CAST(value AS STRING)\")\n", "\n", " # split attributes to nested array in one Column\n", " col = F.split(df['value'], ', ')\n", "\n", " # expand col to multiple top-level columns\n", " for idx, field in enumerate(schema):\n", " df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))\n", " return df.select([field.name for field in schema])" ] }, { "cell_type": "code", "execution_count": 8, "id": "e1737bd0-146f-4ee2-a70f-a4657af5bbc6", "metadata": { "tags": [] }, "outputs": [], "source": [ "ride_schema = T.StructType(\n", " [T.StructField(\"vendor_id\", T.IntegerType()),\n", " T.StructField('tpep_pickup_datetime', T.TimestampType()),\n", " T.StructField('tpep_dropoff_datetime', T.TimestampType()),\n", " T.StructField(\"passenger_count\", T.IntegerType()),\n", " T.StructField(\"trip_distance\", T.FloatType()),\n", " T.StructField(\"payment_type\", T.IntegerType()),\n", " T.StructField(\"total_amount\", T.FloatType()),\n", " ])" ] }, { "cell_type": "code", "execution_count": 9, "id": "ae2ce896-f54b-4166-b01f-b5532ab292fe", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_rides = parse_ride_from_kafka_message(df_raw=df_kafka_raw, schema=ride_schema)" ] }, { "cell_type": "code", "execution_count": 10, "id": "cd848228-97c5-4325-8457-97f35e533cd8", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- vendor_id: integer (nullable = true)\n", " |-- tpep_pickup_datetime: timestamp (nullable = true)\n", " |-- tpep_dropoff_datetime: timestamp (nullable = true)\n", " |-- passenger_count: integer (nullable = true)\n", " |-- trip_distance: float (nullable = true)\n", " |-- payment_type: integer (nullable = true)\n", " |-- total_amount: float (nullable = true)\n", "\n" ] } ], "source": [ "df_rides.printSchema()" ] }, { "cell_type": "markdown", "id": "60277fdc-2797-4b23-9ecf-956b76db5778", "metadata": { "tags": [] }, "source": [ "### 2 Sink Operation & Streaming Query\n", "\n", "through `writeStream`\n", "\n", "---\n", "**Output Sinks**\n", "- File Sink: stores the output to the directory\n", "- Kafka Sink: stores the output to one or more topics in Kafka\n", "- Foreach Sink:\n", "- (for debugging) Console Sink, Memory Sink\n", "\n", "Further details can be found in [Output Sinks](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks)\n", "\n", "---\n", "There are three types of **Output Modes**:\n", "- Complete: The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.\n", "- Append (default): Only new rows are added to the Result Table\n", "- Update: Only updated rows are outputted\n", "\n", "[Output Modes](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes) differs based on the set of transformations applied to the streaming data. \n", "\n", "--- \n", "**Triggers**\n", "\n", "The [trigger settings](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) of a streaming query define the timing of streaming data processing. Spark streaming support micro-batch streamings schema and you can select following options based on requirements.\n", "\n", "- default-micro-batch-mode\n", "- fixed-interval-micro-batch-mode\n", "- one-time-micro-batch-mode\n", "- available-now-micro-batch-mode\n" ] }, { "cell_type": "markdown", "id": "02ca9b08-aa61-46cd-b946-4457ce2cdf5d", "metadata": { "tags": [] }, "source": [ "#### Console and Memory Sink" ] }, { "cell_type": "code", "execution_count": 11, "id": "74c72469-4c37-417c-a866-a1c1ef75ae8b", "metadata": { "tags": [] }, "outputs": [], "source": [ "def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):\n", " write_query = df.writeStream \\\n", " .outputMode(output_mode) \\\n", " .trigger(processingTime=processing_time) \\\n", " .format(\"console\") \\\n", " .option(\"truncate\", False) \\\n", " .start()\n", " return write_query # pyspark.sql.streaming.StreamingQuery" ] }, { "cell_type": "code", "execution_count": 22, "id": "d866c7ba-f8e9-475d-830a-50ffb2c5472b", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "23/02/21 21:46:12 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-289a958e-f6b6-4b38-a87b-50002d82ec8b. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n", "23/02/21 21:46:12 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n", "23/02/21 21:46:12 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0-3, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "23/02/21 21:46:12 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0-3, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n", "23/02/21 21:46:13 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-4, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "23/02/21 21:46:13 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-4, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n", "-------------------------------------------\n", "Batch: 0\n", "-------------------------------------------\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|1 |2020-07-01 00:25:32 |2020-07-01 00:33:39 |1 |1.5 |2 |9.3 |\n", "|1 |2020-07-01 00:03:19 |2020-07-01 00:25:43 |1 |9.5 |1 |27.8 |\n", "|2 |2020-07-01 00:15:11 |2020-07-01 00:29:24 |1 |5.85 |2 |22.3 |\n", "|2 |2020-07-01 00:30:49 |2020-07-01 00:38:26 |1 |1.9 |1 |14.16 |\n", "|2 |2020-07-01 00:31:26 |2020-07-01 00:38:02 |1 |1.25 |2 |7.8 |\n", "|1 |2020-07-01 00:25:32 |2020-07-01 00:33:39 |1 |1.5 |2 |9.3 |\n", "|1 |2020-07-01 00:03:19 |2020-07-01 00:25:43 |1 |9.5 |1 |27.8 |\n", "|2 |2020-07-01 00:15:11 |2020-07-01 00:29:24 |1 |5.85 |2 |22.3 |\n", "|2 |2020-07-01 00:30:49 |2020-07-01 00:38:26 |1 |1.9 |1 |14.16 |\n", "|2 |2020-07-01 00:31:26 |2020-07-01 00:38:02 |1 |1.25 |2 |7.8 |\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "\n", "23/02/21 22:11:05 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-5, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "23/02/21 22:11:05 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-5, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "-------------------------------------------\n", "Batch: 1\n", "-------------------------------------------\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|1 |2020-07-01 00:25:32 |2020-07-01 00:33:39 |1 |1.5 |2 |9.3 |\n", "|1 |2020-07-01 00:03:19 |2020-07-01 00:25:43 |1 |9.5 |1 |27.8 |\n", "|2 |2020-07-01 00:15:11 |2020-07-01 00:29:24 |1 |5.85 |2 |22.3 |\n", "|2 |2020-07-01 00:30:49 |2020-07-01 00:38:26 |1 |1.9 |1 |14.16 |\n", "|2 |2020-07-01 00:31:26 |2020-07-01 00:38:02 |1 |1.25 |2 |7.8 |\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "\n" ] } ], "source": [ "write_query = sink_console(df_rides, output_mode='append')" ] }, { "cell_type": "code", "execution_count": 15, "id": "a9bfa73f-a8cc-4988-a8cf-bf31ee6c449c", "metadata": { "tags": [] }, "outputs": [], "source": [ "def sink_memory(df, query_name, query_template):\n", " write_query = df \\\n", " .writeStream \\\n", " .queryName(query_name) \\\n", " .format('memory') \\\n", " .start()\n", " query_str = query_template.format(table_name=query_name)\n", " query_results = spark.sql(query_str)\n", " return write_query, query_results" ] }, { "cell_type": "code", "execution_count": 16, "id": "b31d0b76-e917-44e7-a14d-f9ce6901c23a", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "23/02/21 21:31:47 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-b3e2c096-aa06-4083-9cdf-d6f3cf04fc06. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n", "23/02/21 21:31:47 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n", "23/02/21 21:31:48 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0-1, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "23/02/21 21:31:48 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0-1, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n", "23/02/21 21:31:49 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor-2, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "23/02/21 21:31:49 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor-2, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " \r" ] } ], "source": [ "query_name = 'vendor_id_counts'\n", "query_template = 'select count(distinct(vendor_id)) from {table_name}'\n", "write_query, df_vendor_id_counts = sink_memory(df=df_rides, query_name=query_name, query_template=query_template)" ] }, { "cell_type": "code", "execution_count": 18, "id": "4ba56111-83bf-4028-ac65-565e0190f310", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "{'message': 'Waiting for data to arrive',\n", " 'isDataAvailable': False,\n", " 'isTriggerActive': True}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(type(write_query)) # pyspark.sql.streaming.StreamingQuery\n", "write_query.status" ] }, { "cell_type": "code", "execution_count": 19, "id": "7cc37bda-9cfa-402b-9d42-a6ba5271476b", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------------------+\n", "|count(DISTINCT vendor_id)|\n", "+-------------------------+\n", "| 2|\n", "+-------------------------+\n", "\n" ] } ], "source": [ "df_vendor_id_counts.show()" ] }, { "cell_type": "code", "execution_count": 20, "id": "88862ca9-4d89-487e-987f-08a2b9e83efe", "metadata": { "tags": [] }, "outputs": [], "source": [ "write_query.stop()" ] }, { "cell_type": "markdown", "id": "443d4041-06db-4a4a-89c1-348848cc7ca8", "metadata": { "tags": [] }, "source": [ "#### Kafka Sink\n", "\n", "To write stream results to `kafka-topic`, the stream dataframe has at least a column with name `value`.\n", "\n", "Therefore before starting `writeStream` in kafka format, dataframe needs to be updated accordingly.\n", "\n", "More information regarding kafka sink expected data structure [here](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka)\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "8b08a013-d039-41cf-94fd-a1a57571d25f", "metadata": { "tags": [] }, "outputs": [], "source": [ "def prepare_dataframe_to_kafka_sink(df, value_columns, key_column=None):\n", " columns = df.columns\n", " df = df.withColumn(\"value\", F.concat_ws(', ',*value_columns)) \n", " if key_column:\n", " df = df.withColumnRenamed(key_column,\"key\")\n", " df = df.withColumn(\"key\",df.key.cast('string'))\n", " return df.select(['key', 'value'])\n", " \n", "def sink_kafka(df, topic, output_mode='append'):\n", " write_query = df.writeStream \\\n", " .format(\"kafka\") \\\n", " .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n", " .outputMode(output_mode) \\\n", " .option(\"topic\", topic) \\\n", " .option(\"checkpointLocation\", \"checkpoint\") \\\n", " .start()\n", " return write_query" ] }, { "cell_type": "markdown", "id": "e4cb2140-9f2e-4914-b74c-be4c18cdbe8a", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 07-streaming/extras/python/streams-example/pyspark/streaming.py ================================================ from pyspark.sql import SparkSession import pyspark.sql.functions as F from settings import RIDE_SCHEMA, CONSUME_TOPIC_RIDES_CSV, TOPIC_WINDOWED_VENDOR_ID_COUNT def read_from_kafka(consume_topic: str): # Spark Streaming DataFrame, connect to Kafka topic served at host in bootrap.servers option df_stream = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092,broker:29092") \ .option("subscribe", consume_topic) \ .option("startingOffsets", "earliest") \ .option("checkpointLocation", "checkpoint") \ .load() return df_stream def parse_ride_from_kafka_message(df, schema): """ take a Spark Streaming df and parse value col based on , return streaming df cols in schema """ assert df.isStreaming is True, "DataFrame doesn't receive streaming data" df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") # split attributes to nested array in one Column col = F.split(df['value'], ', ') # expand col to multiple top-level columns for idx, field in enumerate(schema): df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType)) return df.select([field.name for field in schema]) def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'): write_query = df.writeStream \ .outputMode(output_mode) \ .trigger(processingTime=processing_time) \ .format("console") \ .option("truncate", False) \ .start() return write_query # pyspark.sql.streaming.StreamingQuery def sink_memory(df, query_name, query_template): query_df = df \ .writeStream \ .queryName(query_name) \ .format("memory") \ .start() query_str = query_template.format(table_name=query_name) query_results = spark.sql(query_str) return query_results, query_df def sink_kafka(df, topic): write_query = df.writeStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092,broker:29092") \ .outputMode('complete') \ .option("topic", topic) \ .option("checkpointLocation", "checkpoint") \ .start() return write_query def prepare_df_to_kafka_sink(df, value_columns, key_column=None): columns = df.columns df = df.withColumn("value", F.concat_ws(', ', *value_columns)) if key_column: df = df.withColumnRenamed(key_column, "key") df = df.withColumn("key", df.key.cast('string')) return df.select(['key', 'value']) def op_groupby(df, column_names): df_aggregation = df.groupBy(column_names).count() return df_aggregation def op_windowed_groupby(df, window_duration, slide_duration): df_windowed_aggregation = df.groupBy( F.window(timeColumn=df.tpep_pickup_datetime, windowDuration=window_duration, slideDuration=slide_duration), df.vendor_id ).count() return df_windowed_aggregation if __name__ == "__main__": spark = SparkSession.builder.appName('streaming-examples').getOrCreate() spark.sparkContext.setLogLevel('WARN') # read_streaming data df_consume_stream = read_from_kafka(consume_topic=CONSUME_TOPIC_RIDES_CSV) print(df_consume_stream.printSchema()) # parse streaming data df_rides = parse_ride_from_kafka_message(df_consume_stream, RIDE_SCHEMA) print(df_rides.printSchema()) sink_console(df_rides, output_mode='append') df_trip_count_by_vendor_id = op_groupby(df_rides, ['vendor_id']) df_trip_count_by_pickup_date_vendor_id = op_windowed_groupby(df_rides, window_duration="10 minutes", slide_duration='5 minutes') # write the output out to the console for debugging / testing sink_console(df_trip_count_by_vendor_id) # write the output to the kafka topic df_trip_count_messages = prepare_df_to_kafka_sink(df=df_trip_count_by_pickup_date_vendor_id, value_columns=['count'], key_column='vendor_id') kafka_sink_query = sink_kafka(df=df_trip_count_messages, topic=TOPIC_WINDOWED_VENDOR_ID_COUNT) spark.streams.awaitAnyTermination() ================================================ FILE: 07-streaming/extras/python/streams-example/redpanda/README.md ================================================ # Running PySpark Streaming with Redpanda ### 1. Prerequisite It is important to create network and volume as described in the document. Therefore please ensure, your volume and network are created correctly. ```bash docker volume ls # should list hadoop-distributed-file-system docker network ls # should list kafka-spark-network ``` ### 2. Create Docker Network & Volume If you have not followed any other examples, and above `ls` steps shows no output, create them now. ```bash # Create Network docker network create kafka-spark-network # Create Volume docker volume create --name=hadoop-distributed-file-system ``` ### Running Producer and Consumer ```bash # Run producer python producer.py # Run consumer with default settings python consumer.py # Run consumer for specific topic python consumer.py --topic ``` ### Running Streaming Script spark-submit script ensures installation of necessary jars before running the streaming.py ```bash ./spark-submit.sh streaming.py ``` ### Additional Resources - [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streaming-programming-guide) - [Structured Streaming + Kafka Integration](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#structured-streaming-kafka-integration-guide-kafka-broker-versio) ================================================ FILE: 07-streaming/extras/python/streams-example/redpanda/consumer.py ================================================ import argparse from typing import Dict, List from kafka import KafkaConsumer from settings import BOOTSTRAP_SERVERS, CONSUME_TOPIC_RIDES_CSV class RideCSVConsumer: def __init__(self, props: Dict): self.consumer = KafkaConsumer(**props) def consume_from_kafka(self, topics: List[str]): self.consumer.subscribe(topics=topics) print('Consuming from Kafka started') print('Available topics to consume: ', self.consumer.subscription()) while True: try: # SIGINT can't be handled when polling, limit timeout to 1 second. msg = self.consumer.poll(1.0) if msg is None or msg == {}: continue for msg_key, msg_values in msg.items(): for msg_val in msg_values: print(f'Key:{msg_val.key}-type({type(msg_val.key)}), ' f'Value:{msg_val.value}-type({type(msg_val.value)})') except KeyboardInterrupt: break self.consumer.close() if __name__ == '__main__': parser = argparse.ArgumentParser(description='Kafka Consumer') parser.add_argument('--topic', type=str, default=CONSUME_TOPIC_RIDES_CSV) args = parser.parse_args() topic = args.topic config = { 'bootstrap_servers': [BOOTSTRAP_SERVERS], 'auto_offset_reset': 'earliest', 'enable_auto_commit': True, 'key_deserializer': lambda key: int(key.decode('utf-8')), 'value_deserializer': lambda value: value.decode('utf-8'), 'group_id': 'consumer.group.id.csv-example.1', } csv_consumer = RideCSVConsumer(props=config) csv_consumer.consume_from_kafka(topics=[topic]) ================================================ FILE: 07-streaming/extras/python/streams-example/redpanda/docker-compose.yaml ================================================ version: '3.7' volumes: shared-workspace: name: "hadoop-distributed-file-system" driver: local networks: default: name: kafka-spark-network external: true services: # Redpanda cluster redpanda-1: image: docker.redpanda.com/redpandadata/redpanda:v23.2.26 container_name: redpanda-1 command: - redpanda - start - --smp - '1' - --reserve-memory - 0M - --overprovisioned - --node-id - '1' - --kafka-addr - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092 - --advertise-kafka-addr - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092 - --pandaproxy-addr - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082 - --advertise-pandaproxy-addr - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082 - --rpc-addr - 0.0.0.0:33145 - --advertise-rpc-addr - redpanda-1:33145 ports: # - 8081:8081 - 8082:8082 - 9092:9092 - 9644:9644 - 28082:28082 - 29092:29092 volumes: - shared-workspace:/opt/workspace # Want a two node Redpanda cluster? Uncomment this block :) redpanda-2: image: docker.redpanda.com/redpandadata/redpanda:v23.1.1 container_name: redpanda-2 command: - redpanda - start - --smp - '1' - --reserve-memory - 0M - --overprovisioned - --node-id - '2' - --seeds - redpanda-1:33145 - --kafka-addr - PLAINTEXT://0.0.0.0:29093,OUTSIDE://0.0.0.0:9093 - --advertise-kafka-addr - PLAINTEXT://redpanda-2:29093,OUTSIDE://localhost:9093 - --pandaproxy-addr - PLAINTEXT://0.0.0.0:28083,OUTSIDE://0.0.0.0:8083 - --advertise-pandaproxy-addr - PLAINTEXT://redpanda-2:28083,OUTSIDE://localhost:8083 - --rpc-addr - 0.0.0.0:33146 - --advertise-rpc-addr - redpanda-2:33146 ports: - 8083:8083 - 9093:9093 volumes: - shared-workspace:/opt/workspace redpanda-console: image: docker.redpanda.com/redpandadata/console:v2.2.2 container_name: redpanda-console entrypoint: /bin/sh command: -c "echo \"$$CONSOLE_CONFIG_FILE\" > /tmp/config.yml; /app/console" environment: CONFIG_FILEPATH: /tmp/config.yml CONSOLE_CONFIG_FILE: | kafka: brokers: ["redpanda-1:29092"] schemaRegistry: enabled: false redpanda: adminApi: enabled: true urls: ["http://redpanda-1:9644"] connect: enabled: false ports: - 8080:8080 depends_on: - redpanda-1 volumes: - shared-workspace:/opt/workspace ================================================ FILE: 07-streaming/extras/python/streams-example/redpanda/producer.py ================================================ import csv from time import sleep from typing import Dict from kafka import KafkaProducer from settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, PRODUCE_TOPIC_RIDES_CSV def delivery_report(err, msg): if err is not None: print("Delivery failed for record {}: {}".format(msg.key(), err)) return print('Record {} successfully produced to {} [{}] at offset {}'.format( msg.key(), msg.topic(), msg.partition(), msg.offset())) class RideCSVProducer: def __init__(self, props: Dict): self.producer = KafkaProducer(**props) # self.producer = Producer(producer_props) @staticmethod def read_records(resource_path: str): records, ride_keys = [], [] i = 0 with open(resource_path, 'r') as f: reader = csv.reader(f) header = next(reader) # skip the header for row in reader: # vendor_id, passenger_count, trip_distance, payment_type, total_amount records.append(f'{row[0]}, {row[1]}, {row[2]}, {row[3]}, {row[4]}, {row[9]}, {row[16]}') ride_keys.append(str(row[0])) i += 1 if i == 5: break return zip(ride_keys, records) def publish(self, topic: str, records: [str, str]): for key_value in records: key, value = key_value try: self.producer.send(topic=topic, key=key, value=value) print(f"Producing record for ") except KeyboardInterrupt: break except Exception as e: print(f"Exception while producing record - {value}: {e}") self.producer.flush() sleep(1) if __name__ == "__main__": config = { 'bootstrap_servers': [BOOTSTRAP_SERVERS], 'key_serializer': lambda x: x.encode('utf-8'), 'value_serializer': lambda x: x.encode('utf-8') } producer = RideCSVProducer(props=config) ride_records = producer.read_records(resource_path=INPUT_DATA_PATH) print(ride_records) producer.publish(topic=PRODUCE_TOPIC_RIDES_CSV, records=ride_records) ================================================ FILE: 07-streaming/extras/python/streams-example/redpanda/settings.py ================================================ import pyspark.sql.types as T INPUT_DATA_PATH = '../../resources/rides.csv' BOOTSTRAP_SERVERS = 'localhost:9092' TOPIC_WINDOWED_VENDOR_ID_COUNT = 'vendor_counts_windowed' PRODUCE_TOPIC_RIDES_CSV = CONSUME_TOPIC_RIDES_CSV = 'rides_csv' RIDE_SCHEMA = T.StructType( [T.StructField("vendor_id", T.IntegerType()), T.StructField('tpep_pickup_datetime', T.TimestampType()), T.StructField('tpep_dropoff_datetime', T.TimestampType()), T.StructField("passenger_count", T.IntegerType()), T.StructField("trip_distance", T.FloatType()), T.StructField("payment_type", T.IntegerType()), T.StructField("total_amount", T.FloatType()), ]) ================================================ FILE: 07-streaming/extras/python/streams-example/redpanda/spark-submit.sh ================================================ # Submit Python code to SparkMaster if [ $# -lt 1 ] then echo "Usage: $0 [ executor-memory ]" echo "(specify memory in string format such as \"512M\" or \"2G\")" exit 1 fi PYTHON_JOB=$1 if [ -z $2 ] then EXEC_MEM="1G" else EXEC_MEM=$2 fi spark-submit --master spark://localhost:7077 --num-executors 2 \ --executor-memory $EXEC_MEM --executor-cores 1 \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.spark:spark-avro_2.12:3.5.1,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.5.1 \ $PYTHON_JOB ================================================ FILE: 07-streaming/extras/python/streams-example/redpanda/streaming-notebook.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "id": "c4419168-c0e6-4a65-b56e-8454c42060ac", "metadata": { "tags": [] }, "source": [ "### 0. Spark Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "32bd7cdd-8504-4a54-a461-244bf7878d2a", "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 pyspark-shell'" ] }, { "cell_type": "code", "execution_count": 2, "id": "3aab2a7e-a685-4925-9c9a-b5adf201af77", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "your 131072x1 screen size is bogus. expect trouble\n", "24/03/11 00:28:48 WARN Utils: Your hostname, Cinders resolves to a loopback address: 127.0.1.1; using 172.17.156.62 instead (on interface eth0)\n", "24/03/11 00:28:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ ":: loading settings :: url = jar:file:/home/ellabelle/spark/spark-3.5.1-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Ivy Default Cache set to: /home/ellabelle/.ivy2/cache\n", "The jars for the packages stored in: /home/ellabelle/.ivy2/jars\n", "org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\n", "org.apache.spark#spark-avro_2.12 added as a dependency\n", ":: resolving dependencies :: org.apache.spark#spark-submit-parent-0c8615d6-fa19-46ec-942b-46e9fe0012aa;1.0\n", "\tconfs: [default]\n", "\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 in central\n", "\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 in central\n", "\tfound org.apache.kafka#kafka-clients;2.8.1 in central\n", "\tfound org.lz4#lz4-java;1.8.0 in central\n", "\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\n", "\tfound org.slf4j#slf4j-api;1.7.32 in central\n", "\tfound org.apache.hadoop#hadoop-client-runtime;3.3.2 in central\n", "\tfound org.spark-project.spark#unused;1.0.0 in central\n", "\tfound org.apache.hadoop#hadoop-client-api;3.3.2 in central\n", "\tfound commons-logging#commons-logging;1.1.3 in central\n", "\tfound com.google.code.findbugs#jsr305;3.0.0 in central\n", "\tfound org.apache.commons#commons-pool2;2.11.1 in central\n", "\tfound org.apache.spark#spark-avro_2.12;3.3.1 in central\n", "\tfound org.tukaani#xz;1.8 in central\n", ":: resolution report :: resolve 328ms :: artifacts dl 13ms\n", "\t:: modules in use:\n", "\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\n", "\tcommons-logging#commons-logging;1.1.3 from central in [default]\n", "\torg.apache.commons#commons-pool2;2.11.1 from central in [default]\n", "\torg.apache.hadoop#hadoop-client-api;3.3.2 from central in [default]\n", "\torg.apache.hadoop#hadoop-client-runtime;3.3.2 from central in [default]\n", "\torg.apache.kafka#kafka-clients;2.8.1 from central in [default]\n", "\torg.apache.spark#spark-avro_2.12;3.3.1 from central in [default]\n", "\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 from central in [default]\n", "\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 from central in [default]\n", "\torg.lz4#lz4-java;1.8.0 from central in [default]\n", "\torg.slf4j#slf4j-api;1.7.32 from central in [default]\n", "\torg.spark-project.spark#unused;1.0.0 from central in [default]\n", "\torg.tukaani#xz;1.8 from central in [default]\n", "\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\n", "\t---------------------------------------------------------------------\n", "\t| | modules || artifacts |\n", "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", "\t---------------------------------------------------------------------\n", "\t| default | 14 | 0 | 0 | 0 || 14 | 0 |\n", "\t---------------------------------------------------------------------\n", ":: retrieving :: org.apache.spark#spark-submit-parent-0c8615d6-fa19-46ec-942b-46e9fe0012aa\n", "\tconfs: [default]\n", "\t0 artifacts copied, 14 already retrieved (0kB/8ms)\n", "24/03/11 00:28:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", "24/03/11 00:28:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\n" ] } ], "source": [ "from pyspark.sql import SparkSession\n", "import pyspark.sql.types as T\n", "import pyspark.sql.functions as F\n", "\n", "spark = SparkSession \\\n", " .builder \\\n", " .appName(\"Spark-Notebook\") \\\n", " .getOrCreate()" ] }, { "cell_type": "markdown", "id": "6f4b62fa-b3ce-4a1b-a1f4-2ed332a0d55a", "metadata": { "tags": [] }, "source": [ "### 1. Reading from Kafka Stream\n", "\n", "through `readStream`" ] }, { "cell_type": "markdown", "id": "f491fa45-4471-4bc5-92f7-48081f687140", "metadata": {}, "source": [ "#### 1.1 Raw Kafka Stream" ] }, { "cell_type": "code", "execution_count": 3, "id": "82c25cb2-2599-4f9b-8849-967fbb604a44", "metadata": { "tags": [] }, "outputs": [], "source": [ "# default for startingOffsets is \"latest\"\n", "df_kafka_raw = spark \\\n", " .readStream \\\n", " .format(\"kafka\") \\\n", " .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n", " .option(\"subscribe\", \"rides_csv\") \\\n", " .option(\"startingOffsets\", \"earliest\") \\\n", " .option(\"checkpointLocation\", \"checkpoint\") \\\n", " .load()" ] }, { "cell_type": "code", "execution_count": 4, "id": "d9149ccd-69b2-4f5b-afc0-43567673c634", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- key: binary (nullable = true)\n", " |-- value: binary (nullable = true)\n", " |-- topic: string (nullable = true)\n", " |-- partition: integer (nullable = true)\n", " |-- offset: long (nullable = true)\n", " |-- timestamp: timestamp (nullable = true)\n", " |-- timestampType: integer (nullable = true)\n", "\n" ] } ], "source": [ "df_kafka_raw.printSchema()" ] }, { "cell_type": "markdown", "id": "62e5e753-89c7-460f-a8be-16868ce5c680", "metadata": { "tags": [] }, "source": [ "#### 1.2 Encoded Kafka Stream" ] }, { "cell_type": "code", "execution_count": 5, "id": "0b745eed-7d74-421e-8e4b-c8343fda4de3", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_kafka_encoded = df_kafka_raw.selectExpr(\"CAST(key AS STRING)\",\"CAST(value AS STRING)\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "6839addc-c7c0-4117-8c9c-d2cd59cbf136", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- key: string (nullable = true)\n", " |-- value: string (nullable = true)\n", "\n" ] } ], "source": [ "df_kafka_encoded.printSchema()" ] }, { "cell_type": "markdown", "id": "6749c4de-6f80-4b91-b2b8-b2968c761d75", "metadata": {}, "source": [ "#### 1.3 Structure Streaming DataFrame" ] }, { "cell_type": "code", "execution_count": 7, "id": "ca20ae37-49f0-421f-9859-73fac8d4ca45", "metadata": { "tags": [] }, "outputs": [], "source": [ "def parse_ride_from_kafka_message(df_raw, schema):\n", " \"\"\" take a Spark Streaming df and parse value col based on , return streaming df cols in schema \"\"\"\n", " assert df_raw.isStreaming is True, \"DataFrame doesn't receive streaming data\"\n", "\n", " df = df_raw.selectExpr(\"CAST(key AS STRING)\", \"CAST(value AS STRING)\")\n", "\n", " # split attributes to nested array in one Column\n", " col = F.split(df['value'], ', ')\n", "\n", " # expand col to multiple top-level columns\n", " for idx, field in enumerate(schema):\n", " df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))\n", " return df.select([field.name for field in schema])" ] }, { "cell_type": "code", "execution_count": 8, "id": "e1737bd0-146f-4ee2-a70f-a4657af5bbc6", "metadata": { "tags": [] }, "outputs": [], "source": [ "ride_schema = T.StructType(\n", " [T.StructField(\"vendor_id\", T.IntegerType()),\n", " T.StructField('tpep_pickup_datetime', T.TimestampType()),\n", " T.StructField('tpep_dropoff_datetime', T.TimestampType()),\n", " T.StructField(\"passenger_count\", T.IntegerType()),\n", " T.StructField(\"trip_distance\", T.FloatType()),\n", " T.StructField(\"payment_type\", T.IntegerType()),\n", " T.StructField(\"total_amount\", T.FloatType()),\n", " ])" ] }, { "cell_type": "code", "execution_count": 9, "id": "ae2ce896-f54b-4166-b01f-b5532ab292fe", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_rides = parse_ride_from_kafka_message(df_raw=df_kafka_raw, schema=ride_schema)" ] }, { "cell_type": "code", "execution_count": 10, "id": "cd848228-97c5-4325-8457-97f35e533cd8", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- vendor_id: integer (nullable = true)\n", " |-- tpep_pickup_datetime: timestamp (nullable = true)\n", " |-- tpep_dropoff_datetime: timestamp (nullable = true)\n", " |-- passenger_count: integer (nullable = true)\n", " |-- trip_distance: float (nullable = true)\n", " |-- payment_type: integer (nullable = true)\n", " |-- total_amount: float (nullable = true)\n", "\n" ] } ], "source": [ "df_rides.printSchema()" ] }, { "cell_type": "code", "execution_count": null, "id": "f1cdb53e-f477-4137-8412-6915d7772125", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_rides.show()" ] }, { "cell_type": "markdown", "id": "60277fdc-2797-4b23-9ecf-956b76db5778", "metadata": { "tags": [] }, "source": [ "### 2 Sink Operation & Streaming Query\n", "\n", "through `writeStream`\n", "\n", "---\n", "**Output Sinks**\n", "- File Sink: stores the output to the directory\n", "- Kafka Sink: stores the output to one or more topics in Kafka\n", "- Foreach Sink:\n", "- (for debugging) Console Sink, Memory Sink\n", "\n", "Further details can be found in [Output Sinks](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks)\n", "\n", "---\n", "There are three types of **Output Modes**:\n", "- Complete: The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.\n", "- Append (default): Only new rows are added to the Result Table\n", "- Update: Only updated rows are outputted\n", "\n", "[Output Modes](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes) differs based on the set of transformations applied to the streaming data. \n", "\n", "--- \n", "**Triggers**\n", "\n", "The [trigger settings](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) of a streaming query define the timing of streaming data processing. Spark streaming support micro-batch streamings schema and you can select following options based on requirements.\n", "\n", "- default-micro-batch-mode\n", "- fixed-interval-micro-batch-mode\n", "- one-time-micro-batch-mode\n", "- available-now-micro-batch-mode\n" ] }, { "cell_type": "markdown", "id": "02ca9b08-aa61-46cd-b946-4457ce2cdf5d", "metadata": { "tags": [] }, "source": [ "#### Console and Memory Sink" ] }, { "cell_type": "code", "execution_count": 11, "id": "74c72469-4c37-417c-a866-a1c1ef75ae8b", "metadata": { "tags": [] }, "outputs": [], "source": [ "def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):\n", " write_query = df.writeStream \\\n", " .outputMode(output_mode) \\\n", " .trigger(processingTime=processing_time) \\\n", " .format(\"console\") \\\n", " .option(\"truncate\", False) \\\n", " .start()\n", " return write_query # pyspark.sql.streaming.StreamingQuery" ] }, { "cell_type": "code", "execution_count": 12, "id": "d866c7ba-f8e9-475d-830a-50ffb2c5472b", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "24/03/11 00:30:31 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-2b8e8845-1369-4653-8c23-c45a98e194a9. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n", "24/03/11 00:30:31 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "24/03/11 00:30:32 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n", "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\n", "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\n", "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\n", "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\n", "24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\n", "24/03/11 00:30:33 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n", " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "-------------------------------------------\n", "Batch: 0\n", "-------------------------------------------\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|1 |2020-07-01 00:25:32 |2020-07-01 00:33:39 |1 |1.5 |2 |9.3 |\n", "|1 |2020-07-01 00:03:19 |2020-07-01 00:25:43 |1 |9.5 |1 |27.8 |\n", "|2 |2020-07-01 00:15:11 |2020-07-01 00:29:24 |1 |5.85 |2 |22.3 |\n", "|2 |2020-07-01 00:30:49 |2020-07-01 00:38:26 |1 |1.9 |1 |14.16 |\n", "|2 |2020-07-01 00:31:26 |2020-07-01 00:38:02 |1 |1.25 |2 |7.8 |\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "\n" ] } ], "source": [ "write_query = sink_console(df_rides, output_mode='append')" ] }, { "cell_type": "code", "execution_count": 13, "id": "a9bfa73f-a8cc-4988-a8cf-bf31ee6c449c", "metadata": { "tags": [] }, "outputs": [], "source": [ "def sink_memory(df, query_name, query_template):\n", " write_query = df \\\n", " .writeStream \\\n", " .queryName(query_name) \\\n", " .format('memory') \\\n", " .start()\n", " query_str = query_template.format(table_name=query_name)\n", " query_results = spark.sql(query_str)\n", " return write_query, query_results" ] }, { "cell_type": "code", "execution_count": 14, "id": "b31d0b76-e917-44e7-a14d-f9ce6901c23a", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "24/03/11 00:30:42 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-c7621425-b7fb-47fe-8b42-791c9c5d3186. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\n", "24/03/11 00:30:42 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\n", "24/03/11 00:30:43 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n", "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\n", "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\n", "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\n", "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\n", "24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "24/03/11 00:30:43 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n", " \r" ] } ], "source": [ "query_name = 'vendor_id_counts'\n", "query_template = 'select count(distinct(vendor_id)) from {table_name}'\n", "write_query, df_vendor_id_counts = sink_memory(df=df_rides, query_name=query_name, query_template=query_template)" ] }, { "cell_type": "code", "execution_count": 15, "id": "4ba56111-83bf-4028-ac65-565e0190f310", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "{'message': 'Waiting for data to arrive',\n", " 'isDataAvailable': False,\n", " 'isTriggerActive': False}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "-------------------------------------------\n", "Batch: 1\n", "-------------------------------------------\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|1 |2020-07-01 00:25:32 |2020-07-01 00:33:39 |1 |1.5 |2 |9.3 |\n", "|1 |2020-07-01 00:03:19 |2020-07-01 00:25:43 |1 |9.5 |1 |27.8 |\n", "|2 |2020-07-01 00:15:11 |2020-07-01 00:29:24 |1 |5.85 |2 |22.3 |\n", "|2 |2020-07-01 00:30:49 |2020-07-01 00:38:26 |1 |1.9 |1 |14.16 |\n", "|2 |2020-07-01 00:31:26 |2020-07-01 00:38:02 |1 |1.25 |2 |7.8 |\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "\n", "-------------------------------------------\n", "Batch: 2\n", "-------------------------------------------\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "|1 |2020-07-01 00:25:32 |2020-07-01 00:33:39 |1 |1.5 |2 |9.3 |\n", "|1 |2020-07-01 00:03:19 |2020-07-01 00:25:43 |1 |9.5 |1 |27.8 |\n", "|2 |2020-07-01 00:15:11 |2020-07-01 00:29:24 |1 |5.85 |2 |22.3 |\n", "|2 |2020-07-01 00:30:49 |2020-07-01 00:38:26 |1 |1.9 |1 |14.16 |\n", "|2 |2020-07-01 00:31:26 |2020-07-01 00:38:02 |1 |1.25 |2 |7.8 |\n", "+---------+--------------------+---------------------+---------------+-------------+------------+------------+\n", "\n" ] } ], "source": [ "print(type(write_query)) # pyspark.sql.streaming.StreamingQuery\n", "write_query.status" ] }, { "cell_type": "code", "execution_count": 18, "id": "7cc37bda-9cfa-402b-9d42-a6ba5271476b", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------------------+\n", "|count(DISTINCT vendor_id)|\n", "+-------------------------+\n", "| 2|\n", "+-------------------------+\n", "\n" ] } ], "source": [ "df_vendor_id_counts.show()" ] }, { "cell_type": "code", "execution_count": 19, "id": "88862ca9-4d89-487e-987f-08a2b9e83efe", "metadata": { "tags": [] }, "outputs": [], "source": [ "write_query.stop()" ] }, { "cell_type": "markdown", "id": "443d4041-06db-4a4a-89c1-348848cc7ca8", "metadata": { "tags": [] }, "source": [ "#### Kafka Sink\n", "\n", "To write stream results to `kafka-topic`, the stream dataframe has at least a column with name `value`.\n", "\n", "Therefore before starting `writeStream` in kafka format, dataframe needs to be updated accordingly.\n", "\n", "More information regarding kafka sink expected data structure [here](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka)\n" ] }, { "cell_type": "code", "execution_count": 20, "id": "8b08a013-d039-41cf-94fd-a1a57571d25f", "metadata": { "scrolled": true, "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:36 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:37 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:39 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:40 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:41 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:42 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:43 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:44 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:45 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:46 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:47 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:48 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:49 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:50 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:51 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:52 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:53 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:54 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:55 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:56 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:57 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:34:58 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:00 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:01 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:02 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:03 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:04 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:05 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:06 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:07 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:08 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:09 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:10 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:11 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:12 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:13 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:14 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:16 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:17 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:17 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:19 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:20 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:21 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:22 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:23 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:24 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:25 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:26 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:27 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:28 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:29 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:30 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:31 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:32 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:33 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:34 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:35 WARN KafkaOffsetReaderAdmin: Error in attempt 1 getting Kafka offsets: \n", "java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088535000, tries=1, nextAllowedTryMs=1710088535101) timed out at 1710088535001 after 1 attempt(s)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\n", "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\n", "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\n", "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\n", "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\n", "\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\n", "\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\n", "\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n", "\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n", "\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n", "\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n", "\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n", "\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n", "\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\n", "\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\n", "\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\n", "\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\n", "\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\n", "Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088535000, tries=1, nextAllowedTryMs=1710088535101) timed out at 1710088535001 after 1 attempt(s)\n", "Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\n", "24/03/11 00:35:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:36 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n", "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\n", "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\n", "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\n", "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\n", "24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\n", "24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:37 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:37 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:39 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:40 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:41 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:42 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:43 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:44 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:45 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:46 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:47 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:48 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:49 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:50 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:51 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:52 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:53 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:55 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:55 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:57 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:58 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:35:59 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:00 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:01 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:02 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:03 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:04 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:06 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:07 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:08 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:09 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:10 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:11 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:12 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:13 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:14 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:15 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:16 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:17 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:18 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:19 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:20 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:22 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:23 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:24 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:25 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:26 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:27 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:28 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:29 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:30 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:31 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:32 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:33 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:35 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:35 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:36 WARN KafkaOffsetReaderAdmin: Error in attempt 2 getting Kafka offsets: \n", "java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088596058, tries=1, nextAllowedTryMs=1710088596159) timed out at 1710088596059 after 1 attempt(s)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\n", "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\n", "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\n", "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\n", "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\n", "\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\n", "\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\n", "\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n", "\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n", "\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n", "\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n", "\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n", "\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n", "\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\n", "\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\n", "\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\n", "\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\n", "\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\n", "Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088596058, tries=1, nextAllowedTryMs=1710088596159) timed out at 1710088596059 after 1 attempt(s)\n", "Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\n", "24/03/11 00:36:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:37 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\n", "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\n", "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\n", "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\n", "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\n", "24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\n", "24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:38 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:38 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:40 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:41 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:42 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:43 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:44 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:45 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:46 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:47 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:47 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:48 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:49 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:50 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:52 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:52 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:54 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:55 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:56 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:57 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:58 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:36:59 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:00 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:01 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:02 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:03 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:05 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:05 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:06 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:08 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:09 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:10 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:11 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:12 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:13 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:14 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:15 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:16 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:17 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:18 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:19 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:20 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:21 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:22 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:23 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:24 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:25 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:26 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:27 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:28 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:29 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:31 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:32 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:33 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:34 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:35 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:36 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:37 WARN KafkaOffsetReaderAdmin: Error in attempt 3 getting Kafka offsets: \n", "java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\n", "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\n", "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\n", "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\n", "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\n", "\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\n", "\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\n", "\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n", "\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n", "\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n", "\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n", "\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n", "\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n", "\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\n", "\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\n", "\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\n", "\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\n", "\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\n", "Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\n", "Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\n", "24/03/11 00:37:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\n", "24/03/11 00:37:38 ERROR MicroBatchExecution: Query [id = 4dfba771-eff7-49e7-a3ff-f1aa03a6e840, runId = 0f86ad02-1d50-487a-97c7-72790d8857d8] terminated with error\n", "java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\n", "\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\n", "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\n", "\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\n", "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\n", "\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\n", "\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\n", "\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\n", "\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\n", "\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n", "\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n", "\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n", "\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\n", "\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\n", "\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\n", "\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\n", "\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\n", "\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\n", "\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\n", "\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\n", "\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\n", "\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\n", "\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\n", "\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\n", "\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\n", "Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\n", "Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\n" ] } ], "source": [ "def prepare_dataframe_to_kafka_sink(df, value_columns, key_column=None):\n", " columns = df.columns\n", " df = df.withColumn(\"value\", F.concat_ws(', ',*value_columns)) \n", " if key_column:\n", " df = df.withColumnRenamed(key_column,\"key\")\n", " df = df.withColumn(\"key\",df.key.cast('string'))\n", " return df.select(['key', 'value'])\n", " \n", "def sink_kafka(df, topic, output_mode='append'):\n", " write_query = df.writeStream \\\n", " .format(\"kafka\") \\\n", " .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n", " .outputMode(output_mode) \\\n", " .option(\"topic\", topic) \\\n", " .option(\"checkpointLocation\", \"checkpoint\") \\\n", " .start()\n", " return write_query" ] }, { "cell_type": "markdown", "id": "e4cb2140-9f2e-4914-b74c-be4c18cdbe8a", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "63abe115-879c-4863-97d3-b22cda7f7469", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 07-streaming/extras/python/streams-example/redpanda/streaming.py ================================================ from pyspark.sql import SparkSession import pyspark.sql.functions as F from settings import RIDE_SCHEMA, CONSUME_TOPIC_RIDES_CSV, TOPIC_WINDOWED_VENDOR_ID_COUNT def read_from_kafka(consume_topic: str): # Spark Streaming DataFrame, connect to Kafka topic served at host in bootrap.servers option df_stream = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092,broker:29092") \ .option("subscribe", consume_topic) \ .option("startingOffsets", "earliest") \ .option("checkpointLocation", "checkpoint") \ .load() return df_stream def parse_ride_from_kafka_message(df, schema): """ take a Spark Streaming df and parse value col based on , return streaming df cols in schema """ assert df.isStreaming is True, "DataFrame doesn't receive streaming data" df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") # split attributes to nested array in one Column col = F.split(df['value'], ', ') # expand col to multiple top-level columns for idx, field in enumerate(schema): df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType)) return df.select([field.name for field in schema]) def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'): write_query = df.writeStream \ .outputMode(output_mode) \ .trigger(processingTime=processing_time) \ .format("console") \ .option("truncate", False) \ .start() return write_query # pyspark.sql.streaming.StreamingQuery def sink_memory(df, query_name, query_template): query_df = df \ .writeStream \ .queryName(query_name) \ .format("memory") \ .start() query_str = query_template.format(table_name=query_name) query_results = spark.sql(query_str) return query_results, query_df def sink_kafka(df, topic): write_query = df.writeStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092,broker:29092") \ .outputMode('complete') \ .option("topic", topic) \ .option("checkpointLocation", "checkpoint") \ .start() return write_query def prepare_df_to_kafka_sink(df, value_columns, key_column=None): columns = df.columns df = df.withColumn("value", F.concat_ws(', ', *value_columns)) if key_column: df = df.withColumnRenamed(key_column, "key") df = df.withColumn("key", df.key.cast('string')) return df.select(['key', 'value']) def op_groupby(df, column_names): df_aggregation = df.groupBy(column_names).count() return df_aggregation def op_windowed_groupby(df, window_duration, slide_duration): df_windowed_aggregation = df.groupBy( F.window(timeColumn=df.tpep_pickup_datetime, windowDuration=window_duration, slideDuration=slide_duration), df.vendor_id ).count() return df_windowed_aggregation if __name__ == "__main__": spark = SparkSession.builder.appName('streaming-examples').getOrCreate() spark.sparkContext.setLogLevel('WARN') # read_streaming data df_consume_stream = read_from_kafka(consume_topic=CONSUME_TOPIC_RIDES_CSV) print(df_consume_stream.printSchema()) # parse streaming data df_rides = parse_ride_from_kafka_message( df_consume_stream, RIDE_SCHEMA ) print(df_rides.printSchema()) sink_console(df_rides, output_mode='append') df_trip_count_by_vendor_id = op_groupby(df_rides, ['vendor_id']) df_trip_count_by_pickup_date_vendor_id = op_windowed_groupby( df_rides, window_duration="10 minutes", slide_duration='5 minutes' ) # write the output out to the console for debugging / testing sink_console(df_trip_count_by_vendor_id) # write the output to the kafka topic df_trip_count_messages = prepare_df_to_kafka_sink( df=df_trip_count_by_pickup_date_vendor_id, value_columns=['count'], key_column='vendor_id' ) kafka_sink_query = sink_kafka( df=df_trip_count_messages, topic=TOPIC_WINDOWED_VENDOR_ID_COUNT ) spark.streams.awaitAnyTermination() ================================================ FILE: 07-streaming/theory/README.md ================================================ # Kafka theory (optional) Video lectures covering Kafka concepts, with code examples in Java. Code: [java/kafka_examples](java/kafka_examples) ## Stream processing - [7.0.1 Introduction](https://youtu.be/hfvju3iOIP0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=67) - [7.0.2 What is stream processing](https://youtu.be/WxTxKGcfA-k&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=68) - [7.3 What is Kafka?](https://youtu.be/zPLZUDPi4AY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=69) - [7.4 Confluent Cloud](https://youtu.be/ZnEZFEYKppw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=70) - [7.5 Kafka producer consumer](https://youtu.be/aegTuyxX7Yg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=71) - [7.6 Kafka configuration](https://youtu.be/SXQtWyRpMKs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=72) Links: - [Slides](https://docs.google.com/presentation/d/1bCtdCba8v1HxJ_uMm9pwjRUC-NAMeB-6nOG2ng3KujA/edit?usp=sharing) - [Kafka Configuration Reference](https://docs.confluent.io/platform/current/installation/configuration/) - [Confluent Cloud trial](https://www.confluent.io/confluent-cloud/tryfree/) ## Kafka Streams - [7.7 Kafka stream basics](https://youtu.be/dUyA_63eRb0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=73) - [7.8 Kafka stream join](https://youtu.be/NcpKlujh34Y&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=74) - [7.9 Kafka stream testing](https://youtu.be/TNx5rmLY8Pk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=75) - [7.10 Kafka stream windowing](https://youtu.be/r1OuLdwxbRc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=76) - [7.11 Kafka ksqlDB and Connect](https://youtu.be/DziQ4a4tn9Y&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=77) - [7.12 Kafka Schema registry](https://youtu.be/tBY_hBuyzwI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=78) Links: - [Slides](https://docs.google.com/presentation/d/1fVi9sFa7fL2ZW3ynS5MAZm0bRSZ4jO10fymPmrfTUjE/edit?usp=sharing) - [Streams Concepts](https://docs.confluent.io/platform/current/streams/concepts.html) ================================================ FILE: 07-streaming/theory/java/kafka_examples/.gitignore ================================================ .gradle bin !src/main/resources/rides.csv build/classes build/generated build/libs build/reports build/resources build/test-results build/tmp ================================================ FILE: 07-streaming/theory/java/kafka_examples/build/generated-main-avro-java/schemaregistry/RideRecord.java ================================================ /** * Autogenerated by Avro * * DO NOT EDIT DIRECTLY */ package schemaregistry; import org.apache.avro.generic.GenericArray; import org.apache.avro.specific.SpecificData; import org.apache.avro.util.Utf8; import org.apache.avro.message.BinaryMessageEncoder; import org.apache.avro.message.BinaryMessageDecoder; import org.apache.avro.message.SchemaStore; @org.apache.avro.specific.AvroGenerated public class RideRecord extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { private static final long serialVersionUID = 6805437803204402942L; public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"RideRecord\",\"namespace\":\"schemaregistry\",\"fields\":[{\"name\":\"vendor_id\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"passenger_count\",\"type\":\"int\"},{\"name\":\"trip_distance\",\"type\":\"double\"}]}"); public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; } private static final SpecificData MODEL$ = new SpecificData(); private static final BinaryMessageEncoder ENCODER = new BinaryMessageEncoder<>(MODEL$, SCHEMA$); private static final BinaryMessageDecoder DECODER = new BinaryMessageDecoder<>(MODEL$, SCHEMA$); /** * Return the BinaryMessageEncoder instance used by this class. * @return the message encoder used by this class */ public static BinaryMessageEncoder getEncoder() { return ENCODER; } /** * Return the BinaryMessageDecoder instance used by this class. * @return the message decoder used by this class */ public static BinaryMessageDecoder getDecoder() { return DECODER; } /** * Create a new BinaryMessageDecoder instance for this class that uses the specified {@link SchemaStore}. * @param resolver a {@link SchemaStore} used to find schemas by fingerprint * @return a BinaryMessageDecoder instance for this class backed by the given SchemaStore */ public static BinaryMessageDecoder createDecoder(SchemaStore resolver) { return new BinaryMessageDecoder<>(MODEL$, SCHEMA$, resolver); } /** * Serializes this RideRecord to a ByteBuffer. * @return a buffer holding the serialized data for this instance * @throws java.io.IOException if this instance could not be serialized */ public java.nio.ByteBuffer toByteBuffer() throws java.io.IOException { return ENCODER.encode(this); } /** * Deserializes a RideRecord from a ByteBuffer. * @param b a byte buffer holding serialized data for an instance of this class * @return a RideRecord instance decoded from the given buffer * @throws java.io.IOException if the given bytes could not be deserialized into an instance of this class */ public static RideRecord fromByteBuffer( java.nio.ByteBuffer b) throws java.io.IOException { return DECODER.decode(b); } private java.lang.String vendor_id; private int passenger_count; private double trip_distance; /** * Default constructor. Note that this does not initialize fields * to their default values from the schema. If that is desired then * one should use newBuilder(). */ public RideRecord() {} /** * All-args constructor. * @param vendor_id The new value for vendor_id * @param passenger_count The new value for passenger_count * @param trip_distance The new value for trip_distance */ public RideRecord(java.lang.String vendor_id, java.lang.Integer passenger_count, java.lang.Double trip_distance) { this.vendor_id = vendor_id; this.passenger_count = passenger_count; this.trip_distance = trip_distance; } @Override public org.apache.avro.specific.SpecificData getSpecificData() { return MODEL$; } @Override public org.apache.avro.Schema getSchema() { return SCHEMA$; } // Used by DatumWriter. Applications should not call. @Override public java.lang.Object get(int field$) { switch (field$) { case 0: return vendor_id; case 1: return passenger_count; case 2: return trip_distance; default: throw new IndexOutOfBoundsException("Invalid index: " + field$); } } // Used by DatumReader. Applications should not call. @Override @SuppressWarnings(value="unchecked") public void put(int field$, java.lang.Object value$) { switch (field$) { case 0: vendor_id = value$ != null ? value$.toString() : null; break; case 1: passenger_count = (java.lang.Integer)value$; break; case 2: trip_distance = (java.lang.Double)value$; break; default: throw new IndexOutOfBoundsException("Invalid index: " + field$); } } /** * Gets the value of the 'vendor_id' field. * @return The value of the 'vendor_id' field. */ public java.lang.String getVendorId() { return vendor_id; } /** * Sets the value of the 'vendor_id' field. * @param value the value to set. */ public void setVendorId(java.lang.String value) { this.vendor_id = value; } /** * Gets the value of the 'passenger_count' field. * @return The value of the 'passenger_count' field. */ public int getPassengerCount() { return passenger_count; } /** * Sets the value of the 'passenger_count' field. * @param value the value to set. */ public void setPassengerCount(int value) { this.passenger_count = value; } /** * Gets the value of the 'trip_distance' field. * @return The value of the 'trip_distance' field. */ public double getTripDistance() { return trip_distance; } /** * Sets the value of the 'trip_distance' field. * @param value the value to set. */ public void setTripDistance(double value) { this.trip_distance = value; } /** * Creates a new RideRecord RecordBuilder. * @return A new RideRecord RecordBuilder */ public static schemaregistry.RideRecord.Builder newBuilder() { return new schemaregistry.RideRecord.Builder(); } /** * Creates a new RideRecord RecordBuilder by copying an existing Builder. * @param other The existing builder to copy. * @return A new RideRecord RecordBuilder */ public static schemaregistry.RideRecord.Builder newBuilder(schemaregistry.RideRecord.Builder other) { if (other == null) { return new schemaregistry.RideRecord.Builder(); } else { return new schemaregistry.RideRecord.Builder(other); } } /** * Creates a new RideRecord RecordBuilder by copying an existing RideRecord instance. * @param other The existing instance to copy. * @return A new RideRecord RecordBuilder */ public static schemaregistry.RideRecord.Builder newBuilder(schemaregistry.RideRecord other) { if (other == null) { return new schemaregistry.RideRecord.Builder(); } else { return new schemaregistry.RideRecord.Builder(other); } } /** * RecordBuilder for RideRecord instances. */ @org.apache.avro.specific.AvroGenerated public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase implements org.apache.avro.data.RecordBuilder { private java.lang.String vendor_id; private int passenger_count; private double trip_distance; /** Creates a new Builder */ private Builder() { super(SCHEMA$, MODEL$); } /** * Creates a Builder by copying an existing Builder. * @param other The existing Builder to copy. */ private Builder(schemaregistry.RideRecord.Builder other) { super(other); if (isValidValue(fields()[0], other.vendor_id)) { this.vendor_id = data().deepCopy(fields()[0].schema(), other.vendor_id); fieldSetFlags()[0] = other.fieldSetFlags()[0]; } if (isValidValue(fields()[1], other.passenger_count)) { this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count); fieldSetFlags()[1] = other.fieldSetFlags()[1]; } if (isValidValue(fields()[2], other.trip_distance)) { this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance); fieldSetFlags()[2] = other.fieldSetFlags()[2]; } } /** * Creates a Builder by copying an existing RideRecord instance * @param other The existing instance to copy. */ private Builder(schemaregistry.RideRecord other) { super(SCHEMA$, MODEL$); if (isValidValue(fields()[0], other.vendor_id)) { this.vendor_id = data().deepCopy(fields()[0].schema(), other.vendor_id); fieldSetFlags()[0] = true; } if (isValidValue(fields()[1], other.passenger_count)) { this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count); fieldSetFlags()[1] = true; } if (isValidValue(fields()[2], other.trip_distance)) { this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance); fieldSetFlags()[2] = true; } } /** * Gets the value of the 'vendor_id' field. * @return The value. */ public java.lang.String getVendorId() { return vendor_id; } /** * Sets the value of the 'vendor_id' field. * @param value The value of 'vendor_id'. * @return This builder. */ public schemaregistry.RideRecord.Builder setVendorId(java.lang.String value) { validate(fields()[0], value); this.vendor_id = value; fieldSetFlags()[0] = true; return this; } /** * Checks whether the 'vendor_id' field has been set. * @return True if the 'vendor_id' field has been set, false otherwise. */ public boolean hasVendorId() { return fieldSetFlags()[0]; } /** * Clears the value of the 'vendor_id' field. * @return This builder. */ public schemaregistry.RideRecord.Builder clearVendorId() { vendor_id = null; fieldSetFlags()[0] = false; return this; } /** * Gets the value of the 'passenger_count' field. * @return The value. */ public int getPassengerCount() { return passenger_count; } /** * Sets the value of the 'passenger_count' field. * @param value The value of 'passenger_count'. * @return This builder. */ public schemaregistry.RideRecord.Builder setPassengerCount(int value) { validate(fields()[1], value); this.passenger_count = value; fieldSetFlags()[1] = true; return this; } /** * Checks whether the 'passenger_count' field has been set. * @return True if the 'passenger_count' field has been set, false otherwise. */ public boolean hasPassengerCount() { return fieldSetFlags()[1]; } /** * Clears the value of the 'passenger_count' field. * @return This builder. */ public schemaregistry.RideRecord.Builder clearPassengerCount() { fieldSetFlags()[1] = false; return this; } /** * Gets the value of the 'trip_distance' field. * @return The value. */ public double getTripDistance() { return trip_distance; } /** * Sets the value of the 'trip_distance' field. * @param value The value of 'trip_distance'. * @return This builder. */ public schemaregistry.RideRecord.Builder setTripDistance(double value) { validate(fields()[2], value); this.trip_distance = value; fieldSetFlags()[2] = true; return this; } /** * Checks whether the 'trip_distance' field has been set. * @return True if the 'trip_distance' field has been set, false otherwise. */ public boolean hasTripDistance() { return fieldSetFlags()[2]; } /** * Clears the value of the 'trip_distance' field. * @return This builder. */ public schemaregistry.RideRecord.Builder clearTripDistance() { fieldSetFlags()[2] = false; return this; } @Override @SuppressWarnings("unchecked") public RideRecord build() { try { RideRecord record = new RideRecord(); record.vendor_id = fieldSetFlags()[0] ? this.vendor_id : (java.lang.String) defaultValue(fields()[0]); record.passenger_count = fieldSetFlags()[1] ? this.passenger_count : (java.lang.Integer) defaultValue(fields()[1]); record.trip_distance = fieldSetFlags()[2] ? this.trip_distance : (java.lang.Double) defaultValue(fields()[2]); return record; } catch (org.apache.avro.AvroMissingFieldException e) { throw e; } catch (java.lang.Exception e) { throw new org.apache.avro.AvroRuntimeException(e); } } } @SuppressWarnings("unchecked") private static final org.apache.avro.io.DatumWriter WRITER$ = (org.apache.avro.io.DatumWriter)MODEL$.createDatumWriter(SCHEMA$); @Override public void writeExternal(java.io.ObjectOutput out) throws java.io.IOException { WRITER$.write(this, SpecificData.getEncoder(out)); } @SuppressWarnings("unchecked") private static final org.apache.avro.io.DatumReader READER$ = (org.apache.avro.io.DatumReader)MODEL$.createDatumReader(SCHEMA$); @Override public void readExternal(java.io.ObjectInput in) throws java.io.IOException { READER$.read(this, SpecificData.getDecoder(in)); } @Override protected boolean hasCustomCoders() { return true; } @Override public void customEncode(org.apache.avro.io.Encoder out) throws java.io.IOException { out.writeString(this.vendor_id); out.writeInt(this.passenger_count); out.writeDouble(this.trip_distance); } @Override public void customDecode(org.apache.avro.io.ResolvingDecoder in) throws java.io.IOException { org.apache.avro.Schema.Field[] fieldOrder = in.readFieldOrderIfDiff(); if (fieldOrder == null) { this.vendor_id = in.readString(); this.passenger_count = in.readInt(); this.trip_distance = in.readDouble(); } else { for (int i = 0; i < 3; i++) { switch (fieldOrder[i].pos()) { case 0: this.vendor_id = in.readString(); break; case 1: this.passenger_count = in.readInt(); break; case 2: this.trip_distance = in.readDouble(); break; default: throw new java.io.IOException("Corrupt ResolvingDecoder."); } } } } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/build/generated-main-avro-java/schemaregistry/RideRecordCompatible.java ================================================ /** * Autogenerated by Avro * * DO NOT EDIT DIRECTLY */ package schemaregistry; import org.apache.avro.generic.GenericArray; import org.apache.avro.specific.SpecificData; import org.apache.avro.util.Utf8; import org.apache.avro.message.BinaryMessageEncoder; import org.apache.avro.message.BinaryMessageDecoder; import org.apache.avro.message.SchemaStore; @org.apache.avro.specific.AvroGenerated public class RideRecordCompatible extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { private static final long serialVersionUID = 7163300507090021229L; public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"RideRecordCompatible\",\"namespace\":\"schemaregistry\",\"fields\":[{\"name\":\"vendorId\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"}},{\"name\":\"passenger_count\",\"type\":\"int\"},{\"name\":\"trip_distance\",\"type\":\"double\"},{\"name\":\"pu_location_id\",\"type\":[\"null\",\"long\"],\"default\":null}]}"); public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; } private static final SpecificData MODEL$ = new SpecificData(); private static final BinaryMessageEncoder ENCODER = new BinaryMessageEncoder<>(MODEL$, SCHEMA$); private static final BinaryMessageDecoder DECODER = new BinaryMessageDecoder<>(MODEL$, SCHEMA$); /** * Return the BinaryMessageEncoder instance used by this class. * @return the message encoder used by this class */ public static BinaryMessageEncoder getEncoder() { return ENCODER; } /** * Return the BinaryMessageDecoder instance used by this class. * @return the message decoder used by this class */ public static BinaryMessageDecoder getDecoder() { return DECODER; } /** * Create a new BinaryMessageDecoder instance for this class that uses the specified {@link SchemaStore}. * @param resolver a {@link SchemaStore} used to find schemas by fingerprint * @return a BinaryMessageDecoder instance for this class backed by the given SchemaStore */ public static BinaryMessageDecoder createDecoder(SchemaStore resolver) { return new BinaryMessageDecoder<>(MODEL$, SCHEMA$, resolver); } /** * Serializes this RideRecordCompatible to a ByteBuffer. * @return a buffer holding the serialized data for this instance * @throws java.io.IOException if this instance could not be serialized */ public java.nio.ByteBuffer toByteBuffer() throws java.io.IOException { return ENCODER.encode(this); } /** * Deserializes a RideRecordCompatible from a ByteBuffer. * @param b a byte buffer holding serialized data for an instance of this class * @return a RideRecordCompatible instance decoded from the given buffer * @throws java.io.IOException if the given bytes could not be deserialized into an instance of this class */ public static RideRecordCompatible fromByteBuffer( java.nio.ByteBuffer b) throws java.io.IOException { return DECODER.decode(b); } private java.lang.String vendorId; private int passenger_count; private double trip_distance; private java.lang.Long pu_location_id; /** * Default constructor. Note that this does not initialize fields * to their default values from the schema. If that is desired then * one should use newBuilder(). */ public RideRecordCompatible() {} /** * All-args constructor. * @param vendorId The new value for vendorId * @param passenger_count The new value for passenger_count * @param trip_distance The new value for trip_distance * @param pu_location_id The new value for pu_location_id */ public RideRecordCompatible(java.lang.String vendorId, java.lang.Integer passenger_count, java.lang.Double trip_distance, java.lang.Long pu_location_id) { this.vendorId = vendorId; this.passenger_count = passenger_count; this.trip_distance = trip_distance; this.pu_location_id = pu_location_id; } @Override public org.apache.avro.specific.SpecificData getSpecificData() { return MODEL$; } @Override public org.apache.avro.Schema getSchema() { return SCHEMA$; } // Used by DatumWriter. Applications should not call. @Override public java.lang.Object get(int field$) { switch (field$) { case 0: return vendorId; case 1: return passenger_count; case 2: return trip_distance; case 3: return pu_location_id; default: throw new IndexOutOfBoundsException("Invalid index: " + field$); } } // Used by DatumReader. Applications should not call. @Override @SuppressWarnings(value="unchecked") public void put(int field$, java.lang.Object value$) { switch (field$) { case 0: vendorId = value$ != null ? value$.toString() : null; break; case 1: passenger_count = (java.lang.Integer)value$; break; case 2: trip_distance = (java.lang.Double)value$; break; case 3: pu_location_id = (java.lang.Long)value$; break; default: throw new IndexOutOfBoundsException("Invalid index: " + field$); } } /** * Gets the value of the 'vendorId' field. * @return The value of the 'vendorId' field. */ public java.lang.String getVendorId() { return vendorId; } /** * Sets the value of the 'vendorId' field. * @param value the value to set. */ public void setVendorId(java.lang.String value) { this.vendorId = value; } /** * Gets the value of the 'passenger_count' field. * @return The value of the 'passenger_count' field. */ public int getPassengerCount() { return passenger_count; } /** * Sets the value of the 'passenger_count' field. * @param value the value to set. */ public void setPassengerCount(int value) { this.passenger_count = value; } /** * Gets the value of the 'trip_distance' field. * @return The value of the 'trip_distance' field. */ public double getTripDistance() { return trip_distance; } /** * Sets the value of the 'trip_distance' field. * @param value the value to set. */ public void setTripDistance(double value) { this.trip_distance = value; } /** * Gets the value of the 'pu_location_id' field. * @return The value of the 'pu_location_id' field. */ public java.lang.Long getPuLocationId() { return pu_location_id; } /** * Sets the value of the 'pu_location_id' field. * @param value the value to set. */ public void setPuLocationId(java.lang.Long value) { this.pu_location_id = value; } /** * Creates a new RideRecordCompatible RecordBuilder. * @return A new RideRecordCompatible RecordBuilder */ public static schemaregistry.RideRecordCompatible.Builder newBuilder() { return new schemaregistry.RideRecordCompatible.Builder(); } /** * Creates a new RideRecordCompatible RecordBuilder by copying an existing Builder. * @param other The existing builder to copy. * @return A new RideRecordCompatible RecordBuilder */ public static schemaregistry.RideRecordCompatible.Builder newBuilder(schemaregistry.RideRecordCompatible.Builder other) { if (other == null) { return new schemaregistry.RideRecordCompatible.Builder(); } else { return new schemaregistry.RideRecordCompatible.Builder(other); } } /** * Creates a new RideRecordCompatible RecordBuilder by copying an existing RideRecordCompatible instance. * @param other The existing instance to copy. * @return A new RideRecordCompatible RecordBuilder */ public static schemaregistry.RideRecordCompatible.Builder newBuilder(schemaregistry.RideRecordCompatible other) { if (other == null) { return new schemaregistry.RideRecordCompatible.Builder(); } else { return new schemaregistry.RideRecordCompatible.Builder(other); } } /** * RecordBuilder for RideRecordCompatible instances. */ @org.apache.avro.specific.AvroGenerated public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase implements org.apache.avro.data.RecordBuilder { private java.lang.String vendorId; private int passenger_count; private double trip_distance; private java.lang.Long pu_location_id; /** Creates a new Builder */ private Builder() { super(SCHEMA$, MODEL$); } /** * Creates a Builder by copying an existing Builder. * @param other The existing Builder to copy. */ private Builder(schemaregistry.RideRecordCompatible.Builder other) { super(other); if (isValidValue(fields()[0], other.vendorId)) { this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId); fieldSetFlags()[0] = other.fieldSetFlags()[0]; } if (isValidValue(fields()[1], other.passenger_count)) { this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count); fieldSetFlags()[1] = other.fieldSetFlags()[1]; } if (isValidValue(fields()[2], other.trip_distance)) { this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance); fieldSetFlags()[2] = other.fieldSetFlags()[2]; } if (isValidValue(fields()[3], other.pu_location_id)) { this.pu_location_id = data().deepCopy(fields()[3].schema(), other.pu_location_id); fieldSetFlags()[3] = other.fieldSetFlags()[3]; } } /** * Creates a Builder by copying an existing RideRecordCompatible instance * @param other The existing instance to copy. */ private Builder(schemaregistry.RideRecordCompatible other) { super(SCHEMA$, MODEL$); if (isValidValue(fields()[0], other.vendorId)) { this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId); fieldSetFlags()[0] = true; } if (isValidValue(fields()[1], other.passenger_count)) { this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count); fieldSetFlags()[1] = true; } if (isValidValue(fields()[2], other.trip_distance)) { this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance); fieldSetFlags()[2] = true; } if (isValidValue(fields()[3], other.pu_location_id)) { this.pu_location_id = data().deepCopy(fields()[3].schema(), other.pu_location_id); fieldSetFlags()[3] = true; } } /** * Gets the value of the 'vendorId' field. * @return The value. */ public java.lang.String getVendorId() { return vendorId; } /** * Sets the value of the 'vendorId' field. * @param value The value of 'vendorId'. * @return This builder. */ public schemaregistry.RideRecordCompatible.Builder setVendorId(java.lang.String value) { validate(fields()[0], value); this.vendorId = value; fieldSetFlags()[0] = true; return this; } /** * Checks whether the 'vendorId' field has been set. * @return True if the 'vendorId' field has been set, false otherwise. */ public boolean hasVendorId() { return fieldSetFlags()[0]; } /** * Clears the value of the 'vendorId' field. * @return This builder. */ public schemaregistry.RideRecordCompatible.Builder clearVendorId() { vendorId = null; fieldSetFlags()[0] = false; return this; } /** * Gets the value of the 'passenger_count' field. * @return The value. */ public int getPassengerCount() { return passenger_count; } /** * Sets the value of the 'passenger_count' field. * @param value The value of 'passenger_count'. * @return This builder. */ public schemaregistry.RideRecordCompatible.Builder setPassengerCount(int value) { validate(fields()[1], value); this.passenger_count = value; fieldSetFlags()[1] = true; return this; } /** * Checks whether the 'passenger_count' field has been set. * @return True if the 'passenger_count' field has been set, false otherwise. */ public boolean hasPassengerCount() { return fieldSetFlags()[1]; } /** * Clears the value of the 'passenger_count' field. * @return This builder. */ public schemaregistry.RideRecordCompatible.Builder clearPassengerCount() { fieldSetFlags()[1] = false; return this; } /** * Gets the value of the 'trip_distance' field. * @return The value. */ public double getTripDistance() { return trip_distance; } /** * Sets the value of the 'trip_distance' field. * @param value The value of 'trip_distance'. * @return This builder. */ public schemaregistry.RideRecordCompatible.Builder setTripDistance(double value) { validate(fields()[2], value); this.trip_distance = value; fieldSetFlags()[2] = true; return this; } /** * Checks whether the 'trip_distance' field has been set. * @return True if the 'trip_distance' field has been set, false otherwise. */ public boolean hasTripDistance() { return fieldSetFlags()[2]; } /** * Clears the value of the 'trip_distance' field. * @return This builder. */ public schemaregistry.RideRecordCompatible.Builder clearTripDistance() { fieldSetFlags()[2] = false; return this; } /** * Gets the value of the 'pu_location_id' field. * @return The value. */ public java.lang.Long getPuLocationId() { return pu_location_id; } /** * Sets the value of the 'pu_location_id' field. * @param value The value of 'pu_location_id'. * @return This builder. */ public schemaregistry.RideRecordCompatible.Builder setPuLocationId(java.lang.Long value) { validate(fields()[3], value); this.pu_location_id = value; fieldSetFlags()[3] = true; return this; } /** * Checks whether the 'pu_location_id' field has been set. * @return True if the 'pu_location_id' field has been set, false otherwise. */ public boolean hasPuLocationId() { return fieldSetFlags()[3]; } /** * Clears the value of the 'pu_location_id' field. * @return This builder. */ public schemaregistry.RideRecordCompatible.Builder clearPuLocationId() { pu_location_id = null; fieldSetFlags()[3] = false; return this; } @Override @SuppressWarnings("unchecked") public RideRecordCompatible build() { try { RideRecordCompatible record = new RideRecordCompatible(); record.vendorId = fieldSetFlags()[0] ? this.vendorId : (java.lang.String) defaultValue(fields()[0]); record.passenger_count = fieldSetFlags()[1] ? this.passenger_count : (java.lang.Integer) defaultValue(fields()[1]); record.trip_distance = fieldSetFlags()[2] ? this.trip_distance : (java.lang.Double) defaultValue(fields()[2]); record.pu_location_id = fieldSetFlags()[3] ? this.pu_location_id : (java.lang.Long) defaultValue(fields()[3]); return record; } catch (org.apache.avro.AvroMissingFieldException e) { throw e; } catch (java.lang.Exception e) { throw new org.apache.avro.AvroRuntimeException(e); } } } @SuppressWarnings("unchecked") private static final org.apache.avro.io.DatumWriter WRITER$ = (org.apache.avro.io.DatumWriter)MODEL$.createDatumWriter(SCHEMA$); @Override public void writeExternal(java.io.ObjectOutput out) throws java.io.IOException { WRITER$.write(this, SpecificData.getEncoder(out)); } @SuppressWarnings("unchecked") private static final org.apache.avro.io.DatumReader READER$ = (org.apache.avro.io.DatumReader)MODEL$.createDatumReader(SCHEMA$); @Override public void readExternal(java.io.ObjectInput in) throws java.io.IOException { READER$.read(this, SpecificData.getDecoder(in)); } @Override protected boolean hasCustomCoders() { return true; } @Override public void customEncode(org.apache.avro.io.Encoder out) throws java.io.IOException { out.writeString(this.vendorId); out.writeInt(this.passenger_count); out.writeDouble(this.trip_distance); if (this.pu_location_id == null) { out.writeIndex(0); out.writeNull(); } else { out.writeIndex(1); out.writeLong(this.pu_location_id); } } @Override public void customDecode(org.apache.avro.io.ResolvingDecoder in) throws java.io.IOException { org.apache.avro.Schema.Field[] fieldOrder = in.readFieldOrderIfDiff(); if (fieldOrder == null) { this.vendorId = in.readString(); this.passenger_count = in.readInt(); this.trip_distance = in.readDouble(); if (in.readIndex() != 1) { in.readNull(); this.pu_location_id = null; } else { this.pu_location_id = in.readLong(); } } else { for (int i = 0; i < 4; i++) { switch (fieldOrder[i].pos()) { case 0: this.vendorId = in.readString(); break; case 1: this.passenger_count = in.readInt(); break; case 2: this.trip_distance = in.readDouble(); break; case 3: if (in.readIndex() != 1) { in.readNull(); this.pu_location_id = null; } else { this.pu_location_id = in.readLong(); } break; default: throw new java.io.IOException("Corrupt ResolvingDecoder."); } } } } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/build/generated-main-avro-java/schemaregistry/RideRecordNoneCompatible.java ================================================ /** * Autogenerated by Avro * * DO NOT EDIT DIRECTLY */ package schemaregistry; import org.apache.avro.generic.GenericArray; import org.apache.avro.specific.SpecificData; import org.apache.avro.util.Utf8; import org.apache.avro.message.BinaryMessageEncoder; import org.apache.avro.message.BinaryMessageDecoder; import org.apache.avro.message.SchemaStore; @org.apache.avro.specific.AvroGenerated public class RideRecordNoneCompatible extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { private static final long serialVersionUID = -4618980179396772493L; public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"RideRecordNoneCompatible\",\"namespace\":\"schemaregistry\",\"fields\":[{\"name\":\"vendorId\",\"type\":\"int\"},{\"name\":\"passenger_count\",\"type\":\"int\"},{\"name\":\"trip_distance\",\"type\":\"double\"}]}"); public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; } private static final SpecificData MODEL$ = new SpecificData(); private static final BinaryMessageEncoder ENCODER = new BinaryMessageEncoder<>(MODEL$, SCHEMA$); private static final BinaryMessageDecoder DECODER = new BinaryMessageDecoder<>(MODEL$, SCHEMA$); /** * Return the BinaryMessageEncoder instance used by this class. * @return the message encoder used by this class */ public static BinaryMessageEncoder getEncoder() { return ENCODER; } /** * Return the BinaryMessageDecoder instance used by this class. * @return the message decoder used by this class */ public static BinaryMessageDecoder getDecoder() { return DECODER; } /** * Create a new BinaryMessageDecoder instance for this class that uses the specified {@link SchemaStore}. * @param resolver a {@link SchemaStore} used to find schemas by fingerprint * @return a BinaryMessageDecoder instance for this class backed by the given SchemaStore */ public static BinaryMessageDecoder createDecoder(SchemaStore resolver) { return new BinaryMessageDecoder<>(MODEL$, SCHEMA$, resolver); } /** * Serializes this RideRecordNoneCompatible to a ByteBuffer. * @return a buffer holding the serialized data for this instance * @throws java.io.IOException if this instance could not be serialized */ public java.nio.ByteBuffer toByteBuffer() throws java.io.IOException { return ENCODER.encode(this); } /** * Deserializes a RideRecordNoneCompatible from a ByteBuffer. * @param b a byte buffer holding serialized data for an instance of this class * @return a RideRecordNoneCompatible instance decoded from the given buffer * @throws java.io.IOException if the given bytes could not be deserialized into an instance of this class */ public static RideRecordNoneCompatible fromByteBuffer( java.nio.ByteBuffer b) throws java.io.IOException { return DECODER.decode(b); } private int vendorId; private int passenger_count; private double trip_distance; /** * Default constructor. Note that this does not initialize fields * to their default values from the schema. If that is desired then * one should use newBuilder(). */ public RideRecordNoneCompatible() {} /** * All-args constructor. * @param vendorId The new value for vendorId * @param passenger_count The new value for passenger_count * @param trip_distance The new value for trip_distance */ public RideRecordNoneCompatible(java.lang.Integer vendorId, java.lang.Integer passenger_count, java.lang.Double trip_distance) { this.vendorId = vendorId; this.passenger_count = passenger_count; this.trip_distance = trip_distance; } @Override public org.apache.avro.specific.SpecificData getSpecificData() { return MODEL$; } @Override public org.apache.avro.Schema getSchema() { return SCHEMA$; } // Used by DatumWriter. Applications should not call. @Override public java.lang.Object get(int field$) { switch (field$) { case 0: return vendorId; case 1: return passenger_count; case 2: return trip_distance; default: throw new IndexOutOfBoundsException("Invalid index: " + field$); } } // Used by DatumReader. Applications should not call. @Override @SuppressWarnings(value="unchecked") public void put(int field$, java.lang.Object value$) { switch (field$) { case 0: vendorId = (java.lang.Integer)value$; break; case 1: passenger_count = (java.lang.Integer)value$; break; case 2: trip_distance = (java.lang.Double)value$; break; default: throw new IndexOutOfBoundsException("Invalid index: " + field$); } } /** * Gets the value of the 'vendorId' field. * @return The value of the 'vendorId' field. */ public int getVendorId() { return vendorId; } /** * Sets the value of the 'vendorId' field. * @param value the value to set. */ public void setVendorId(int value) { this.vendorId = value; } /** * Gets the value of the 'passenger_count' field. * @return The value of the 'passenger_count' field. */ public int getPassengerCount() { return passenger_count; } /** * Sets the value of the 'passenger_count' field. * @param value the value to set. */ public void setPassengerCount(int value) { this.passenger_count = value; } /** * Gets the value of the 'trip_distance' field. * @return The value of the 'trip_distance' field. */ public double getTripDistance() { return trip_distance; } /** * Sets the value of the 'trip_distance' field. * @param value the value to set. */ public void setTripDistance(double value) { this.trip_distance = value; } /** * Creates a new RideRecordNoneCompatible RecordBuilder. * @return A new RideRecordNoneCompatible RecordBuilder */ public static schemaregistry.RideRecordNoneCompatible.Builder newBuilder() { return new schemaregistry.RideRecordNoneCompatible.Builder(); } /** * Creates a new RideRecordNoneCompatible RecordBuilder by copying an existing Builder. * @param other The existing builder to copy. * @return A new RideRecordNoneCompatible RecordBuilder */ public static schemaregistry.RideRecordNoneCompatible.Builder newBuilder(schemaregistry.RideRecordNoneCompatible.Builder other) { if (other == null) { return new schemaregistry.RideRecordNoneCompatible.Builder(); } else { return new schemaregistry.RideRecordNoneCompatible.Builder(other); } } /** * Creates a new RideRecordNoneCompatible RecordBuilder by copying an existing RideRecordNoneCompatible instance. * @param other The existing instance to copy. * @return A new RideRecordNoneCompatible RecordBuilder */ public static schemaregistry.RideRecordNoneCompatible.Builder newBuilder(schemaregistry.RideRecordNoneCompatible other) { if (other == null) { return new schemaregistry.RideRecordNoneCompatible.Builder(); } else { return new schemaregistry.RideRecordNoneCompatible.Builder(other); } } /** * RecordBuilder for RideRecordNoneCompatible instances. */ @org.apache.avro.specific.AvroGenerated public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase implements org.apache.avro.data.RecordBuilder { private int vendorId; private int passenger_count; private double trip_distance; /** Creates a new Builder */ private Builder() { super(SCHEMA$, MODEL$); } /** * Creates a Builder by copying an existing Builder. * @param other The existing Builder to copy. */ private Builder(schemaregistry.RideRecordNoneCompatible.Builder other) { super(other); if (isValidValue(fields()[0], other.vendorId)) { this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId); fieldSetFlags()[0] = other.fieldSetFlags()[0]; } if (isValidValue(fields()[1], other.passenger_count)) { this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count); fieldSetFlags()[1] = other.fieldSetFlags()[1]; } if (isValidValue(fields()[2], other.trip_distance)) { this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance); fieldSetFlags()[2] = other.fieldSetFlags()[2]; } } /** * Creates a Builder by copying an existing RideRecordNoneCompatible instance * @param other The existing instance to copy. */ private Builder(schemaregistry.RideRecordNoneCompatible other) { super(SCHEMA$, MODEL$); if (isValidValue(fields()[0], other.vendorId)) { this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId); fieldSetFlags()[0] = true; } if (isValidValue(fields()[1], other.passenger_count)) { this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count); fieldSetFlags()[1] = true; } if (isValidValue(fields()[2], other.trip_distance)) { this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance); fieldSetFlags()[2] = true; } } /** * Gets the value of the 'vendorId' field. * @return The value. */ public int getVendorId() { return vendorId; } /** * Sets the value of the 'vendorId' field. * @param value The value of 'vendorId'. * @return This builder. */ public schemaregistry.RideRecordNoneCompatible.Builder setVendorId(int value) { validate(fields()[0], value); this.vendorId = value; fieldSetFlags()[0] = true; return this; } /** * Checks whether the 'vendorId' field has been set. * @return True if the 'vendorId' field has been set, false otherwise. */ public boolean hasVendorId() { return fieldSetFlags()[0]; } /** * Clears the value of the 'vendorId' field. * @return This builder. */ public schemaregistry.RideRecordNoneCompatible.Builder clearVendorId() { fieldSetFlags()[0] = false; return this; } /** * Gets the value of the 'passenger_count' field. * @return The value. */ public int getPassengerCount() { return passenger_count; } /** * Sets the value of the 'passenger_count' field. * @param value The value of 'passenger_count'. * @return This builder. */ public schemaregistry.RideRecordNoneCompatible.Builder setPassengerCount(int value) { validate(fields()[1], value); this.passenger_count = value; fieldSetFlags()[1] = true; return this; } /** * Checks whether the 'passenger_count' field has been set. * @return True if the 'passenger_count' field has been set, false otherwise. */ public boolean hasPassengerCount() { return fieldSetFlags()[1]; } /** * Clears the value of the 'passenger_count' field. * @return This builder. */ public schemaregistry.RideRecordNoneCompatible.Builder clearPassengerCount() { fieldSetFlags()[1] = false; return this; } /** * Gets the value of the 'trip_distance' field. * @return The value. */ public double getTripDistance() { return trip_distance; } /** * Sets the value of the 'trip_distance' field. * @param value The value of 'trip_distance'. * @return This builder. */ public schemaregistry.RideRecordNoneCompatible.Builder setTripDistance(double value) { validate(fields()[2], value); this.trip_distance = value; fieldSetFlags()[2] = true; return this; } /** * Checks whether the 'trip_distance' field has been set. * @return True if the 'trip_distance' field has been set, false otherwise. */ public boolean hasTripDistance() { return fieldSetFlags()[2]; } /** * Clears the value of the 'trip_distance' field. * @return This builder. */ public schemaregistry.RideRecordNoneCompatible.Builder clearTripDistance() { fieldSetFlags()[2] = false; return this; } @Override @SuppressWarnings("unchecked") public RideRecordNoneCompatible build() { try { RideRecordNoneCompatible record = new RideRecordNoneCompatible(); record.vendorId = fieldSetFlags()[0] ? this.vendorId : (java.lang.Integer) defaultValue(fields()[0]); record.passenger_count = fieldSetFlags()[1] ? this.passenger_count : (java.lang.Integer) defaultValue(fields()[1]); record.trip_distance = fieldSetFlags()[2] ? this.trip_distance : (java.lang.Double) defaultValue(fields()[2]); return record; } catch (org.apache.avro.AvroMissingFieldException e) { throw e; } catch (java.lang.Exception e) { throw new org.apache.avro.AvroRuntimeException(e); } } } @SuppressWarnings("unchecked") private static final org.apache.avro.io.DatumWriter WRITER$ = (org.apache.avro.io.DatumWriter)MODEL$.createDatumWriter(SCHEMA$); @Override public void writeExternal(java.io.ObjectOutput out) throws java.io.IOException { WRITER$.write(this, SpecificData.getEncoder(out)); } @SuppressWarnings("unchecked") private static final org.apache.avro.io.DatumReader READER$ = (org.apache.avro.io.DatumReader)MODEL$.createDatumReader(SCHEMA$); @Override public void readExternal(java.io.ObjectInput in) throws java.io.IOException { READER$.read(this, SpecificData.getDecoder(in)); } @Override protected boolean hasCustomCoders() { return true; } @Override public void customEncode(org.apache.avro.io.Encoder out) throws java.io.IOException { out.writeInt(this.vendorId); out.writeInt(this.passenger_count); out.writeDouble(this.trip_distance); } @Override public void customDecode(org.apache.avro.io.ResolvingDecoder in) throws java.io.IOException { org.apache.avro.Schema.Field[] fieldOrder = in.readFieldOrderIfDiff(); if (fieldOrder == null) { this.vendorId = in.readInt(); this.passenger_count = in.readInt(); this.trip_distance = in.readDouble(); } else { for (int i = 0; i < 3; i++) { switch (fieldOrder[i].pos()) { case 0: this.vendorId = in.readInt(); break; case 1: this.passenger_count = in.readInt(); break; case 2: this.trip_distance = in.readDouble(); break; default: throw new java.io.IOException("Corrupt ResolvingDecoder."); } } } } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/build.gradle ================================================ plugins { id 'java' id "com.github.davidmc24.gradle.plugin.avro" version "1.5.0" } group 'org.example' version '1.0-SNAPSHOT' repositories { mavenCentral() maven { url "https://packages.confluent.io/maven" } } dependencies { implementation 'org.apache.kafka:kafka-clients:3.3.1' implementation 'com.opencsv:opencsv:5.7.1' implementation 'io.confluent:kafka-json-serializer:7.3.1' implementation 'org.apache.kafka:kafka-streams:3.3.1' implementation 'io.confluent:kafka-avro-serializer:7.3.1' implementation 'io.confluent:kafka-schema-registry-client:7.3.1' implementation 'io.confluent:kafka-streams-avro-serde:7.3.1' implementation "org.apache.avro:avro:1.11.0" testImplementation 'org.junit.jupiter:junit-jupiter-api:5.8.1' testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.8.1' testImplementation 'org.apache.kafka:kafka-streams-test-utils:3.3.1' } sourceSets.main.java.srcDirs = ['build/generated-main-avro-java','src/main/java'] test { useJUnitPlatform() } ================================================ FILE: 07-streaming/theory/java/kafka_examples/gradle/wrapper/gradle-wrapper.properties ================================================ distributionBase=GRADLE_USER_HOME distributionPath=wrapper/dists distributionUrl=https\://services.gradle.org/distributions/gradle-7.5.1-bin.zip zipStoreBase=GRADLE_USER_HOME zipStorePath=wrapper/dists ================================================ FILE: 07-streaming/theory/java/kafka_examples/gradlew ================================================ #!/bin/sh # # Copyright © 2015-2021 the original authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ############################################################################## # # Gradle start up script for POSIX generated by Gradle. # # Important for running: # # (1) You need a POSIX-compliant shell to run this script. If your /bin/sh is # noncompliant, but you have some other compliant shell such as ksh or # bash, then to run this script, type that shell name before the whole # command line, like: # # ksh Gradle # # Busybox and similar reduced shells will NOT work, because this script # requires all of these POSIX shell features: # * functions; # * expansions «$var», «${var}», «${var:-default}», «${var+SET}», # «${var#prefix}», «${var%suffix}», and «$( cmd )»; # * compound commands having a testable exit status, especially «case»; # * various built-in commands including «command», «set», and «ulimit». # # Important for patching: # # (2) This script targets any POSIX shell, so it avoids extensions provided # by Bash, Ksh, etc; in particular arrays are avoided. # # The "traditional" practice of packing multiple parameters into a # space-separated string is a well documented source of bugs and security # problems, so this is (mostly) avoided, by progressively accumulating # options in "$@", and eventually passing that to Java. # # Where the inherited environment variables (DEFAULT_JVM_OPTS, JAVA_OPTS, # and GRADLE_OPTS) rely on word-splitting, this is performed explicitly; # see the in-line comments for details. # # There are tweaks for specific operating systems such as AIX, CygWin, # Darwin, MinGW, and NonStop. # # (3) This script is generated from the Groovy template # https://github.com/gradle/gradle/blob/master/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt # within the Gradle project. # # You can find Gradle at https://github.com/gradle/gradle/. # ############################################################################## # Attempt to set APP_HOME # Resolve links: $0 may be a link app_path=$0 # Need this for daisy-chained symlinks. while APP_HOME=${app_path%"${app_path##*/}"} # leaves a trailing /; empty if no leading path [ -h "$app_path" ] do ls=$( ls -ld "$app_path" ) link=${ls#*' -> '} case $link in #( /*) app_path=$link ;; #( *) app_path=$APP_HOME$link ;; esac done APP_HOME=$( cd "${APP_HOME:-./}" && pwd -P ) || exit APP_NAME="Gradle" APP_BASE_NAME=${0##*/} # Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script. DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"' # Use the maximum available, or set MAX_FD != -1 to use that value. MAX_FD=maximum warn () { echo "$*" } >&2 die () { echo echo "$*" echo exit 1 } >&2 # OS specific support (must be 'true' or 'false'). cygwin=false msys=false darwin=false nonstop=false case "$( uname )" in #( CYGWIN* ) cygwin=true ;; #( Darwin* ) darwin=true ;; #( MSYS* | MINGW* ) msys=true ;; #( NONSTOP* ) nonstop=true ;; esac CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar # Determine the Java command to use to start the JVM. if [ -n "$JAVA_HOME" ] ; then if [ -x "$JAVA_HOME/jre/sh/java" ] ; then # IBM's JDK on AIX uses strange locations for the executables JAVACMD=$JAVA_HOME/jre/sh/java else JAVACMD=$JAVA_HOME/bin/java fi if [ ! -x "$JAVACMD" ] ; then die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME Please set the JAVA_HOME variable in your environment to match the location of your Java installation." fi else JAVACMD=java which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH. Please set the JAVA_HOME variable in your environment to match the location of your Java installation." fi # Increase the maximum file descriptors if we can. if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then case $MAX_FD in #( max*) MAX_FD=$( ulimit -H -n ) || warn "Could not query maximum file descriptor limit" esac case $MAX_FD in #( '' | soft) :;; #( *) ulimit -n "$MAX_FD" || warn "Could not set maximum file descriptor limit to $MAX_FD" esac fi # Collect all arguments for the java command, stacking in reverse order: # * args from the command line # * the main class name # * -classpath # * -D...appname settings # * --module-path (only if needed) # * DEFAULT_JVM_OPTS, JAVA_OPTS, and GRADLE_OPTS environment variables. # For Cygwin or MSYS, switch paths to Windows format before running java if "$cygwin" || "$msys" ; then APP_HOME=$( cygpath --path --mixed "$APP_HOME" ) CLASSPATH=$( cygpath --path --mixed "$CLASSPATH" ) JAVACMD=$( cygpath --unix "$JAVACMD" ) # Now convert the arguments - kludge to limit ourselves to /bin/sh for arg do if case $arg in #( -*) false ;; # don't mess with options #( /?*) t=${arg#/} t=/${t%%/*} # looks like a POSIX filepath [ -e "$t" ] ;; #( *) false ;; esac then arg=$( cygpath --path --ignore --mixed "$arg" ) fi # Roll the args list around exactly as many times as the number of # args, so each arg winds up back in the position where it started, but # possibly modified. # # NB: a `for` loop captures its iteration list before it begins, so # changing the positional parameters here affects neither the number of # iterations, nor the values presented in `arg`. shift # remove old arg set -- "$@" "$arg" # push replacement arg done fi # Collect all arguments for the java command; # * $DEFAULT_JVM_OPTS, $JAVA_OPTS, and $GRADLE_OPTS can contain fragments of # shell script including quotes and variable substitutions, so put them in # double quotes to make sure that they get re-expanded; and # * put everything else in single quotes, so that it's not re-expanded. set -- \ "-Dorg.gradle.appname=$APP_BASE_NAME" \ -classpath "$CLASSPATH" \ org.gradle.wrapper.GradleWrapperMain \ "$@" # Stop when "xargs" is not available. if ! command -v xargs >/dev/null 2>&1 then die "xargs is not available" fi # Use "xargs" to parse quoted args. # # With -n1 it outputs one arg per line, with the quotes and backslashes removed. # # In Bash we could simply go: # # readarray ARGS < <( xargs -n1 <<<"$var" ) && # set -- "${ARGS[@]}" "$@" # # but POSIX shell has neither arrays nor command substitution, so instead we # post-process each arg (as a line of input to sed) to backslash-escape any # character that might be a shell metacharacter, then use eval to reverse # that process (while maintaining the separation between arguments), and wrap # the whole thing up as a single "set" statement. # # This will of course break if any of these variables contains a newline or # an unmatched quote. # eval "set -- $( printf '%s\n' "$DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS" | xargs -n1 | sed ' s~[^-[:alnum:]+,./:=@_]~\\&~g; ' | tr '\n' ' ' )" '"$@"' exec "$JAVACMD" "$@" ================================================ FILE: 07-streaming/theory/java/kafka_examples/gradlew.bat ================================================ @rem @rem Copyright 2015 the original author or authors. @rem @rem Licensed under the Apache License, Version 2.0 (the "License"); @rem you may not use this file except in compliance with the License. @rem You may obtain a copy of the License at @rem @rem https://www.apache.org/licenses/LICENSE-2.0 @rem @rem Unless required by applicable law or agreed to in writing, software @rem distributed under the License is distributed on an "AS IS" BASIS, @rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. @rem See the License for the specific language governing permissions and @rem limitations under the License. @rem @if "%DEBUG%"=="" @echo off @rem ########################################################################## @rem @rem Gradle startup script for Windows @rem @rem ########################################################################## @rem Set local scope for the variables with windows NT shell if "%OS%"=="Windows_NT" setlocal set DIRNAME=%~dp0 if "%DIRNAME%"=="" set DIRNAME=. set APP_BASE_NAME=%~n0 set APP_HOME=%DIRNAME% @rem Resolve any "." and ".." in APP_HOME to make it shorter. for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi @rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script. set DEFAULT_JVM_OPTS="-Xmx64m" "-Xms64m" @rem Find java.exe if defined JAVA_HOME goto findJavaFromJavaHome set JAVA_EXE=java.exe %JAVA_EXE% -version >NUL 2>&1 if %ERRORLEVEL% equ 0 goto execute echo. echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH. echo. echo Please set the JAVA_HOME variable in your environment to match the echo location of your Java installation. goto fail :findJavaFromJavaHome set JAVA_HOME=%JAVA_HOME:"=% set JAVA_EXE=%JAVA_HOME%/bin/java.exe if exist "%JAVA_EXE%" goto execute echo. echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME% echo. echo Please set the JAVA_HOME variable in your environment to match the echo location of your Java installation. goto fail :execute @rem Setup the command line set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar @rem Execute Gradle "%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %* :end @rem End local scope for the variables with windows NT shell if %ERRORLEVEL% equ 0 goto mainEnd :fail rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of rem the _cmd.exe /c_ return code! set EXIT_CODE=%ERRORLEVEL% if %EXIT_CODE% equ 0 set EXIT_CODE=1 if not ""=="%GRADLE_EXIT_CONSOLE%" exit %EXIT_CODE% exit /b %EXIT_CODE% :mainEnd if "%OS%"=="Windows_NT" endlocal :omega ================================================ FILE: 07-streaming/theory/java/kafka_examples/settings.gradle ================================================ pluginManagement { repositories { gradlePluginPortal() mavenCentral() } } rootProject.name = 'kafka_examples' ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/avro/rides.avsc ================================================ { "type": "record", "name":"RideRecord", "namespace": "schemaregistry", "fields":[ {"name":"vendor_id","type":"string"}, {"name":"passenger_count","type":"int"}, {"name":"trip_distance","type":"double"} ] } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/avro/rides_compatible.avsc ================================================ { "type": "record", "name":"RideRecordCompatible", "namespace": "schemaregistry", "fields":[ {"name":"vendorId","type":"string"}, {"name":"passenger_count","type":"int"}, {"name":"trip_distance","type":"double"}, {"name":"pu_location_id", "type": [ "null", "long" ], "default": null} ] } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/avro/rides_non_compatible.avsc ================================================ { "type": "record", "name":"RideRecordNoneCompatible", "namespace": "schemaregistry", "fields":[ {"name":"vendorId","type":"int"}, {"name":"passenger_count","type":"int"}, {"name":"trip_distance","type":"double"} ] } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/AvroProducer.java ================================================ package org.example; import com.opencsv.CSVReader; import com.opencsv.exceptions.CsvException; import io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig; import io.confluent.kafka.serializers.KafkaAvroSerializer; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerConfig; import org.apache.kafka.clients.producer.ProducerRecord; import org.apache.kafka.streams.StreamsConfig; import schemaregistry.RideRecord; import java.io.FileReader; import java.io.IOException; import java.util.List; import java.util.Properties; import java.util.concurrent.ExecutionException; import java.util.stream.Collectors; public class AvroProducer { private Properties props = new Properties(); public AvroProducer() { props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092"); props.put("security.protocol", "SASL_SSL"); props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';"); props.put("sasl.mechanism", "PLAIN"); props.put("client.dns.lookup", "use_all_dns_ips"); props.put("session.timeout.ms", "45000"); props.put(ProducerConfig.ACKS_CONFIG, "all"); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer"); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class.getName()); props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "https://psrc-kk5gg.europe-west3.gcp.confluent.cloud"); props.put("basic.auth.credentials.source", "USER_INFO"); props.put("basic.auth.user.info", Secrets.SCHEMA_REGISTRY_KEY+":"+Secrets.SCHEMA_REGISTRY_SECRET); } public List getRides() throws IOException, CsvException { var ridesStream = this.getClass().getResource("/rides.csv"); var reader = new CSVReader(new FileReader(ridesStream.getFile())); reader.skip(1); return reader.readAll().stream().map(row -> RideRecord.newBuilder() .setVendorId(row[0]) .setTripDistance(Double.parseDouble(row[4])) .setPassengerCount(Integer.parseInt(row[3])) .build() ).collect(Collectors.toList()); } public void publishRides(List rides) throws ExecutionException, InterruptedException { KafkaProducer kafkaProducer = new KafkaProducer<>(props); for (RideRecord ride : rides) { var record = kafkaProducer.send(new ProducerRecord<>("rides_avro", String.valueOf(ride.getVendorId()), ride), (metadata, exception) -> { if (exception != null) { System.out.println(exception.getMessage()); } }); System.out.println(record.get().offset()); Thread.sleep(500); } } public static void main(String[] args) throws IOException, CsvException, ExecutionException, InterruptedException { var producer = new AvroProducer(); var rideRecords = producer.getRides(); producer.publishRides(rideRecords); } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonConsumer.java ================================================ package org.example; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.clients.consumer.ConsumerRecord; import org.apache.kafka.clients.consumer.KafkaConsumer; import org.apache.kafka.clients.producer.ProducerConfig; import org.example.data.Ride; import java.time.Duration; import java.time.temporal.ChronoUnit; import java.time.temporal.TemporalUnit; import java.util.List; import java.util.Properties; import io.confluent.kafka.serializers.KafkaJsonDeserializerConfig; public class JsonConsumer { private Properties props = new Properties(); private KafkaConsumer consumer; public JsonConsumer() { props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092"); props.put("security.protocol", "SASL_SSL"); props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';"); props.put("sasl.mechanism", "PLAIN"); props.put("client.dns.lookup", "use_all_dns_ips"); props.put("session.timeout.ms", "45000"); props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer"); props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "io.confluent.kafka.serializers.KafkaJsonDeserializer"); props.put(ConsumerConfig.GROUP_ID_CONFIG, "kafka_tutorial_example.jsonconsumer.v2"); props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"); props.put(KafkaJsonDeserializerConfig.JSON_VALUE_TYPE, Ride.class); consumer = new KafkaConsumer(props); consumer.subscribe(List.of("rides")); } public void consumeFromKafka() { System.out.println("Consuming form kafka started"); var results = consumer.poll(Duration.of(1, ChronoUnit.SECONDS)); var i = 0; do { for(ConsumerRecord result: results) { System.out.println(result.value().DOLocationID); } results = consumer.poll(Duration.of(1, ChronoUnit.SECONDS)); System.out.println("RESULTS:::" + results.count()); i++; } while(!results.isEmpty() || i < 10); } public static void main(String[] args) { JsonConsumer jsonConsumer = new JsonConsumer(); jsonConsumer.consumeFromKafka(); } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonKStream.java ================================================ package org.example; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.common.serialization.Serdes; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.kstream.Consumed; import org.apache.kafka.streams.kstream.Produced; import org.example.customserdes.CustomSerdes; import org.example.data.Ride; import java.util.Properties; public class JsonKStream { private Properties props = new Properties(); public JsonKStream() { props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092"); props.put("security.protocol", "SASL_SSL"); props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';"); props.put("sasl.mechanism", "PLAIN"); props.put("client.dns.lookup", "use_all_dns_ips"); props.put("session.timeout.ms", "45000"); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "kafka_tutorial.kstream.count.plocation.v1"); props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest"); props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0); } public Topology createTopology() { StreamsBuilder streamsBuilder = new StreamsBuilder(); var ridesStream = streamsBuilder.stream("rides", Consumed.with(Serdes.String(), CustomSerdes.getSerde(Ride.class))); var puLocationCount = ridesStream.groupByKey().count().toStream(); puLocationCount.to("rides-pulocation-count", Produced.with(Serdes.String(), Serdes.Long())); return streamsBuilder.build(); } public void countPLocation() throws InterruptedException { var topology = createTopology(); var kStreams = new KafkaStreams(topology, props); kStreams.start(); while (kStreams.state() != KafkaStreams.State.RUNNING) { System.out.println(kStreams.state()); Thread.sleep(1000); } System.out.println(kStreams.state()); Runtime.getRuntime().addShutdownHook(new Thread(kStreams::close)); } public static void main(String[] args) throws InterruptedException { var object = new JsonKStream(); object.countPLocation(); } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonKStreamJoins.java ================================================ package org.example; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.common.serialization.Serdes; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.errors.StreamsUncaughtExceptionHandler; import org.apache.kafka.streams.kstream.*; import org.example.customserdes.CustomSerdes; import org.example.data.PickupLocation; import org.example.data.Ride; import org.example.data.VendorInfo; import java.time.Duration; import java.util.Optional; import java.util.Properties; public class JsonKStreamJoins { private Properties props = new Properties(); public JsonKStreamJoins() { props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092"); props.put("security.protocol", "SASL_SSL"); props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';"); props.put("sasl.mechanism", "PLAIN"); props.put("client.dns.lookup", "use_all_dns_ips"); props.put("session.timeout.ms", "45000"); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "kafka_tutorial.kstream.joined.rides.pickuplocation.v1"); props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest"); props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0); } public Topology createTopology() { StreamsBuilder streamsBuilder = new StreamsBuilder(); KStream rides = streamsBuilder.stream(Topics.INPUT_RIDE_TOPIC, Consumed.with(Serdes.String(), CustomSerdes.getSerde(Ride.class))); KStream pickupLocations = streamsBuilder.stream(Topics.INPUT_RIDE_LOCATION_TOPIC, Consumed.with(Serdes.String(), CustomSerdes.getSerde(PickupLocation.class))); var pickupLocationsKeyedOnPUId = pickupLocations.selectKey((key, value) -> String.valueOf(value.PULocationID)); var joined = rides.join(pickupLocationsKeyedOnPUId, (ValueJoiner>) (ride, pickupLocation) -> { var period = Duration.between(ride.tpep_dropoff_datetime, pickupLocation.tpep_pickup_datetime); if (period.abs().toMinutes() > 10) return Optional.empty(); else return Optional.of(new VendorInfo(ride.VendorID, pickupLocation.PULocationID, pickupLocation.tpep_pickup_datetime, ride.tpep_dropoff_datetime)); }, JoinWindows.ofTimeDifferenceAndGrace(Duration.ofMinutes(20), Duration.ofMinutes(5)), StreamJoined.with(Serdes.String(), CustomSerdes.getSerde(Ride.class), CustomSerdes.getSerde(PickupLocation.class))); joined.filter(((key, value) -> value.isPresent())).mapValues(Optional::get) .to(Topics.OUTPUT_TOPIC, Produced.with(Serdes.String(), CustomSerdes.getSerde(VendorInfo.class))); return streamsBuilder.build(); } public void joinRidesPickupLocation() throws InterruptedException { var topology = createTopology(); var kStreams = new KafkaStreams(topology, props); kStreams.setUncaughtExceptionHandler(exception -> { System.out.println(exception.getMessage()); return StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.SHUTDOWN_APPLICATION; }); kStreams.start(); while (kStreams.state() != KafkaStreams.State.RUNNING) { System.out.println(kStreams.state()); Thread.sleep(1000); } System.out.println(kStreams.state()); Runtime.getRuntime().addShutdownHook(new Thread(kStreams::close)); } public static void main(String[] args) throws InterruptedException { var object = new JsonKStreamJoins(); object.joinRidesPickupLocation(); } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonKStreamWindow.java ================================================ package org.example; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.common.serialization.Serdes; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.kstream.Consumed; import org.apache.kafka.streams.kstream.Produced; import org.apache.kafka.streams.kstream.TimeWindows; import org.apache.kafka.streams.kstream.WindowedSerdes; import org.example.customserdes.CustomSerdes; import org.example.data.Ride; import java.time.Duration; import java.time.temporal.ChronoUnit; import java.util.Properties; public class JsonKStreamWindow { private Properties props = new Properties(); public JsonKStreamWindow() { props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092"); props.put("security.protocol", "SASL_SSL"); props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';"); props.put("sasl.mechanism", "PLAIN"); props.put("client.dns.lookup", "use_all_dns_ips"); props.put("session.timeout.ms", "45000"); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "kafka_tutorial.kstream.count.plocation.v1"); props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest"); props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0); } public Topology createTopology() { StreamsBuilder streamsBuilder = new StreamsBuilder(); var ridesStream = streamsBuilder.stream("rides", Consumed.with(Serdes.String(), CustomSerdes.getSerde(Ride.class))); var puLocationCount = ridesStream.groupByKey() .windowedBy(TimeWindows.ofSizeAndGrace(Duration.ofSeconds(10), Duration.ofSeconds(5))) .count().toStream(); var windowSerde = WindowedSerdes.timeWindowedSerdeFrom(String.class, 10*1000); puLocationCount.to("rides-pulocation-window-count", Produced.with(windowSerde, Serdes.Long())); return streamsBuilder.build(); } public void countPLocationWindowed() { var topology = createTopology(); var kStreams = new KafkaStreams(topology, props); kStreams.start(); Runtime.getRuntime().addShutdownHook(new Thread(kStreams::close)); } public static void main(String[] args) { var object = new JsonKStreamWindow(); object.countPLocationWindowed(); } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonProducer.java ================================================ package org.example; import com.opencsv.CSVReader; import com.opencsv.exceptions.CsvException; import org.apache.kafka.clients.producer.*; import org.apache.kafka.streams.StreamsConfig; import org.example.data.Ride; import java.io.FileReader; import java.io.IOException; import java.time.LocalDateTime; import java.util.List; import java.util.Properties; import java.util.concurrent.ExecutionException; import java.util.stream.Collectors; public class JsonProducer { private Properties props = new Properties(); public JsonProducer() { props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092"); props.put("security.protocol", "SASL_SSL"); props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';"); props.put("sasl.mechanism", "PLAIN"); props.put("client.dns.lookup", "use_all_dns_ips"); props.put("session.timeout.ms", "45000"); props.put(ProducerConfig.ACKS_CONFIG, "all"); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer"); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "io.confluent.kafka.serializers.KafkaJsonSerializer"); } public List getRides() throws IOException, CsvException { var ridesStream = this.getClass().getResource("/rides.csv"); var reader = new CSVReader(new FileReader(ridesStream.getFile())); reader.skip(1); return reader.readAll().stream().map(arr -> new Ride(arr)) .collect(Collectors.toList()); } public void publishRides(List rides) throws ExecutionException, InterruptedException { KafkaProducer kafkaProducer = new KafkaProducer(props); for(Ride ride: rides) { ride.tpep_pickup_datetime = LocalDateTime.now().minusMinutes(20); ride.tpep_dropoff_datetime = LocalDateTime.now(); var record = kafkaProducer.send(new ProducerRecord<>("rides", String.valueOf(ride.DOLocationID), ride), (metadata, exception) -> { if(exception != null) { System.out.println(exception.getMessage()); } }); System.out.println(record.get().offset()); System.out.println(ride.DOLocationID); Thread.sleep(500); } } public static void main(String[] args) throws IOException, CsvException, ExecutionException, InterruptedException { var producer = new JsonProducer(); var rides = producer.getRides(); producer.publishRides(rides); } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonProducerPickupLocation.java ================================================ package org.example; import com.opencsv.exceptions.CsvException; import org.apache.kafka.clients.producer.KafkaProducer; import org.apache.kafka.clients.producer.ProducerConfig; import org.apache.kafka.clients.producer.ProducerRecord; import org.example.data.PickupLocation; import java.io.IOException; import java.time.LocalDateTime; import java.util.Properties; import java.util.concurrent.ExecutionException; public class JsonProducerPickupLocation { private Properties props = new Properties(); public JsonProducerPickupLocation() { props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092"); props.put("security.protocol", "SASL_SSL"); props.put("sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username='"+Secrets.KAFKA_CLUSTER_KEY+"' password='"+Secrets.KAFKA_CLUSTER_SECRET+"';"); props.put("sasl.mechanism", "PLAIN"); props.put("client.dns.lookup", "use_all_dns_ips"); props.put("session.timeout.ms", "45000"); props.put(ProducerConfig.ACKS_CONFIG, "all"); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer"); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "io.confluent.kafka.serializers.KafkaJsonSerializer"); } public void publish(PickupLocation pickupLocation) throws ExecutionException, InterruptedException { KafkaProducer kafkaProducer = new KafkaProducer(props); var record = kafkaProducer.send(new ProducerRecord<>("rides_location", String.valueOf(pickupLocation.PULocationID), pickupLocation), (metadata, exception) -> { if (exception != null) { System.out.println(exception.getMessage()); } }); System.out.println(record.get().offset()); } public static void main(String[] args) throws IOException, CsvException, ExecutionException, InterruptedException { var producer = new JsonProducerPickupLocation(); producer.publish(new PickupLocation(186, LocalDateTime.now())); } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/Secrets.java ================================================ package org.example; public class Secrets { public static final String KAFKA_CLUSTER_KEY = "REPLACE_WITH_YOUR_KAFKA_CLUSTER_KEY"; public static final String KAFKA_CLUSTER_SECRET = "REPLACE_WITH_YOUR_KAFKA_CLUSTER_SECRET"; public static final String SCHEMA_REGISTRY_KEY = "REPLACE_WITH_SCHEMA_REGISTRY_KEY"; public static final String SCHEMA_REGISTRY_SECRET = "REPLACE_WITH_SCHEMA_REGISTRY_SECRET"; } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/Topics.java ================================================ package org.example; public class Topics { public static final String INPUT_RIDE_TOPIC = "rides"; public static final String INPUT_RIDE_LOCATION_TOPIC = "rides_location"; public static final String OUTPUT_TOPIC = "vendor_info"; } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/customserdes/CustomSerdes.java ================================================ package org.example.customserdes; import io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig; import io.confluent.kafka.serializers.KafkaJsonDeserializer; import io.confluent.kafka.serializers.KafkaJsonSerializer; import io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde; import org.apache.avro.specific.SpecificRecordBase; import org.apache.kafka.common.serialization.Deserializer; import org.apache.kafka.common.serialization.Serde; import org.apache.kafka.common.serialization.Serdes; import org.apache.kafka.common.serialization.Serializer; import org.example.data.PickupLocation; import org.example.data.Ride; import org.example.data.VendorInfo; import java.util.HashMap; import java.util.Map; public class CustomSerdes { public static Serde getSerde(Class classOf) { Map serdeProps = new HashMap<>(); serdeProps.put("json.value.type", classOf); final Serializer mySerializer = new KafkaJsonSerializer<>(); mySerializer.configure(serdeProps, false); final Deserializer myDeserializer = new KafkaJsonDeserializer<>(); myDeserializer.configure(serdeProps, false); return Serdes.serdeFrom(mySerializer, myDeserializer); } public static SpecificAvroSerde getAvroSerde(boolean isKey, String schemaRegistryUrl) { var serde = new SpecificAvroSerde(); Map serdeProps = new HashMap<>(); serdeProps.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl); serde.configure(serdeProps, isKey); return serde; } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/data/PickupLocation.java ================================================ package org.example.data; import java.time.LocalDateTime; public class PickupLocation { public PickupLocation(long PULocationID, LocalDateTime tpep_pickup_datetime) { this.PULocationID = PULocationID; this.tpep_pickup_datetime = tpep_pickup_datetime; } public PickupLocation() { } public long PULocationID; public LocalDateTime tpep_pickup_datetime; } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/data/Ride.java ================================================ package org.example.data; import java.nio.DoubleBuffer; import java.time.LocalDate; import java.time.LocalDateTime; import java.time.format.DateTimeFormatter; public class Ride { public Ride(String[] arr) { VendorID = arr[0]; tpep_pickup_datetime = LocalDateTime.parse(arr[1], DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")); tpep_dropoff_datetime = LocalDateTime.parse(arr[2], DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")); passenger_count = Integer.parseInt(arr[3]); trip_distance = Double.parseDouble(arr[4]); RatecodeID = Long.parseLong(arr[5]); store_and_fwd_flag = arr[6]; PULocationID = Long.parseLong(arr[7]); DOLocationID = Long.parseLong(arr[8]); payment_type = arr[9]; fare_amount = Double.parseDouble(arr[10]); extra = Double.parseDouble(arr[11]); mta_tax = Double.parseDouble(arr[12]); tip_amount = Double.parseDouble(arr[13]); tolls_amount = Double.parseDouble(arr[14]); improvement_surcharge = Double.parseDouble(arr[15]); total_amount = Double.parseDouble(arr[16]); congestion_surcharge = Double.parseDouble(arr[17]); } public Ride(){} public String VendorID; public LocalDateTime tpep_pickup_datetime; public LocalDateTime tpep_dropoff_datetime; public int passenger_count; public double trip_distance; public long RatecodeID; public String store_and_fwd_flag; public long PULocationID; public long DOLocationID; public String payment_type; public double fare_amount; public double extra; public double mta_tax; public double tip_amount; public double tolls_amount; public double improvement_surcharge; public double total_amount; public double congestion_surcharge; } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/main/java/org/example/data/VendorInfo.java ================================================ package org.example.data; import java.time.LocalDateTime; public class VendorInfo { public VendorInfo(String vendorID, long PULocationID, LocalDateTime pickupTime, LocalDateTime lastDropoffTime) { VendorID = vendorID; this.PULocationID = PULocationID; this.pickupTime = pickupTime; this.lastDropoffTime = lastDropoffTime; } public VendorInfo() { } public String VendorID; public long PULocationID; public LocalDateTime pickupTime; public LocalDateTime lastDropoffTime; } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/test/java/org/example/JsonKStreamJoinsTest.java ================================================ package org.example; import org.apache.kafka.clients.consumer.ConsumerConfig; import org.apache.kafka.common.internals.Topic; import org.apache.kafka.common.serialization.Serdes; import org.apache.kafka.streams.*; import org.example.customserdes.CustomSerdes; import org.example.data.PickupLocation; import org.example.data.Ride; import org.example.data.VendorInfo; import org.example.helper.DataGeneratorHelper; import org.junit.jupiter.api.AfterAll; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; import javax.xml.crypto.Data; import java.util.Properties; import static org.junit.jupiter.api.Assertions.*; class JsonKStreamJoinsTest { private Properties props = new Properties(); private static TopologyTestDriver testDriver; private TestInputTopic ridesTopic; private TestInputTopic pickLocationTopic; private TestOutputTopic outputTopic; private Topology topology = new JsonKStreamJoins().createTopology(); @BeforeEach public void setup() { props = new Properties(); props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "testing_count_application"); props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "dummy:1234"); if (testDriver != null) { testDriver.close(); } testDriver = new TopologyTestDriver(topology, props); ridesTopic = testDriver.createInputTopic(Topics.INPUT_RIDE_TOPIC, Serdes.String().serializer(), CustomSerdes.getSerde(Ride.class).serializer()); pickLocationTopic = testDriver.createInputTopic(Topics.INPUT_RIDE_LOCATION_TOPIC, Serdes.String().serializer(), CustomSerdes.getSerde(PickupLocation.class).serializer()); outputTopic = testDriver.createOutputTopic(Topics.OUTPUT_TOPIC, Serdes.String().deserializer(), CustomSerdes.getSerde(VendorInfo.class).deserializer()); } @Test public void testIfJoinWorksOnSameDropOffPickupLocationId() { Ride ride = DataGeneratorHelper.generateRide(); PickupLocation pickupLocation = DataGeneratorHelper.generatePickUpLocation(ride.DOLocationID); ridesTopic.pipeInput(String.valueOf(ride.DOLocationID), ride); pickLocationTopic.pipeInput(String.valueOf(pickupLocation.PULocationID), pickupLocation); assertEquals(outputTopic.getQueueSize(), 1); var expected = new VendorInfo(ride.VendorID, pickupLocation.PULocationID, pickupLocation.tpep_pickup_datetime, ride.tpep_dropoff_datetime); var result = outputTopic.readKeyValue(); assertEquals(result.key, String.valueOf(ride.DOLocationID)); assertEquals(result.value.VendorID, expected.VendorID); assertEquals(result.value.pickupTime, expected.pickupTime); } @AfterAll public static void shutdown() { testDriver.close(); } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/test/java/org/example/JsonKStreamTest.java ================================================ package org.example; import org.apache.kafka.common.serialization.Serdes; import org.apache.kafka.streams.*; import org.example.customserdes.CustomSerdes; import org.example.data.Ride; import org.example.helper.DataGeneratorHelper; import org.junit.jupiter.api.AfterAll; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; import static org.junit.jupiter.api.Assertions.*; import java.util.Properties; class JsonKStreamTest { private Properties props; private static TopologyTestDriver testDriver; private TestInputTopic inputTopic; private TestOutputTopic outputTopic; private Topology topology = new JsonKStream().createTopology(); @BeforeEach public void setup() { props = new Properties(); props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, "testing_count_application"); props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "dummy:1234"); if (testDriver != null) { testDriver.close(); } testDriver = new TopologyTestDriver(topology, props); inputTopic = testDriver.createInputTopic("rides", Serdes.String().serializer(), CustomSerdes.getSerde(Ride.class).serializer()); outputTopic = testDriver.createOutputTopic("rides-pulocation-count", Serdes.String().deserializer(), Serdes.Long().deserializer()); } @Test public void testIfOneMessageIsPassedToInputTopicWeGetCountOfOne() { Ride ride = DataGeneratorHelper.generateRide(); inputTopic.pipeInput(String.valueOf(ride.DOLocationID), ride); assertEquals(outputTopic.readKeyValue(), KeyValue.pair(String.valueOf(ride.DOLocationID), 1L)); assertTrue(outputTopic.isEmpty()); } @Test public void testIfTwoMessageArePassedWithDifferentKey() { Ride ride1 = DataGeneratorHelper.generateRide(); ride1.DOLocationID = 100L; inputTopic.pipeInput(String.valueOf(ride1.DOLocationID), ride1); Ride ride2 = DataGeneratorHelper.generateRide(); ride2.DOLocationID = 200L; inputTopic.pipeInput(String.valueOf(ride2.DOLocationID), ride2); assertEquals(outputTopic.readKeyValue(), KeyValue.pair(String.valueOf(ride1.DOLocationID), 1L)); assertEquals(outputTopic.readKeyValue(), KeyValue.pair(String.valueOf(ride2.DOLocationID), 1L)); assertTrue(outputTopic.isEmpty()); } @Test public void testIfTwoMessageArePassedWithSameKey() { Ride ride1 = DataGeneratorHelper.generateRide(); ride1.DOLocationID = 100L; inputTopic.pipeInput(String.valueOf(ride1.DOLocationID), ride1); Ride ride2 = DataGeneratorHelper.generateRide(); ride2.DOLocationID = 100L; inputTopic.pipeInput(String.valueOf(ride2.DOLocationID), ride2); assertEquals(outputTopic.readKeyValue(), KeyValue.pair("100", 1L)); assertEquals(outputTopic.readKeyValue(), KeyValue.pair("100", 2L)); assertTrue(outputTopic.isEmpty()); } @AfterAll public static void tearDown() { testDriver.close(); } } ================================================ FILE: 07-streaming/theory/java/kafka_examples/src/test/java/org/example/helper/DataGeneratorHelper.java ================================================ package org.example.helper; import org.example.data.PickupLocation; import org.example.data.Ride; import org.example.data.VendorInfo; import java.time.LocalDateTime; import java.time.format.DateTimeFormatter; import java.util.List; public class DataGeneratorHelper { public static Ride generateRide() { var arrivalTime = LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")); var departureTime = LocalDateTime.now().minusMinutes(30).format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")); return new Ride(new String[]{"1", departureTime, arrivalTime,"1","1.50","1","N","238","75","2","8","0.5","0.5","0","0","0.3","9.3","0"}); } public static PickupLocation generatePickUpLocation(long pickupLocationId) { return new PickupLocation(pickupLocationId, LocalDateTime.now()); } } ================================================ FILE: 07-streaming/workshop/.python-version ================================================ 3.13 ================================================ FILE: 07-streaming/workshop/Dockerfile.flink ================================================ FROM flink:2.2.0-scala_2.12-java17 COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/ # ref: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker WORKDIR /opt/pyflink COPY pyproject.flink.toml pyproject.toml RUN uv python install 3.12 && uv sync ENV PATH="/opt/pyflink/.venv/bin:$PATH" # Download connector libraries WORKDIR /opt/flink/lib RUN wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/2.2.0/flink-json-2.2.0.jar; \ wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/4.0.1-2.0/flink-sql-connector-kafka-4.0.1-2.0.jar; \ wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-core/4.0.0-2.0/flink-connector-jdbc-core-4.0.0-2.0.jar; \ wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-postgres/4.0.0-2.0/flink-connector-jdbc-postgres-4.0.0-2.0.jar; \ wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.10/postgresql-42.7.10.jar COPY flink-config.yaml /opt/flink/conf/config.yaml WORKDIR /opt/flink ================================================ FILE: 07-streaming/workshop/Dockerfile_ARM64.flink ================================================ FROM flink:2.2.0-scala_2.12-java17 COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/ USER root # Install a full JDK (not just a runtime) plus native build tools for pemja RUN apt-get update && apt-get install -y --no-install-recommends \ openjdk-17-jdk-headless \ build-essential \ python3-dev \ wget \ ca-certificates \ && rm -rf /var/lib/apt/lists/* # Point JAVA_HOME at the full JDK and make /opt/java/openjdk match what pemja expects RUN JDK_DIR="$(dirname "$(dirname "$(readlink -f "$(command -v javac)")")")" \ && rm -rf /opt/java/openjdk \ && ln -s "${JDK_DIR}" /opt/java/openjdk \ && test -d /opt/java/openjdk/include ENV JAVA_HOME=/opt/java/openjdk ENV PATH="${JAVA_HOME}/bin:${PATH}" # ref: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker WORKDIR /opt/pyflink COPY pyproject.flink.toml pyproject.toml RUN uv python install 3.12 && uv sync ENV PATH="/opt/pyflink/.venv/bin:$PATH" # Download connector libraries # flink-json-2.2.0.jar is already bundled in the base image -- do NOT re-download it. WORKDIR /opt/flink/lib RUN wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/4.0.1-2.0/flink-sql-connector-kafka-4.0.1-2.0.jar \ && wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-core/4.0.0-2.0/flink-connector-jdbc-core-4.0.0-2.0.jar \ && wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-postgres/4.0.0-2.0/flink-connector-jdbc-postgres-4.0.0-2.0.jar \ && wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.10/postgresql-42.7.10.jar COPY flink-config.yaml /opt/flink/conf/config.yaml WORKDIR /opt/flink ================================================ FILE: 07-streaming/workshop/Makefile ================================================ .PHONY: build up down job aggregation_job stop start build: docker compose build up: docker compose up --build --remove-orphans -d down: docker compose down --remove-orphans job: docker compose exec jobmanager ./bin/flink run -py /opt/src/job/start_job.py --pyFiles /opt/src -d aggregation_job: docker compose exec jobmanager ./bin/flink run -py /opt/src/job/aggregation_job.py --pyFiles /opt/src -d stop: docker compose stop start: docker compose start ================================================ FILE: 07-streaming/workshop/README.md ================================================ # PyFlink: Stream Processing Workshop Video: https://www.youtube.com/watch?v=YDUgFeHQzJU This workshop is based on the [2025 stream with Zach Wilson](https://www.youtube.com/watch?v=P2loELMUUeI). In this workshop, we build a real-time streaming pipeline step by step. We start with the basics - a message broker, a producer, and a consumer - then add a database and finally a stream processing framework. We'll use NYC yellow taxi trip data as our data source. What we'll build by the end: ``` Producer (Python) -> Kafka (Redpanda) -> Flink -> PostgreSQL ``` Prerequisites: - Docker and Docker Compose - [uv](https://docs.astral.sh/uv/) - A SQL client - [pgcli](https://www.pgcli.com/) (`uvx pgcli`), DBeaver, pgAdmin, or DataGrip Code: - [Reference code](./) in this directory (`07-streaming/workshop/`) - [Code created during the workshop](live/) by Alexey The README walks through building everything from scratch - you can follow along step by step or study the existing files and run the commands. ## Redpanda - a Kafka-compatible broker Before we can produce or consume messages, we need a message broker - a service that receives messages from producers, stores them, and delivers them to consumers. We use [Redpanda](https://redpanda.com/), a drop-in replacement for Apache Kafka. Redpanda implements the same protocol, so any Kafka client library works with it unchanged. The `kafka-python` library we'll use doesn't know or care that Redpanda is running instead of Kafka. Why Redpanda instead of Kafka? - No JVM - Kafka runs on Java and needs significant memory for the JVM. Redpanda is written in C++ and starts in seconds with far less overhead. - No ZooKeeper - Kafka traditionally required a separate ZooKeeper cluster for coordination (metadata, leader election). Redpanda handles this internally using the Raft consensus protocol - one less service to run. - Single binary - just one container, nothing else to configure. For this workshop, every time we say "Kafka" we mean the Kafka protocol and concepts. Redpanda is the actual broker running underneath. Create `docker-compose.yml` with the Redpanda service: ```yaml services: redpanda: image: redpandadata/redpanda:v25.3.9 command: - redpanda - start - --smp - '1' - --reserve-memory - 0M - --overprovisioned - --node-id - '1' - --kafka-addr - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092 - --advertise-kafka-addr - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092 - --pandaproxy-addr - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082 - --advertise-pandaproxy-addr - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082 - --rpc-addr - 0.0.0.0:33145 - --advertise-rpc-addr - redpanda:33145 ports: - 8082:8082 - 9092:9092 - 28082:28082 - 29092:29092 ``` The command has many parameters. Let's go through them. Resource parameters: | Parameter | What it does | |---|---| | `--smp 1` | Use 1 CPU core. Redpanda is built on [Seastar](http://seastar.io/), a framework that pins threads to cores for high performance. For development, 1 core is enough. | | `--reserve-memory 0M` | Don't reserve extra memory for Redpanda's internal cache. In production, Redpanda reserves memory for its own page cache; we skip this in development. | | `--overprovisioned` | Don't pin threads to specific CPU cores. On a shared development machine, this avoids contention with other processes. | | `--node-id 1` | Unique identifier for this broker in the cluster. With a single broker it doesn't matter, but the parameter is required. | Networking parameters: Redpanda exposes two separate listeners for the Kafka protocol - one for connections from inside Docker (other containers) and one for connections from outside Docker (your laptop): | Parameter | Internal (Docker) | External (your laptop) | |---|---|---| | `--kafka-addr` | `PLAINTEXT://0.0.0.0:29092` | `OUTSIDE://0.0.0.0:9092` | | `--advertise-kafka-addr` | `PLAINTEXT://redpanda:29092` | `OUTSIDE://localhost:9092` | Why two addresses? Kafka clients use a two-step connection process: 1. The client connects to a bootstrap server and asks for cluster metadata 2. The broker responds with advertised addresses - where the client should connect for actual data transfer Inside Docker, containers find each other by service name, so the internal advertised address is `redpanda:29092`. From your laptop, you connect via the published port at `localhost:9092`. If we used only one address, either Docker containers or your laptop wouldn't be able to connect. The `--pandaproxy-addr` / `--advertise-pandaproxy-addr` follow the same pattern for Redpanda's HTTP REST API (not used in this workshop). The `--rpc-addr` / `--advertise-rpc-addr` are for internal cluster communication between Redpanda nodes (not relevant with a single node). Published ports: | Port | What it's for | |---|---| | `9092` | Kafka protocol (external) - your Python producer/consumer connects here | | `29092` | Kafka protocol (internal) - Flink containers will connect here later | | `8082` / `28082` | HTTP Proxy - REST API access (not used in this workshop) | Start Redpanda: ```bash docker compose up redpanda -d ``` Verify it's running: ```bash docker compose ps ``` ``` NAME IMAGE SERVICE STATUS workshop-redpanda redpandadata/redpanda:v25.3.9 redpanda Up ``` ## Produce messages to Kafka Initialize a Python project and add the dependencies we need: ```bash uv init -p 3.12 uv add kafka-python pandas pyarrow ``` > If you cloned the repository, `pyproject.toml` already exists. > Run `uv sync` instead. We'll send NYC yellow taxi trip data to Kafka. You can run the code below either as a Python script or in a Jupyter notebook (`uv add jupyter`, then `uv run jupyter lab`). First, download the data. We read a parquet file of yellow taxi trips and take the first 1000 rows: ```python import pandas as pd url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet" columns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime'] df = pd.read_parquet(url, columns=columns).head(1000) df.head() ``` We only read 5 columns to keep things focused. The full dataset has many more (fare breakdown, rate codes, payment type, etc.). Define a dataclass for our message. This gives us a clear schema for each taxi trip: ```python from dataclasses import dataclass @dataclass class Ride: PULocationID: int DOLocationID: int trip_distance: float total_amount: float tpep_pickup_datetime: int # epoch milliseconds ``` Write a function to convert a DataFrame row into a `Ride`. We convert the pandas Timestamp to epoch milliseconds - that's the format Flink expects later: ```python def ride_from_row(row): return Ride( PULocationID=int(row['PULocationID']), DOLocationID=int(row['DOLocationID']), trip_distance=float(row['trip_distance']), total_amount=float(row['total_amount']), tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000), ) ``` Test it: ```python ride = ride_from_row(df.iloc[0]) ride # Ride(PULocationID=186, DOLocationID=79, trip_distance=1.72, # total_amount=17.31, tpep_pickup_datetime=1730429702000) ``` Next, connect to Kafka. The `bootstrap_servers` is where the broker accepts connections - `localhost:9092` because we're running this from our laptop (outside Docker). In production with multiple brokers, you'd list several for redundancy - if one is down, the client connects through another. Kafka works with raw bytes, so we need a serializer that converts Python dicts to JSON: ```python import json from kafka import KafkaProducer def json_serializer(data): return json.dumps(data).encode('utf-8') server = 'localhost:9092' producer = KafkaProducer( bootstrap_servers=[server], value_serializer=json_serializer ) ``` Let's send a single ride to try it out. `dataclasses.asdict(ride)` converts the dataclass to a plain dict, which the serializer turns into JSON bytes. The broker auto-creates the `rides` topic on first use: ```python import dataclasses topic_name = 'rides' producer.send(topic_name, value=dataclasses.asdict(ride)) producer.flush() ``` This works, but calling `dataclasses.asdict()` every time is tedious. We can make a serializer that handles dataclasses directly: ```python def ride_serializer(ride): ride_dict = dataclasses.asdict(ride) json_str = json.dumps(ride_dict) return json_str.encode('utf-8') ``` Now recreate the producer with the new serializer - we can pass `Ride` objects directly without converting them to dicts first: ```python producer = KafkaProducer( bootstrap_servers=[server], value_serializer=ride_serializer ) ``` Send one ride to verify: ```python producer.send(topic_name, value=ride) producer.flush() ``` That sent one record. Now let's send all 1000 rides in a loop: ```python import time t0 = time.time() for _, row in df.iterrows(): ride = ride_from_row(row) producer.send(topic_name, value=ride) print(f"Sent: {ride}") time.sleep(0.01) producer.flush() t1 = time.time() print(f'took {(t1 - t0):.2f} seconds') ``` If you're building from scratch (not using the cloned repo files), create the source directory structure and save the shared data model. The producer and consumer scripts both import from this file: ```bash mkdir -p src/producers src/consumers src/job ``` Create `src/models.py`: ```python import json from dataclasses import dataclass @dataclass class Ride: PULocationID: int DOLocationID: int trip_distance: float total_amount: float tpep_pickup_datetime: int # epoch milliseconds def ride_from_row(row): return Ride( PULocationID=int(row['PULocationID']), DOLocationID=int(row['DOLocationID']), trip_distance=float(row['trip_distance']), total_amount=float(row['total_amount']), tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000), ) def ride_deserializer(data): json_str = data.decode('utf-8') ride_dict = json.loads(json_str) return Ride(**ride_dict) ``` `ride_deserializer` is introduced in the next step - we include it here so the file is complete. > The complete script is in `src/producers/producer.py`. Run it: ```bash uv run python src/producers/producer.py ``` You'll see 1000 taxi trips sent over ~10 seconds: ``` Sent: Ride(PULocationID=..., DOLocationID=..., trip_distance=..., total_amount=..., tpep_pickup_datetime=...) ... took 10.23 seconds ``` ## Consume messages with Python Now let's read back the messages. The consumer receives raw bytes from Kafka. Instead of deserializing to a dict and then constructing a `Ride` manually, let's write a function that does both in one step: ```python import json def ride_deserializer(data): json_str = data.decode('utf-8') ride_dict = json.loads(json_str) return Ride(**ride_dict) ``` Test it with a sample JSON binary string (this is what Kafka delivers): ```python test_bytes = json.dumps({ 'PULocationID': 186, 'DOLocationID': 79, 'trip_distance': 1.72, 'total_amount': 17.31, 'tpep_pickup_datetime': 1730429702000 }).encode('utf-8') ride_deserializer(test_bytes) # Ride(PULocationID=186, DOLocationID=79, trip_distance=1.72, # total_amount=17.31, tpep_pickup_datetime=1730429702000) ``` Now we can pass `ride_deserializer` directly as the `value_deserializer` - Kafka calls it on every message, so `message.value` is already a `Ride`. Connect to Kafka as a consumer. `auto_offset_reset='earliest'` means we start reading from the beginning of the topic (without this, new consumers default to `latest` and only see new messages). `group_id` identifies this consumer group - Kafka tracks how far each group has read, so restarting with the same group ID continues where it left off: ```python from kafka import KafkaConsumer server = 'localhost:9092' topic_name = 'rides' consumer = KafkaConsumer( topic_name, bootstrap_servers=[server], auto_offset_reset='earliest', group_id='rides-console', value_deserializer=ride_deserializer ) ``` Read messages and print them. Since `value_deserializer` returns a `Ride`, `message.value` is already a `Ride` object - no extra conversion needed: ```python from datetime import datetime print(f"Listening to {topic_name}...") count = 0 for message in consumer: ride = message.value pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000) print(f"Received: PU={ride.PULocationID}, DO={ride.DOLocationID}, " f"distance={ride.trip_distance}, amount=${ride.total_amount:.2f}, " f"pickup={pickup_dt}") count += 1 if count >= 10: print(f"\n... received {count} messages so far (stopping after 10 for demo)") break consumer.close() ``` > The complete script is in `src/consumers/consumer.py`. Run it: ```bash uv run python src/consumers/consumer.py ``` ``` Listening to rides... Received: PU=..., DO=..., distance=..., amount=$..., pickup=2025-... ... ... received 10 messages so far (stopping after 10 for demo) ``` ## Save events to PostgreSQL Printing to the screen is fine for debugging, but let's save events to a database. Add the PostgreSQL service to `docker-compose.yml`: ```yaml postgres: image: postgres:18 restart: on-failure environment: - POSTGRES_DB=postgres - POSTGRES_USER=postgres - POSTGRES_PASSWORD=postgres ports: - "5432:5432" ``` Start it: ```bash docker compose up postgres -d ``` Connect to PostgreSQL. With `pgcli`: ```bash uvx pgcli -h localhost -p 5432 -U postgres -d postgres # password: postgres ``` Or via Docker: ```bash docker compose exec postgres psql -U postgres -d postgres ``` Create a table for our events: ```sql CREATE TABLE processed_events ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE PRECISION, total_amount DOUBLE PRECISION, pickup_datetime TIMESTAMP ); ``` Install the PostgreSQL client library: ```bash uv add psycopg2-binary ``` Create `src/consumers/consumer_postgres.py`. Set up the Kafka consumer. We reuse the same `ride_deserializer` from the previous step. The `group_id` is different - each consumer group tracks its offsets independently, so the console consumer and the PostgreSQL consumer each read all messages: ```python from kafka import KafkaConsumer server = 'localhost:9092' topic_name = 'rides' consumer = KafkaConsumer( topic_name, bootstrap_servers=[server], auto_offset_reset='earliest', group_id='rides-to-postgres', value_deserializer=ride_deserializer ) ``` Connect to PostgreSQL: ```python import psycopg2 conn = psycopg2.connect( host='localhost', port=5432, database='postgres', user='postgres', password='postgres' ) conn.autocommit = True cur = conn.cursor() ``` `autocommit = True` means each INSERT is committed immediately - no need to call `conn.commit()` after every row. Read messages and insert into PostgreSQL: ```python from datetime import datetime print(f"Listening to {topic_name} and writing to PostgreSQL...") count = 0 for message in consumer: ride = message.value pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000) cur.execute( """INSERT INTO processed_events (PULocationID, DOLocationID, trip_distance, total_amount, pickup_datetime) VALUES (%s, %s, %s, %s, %s)""", (ride.PULocationID, ride.DOLocationID, ride.trip_distance, ride.total_amount, pickup_dt) ) count += 1 if count % 100 == 0: print(f"Inserted {count} rows...") consumer.close() cur.close() conn.close() ``` Run it (press Ctrl+C after it processes the data): ```bash uv run python src/consumers/consumer_postgres.py ``` Check PostgreSQL: ```sql SELECT count(*) FROM processed_events; ``` ``` count ------- 1000 ``` This works, but think about what's missing: - What if we want to aggregate by time window? We'd need to implement windowing logic ourselves. - What if the consumer crashes? We'd need to track offsets ourselves to avoid reprocessing or missing data. - What about parallelism? We'd need to manage multiple consumer instances and partition assignment. - What about writing to different sinks? We'd need to write connector code for each destination. This is where Flink comes in. Clear the table before moving on: ```sql TRUNCATE processed_events; ``` ## Why Flink? Flink is a stream processing framework that handles all the hard parts: - Windowing - built-in tumbling, sliding, and session windows - Checkpointing - automatic state recovery after failures (no manual offset tracking) - Parallelism - distribute processing across multiple workers - Connectors - built-in JDBC, Kafka, filesystem sinks (no psycopg2 code) - SQL interface - express stream processing with SQL queries Flink can also connect to sources beyond Kafka - REST APIs, websockets, filesystems, and more. But Kafka is the most common source in stream processing. The trade-off is infrastructure complexity - we need the JobManager and TaskManager containers. A streaming job is more like owning a server than running a batch pipeline - it runs 24/7 and needs monitoring. But for anything beyond simple consume-and-write, Flink pays for itself. ## The Flink image and services Flink doesn't come with Python support out of the box. We need a custom Docker image with Python, PyFlink, and connector JARs. Download the Flink build files: ```bash PREFIX="https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/main/07-streaming/workshop" wget ${PREFIX}/Dockerfile.flink wget ${PREFIX}/pyproject.flink.toml wget ${PREFIX}/flink-config.yaml ``` > If you cloned the repository, these files are already in the > `07-streaming/workshop/` directory. You can look at [`Dockerfile.flink`](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/07-streaming/workshop/Dockerfile.flink) to see what it does: - Starts from the official Flink image (`flink:2.2.0-scala_2.12-java17`) - Installs Python 3.12 and PyFlink via uv - Downloads connector JARs (Kafka, JDBC, PostgreSQL driver) - Applies a custom Flink config to increase JVM metaspace for PyFlink Now add the Flink services to `docker-compose.yml`. A Flink cluster has two types of processes - let's add them one at a time. The JobManager is the coordinator. It accepts jobs, manages checkpoints, and assigns work to task managers. You interact with it through the web UI (port `8081`) and submit jobs via its RPC port (`6123`): ```yaml jobmanager: build: context: . dockerfile: ./Dockerfile.flink image: pyflink-workshop pull_policy: never expose: - "6123" ports: - "8081:8081" volumes: - ./:/opt/flink/usrlib - ./src/:/opt/src command: jobmanager environment: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager jobmanager.memory.process.size: 1600m ``` - `build` + `image: pyflink-workshop` - builds our custom Docker image and tags it as `pyflink-workshop`. The taskmanager will reuse this same image without rebuilding. - `pull_policy: never` - don't try to pull `pyflink-workshop` from Docker Hub (it doesn't exist there - we built it locally). - `volumes` - mount the source code into the container so we can submit jobs without rebuilding the image. - `FLINK_PROPERTIES` - Flink configuration passed as an environment variable. `jobmanager.rpc.address: jobmanager` tells Flink where the coordinator lives (`jobmanager` is the Docker service name). The TaskManager is the worker. It executes the actual data processing: ```yaml taskmanager: image: pyflink-workshop pull_policy: never expose: - "6121" - "6122" volumes: - ./:/opt/flink/usrlib - ./src/:/opt/src depends_on: - jobmanager command: taskmanager --taskmanager.registration.timeout 5 min environment: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager taskmanager.memory.process.size: 1728m taskmanager.numberOfTaskSlots: 15 parallelism.default: 3 ``` - `image: pyflink-workshop` - reuses the image built by the jobmanager service, no `build` needed. - `depends_on: jobmanager` - start after the jobmanager. - `--taskmanager.registration.timeout 5 min` - give the task manager 5 minutes to find the job manager on startup (useful when services start in parallel). - `taskmanager.numberOfTaskSlots: 15` - this task manager has 15 slots. - `parallelism.default: 3` - by default, each pipeline stage runs 3 copies processing data in parallel. A task slot is a unit of resources (memory, CPU) that can run one parallel instance of a pipeline stage. Think of slots like lanes on a highway - more lanes means more data can flow through at once. If you submit a job with parallelism 3, that job uses 3 slots. With 15 slots available, you can run 5 such jobs simultaneously on this single task manager. In production, you'd have multiple task managers across different machines, each contributing slots to the cluster. The job manager decides which slots run which parts of which jobs. Make sure `src/` exists before starting Docker - the volume mount `./src/:/opt/src` will create it as root if it doesn't exist, causing permission issues later when you try to create files inside it: ```bash mkdir -p src/job ``` Build the Flink image and start all services: ```bash docker compose up --build -d ``` The first build takes a few minutes - it installs Python, PyFlink, and downloads the connector JARs. Verify all four services are running: ```bash docker compose ps ``` ``` NAME IMAGE SERVICE STATUS workshop-jobmanager pyflink-workshop jobmanager Up workshop-taskmanager pyflink-workshop taskmanager Up workshop-postgres postgres:18 postgres Up workshop-redpanda redpandadata/redpanda:v25.3.9 redpanda Up ``` Check the Flink dashboard at [http://localhost:8081](http://localhost:8081) - you should see 1 task manager with 15 available task slots. ## The pass-through Flink job Now let's do the same thing our Python consumer did, but with Flink. Unlike the producer and consumer scripts, Flink jobs can't run from a Jupyter notebook. They are submitted to the Flink cluster as .py files using `docker compose exec`. We cover how job submission works in production in the "Flink in production" section at the end. Create `src/job/pass_through_job.py`. The Kafka source table: ```python def create_events_source_kafka(t_env): table_name = "events" source_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, tpep_pickup_datetime BIGINT ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda:29092', 'topic' = 'rides', 'scan.startup.mode' = 'latest-offset', 'properties.auto.offset.reset' = 'latest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name ``` This is a Flink SQL DDL statement. Breaking it down: - `PULocationID`, `DOLocationID`, `trip_distance`, `total_amount`, `tpep_pickup_datetime` - the JSON fields from our producer - `'properties.bootstrap.servers' = 'redpanda:29092'` - the internal Docker network address (not `localhost` - Flink runs inside Docker) - `'scan.startup.mode' = 'latest-offset'` - only read new messages arriving after the job starts - `'format' = 'json'` - Flink deserializes JSON automatically The PostgreSQL sink table: ```python def create_processed_events_sink_postgres(t_env): table_name = 'processed_events' sink_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, pickup_datetime TIMESTAMP ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name ``` No psycopg2, no INSERT statements - just declare the table and Flink handles the rest. The execution: ```python from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, StreamTableEnvironment def log_processing(): env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) # checkpoint every 10 seconds settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) source_table = create_events_source_kafka(t_env) postgres_sink = create_processed_events_sink_postgres(t_env) t_env.execute_sql( f""" INSERT INTO {postgres_sink} SELECT PULocationID, DOLocationID, trip_distance, total_amount, TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) as pickup_datetime FROM {source_table} """ ).wait() if __name__ == '__main__': log_processing() ``` - Streaming mode - the job runs continuously, waiting for new data - The `INSERT INTO ... SELECT` is the pipeline - read from Kafka, convert the timestamp, write to PostgreSQL `enable_checkpointing(10 * 1000)` tells Flink to take a snapshot of the job's state every 10 seconds. A checkpoint captures the Kafka offsets (how far Flink has read) and any in-flight data. If the job crashes, it resumes from the last checkpoint instead of starting from the beginning. Checkpointing gets especially important with windows. If you have a 5-minute window and the job fails 2 minutes in, Flink doesn't just track the offset - it also serializes the open windows to disk. When it restarts, it picks up right where it left off, with the partially-filled window intact. The trade-off is resilience versus efficiency. Checkpointing every 1 second is expensive - Flink has to serialize and persist the entire state that often. Checkpointing every 10 minutes means you could lose up to 10 minutes of progress on failure. 10 seconds is a reasonable default for most jobs. Submit the job: ```bash docker compose exec jobmanager ./bin/flink run \ -py /opt/src/job/pass_through_job.py \ --pyFiles /opt/src -d ``` ``` Job has been submitted with JobID 663cff6811b65e97fc1e068d641401f4 ``` Check the Flink UI at [http://localhost:8081](http://localhost:8081) - you should see a running job. Since the job uses `latest-offset`, it's waiting for new messages. Send data: ```bash uv run python src/producers/producer.py ``` Query PostgreSQL: ```sql SELECT count(*) FROM processed_events; ``` Compare this to our Python consumer approach - same result, but Flink handles checkpointing, offset management, and PostgreSQL writes automatically. ## Offsets - earliest vs latest When Flink connects to Kafka, it needs to know where to start reading. This is the `scan.startup.mode` setting: | Mode | Behavior | |---|---| | `latest-offset` | Only read messages arriving after the job starts | | `earliest-offset` | Read everything from the beginning of the topic | | `timestamp` | Start from a specific point in time | `earliest` is typically used for backfilling or restating data - you're using Flink to process data that's been sitting in Kafka for a while, not real-time data. `latest` is the more common production setting - the job starts up and only processes new events as people click buttons on your website or whatever event feed you're consuming. Our pass-through job uses `latest-offset`. Let's see what happens with `earliest-offset`: 1. Cancel the running job from the Flink UI (click on the job, then Cancel) 2. Clear the table: ```sql TRUNCATE processed_events; ``` 3. Edit `src/job/pass_through_job.py` - change both offset settings: ``` 'scan.startup.mode' = 'earliest-offset', 'properties.auto.offset.reset' = 'earliest', ``` 4. Resubmit: ```bash docker compose exec jobmanager ./bin/flink run \ -py /opt/src/job/pass_through_job.py \ --pyFiles /opt/src -d ``` 5. Wait 15 seconds, then check: ```sql SELECT count(*) FROM processed_events; ``` Flink reads all messages from the topic - including data from previous producer runs. If you ran the producer twice before, you'll see ~2000 rows (duplicates of everything already processed). Why duplicates? Checkpoints are scoped to a specific job instance. When you cancel and resubmit, it's a brand new job that knows nothing about previous checkpoints. With `earliest-offset`, it starts from scratch. The offset setting only matters at startup - once the job is running, checkpointing takes over and tracks progress. But if you kill the job and create a new one, those checkpoints are gone. There is a third option - `timestamp` mode. If your job was running fine until 2:00 PM and then crashed, you can restart it from exactly 2:00 PM. This is useful for recovering from failures without reprocessing everything from the beginning or missing the data that arrived while the job was down. A common production pattern (Lambda architecture): run your streaming job with `latest-offset` for real-time results, and if it goes down, use a separate batch job to backfill the gap. This way the streaming job stays fast and you don't lose data. > Change the offset back to `latest-offset` when you're done experimenting. ## Aggregation with tumbling windows Now let's do something our plain Python consumer can't easily do - windowed aggregation. We'll count taxi trips and sum revenue by pickup location per hour. First, cancel any running jobs. Then create the aggregation table in PostgreSQL: ```sql CREATE TABLE processed_events_aggregated ( window_start TIMESTAMP, PULocationID INTEGER, num_trips BIGINT, total_revenue DOUBLE PRECISION, PRIMARY KEY (window_start, PULocationID) ); ``` Two important design choices: 1. `PULocationID` is included - we group by both time window and pickup location, so both appear in the output. 2. `PRIMARY KEY` - enables upsert behavior. When Flink sends updated counts for the same window, PostgreSQL updates the existing row instead of creating a duplicate. This matters because late-arriving events can cause Flink to re-evaluate a window it already emitted results for. With upsert, the corrected count replaces the old one automatically. Now create `src/job/aggregation_job.py`: ```python from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, StreamTableEnvironment def create_events_source_kafka(t_env): table_name = "events" source_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, tpep_pickup_datetime BIGINT, event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3), WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda:29092', 'topic' = 'rides', 'scan.startup.mode' = 'earliest-offset', 'properties.auto.offset.reset' = 'earliest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name def create_events_aggregated_sink(t_env): table_name = 'processed_events_aggregated' sink_ddl = f""" CREATE TABLE {table_name} ( window_start TIMESTAMP(3), PULocationID INT, num_trips BIGINT, total_revenue DOUBLE, PRIMARY KEY (window_start, PULocationID) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name def log_aggregation(): env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) env.set_parallelism(3) settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) try: source_table = create_events_source_kafka(t_env) aggregated_table = create_events_aggregated_sink(t_env) t_env.execute_sql(f""" INSERT INTO {aggregated_table} SELECT window_start, PULocationID, COUNT(*) AS num_trips, SUM(total_amount) AS total_revenue FROM TABLE( TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR) ) GROUP BY window_start, PULocationID; """).wait() except Exception as e: print("Writing records from Kafka to JDBC failed:", str(e)) if __name__ == '__main__': log_aggregation() ``` The Kafka source table has two new lines compared to the pass-through job: - `event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3)` - a computed column that converts epoch milliseconds to a timestamp. The `3` means milliseconds precision. - `WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND` - tells Flink when to publish window results. The window defines WHAT you're counting - a 1-hour bucket of taxi trips. But in a stream, events keep arriving. How does Flink know when to stop waiting and publish the count for the 2 PM - 3 PM hour? It can't just look at the clock because some events arrive late. Without a trigger, Flink would accumulate data forever and never write anything to PostgreSQL. The watermark is that trigger. It tells Flink when to publish. In the SQL: ``` WATERMARK FOR event_timestamp AS event_timestamp - INTERVAL '5' SECOND ^^^^^^^^^^^^^^^^^^^ patience = 5 seconds ``` The watermark is always 5 seconds behind the latest event timestamp Flink has seen. When the watermark passes the end of a window, Flink publishes that window's results. The 5 seconds is patience for stragglers - events that happened before the window ended but arrived a few seconds late. Three pieces working together: - Window = what bucket to count into (1 hour) - Watermark = when to publish the result (the trigger) - Upsert (PRIMARY KEY) = safety net that corrects the result if something arrives after publishing Here's a concrete example. Two taxi pickups in East Village (PU=79) with a 10-second window and 5-second watermark. Event A is on time, Event B is 8 seconds late (the rider's phone lost signal in a tunnel). Event B arrives late, but Flink hasn't published yet - both events counted: ```mermaid sequenceDiagram participant P as Producer participant K as Kafka participant F as Flink participant PG as PostgreSQL P->>K: Event A (ts=14:00:07, on time) K->>F: Event A Note over F: watermark = 00:02
window [00:00, 00:10) not published yet
A added to window Note over P: 5 seconds pass, phone reconnects P->>K: Event B (ts=14:00:04, 8s late) K->>F: Event B Note over F: watermark = 00:07
window [00:00, 00:10) still not published
B added to window Note over F: more events arrive
watermark reaches 00:10
time to publish F->>PG: INSERT (window=00:00, PU=79, trips=2) Note over PG: both events counted ``` Event B arrived late, but within Flink's patience window. Flink hadn't published the result yet, so B was included in the count. Now what if Event B were 20 seconds late - arriving after Flink already published? ```mermaid sequenceDiagram participant P as Producer participant K as Kafka participant F as Flink participant PG as PostgreSQL P->>K: Event A (ts=14:00:07, on time) K->>F: Event A Note over F: A added to window [00:00, 00:10) Note over F: watermark reaches 00:10
time to publish F->>PG: INSERT (window=00:00, PU=79, trips=1) Note over PG: published with trips=1 Note over P: 20 seconds later, phone reconnects P->>K: Event B (ts=14:00:04, 20s late) K->>F: Event B Note over F: window [00:00, 00:10) already published
but B still belongs to it F->>PG: UPDATE (window=00:00, PU=79, trips=2) Note over PG: upsert via PRIMARY KEY
corrected from 1 to 2 ``` Flink already published trips=1, but when Event B finally arrives, the PRIMARY KEY lets Flink send a correction. PostgreSQL updates the row from 1 to 2. Without the PRIMARY KEY (an append-only sink), Event B would be lost - Flink can't re-open a published window in append mode. The trade-off is latency vs completeness. A larger watermark means more patience for late events, but you wait longer before seeing any results. 5 seconds is a reasonable default. In production, you'd tune this based on how out-of-order your data actually is. Other differences from the pass-through job: - The sink has a `PRIMARY KEY` with `NOT ENFORCED` - this enables upsert behavior in the Flink JDBC connector. - `earliest-offset` - reads all existing data from Kafka. - `env.set_parallelism(3)` - runs 3 copies processing data in parallel. - The `TUMBLE` function creates fixed-size, non-overlapping windows. `DESCRIPTOR(event_timestamp)` must reference the column with the `WATERMARK` defined on it, and `INTERVAL '1' HOUR` sets the window size. Submit and test: ```bash docker compose exec jobmanager ./bin/flink run \ -py /opt/src/job/aggregation_job.py \ --pyFiles /opt/src -d ``` Send data: ```bash uv run python src/producers/producer.py ``` Wait ~15 seconds for the windows to close, then check: ```sql SELECT window_start, count(*) as locations, sum(num_trips) as total_trips, round(sum(total_revenue)::numeric, 2) as revenue FROM processed_events_aggregated GROUP BY window_start ORDER BY window_start; ``` ``` window_start | locations | total_trips | revenue ----------------------+-----------+-------------+--------- 2025-11-01 00:00:00 | ... 2025-11-01 01:00:00 | ... ... ``` The 1000 taxi trips were grouped into 1-hour tumbling windows by pickup location. Each row shows how many locations had trips in that hour and the total number of trips. Try this with a plain Python consumer - you'd need to implement the windowing logic, handle late events, manage state, and write the upsert SQL yourself. With Flink, it's a SQL query. ## Late events and upserts The CSV producer sends events in order, so the watermark never has to handle late arrivals. Let's use a real-time producer that generates synthetic events with occasional delays to see what happens. Download and run the real-time producer: ```bash PREFIX="https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/main/07-streaming/workshop" wget ${PREFIX}/src/producers/producer_realtime.py -P src/producers/ ``` ```bash uv run python src/producers/producer_realtime.py ``` It generates random taxi trips with current timestamps, but ~20% of events are sent with a timestamp 3-10 seconds in the past (simulating network delays). The output labels each event: ``` on time -> PU=79 ts=14:23:05 on time -> PU=107 ts=14:23:05 LATE (8s) -> PU=234 ts=14:22:58 on time -> PU=48 ts=14:23:06 ``` With our 5-second watermark and 1-hour windows, no events will be dropped - even an event 10 seconds late lands well within the current hour window. But the watermark + upsert behavior is still visible: Flink first emits window results when the watermark passes the window end, then late events update those results via the PRIMARY KEY. To see this in action, open two terminals: Terminal 1 - run the real-time producer: ```bash uv run python src/producers/producer_realtime.py ``` Terminal 2 - watch aggregation counts change: ```bash watch -n 1 'PGPASSWORD=postgres docker compose exec postgres psql -U postgres -d postgres -c "SELECT window_start, sum(num_trips) as trips, round(sum(total_revenue)::numeric, 2) as revenue FROM processed_events_aggregated GROUP BY window_start ORDER BY window_start;"' ``` You'll see the counts for older windows increase as late events arrive and update the aggregation via upsert. This is why we set up the PRIMARY KEY - without it, late events would either be dropped or create duplicates. ## Understanding window types We used tumbling windows above. Flink supports three types: ### Tumbling windows Fixed-size, non-overlapping. Every event belongs to exactly one window. If you come from the batch world, tumbling windows are the most familiar - they just cut up your data into fixed segments. It's essentially a way to speed up batch processing. ``` | Window 1 | Window 2 | Window 3 | | 1 hour | 1 hour | 1 hour | ``` Use case: Counting trips per hour, daily revenue summaries. ### Sliding windows Fixed-size, overlapping. An event can belong to multiple windows. When you think of a 1-hour window, most people think of 00:00-01:00. But there's also 00:15-01:15, 00:30-01:30 - those are also 1-hour windows, just starting at different points. Sliding windows capture all of them. ``` |--- Window 1 (1 hour) ---| |--- Window 2 (1 hour) ---| |--- Window 3 (1 hour) ---| <- 15 min slide -> ``` ```sql HOP(TABLE events, DESCRIPTOR(event_timestamp), INTERVAL '15' MINUTE, INTERVAL '1' HOUR) ``` Use case: finding peaks and valleys - "what was our peak traffic in any 1-hour window?" These overlapping windows let you find the moment in time where you have the highest or lowest values. Good for min-maxing, moving averages, and surge detection (e.g., ride-share surge pricing). ### Session windows Dynamic windows based on inactivity gaps. Unlike tumbling and sliding windows, the window size isn't fixed - the window doesn't close at a specified time, it closes after a specified amount of inactivity. ``` |--events--| gap |--events------| gap |--events--| | Session 1| | Session 2 | | Session 3| ``` Use case: grouping user behavior together. Imagine a user logs into an app, clicks a bunch of buttons, leaves for 2 minutes, then comes back - that's still technically the same session. You set a session gap (say, 30 minutes of inactivity) and Flink groups all the events within that session together. Sessionization is very powerful for behavioral analytics. ## Cleanup Stop and remove all containers: ```bash docker compose down ``` To also remove the PostgreSQL data volume: ```bash docker compose down -v ``` ## Q&A Questions and answers from the [2025 stream with Zach Wilson](https://www.youtube.com/watch?v=P2loELMUUeI). ### What happens when a Flink job dies and restarts? Does it reprocess everything? The `earliest` offset setting is only for the initial startup. If the job restarts (not re-submitted as a new job), it uses checkpointing to resume from the last snapshot. Without checkpointing, you either reprocess everything (with `earliest`) or skip data (with `latest`). The catch: checkpoints are scoped to a specific job instance. If you completely kill a job and submit a new one, the new job has no knowledge of the previous checkpoints. To preserve state across redeployments, restart the existing job rather than creating a new one. ### Why can't we just use Kafka consumers? What does Flink actually add? For simple pass-through (read a message, write it somewhere), a Kafka consumer is fine. For anything involving time windows, watermarks, checkpointing, or parallel processing, Flink saves you from building all that yourself. You can do windowing, watermarking, late data handling, and job recovery with a plain consumer - go ahead and manage it yourself. But as Zach puts it: "good luck." With a plain consumer, you'd also need to track checkpoints yourself - save the latest processed timestamp to a file or database and manage it on every restart. Flink keeps the state for you. It's like asking "why use Spark when you can use Pandas?" You can, but Pandas won't work at higher scale in a distributed way. ### What happens with events delayed beyond the watermark (the "tunnel" scenario)? There are two types of lateness. The watermark handles acceptable lateness - small delays where events arrive a few seconds late. For events arriving much later (like after a 5-minute tunnel), Flink has an allowed lateness parameter. By default, allowed lateness is zero - events arriving after the watermark closes a window are discarded. If you set allowed lateness to 10 minutes, Flink will go back, find the old closed window, create a new aggregation with the late event, and send it to the sink as a brand new record. This means you need deduplication logic on the sink side (a primary key with upsert behavior - exactly what we set up in the aggregation section). The trade-off: allowed lateness requires Flink to hold all those windows on disk for the duration of the tolerance. ### When do we actually need streaming? For many things micro-batch is enough. The key question: is something going to happen in real time on the other side? If there is an automated process that will change something based on the data, streaming is a great choice. If a human is just looking at data, real-time is unnecessary and micro-batch is easier to maintain. In 10 years as a data engineer, Zach had literally two use cases that genuinely needed streaming - Netflix fraud/security detection (5 minutes of delay means 5 more minutes of a hacked account) and Airbnb surge pricing (supply and demand changes rapidly). Everything else was daily batch, or hourly/every-15-minute micro-batch for lower latency needs. Before committing to streaming, consider the operational cost. A streaming job runs 24/7 - if it breaks at 3 AM, someone needs to fix it. If you're the only person on the team who understands Flink, you'll be on-call for it forever. Talk to your manager before implementing streaming - you'll need to teach your entire team before you can share the on-call burden. ### Spark Streaming vs Flink Streaming? They are fundamentally different today but will likely converge. The key difference: Spark Streaming is micro-batch - it pulses every 15-30 seconds, pulling data in small batches (pull architecture). Flink is genuine continuous processing - events flow through as they arrive (push architecture). For most use cases the difference is negligible, but Flink has lower latency for truly real-time needs. For micro-batch intervals, Zach finds every-5-minutes too frequent with Spark because startup alone takes about a minute, making the overhead-to-work ratio poor. His sweet spots are hourly and every 15 minutes. ### How does job submission work in production? In this workshop we mount local files into Docker and submit jobs with `docker compose exec` - that's a development convenience. In production, job submission looks different depending on the deployment: - Managed services (AWS Kinesis Data Analytics, Google Cloud Dataflow, Confluent Cloud) - you upload a JAR or Python zip through a web console or CLI. The service handles the cluster. - Self-hosted Flink on Kubernetes - you typically build a Docker image with your job code baked in, or use the Flink Kubernetes Operator which pulls job artifacts from S3/GCS at startup. - Standalone Flink cluster - you use the `flink run` CLI pointing to a local file or an HTTP/S3 URL. CI/CD pipelines often upload the job artifact to S3 and then call `flink run` with that URL. The common pattern: your code lives in git, CI builds an artifact (JAR, Python zip, or Docker image), pushes it to a registry or object store, and then triggers the Flink cluster to pick it up. ================================================ FILE: 07-streaming/workshop/docker-compose.yml ================================================ services: redpanda: image: redpandadata/redpanda:v25.3.9 command: - redpanda - start - --smp - '1' - --reserve-memory - 0M - --overprovisioned - --node-id - '1' - --kafka-addr - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092 - --advertise-kafka-addr - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092 - --pandaproxy-addr - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082 - --advertise-pandaproxy-addr - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082 - --rpc-addr - 0.0.0.0:33145 - --advertise-rpc-addr - redpanda:33145 ports: - 8082:8082 - 9092:9092 - 28082:28082 - 29092:29092 jobmanager: build: context: . dockerfile: ./Dockerfile.flink image: pyflink-workshop pull_policy: never expose: - "6123" ports: - "8081:8081" volumes: - ./:/opt/flink/usrlib - ./src/:/opt/src command: jobmanager environment: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager jobmanager.memory.process.size: 1600m taskmanager: image: pyflink-workshop pull_policy: never expose: - "6121" - "6122" volumes: - ./:/opt/flink/usrlib - ./src/:/opt/src depends_on: - jobmanager command: taskmanager --taskmanager.registration.timeout 5 min environment: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager taskmanager.memory.process.size: 1728m taskmanager.numberOfTaskSlots: 15 parallelism.default: 3 postgres: image: postgres:18 restart: on-failure environment: - POSTGRES_DB=postgres - POSTGRES_USER=postgres - POSTGRES_PASSWORD=postgres ports: - "5432:5432" ================================================ FILE: 07-streaming/workshop/flink-config.yaml ================================================ # Custom Flink config for PyFlink workshop. # Original: https://github.com/apache/flink/blob/release-2.2/flink-dist/src/main/resources/config.yaml # Changes from default: # 1. Added taskmanager.memory.jvm-metaspace.size: 512m (PyFlink needs more metaspace) # 2. Removed --add-exports=jdk.compiler/... from env.java.opts.all # (jdk.compiler module is not present in the JRE, causing warnings on every command) blob: server: port: '6124' taskmanager: memory: process: size: 1728m jvm-metaspace: size: 512m # added for PyFlink bind-host: 0.0.0.0 numberOfTaskSlots: 15 jobmanager: execution: failover-strategy: region rpc: address: jobmanager port: 6123 memory: process: size: 1600m bind-host: 0.0.0.0 query: server: port: '6125' parallelism: default: 1 rest: address: 0.0.0.0 env: java: opts: all: >- --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.text=ALL-UNNAMED --add-opens=java.base/java.time=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED ================================================ FILE: 07-streaming/workshop/live/.gitignore ================================================ # Byte-compiled / optimized / DLL files __pycache__/ *.py[codz] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py.cover .hypothesis/ .pytest_cache/ cover/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder .pybuilder/ target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv # For a library or package, you might want to ignore these files since the code is # intended to run in multiple environments; otherwise, check them in: # .python-version # pipenv # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. # However, in case of collaboration, if having platform-specific dependencies or dependencies # having no cross-platform support, pipenv may install dependencies that don't work, or not # install all needed dependencies. #Pipfile.lock # UV # Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control. # This is especially recommended for binary packages to ensure reproducibility, and is more # commonly ignored for libraries. #uv.lock # poetry # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. # This is especially recommended for binary packages to ensure reproducibility, and is more # commonly ignored for libraries. # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control #poetry.lock #poetry.toml # pdm # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. # pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python. # https://pdm-project.org/en/latest/usage/project/#working-with-version-control #pdm.lock #pdm.toml .pdm-python .pdm-build/ # pixi # Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control. #pixi.lock # Pixi creates a virtual environment in the .pixi directory, just like venv module creates one # in the .venv directory. It is recommended not to include this directory in version control. .pixi # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .envrc .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # pytype static type analyzer .pytype/ # Cython debug symbols cython_debug/ # PyCharm # JetBrains specific template is maintained in a separate JetBrains.gitignore that can # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore # and can be added to the global gitignore or merged into this file. For a more nuclear # option (not recommended) you can uncomment the following to ignore the entire idea folder. #.idea/ # Abstra # Abstra is an AI-powered process automation framework. # Ignore directories containing user credentials, local state, and settings. # Learn more at https://abstra.io/docs .abstra/ # Visual Studio Code # Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore # that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore # and can be added to the global gitignore or merged into this file. However, if you prefer, # you could uncomment the following to ignore the entire vscode folder # .vscode/ # Ruff stuff: .ruff_cache/ # PyPI configuration file .pypirc # Cursor # Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to # exclude from AI features like autocomplete and code analysis. Recommended for sensitive data # refer to https://docs.cursor.com/context/ignore-files .cursorignore .cursorindexingignore # Marimo marimo/_static/ marimo/_lsp/ __marimo__/ ================================================ FILE: 07-streaming/workshop/live/.python-version ================================================ 3.12 ================================================ FILE: 07-streaming/workshop/live/Dockerfile.flink ================================================ FROM flink:2.2.0-scala_2.12-java17 COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/ # ref: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker WORKDIR /opt/pyflink COPY pyproject.flink.toml pyproject.toml RUN uv python install 3.12 && uv sync ENV PATH="/opt/pyflink/.venv/bin:$PATH" # Download connector libraries WORKDIR /opt/flink/lib RUN wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/2.2.0/flink-json-2.2.0.jar; \ wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/4.0.1-2.0/flink-sql-connector-kafka-4.0.1-2.0.jar; \ wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-core/4.0.0-2.0/flink-connector-jdbc-core-4.0.0-2.0.jar; \ wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-postgres/4.0.0-2.0/flink-connector-jdbc-postgres-4.0.0-2.0.jar; \ wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.10/postgresql-42.7.10.jar COPY flink-config.yaml /opt/flink/conf/config.yaml WORKDIR /opt/flink ================================================ FILE: 07-streaming/workshop/live/README.md ================================================ # streaming-workshop ================================================ FILE: 07-streaming/workshop/live/docker-compose.yaml ================================================ services: redpanda: image: redpandadata/redpanda:v25.3.9 command: - redpanda - start - --smp - '1' - --reserve-memory - 0M - --overprovisioned - --node-id - '1' - --kafka-addr - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092 - --advertise-kafka-addr - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092 - --pandaproxy-addr - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082 - --advertise-pandaproxy-addr - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082 - --rpc-addr - 0.0.0.0:33145 - --advertise-rpc-addr - redpanda:33145 ports: - 8082:8082 - 9092:9092 - 28082:28082 - 29092:29092 postgres: image: postgres:18 restart: on-failure environment: - POSTGRES_DB=postgres - POSTGRES_USER=postgres - POSTGRES_PASSWORD=postgres ports: - "5432:5432" jobmanager: build: context: . dockerfile: ./Dockerfile.flink image: pyflink-workshop pull_policy: never expose: - "6123" ports: - "8081:8081" volumes: - ./:/opt/flink/usrlib - ./src/:/opt/src command: jobmanager environment: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager jobmanager.memory.process.size: 1600m taskmanager: image: pyflink-workshop pull_policy: never expose: - "6121" - "6122" volumes: - ./:/opt/flink/usrlib - ./src/:/opt/src depends_on: - jobmanager command: taskmanager --taskmanager.registration.timeout 5 min environment: - | FLINK_PROPERTIES= jobmanager.rpc.address: jobmanager taskmanager.memory.process.size: 1728m taskmanager.numberOfTaskSlots: 15 parallelism.default: 3 ================================================ FILE: 07-streaming/workshop/live/flink-config.yaml ================================================ # Custom Flink config for PyFlink workshop. # Original: https://github.com/apache/flink/blob/release-2.2/flink-dist/src/main/resources/config.yaml # Changes from default: # 1. Added taskmanager.memory.jvm-metaspace.size: 512m (PyFlink needs more metaspace) # 2. Removed --add-exports=jdk.compiler/... from env.java.opts.all # (jdk.compiler module is not present in the JRE, causing warnings on every command) blob: server: port: '6124' taskmanager: memory: process: size: 1728m jvm-metaspace: size: 512m # added for PyFlink bind-host: 0.0.0.0 numberOfTaskSlots: 15 jobmanager: execution: failover-strategy: region rpc: address: jobmanager port: 6123 memory: process: size: 1600m bind-host: 0.0.0.0 query: server: port: '6125' parallelism: default: 1 rest: address: 0.0.0.0 env: java: opts: all: >- --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.text=ALL-UNNAMED --add-opens=java.base/java.time=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED ================================================ FILE: 07-streaming/workshop/live/main.py ================================================ def main(): print("Hello from streaming-workshop!") if __name__ == "__main__": main() ================================================ FILE: 07-streaming/workshop/live/notebooks/consumer_db.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 2, "id": "c77749d8", "metadata": {}, "outputs": [], "source": [ "from kafka import KafkaConsumer\n", "\n", "server = 'localhost:9092'\n", "topic_name = 'rides'" ] }, { "cell_type": "code", "execution_count": 3, "id": "74dcdffe", "metadata": {}, "outputs": [], "source": [ "from models import Ride, ride_deserializer" ] }, { "cell_type": "code", "execution_count": null, "id": "00726e41", "metadata": {}, "outputs": [], "source": [ "consumer = KafkaConsumer(\n", " topic_name,\n", " bootstrap_servers=[server],\n", " auto_offset_reset='earliest',\n", " group_id='rides-database',\n", " value_deserializer=ride_deserializer\n", ")" ] }, { "cell_type": "code", "execution_count": 1, "id": "a2cf7106", "metadata": {}, "outputs": [], "source": [ "import psycopg2\n", "\n", "conn = psycopg2.connect(\n", " host='localhost',\n", " port=5432,\n", " database='postgres',\n", " user='postgres',\n", " password='postgres'\n", ")\n", "\n", "conn.autocommit = True\n", "cur = conn.cursor()" ] }, { "cell_type": "code", "execution_count": 5, "id": "f0902406", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Listening to rides and writing to PostgreSQL...\n", "Inserted 100 rows...\n", "Inserted 200 rows...\n", "Inserted 300 rows...\n", "Inserted 400 rows...\n", "Inserted 500 rows...\n", "Inserted 600 rows...\n", "Inserted 700 rows...\n", "Inserted 800 rows...\n", "Inserted 900 rows...\n", "Inserted 1000 rows...\n" ] }, { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "error", "traceback": [ "\u001b[31m---------------------------------------------------------------------------\u001b[39m", "\u001b[31mKeyboardInterrupt\u001b[39m Traceback (most recent call last)", "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[5]\u001b[39m\u001b[32m, line 6\u001b[39m\n\u001b[32m 3\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33mf\u001b[39m\u001b[33m\"\u001b[39m\u001b[33mListening to \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mtopic_name\u001b[38;5;132;01m}\u001b[39;00m\u001b[33m and writing to PostgreSQL...\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 5\u001b[39m count = \u001b[32m0\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m \u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mmessage\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mconsumer\u001b[49m\u001b[43m:\u001b[49m\n\u001b[32m 7\u001b[39m \u001b[43m \u001b[49m\u001b[43mride\u001b[49m\u001b[43m \u001b[49m\u001b[43m=\u001b[49m\u001b[43m \u001b[49m\u001b[43mmessage\u001b[49m\u001b[43m.\u001b[49m\u001b[43mvalue\u001b[49m\n\u001b[32m 8\u001b[39m \u001b[43m \u001b[49m\u001b[43mpickup_dt\u001b[49m\u001b[43m \u001b[49m\u001b[43m=\u001b[49m\u001b[43m \u001b[49m\u001b[43mdatetime\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfromtimestamp\u001b[49m\u001b[43m(\u001b[49m\u001b[43mride\u001b[49m\u001b[43m.\u001b[49m\u001b[43mtpep_pickup_datetime\u001b[49m\u001b[43m \u001b[49m\u001b[43m/\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1000\u001b[39;49m\u001b[43m)\u001b[49m\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:1213\u001b[39m, in \u001b[36mKafkaConsumer.__next__\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 1211\u001b[39m \u001b[38;5;28mself\u001b[39m._iterator = \u001b[38;5;28mself\u001b[39m._message_generator_v2()\n\u001b[32m 1212\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1213\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mnext\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_iterator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1214\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mStopIteration\u001b[39;00m:\n\u001b[32m 1215\u001b[39m \u001b[38;5;28mself\u001b[39m._iterator = \u001b[38;5;28;01mNone\u001b[39;00m\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:1185\u001b[39m, in \u001b[36mKafkaConsumer._message_generator_v2\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 1183\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m_message_generator_v2\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m 1184\u001b[39m timeout_ms = \u001b[32m1000\u001b[39m * \u001b[38;5;28mmax\u001b[39m(\u001b[32m0\u001b[39m, \u001b[38;5;28mself\u001b[39m._consumer_timeout - time.time())\n\u001b[32m-> \u001b[39m\u001b[32m1185\u001b[39m record_map = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mpoll\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtimeout_ms\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtimeout_ms\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mupdate_offsets\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[32m 1186\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m tp, records \u001b[38;5;129;01min\u001b[39;00m six.iteritems(record_map):\n\u001b[32m 1187\u001b[39m \u001b[38;5;66;03m# Generators are stateful, and it is possible that the tp / records\u001b[39;00m\n\u001b[32m 1188\u001b[39m \u001b[38;5;66;03m# here may become stale during iteration -- i.e., we seek to a\u001b[39;00m\n\u001b[32m 1189\u001b[39m \u001b[38;5;66;03m# different offset, pause consumption, or lose assignment.\u001b[39;00m\n\u001b[32m 1190\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m record \u001b[38;5;129;01min\u001b[39;00m records:\n\u001b[32m 1191\u001b[39m \u001b[38;5;66;03m# is_fetchable(tp) should handle assignment changes and offset\u001b[39;00m\n\u001b[32m 1192\u001b[39m \u001b[38;5;66;03m# resets; for all other changes (e.g., seeks) we'll rely on the\u001b[39;00m\n\u001b[32m 1193\u001b[39m \u001b[38;5;66;03m# outer function destroying the existing iterator/generator\u001b[39;00m\n\u001b[32m 1194\u001b[39m \u001b[38;5;66;03m# via self._iterator = None\u001b[39;00m\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:708\u001b[39m, in \u001b[36mKafkaConsumer.poll\u001b[39m\u001b[34m(self, timeout_ms, max_records, update_offsets)\u001b[39m\n\u001b[32m 706\u001b[39m timer = Timer(timeout_ms)\n\u001b[32m 707\u001b[39m \u001b[38;5;28;01mwhile\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m._closed:\n\u001b[32m--> \u001b[39m\u001b[32m708\u001b[39m records = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_poll_once\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtimer\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmax_records\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mupdate_offsets\u001b[49m\u001b[43m=\u001b[49m\u001b[43mupdate_offsets\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 709\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m records:\n\u001b[32m 710\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m records\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:757\u001b[39m, in \u001b[36mKafkaConsumer._poll_once\u001b[39m\u001b[34m(self, timer, max_records, update_offsets)\u001b[39m\n\u001b[32m 754\u001b[39m log.debug(\u001b[33m'\u001b[39m\u001b[33mpoll: do not have all fetch positions...\u001b[39m\u001b[33m'\u001b[39m)\n\u001b[32m 755\u001b[39m poll_timeout_ms = \u001b[38;5;28mmin\u001b[39m(poll_timeout_ms, \u001b[38;5;28mself\u001b[39m.config[\u001b[33m'\u001b[39m\u001b[33mretry_backoff_ms\u001b[39m\u001b[33m'\u001b[39m])\n\u001b[32m--> \u001b[39m\u001b[32m757\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_client\u001b[49m\u001b[43m.\u001b[49m\u001b[43mpoll\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtimeout_ms\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpoll_timeout_ms\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 758\u001b[39m \u001b[38;5;66;03m# after the long poll, we should check whether the group needs to rebalance\u001b[39;00m\n\u001b[32m 759\u001b[39m \u001b[38;5;66;03m# prior to returning data so that the group can stabilize faster\u001b[39;00m\n\u001b[32m 760\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._coordinator.need_rejoin():\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/client_async.py:685\u001b[39m, in \u001b[36mKafkaClient.poll\u001b[39m\u001b[34m(self, timeout_ms, future)\u001b[39m\n\u001b[32m 678\u001b[39m timeout = \u001b[38;5;28mmin\u001b[39m(\n\u001b[32m 679\u001b[39m user_timeout_ms,\n\u001b[32m 680\u001b[39m metadata_timeout_ms,\n\u001b[32m 681\u001b[39m idle_connection_timeout_ms,\n\u001b[32m 682\u001b[39m request_timeout_ms)\n\u001b[32m 683\u001b[39m timeout = \u001b[38;5;28mmax\u001b[39m(\u001b[32m0\u001b[39m, timeout) \u001b[38;5;66;03m# avoid negative timeouts\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m685\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_poll\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtimeout\u001b[49m\u001b[43m \u001b[49m\u001b[43m/\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1000\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[32m 687\u001b[39m \u001b[38;5;66;03m# called without the lock to avoid deadlock potential\u001b[39;00m\n\u001b[32m 688\u001b[39m \u001b[38;5;66;03m# if handlers need to acquire locks\u001b[39;00m\n\u001b[32m 689\u001b[39m responses.extend(\u001b[38;5;28mself\u001b[39m._fire_pending_completed_requests())\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/client_async.py:781\u001b[39m, in \u001b[36mKafkaClient._poll\u001b[39m\u001b[34m(self, timeout)\u001b[39m\n\u001b[32m 778\u001b[39m \u001b[38;5;28;01mcontinue\u001b[39;00m\n\u001b[32m 780\u001b[39m \u001b[38;5;28mself\u001b[39m._idle_expiry_manager.update(conn.node_id)\n\u001b[32m--> \u001b[39m\u001b[32m781\u001b[39m \u001b[38;5;28mself\u001b[39m._pending_completion.extend(\u001b[43mconn\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrecv\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m)\n\u001b[32m 783\u001b[39m \u001b[38;5;66;03m# Check for additional pending SSL bytes\u001b[39;00m\n\u001b[32m 784\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m.config[\u001b[33m'\u001b[39m\u001b[33msecurity_protocol\u001b[39m\u001b[33m'\u001b[39m] \u001b[38;5;129;01min\u001b[39;00m (\u001b[33m'\u001b[39m\u001b[33mSSL\u001b[39m\u001b[33m'\u001b[39m, \u001b[33m'\u001b[39m\u001b[33mSASL_SSL\u001b[39m\u001b[33m'\u001b[39m):\n\u001b[32m 785\u001b[39m \u001b[38;5;66;03m# TODO: optimize\u001b[39;00m\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/conn.py:1131\u001b[39m, in \u001b[36mBrokerConnection.recv\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 1126\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mrecv\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m 1127\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"Non-blocking network receive.\u001b[39;00m\n\u001b[32m 1128\u001b[39m \n\u001b[32m 1129\u001b[39m \u001b[33;03m Return list of (response, future) tuples\u001b[39;00m\n\u001b[32m 1130\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m1131\u001b[39m responses = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_recv\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1132\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m responses \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m.requests_timed_out():\n\u001b[32m 1133\u001b[39m timed_out = \u001b[38;5;28mself\u001b[39m.timed_out_ifrs()\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/conn.py:1202\u001b[39m, in \u001b[36mBrokerConnection._recv\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 1200\u001b[39m recvd_data = \u001b[33mb\u001b[39m\u001b[33m'\u001b[39m\u001b[33m'\u001b[39m.join(recvd)\n\u001b[32m 1201\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._sensors:\n\u001b[32m-> \u001b[39m\u001b[32m1202\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_sensors\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbytes_received\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrecord\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mlen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mrecvd_data\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1204\u001b[39m \u001b[38;5;66;03m# We need to keep the lock through protocol receipt\u001b[39;00m\n\u001b[32m 1205\u001b[39m \u001b[38;5;66;03m# so that we ensure that the processed byte order is the\u001b[39;00m\n\u001b[32m 1206\u001b[39m \u001b[38;5;66;03m# same as the received byte order\u001b[39;00m\n\u001b[32m 1207\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/metrics/stats/sensor.py:77\u001b[39m, in \u001b[36mSensor.record\u001b[39m\u001b[34m(self, value, time_ms)\u001b[39m\n\u001b[32m 74\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28mself\u001b[39m._lock: \u001b[38;5;66;03m# XXX high volume, might be performance issue\u001b[39;00m\n\u001b[32m 75\u001b[39m \u001b[38;5;66;03m# increment all the stats\u001b[39;00m\n\u001b[32m 76\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m stat \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._stats:\n\u001b[32m---> \u001b[39m\u001b[32m77\u001b[39m \u001b[43mstat\u001b[49m\u001b[43m.\u001b[49m\u001b[43mrecord\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_config\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mvalue\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtime_ms\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 78\u001b[39m \u001b[38;5;28mself\u001b[39m._check_quotas(time_ms)\n\u001b[32m 79\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m parent \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m._parents:\n", "\u001b[36mFile \u001b[39m\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/metrics/stats/rate.py:49\u001b[39m, in \u001b[36mRate.record\u001b[39m\u001b[34m(self, config, value, time_ms)\u001b[39m\n\u001b[32m 46\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34munit_name\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[32m 47\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m TimeUnit.get_name(\u001b[38;5;28mself\u001b[39m._unit)\n\u001b[32m---> \u001b[39m\u001b[32m49\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mrecord\u001b[39m(\u001b[38;5;28mself\u001b[39m, config, value, time_ms):\n\u001b[32m 50\u001b[39m \u001b[38;5;28mself\u001b[39m._stat.record(config, value, time_ms)\n\u001b[32m 52\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mmeasure\u001b[39m(\u001b[38;5;28mself\u001b[39m, config, now):\n", "\u001b[31mKeyboardInterrupt\u001b[39m: " ] } ], "source": [ "from datetime import datetime\n", "\n", "print(f\"Listening to {topic_name} and writing to PostgreSQL...\")\n", "\n", "count = 0\n", "for message in consumer:\n", " ride = message.value\n", " pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)\n", " cur.execute(\n", " \"\"\"INSERT INTO processed_events\n", " (PULocationID, DOLocationID, trip_distance, total_amount, pickup_datetime)\n", " VALUES (%s, %s, %s, %s, %s)\"\"\",\n", " (ride.PULocationID, ride.DOLocationID,\n", " ride.trip_distance, ride.total_amount, pickup_dt)\n", " )\n", " count += 1\n", " if count % 100 == 0:\n", " print(f\"Inserted {count} rows...\")\n", "\n", "consumer.close()\n", "cur.close()\n", "conn.close()" ] }, { "cell_type": "code", "execution_count": null, "id": "66840c80", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2bec0472", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "streaming-workshop", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 07-streaming/workshop/live/notebooks/models.py ================================================ import json import dataclasses from dataclasses import dataclass @dataclass class Ride: PULocationID: int DOLocationID: int trip_distance: float total_amount: float tpep_pickup_datetime: int # epoch milliseconds def ride_from_row(row): return Ride( PULocationID=int(row['PULocationID']), DOLocationID=int(row['DOLocationID']), trip_distance=float(row['trip_distance']), total_amount=float(row['total_amount']), tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000), ) def ride_serializer(ride): ride_dict = dataclasses.asdict(ride) ride_json = json.dumps(ride_dict).encode('utf-8') return ride_json def ride_deserializer(data): json_str = data.decode('utf-8') ride_dict = json.loads(json_str) return Ride(**ride_dict) ================================================ FILE: 07-streaming/workshop/live/notebooks/producer.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 2, "id": "eebfcff0", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "id": "1e3c198b", "metadata": {}, "outputs": [], "source": [ "url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet'" ] }, { "cell_type": "code", "execution_count": 4, "id": "2113c0a9", "metadata": {}, "outputs": [], "source": [ "columns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime']\n", "df = pd.read_parquet(url, columns=columns).head(1000)" ] }, { "cell_type": "code", "execution_count": 39, "id": "05ed66d7", "metadata": {}, "outputs": [], "source": [ "from models import Ride, ride_from_row, ride_serializer" ] }, { "cell_type": "code", "execution_count": 40, "id": "26950bac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ride(PULocationID=142, DOLocationID=237, trip_distance=2.28, total_amount=24.94, tpep_pickup_datetime=1761958147000)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ride = ride_from_row(df.iloc[1])\n", "ride" ] }, { "cell_type": "code", "execution_count": 41, "id": "05cfce95", "metadata": {}, "outputs": [], "source": [ "from kafka import KafkaProducer\n", "\n", "server = 'localhost:9092'\n", "\n", "producer = KafkaProducer(\n", " bootstrap_servers=[server],\n", " value_serializer=ride_serializer\n", ")" ] }, { "cell_type": "code", "execution_count": 46, "id": "21f5fff3", "metadata": {}, "outputs": [], "source": [ "topic_name = 'rides'\n", "\n", "producer.send(topic_name, value=ride)\n", "producer.flush()" ] }, { "cell_type": "code", "execution_count": 48, "id": "b17a175a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sent: Ride(PULocationID=43, DOLocationID=186, trip_distance=1.68, total_amount=22.15, tpep_pickup_datetime=1761956005000)\n", "Sent: Ride(PULocationID=142, DOLocationID=237, trip_distance=2.28, total_amount=24.94, tpep_pickup_datetime=1761958147000)\n", "Sent: Ride(PULocationID=163, DOLocationID=238, trip_distance=2.7, total_amount=25.62, tpep_pickup_datetime=1761955639000)\n", "Sent: Ride(PULocationID=138, DOLocationID=261, trip_distance=12.87, total_amount=86.14, tpep_pickup_datetime=1761955200000)\n", "Sent: Ride(PULocationID=138, DOLocationID=37, trip_distance=8.4, total_amount=48.65, tpep_pickup_datetime=1761956330000)\n", "Sent: Ride(PULocationID=90, DOLocationID=100, trip_distance=0.85, total_amount=16.45, tpep_pickup_datetime=1761956471000)\n", "Sent: Ride(PULocationID=142, DOLocationID=170, trip_distance=3.01, total_amount=25.85, tpep_pickup_datetime=1761955651000)\n", "Sent: Ride(PULocationID=237, DOLocationID=144, trip_distance=3.82, total_amount=57.54, tpep_pickup_datetime=1761958012000)\n", "Sent: Ride(PULocationID=162, DOLocationID=161, trip_distance=0.89, total_amount=12.95, tpep_pickup_datetime=1761958619000)\n", "Sent: Ride(PULocationID=234, DOLocationID=162, trip_distance=2.28, total_amount=38.68, tpep_pickup_datetime=1761955843000)\n", "Sent: Ride(PULocationID=158, DOLocationID=88, trip_distance=3.3, total_amount=44.0, tpep_pickup_datetime=1761955203000)\n", "Sent: Ride(PULocationID=88, DOLocationID=148, trip_distance=1.5, total_amount=19.55, tpep_pickup_datetime=1761957833000)\n", "Sent: Ride(PULocationID=148, DOLocationID=236, trip_distance=4.7, total_amount=47.65, tpep_pickup_datetime=1761958682000)\n", "Sent: Ride(PULocationID=87, DOLocationID=255, trip_distance=5.61, total_amount=38.85, tpep_pickup_datetime=1761958368000)\n", "Sent: Ride(PULocationID=231, DOLocationID=43, trip_distance=3.9, total_amount=46.55, tpep_pickup_datetime=1761955553000)\n", "Sent: Ride(PULocationID=141, DOLocationID=262, trip_distance=1.14, total_amount=14.9, tpep_pickup_datetime=1761956024000)\n", "Sent: Ride(PULocationID=238, DOLocationID=24, trip_distance=0.6, total_amount=9.12, tpep_pickup_datetime=1761955398000)\n", "Sent: Ride(PULocationID=236, DOLocationID=147, trip_distance=4.3, total_amount=29.2, tpep_pickup_datetime=1761956395000)\n", "Sent: Ride(PULocationID=231, DOLocationID=137, trip_distance=3.0, total_amount=32.75, tpep_pickup_datetime=1761957955000)\n", "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.69, total_amount=11.5, tpep_pickup_datetime=1761955872000)\n", "Sent: Ride(PULocationID=132, DOLocationID=265, trip_distance=15.47, total_amount=106.63, tpep_pickup_datetime=1761955521000)\n", "Sent: Ride(PULocationID=79, DOLocationID=125, trip_distance=1.29, total_amount=22.26, tpep_pickup_datetime=1761955760000)\n", "Sent: Ride(PULocationID=158, DOLocationID=79, trip_distance=1.66, total_amount=32.34, tpep_pickup_datetime=1761957539000)\n", "Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.25, total_amount=22.25, tpep_pickup_datetime=1761958533000)\n", "Sent: Ride(PULocationID=142, DOLocationID=249, trip_distance=2.68, total_amount=48.68, tpep_pickup_datetime=1761956184000)\n", "Sent: Ride(PULocationID=4, DOLocationID=48, trip_distance=3.16, total_amount=33.15, tpep_pickup_datetime=1761958409000)\n", "Sent: Ride(PULocationID=48, DOLocationID=24, trip_distance=2.8, total_amount=24.55, tpep_pickup_datetime=1761956650000)\n", "Sent: Ride(PULocationID=143, DOLocationID=169, trip_distance=7.45, total_amount=44.04, tpep_pickup_datetime=1761956178000)\n", "Sent: Ride(PULocationID=140, DOLocationID=142, trip_distance=2.02, total_amount=17.8, tpep_pickup_datetime=1761957454000)\n", "Sent: Ride(PULocationID=107, DOLocationID=90, trip_distance=3.46, total_amount=35.7, tpep_pickup_datetime=1761957360000)\n", "Sent: Ride(PULocationID=50, DOLocationID=263, trip_distance=2.89, total_amount=26.46, tpep_pickup_datetime=1761956052000)\n", "Sent: Ride(PULocationID=234, DOLocationID=68, trip_distance=1.2, total_amount=27.3, tpep_pickup_datetime=1761956041000)\n", "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.2, total_amount=13.02, tpep_pickup_datetime=1761957285000)\n", "Sent: Ride(PULocationID=68, DOLocationID=233, trip_distance=2.57, total_amount=26.15, tpep_pickup_datetime=1761957826000)\n", "Sent: Ride(PULocationID=140, DOLocationID=233, trip_distance=1.4, total_amount=24.75, tpep_pickup_datetime=1761956818000)\n", "Sent: Ride(PULocationID=233, DOLocationID=170, trip_distance=0.3, total_amount=13.0, tpep_pickup_datetime=1761957934000)\n", "Sent: Ride(PULocationID=170, DOLocationID=137, trip_distance=0.7, total_amount=17.7, tpep_pickup_datetime=1761958462000)\n", "Sent: Ride(PULocationID=141, DOLocationID=236, trip_distance=0.92, total_amount=13.8, tpep_pickup_datetime=1761956649000)\n", "Sent: Ride(PULocationID=43, DOLocationID=151, trip_distance=2.21, total_amount=17.4, tpep_pickup_datetime=1761957030000)\n", "Sent: Ride(PULocationID=151, DOLocationID=116, trip_distance=2.62, total_amount=21.75, tpep_pickup_datetime=1761957624000)\n", "Sent: Ride(PULocationID=164, DOLocationID=114, trip_distance=1.58, total_amount=35.25, tpep_pickup_datetime=1761957104000)\n", "Sent: Ride(PULocationID=138, DOLocationID=74, trip_distance=6.51, total_amount=52.08, tpep_pickup_datetime=1761958025000)\n", "Sent: Ride(PULocationID=166, DOLocationID=262, trip_distance=3.19, total_amount=27.24, tpep_pickup_datetime=1761956016000)\n", "Sent: Ride(PULocationID=238, DOLocationID=238, trip_distance=0.46, total_amount=12.12, tpep_pickup_datetime=1761956726000)\n", "Sent: Ride(PULocationID=238, DOLocationID=166, trip_distance=1.2, total_amount=17.16, tpep_pickup_datetime=1761957605000)\n", "Sent: Ride(PULocationID=66, DOLocationID=246, trip_distance=4.4, total_amount=30.35, tpep_pickup_datetime=1761955924000)\n", "Sent: Ride(PULocationID=68, DOLocationID=239, trip_distance=3.5, total_amount=29.85, tpep_pickup_datetime=1761958430000)\n", "Sent: Ride(PULocationID=233, DOLocationID=161, trip_distance=0.66, total_amount=14.25, tpep_pickup_datetime=1761956248000)\n", "Sent: Ride(PULocationID=43, DOLocationID=239, trip_distance=1.13, total_amount=16.38, tpep_pickup_datetime=1761956861000)\n", "Sent: Ride(PULocationID=238, DOLocationID=166, trip_distance=1.22, total_amount=13.32, tpep_pickup_datetime=1761957749000)\n", "Sent: Ride(PULocationID=138, DOLocationID=48, trip_distance=11.29, total_amount=81.18, tpep_pickup_datetime=1761958324000)\n", "Sent: Ride(PULocationID=237, DOLocationID=262, trip_distance=0.86, total_amount=14.2, tpep_pickup_datetime=1761955542000)\n", "Sent: Ride(PULocationID=234, DOLocationID=249, trip_distance=1.47, total_amount=24.78, tpep_pickup_datetime=1761958550000)\n", "Sent: Ride(PULocationID=137, DOLocationID=164, trip_distance=0.52, total_amount=20.47, tpep_pickup_datetime=1761955095000)\n", "Sent: Ride(PULocationID=164, DOLocationID=142, trip_distance=3.99, total_amount=38.22, tpep_pickup_datetime=1761955769000)\n", "Sent: Ride(PULocationID=107, DOLocationID=164, trip_distance=1.03, total_amount=16.35, tpep_pickup_datetime=1761955355000)\n", "Sent: Ride(PULocationID=100, DOLocationID=141, trip_distance=2.47, total_amount=27.75, tpep_pickup_datetime=1761956835000)\n", "Sent: Ride(PULocationID=161, DOLocationID=237, trip_distance=1.6, total_amount=18.45, tpep_pickup_datetime=1761958690000)\n", "Sent: Ride(PULocationID=140, DOLocationID=75, trip_distance=2.11, total_amount=20.52, tpep_pickup_datetime=1761955763000)\n", "Sent: Ride(PULocationID=132, DOLocationID=216, trip_distance=4.7, total_amount=24.3, tpep_pickup_datetime=1761957345000)\n", "Sent: Ride(PULocationID=148, DOLocationID=170, trip_distance=2.08, total_amount=31.5, tpep_pickup_datetime=1761958185000)\n", "Sent: Ride(PULocationID=113, DOLocationID=4, trip_distance=0.9, total_amount=19.72, tpep_pickup_datetime=1761957574000)\n", "Sent: Ride(PULocationID=4, DOLocationID=233, trip_distance=2.2, total_amount=22.05, tpep_pickup_datetime=1761958544000)\n", "Sent: Ride(PULocationID=231, DOLocationID=209, trip_distance=1.1, total_amount=19.7, tpep_pickup_datetime=1761958275000)\n", "Sent: Ride(PULocationID=186, DOLocationID=238, trip_distance=4.98, total_amount=45.75, tpep_pickup_datetime=1761955208000)\n", "Sent: Ride(PULocationID=138, DOLocationID=164, trip_distance=9.43, total_amount=59.54, tpep_pickup_datetime=1761955156000)\n", "Sent: Ride(PULocationID=162, DOLocationID=141, trip_distance=1.14, total_amount=17.22, tpep_pickup_datetime=1761957361000)\n", "Sent: Ride(PULocationID=158, DOLocationID=68, trip_distance=1.43, total_amount=23.21, tpep_pickup_datetime=1761955824000)\n", "Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=0.36, total_amount=12.15, tpep_pickup_datetime=1761956775000)\n", "Sent: Ride(PULocationID=186, DOLocationID=186, trip_distance=0.01, total_amount=-10.85, tpep_pickup_datetime=1761957695000)\n", "Sent: Ride(PULocationID=186, DOLocationID=186, trip_distance=0.01, total_amount=10.85, tpep_pickup_datetime=1761957695000)\n", "Sent: Ride(PULocationID=68, DOLocationID=265, trip_distance=4.31, total_amount=90.81, tpep_pickup_datetime=1761958059000)\n", "Sent: Ride(PULocationID=90, DOLocationID=141, trip_distance=3.92, total_amount=42.42, tpep_pickup_datetime=1761956086000)\n", "Sent: Ride(PULocationID=162, DOLocationID=173, trip_distance=7.98, total_amount=50.25, tpep_pickup_datetime=1761958460000)\n", "Sent: Ride(PULocationID=238, DOLocationID=48, trip_distance=3.03, total_amount=35.44, tpep_pickup_datetime=1761956112000)\n", "Sent: Ride(PULocationID=141, DOLocationID=140, trip_distance=0.54, total_amount=13.86, tpep_pickup_datetime=1761956344000)\n", "Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=1.46, total_amount=20.5, tpep_pickup_datetime=1761957032000)\n", "Sent: Ride(PULocationID=229, DOLocationID=48, trip_distance=1.8, total_amount=26.45, tpep_pickup_datetime=1761957521000)\n", "Sent: Ride(PULocationID=48, DOLocationID=233, trip_distance=0.91, total_amount=17.85, tpep_pickup_datetime=1761956210000)\n", "Sent: Ride(PULocationID=233, DOLocationID=113, trip_distance=1.61, total_amount=27.3, tpep_pickup_datetime=1761957229000)\n", "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.33, total_amount=19.25, tpep_pickup_datetime=1761957297000)\n", "Sent: Ride(PULocationID=132, DOLocationID=265, trip_distance=45.7, total_amount=284.39, tpep_pickup_datetime=1761956656000)\n", "Sent: Ride(PULocationID=158, DOLocationID=231, trip_distance=1.75, total_amount=20.55, tpep_pickup_datetime=1761956254000)\n", "Sent: Ride(PULocationID=231, DOLocationID=229, trip_distance=4.32, total_amount=45.78, tpep_pickup_datetime=1761957567000)\n", "Sent: Ride(PULocationID=148, DOLocationID=224, trip_distance=1.99, total_amount=22.26, tpep_pickup_datetime=1761955509000)\n", "Sent: Ride(PULocationID=224, DOLocationID=141, trip_distance=2.85, total_amount=29.82, tpep_pickup_datetime=1761956184000)\n", "Sent: Ride(PULocationID=141, DOLocationID=239, trip_distance=1.32, total_amount=18.84, tpep_pickup_datetime=1761957687000)\n", "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=1.0, total_amount=17.75, tpep_pickup_datetime=1761956921000)\n", "Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.29, total_amount=23.75, tpep_pickup_datetime=1761957629000)\n", "Sent: Ride(PULocationID=125, DOLocationID=186, trip_distance=1.81, total_amount=31.15, tpep_pickup_datetime=1761956520000)\n", "Sent: Ride(PULocationID=186, DOLocationID=249, trip_distance=1.14, total_amount=21.25, tpep_pickup_datetime=1761958570000)\n", "Sent: Ride(PULocationID=138, DOLocationID=237, trip_distance=10.18, total_amount=69.71, tpep_pickup_datetime=1761957894000)\n", "Sent: Ride(PULocationID=236, DOLocationID=43, trip_distance=2.16, total_amount=17.1, tpep_pickup_datetime=1761956402000)\n", "Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=1.7, total_amount=18.0, tpep_pickup_datetime=1761955253000)\n", "Sent: Ride(PULocationID=263, DOLocationID=7, trip_distance=4.5, total_amount=32.34, tpep_pickup_datetime=1761955970000)\n", "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.4, total_amount=27.55, tpep_pickup_datetime=1761955608000)\n", "Sent: Ride(PULocationID=113, DOLocationID=244, trip_distance=10.5, total_amount=57.05, tpep_pickup_datetime=1761957376000)\n", "Sent: Ride(PULocationID=68, DOLocationID=255, trip_distance=7.32, total_amount=56.35, tpep_pickup_datetime=1761955670000)\n", "Sent: Ride(PULocationID=230, DOLocationID=261, trip_distance=6.49, total_amount=43.85, tpep_pickup_datetime=1761958702000)\n", "Sent: Ride(PULocationID=45, DOLocationID=97, trip_distance=3.2, total_amount=23.45, tpep_pickup_datetime=1761956745000)\n", "Sent: Ride(PULocationID=144, DOLocationID=186, trip_distance=2.9, total_amount=24.85, tpep_pickup_datetime=1761958647000)\n", "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=0.97, total_amount=23.94, tpep_pickup_datetime=1761956073000)\n", "Sent: Ride(PULocationID=234, DOLocationID=237, trip_distance=1.73, total_amount=22.26, tpep_pickup_datetime=1761957100000)\n", "Sent: Ride(PULocationID=45, DOLocationID=79, trip_distance=1.32, total_amount=22.26, tpep_pickup_datetime=1761957425000)\n", "Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.64, total_amount=24.72, tpep_pickup_datetime=1761958384000)\n", "Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.6, total_amount=23.1, tpep_pickup_datetime=1761956460000)\n", "Sent: Ride(PULocationID=113, DOLocationID=41, trip_distance=7.6, total_amount=49.4, tpep_pickup_datetime=1761957518000)\n", "Sent: Ride(PULocationID=140, DOLocationID=233, trip_distance=1.1, total_amount=22.3, tpep_pickup_datetime=1761956203000)\n", "Sent: Ride(PULocationID=162, DOLocationID=75, trip_distance=2.7, total_amount=23.2, tpep_pickup_datetime=1761957674000)\n", "Sent: Ride(PULocationID=170, DOLocationID=125, trip_distance=3.43, total_amount=49.14, tpep_pickup_datetime=1761957040000)\n", "Sent: Ride(PULocationID=209, DOLocationID=90, trip_distance=3.28, total_amount=31.15, tpep_pickup_datetime=1761956343000)\n", "Sent: Ride(PULocationID=90, DOLocationID=125, trip_distance=2.08, total_amount=34.86, tpep_pickup_datetime=1761958287000)\n", "Sent: Ride(PULocationID=48, DOLocationID=113, trip_distance=2.01, total_amount=41.58, tpep_pickup_datetime=1761956311000)\n", "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.15, total_amount=10.44, tpep_pickup_datetime=1761955368000)\n", "Sent: Ride(PULocationID=246, DOLocationID=48, trip_distance=1.4, total_amount=22.75, tpep_pickup_datetime=1761957364000)\n", "Sent: Ride(PULocationID=142, DOLocationID=68, trip_distance=2.4, total_amount=24.8, tpep_pickup_datetime=1761956279000)\n", "Sent: Ride(PULocationID=262, DOLocationID=90, trip_distance=4.3, total_amount=52.5, tpep_pickup_datetime=1761956236000)\n", "Sent: Ride(PULocationID=68, DOLocationID=246, trip_distance=0.72, total_amount=13.86, tpep_pickup_datetime=1761956524000)\n", "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.38, total_amount=12.55, tpep_pickup_datetime=1761957465000)\n", "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.95, total_amount=17.22, tpep_pickup_datetime=1761958160000)\n", "Sent: Ride(PULocationID=68, DOLocationID=236, trip_distance=4.25, total_amount=46.62, tpep_pickup_datetime=1761955309000)\n", "Sent: Ride(PULocationID=236, DOLocationID=75, trip_distance=0.83, total_amount=11.5, tpep_pickup_datetime=1761955738000)\n", "Sent: Ride(PULocationID=141, DOLocationID=137, trip_distance=2.0, total_amount=20.75, tpep_pickup_datetime=1761957902000)\n", "Sent: Ride(PULocationID=140, DOLocationID=140, trip_distance=0.49, total_amount=10.1, tpep_pickup_datetime=1761956499000)\n", "Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.3, total_amount=17.16, tpep_pickup_datetime=1761957396000)\n", "Sent: Ride(PULocationID=142, DOLocationID=249, trip_distance=3.49, total_amount=30.05, tpep_pickup_datetime=1761958009000)\n", "Sent: Ride(PULocationID=87, DOLocationID=229, trip_distance=6.31, total_amount=40.74, tpep_pickup_datetime=1761958328000)\n", "Sent: Ride(PULocationID=50, DOLocationID=68, trip_distance=1.9, total_amount=20.55, tpep_pickup_datetime=1761955870000)\n", "Sent: Ride(PULocationID=87, DOLocationID=49, trip_distance=3.71, total_amount=25.55, tpep_pickup_datetime=1761957107000)\n", "Sent: Ride(PULocationID=97, DOLocationID=256, trip_distance=3.39, total_amount=27.6, tpep_pickup_datetime=1761958421000)\n", "Sent: Ride(PULocationID=148, DOLocationID=236, trip_distance=5.79, total_amount=40.95, tpep_pickup_datetime=1761955769000)\n", "Sent: Ride(PULocationID=236, DOLocationID=143, trip_distance=3.23, total_amount=23.7, tpep_pickup_datetime=1761957945000)\n", "Sent: Ride(PULocationID=231, DOLocationID=232, trip_distance=1.6, total_amount=19.25, tpep_pickup_datetime=1761956045000)\n", "Sent: Ride(PULocationID=141, DOLocationID=229, trip_distance=0.7, total_amount=13.85, tpep_pickup_datetime=1761956383000)\n", "Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.3, total_amount=13.4, tpep_pickup_datetime=1761957107000)\n", "Sent: Ride(PULocationID=263, DOLocationID=263, trip_distance=0.7, total_amount=12.8, tpep_pickup_datetime=1761957502000)\n", "Sent: Ride(PULocationID=263, DOLocationID=229, trip_distance=1.7, total_amount=18.9, tpep_pickup_datetime=1761958191000)\n", "Sent: Ride(PULocationID=107, DOLocationID=148, trip_distance=1.04, total_amount=24.78, tpep_pickup_datetime=1761958597000)\n", "Sent: Ride(PULocationID=233, DOLocationID=265, trip_distance=0.0, total_amount=124.25, tpep_pickup_datetime=1761956790000)\n", "Sent: Ride(PULocationID=79, DOLocationID=229, trip_distance=2.57, total_amount=28.14, tpep_pickup_datetime=1761956518000)\n", "Sent: Ride(PULocationID=170, DOLocationID=48, trip_distance=1.06, total_amount=19.57, tpep_pickup_datetime=1761956704000)\n", "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.32, total_amount=13.02, tpep_pickup_datetime=1761955678000)\n", "Sent: Ride(PULocationID=68, DOLocationID=162, trip_distance=1.73, total_amount=26.46, tpep_pickup_datetime=1761956006000)\n", "Sent: Ride(PULocationID=74, DOLocationID=236, trip_distance=1.55, total_amount=17.16, tpep_pickup_datetime=1761958603000)\n", "Sent: Ride(PULocationID=140, DOLocationID=230, trip_distance=2.16, total_amount=33.69, tpep_pickup_datetime=1761955305000)\n", "Sent: Ride(PULocationID=158, DOLocationID=231, trip_distance=1.99, total_amount=31.5, tpep_pickup_datetime=1761955569000)\n", "Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.04, total_amount=19.95, tpep_pickup_datetime=1761956993000)\n", "Sent: Ride(PULocationID=148, DOLocationID=170, trip_distance=1.53, total_amount=19.95, tpep_pickup_datetime=1761958077000)\n", "Sent: Ride(PULocationID=263, DOLocationID=234, trip_distance=2.63, total_amount=25.62, tpep_pickup_datetime=1761958784000)\n", "Sent: Ride(PULocationID=148, DOLocationID=141, trip_distance=4.28, total_amount=38.22, tpep_pickup_datetime=1761958147000)\n", "Sent: Ride(PULocationID=113, DOLocationID=230, trip_distance=3.19, total_amount=42.42, tpep_pickup_datetime=1761955883000)\n", "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.38, total_amount=18.9, tpep_pickup_datetime=1761958302000)\n", "Sent: Ride(PULocationID=261, DOLocationID=186, trip_distance=3.2, total_amount=34.0, tpep_pickup_datetime=1761957372000)\n", "Sent: Ride(PULocationID=158, DOLocationID=68, trip_distance=0.85, total_amount=26.81, tpep_pickup_datetime=1761955469000)\n", "Sent: Ride(PULocationID=68, DOLocationID=237, trip_distance=3.15, total_amount=33.8, tpep_pickup_datetime=1761956993000)\n", "Sent: Ride(PULocationID=170, DOLocationID=114, trip_distance=1.4, total_amount=31.5, tpep_pickup_datetime=1761955287000)\n", "Sent: Ride(PULocationID=114, DOLocationID=230, trip_distance=2.8, total_amount=31.05, tpep_pickup_datetime=1761956826000)\n", "Sent: Ride(PULocationID=163, DOLocationID=43, trip_distance=0.55, total_amount=13.02, tpep_pickup_datetime=1761955330000)\n", "Sent: Ride(PULocationID=142, DOLocationID=143, trip_distance=0.7, total_amount=13.8, tpep_pickup_datetime=1761955615000)\n", "Sent: Ride(PULocationID=87, DOLocationID=141, trip_distance=6.07, total_amount=34.55, tpep_pickup_datetime=1761956252000)\n", "Sent: Ride(PULocationID=148, DOLocationID=186, trip_distance=2.8, total_amount=33.18, tpep_pickup_datetime=1761956258000)\n", "Sent: Ride(PULocationID=186, DOLocationID=48, trip_distance=1.54, total_amount=21.35, tpep_pickup_datetime=1761958653000)\n", "Sent: Ride(PULocationID=239, DOLocationID=48, trip_distance=1.3, total_amount=17.35, tpep_pickup_datetime=1761957861000)\n", "Sent: Ride(PULocationID=48, DOLocationID=141, trip_distance=1.6, total_amount=23.95, tpep_pickup_datetime=1761958511000)\n", "Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=4.26, total_amount=39.9, tpep_pickup_datetime=1761956321000)\n", "Sent: Ride(PULocationID=141, DOLocationID=164, trip_distance=2.85, total_amount=28.14, tpep_pickup_datetime=1761958363000)\n", "Sent: Ride(PULocationID=113, DOLocationID=140, trip_distance=3.86, total_amount=42.42, tpep_pickup_datetime=1761956960000)\n", "Sent: Ride(PULocationID=263, DOLocationID=68, trip_distance=4.44, total_amount=34.02, tpep_pickup_datetime=1761958780000)\n", "Sent: Ride(PULocationID=48, DOLocationID=234, trip_distance=2.0, total_amount=28.45, tpep_pickup_datetime=1761956085000)\n", "Sent: Ride(PULocationID=234, DOLocationID=48, trip_distance=2.1, total_amount=37.2, tpep_pickup_datetime=1761957500000)\n", "Sent: Ride(PULocationID=140, DOLocationID=159, trip_distance=4.96, total_amount=30.4, tpep_pickup_datetime=1761956905000)\n", "Sent: Ride(PULocationID=170, DOLocationID=237, trip_distance=1.48, total_amount=21.42, tpep_pickup_datetime=1761955928000)\n", "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.0, total_amount=16.45, tpep_pickup_datetime=1761956435000)\n", "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=0.6, total_amount=13.85, tpep_pickup_datetime=1761957093000)\n", "Sent: Ride(PULocationID=170, DOLocationID=229, trip_distance=1.5, total_amount=24.75, tpep_pickup_datetime=1761957404000)\n", "Sent: Ride(PULocationID=263, DOLocationID=224, trip_distance=4.26, total_amount=50.85, tpep_pickup_datetime=1761955719000)\n", "Sent: Ride(PULocationID=224, DOLocationID=233, trip_distance=1.07, total_amount=17.22, tpep_pickup_datetime=1761958748000)\n", "Sent: Ride(PULocationID=249, DOLocationID=162, trip_distance=2.38, total_amount=34.86, tpep_pickup_datetime=1761956435000)\n", "Sent: Ride(PULocationID=162, DOLocationID=24, trip_distance=3.81, total_amount=28.55, tpep_pickup_datetime=1761958415000)\n", "Sent: Ride(PULocationID=246, DOLocationID=246, trip_distance=0.77, total_amount=15.05, tpep_pickup_datetime=1761956032000)\n", "Sent: Ride(PULocationID=262, DOLocationID=262, trip_distance=0.0, total_amount=8.0, tpep_pickup_datetime=1761956477000)\n", "Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=0.8, total_amount=13.5, tpep_pickup_datetime=1761957782000)\n", "Sent: Ride(PULocationID=236, DOLocationID=238, trip_distance=2.0, total_amount=18.84, tpep_pickup_datetime=1761958251000)\n", "Sent: Ride(PULocationID=230, DOLocationID=141, trip_distance=1.78, total_amount=18.09, tpep_pickup_datetime=1761955958000)\n", "Sent: Ride(PULocationID=237, DOLocationID=74, trip_distance=2.49, total_amount=21.6, tpep_pickup_datetime=1761956602000)\n", "Sent: Ride(PULocationID=231, DOLocationID=87, trip_distance=2.61, total_amount=24.78, tpep_pickup_datetime=1761958149000)\n", "Sent: Ride(PULocationID=161, DOLocationID=237, trip_distance=1.84, total_amount=17.15, tpep_pickup_datetime=1761955513000)\n", "Sent: Ride(PULocationID=237, DOLocationID=262, trip_distance=1.05, total_amount=15.48, tpep_pickup_datetime=1761956227000)\n", "Sent: Ride(PULocationID=163, DOLocationID=163, trip_distance=0.45, total_amount=15.54, tpep_pickup_datetime=1761954619000)\n", "Sent: Ride(PULocationID=113, DOLocationID=68, trip_distance=1.8, total_amount=49.98, tpep_pickup_datetime=1761955361000)\n", "Sent: Ride(PULocationID=246, DOLocationID=114, trip_distance=1.88, total_amount=31.94, tpep_pickup_datetime=1761958526000)\n", "Sent: Ride(PULocationID=164, DOLocationID=87, trip_distance=4.4, total_amount=57.3, tpep_pickup_datetime=1761956034000)\n", "Sent: Ride(PULocationID=230, DOLocationID=237, trip_distance=0.89, total_amount=15.39, tpep_pickup_datetime=1761958187000)\n", "Sent: Ride(PULocationID=231, DOLocationID=229, trip_distance=4.88, total_amount=43.73, tpep_pickup_datetime=1761958624000)\n", "Sent: Ride(PULocationID=239, DOLocationID=151, trip_distance=1.0, total_amount=14.6, tpep_pickup_datetime=1761956087000)\n", "Sent: Ride(PULocationID=48, DOLocationID=261, trip_distance=5.0, total_amount=50.15, tpep_pickup_datetime=1761957798000)\n", "Sent: Ride(PULocationID=148, DOLocationID=148, trip_distance=0.39, total_amount=23.8, tpep_pickup_datetime=1761955384000)\n", "Sent: Ride(PULocationID=148, DOLocationID=223, trip_distance=8.64, total_amount=60.06, tpep_pickup_datetime=1761956689000)\n", "Sent: Ride(PULocationID=256, DOLocationID=107, trip_distance=3.25, total_amount=40.35, tpep_pickup_datetime=1761955754000)\n", "Sent: Ride(PULocationID=107, DOLocationID=68, trip_distance=1.47, total_amount=18.55, tpep_pickup_datetime=1761957878000)\n", "Sent: Ride(PULocationID=161, DOLocationID=7, trip_distance=4.12, total_amount=31.8, tpep_pickup_datetime=1761955283000)\n", "Sent: Ride(PULocationID=138, DOLocationID=113, trip_distance=9.36, total_amount=75.5, tpep_pickup_datetime=1761957063000)\n", "Sent: Ride(PULocationID=90, DOLocationID=158, trip_distance=0.96, total_amount=23.94, tpep_pickup_datetime=1761957835000)\n", "Sent: Ride(PULocationID=237, DOLocationID=230, trip_distance=1.28, total_amount=20.56, tpep_pickup_datetime=1761955072000)\n", "Sent: Ride(PULocationID=100, DOLocationID=107, trip_distance=1.7, total_amount=24.95, tpep_pickup_datetime=1761956114000)\n", "Sent: Ride(PULocationID=107, DOLocationID=137, trip_distance=0.64, total_amount=13.63, tpep_pickup_datetime=1761957625000)\n", "Sent: Ride(PULocationID=170, DOLocationID=107, trip_distance=1.17, total_amount=27.3, tpep_pickup_datetime=1761958042000)\n", "Sent: Ride(PULocationID=79, DOLocationID=68, trip_distance=3.48, total_amount=37.38, tpep_pickup_datetime=1761956556000)\n", "Sent: Ride(PULocationID=161, DOLocationID=75, trip_distance=3.25, total_amount=25.62, tpep_pickup_datetime=1761955328000)\n", "Sent: Ride(PULocationID=43, DOLocationID=237, trip_distance=1.06, total_amount=18.35, tpep_pickup_datetime=1761956831000)\n", "Sent: Ride(PULocationID=43, DOLocationID=236, trip_distance=2.03, total_amount=22.26, tpep_pickup_datetime=1761957676000)\n", "Sent: Ride(PULocationID=68, DOLocationID=246, trip_distance=0.54, total_amount=16.55, tpep_pickup_datetime=1761955938000)\n", "Sent: Ride(PULocationID=246, DOLocationID=236, trip_distance=3.49, total_amount=35.7, tpep_pickup_datetime=1761956552000)\n", "Sent: Ride(PULocationID=75, DOLocationID=238, trip_distance=1.23, total_amount=12.9, tpep_pickup_datetime=1761958460000)\n", "Sent: Ride(PULocationID=138, DOLocationID=229, trip_distance=10.1, total_amount=72.24, tpep_pickup_datetime=1761956382000)\n", "Sent: Ride(PULocationID=162, DOLocationID=230, trip_distance=1.0, total_amount=17.15, tpep_pickup_datetime=1761958646000)\n", "Sent: Ride(PULocationID=132, DOLocationID=177, trip_distance=9.77, total_amount=54.55, tpep_pickup_datetime=1761957231000)\n", "Sent: Ride(PULocationID=239, DOLocationID=262, trip_distance=1.9, total_amount=21.35, tpep_pickup_datetime=1761958531000)\n", "Sent: Ride(PULocationID=125, DOLocationID=239, trip_distance=4.72, total_amount=47.46, tpep_pickup_datetime=1761955991000)\n", "Sent: Ride(PULocationID=239, DOLocationID=262, trip_distance=2.02, total_amount=20.5, tpep_pickup_datetime=1761958263000)\n", "Sent: Ride(PULocationID=224, DOLocationID=231, trip_distance=3.16, total_amount=27.65, tpep_pickup_datetime=1761957209000)\n", "Sent: Ride(PULocationID=239, DOLocationID=90, trip_distance=3.67, total_amount=34.02, tpep_pickup_datetime=1761956212000)\n", "Sent: Ride(PULocationID=90, DOLocationID=90, trip_distance=0.18, total_amount=19.74, tpep_pickup_datetime=1761957726000)\n", "Sent: Ride(PULocationID=90, DOLocationID=79, trip_distance=0.98, total_amount=27.3, tpep_pickup_datetime=1761958631000)\n", "Sent: Ride(PULocationID=263, DOLocationID=90, trip_distance=4.7, total_amount=40.74, tpep_pickup_datetime=1761956974000)\n", "Sent: Ride(PULocationID=249, DOLocationID=68, trip_distance=0.85, total_amount=19.55, tpep_pickup_datetime=1761955241000)\n", "Sent: Ride(PULocationID=186, DOLocationID=161, trip_distance=1.37, total_amount=24.94, tpep_pickup_datetime=1761956360000)\n", "Sent: Ride(PULocationID=231, DOLocationID=232, trip_distance=2.09, total_amount=22.26, tpep_pickup_datetime=1761956603000)\n", "Sent: Ride(PULocationID=142, DOLocationID=78, trip_distance=8.9, total_amount=35.0, tpep_pickup_datetime=1761957013000)\n", "Sent: Ride(PULocationID=148, DOLocationID=236, trip_distance=7.09, total_amount=43.65, tpep_pickup_datetime=1761955507000)\n", "Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=2.0, total_amount=18.81, tpep_pickup_datetime=1761957857000)\n", "Sent: Ride(PULocationID=239, DOLocationID=50, trip_distance=1.81, total_amount=20.58, tpep_pickup_datetime=1761958528000)\n", "Sent: Ride(PULocationID=148, DOLocationID=113, trip_distance=1.25, total_amount=24.78, tpep_pickup_datetime=1761956381000)\n", "Sent: Ride(PULocationID=113, DOLocationID=107, trip_distance=1.44, total_amount=24.78, tpep_pickup_datetime=1761957581000)\n", "Sent: Ride(PULocationID=50, DOLocationID=90, trip_distance=2.4, total_amount=33.2, tpep_pickup_datetime=1761955786000)\n", "Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=0.6, total_amount=17.05, tpep_pickup_datetime=1761957365000)\n", "Sent: Ride(PULocationID=246, DOLocationID=68, trip_distance=0.8, total_amount=16.75, tpep_pickup_datetime=1761958508000)\n", "Sent: Ride(PULocationID=249, DOLocationID=246, trip_distance=2.2, total_amount=27.3, tpep_pickup_datetime=1761955399000)\n", "Sent: Ride(PULocationID=48, DOLocationID=90, trip_distance=1.5, total_amount=18.45, tpep_pickup_datetime=1761957057000)\n", "Sent: Ride(PULocationID=90, DOLocationID=107, trip_distance=1.1, total_amount=24.8, tpep_pickup_datetime=1761957738000)\n", "Sent: Ride(PULocationID=148, DOLocationID=234, trip_distance=2.5, total_amount=34.0, tpep_pickup_datetime=1761956224000)\n", "Sent: Ride(PULocationID=48, DOLocationID=186, trip_distance=1.2, total_amount=18.9, tpep_pickup_datetime=1761955938000)\n", "Sent: Ride(PULocationID=234, DOLocationID=166, trip_distance=6.1, total_amount=43.3, tpep_pickup_datetime=1761957437000)\n", "Sent: Ride(PULocationID=238, DOLocationID=239, trip_distance=0.7, total_amount=15.5, tpep_pickup_datetime=1761956589000)\n", "Sent: Ride(PULocationID=90, DOLocationID=113, trip_distance=0.37, total_amount=17.15, tpep_pickup_datetime=1761957323000)\n", "Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.2, total_amount=24.8, tpep_pickup_datetime=1761956590000)\n", "Sent: Ride(PULocationID=234, DOLocationID=107, trip_distance=1.0, total_amount=20.65, tpep_pickup_datetime=1761957748000)\n", "Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.34, total_amount=24.15, tpep_pickup_datetime=1761958718000)\n", "Sent: Ride(PULocationID=231, DOLocationID=137, trip_distance=2.4, total_amount=27.78, tpep_pickup_datetime=1761956073000)\n", "Sent: Ride(PULocationID=100, DOLocationID=232, trip_distance=3.1, total_amount=34.85, tpep_pickup_datetime=1761955678000)\n", "Sent: Ride(PULocationID=232, DOLocationID=263, trip_distance=5.4, total_amount=36.55, tpep_pickup_datetime=1761957916000)\n", "Sent: Ride(PULocationID=236, DOLocationID=43, trip_distance=0.51, total_amount=12.96, tpep_pickup_datetime=1761957600000)\n", "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=1.02, total_amount=17.88, tpep_pickup_datetime=1761958142000)\n", "Sent: Ride(PULocationID=230, DOLocationID=75, trip_distance=3.62, total_amount=27.3, tpep_pickup_datetime=1761957421000)\n", "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.54, total_amount=11.4, tpep_pickup_datetime=1761957775000)\n", "Sent: Ride(PULocationID=162, DOLocationID=140, trip_distance=0.89, total_amount=13.02, tpep_pickup_datetime=1761958544000)\n", "Sent: Ride(PULocationID=246, DOLocationID=17, trip_distance=7.47, total_amount=65.94, tpep_pickup_datetime=1761956096000)\n", "Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.8, total_amount=23.94, tpep_pickup_datetime=1761956686000)\n", "Sent: Ride(PULocationID=148, DOLocationID=229, trip_distance=3.43, total_amount=34.02, tpep_pickup_datetime=1761955666000)\n", "Sent: Ride(PULocationID=229, DOLocationID=262, trip_distance=1.62, total_amount=17.05, tpep_pickup_datetime=1761957259000)\n", "Sent: Ride(PULocationID=262, DOLocationID=237, trip_distance=1.32, total_amount=18.0, tpep_pickup_datetime=1761957709000)\n", "Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=0.9, total_amount=16.32, tpep_pickup_datetime=1761958396000)\n", "Sent: Ride(PULocationID=4, DOLocationID=107, trip_distance=1.3, total_amount=19.7, tpep_pickup_datetime=1761957455000)\n", "Sent: Ride(PULocationID=107, DOLocationID=170, trip_distance=0.93, total_amount=16.38, tpep_pickup_datetime=1761955645000)\n", "Sent: Ride(PULocationID=249, DOLocationID=164, trip_distance=1.12, total_amount=24.15, tpep_pickup_datetime=1761955455000)\n", "Sent: Ride(PULocationID=236, DOLocationID=236, trip_distance=0.39, total_amount=11.11, tpep_pickup_datetime=1761957759000)\n", "Sent: Ride(PULocationID=79, DOLocationID=231, trip_distance=1.82, total_amount=24.95, tpep_pickup_datetime=1761955330000)\n", "Sent: Ride(PULocationID=231, DOLocationID=162, trip_distance=6.21, total_amount=43.05, tpep_pickup_datetime=1761956607000)\n", "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.27, total_amount=11.75, tpep_pickup_datetime=1761956212000)\n", "Sent: Ride(PULocationID=263, DOLocationID=43, trip_distance=1.46, total_amount=15.0, tpep_pickup_datetime=1761958020000)\n", "Sent: Ride(PULocationID=237, DOLocationID=141, trip_distance=0.5, total_amount=14.95, tpep_pickup_datetime=1761956740000)\n", "Sent: Ride(PULocationID=234, DOLocationID=87, trip_distance=4.1, total_amount=51.35, tpep_pickup_datetime=1761958460000)\n", "Sent: Ride(PULocationID=132, DOLocationID=116, trip_distance=18.5, total_amount=96.24, tpep_pickup_datetime=1761957126000)\n", "Sent: Ride(PULocationID=138, DOLocationID=17, trip_distance=9.96, total_amount=61.39, tpep_pickup_datetime=1761957438000)\n", "Sent: Ride(PULocationID=231, DOLocationID=170, trip_distance=4.21, total_amount=34.55, tpep_pickup_datetime=1761957583000)\n", "Sent: Ride(PULocationID=4, DOLocationID=263, trip_distance=3.7, total_amount=31.5, tpep_pickup_datetime=1761956232000)\n", "Sent: Ride(PULocationID=263, DOLocationID=263, trip_distance=0.0, total_amount=8.0, tpep_pickup_datetime=1761957635000)\n", "Sent: Ride(PULocationID=234, DOLocationID=229, trip_distance=1.9, total_amount=27.3, tpep_pickup_datetime=1761958640000)\n", "Sent: Ride(PULocationID=68, DOLocationID=164, trip_distance=1.05, total_amount=15.75, tpep_pickup_datetime=1761956099000)\n", "Sent: Ride(PULocationID=164, DOLocationID=170, trip_distance=1.13, total_amount=15.75, tpep_pickup_datetime=1761956710000)\n", "Sent: Ride(PULocationID=170, DOLocationID=256, trip_distance=5.72, total_amount=60.9, tpep_pickup_datetime=1761957394000)\n", "Sent: Ride(PULocationID=255, DOLocationID=49, trip_distance=4.19, total_amount=46.08, tpep_pickup_datetime=1761956379000)\n", "Sent: Ride(PULocationID=170, DOLocationID=148, trip_distance=2.1, total_amount=29.76, tpep_pickup_datetime=1761955716000)\n", "Sent: Ride(PULocationID=144, DOLocationID=249, trip_distance=1.3, total_amount=18.9, tpep_pickup_datetime=1761957598000)\n", "Sent: Ride(PULocationID=249, DOLocationID=87, trip_distance=2.2, total_amount=26.45, tpep_pickup_datetime=1761958322000)\n", "Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.35, total_amount=15.54, tpep_pickup_datetime=1761956616000)\n", "Sent: Ride(PULocationID=164, DOLocationID=246, trip_distance=1.21, total_amount=15.75, tpep_pickup_datetime=1761955343000)\n", "Sent: Ride(PULocationID=246, DOLocationID=246, trip_distance=0.83, total_amount=19.95, tpep_pickup_datetime=1761955909000)\n", "Sent: Ride(PULocationID=79, DOLocationID=263, trip_distance=4.38, total_amount=34.02, tpep_pickup_datetime=1761958565000)\n", "Sent: Ride(PULocationID=132, DOLocationID=107, trip_distance=16.64, total_amount=93.44, tpep_pickup_datetime=1761956975000)\n", "Sent: Ride(PULocationID=107, DOLocationID=233, trip_distance=1.44, total_amount=18.45, tpep_pickup_datetime=1761955379000)\n", "Sent: Ride(PULocationID=79, DOLocationID=246, trip_distance=2.65, total_amount=35.55, tpep_pickup_datetime=1761956312000)\n", "Sent: Ride(PULocationID=233, DOLocationID=229, trip_distance=0.71, total_amount=13.86, tpep_pickup_datetime=1761958144000)\n", "Sent: Ride(PULocationID=48, DOLocationID=79, trip_distance=3.44, total_amount=40.74, tpep_pickup_datetime=1761957545000)\n", "Sent: Ride(PULocationID=68, DOLocationID=90, trip_distance=0.6, total_amount=17.22, tpep_pickup_datetime=1761955431000)\n", "Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=1.1, total_amount=19.85, tpep_pickup_datetime=1761956413000)\n", "Sent: Ride(PULocationID=107, DOLocationID=62, trip_distance=8.16, total_amount=53.55, tpep_pickup_datetime=1761956385000)\n", "Sent: Ride(PULocationID=234, DOLocationID=79, trip_distance=1.11, total_amount=26.46, tpep_pickup_datetime=1761955381000)\n", "Sent: Ride(PULocationID=79, DOLocationID=4, trip_distance=0.77, total_amount=21.42, tpep_pickup_datetime=1761956586000)\n", "Sent: Ride(PULocationID=4, DOLocationID=107, trip_distance=1.11, total_amount=18.06, tpep_pickup_datetime=1761957433000)\n", "Sent: Ride(PULocationID=79, DOLocationID=68, trip_distance=2.16, total_amount=28.95, tpep_pickup_datetime=1761956224000)\n", "Sent: Ride(PULocationID=68, DOLocationID=90, trip_distance=0.8, total_amount=19.74, tpep_pickup_datetime=1761957722000)\n", "Sent: Ride(PULocationID=90, DOLocationID=246, trip_distance=1.77, total_amount=21.95, tpep_pickup_datetime=1761958441000)\n", "Sent: Ride(PULocationID=87, DOLocationID=79, trip_distance=2.87, total_amount=34.02, tpep_pickup_datetime=1761957364000)\n", "Sent: Ride(PULocationID=43, DOLocationID=140, trip_distance=2.05, total_amount=17.15, tpep_pickup_datetime=1761956821000)\n", "Sent: Ride(PULocationID=263, DOLocationID=263, trip_distance=0.6, total_amount=13.8, tpep_pickup_datetime=1761957600000)\n", "Sent: Ride(PULocationID=43, DOLocationID=249, trip_distance=3.17, total_amount=35.82, tpep_pickup_datetime=1761958522000)\n", "Sent: Ride(PULocationID=107, DOLocationID=170, trip_distance=0.7, total_amount=17.75, tpep_pickup_datetime=1761955603000)\n", "Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=8.9, total_amount=64.74, tpep_pickup_datetime=1761958130000)\n", "Sent: Ride(PULocationID=230, DOLocationID=142, trip_distance=1.9, total_amount=23.9, tpep_pickup_datetime=1761956855000)\n", "Sent: Ride(PULocationID=239, DOLocationID=41, trip_distance=2.4, total_amount=22.2, tpep_pickup_datetime=1761958017000)\n", "Sent: Ride(PULocationID=166, DOLocationID=151, trip_distance=0.71, total_amount=11.7, tpep_pickup_datetime=1761955955000)\n", "Sent: Ride(PULocationID=166, DOLocationID=243, trip_distance=5.58, total_amount=34.32, tpep_pickup_datetime=1761956353000)\n", "Sent: Ride(PULocationID=236, DOLocationID=236, trip_distance=0.64, total_amount=12.12, tpep_pickup_datetime=1761955402000)\n", "Sent: Ride(PULocationID=236, DOLocationID=170, trip_distance=3.58, total_amount=34.02, tpep_pickup_datetime=1761955588000)\n", "Sent: Ride(PULocationID=161, DOLocationID=236, trip_distance=1.3, total_amount=17.2, tpep_pickup_datetime=1761956232000)\n", "Sent: Ride(PULocationID=231, DOLocationID=74, trip_distance=7.55, total_amount=64.26, tpep_pickup_datetime=1761958722000)\n", "Sent: Ride(PULocationID=246, DOLocationID=90, trip_distance=1.0, total_amount=24.78, tpep_pickup_datetime=1761955226000)\n", "Sent: Ride(PULocationID=90, DOLocationID=137, trip_distance=1.51, total_amount=26.95, tpep_pickup_datetime=1761956284000)\n", "Sent: Ride(PULocationID=162, DOLocationID=263, trip_distance=2.0, total_amount=19.15, tpep_pickup_datetime=1761958009000)\n", "Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=1.0, total_amount=19.74, tpep_pickup_datetime=1761955806000)\n", "Sent: Ride(PULocationID=170, DOLocationID=233, trip_distance=0.15, total_amount=19.75, tpep_pickup_datetime=1761956644000)\n", "Sent: Ride(PULocationID=113, DOLocationID=256, trip_distance=4.57, total_amount=45.05, tpep_pickup_datetime=1761955504000)\n", "Sent: Ride(PULocationID=237, DOLocationID=161, trip_distance=0.67, total_amount=13.02, tpep_pickup_datetime=1761956393000)\n", "Sent: Ride(PULocationID=158, DOLocationID=158, trip_distance=0.07, total_amount=10.15, tpep_pickup_datetime=1761956927000)\n", "Sent: Ride(PULocationID=158, DOLocationID=107, trip_distance=0.99, total_amount=33.69, tpep_pickup_datetime=1761957291000)\n", "Sent: Ride(PULocationID=234, DOLocationID=100, trip_distance=1.46, total_amount=20.95, tpep_pickup_datetime=1761956040000)\n", "Sent: Ride(PULocationID=100, DOLocationID=233, trip_distance=1.13, total_amount=25.05, tpep_pickup_datetime=1761956974000)\n", "Sent: Ride(PULocationID=239, DOLocationID=68, trip_distance=2.89, total_amount=24.13, tpep_pickup_datetime=1761956628000)\n", "Sent: Ride(PULocationID=100, DOLocationID=233, trip_distance=1.9, total_amount=23.1, tpep_pickup_datetime=1761957676000)\n", "Sent: Ride(PULocationID=246, DOLocationID=68, trip_distance=0.82, total_amount=13.95, tpep_pickup_datetime=1761956420000)\n", "Sent: Ride(PULocationID=90, DOLocationID=142, trip_distance=3.3, total_amount=36.54, tpep_pickup_datetime=1761957077000)\n", "Sent: Ride(PULocationID=143, DOLocationID=79, trip_distance=7.5, total_amount=59.05, tpep_pickup_datetime=1761956774000)\n", "Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.08, total_amount=16.38, tpep_pickup_datetime=1761957693000)\n", "Sent: Ride(PULocationID=148, DOLocationID=263, trip_distance=4.19, total_amount=35.44, tpep_pickup_datetime=1761958288000)\n", "Sent: Ride(PULocationID=162, DOLocationID=229, trip_distance=0.52, total_amount=15.54, tpep_pickup_datetime=1761955463000)\n", "Sent: Ride(PULocationID=141, DOLocationID=238, trip_distance=2.32, total_amount=23.04, tpep_pickup_datetime=1761956228000)\n", "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.48, total_amount=12.12, tpep_pickup_datetime=1761956375000)\n", "Sent: Ride(PULocationID=141, DOLocationID=140, trip_distance=1.2, total_amount=16.32, tpep_pickup_datetime=1761956646000)\n", "Sent: Ride(PULocationID=79, DOLocationID=142, trip_distance=4.38, total_amount=38.41, tpep_pickup_datetime=1761955989000)\n", "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.42, total_amount=15.05, tpep_pickup_datetime=1761958537000)\n", "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.3, total_amount=11.55, tpep_pickup_datetime=1761957747000)\n", "Sent: Ride(PULocationID=68, DOLocationID=249, trip_distance=0.98, total_amount=19.74, tpep_pickup_datetime=1761958266000)\n", "Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.86, total_amount=20.58, tpep_pickup_datetime=1761955096000)\n", "Sent: Ride(PULocationID=107, DOLocationID=229, trip_distance=2.06, total_amount=23.94, tpep_pickup_datetime=1761956339000)\n", "Sent: Ride(PULocationID=141, DOLocationID=141, trip_distance=0.89, total_amount=15.54, tpep_pickup_datetime=1761955468000)\n", "Sent: Ride(PULocationID=141, DOLocationID=262, trip_distance=0.97, total_amount=11.5, tpep_pickup_datetime=1761956055000)\n", "Sent: Ride(PULocationID=236, DOLocationID=158, trip_distance=5.59, total_amount=45.78, tpep_pickup_datetime=1761956318000)\n", "Sent: Ride(PULocationID=158, DOLocationID=68, trip_distance=1.17, total_amount=19.57, tpep_pickup_datetime=1761958429000)\n", "Sent: Ride(PULocationID=24, DOLocationID=75, trip_distance=1.38, total_amount=15.0, tpep_pickup_datetime=1761956097000)\n", "Sent: Ride(PULocationID=137, DOLocationID=224, trip_distance=0.83, total_amount=19.25, tpep_pickup_datetime=1761958771000)\n", "Sent: Ride(PULocationID=68, DOLocationID=48, trip_distance=1.34, total_amount=24.78, tpep_pickup_datetime=1761955846000)\n", "Sent: Ride(PULocationID=48, DOLocationID=41, trip_distance=3.75, total_amount=31.4, tpep_pickup_datetime=1761957073000)\n", "Sent: Ride(PULocationID=142, DOLocationID=263, trip_distance=2.33, total_amount=23.04, tpep_pickup_datetime=1761958753000)\n", "Sent: Ride(PULocationID=249, DOLocationID=61, trip_distance=7.04, total_amount=58.36, tpep_pickup_datetime=1761957360000)\n", "Sent: Ride(PULocationID=186, DOLocationID=163, trip_distance=1.1, total_amount=21.4, tpep_pickup_datetime=1761955669000)\n", "Sent: Ride(PULocationID=163, DOLocationID=261, trip_distance=5.9, total_amount=52.5, tpep_pickup_datetime=1761956719000)\n", "Sent: Ride(PULocationID=140, DOLocationID=233, trip_distance=1.16, total_amount=16.45, tpep_pickup_datetime=1761956976000)\n", "Sent: Ride(PULocationID=233, DOLocationID=107, trip_distance=0.98, total_amount=18.06, tpep_pickup_datetime=1761957777000)\n", "Sent: Ride(PULocationID=48, DOLocationID=24, trip_distance=3.75, total_amount=26.25, tpep_pickup_datetime=1761957045000)\n", "Sent: Ride(PULocationID=141, DOLocationID=170, trip_distance=2.2, total_amount=21.17, tpep_pickup_datetime=1761957744000)\n", "Sent: Ride(PULocationID=161, DOLocationID=48, trip_distance=1.2, total_amount=18.55, tpep_pickup_datetime=1761955558000)\n", "Sent: Ride(PULocationID=74, DOLocationID=42, trip_distance=2.21, total_amount=15.3, tpep_pickup_datetime=1761955487000)\n", "Sent: Ride(PULocationID=151, DOLocationID=238, trip_distance=0.71, total_amount=11.7, tpep_pickup_datetime=1761957229000)\n", "Sent: Ride(PULocationID=143, DOLocationID=151, trip_distance=2.15, total_amount=19.58, tpep_pickup_datetime=1761958183000)\n", "Sent: Ride(PULocationID=239, DOLocationID=230, trip_distance=1.9, total_amount=21.4, tpep_pickup_datetime=1761957928000)\n", "Sent: Ride(PULocationID=230, DOLocationID=230, trip_distance=0.3, total_amount=11.55, tpep_pickup_datetime=1761958796000)\n", "Sent: Ride(PULocationID=113, DOLocationID=233, trip_distance=2.3, total_amount=31.05, tpep_pickup_datetime=1761957266000)\n", "Sent: Ride(PULocationID=233, DOLocationID=262, trip_distance=3.0, total_amount=19.25, tpep_pickup_datetime=1761958773000)\n", "Sent: Ride(PULocationID=162, DOLocationID=50, trip_distance=1.26, total_amount=23.05, tpep_pickup_datetime=1761958750000)\n", "Sent: Ride(PULocationID=90, DOLocationID=186, trip_distance=0.59, total_amount=15.54, tpep_pickup_datetime=1761956784000)\n", "Sent: Ride(PULocationID=186, DOLocationID=48, trip_distance=1.39, total_amount=16.45, tpep_pickup_datetime=1761957244000)\n", "Sent: Ride(PULocationID=48, DOLocationID=238, trip_distance=2.04, total_amount=25.62, tpep_pickup_datetime=1761958076000)\n", "Sent: Ride(PULocationID=68, DOLocationID=141, trip_distance=3.93, total_amount=40.74, tpep_pickup_datetime=1761957260000)\n", "Sent: Ride(PULocationID=186, DOLocationID=4, trip_distance=1.98, total_amount=36.54, tpep_pickup_datetime=1761957796000)\n", "Sent: Ride(PULocationID=234, DOLocationID=112, trip_distance=3.96, total_amount=49.91, tpep_pickup_datetime=1761956053000)\n", "Sent: Ride(PULocationID=211, DOLocationID=137, trip_distance=2.41, total_amount=29.25, tpep_pickup_datetime=1761955544000)\n", "Sent: Ride(PULocationID=239, DOLocationID=141, trip_distance=1.48, total_amount=18.06, tpep_pickup_datetime=1761958664000)\n", "Sent: Ride(PULocationID=170, DOLocationID=138, trip_distance=11.3, total_amount=71.29, tpep_pickup_datetime=1761956418000)\n", "Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=11.04, total_amount=74.46, tpep_pickup_datetime=1761958340000)\n", "Sent: Ride(PULocationID=90, DOLocationID=79, trip_distance=1.82, total_amount=44.1, tpep_pickup_datetime=1761957003000)\n", "Sent: Ride(PULocationID=236, DOLocationID=237, trip_distance=0.64, total_amount=14.2, tpep_pickup_datetime=1761955771000)\n", "Sent: Ride(PULocationID=237, DOLocationID=162, trip_distance=1.19, total_amount=19.74, tpep_pickup_datetime=1761956127000)\n", "Sent: Ride(PULocationID=233, DOLocationID=80, trip_distance=5.8, total_amount=52.43, tpep_pickup_datetime=1761956989000)\n", "Sent: Ride(PULocationID=262, DOLocationID=87, trip_distance=6.6, total_amount=66.8, tpep_pickup_datetime=1761955277000)\n", "Sent: Ride(PULocationID=132, DOLocationID=141, trip_distance=14.9, total_amount=81.0, tpep_pickup_datetime=1761956893000)\n", "Sent: Ride(PULocationID=107, DOLocationID=230, trip_distance=2.16, total_amount=34.02, tpep_pickup_datetime=1761955901000)\n", "Sent: Ride(PULocationID=230, DOLocationID=236, trip_distance=2.67, total_amount=23.65, tpep_pickup_datetime=1761957546000)\n", "Sent: Ride(PULocationID=163, DOLocationID=179, trip_distance=3.9, total_amount=29.0, tpep_pickup_datetime=1761955359000)\n", "Sent: Ride(PULocationID=79, DOLocationID=113, trip_distance=1.0, total_amount=17.85, tpep_pickup_datetime=1761955818000)\n", "Sent: Ride(PULocationID=113, DOLocationID=48, trip_distance=3.0, total_amount=34.85, tpep_pickup_datetime=1761956723000)\n", "Sent: Ride(PULocationID=234, DOLocationID=113, trip_distance=0.87, total_amount=29.95, tpep_pickup_datetime=1761955542000)\n", "Sent: Ride(PULocationID=113, DOLocationID=236, trip_distance=3.63, total_amount=41.56, tpep_pickup_datetime=1761957744000)\n", "Sent: Ride(PULocationID=48, DOLocationID=170, trip_distance=1.92, total_amount=25.75, tpep_pickup_datetime=1761956455000)\n", "Sent: Ride(PULocationID=170, DOLocationID=48, trip_distance=1.12, total_amount=21.42, tpep_pickup_datetime=1761957962000)\n", "Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.2, total_amount=15.05, tpep_pickup_datetime=1761957985000)\n", "Sent: Ride(PULocationID=233, DOLocationID=263, trip_distance=2.0, total_amount=16.45, tpep_pickup_datetime=1761958639000)\n", "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=1.4, total_amount=16.35, tpep_pickup_datetime=1761956025000)\n", "Sent: Ride(PULocationID=234, DOLocationID=141, trip_distance=2.8, total_amount=27.3, tpep_pickup_datetime=1761958394000)\n", "Sent: Ride(PULocationID=113, DOLocationID=186, trip_distance=1.2, total_amount=24.05, tpep_pickup_datetime=1761956352000)\n", "Sent: Ride(PULocationID=170, DOLocationID=142, trip_distance=2.1, total_amount=22.25, tpep_pickup_datetime=1761957923000)\n", "Sent: Ride(PULocationID=50, DOLocationID=246, trip_distance=1.25, total_amount=16.38, tpep_pickup_datetime=1761958378000)\n", "Sent: Ride(PULocationID=246, DOLocationID=75, trip_distance=5.25, total_amount=37.35, tpep_pickup_datetime=1761958709000)\n", "Sent: Ride(PULocationID=170, DOLocationID=170, trip_distance=0.59, total_amount=14.65, tpep_pickup_datetime=1761958613000)\n", "Sent: Ride(PULocationID=141, DOLocationID=161, trip_distance=1.56, total_amount=18.9, tpep_pickup_datetime=1761956052000)\n", "Sent: Ride(PULocationID=161, DOLocationID=7, trip_distance=3.47, total_amount=31.5, tpep_pickup_datetime=1761956801000)\n", "Sent: Ride(PULocationID=113, DOLocationID=107, trip_distance=0.54, total_amount=14.15, tpep_pickup_datetime=1761955290000)\n", "Sent: Ride(PULocationID=107, DOLocationID=229, trip_distance=1.92, total_amount=30.65, tpep_pickup_datetime=1761955702000)\n", "Sent: Ride(PULocationID=237, DOLocationID=232, trip_distance=5.7, total_amount=56.7, tpep_pickup_datetime=1761955310000)\n", "Sent: Ride(PULocationID=34, DOLocationID=263, trip_distance=8.81, total_amount=54.9, tpep_pickup_datetime=1761956216000)\n", "Sent: Ride(PULocationID=238, DOLocationID=263, trip_distance=1.8, total_amount=17.15, tpep_pickup_datetime=1761956756000)\n", "Sent: Ride(PULocationID=263, DOLocationID=233, trip_distance=2.3, total_amount=23.2, tpep_pickup_datetime=1761957345000)\n", "Sent: Ride(PULocationID=233, DOLocationID=163, trip_distance=1.2, total_amount=17.2, tpep_pickup_datetime=1761958050000)\n", "Sent: Ride(PULocationID=237, DOLocationID=42, trip_distance=3.5, total_amount=24.8, tpep_pickup_datetime=1761958674000)\n", "Sent: Ride(PULocationID=143, DOLocationID=81, trip_distance=13.98, total_amount=62.65, tpep_pickup_datetime=1761957611000)\n", "Sent: Ride(PULocationID=148, DOLocationID=263, trip_distance=7.04, total_amount=47.46, tpep_pickup_datetime=1761958101000)\n", "Sent: Ride(PULocationID=230, DOLocationID=230, trip_distance=0.28, total_amount=11.55, tpep_pickup_datetime=1761955440000)\n", "Sent: Ride(PULocationID=265, DOLocationID=265, trip_distance=0.0, total_amount=71.0, tpep_pickup_datetime=1761957594000)\n", "Sent: Ride(PULocationID=265, DOLocationID=265, trip_distance=0.0, total_amount=85.2, tpep_pickup_datetime=1761957770000)\n", "Sent: Ride(PULocationID=246, DOLocationID=87, trip_distance=4.0, total_amount=47.35, tpep_pickup_datetime=1761958153000)\n", "Sent: Ride(PULocationID=231, DOLocationID=231, trip_distance=0.3, total_amount=12.15, tpep_pickup_datetime=1761955983000)\n", "Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=1.9, total_amount=21.35, tpep_pickup_datetime=1761958793000)\n", "Sent: Ride(PULocationID=236, DOLocationID=238, trip_distance=1.6, total_amount=17.0, tpep_pickup_datetime=1761956346000)\n", "Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=0.8, total_amount=13.8, tpep_pickup_datetime=1761956873000)\n", "Sent: Ride(PULocationID=239, DOLocationID=263, trip_distance=2.3, total_amount=22.25, tpep_pickup_datetime=1761957322000)\n", "Sent: Ride(PULocationID=79, DOLocationID=66, trip_distance=4.8, total_amount=31.85, tpep_pickup_datetime=1761954981000)\n", "Sent: Ride(PULocationID=234, DOLocationID=239, trip_distance=4.01, total_amount=37.38, tpep_pickup_datetime=1761955907000)\n", "Sent: Ride(PULocationID=237, DOLocationID=249, trip_distance=4.0, total_amount=52.5, tpep_pickup_datetime=1761955461000)\n", "Sent: Ride(PULocationID=249, DOLocationID=236, trip_distance=5.02, total_amount=42.95, tpep_pickup_datetime=1761958553000)\n", "Sent: Ride(PULocationID=144, DOLocationID=262, trip_distance=5.8, total_amount=40.75, tpep_pickup_datetime=1761957082000)\n", "Sent: Ride(PULocationID=239, DOLocationID=68, trip_distance=3.74, total_amount=34.56, tpep_pickup_datetime=1761956465000)\n", "Sent: Ride(PULocationID=237, DOLocationID=75, trip_distance=1.98, total_amount=15.7, tpep_pickup_datetime=1761955693000)\n", "Sent: Ride(PULocationID=107, DOLocationID=158, trip_distance=2.0, total_amount=24.8, tpep_pickup_datetime=1761957491000)\n", "Sent: Ride(PULocationID=43, DOLocationID=170, trip_distance=1.8, total_amount=23.9, tpep_pickup_datetime=1761955415000)\n", "Sent: Ride(PULocationID=107, DOLocationID=113, trip_distance=1.0, total_amount=24.75, tpep_pickup_datetime=1761956410000)\n", "Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=1.4, total_amount=25.45, tpep_pickup_datetime=1761957751000)\n", "Sent: Ride(PULocationID=263, DOLocationID=229, trip_distance=1.5, total_amount=16.75, tpep_pickup_datetime=1761956645000)\n", "Sent: Ride(PULocationID=229, DOLocationID=263, trip_distance=1.9, total_amount=19.7, tpep_pickup_datetime=1761957347000)\n", "Sent: Ride(PULocationID=263, DOLocationID=79, trip_distance=3.7, total_amount=36.5, tpep_pickup_datetime=1761957967000)\n", "Sent: Ride(PULocationID=186, DOLocationID=4, trip_distance=2.35, total_amount=45.04, tpep_pickup_datetime=1761955695000)\n", "Sent: Ride(PULocationID=79, DOLocationID=230, trip_distance=3.06, total_amount=34.02, tpep_pickup_datetime=1761958084000)\n", "Sent: Ride(PULocationID=151, DOLocationID=164, trip_distance=3.53, total_amount=37.38, tpep_pickup_datetime=1761956326000)\n", "Sent: Ride(PULocationID=164, DOLocationID=79, trip_distance=1.62, total_amount=24.05, tpep_pickup_datetime=1761958296000)\n", "Sent: Ride(PULocationID=170, DOLocationID=141, trip_distance=1.78, total_amount=22.35, tpep_pickup_datetime=1761956815000)\n", "Sent: Ride(PULocationID=90, DOLocationID=161, trip_distance=3.12, total_amount=37.45, tpep_pickup_datetime=1761956190000)\n", "Sent: Ride(PULocationID=161, DOLocationID=137, trip_distance=0.71, total_amount=14.7, tpep_pickup_datetime=1761958680000)\n", "Sent: Ride(PULocationID=158, DOLocationID=239, trip_distance=4.03, total_amount=36.15, tpep_pickup_datetime=1761957513000)\n", "Sent: Ride(PULocationID=236, DOLocationID=239, trip_distance=1.43, total_amount=17.7, tpep_pickup_datetime=1761955297000)\n", "Sent: Ride(PULocationID=107, DOLocationID=263, trip_distance=4.9, total_amount=33.85, tpep_pickup_datetime=1761955712000)\n", "Sent: Ride(PULocationID=141, DOLocationID=249, trip_distance=4.3, total_amount=55.0, tpep_pickup_datetime=1761957577000)\n", "Sent: Ride(PULocationID=186, DOLocationID=234, trip_distance=0.47, total_amount=15.7, tpep_pickup_datetime=1761955763000)\n", "Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=2.07, total_amount=29.82, tpep_pickup_datetime=1761956324000)\n", "Sent: Ride(PULocationID=232, DOLocationID=232, trip_distance=0.07, total_amount=8.75, tpep_pickup_datetime=1761957625000)\n", "Sent: Ride(PULocationID=232, DOLocationID=224, trip_distance=1.45, total_amount=19.74, tpep_pickup_datetime=1761957883000)\n", "Sent: Ride(PULocationID=224, DOLocationID=229, trip_distance=1.92, total_amount=20.58, tpep_pickup_datetime=1761958559000)\n", "Sent: Ride(PULocationID=234, DOLocationID=255, trip_distance=4.34, total_amount=54.18, tpep_pickup_datetime=1761955952000)\n", "Sent: Ride(PULocationID=68, DOLocationID=249, trip_distance=1.0, total_amount=40.74, tpep_pickup_datetime=1761955307000)\n", "Sent: Ride(PULocationID=249, DOLocationID=148, trip_distance=1.88, total_amount=22.97, tpep_pickup_datetime=1761957572000)\n", "Sent: Ride(PULocationID=148, DOLocationID=25, trip_distance=2.62, total_amount=24.78, tpep_pickup_datetime=1761958769000)\n", "Sent: Ride(PULocationID=162, DOLocationID=263, trip_distance=1.95, total_amount=15.75, tpep_pickup_datetime=1761955485000)\n", "Sent: Ride(PULocationID=261, DOLocationID=13, trip_distance=0.83, total_amount=16.83, tpep_pickup_datetime=1761958153000)\n", "Sent: Ride(PULocationID=79, DOLocationID=261, trip_distance=1.9, total_amount=20.75, tpep_pickup_datetime=1761955780000)\n", "Sent: Ride(PULocationID=45, DOLocationID=170, trip_distance=3.1, total_amount=39.9, tpep_pickup_datetime=1761957020000)\n", "Sent: Ride(PULocationID=243, DOLocationID=116, trip_distance=2.99, total_amount=21.72, tpep_pickup_datetime=1761955822000)\n", "Sent: Ride(PULocationID=234, DOLocationID=137, trip_distance=1.41, total_amount=27.55, tpep_pickup_datetime=1761956665000)\n", "Sent: Ride(PULocationID=48, DOLocationID=41, trip_distance=3.29, total_amount=25.45, tpep_pickup_datetime=1761955794000)\n", "Sent: Ride(PULocationID=113, DOLocationID=170, trip_distance=1.17, total_amount=24.78, tpep_pickup_datetime=1761955535000)\n", "Sent: Ride(PULocationID=234, DOLocationID=107, trip_distance=0.91, total_amount=23.94, tpep_pickup_datetime=1761957348000)\n", "Sent: Ride(PULocationID=107, DOLocationID=100, trip_distance=0.93, total_amount=18.06, tpep_pickup_datetime=1761958539000)\n", "Sent: Ride(PULocationID=249, DOLocationID=107, trip_distance=1.65, total_amount=28.85, tpep_pickup_datetime=1761956687000)\n", "Sent: Ride(PULocationID=186, DOLocationID=113, trip_distance=1.3, total_amount=24.78, tpep_pickup_datetime=1761955497000)\n", "Sent: Ride(PULocationID=113, DOLocationID=100, trip_distance=2.19, total_amount=30.45, tpep_pickup_datetime=1761957097000)\n", "Sent: Ride(PULocationID=186, DOLocationID=24, trip_distance=5.13, total_amount=44.1, tpep_pickup_datetime=1761957239000)\n", "Sent: Ride(PULocationID=264, DOLocationID=90, trip_distance=1.4, total_amount=41.55, tpep_pickup_datetime=1761957062000)\n", "Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=10.22, total_amount=65.24, tpep_pickup_datetime=1761955208000)\n", "Sent: Ride(PULocationID=161, DOLocationID=234, trip_distance=1.9, total_amount=23.1, tpep_pickup_datetime=1761956752000)\n", "Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.0, total_amount=-8.75, tpep_pickup_datetime=1761957929000)\n", "Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.0, total_amount=8.75, tpep_pickup_datetime=1761957929000)\n", "Sent: Ride(PULocationID=113, DOLocationID=75, trip_distance=5.69, total_amount=35.95, tpep_pickup_datetime=1761958337000)\n", "Sent: Ride(PULocationID=144, DOLocationID=148, trip_distance=0.8, total_amount=22.25, tpep_pickup_datetime=1761955358000)\n", "Sent: Ride(PULocationID=148, DOLocationID=170, trip_distance=2.0, total_amount=25.6, tpep_pickup_datetime=1761956269000)\n", "Sent: Ride(PULocationID=107, DOLocationID=262, trip_distance=3.6, total_amount=30.65, tpep_pickup_datetime=1761957679000)\n", "Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=0.7, total_amount=11.1, tpep_pickup_datetime=1761958752000)\n", "Sent: Ride(PULocationID=48, DOLocationID=238, trip_distance=2.0, total_amount=22.31, tpep_pickup_datetime=1761957155000)\n", "Sent: Ride(PULocationID=48, DOLocationID=229, trip_distance=1.5, total_amount=19.85, tpep_pickup_datetime=1761958389000)\n", "Sent: Ride(PULocationID=68, DOLocationID=144, trip_distance=1.58, total_amount=30.66, tpep_pickup_datetime=1761955512000)\n", "Sent: Ride(PULocationID=163, DOLocationID=143, trip_distance=1.2, total_amount=18.06, tpep_pickup_datetime=1761956341000)\n", "Sent: Ride(PULocationID=48, DOLocationID=141, trip_distance=2.19, total_amount=23.94, tpep_pickup_datetime=1761957714000)\n", "Sent: Ride(PULocationID=234, DOLocationID=233, trip_distance=1.26, total_amount=25.25, tpep_pickup_datetime=1761956020000)\n", "Sent: Ride(PULocationID=137, DOLocationID=79, trip_distance=1.64, total_amount=27.65, tpep_pickup_datetime=1761956631000)\n", "Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.91, total_amount=14.45, tpep_pickup_datetime=1761958456000)\n", "Sent: Ride(PULocationID=237, DOLocationID=68, trip_distance=2.5, total_amount=29.82, tpep_pickup_datetime=1761955599000)\n", "Sent: Ride(PULocationID=90, DOLocationID=87, trip_distance=4.67, total_amount=50.82, tpep_pickup_datetime=1761957483000)\n", "Sent: Ride(PULocationID=237, DOLocationID=164, trip_distance=1.85, total_amount=23.94, tpep_pickup_datetime=1761958438000)\n", "Sent: Ride(PULocationID=143, DOLocationID=162, trip_distance=1.3, total_amount=13.65, tpep_pickup_datetime=1761956802000)\n", "Sent: Ride(PULocationID=162, DOLocationID=158, trip_distance=3.0, total_amount=41.65, tpep_pickup_datetime=1761957357000)\n", "Sent: Ride(PULocationID=37, DOLocationID=143, trip_distance=8.61, total_amount=65.88, tpep_pickup_datetime=1761956548000)\n", "Sent: Ride(PULocationID=87, DOLocationID=148, trip_distance=1.67, total_amount=28.14, tpep_pickup_datetime=1761956271000)\n", "Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.93, total_amount=18.9, tpep_pickup_datetime=1761957806000)\n", "Sent: Ride(PULocationID=137, DOLocationID=236, trip_distance=3.45, total_amount=26.45, tpep_pickup_datetime=1761958735000)\n", "Sent: Ride(PULocationID=148, DOLocationID=87, trip_distance=3.04, total_amount=26.05, tpep_pickup_datetime=1761958064000)\n", "Sent: Ride(PULocationID=238, DOLocationID=236, trip_distance=1.77, total_amount=20.5, tpep_pickup_datetime=1761955394000)\n", "Sent: Ride(PULocationID=141, DOLocationID=265, trip_distance=6.26, total_amount=92.1, tpep_pickup_datetime=1761956388000)\n", "Sent: Ride(PULocationID=152, DOLocationID=82, trip_distance=7.39, total_amount=50.94, tpep_pickup_datetime=1761957298000)\n", "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.38, total_amount=19.25, tpep_pickup_datetime=1761958530000)\n", "Sent: Ride(PULocationID=137, DOLocationID=90, trip_distance=0.83, total_amount=18.9, tpep_pickup_datetime=1761956852000)\n", "Sent: Ride(PULocationID=263, DOLocationID=237, trip_distance=1.47, total_amount=14.3, tpep_pickup_datetime=1761955429000)\n", "Sent: Ride(PULocationID=162, DOLocationID=233, trip_distance=0.55, total_amount=19.74, tpep_pickup_datetime=1761956098000)\n", "Sent: Ride(PULocationID=158, DOLocationID=75, trip_distance=6.2, total_amount=53.35, tpep_pickup_datetime=1761956596000)\n", "Sent: Ride(PULocationID=164, DOLocationID=231, trip_distance=2.93, total_amount=35.55, tpep_pickup_datetime=1761958267000)\n", "Sent: Ride(PULocationID=100, DOLocationID=246, trip_distance=0.73, total_amount=14.25, tpep_pickup_datetime=1761957445000)\n", "Sent: Ride(PULocationID=246, DOLocationID=48, trip_distance=1.48, total_amount=23.94, tpep_pickup_datetime=1761955900000)\n", "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.6, total_amount=20.41, tpep_pickup_datetime=1761955372000)\n", "Sent: Ride(PULocationID=90, DOLocationID=137, trip_distance=1.6, total_amount=25.9, tpep_pickup_datetime=1761956526000)\n", "Sent: Ride(PULocationID=107, DOLocationID=233, trip_distance=1.4, total_amount=22.05, tpep_pickup_datetime=1761958784000)\n", "Sent: Ride(PULocationID=107, DOLocationID=234, trip_distance=0.4, total_amount=14.7, tpep_pickup_datetime=1761955781000)\n", "Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=2.1, total_amount=44.9, tpep_pickup_datetime=1761956134000)\n", "Sent: Ride(PULocationID=170, DOLocationID=87, trip_distance=5.1, total_amount=53.34, tpep_pickup_datetime=1761956209000)\n", "Sent: Ride(PULocationID=262, DOLocationID=145, trip_distance=1.73, total_amount=16.45, tpep_pickup_datetime=1761955736000)\n", "Sent: Ride(PULocationID=68, DOLocationID=158, trip_distance=0.68, total_amount=21.95, tpep_pickup_datetime=1761955728000)\n", "Sent: Ride(PULocationID=158, DOLocationID=249, trip_distance=0.22, total_amount=15.54, tpep_pickup_datetime=1761956324000)\n", "Sent: Ride(PULocationID=239, DOLocationID=229, trip_distance=2.49, total_amount=24.95, tpep_pickup_datetime=1761955362000)\n", "Sent: Ride(PULocationID=229, DOLocationID=140, trip_distance=1.08, total_amount=15.54, tpep_pickup_datetime=1761956228000)\n", "Sent: Ride(PULocationID=249, DOLocationID=163, trip_distance=4.6, total_amount=47.85, tpep_pickup_datetime=1761955422000)\n", "Sent: Ride(PULocationID=163, DOLocationID=237, trip_distance=1.41, total_amount=19.74, tpep_pickup_datetime=1761958654000)\n", "Sent: Ride(PULocationID=229, DOLocationID=68, trip_distance=3.5, total_amount=26.25, tpep_pickup_datetime=1761958702000)\n", "Sent: Ride(PULocationID=90, DOLocationID=48, trip_distance=1.1, total_amount=24.8, tpep_pickup_datetime=1761955292000)\n", "Sent: Ride(PULocationID=163, DOLocationID=90, trip_distance=1.3, total_amount=21.45, tpep_pickup_datetime=1761956729000)\n", "Sent: Ride(PULocationID=90, DOLocationID=249, trip_distance=0.5, total_amount=20.55, tpep_pickup_datetime=1761957601000)\n", "Sent: Ride(PULocationID=249, DOLocationID=137, trip_distance=2.7, total_amount=36.3, tpep_pickup_datetime=1761958362000)\n", "Sent: Ride(PULocationID=232, DOLocationID=87, trip_distance=1.54, total_amount=22.29, tpep_pickup_datetime=1761957878000)\n", "Sent: Ride(PULocationID=87, DOLocationID=148, trip_distance=1.66, total_amount=29.82, tpep_pickup_datetime=1761958604000)\n", "Sent: Ride(PULocationID=65, DOLocationID=49, trip_distance=0.9, total_amount=11.4, tpep_pickup_datetime=1761956206000)\n", "Sent: Ride(PULocationID=148, DOLocationID=249, trip_distance=1.43, total_amount=21.95, tpep_pickup_datetime=1761958184000)\n", "Sent: Ride(PULocationID=162, DOLocationID=186, trip_distance=1.66, total_amount=26.15, tpep_pickup_datetime=1761955322000)\n", "Sent: Ride(PULocationID=186, DOLocationID=68, trip_distance=0.75, total_amount=16.05, tpep_pickup_datetime=1761956736000)\n", "Sent: Ride(PULocationID=68, DOLocationID=164, trip_distance=1.11, total_amount=21.42, tpep_pickup_datetime=1761957396000)\n", "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=2.22, total_amount=49.85, tpep_pickup_datetime=1761956131000)\n", "Sent: Ride(PULocationID=138, DOLocationID=68, trip_distance=11.1, total_amount=79.9, tpep_pickup_datetime=1761955488000)\n", "Sent: Ride(PULocationID=68, DOLocationID=246, trip_distance=0.9, total_amount=33.45, tpep_pickup_datetime=1761958334000)\n", "Sent: Ride(PULocationID=141, DOLocationID=74, trip_distance=2.9, total_amount=23.6, tpep_pickup_datetime=1761955491000)\n", "Sent: Ride(PULocationID=74, DOLocationID=244, trip_distance=3.7, total_amount=22.2, tpep_pickup_datetime=1761956515000)\n", "Sent: Ride(PULocationID=263, DOLocationID=209, trip_distance=7.04, total_amount=51.66, tpep_pickup_datetime=1761957054000)\n", "Sent: Ride(PULocationID=48, DOLocationID=256, trip_distance=7.1, total_amount=61.95, tpep_pickup_datetime=1761955765000)\n", "Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=1.18, total_amount=17.22, tpep_pickup_datetime=1761955439000)\n", "Sent: Ride(PULocationID=239, DOLocationID=161, trip_distance=2.44, total_amount=24.78, tpep_pickup_datetime=1761956004000)\n", "Sent: Ride(PULocationID=237, DOLocationID=236, trip_distance=1.22, total_amount=17.22, tpep_pickup_datetime=1761957596000)\n", "Sent: Ride(PULocationID=236, DOLocationID=142, trip_distance=1.74, total_amount=18.84, tpep_pickup_datetime=1761958175000)\n", "Sent: Ride(PULocationID=137, DOLocationID=229, trip_distance=3.03, total_amount=37.38, tpep_pickup_datetime=1761955292000)\n", "Sent: Ride(PULocationID=162, DOLocationID=75, trip_distance=2.3, total_amount=20.58, tpep_pickup_datetime=1761957307000)\n", "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.27, total_amount=20.15, tpep_pickup_datetime=1761956056000)\n", "Sent: Ride(PULocationID=233, DOLocationID=243, trip_distance=9.04, total_amount=55.02, tpep_pickup_datetime=1761956831000)\n", "Sent: Ride(PULocationID=239, DOLocationID=142, trip_distance=0.69, total_amount=15.86, tpep_pickup_datetime=1761955394000)\n", "Sent: Ride(PULocationID=48, DOLocationID=100, trip_distance=1.52, total_amount=21.42, tpep_pickup_datetime=1761957177000)\n", "Sent: Ride(PULocationID=230, DOLocationID=163, trip_distance=0.76, total_amount=14.44, tpep_pickup_datetime=1761957724000)\n", "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.84, total_amount=13.8, tpep_pickup_datetime=1761958711000)\n", "Sent: Ride(PULocationID=68, DOLocationID=113, trip_distance=1.8, total_amount=34.45, tpep_pickup_datetime=1761957122000)\n", "Sent: Ride(PULocationID=162, DOLocationID=233, trip_distance=0.38, total_amount=10.85, tpep_pickup_datetime=1761957862000)\n", "Sent: Ride(PULocationID=186, DOLocationID=233, trip_distance=1.63, total_amount=30.66, tpep_pickup_datetime=1761955249000)\n", "Sent: Ride(PULocationID=162, DOLocationID=79, trip_distance=3.63, total_amount=35.25, tpep_pickup_datetime=1761957099000)\n", "Sent: Ride(PULocationID=148, DOLocationID=87, trip_distance=1.2, total_amount=19.75, tpep_pickup_datetime=1761956635000)\n", "Sent: Ride(PULocationID=209, DOLocationID=170, trip_distance=4.5, total_amount=39.05, tpep_pickup_datetime=1761957901000)\n", "Sent: Ride(PULocationID=249, DOLocationID=249, trip_distance=0.27, total_amount=15.05, tpep_pickup_datetime=1761955403000)\n", "Sent: Ride(PULocationID=68, DOLocationID=239, trip_distance=3.5, total_amount=48.3, tpep_pickup_datetime=1761956136000)\n", "Sent: Ride(PULocationID=48, DOLocationID=170, trip_distance=2.16, total_amount=24.78, tpep_pickup_datetime=1761957922000)\n", "Sent: Ride(PULocationID=238, DOLocationID=166, trip_distance=1.0, total_amount=13.0, tpep_pickup_datetime=1761958113000)\n", "Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.56, total_amount=12.96, tpep_pickup_datetime=1761956773000)\n", "Sent: Ride(PULocationID=263, DOLocationID=74, trip_distance=1.91, total_amount=17.25, tpep_pickup_datetime=1761957198000)\n", "Sent: Ride(PULocationID=138, DOLocationID=161, trip_distance=9.31, total_amount=70.26, tpep_pickup_datetime=1761958431000)\n", "Sent: Ride(PULocationID=246, DOLocationID=243, trip_distance=10.4, total_amount=54.25, tpep_pickup_datetime=1761957210000)\n", "Sent: Ride(PULocationID=148, DOLocationID=231, trip_distance=1.17, total_amount=21.42, tpep_pickup_datetime=1761956227000)\n", "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.57, total_amount=19.55, tpep_pickup_datetime=1761958599000)\n", "Sent: Ride(PULocationID=163, DOLocationID=145, trip_distance=1.97, total_amount=22.26, tpep_pickup_datetime=1761956297000)\n", "Sent: Ride(PULocationID=148, DOLocationID=209, trip_distance=1.4, total_amount=18.9, tpep_pickup_datetime=1761958782000)\n", "Sent: Ride(PULocationID=48, DOLocationID=163, trip_distance=0.96, total_amount=13.95, tpep_pickup_datetime=1761958014000)\n", "Sent: Ride(PULocationID=163, DOLocationID=234, trip_distance=1.35, total_amount=19.55, tpep_pickup_datetime=1761958413000)\n", "Sent: Ride(PULocationID=100, DOLocationID=145, trip_distance=4.79, total_amount=42.26, tpep_pickup_datetime=1761956569000)\n", "Sent: Ride(PULocationID=107, DOLocationID=152, trip_distance=7.57, total_amount=57.79, tpep_pickup_datetime=1761957412000)\n", "Sent: Ride(PULocationID=142, DOLocationID=163, trip_distance=0.64, total_amount=12.25, tpep_pickup_datetime=1761957651000)\n", "Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=1.16, total_amount=15.75, tpep_pickup_datetime=1761958651000)\n", "Sent: Ride(PULocationID=148, DOLocationID=80, trip_distance=3.27, total_amount=24.85, tpep_pickup_datetime=1761956017000)\n", "Sent: Ride(PULocationID=138, DOLocationID=70, trip_distance=1.16, total_amount=24.09, tpep_pickup_datetime=1761958172000)\n", "Sent: Ride(PULocationID=138, DOLocationID=252, trip_distance=6.74, total_amount=49.63, tpep_pickup_datetime=1761958767000)\n", "Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=3.51, total_amount=37.38, tpep_pickup_datetime=1761955439000)\n", "Sent: Ride(PULocationID=162, DOLocationID=75, trip_distance=2.72, total_amount=18.55, tpep_pickup_datetime=1761958052000)\n", "Sent: Ride(PULocationID=238, DOLocationID=151, trip_distance=0.53, total_amount=12.12, tpep_pickup_datetime=1761957724000)\n", "Sent: Ride(PULocationID=113, DOLocationID=48, trip_distance=2.82, total_amount=38.22, tpep_pickup_datetime=1761955250000)\n", "Sent: Ride(PULocationID=48, DOLocationID=74, trip_distance=5.21, total_amount=37.38, tpep_pickup_datetime=1761957341000)\n", "Sent: Ride(PULocationID=141, DOLocationID=144, trip_distance=4.22, total_amount=42.73, tpep_pickup_datetime=1761956330000)\n", "Sent: Ride(PULocationID=132, DOLocationID=142, trip_distance=21.73, total_amount=87.69, tpep_pickup_datetime=1761956883000)\n", "Sent: Ride(PULocationID=144, DOLocationID=141, trip_distance=4.5, total_amount=46.6, tpep_pickup_datetime=1761955852000)\n", "Sent: Ride(PULocationID=107, DOLocationID=262, trip_distance=3.44, total_amount=33.18, tpep_pickup_datetime=1761956238000)\n", "Sent: Ride(PULocationID=263, DOLocationID=229, trip_distance=1.63, total_amount=19.74, tpep_pickup_datetime=1761958132000)\n", "Sent: Ride(PULocationID=229, DOLocationID=236, trip_distance=1.45, total_amount=16.38, tpep_pickup_datetime=1761958766000)\n", "Sent: Ride(PULocationID=114, DOLocationID=107, trip_distance=1.3, total_amount=21.35, tpep_pickup_datetime=1761955718000)\n", "Sent: Ride(PULocationID=107, DOLocationID=79, trip_distance=1.0, total_amount=20.65, tpep_pickup_datetime=1761956865000)\n", "Sent: Ride(PULocationID=79, DOLocationID=48, trip_distance=3.6, total_amount=29.05, tpep_pickup_datetime=1761957987000)\n", "Sent: Ride(PULocationID=230, DOLocationID=238, trip_distance=2.1, total_amount=21.4, tpep_pickup_datetime=1761957886000)\n", "Sent: Ride(PULocationID=144, DOLocationID=141, trip_distance=3.71, total_amount=32.34, tpep_pickup_datetime=1761958179000)\n", "Sent: Ride(PULocationID=144, DOLocationID=170, trip_distance=1.92, total_amount=28.14, tpep_pickup_datetime=1761958413000)\n", "Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=9.95, total_amount=76.61, tpep_pickup_datetime=1761956513000)\n", "Sent: Ride(PULocationID=140, DOLocationID=107, trip_distance=2.2, total_amount=24.15, tpep_pickup_datetime=1761957310000)\n", "Sent: Ride(PULocationID=164, DOLocationID=17, trip_distance=8.9, total_amount=67.49, tpep_pickup_datetime=1761956182000)\n", "Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.33, total_amount=17.16, tpep_pickup_datetime=1761956654000)\n", "Sent: Ride(PULocationID=234, DOLocationID=230, trip_distance=1.8, total_amount=22.25, tpep_pickup_datetime=1761957156000)\n", "Sent: Ride(PULocationID=236, DOLocationID=48, trip_distance=2.3, total_amount=28.35, tpep_pickup_datetime=1761958637000)\n", "Sent: Ride(PULocationID=141, DOLocationID=79, trip_distance=4.6, total_amount=53.34, tpep_pickup_datetime=1761957453000)\n", "Sent: Ride(PULocationID=90, DOLocationID=249, trip_distance=1.3, total_amount=30.45, tpep_pickup_datetime=1761955450000)\n", "Sent: Ride(PULocationID=249, DOLocationID=238, trip_distance=4.5, total_amount=50.0, tpep_pickup_datetime=1761957468000)\n", "Sent: Ride(PULocationID=249, DOLocationID=246, trip_distance=1.83, total_amount=24.75, tpep_pickup_datetime=1761957761000)\n", "Sent: Ride(PULocationID=170, DOLocationID=148, trip_distance=2.5, total_amount=32.85, tpep_pickup_datetime=1761955826000)\n", "Sent: Ride(PULocationID=144, DOLocationID=249, trip_distance=1.9, total_amount=24.15, tpep_pickup_datetime=1761958126000)\n", "Sent: Ride(PULocationID=100, DOLocationID=143, trip_distance=1.98, total_amount=25.36, tpep_pickup_datetime=1761955308000)\n", "Sent: Ride(PULocationID=170, DOLocationID=246, trip_distance=2.17, total_amount=23.45, tpep_pickup_datetime=1761955684000)\n", "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.79, total_amount=28.98, tpep_pickup_datetime=1761955920000)\n", "Sent: Ride(PULocationID=107, DOLocationID=237, trip_distance=2.55, total_amount=23.35, tpep_pickup_datetime=1761957446000)\n", "Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=1.66, total_amount=17.2, tpep_pickup_datetime=1761958370000)\n", "Sent: Ride(PULocationID=113, DOLocationID=255, trip_distance=5.69, total_amount=52.15, tpep_pickup_datetime=1761956236000)\n", "Sent: Ride(PULocationID=144, DOLocationID=249, trip_distance=1.0, total_amount=15.05, tpep_pickup_datetime=1761958649000)\n", "Sent: Ride(PULocationID=158, DOLocationID=43, trip_distance=4.24, total_amount=42.45, tpep_pickup_datetime=1761957905000)\n", "Sent: Ride(PULocationID=132, DOLocationID=256, trip_distance=16.31, total_amount=87.85, tpep_pickup_datetime=1761957025000)\n", "Sent: Ride(PULocationID=138, DOLocationID=231, trip_distance=11.65, total_amount=66.33, tpep_pickup_datetime=1761958443000)\n", "Sent: Ride(PULocationID=88, DOLocationID=261, trip_distance=0.43, total_amount=18.06, tpep_pickup_datetime=1761956169000)\n", "Sent: Ride(PULocationID=261, DOLocationID=186, trip_distance=5.42, total_amount=51.66, tpep_pickup_datetime=1761957049000)\n", "Sent: Ride(PULocationID=50, DOLocationID=68, trip_distance=0.93, total_amount=15.54, tpep_pickup_datetime=1761957235000)\n", "Sent: Ride(PULocationID=68, DOLocationID=13, trip_distance=3.6, total_amount=-36.05, tpep_pickup_datetime=1761958091000)\n", "Sent: Ride(PULocationID=68, DOLocationID=13, trip_distance=3.6, total_amount=36.05, tpep_pickup_datetime=1761958091000)\n", "Sent: Ride(PULocationID=246, DOLocationID=37, trip_distance=12.66, total_amount=83.09, tpep_pickup_datetime=1761956632000)\n", "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=0.69, total_amount=14.65, tpep_pickup_datetime=1761955727000)\n", "Sent: Ride(PULocationID=79, DOLocationID=148, trip_distance=1.23, total_amount=25.62, tpep_pickup_datetime=1761956213000)\n", "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=1.04, total_amount=16.75, tpep_pickup_datetime=1761957350000)\n", "Sent: Ride(PULocationID=79, DOLocationID=231, trip_distance=1.63, total_amount=19.15, tpep_pickup_datetime=1761958074000)\n", "Sent: Ride(PULocationID=79, DOLocationID=255, trip_distance=3.38, total_amount=49.14, tpep_pickup_datetime=1761956923000)\n", "Sent: Ride(PULocationID=234, DOLocationID=113, trip_distance=0.28, total_amount=16.38, tpep_pickup_datetime=1761956238000)\n", "Sent: Ride(PULocationID=113, DOLocationID=161, trip_distance=2.63, total_amount=29.75, tpep_pickup_datetime=1761956822000)\n", "Sent: Ride(PULocationID=43, DOLocationID=158, trip_distance=5.39, total_amount=36.25, tpep_pickup_datetime=1761956931000)\n", "Sent: Ride(PULocationID=158, DOLocationID=186, trip_distance=1.33, total_amount=17.75, tpep_pickup_datetime=1761958537000)\n", "Sent: Ride(PULocationID=50, DOLocationID=263, trip_distance=4.11, total_amount=26.25, tpep_pickup_datetime=1761955276000)\n", "Sent: Ride(PULocationID=231, DOLocationID=144, trip_distance=1.07, total_amount=24.06, tpep_pickup_datetime=1761955616000)\n", "Sent: Ride(PULocationID=144, DOLocationID=158, trip_distance=2.15, total_amount=27.3, tpep_pickup_datetime=1761956698000)\n", "Sent: Ride(PULocationID=158, DOLocationID=249, trip_distance=0.58, total_amount=25.03, tpep_pickup_datetime=1761957879000)\n", "Sent: Ride(PULocationID=107, DOLocationID=246, trip_distance=1.7, total_amount=25.6, tpep_pickup_datetime=1761958272000)\n", "Sent: Ride(PULocationID=100, DOLocationID=75, trip_distance=4.3, total_amount=36.5, tpep_pickup_datetime=1761955917000)\n", "Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.73, total_amount=17.22, tpep_pickup_datetime=1761955730000)\n", "Sent: Ride(PULocationID=230, DOLocationID=243, trip_distance=8.62, total_amount=52.5, tpep_pickup_datetime=1761957246000)\n", "Sent: Ride(PULocationID=236, DOLocationID=163, trip_distance=1.9, total_amount=20.6, tpep_pickup_datetime=1761956718000)\n", "Sent: Ride(PULocationID=163, DOLocationID=234, trip_distance=2.3, total_amount=29.8, tpep_pickup_datetime=1761957364000)\n", "Sent: Ride(PULocationID=141, DOLocationID=79, trip_distance=3.15, total_amount=48.56, tpep_pickup_datetime=1761955429000)\n", "Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.09, total_amount=13.02, tpep_pickup_datetime=1761957978000)\n", "Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=2.07, total_amount=23.94, tpep_pickup_datetime=1761955360000)\n", "Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.43, total_amount=12.12, tpep_pickup_datetime=1761956193000)\n", "Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=1.27, total_amount=15.48, tpep_pickup_datetime=1761956860000)\n", "Sent: Ride(PULocationID=239, DOLocationID=229, trip_distance=3.11, total_amount=28.98, tpep_pickup_datetime=1761957265000)\n", "Sent: Ride(PULocationID=229, DOLocationID=79, trip_distance=2.55, total_amount=32.45, tpep_pickup_datetime=1761958387000)\n", "Sent: Ride(PULocationID=226, DOLocationID=7, trip_distance=1.1, total_amount=14.75, tpep_pickup_datetime=1761958172000)\n", "Sent: Ride(PULocationID=107, DOLocationID=137, trip_distance=0.75, total_amount=16.35, tpep_pickup_datetime=1761955477000)\n", "Sent: Ride(PULocationID=161, DOLocationID=236, trip_distance=2.06, total_amount=19.85, tpep_pickup_datetime=1761957775000)\n", "Sent: Ride(PULocationID=132, DOLocationID=145, trip_distance=16.5, total_amount=90.45, tpep_pickup_datetime=1761958648000)\n", "Sent: Ride(PULocationID=186, DOLocationID=107, trip_distance=1.0, total_amount=21.42, tpep_pickup_datetime=1761955658000)\n", "Sent: Ride(PULocationID=107, DOLocationID=170, trip_distance=0.71, total_amount=12.95, tpep_pickup_datetime=1761956423000)\n", "Sent: Ride(PULocationID=170, DOLocationID=170, trip_distance=0.0, total_amount=-19.25, tpep_pickup_datetime=1761956816000)\n", "Sent: Ride(PULocationID=170, DOLocationID=170, trip_distance=0.0, total_amount=19.25, tpep_pickup_datetime=1761956816000)\n", "Sent: Ride(PULocationID=113, DOLocationID=224, trip_distance=1.0, total_amount=23.95, tpep_pickup_datetime=1761956447000)\n", "Sent: Ride(PULocationID=224, DOLocationID=79, trip_distance=0.9, total_amount=17.2, tpep_pickup_datetime=1761957800000)\n", "Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.7, total_amount=18.9, tpep_pickup_datetime=1761958452000)\n", "Sent: Ride(PULocationID=114, DOLocationID=113, trip_distance=0.8, total_amount=19.15, tpep_pickup_datetime=1761955618000)\n", "Sent: Ride(PULocationID=79, DOLocationID=7, trip_distance=5.5, total_amount=56.7, tpep_pickup_datetime=1761957926000)\n", "Sent: Ride(PULocationID=238, DOLocationID=48, trip_distance=1.8, total_amount=19.15, tpep_pickup_datetime=1761956880000)\n", "Sent: Ride(PULocationID=230, DOLocationID=164, trip_distance=2.4, total_amount=29.05, tpep_pickup_datetime=1761957816000)\n", "Sent: Ride(PULocationID=234, DOLocationID=237, trip_distance=2.28, total_amount=24.15, tpep_pickup_datetime=1761957069000)\n", "Sent: Ride(PULocationID=107, DOLocationID=164, trip_distance=0.62, total_amount=16.38, tpep_pickup_datetime=1761957817000)\n", "Sent: Ride(PULocationID=186, DOLocationID=238, trip_distance=4.1, total_amount=36.15, tpep_pickup_datetime=1761958381000)\n", "Sent: Ride(PULocationID=158, DOLocationID=125, trip_distance=0.99, total_amount=23.75, tpep_pickup_datetime=1761955374000)\n", "Sent: Ride(PULocationID=90, DOLocationID=158, trip_distance=2.28, total_amount=43.26, tpep_pickup_datetime=1761956336000)\n", "Sent: Ride(PULocationID=141, DOLocationID=237, trip_distance=0.88, total_amount=12.25, tpep_pickup_datetime=1761956351000)\n", "Sent: Ride(PULocationID=141, DOLocationID=263, trip_distance=0.58, total_amount=12.12, tpep_pickup_datetime=1761956881000)\n", "Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.34, total_amount=10.81, tpep_pickup_datetime=1761957100000)\n", "Sent: Ride(PULocationID=262, DOLocationID=140, trip_distance=1.3, total_amount=18.0, tpep_pickup_datetime=1761957520000)\n", "Sent: Ride(PULocationID=237, DOLocationID=236, trip_distance=0.51, total_amount=10.4, tpep_pickup_datetime=1761958424000)\n", "Sent: Ride(PULocationID=231, DOLocationID=79, trip_distance=1.71, total_amount=21.35, tpep_pickup_datetime=1761955817000)\n", "Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.77, total_amount=19.25, tpep_pickup_datetime=1761957630000)\n", "Sent: Ride(PULocationID=249, DOLocationID=230, trip_distance=2.52, total_amount=44.94, tpep_pickup_datetime=1761955596000)\n", "Sent: Ride(PULocationID=116, DOLocationID=166, trip_distance=1.05, total_amount=13.52, tpep_pickup_datetime=1761956317000)\n", "Sent: Ride(PULocationID=138, DOLocationID=230, trip_distance=11.01, total_amount=82.02, tpep_pickup_datetime=1761956511000)\n", "Sent: Ride(PULocationID=48, DOLocationID=45, trip_distance=4.35, total_amount=-38.85, tpep_pickup_datetime=1761958593000)\n", "Sent: Ride(PULocationID=48, DOLocationID=45, trip_distance=4.35, total_amount=38.85, tpep_pickup_datetime=1761958593000)\n", "Sent: Ride(PULocationID=140, DOLocationID=263, trip_distance=0.82, total_amount=13.7, tpep_pickup_datetime=1761956267000)\n", "Sent: Ride(PULocationID=263, DOLocationID=74, trip_distance=0.84, total_amount=13.8, tpep_pickup_datetime=1761956776000)\n", "Sent: Ride(PULocationID=48, DOLocationID=186, trip_distance=1.4, total_amount=18.9, tpep_pickup_datetime=1761958274000)\n", "Sent: Ride(PULocationID=138, DOLocationID=28, trip_distance=6.35, total_amount=50.47, tpep_pickup_datetime=1761955330000)\n", "Sent: Ride(PULocationID=138, DOLocationID=262, trip_distance=8.44, total_amount=63.48, tpep_pickup_datetime=1761957553000)\n", "Sent: Ride(PULocationID=246, DOLocationID=48, trip_distance=1.11, total_amount=22.26, tpep_pickup_datetime=1761956847000)\n", "Sent: Ride(PULocationID=48, DOLocationID=87, trip_distance=6.66, total_amount=50.15, tpep_pickup_datetime=1761957913000)\n", "Sent: Ride(PULocationID=239, DOLocationID=151, trip_distance=1.41, total_amount=16.32, tpep_pickup_datetime=1761956114000)\n", "Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=0.76, total_amount=10.8, tpep_pickup_datetime=1761957059000)\n", "Sent: Ride(PULocationID=142, DOLocationID=238, trip_distance=1.01, total_amount=13.8, tpep_pickup_datetime=1761957693000)\n", "Sent: Ride(PULocationID=239, DOLocationID=263, trip_distance=1.4, total_amount=18.0, tpep_pickup_datetime=1761958383000)\n", "Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=1.11, total_amount=14.64, tpep_pickup_datetime=1761958732000)\n", "Sent: Ride(PULocationID=261, DOLocationID=229, trip_distance=5.7, total_amount=39.55, tpep_pickup_datetime=1761957098000)\n", "Sent: Ride(PULocationID=107, DOLocationID=148, trip_distance=1.38, total_amount=29.05, tpep_pickup_datetime=1761956006000)\n", "Sent: Ride(PULocationID=144, DOLocationID=141, trip_distance=4.33, total_amount=44.1, tpep_pickup_datetime=1761957811000)\n", "Sent: Ride(PULocationID=132, DOLocationID=238, trip_distance=19.53, total_amount=98.88, tpep_pickup_datetime=1761956173000)\n", "Sent: Ride(PULocationID=234, DOLocationID=43, trip_distance=2.5, total_amount=37.2, tpep_pickup_datetime=1761957579000)\n", "Sent: Ride(PULocationID=138, DOLocationID=68, trip_distance=10.49, total_amount=73.3, tpep_pickup_datetime=1761955850000)\n", "Sent: Ride(PULocationID=90, DOLocationID=48, trip_distance=1.54, total_amount=33.25, tpep_pickup_datetime=1761955446000)\n", "Sent: Ride(PULocationID=50, DOLocationID=97, trip_distance=5.7, total_amount=56.65, tpep_pickup_datetime=1761958001000)\n", "Sent: Ride(PULocationID=90, DOLocationID=148, trip_distance=1.43, total_amount=38.22, tpep_pickup_datetime=1761956136000)\n", "Sent: Ride(PULocationID=148, DOLocationID=36, trip_distance=3.85, total_amount=37.45, tpep_pickup_datetime=1761958185000)\n", "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.09, total_amount=15.15, tpep_pickup_datetime=1761955232000)\n", "Sent: Ride(PULocationID=246, DOLocationID=249, trip_distance=1.32, total_amount=23.45, tpep_pickup_datetime=1761955586000)\n", "Sent: Ride(PULocationID=249, DOLocationID=13, trip_distance=3.02, total_amount=39.06, tpep_pickup_datetime=1761956901000)\n", "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.7, total_amount=12.95, tpep_pickup_datetime=1761956438000)\n", "Sent: Ride(PULocationID=140, DOLocationID=244, trip_distance=5.9, total_amount=38.9, tpep_pickup_datetime=1761956922000)\n", "Sent: Ride(PULocationID=107, DOLocationID=68, trip_distance=1.07, total_amount=19.74, tpep_pickup_datetime=1761954560000)\n", "Sent: Ride(PULocationID=68, DOLocationID=263, trip_distance=5.34, total_amount=49.14, tpep_pickup_datetime=1761955328000)\n", "Sent: Ride(PULocationID=48, DOLocationID=237, trip_distance=1.92, total_amount=19.25, tpep_pickup_datetime=1761958528000)\n", "Sent: Ride(PULocationID=132, DOLocationID=10, trip_distance=2.32, total_amount=26.76, tpep_pickup_datetime=1761958227000)\n", "Sent: Ride(PULocationID=79, DOLocationID=162, trip_distance=2.4, total_amount=26.45, tpep_pickup_datetime=1761956453000)\n", "Sent: Ride(PULocationID=162, DOLocationID=48, trip_distance=1.3, total_amount=23.2, tpep_pickup_datetime=1761957702000)\n", "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.1, total_amount=17.05, tpep_pickup_datetime=1761958724000)\n", "Sent: Ride(PULocationID=48, DOLocationID=90, trip_distance=1.8, total_amount=27.3, tpep_pickup_datetime=1761955601000)\n", "Sent: Ride(PULocationID=90, DOLocationID=88, trip_distance=3.1, total_amount=41.6, tpep_pickup_datetime=1761956876000)\n", "Sent: Ride(PULocationID=68, DOLocationID=48, trip_distance=1.5, total_amount=26.45, tpep_pickup_datetime=1761955465000)\n", "Sent: Ride(PULocationID=48, DOLocationID=164, trip_distance=1.3, total_amount=20.85, tpep_pickup_datetime=1761956661000)\n", "Sent: Ride(PULocationID=48, DOLocationID=50, trip_distance=0.5, total_amount=15.5, tpep_pickup_datetime=1761957974000)\n", "Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=1.64, total_amount=18.06, tpep_pickup_datetime=1761957850000)\n", "Sent: Ride(PULocationID=229, DOLocationID=4, trip_distance=2.84, total_amount=39.06, tpep_pickup_datetime=1761958097000)\n", "Sent: Ride(PULocationID=113, DOLocationID=164, trip_distance=1.09, total_amount=17.15, tpep_pickup_datetime=1761958351000)\n", "Sent: Ride(PULocationID=68, DOLocationID=88, trip_distance=3.3, total_amount=35.7, tpep_pickup_datetime=1761957349000)\n", "Sent: Ride(PULocationID=142, DOLocationID=229, trip_distance=1.0, total_amount=17.0, tpep_pickup_datetime=1761955657000)\n", "Sent: Ride(PULocationID=163, DOLocationID=223, trip_distance=5.7, total_amount=55.0, tpep_pickup_datetime=1761957596000)\n", "Sent: Ride(PULocationID=148, DOLocationID=4, trip_distance=0.54, total_amount=16.05, tpep_pickup_datetime=1761956220000)\n", "Sent: Ride(PULocationID=4, DOLocationID=246, trip_distance=2.39, total_amount=35.7, tpep_pickup_datetime=1761956807000)\n", "Sent: Ride(PULocationID=148, DOLocationID=140, trip_distance=4.82, total_amount=42.42, tpep_pickup_datetime=1761957016000)\n", "Sent: Ride(PULocationID=249, DOLocationID=231, trip_distance=1.29, total_amount=18.06, tpep_pickup_datetime=1761958584000)\n", "Sent: Ride(PULocationID=230, DOLocationID=238, trip_distance=2.3, total_amount=20.65, tpep_pickup_datetime=1761955264000)\n", "Sent: Ride(PULocationID=143, DOLocationID=263, trip_distance=2.0, total_amount=18.85, tpep_pickup_datetime=1761956722000)\n", "Sent: Ride(PULocationID=263, DOLocationID=107, trip_distance=4.0, total_amount=34.0, tpep_pickup_datetime=1761957356000)\n", "Sent: Ride(PULocationID=107, DOLocationID=249, trip_distance=1.3, total_amount=24.95, tpep_pickup_datetime=1761958780000)\n", "Sent: Ride(PULocationID=163, DOLocationID=229, trip_distance=1.0, total_amount=18.05, tpep_pickup_datetime=1761955689000)\n", "Sent: Ride(PULocationID=229, DOLocationID=107, trip_distance=1.7, total_amount=32.15, tpep_pickup_datetime=1761956227000)\n", "Sent: Ride(PULocationID=107, DOLocationID=79, trip_distance=0.9, total_amount=27.55, tpep_pickup_datetime=1761958453000)\n", "Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.35, total_amount=23.94, tpep_pickup_datetime=1761955756000)\n", "Sent: Ride(PULocationID=230, DOLocationID=48, trip_distance=0.7, total_amount=16.38, tpep_pickup_datetime=1761956588000)\n", "Sent: Ride(PULocationID=48, DOLocationID=100, trip_distance=1.02, total_amount=15.75, tpep_pickup_datetime=1761957508000)\n", "Sent: Ride(PULocationID=100, DOLocationID=140, trip_distance=2.04, total_amount=22.75, tpep_pickup_datetime=1761958204000)\n", "Sent: Ride(PULocationID=48, DOLocationID=166, trip_distance=2.9, total_amount=23.35, tpep_pickup_datetime=1761956399000)\n", "Sent: Ride(PULocationID=151, DOLocationID=116, trip_distance=2.6, total_amount=22.55, tpep_pickup_datetime=1761957485000)\n", "Sent: Ride(PULocationID=209, DOLocationID=158, trip_distance=2.7, total_amount=26.95, tpep_pickup_datetime=1761955527000)\n", "Sent: Ride(PULocationID=158, DOLocationID=249, trip_distance=1.3, total_amount=32.8, tpep_pickup_datetime=1761957154000)\n", "Sent: Ride(PULocationID=48, DOLocationID=48, trip_distance=0.3, total_amount=10.15, tpep_pickup_datetime=1761955372000)\n", "Sent: Ride(PULocationID=48, DOLocationID=112, trip_distance=5.3, total_amount=39.9, tpep_pickup_datetime=1761955882000)\n", "Sent: Ride(PULocationID=237, DOLocationID=262, trip_distance=1.43, total_amount=-15.0, tpep_pickup_datetime=1761957456000)\n", "Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=1.43, total_amount=15.0, tpep_pickup_datetime=1761957456000)\n", "Sent: Ride(PULocationID=162, DOLocationID=48, trip_distance=0.7, total_amount=22.95, tpep_pickup_datetime=1761956063000)\n", "Sent: Ride(PULocationID=138, DOLocationID=161, trip_distance=10.23, total_amount=79.5, tpep_pickup_datetime=1761955406000)\n", "Sent: Ride(PULocationID=137, DOLocationID=87, trip_distance=4.43, total_amount=50.05, tpep_pickup_datetime=1761956344000)\n", "Sent: Ride(PULocationID=79, DOLocationID=113, trip_distance=0.5, total_amount=13.85, tpep_pickup_datetime=1761956358000)\n", "Sent: Ride(PULocationID=113, DOLocationID=144, trip_distance=0.9, total_amount=15.55, tpep_pickup_datetime=1761956781000)\n", "Sent: Ride(PULocationID=261, DOLocationID=236, trip_distance=9.4, total_amount=59.35, tpep_pickup_datetime=1761958586000)\n", "Sent: Ride(PULocationID=237, DOLocationID=140, trip_distance=0.87, total_amount=13.7, tpep_pickup_datetime=1761955339000)\n", "Sent: Ride(PULocationID=141, DOLocationID=7, trip_distance=3.94, total_amount=27.65, tpep_pickup_datetime=1761956046000)\n", "Sent: Ride(PULocationID=48, DOLocationID=107, trip_distance=2.58, total_amount=34.02, tpep_pickup_datetime=1761958777000)\n", "Sent: Ride(PULocationID=24, DOLocationID=74, trip_distance=1.82, total_amount=17.52, tpep_pickup_datetime=1761957474000)\n", "Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.77, total_amount=23.1, tpep_pickup_datetime=1761955737000)\n", "Sent: Ride(PULocationID=107, DOLocationID=87, trip_distance=3.99, total_amount=40.74, tpep_pickup_datetime=1761957025000)\n", "Sent: Ride(PULocationID=148, DOLocationID=113, trip_distance=0.66, total_amount=17.22, tpep_pickup_datetime=1761956685000)\n", "Sent: Ride(PULocationID=113, DOLocationID=170, trip_distance=1.08, total_amount=21.42, tpep_pickup_datetime=1761957289000)\n", "Sent: Ride(PULocationID=170, DOLocationID=48, trip_distance=1.59, total_amount=25.79, tpep_pickup_datetime=1761958164000)\n", "Sent: Ride(PULocationID=226, DOLocationID=181, trip_distance=9.59, total_amount=47.3, tpep_pickup_datetime=1761955971000)\n", "Sent: Ride(PULocationID=186, DOLocationID=246, trip_distance=0.55, total_amount=12.85, tpep_pickup_datetime=1761956644000)\n", "Sent: Ride(PULocationID=48, DOLocationID=265, trip_distance=2.75, total_amount=102.97, tpep_pickup_datetime=1761958631000)\n", "Sent: Ride(PULocationID=231, DOLocationID=13, trip_distance=1.22, total_amount=17.22, tpep_pickup_datetime=1761955724000)\n", "Sent: Ride(PULocationID=107, DOLocationID=140, trip_distance=3.69, total_amount=26.25, tpep_pickup_datetime=1761958367000)\n", "Sent: Ride(PULocationID=4, DOLocationID=144, trip_distance=0.95, total_amount=19.25, tpep_pickup_datetime=1761955694000)\n", "Sent: Ride(PULocationID=144, DOLocationID=79, trip_distance=0.82, total_amount=19.74, tpep_pickup_datetime=1761956737000)\n", "Sent: Ride(PULocationID=234, DOLocationID=49, trip_distance=6.77, total_amount=54.69, tpep_pickup_datetime=1761957574000)\n", "Sent: Ride(PULocationID=239, DOLocationID=41, trip_distance=2.43, total_amount=22.2, tpep_pickup_datetime=1761957454000)\n", "Sent: Ride(PULocationID=138, DOLocationID=148, trip_distance=9.41, total_amount=64.7, tpep_pickup_datetime=1761958000000)\n", "Sent: Ride(PULocationID=100, DOLocationID=211, trip_distance=3.51, total_amount=40.74, tpep_pickup_datetime=1761955201000)\n", "Sent: Ride(PULocationID=211, DOLocationID=158, trip_distance=1.21, total_amount=22.29, tpep_pickup_datetime=1761957218000)\n", "Sent: Ride(PULocationID=4, DOLocationID=239, trip_distance=5.75, total_amount=40.43, tpep_pickup_datetime=1761955800000)\n", "Sent: Ride(PULocationID=161, DOLocationID=262, trip_distance=2.3, total_amount=23.2, tpep_pickup_datetime=1761955464000)\n", "Sent: Ride(PULocationID=263, DOLocationID=79, trip_distance=4.2, total_amount=53.35, tpep_pickup_datetime=1761956651000)\n", "Sent: Ride(PULocationID=137, DOLocationID=233, trip_distance=1.18, total_amount=18.06, tpep_pickup_datetime=1761956725000)\n", "Sent: Ride(PULocationID=137, DOLocationID=79, trip_distance=1.34, total_amount=32.34, tpep_pickup_datetime=1761957567000)\n", "Sent: Ride(PULocationID=238, DOLocationID=263, trip_distance=1.73, total_amount=17.4, tpep_pickup_datetime=1761957998000)\n", "Sent: Ride(PULocationID=234, DOLocationID=163, trip_distance=2.2, total_amount=23.86, tpep_pickup_datetime=1761955626000)\n", "Sent: Ride(PULocationID=163, DOLocationID=141, trip_distance=1.11, total_amount=18.9, tpep_pickup_datetime=1761956732000)\n", "Sent: Ride(PULocationID=100, DOLocationID=68, trip_distance=0.4, total_amount=13.02, tpep_pickup_datetime=1761955846000)\n", "Sent: Ride(PULocationID=68, DOLocationID=144, trip_distance=4.01, total_amount=48.15, tpep_pickup_datetime=1761956664000)\n", "Sent: Ride(PULocationID=238, DOLocationID=107, trip_distance=5.4, total_amount=50.45, tpep_pickup_datetime=1761956704000)\n", "Sent: Ride(PULocationID=186, DOLocationID=33, trip_distance=4.16, total_amount=44.1, tpep_pickup_datetime=1761957441000)\n", "Sent: Ride(PULocationID=48, DOLocationID=196, trip_distance=8.15, total_amount=49.99, tpep_pickup_datetime=1761958446000)\n", "Sent: Ride(PULocationID=13, DOLocationID=107, trip_distance=4.97, total_amount=37.38, tpep_pickup_datetime=1761955283000)\n", "Sent: Ride(PULocationID=107, DOLocationID=140, trip_distance=2.6, total_amount=22.75, tpep_pickup_datetime=1761956605000)\n", "Sent: Ride(PULocationID=141, DOLocationID=143, trip_distance=1.47, total_amount=18.84, tpep_pickup_datetime=1761957687000)\n", "Sent: Ride(PULocationID=170, DOLocationID=79, trip_distance=1.59, total_amount=24.75, tpep_pickup_datetime=1761955644000)\n", "Sent: Ride(PULocationID=4, DOLocationID=113, trip_distance=0.99, total_amount=19.74, tpep_pickup_datetime=1761956917000)\n", "Sent: Ride(PULocationID=114, DOLocationID=4, trip_distance=0.88, total_amount=17.45, tpep_pickup_datetime=1761957745000)\n", "Sent: Ride(PULocationID=161, DOLocationID=263, trip_distance=3.05, total_amount=34.86, tpep_pickup_datetime=1761956135000)\n", "Sent: Ride(PULocationID=263, DOLocationID=237, trip_distance=0.79, total_amount=14.2, tpep_pickup_datetime=1761957780000)\n", "Sent: Ride(PULocationID=263, DOLocationID=90, trip_distance=3.1, total_amount=28.98, tpep_pickup_datetime=1761958374000)\n", "Sent: Ride(PULocationID=161, DOLocationID=262, trip_distance=2.46, total_amount=24.78, tpep_pickup_datetime=1761957461000)\n", "Sent: Ride(PULocationID=68, DOLocationID=79, trip_distance=6.25, total_amount=77.44, tpep_pickup_datetime=1761956039000)\n", "Sent: Ride(PULocationID=68, DOLocationID=137, trip_distance=1.4, total_amount=27.3, tpep_pickup_datetime=1761956403000)\n", "Sent: Ride(PULocationID=137, DOLocationID=170, trip_distance=0.8, total_amount=19.7, tpep_pickup_datetime=1761957702000)\n", "Sent: Ride(PULocationID=162, DOLocationID=229, trip_distance=0.8, total_amount=11.55, tpep_pickup_datetime=1761958542000)\n", "Sent: Ride(PULocationID=68, DOLocationID=42, trip_distance=6.31, total_amount=50.82, tpep_pickup_datetime=1761955373000)\n", "Sent: Ride(PULocationID=41, DOLocationID=263, trip_distance=1.92, total_amount=19.68, tpep_pickup_datetime=1761958166000)\n", "Sent: Ride(PULocationID=140, DOLocationID=209, trip_distance=6.1, total_amount=39.35, tpep_pickup_datetime=1761957889000)\n", "Sent: Ride(PULocationID=161, DOLocationID=164, trip_distance=0.97, total_amount=16.45, tpep_pickup_datetime=1761955699000)\n", "Sent: Ride(PULocationID=164, DOLocationID=231, trip_distance=2.31, total_amount=37.19, tpep_pickup_datetime=1761956454000)\n", "Sent: Ride(PULocationID=237, DOLocationID=236, trip_distance=1.53, total_amount=18.84, tpep_pickup_datetime=1761955789000)\n", "Sent: Ride(PULocationID=148, DOLocationID=148, trip_distance=0.0, total_amount=8.75, tpep_pickup_datetime=1761957945000)\n", "Sent: Ride(PULocationID=148, DOLocationID=68, trip_distance=2.8, total_amount=28.15, tpep_pickup_datetime=1761957966000)\n", "Sent: Ride(PULocationID=231, DOLocationID=87, trip_distance=0.7, total_amount=12.55, tpep_pickup_datetime=1761958490000)\n", "Sent: Ride(PULocationID=66, DOLocationID=37, trip_distance=3.77, total_amount=31.0, tpep_pickup_datetime=1761958073000)\n", "Sent: Ride(PULocationID=24, DOLocationID=262, trip_distance=2.63, total_amount=24.72, tpep_pickup_datetime=1761958756000)\n", "Sent: Ride(PULocationID=144, DOLocationID=186, trip_distance=3.0, total_amount=42.45, tpep_pickup_datetime=1761956480000)\n", "Sent: Ride(PULocationID=249, DOLocationID=231, trip_distance=1.12, total_amount=14.89, tpep_pickup_datetime=1761955444000)\n", "Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.18, total_amount=29.75, tpep_pickup_datetime=1761955863000)\n", "Sent: Ride(PULocationID=148, DOLocationID=234, trip_distance=1.1, total_amount=20.25, tpep_pickup_datetime=1761957725000)\n", "Sent: Ride(PULocationID=234, DOLocationID=246, trip_distance=1.9, total_amount=26.7, tpep_pickup_datetime=1761956765000)\n", "Sent: Ride(PULocationID=163, DOLocationID=142, trip_distance=1.04, total_amount=17.31, tpep_pickup_datetime=1761957754000)\n", "Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.75, total_amount=14.25, tpep_pickup_datetime=1761955857000)\n", "Sent: Ride(PULocationID=107, DOLocationID=163, trip_distance=2.13, total_amount=19.95, tpep_pickup_datetime=1761956490000)\n", "Sent: Ride(PULocationID=142, DOLocationID=142, trip_distance=0.4, total_amount=13.02, tpep_pickup_datetime=1761957827000)\n", "Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.38, total_amount=12.95, tpep_pickup_datetime=1761957395000)\n", "Sent: Ride(PULocationID=113, DOLocationID=151, trip_distance=7.92, total_amount=69.3, tpep_pickup_datetime=1761957823000)\n", "Sent: Ride(PULocationID=234, DOLocationID=163, trip_distance=1.44, total_amount=19.25, tpep_pickup_datetime=1761956112000)\n", "Sent: Ride(PULocationID=230, DOLocationID=256, trip_distance=5.77, total_amount=51.45, tpep_pickup_datetime=1761955515000)\n", "Sent: Ride(PULocationID=263, DOLocationID=236, trip_distance=0.33, total_amount=12.12, tpep_pickup_datetime=1761957509000)\n", "Sent: Ride(PULocationID=148, DOLocationID=4, trip_distance=1.05, total_amount=-19.25, tpep_pickup_datetime=1761957971000)\n", "Sent: Ride(PULocationID=148, DOLocationID=4, trip_distance=1.05, total_amount=19.25, tpep_pickup_datetime=1761957971000)\n", "Sent: Ride(PULocationID=88, DOLocationID=33, trip_distance=3.68, total_amount=28.14, tpep_pickup_datetime=1761956756000)\n", "Sent: Ride(PULocationID=33, DOLocationID=25, trip_distance=0.98, total_amount=13.0, tpep_pickup_datetime=1761957514000)\n", "Sent: Ride(PULocationID=148, DOLocationID=211, trip_distance=0.85, total_amount=18.65, tpep_pickup_datetime=1761955144000)\n", "Sent: Ride(PULocationID=211, DOLocationID=97, trip_distance=3.25, total_amount=31.15, tpep_pickup_datetime=1761956474000)\n", "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.26, total_amount=18.06, tpep_pickup_datetime=1761958254000)\n", "Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=1.43, total_amount=15.3, tpep_pickup_datetime=1761955827000)\n", "Sent: Ride(PULocationID=50, DOLocationID=90, trip_distance=2.31, total_amount=27.3, tpep_pickup_datetime=1761956610000)\n", "Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=3.37, total_amount=31.5, tpep_pickup_datetime=1761957803000)\n", "Sent: Ride(PULocationID=148, DOLocationID=113, trip_distance=1.03, total_amount=20.58, tpep_pickup_datetime=1761955629000)\n", "Sent: Ride(PULocationID=137, DOLocationID=170, trip_distance=0.11, total_amount=9.45, tpep_pickup_datetime=1761957550000)\n", "Sent: Ride(PULocationID=79, DOLocationID=68, trip_distance=1.7, total_amount=21.33, tpep_pickup_datetime=1761957416000)\n", "Sent: Ride(PULocationID=68, DOLocationID=231, trip_distance=2.27, total_amount=33.45, tpep_pickup_datetime=1761958648000)\n", "Sent: Ride(PULocationID=90, DOLocationID=233, trip_distance=1.76, total_amount=27.33, tpep_pickup_datetime=1761956324000)\n", "Sent: Ride(PULocationID=231, DOLocationID=186, trip_distance=2.52, total_amount=35.7, tpep_pickup_datetime=1761955460000)\n", "Sent: Ride(PULocationID=107, DOLocationID=236, trip_distance=4.01, total_amount=28.95, tpep_pickup_datetime=1761955447000)\n", "Sent: Ride(PULocationID=231, DOLocationID=127, trip_distance=15.53, total_amount=81.65, tpep_pickup_datetime=1761955352000)\n", "Sent: Ride(PULocationID=161, DOLocationID=246, trip_distance=2.33, total_amount=28.98, tpep_pickup_datetime=1761955667000)\n", "Sent: Ride(PULocationID=161, DOLocationID=100, trip_distance=0.23, total_amount=13.1, tpep_pickup_datetime=1761955396000)\n", "Sent: Ride(PULocationID=48, DOLocationID=151, trip_distance=2.59, total_amount=22.65, tpep_pickup_datetime=1761956438000)\n", "Sent: Ride(PULocationID=143, DOLocationID=262, trip_distance=2.21, total_amount=25.4, tpep_pickup_datetime=1761958058000)\n", "Sent: Ride(PULocationID=234, DOLocationID=234, trip_distance=3.02, total_amount=52.32, tpep_pickup_datetime=1761955791000)\n", "Sent: Ride(PULocationID=234, DOLocationID=4, trip_distance=1.64, total_amount=34.86, tpep_pickup_datetime=1761958476000)\n", "Sent: Ride(PULocationID=263, DOLocationID=163, trip_distance=2.58, total_amount=25.62, tpep_pickup_datetime=1761956710000)\n", "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=0.79, total_amount=20.25, tpep_pickup_datetime=1761956022000)\n", "Sent: Ride(PULocationID=79, DOLocationID=114, trip_distance=0.61, total_amount=17.45, tpep_pickup_datetime=1761957040000)\n", "Sent: Ride(PULocationID=79, DOLocationID=249, trip_distance=2.04, total_amount=32.34, tpep_pickup_datetime=1761957906000)\n", "Sent: Ride(PULocationID=234, DOLocationID=260, trip_distance=6.9, total_amount=50.09, tpep_pickup_datetime=1761956327000)\n", "Sent: Ride(PULocationID=230, DOLocationID=48, trip_distance=0.94, total_amount=15.54, tpep_pickup_datetime=1761957874000)\n", "Sent: Ride(PULocationID=161, DOLocationID=143, trip_distance=1.88, total_amount=17.85, tpep_pickup_datetime=1761953770000)\n", "Sent: Ride(PULocationID=143, DOLocationID=236, trip_distance=2.75, total_amount=26.4, tpep_pickup_datetime=1761954614000)\n", "Sent: Ride(PULocationID=140, DOLocationID=48, trip_distance=2.65, total_amount=34.02, tpep_pickup_datetime=1761956414000)\n", "Sent: Ride(PULocationID=170, DOLocationID=229, trip_distance=1.5, total_amount=22.25, tpep_pickup_datetime=1761955922000)\n", "Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.2, total_amount=15.69, tpep_pickup_datetime=1761958070000)\n", "Sent: Ride(PULocationID=234, DOLocationID=152, trip_distance=8.0, total_amount=58.38, tpep_pickup_datetime=1761956935000)\n", "Sent: Ride(PULocationID=166, DOLocationID=116, trip_distance=1.21, total_amount=11.96, tpep_pickup_datetime=1761956038000)\n", "Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.75, total_amount=18.8, tpep_pickup_datetime=1761957587000)\n", "Sent: Ride(PULocationID=48, DOLocationID=90, trip_distance=1.78, total_amount=26.46, tpep_pickup_datetime=1761955155000)\n", "Sent: Ride(PULocationID=237, DOLocationID=107, trip_distance=2.28, total_amount=28.44, tpep_pickup_datetime=1761956921000)\n", "Sent: Ride(PULocationID=107, DOLocationID=162, trip_distance=1.15, total_amount=16.38, tpep_pickup_datetime=1761958446000)\n", "Sent: Ride(PULocationID=236, DOLocationID=75, trip_distance=1.63, total_amount=18.0, tpep_pickup_datetime=1761955216000)\n", "Sent: Ride(PULocationID=68, DOLocationID=186, trip_distance=1.14, total_amount=17.31, tpep_pickup_datetime=1761956032000)\n", "Sent: Ride(PULocationID=234, DOLocationID=90, trip_distance=0.76, total_amount=26.25, tpep_pickup_datetime=1761958542000)\n", "Sent: Ride(PULocationID=151, DOLocationID=116, trip_distance=2.17, total_amount=20.04, tpep_pickup_datetime=1761957776000)\n", "Sent: Ride(PULocationID=239, DOLocationID=79, trip_distance=4.19, total_amount=47.35, tpep_pickup_datetime=1761957071000)\n", "Sent: Ride(PULocationID=138, DOLocationID=164, trip_distance=11.3, total_amount=69.85, tpep_pickup_datetime=1761958772000)\n", "Sent: Ride(PULocationID=164, DOLocationID=158, trip_distance=1.9, total_amount=26.85, tpep_pickup_datetime=1761958643000)\n", "Sent: Ride(PULocationID=141, DOLocationID=263, trip_distance=1.1, total_amount=15.45, tpep_pickup_datetime=1761956554000)\n", "Sent: Ride(PULocationID=141, DOLocationID=234, trip_distance=3.1, total_amount=33.15, tpep_pickup_datetime=1761957289000)\n", "Sent: Ride(PULocationID=138, DOLocationID=134, trip_distance=6.05, total_amount=41.65, tpep_pickup_datetime=1761956033000)\n", "Sent: Ride(PULocationID=138, DOLocationID=239, trip_distance=11.5, total_amount=82.86, tpep_pickup_datetime=1761958551000)\n", "Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.72, total_amount=26.25, tpep_pickup_datetime=1761956681000)\n", "Sent: Ride(PULocationID=87, DOLocationID=61, trip_distance=6.25, total_amount=41.45, tpep_pickup_datetime=1761957608000)\n", "Sent: Ride(PULocationID=4, DOLocationID=80, trip_distance=3.17, total_amount=25.55, tpep_pickup_datetime=1761958324000)\n", "Sent: Ride(PULocationID=263, DOLocationID=137, trip_distance=3.33, total_amount=31.45, tpep_pickup_datetime=1761955496000)\n", "Sent: Ride(PULocationID=137, DOLocationID=13, trip_distance=6.92, total_amount=64.26, tpep_pickup_datetime=1761957268000)\n", "Sent: Ride(PULocationID=249, DOLocationID=170, trip_distance=2.41, total_amount=41.58, tpep_pickup_datetime=1761955218000)\n", "Sent: Ride(PULocationID=107, DOLocationID=113, trip_distance=0.62, total_amount=23.94, tpep_pickup_datetime=1761955586000)\n", "Sent: Ride(PULocationID=79, DOLocationID=263, trip_distance=4.62, total_amount=49.14, tpep_pickup_datetime=1761956998000)\n", "Sent: Ride(PULocationID=79, DOLocationID=246, trip_distance=2.12, total_amount=25.45, tpep_pickup_datetime=1761956111000)\n", "Sent: Ride(PULocationID=246, DOLocationID=50, trip_distance=0.39, total_amount=22.14, tpep_pickup_datetime=1761957432000)\n", "Sent: Ride(PULocationID=48, DOLocationID=142, trip_distance=0.9, total_amount=14.7, tpep_pickup_datetime=1761958636000)\n", "Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=0.65, total_amount=13.8, tpep_pickup_datetime=1761957376000)\n", "Sent: Ride(PULocationID=141, DOLocationID=263, trip_distance=0.67, total_amount=11.5, tpep_pickup_datetime=1761958327000)\n", "Sent: Ride(PULocationID=48, DOLocationID=68, trip_distance=0.4, total_amount=13.02, tpep_pickup_datetime=1761956732000)\n", "Sent: Ride(PULocationID=186, DOLocationID=41, trip_distance=4.87, total_amount=-36.05, tpep_pickup_datetime=1761957722000)\n", "Sent: Ride(PULocationID=186, DOLocationID=74, trip_distance=4.87, total_amount=36.05, tpep_pickup_datetime=1761957722000)\n", "Sent: Ride(PULocationID=144, DOLocationID=239, trip_distance=5.79, total_amount=53.34, tpep_pickup_datetime=1761957109000)\n", "Sent: Ride(PULocationID=137, DOLocationID=79, trip_distance=0.99, total_amount=19.95, tpep_pickup_datetime=1761956287000)\n", "Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.73, total_amount=23.35, tpep_pickup_datetime=1761957380000)\n", "Sent: Ride(PULocationID=152, DOLocationID=151, trip_distance=1.39, total_amount=14.16, tpep_pickup_datetime=1761955467000)\n", "Sent: Ride(PULocationID=239, DOLocationID=107, trip_distance=3.63, total_amount=32.34, tpep_pickup_datetime=1761956364000)\n", "Sent: Ride(PULocationID=231, DOLocationID=263, trip_distance=6.34, total_amount=52.5, tpep_pickup_datetime=1761955711000)\n", "Sent: Ride(PULocationID=87, DOLocationID=74, trip_distance=8.5, total_amount=54.15, tpep_pickup_datetime=1761955795000)\n", "Sent: Ride(PULocationID=74, DOLocationID=42, trip_distance=2.2, total_amount=17.5, tpep_pickup_datetime=1761957673000)\n", "Sent: Ride(PULocationID=80, DOLocationID=34, trip_distance=4.3, total_amount=26.75, tpep_pickup_datetime=1761957260000)\n", "Sent: Ride(PULocationID=144, DOLocationID=112, trip_distance=4.4, total_amount=44.95, tpep_pickup_datetime=1761956555000)\n", "Sent: Ride(PULocationID=144, DOLocationID=68, trip_distance=2.0, total_amount=47.45, tpep_pickup_datetime=1761955568000)\n", "Sent: Ride(PULocationID=68, DOLocationID=114, trip_distance=1.7, total_amount=27.25, tpep_pickup_datetime=1761958278000)\n", "Sent: Ride(PULocationID=162, DOLocationID=179, trip_distance=3.86, total_amount=33.69, tpep_pickup_datetime=1761956794000)\n", "Sent: Ride(PULocationID=264, DOLocationID=229, trip_distance=1.38, total_amount=18.9, tpep_pickup_datetime=1761958759000)\n", "Sent: Ride(PULocationID=79, DOLocationID=233, trip_distance=2.18, total_amount=31.94, tpep_pickup_datetime=1761956161000)\n", "Sent: Ride(PULocationID=233, DOLocationID=262, trip_distance=1.85, total_amount=19.74, tpep_pickup_datetime=1761957605000)\n", "Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=1.47, total_amount=21.25, tpep_pickup_datetime=1761956512000)\n", "Sent: Ride(PULocationID=233, DOLocationID=140, trip_distance=2.02, total_amount=20.58, tpep_pickup_datetime=1761957479000)\n", "Sent: Ride(PULocationID=68, DOLocationID=164, trip_distance=1.9, total_amount=24.63, tpep_pickup_datetime=1761958326000)\n", "Sent: Ride(PULocationID=50, DOLocationID=252, trip_distance=15.65, total_amount=106.85, tpep_pickup_datetime=1761956198000)\n", "Sent: Ride(PULocationID=170, DOLocationID=141, trip_distance=2.03, total_amount=18.9, tpep_pickup_datetime=1761958441000)\n", "Sent: Ride(PULocationID=68, DOLocationID=243, trip_distance=7.66, total_amount=65.94, tpep_pickup_datetime=1761956564000)\n", "Sent: Ride(PULocationID=142, DOLocationID=262, trip_distance=2.73, total_amount=19.9, tpep_pickup_datetime=1761955822000)\n", "Sent: Ride(PULocationID=262, DOLocationID=162, trip_distance=2.05, total_amount=-16.45, tpep_pickup_datetime=1761956792000)\n", "Sent: Ride(PULocationID=262, DOLocationID=162, trip_distance=2.05, total_amount=16.45, tpep_pickup_datetime=1761956792000)\n", "Sent: Ride(PULocationID=229, DOLocationID=229, trip_distance=0.41, total_amount=12.48, tpep_pickup_datetime=1761957355000)\n", "Sent: Ride(PULocationID=68, DOLocationID=233, trip_distance=2.05, total_amount=24.35, tpep_pickup_datetime=1761956966000)\n", "Sent: Ride(PULocationID=137, DOLocationID=232, trip_distance=2.58, total_amount=34.86, tpep_pickup_datetime=1761955867000)\n", "Sent: Ride(PULocationID=232, DOLocationID=238, trip_distance=7.27, total_amount=45.21, tpep_pickup_datetime=1761957725000)\n", "Sent: Ride(PULocationID=132, DOLocationID=10, trip_distance=3.38, total_amount=24.31, tpep_pickup_datetime=1761957541000)\n", "Sent: Ride(PULocationID=107, DOLocationID=234, trip_distance=0.5, total_amount=20.58, tpep_pickup_datetime=1761955758000)\n", "Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=4.53, total_amount=48.3, tpep_pickup_datetime=1761957112000)\n", "Sent: Ride(PULocationID=48, DOLocationID=142, trip_distance=1.09, total_amount=17.22, tpep_pickup_datetime=1761955648000)\n", "Sent: Ride(PULocationID=239, DOLocationID=100, trip_distance=2.5, total_amount=27.3, tpep_pickup_datetime=1761956205000)\n", "Sent: Ride(PULocationID=100, DOLocationID=68, trip_distance=1.06, total_amount=20.58, tpep_pickup_datetime=1761957629000)\n", "Sent: Ride(PULocationID=125, DOLocationID=151, trip_distance=6.01, total_amount=38.85, tpep_pickup_datetime=1761956924000)\n", "Sent: Ride(PULocationID=107, DOLocationID=107, trip_distance=0.77, total_amount=16.38, tpep_pickup_datetime=1761955684000)\n", "Sent: Ride(PULocationID=113, DOLocationID=164, trip_distance=1.0, total_amount=18.8, tpep_pickup_datetime=1761955373000)\n", "Sent: Ride(PULocationID=164, DOLocationID=137, trip_distance=1.0, total_amount=23.2, tpep_pickup_datetime=1761956086000)\n", "Sent: Ride(PULocationID=233, DOLocationID=141, trip_distance=1.2, total_amount=12.95, tpep_pickup_datetime=1761956985000)\n", "Sent: Ride(PULocationID=163, DOLocationID=141, trip_distance=1.8, total_amount=19.75, tpep_pickup_datetime=1761958331000)\n", "Sent: Ride(PULocationID=161, DOLocationID=230, trip_distance=0.7, total_amount=16.05, tpep_pickup_datetime=1761956008000)\n", "Sent: Ride(PULocationID=161, DOLocationID=263, trip_distance=1.8, total_amount=21.4, tpep_pickup_datetime=1761956913000)\n", "Sent: Ride(PULocationID=263, DOLocationID=142, trip_distance=2.2, total_amount=19.8, tpep_pickup_datetime=1761957773000)\n", "Sent: Ride(PULocationID=163, DOLocationID=263, trip_distance=2.68, total_amount=24.78, tpep_pickup_datetime=1761955217000)\n", "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.56, total_amount=11.62, tpep_pickup_datetime=1761955998000)\n", "Sent: Ride(PULocationID=263, DOLocationID=83, trip_distance=5.16, total_amount=31.85, tpep_pickup_datetime=1761957529000)\n", "Sent: Ride(PULocationID=113, DOLocationID=246, trip_distance=2.6, total_amount=39.9, tpep_pickup_datetime=1761957214000)\n", "Sent: Ride(PULocationID=211, DOLocationID=114, trip_distance=0.8, total_amount=15.65, tpep_pickup_datetime=1761956007000)\n", "Sent: Ride(PULocationID=144, DOLocationID=158, trip_distance=1.9, total_amount=24.78, tpep_pickup_datetime=1761956443000)\n", "Sent: Ride(PULocationID=158, DOLocationID=87, trip_distance=3.82, total_amount=28.45, tpep_pickup_datetime=1761957437000)\n", "Sent: Ride(PULocationID=87, DOLocationID=107, trip_distance=4.34, total_amount=40.69, tpep_pickup_datetime=1761958760000)\n", "Sent: Ride(PULocationID=114, DOLocationID=137, trip_distance=2.0, total_amount=25.6, tpep_pickup_datetime=1761957565000)\n", "Sent: Ride(PULocationID=265, DOLocationID=265, trip_distance=0.0, total_amount=96.0, tpep_pickup_datetime=1761958241000)\n", "Sent: Ride(PULocationID=230, DOLocationID=144, trip_distance=3.5, total_amount=32.55, tpep_pickup_datetime=1761956658000)\n", "Sent: Ride(PULocationID=148, DOLocationID=160, trip_distance=7.45, total_amount=40.45, tpep_pickup_datetime=1761958726000)\n", "Sent: Ride(PULocationID=48, DOLocationID=256, trip_distance=6.03, total_amount=68.46, tpep_pickup_datetime=1761955214000)\n", "Sent: Ride(PULocationID=24, DOLocationID=152, trip_distance=1.3, total_amount=10.4, tpep_pickup_datetime=1761956874000)\n", "Sent: Ride(PULocationID=236, DOLocationID=236, trip_distance=0.43, total_amount=11.75, tpep_pickup_datetime=1761956620000)\n", "Sent: Ride(PULocationID=236, DOLocationID=263, trip_distance=1.15, total_amount=16.32, tpep_pickup_datetime=1761956897000)\n", "Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.99, total_amount=19.1, tpep_pickup_datetime=1761957478000)\n", "Sent: Ride(PULocationID=237, DOLocationID=24, trip_distance=2.3, total_amount=19.7, tpep_pickup_datetime=1761955265000)\n", "Sent: Ride(PULocationID=238, DOLocationID=239, trip_distance=0.8, total_amount=13.8, tpep_pickup_datetime=1761956041000)\n", "Sent: Ride(PULocationID=142, DOLocationID=262, trip_distance=2.1, total_amount=20.5, tpep_pickup_datetime=1761956804000)\n", "Sent: Ride(PULocationID=263, DOLocationID=226, trip_distance=4.0, total_amount=23.45, tpep_pickup_datetime=1761957750000)\n", "Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=1.53, total_amount=17.95, tpep_pickup_datetime=1761958179000)\n", "Sent: Ride(PULocationID=144, DOLocationID=231, trip_distance=1.2, total_amount=18.9, tpep_pickup_datetime=1761955678000)\n", "Sent: Ride(PULocationID=113, DOLocationID=229, trip_distance=2.66, total_amount=30.66, tpep_pickup_datetime=1761958375000)\n", "Sent: Ride(PULocationID=113, DOLocationID=107, trip_distance=0.75, total_amount=21.42, tpep_pickup_datetime=1761955353000)\n", "Sent: Ride(PULocationID=107, DOLocationID=88, trip_distance=4.28, total_amount=36.05, tpep_pickup_datetime=1761957278000)\n", "Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.67, total_amount=13.5, tpep_pickup_datetime=1761956709000)\n", "Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=1.5, total_amount=23.1, tpep_pickup_datetime=1761955608000)\n", "Sent: Ride(PULocationID=41, DOLocationID=238, trip_distance=1.73, total_amount=19.68, tpep_pickup_datetime=1761955822000)\n", "Sent: Ride(PULocationID=163, DOLocationID=140, trip_distance=1.44, total_amount=18.11, tpep_pickup_datetime=1761958756000)\n", "Sent: Ride(PULocationID=107, DOLocationID=137, trip_distance=0.49, total_amount=12.55, tpep_pickup_datetime=1761957049000)\n", "Sent: Ride(PULocationID=137, DOLocationID=68, trip_distance=1.48, total_amount=23.05, tpep_pickup_datetime=1761957555000)\n", "Sent: Ride(PULocationID=166, DOLocationID=41, trip_distance=1.1, total_amount=11.1, tpep_pickup_datetime=1761955442000)\n", "Sent: Ride(PULocationID=239, DOLocationID=140, trip_distance=1.8, total_amount=18.85, tpep_pickup_datetime=1761956742000)\n", "Sent: Ride(PULocationID=140, DOLocationID=237, trip_distance=0.7, total_amount=14.31, tpep_pickup_datetime=1761957607000)\n", "Sent: Ride(PULocationID=237, DOLocationID=107, trip_distance=2.8, total_amount=30.65, tpep_pickup_datetime=1761958306000)\n", "Sent: Ride(PULocationID=100, DOLocationID=239, trip_distance=2.94, total_amount=38.22, tpep_pickup_datetime=1761958211000)\n", "Sent: Ride(PULocationID=48, DOLocationID=42, trip_distance=4.7, total_amount=34.85, tpep_pickup_datetime=1761955550000)\n", "Sent: Ride(PULocationID=263, DOLocationID=236, trip_distance=0.9, total_amount=13.8, tpep_pickup_datetime=1761957769000)\n", "Sent: Ride(PULocationID=238, DOLocationID=114, trip_distance=6.12, total_amount=56.35, tpep_pickup_datetime=1761955775000)\n", "Sent: Ride(PULocationID=138, DOLocationID=232, trip_distance=9.37, total_amount=55.4, tpep_pickup_datetime=1761955488000)\n", "Sent: Ride(PULocationID=232, DOLocationID=137, trip_distance=1.93, total_amount=22.05, tpep_pickup_datetime=1761957174000)\n", "Sent: Ride(PULocationID=137, DOLocationID=263, trip_distance=2.72, total_amount=20.65, tpep_pickup_datetime=1761958356000)\n", "Sent: Ride(PULocationID=246, DOLocationID=263, trip_distance=4.6, total_amount=39.9, tpep_pickup_datetime=1761956440000)\n", "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.6, total_amount=11.92, tpep_pickup_datetime=1761958168000)\n", "Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=1.04, total_amount=14.64, tpep_pickup_datetime=1761955790000)\n", "Sent: Ride(PULocationID=237, DOLocationID=75, trip_distance=1.93, total_amount=18.7, tpep_pickup_datetime=1761956503000)\n", "Sent: Ride(PULocationID=233, DOLocationID=107, trip_distance=0.76, total_amount=21.42, tpep_pickup_datetime=1761956475000)\n", "Sent: Ride(PULocationID=262, DOLocationID=48, trip_distance=3.12, total_amount=28.14, tpep_pickup_datetime=1761955204000)\n", "Sent: Ride(PULocationID=48, DOLocationID=68, trip_distance=0.45, total_amount=13.02, tpep_pickup_datetime=1761956902000)\n", "Sent: Ride(PULocationID=246, DOLocationID=186, trip_distance=0.7, total_amount=13.95, tpep_pickup_datetime=1761957726000)\n", "Sent: Ride(PULocationID=100, DOLocationID=164, trip_distance=0.15, total_amount=9.45, tpep_pickup_datetime=1761958138000)\n", "Sent: Ride(PULocationID=234, DOLocationID=113, trip_distance=0.43, total_amount=18.06, tpep_pickup_datetime=1761958439000)\n", "Sent: Ride(PULocationID=239, DOLocationID=116, trip_distance=3.3, total_amount=28.35, tpep_pickup_datetime=1761955931000)\n", "Sent: Ride(PULocationID=107, DOLocationID=79, trip_distance=0.9, total_amount=18.9, tpep_pickup_datetime=1761958296000)\n", "took 11.49 seconds\n" ] } ], "source": [ "import time\n", "\n", "t0 = time.time()\n", "\n", "for _, row in df.iterrows():\n", " ride = ride_from_row(row)\n", " producer.send(topic_name, value=ride)\n", " print(f\"Sent: {ride}\")\n", " time.sleep(0.01)\n", "\n", "producer.flush()\n", "\n", "t1 = time.time()\n", "print(f'took {(t1 - t0):.2f} seconds')" ] }, { "cell_type": "code", "execution_count": null, "id": "a1ca66fe", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "streaming-workshop", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" } }, "nbformat": 4, "nbformat_minor": 5 } ================================================ FILE: 07-streaming/workshop/live/pyproject.flink.toml ================================================ [project] name = "pyflink-workshop" version = "0.1.0" requires-python = ">=3.12" dependencies = [ "apache-flink==2.2.0", ] ================================================ FILE: 07-streaming/workshop/live/pyproject.toml ================================================ [project] name = "streaming-workshop" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.12" dependencies = [ "kafka-python>=2.3.0", "pandas>=3.0.1", "psycopg2-binary>=2.9.11", "pyarrow>=23.0.1", ] [dependency-groups] dev = [ "jupyter>=1.1.1", ] ================================================ FILE: 07-streaming/workshop/live/src/job/aggregation_job.py ================================================ from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, StreamTableEnvironment def create_events_source_kafka(t_env): table_name = "events" source_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, tpep_pickup_datetime BIGINT, event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3), WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda:29092', 'topic' = 'rides', 'scan.startup.mode' = 'earliest-offset', 'properties.auto.offset.reset' = 'earliest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name def create_events_aggregated_sink(t_env): table_name = 'processed_events_aggregated' sink_ddl = f""" CREATE TABLE {table_name} ( window_start TIMESTAMP(3), PULocationID INT, num_trips BIGINT, total_revenue DOUBLE, PRIMARY KEY (window_start, PULocationID) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name def log_aggregation(): env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) env.set_parallelism(3) settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) try: source_table = create_events_source_kafka(t_env) aggregated_table = create_events_aggregated_sink(t_env) t_env.execute_sql(f""" INSERT INTO {aggregated_table} SELECT window_start, PULocationID, COUNT(*) AS num_trips, SUM(total_amount) AS total_revenue FROM TABLE( TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR) ) GROUP BY window_start, PULocationID; """).wait() except Exception as e: print("Writing records from Kafka to JDBC failed:", str(e)) if __name__ == '__main__': log_aggregation() ================================================ FILE: 07-streaming/workshop/live/src/job/pass_through_job.py ================================================ from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, StreamTableEnvironment def create_events_source_kafka(t_env): table_name = "events" source_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, tpep_pickup_datetime BIGINT ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda:29092', 'topic' = 'rides', 'scan.startup.mode' = 'latest-offset', 'properties.auto.offset.reset' = 'latest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name def create_processed_events_sink_postgres(t_env): table_name = 'processed_events' sink_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, pickup_datetime TIMESTAMP ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name def log_processing(): env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) # checkpoint every 10 seconds settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) source_table = create_events_source_kafka(t_env) postgres_sink = create_processed_events_sink_postgres(t_env) t_env.execute_sql( f""" INSERT INTO {postgres_sink} SELECT PULocationID, DOLocationID, trip_distance, total_amount, TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) as pickup_datetime FROM {source_table} """ ).wait() if __name__ == '__main__': log_processing() ================================================ FILE: 07-streaming/workshop/live/src/producers/models.py ================================================ import json import dataclasses from dataclasses import dataclass @dataclass class Ride: PULocationID: int DOLocationID: int trip_distance: float total_amount: float tpep_pickup_datetime: int # epoch milliseconds def ride_from_row(row): return Ride( PULocationID=int(row['PULocationID']), DOLocationID=int(row['DOLocationID']), trip_distance=float(row['trip_distance']), total_amount=float(row['total_amount']), tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000), ) def ride_serializer(ride): ride_dict = dataclasses.asdict(ride) ride_json = json.dumps(ride_dict).encode('utf-8') return ride_json def ride_deserializer(data): json_str = data.decode('utf-8') ride_dict = json.loads(json_str) return Ride(**ride_dict) ================================================ FILE: 07-streaming/workshop/live/src/producers/producer_realtime.py ================================================ import dataclasses import json import random import sys import time from datetime import datetime, timezone from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent)) from kafka import KafkaProducer from models import Ride # Top pickup locations from the actual NYC yellow taxi data. # PULocationID is a taxi zone ID (1-263) defined by the NYC TLC. # See https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv PICKUP_LOCATIONS = [ 79, # East Village, Manhattan 107, # Gramercy, Manhattan 48, # Clinton East (Hell's Kitchen), Manhattan 132, # JFK Airport 234, # Union Sq, Manhattan 148, # Lower East Side, Manhattan 249, # West Village, Manhattan 68, # East Chelsea, Manhattan 90, # Flatiron, Manhattan 263, # Yorkville West, Manhattan 138, # LaGuardia Airport 230, # Times Sq/Theatre District, Manhattan 161, # Midtown Center, Manhattan 162, # Midtown East, Manhattan 170, # Murray Hill, Manhattan 237, # Upper East Side South, Manhattan 239, # Upper West Side South, Manhattan 186, # Penn Station/Madison Sq West, Manhattan 164, # Midtown South, Manhattan 236, # Upper East Side North, Manhattan ] DROPOFF_LOCATIONS = PICKUP_LOCATIONS # same pool for simplicity def make_ride(delay_seconds=0): now_ms = int(time.time() * 1000) - delay_seconds * 1000 return Ride( PULocationID=random.choice(PICKUP_LOCATIONS), DOLocationID=random.choice(DROPOFF_LOCATIONS), trip_distance=round(random.uniform(0.5, 20.0), 2), total_amount=round(random.uniform(5.0, 100.0), 2), tpep_pickup_datetime=now_ms, ) def ride_serializer(ride): return json.dumps(dataclasses.asdict(ride)).encode('utf-8') server = 'localhost:9092' producer = KafkaProducer( bootstrap_servers=[server], value_serializer=ride_serializer, ) topic_name = 'rides' count = 0 print("Sending events (Ctrl+C to stop)...") print() try: while True: # ~20% chance of a late event (3-10 seconds old) if random.random() < 0.2: delay = random.randint(3, 10) ride = make_ride(delay_seconds=delay) ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc) print(f" LATE ({delay}s) -> PU={ride.PULocationID} ts={ts:%H:%M:%S}") else: ride = make_ride() ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc) print(f" on time -> PU={ride.PULocationID} ts={ts:%H:%M:%S}") producer.send(topic_name, value=ride) count += 1 time.sleep(0.5) except KeyboardInterrupt: producer.flush() print(f"\nSent {count} events") ================================================ FILE: 07-streaming/workshop/pyproject.flink.toml ================================================ [project] name = "pyflink-workshop" version = "0.1.0" requires-python = ">=3.12" dependencies = [ "apache-flink==2.2.0", ] ================================================ FILE: 07-streaming/workshop/pyproject.toml ================================================ [project] name = "workshop" version = "0.1.0" description = "PyFlink Stream Processing Workshop" requires-python = ">=3.12" dependencies = [ "kafka-python>=2.3.0", "pandas>=2.2.0", "psycopg2-binary>=2.9.11", "pyarrow>=19.0.0", ] ================================================ FILE: 07-streaming/workshop/src/consumers/consumer.py ================================================ import sys from datetime import datetime from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent)) from kafka import KafkaConsumer from models import ride_deserializer server = 'localhost:9092' topic_name = 'rides' consumer = KafkaConsumer( topic_name, bootstrap_servers=[server], auto_offset_reset='earliest', group_id='rides-console', value_deserializer=ride_deserializer ) print(f"Listening to {topic_name}...") count = 0 for message in consumer: ride = message.value pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000) print(f"Received: PU={ride.PULocationID}, DO={ride.DOLocationID}, " f"distance={ride.trip_distance}, amount=${ride.total_amount:.2f}, " f"pickup={pickup_dt}") count += 1 if count >= 10: print(f"\n... received {count} messages so far (stopping after 10 for demo)") break consumer.close() ================================================ FILE: 07-streaming/workshop/src/consumers/consumer_postgres.py ================================================ import sys from datetime import datetime from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent)) import psycopg2 from kafka import KafkaConsumer from models import ride_deserializer server = 'localhost:9092' topic_name = 'rides' # Connect to PostgreSQL conn = psycopg2.connect( host='localhost', port=5432, database='postgres', user='postgres', password='postgres' ) conn.autocommit = True cur = conn.cursor() consumer = KafkaConsumer( topic_name, bootstrap_servers=[server], auto_offset_reset='earliest', group_id='rides-to-postgres', value_deserializer=ride_deserializer ) print(f"Listening to {topic_name} and writing to PostgreSQL...") count = 0 for message in consumer: ride = message.value pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000) cur.execute( """INSERT INTO processed_events (PULocationID, DOLocationID, trip_distance, total_amount, pickup_datetime) VALUES (%s, %s, %s, %s, %s)""", (ride.PULocationID, ride.DOLocationID, ride.trip_distance, ride.total_amount, pickup_dt) ) count += 1 if count % 100 == 0: print(f"Inserted {count} rows...") consumer.close() cur.close() conn.close() ================================================ FILE: 07-streaming/workshop/src/job/aggregation_job.py ================================================ from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, StreamTableEnvironment def create_events_aggregated_sink(t_env): table_name = 'processed_events_aggregated' sink_ddl = f""" CREATE TABLE {table_name} ( window_start TIMESTAMP(3), PULocationID INT, num_trips BIGINT, total_revenue DOUBLE, PRIMARY KEY (window_start, PULocationID) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name def create_events_source_kafka(t_env): table_name = "events" source_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, tpep_pickup_datetime BIGINT, event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3), WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda:29092', 'topic' = 'rides', 'scan.startup.mode' = 'earliest-offset', 'properties.auto.offset.reset' = 'earliest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name def log_aggregation(): # Set up the execution environment env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) env.set_parallelism(3) # Set up the table environment settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) try: # Create Kafka table source_table = create_events_source_kafka(t_env) aggregated_table = create_events_aggregated_sink(t_env) t_env.execute_sql(f""" INSERT INTO {aggregated_table} SELECT window_start, PULocationID, COUNT(*) AS num_trips, SUM(total_amount) AS total_revenue FROM TABLE( TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR) ) GROUP BY window_start, PULocationID; """).wait() except Exception as e: print("Writing records from Kafka to JDBC failed:", str(e)) if __name__ == '__main__': log_aggregation() ================================================ FILE: 07-streaming/workshop/src/job/aggregation_job_demo.py ================================================ """ Demo aggregation job with 10-second tumbling windows. Use with producer_realtime.py to observe watermark behavior: - Watermark = event_timestamp - 5 seconds - Late events (<=5s) arrive before the watermark closes the window -> included - Late events (>5s) may arrive after the watermark closes the window -> dropped """ from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, StreamTableEnvironment def create_events_source_kafka(t_env): table_name = "events" source_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, tpep_pickup_datetime BIGINT, event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3), WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda:29092', 'topic' = 'rides', 'scan.startup.mode' = 'latest-offset', 'properties.auto.offset.reset' = 'latest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name def create_events_aggregated_sink(t_env): table_name = 'processed_events_aggregated' sink_ddl = f""" CREATE TABLE {table_name} ( window_start TIMESTAMP(3), PULocationID INT, num_trips BIGINT, total_revenue DOUBLE, PRIMARY KEY (window_start, PULocationID) NOT ENFORCED ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name def log_aggregation(): env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) env.set_parallelism(1) settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) try: source_table = create_events_source_kafka(t_env) aggregated_table = create_events_aggregated_sink(t_env) # 10-second tumbling windows (instead of 1 hour) so we can # observe windows closing and late events being dropped t_env.execute_sql(f""" INSERT INTO {aggregated_table} SELECT window_start, PULocationID, COUNT(*) AS num_trips, SUM(total_amount) AS total_revenue FROM TABLE( TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '10' SECOND) ) GROUP BY window_start, PULocationID; """).wait() except Exception as e: print("Writing records from Kafka to JDBC failed:", str(e)) if __name__ == '__main__': log_aggregation() ================================================ FILE: 07-streaming/workshop/src/job/pass_through_job.py ================================================ from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import EnvironmentSettings, StreamTableEnvironment def create_processed_events_sink_postgres(t_env): table_name = 'processed_events' sink_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, pickup_datetime TIMESTAMP ) WITH ( 'connector' = 'jdbc', 'url' = 'jdbc:postgresql://postgres:5432/postgres', 'table-name' = '{table_name}', 'username' = 'postgres', 'password' = 'postgres', 'driver' = 'org.postgresql.Driver' ); """ t_env.execute_sql(sink_ddl) return table_name def create_events_source_kafka(t_env): table_name = "events" source_ddl = f""" CREATE TABLE {table_name} ( PULocationID INTEGER, DOLocationID INTEGER, trip_distance DOUBLE, total_amount DOUBLE, tpep_pickup_datetime BIGINT ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'redpanda:29092', 'topic' = 'rides', 'scan.startup.mode' = 'latest-offset', 'properties.auto.offset.reset' = 'latest', 'format' = 'json' ); """ t_env.execute_sql(source_ddl) return table_name def log_processing(): # Set up the execution environment env = StreamExecutionEnvironment.get_execution_environment() env.enable_checkpointing(10 * 1000) # Set up the table environment settings = EnvironmentSettings.new_instance().in_streaming_mode().build() t_env = StreamTableEnvironment.create(env, environment_settings=settings) try: # Create Kafka table source_table = create_events_source_kafka(t_env) postgres_sink = create_processed_events_sink_postgres(t_env) # write records to postgres t_env.execute_sql( f""" INSERT INTO {postgres_sink} SELECT PULocationID, DOLocationID, trip_distance, total_amount, TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) as pickup_datetime FROM {source_table} """ ).wait() except Exception as e: print("Writing records from Kafka to JDBC failed:", str(e)) if __name__ == '__main__': log_processing() ================================================ FILE: 07-streaming/workshop/src/models.py ================================================ import json from dataclasses import dataclass @dataclass class Ride: PULocationID: int DOLocationID: int trip_distance: float total_amount: float tpep_pickup_datetime: int # epoch milliseconds def ride_from_row(row): return Ride( PULocationID=int(row['PULocationID']), DOLocationID=int(row['DOLocationID']), trip_distance=float(row['trip_distance']), total_amount=float(row['total_amount']), tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000), ) def ride_deserializer(data): json_str = data.decode('utf-8') ride_dict = json.loads(json_str) return Ride(**ride_dict) ================================================ FILE: 07-streaming/workshop/src/producers/producer.py ================================================ import dataclasses import json import sys import time from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent)) import pandas as pd from kafka import KafkaProducer from models import Ride, ride_from_row # Download NYC yellow taxi trip data (first 1000 rows) url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet" columns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime'] df = pd.read_parquet(url, columns=columns).head(1000) def ride_serializer(ride): ride_dict = dataclasses.asdict(ride) json_str = json.dumps(ride_dict) return json_str.encode('utf-8') server = 'localhost:9092' producer = KafkaProducer( bootstrap_servers=[server], value_serializer=ride_serializer ) t0 = time.time() topic_name = 'rides' for _, row in df.iterrows(): ride = ride_from_row(row) producer.send(topic_name, value=ride) print(f"Sent: {ride}") time.sleep(0.01) producer.flush() t1 = time.time() print(f'took {(t1 - t0):.2f} seconds') ================================================ FILE: 07-streaming/workshop/src/producers/producer_realtime.py ================================================ import dataclasses import json import random import sys import time from datetime import datetime, timezone from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent)) from kafka import KafkaProducer from models import Ride # Top pickup locations from the actual NYC yellow taxi data. # PULocationID is a taxi zone ID (1-263) defined by the NYC TLC. # See https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv PICKUP_LOCATIONS = [ 79, # East Village, Manhattan 107, # Gramercy, Manhattan 48, # Clinton East (Hell's Kitchen), Manhattan 132, # JFK Airport 234, # Union Sq, Manhattan 148, # Lower East Side, Manhattan 249, # West Village, Manhattan 68, # East Chelsea, Manhattan 90, # Flatiron, Manhattan 263, # Yorkville West, Manhattan 138, # LaGuardia Airport 230, # Times Sq/Theatre District, Manhattan 161, # Midtown Center, Manhattan 162, # Midtown East, Manhattan 170, # Murray Hill, Manhattan 237, # Upper East Side South, Manhattan 239, # Upper West Side South, Manhattan 186, # Penn Station/Madison Sq West, Manhattan 164, # Midtown South, Manhattan 236, # Upper East Side North, Manhattan ] DROPOFF_LOCATIONS = PICKUP_LOCATIONS # same pool for simplicity def make_ride(delay_seconds=0): now_ms = int(time.time() * 1000) - delay_seconds * 1000 return Ride( PULocationID=random.choice(PICKUP_LOCATIONS), DOLocationID=random.choice(DROPOFF_LOCATIONS), trip_distance=round(random.uniform(0.5, 20.0), 2), total_amount=round(random.uniform(5.0, 100.0), 2), tpep_pickup_datetime=now_ms, ) def ride_serializer(ride): return json.dumps(dataclasses.asdict(ride)).encode('utf-8') server = 'localhost:9092' producer = KafkaProducer( bootstrap_servers=[server], value_serializer=ride_serializer, ) topic_name = 'rides' count = 0 print("Sending events (Ctrl+C to stop)...") print() try: while True: # ~20% chance of a late event (3-10 seconds old) if random.random() < 0.2: delay = random.randint(3, 10) ride = make_ride(delay_seconds=delay) ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc) print(f" LATE ({delay}s) -> PU={ride.PULocationID} ts={ts:%H:%M:%S}") else: ride = make_ride() ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc) print(f" on time -> PU={ride.PULocationID} ts={ts:%H:%M:%S}") producer.send(topic_name, value=ride) count += 1 time.sleep(0.5) except KeyboardInterrupt: producer.flush() print(f"\nSent {count} events") ================================================ FILE: README.md ================================================

Data Engineering Zoomcamp Overview

Data Engineering Zoomcamp: A Free 9-Week Course on Data Engineering Fundamentals

Master the fundamentals of data engineering by building an end-to-end data pipeline from scratch. Gain hands-on experience with industry-standard tools and best practices.

Join Slack#course-data-engineering ChannelTelegram AnnouncementsCourse PlaylistFAQ

## How to Enroll ### 2026 Cohort - **Start Date**: 12 January 2026 - **Register Here**: [Sign up](https://airtable.com/shr6oVXeQvSI5HuWD) ### Self-Paced Learning All course materials are freely available for independent study. Follow these steps: 1. Watch the course videos. 2. Join the [Slack community](https://datatalks.club/slack.html). 3. Refer to the [FAQ document](https://datatalks.club/faq/data-engineering-zoomcamp.html) for guidance. ## Syllabus Overview The course consists of structured modules, hands-on workshops, and a final project to reinforce your learning. ### **Prerequisites** To get the most out of this course, you should have: - Basic coding experience - Familiarity with SQL - Experience with Python (helpful but not required) No prior data engineering experience is necessary. ### **Modules** #### [Module 1: Containerization and Infrastructure as Code](01-docker-terraform/) - Introduction to GCP - Docker and Docker Compose - Running PostgreSQL with Docker - Infrastructure setup with Terraform - Homework #### [Module 2: Workflow Orchestration](02-workflow-orchestration/) - Data Lakes and Workflow Orchestration - Workflow orchestration with Kestra - Homework #### [Workshop 1: Data Ingestion](cohorts/2026/workshops/dlt.md) - API reading and pipeline scalability - Data normalization and incremental loading - Homework #### [Module 3: Data Warehousing](03-data-warehouse/) - Introduction to BigQuery - Partitioning, clustering, and best practices - Machine learning in BigQuery #### [Module 4: Analytics Engineering](04-analytics-engineering/) - Analytics Engineering and Data Modeling - dbt (data build tool) with DuckDB & BigQuery - Testing, documentation, and deployment #### [Module 5: Data Platforms](05-data-platforms/) - Building end-to-end data pipelines with Bruin - Data ingestion, transformation, and quality - Deployment to cloud (BigQuery) #### [Module 6: Batch Processing](06-batch/) - Introduction to Apache Spark - DataFrames and SQL - Internals of GroupBy and Joins #### [Module 7: Streaming](07-streaming/) - Introduction to Kafka - Kafka Streams and KSQL - Schema management with Avro #### [Final Project](projects/) - Apply all concepts learned in a real-world scenario - Peer review and feedback process ## Testimonials > Thank you for what you do! The Data Engineering Zoomcamp gave me skills that helped me land my first tech job. > > — [Tim Claytor](https://www.linkedin.com/in/claytor/) ([Source](https://www.linkedin.com/feed/update/urn:li:activity:7396882073308938240?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7396882073308938240%2C7396889959711793152%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287396889959711793152%2Curn%3Ali%3Aactivity%3A7396882073308938240%29)) > Three months might seem like a long time, but the growth and learning during this period are truly remarkable. It was a great experience with a lot of learning, connecting with like-minded people from all around the world, and having fun. I must admit, this was really hard. But the feeling of accomplishment and learning made it all worthwhile. And I would do it again! > > — [Nevenka Lukic](https://www.linkedin.com/in/nevenka-lukic/) ([Source](https://www.linkedin.com/posts/nevenka-lukic_data-engineering-zoomcamp-final-project-activity-7181985646033461248-Lc1O?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4)) > One of the significant things I inferred from the Zoomcamp is to prioritize fundamentals and principles over ever-evolving tools and tech stacks. Hugely grateful to Alexey Grigorev for putting together this incredible course and offering it for free. > > — [Siddhartha Gogoi](https://www.linkedin.com/in/siddhartha-gogoi/) ([Source](https://www.linkedin.com/posts/activity-7325692407675604992-XSKI?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4)) > Such a fun deep dive into data engineering, cloud automation, and orchestration. I learned so much along the way. Big shoutout to Alexey Grigorev and the DataTalksClub team for the opportunity and guidance throughout the 3 months of the free course. > > — [Assitan NIARE](https://www.linkedin.com/in/assitan-niar%C3%A9-data/) ([Source](https://www.linkedin.com/posts/activity-7317441554023874561-E3wm?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4)) > If you’re serious about breaking into data engineering, start here. The repo’s structure, community, and hands-on focus make it unparalleled. > > — [Wady Osama](https://www.linkedin.com/in/wadyosama/) ([Source](https://www.linkedin.com/posts/wadyosama_dataengineering-zoomcamp-dezoomcamp-activity-7292126824711520258-puJm?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4)) ## Community & Support ### **Getting Help on Slack** Join the [`#course-data-engineering`](https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG) channel on [DataTalks.Club Slack](https://datatalks.club/slack.html) for discussions, troubleshooting, and networking. To keep discussions organized: - Follow [our guidelines](asking-questions.md) when posting questions. - Review the [community guidelines](https://datatalks.club/slack/guidelines.html). ## Meet the Instructors - [Alexey Grigorev](https://linkedin.com/in/agrigorev) - [Michael Shoemaker](https://www.linkedin.com/in/michaelshoemaker1/) - [Will Russell](https://www.linkedin.com/in/wrussell1999/) - [Anna Geller](https://www.linkedin.com/in/anna-geller-12a86811a/) - [Juan Manuel Perafan](https://www.linkedin.com/in/jmperafan/) - [Arsalan Noorafkan](https://www.linkedin.com/in/arsalan0/) Past instructors: - [Victoria Perez Mola](https://www.linkedin.com/in/victoriaperezmola/) - [Ankush Khanna](https://linkedin.com/in/ankushkhanna2) - [Sejal Vaidya](https://www.linkedin.com/in/vaidyasejal/) - [Irem Erturk](https://www.linkedin.com/in/iremerturk/) - [Luis Oliveira](https://www.linkedin.com/in/lgsoliveira/) - [Zach Wilson](https://www.linkedin.com/in/eczachly) ## Sponsors & Supporters A special thanks to our course sponsors for making this initiative possible!

Interested in supporting our community? Reach out to [alexey@datatalks.club](mailto:alexey@datatalks.club). ## About DataTalks.Club

DataTalks.Club

DataTalks.Club is a global online community of data enthusiasts. It's a place to discuss data, learn, share knowledge, ask and answer questions, and support each other.

WebsiteJoin Slack CommunityNewsletterUpcoming EventsYouTubeGitHubLinkedInTwitter

All the activity at DataTalks.Club mainly happens on [Slack](https://datatalks.club/slack.html). We post updates there and discuss different aspects of data, career questions, and more. At DataTalksClub, we organize online events, community activities, and free courses. You can learn more about what we do at [DataTalksClub Community Navigation](https://www.notion.so/DataTalksClub-Community-Navigation-bf070ad27ba44bf6bbc9222082f0e5a8?pvs=21). ================================================ FILE: after-sign-up.md ================================================ ## Thank you! Thanks for signing up for the course. The process of adding you to the mailing list is not automated yet, but you will hear from us closer to the course start. To make sure you don't miss any announcements - Register in [DataTalks.Club's Slack](https://datatalks.club/slack.html) and join the [`#course-data-engineering`](https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG) channel - Join the [course Telegram channel with announcements](https://t.me/dezoomcamp) - Subscribe to [DataTalks.Club's YouTube channel](https://www.youtube.com/c/DataTalksClub) and check [the course playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) See you in January! ================================================ FILE: asking-questions.md ================================================ ## Asking questions If you have any questions, ask them in the [`#course-data-engineering`](https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG) channel in [DataTalks.Club](https://datatalks.club) slack. To keep our discussion in Slack more organized, we ask you to follow these suggestions: * First, review How to troubleshoot issues listed below. * Before asking a question, check the [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html). * Before asking a question review the [Slack Guidelines](#Ask-in-Slack). * If somebody helped you with your problem and it's not in [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html), please add it there. It'll help other students. * Zed Shaw (of the Learn the Hard Way series) has [a great post on how to help others help you](https://learncodethehardway.com/blog/03-how-to-ask-for-help/) * Check [Stackoverflow guide on asking](https://stackoverflow.com/help/how-to-ask) ### How to troubleshoot issues The first step is to try to solve the issue on you own; get used to solving problems. This will be a real life skill you need when employed. 1. What does the error say? There will often be a description of the error or instructions on what is needed, I have even seen a link to the solution. Does it reference a specific line of your code? 2. Restart the application or server/pc. 3. Google it. It is going to be rare that you are the first to have the problem, someone out there has posted the issue and likely the solution. Search using: **technology** **problem statement**. Example: `pgcli error column c.relhasoids does not exist`. * There are often different solutions for the same problem due to variation in environments. 4. Check the tech’s documentation. Use its search if available or use the browser's search function. 5. Try uninstall (this may remove the bad actor) and reinstall of application or re-implementation of action. Don’t forget to restart the server/pc for reinstalls. * Sometimes reinstalling fails to resolve the issue but works if you uninstall first. 6. Ask in Slack 7. Take a break and come back to it later. You will be amazed at how often you figure out the solution after letting your brain rest. Get some fresh air, workout, play a video game, watch a tv show, whatever allows your brain to not think about it for a little while or even until the next day. 8. Remember technology issues in real life sometimes take days or even weeks to resolve ### Asking in Slack * Before asking a question, check the [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html). * DO NOT use screenshots, especially don’t take pictures from a phone. * DO NOT tag instructors, it may discourage others from helping you. * Copy and paste errors; if it’s long, just post it in a reply to your thread. * Use ``` for formatting your code. * Use the same thread for the conversation (that means replying to your own thread). * DO NOT create multiple posts to discuss the issue. * You may create a new post if the issue reemerges down the road. Be sure to describe what has changed in the environment. * Provide additional information in the same thread of the steps you have taken for resolution. ================================================ FILE: awesome-data-engineering.md ================================================ Have you found any cool resources about data engineering? Put them here ## Learning Data Engineering ### Courses * [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp) by DataTalks.Club (free) * [Big Data Platforms, Autumn 2022: Introduction to Big Data Processing Frameworks](https://big-data-platforms-22.mooc.fi/) by the University of Helsinki (free) * [Awesome Data Engineering Learning Path](https://awesomedataengineering.com/) ### Books * [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321) * [Big Data: Principles and Best Practices of Scalable Realtime Data Systems by Nathan Marz, James Warren](https://www.amazon.com/Big-Data-Principles-practices-scalable/dp/1617290343) * [Practical DataOps: Delivering Agile Data Science at Scale by Harvinder Atwal](https://www.amazon.com/Practical-DataOps-Delivering-Agile-Science/dp/1484251032) * [Data Pipelines Pocket Reference: Moving and Processing Data for Analytics by James Densmore](https://www.amazon.com/Data-Pipelines-Pocket-Reference-Processing/dp/1492087831) * [Best books for data engineering](https://awesomedataengineering.com/data_engineering_best_books) * [Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis, Matt Housley](https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302) ### Introduction to Data Engineering Terms * [https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html](https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html) ### Data engineering in practice Conference talks from companies, blog posts, etc * [Uber Data Archives](https://eng.uber.com/category/articles/uberdata/) (Uber engineering blog) * [Data Engineering Weekly (DE-focused substack)](https://www.dataengineeringweekly.com/) * [Seattle Data Guy (DE-focused substack)](https://seattledataguy.substack.com/) ## Doing Data Engineering ### Coding & Python * [CS50's Introduction to Computer Science | edX](https://www.edx.org/course/introduction-computer-science-harvardx-cs50x) (course) * [Python for Everybody Specialization](https://www.coursera.org/specializations/python) (course) * [Practical Python programming](https://github.com/dabeaz-course/practical-python/blob/master/Notes/Contents.md) ### SQL * [Intro to SQL: Querying and managing data | Khan Academy](https://www.khanacademy.org/computing/computer-programming/sql) * [Mode SQL Tutorial](https://mode.com/sql-tutorial/) * [Use The Index, Luke](https://use-the-index-luke.com/) (SQL Indexing a nd Tuning e-Book)nfreffx * [SQL Performance Explained](https://sql-performance-explained.com/) (book) e ### Workflow orchestration * [What is DAG?](https://youtu.be/1Yh5S-S6wsI) (video) * [Airflow, Prefect, and Dagster: An Inside Look](https://towardsdatascience.com/airflow-prefect-and-dagster-an-inside-look-6074781c9b77) (blog post) * [Open-Source Spotlight - Prefect - Kevin Kho](https://www.youtube.com/watch?v=ISLV9JyqF1w) (video) * [Prefect as a Data Engineering Project Workflow Tool, with Mary Clair Thompson (Duke) - 11/6/2020](https://youtu.be/HuwA4wLQtCM) (video) ### ETL and ELT * [ETL vs. ELT: What’s the Difference?](https://rivery.io/blog/etl-vs-elt/) (blog post) (print version) ### Data lakes * [An Introduction to Modern Data Lake Storage Layers (Hodi, Iceberg, Delta Lake)](https://dacort.dev/posts/modern-data-lake-storage-layers/) (blog post) * [Lake House Architecture @ Halodoc: Data Platform 2.0](https://blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/amp/) (blzog post) ### Data warehousing * [Guide to Data Warehousing. Short and comprehensive information… | by Tomas Peluritis](https://medium.com/towards-data-science/guide-to-data-warehousing-6fdcf30b6fbe) (blog post) * [Snowflake, Redshift, BigQuery, and Others: Cloud Data Warehouse Tools Compared](https://www.altexsoft.com/blog/snowflake-redshift-bigquery-data-warehouse-tools/) (blog post) ### Streaming * Building Streaming Analytics: The Journey and Learnings - Maxim Lukichev ### DataOps * [DataOps 101 with Lars Albertsson – DataTalks.Club](https://datatalks.club/podcast/s02e11-dataops.html) (podcast) * ### Monitoring and observability * [Data Observability: The Next Frontier of Data Engineering with Barr Moses](https://datatalks.club/podcast/s03e03-data-observability.html) (podcast) ### Analytics engineering * [Analytics Engineer: New Role in a Data Team with Victoria Perez Mola](https://datatalks.club/podcast/s03e11-analytics-engineer.html) (podcast) * [Modern Data Stack for Analytics Engineering - Kyle Shannon](https://www.youtube.com/watch?v=UmIZIkeOfi0) (video) * [Analytics Engineering vs Data Engineering | RudderStack Blog](https://www.rudderstack.com/blog/analytics-engineering-vs-data-engineering) (blog post) * [Learn the Fundamentals of Analytics Engineering with dbt](https://courses.getdbt.com/courses/fundamentals) (course) ### Data mesh * [Data Mesh in Practice - Max Schultze](https://www.youtube.com/watch?v=ekEc8D_D3zY) (video) ### Cloud * [https://acceldataio.medium.com/data-engineering-best-practices-how-netflix-keeps-its-data-infrastructure-cost-effective-dee310bcc910](https://acceldataio.medium.com/data-engineering-best-practices-how-netflix-keeps-its-data-infrastructure-cost-effective-dee310bcc910) ### Reverse ETL * TODO: What is reverse ETL? * [https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html](https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html) * [Open-Source Spotlight - Grouparoo - Brian Leonard](https://www.youtube.com/watch?v=hswlcgQZYuw) (video) * [Open-Source Spotlight - Castled.io (Reverse ETL) - Arun Thulasidharan](https://www.youtube.com/watch?v=iW0XhltAUJ8) (video) ## Career in Data Engineering * [From Data Science to Data Engineering with Ellen König – DataTalks.Club](https://datatalks.club/podcast/s07e08-from-data-science-to-data-engineering.html) (podcast) * [Big Data Engineer vs Data Scientist with Roksolana Diachuk – DataTalks.Club](https://datatalks.club/podcast/s04e03-big-data-engineer-vs-data-scientist.html) (podcast) * [What Skills Do You Need to Become a Data Engineer](https://www.linkedin.com/pulse/what-skills-do-you-need-become-data-engineer-peng-wang/) (blog post) * [The future history of Data Engineering](https://groupby1.substack.com/p/data-engineering?s=r) (blog post) * [What Skills Do Data Engineers Need](https://www.theseattledataguy.com/what-skills-do-data-engineers-need/) (blog post) ### Data Engineering Management * [Becoming a Data Engineering Manager with Rahul Jain – DataTalks.Club](https://datatalks.club/podcast/s07e07-becoming-a-data-engineering-manager.html) (podcast) ## Data engineering projects * [How To Start A Data Engineering Project - With Data Engineering Project Ideas](https://www.youtube.com/watch?v=WpN47Jddo7I) (video) * [Data Engineering Project for Beginners - Batch edition](https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/) (blog post) * [Building a Data Engineering Project in 20 Minutes](https://www.sspaeti.com/blog/data-engineering-project-in-twenty-minutes/) (blog post) * [Automating Nike Run Club Data Analysis with Python, Airflow and Google Data Studio | by Rich Martin | Medium](https://medium.com/@rich_23525/automating-nike-run-club-data-analysis-with-python-airflow-and-google-data-studio-3c9556478926) (blog post) ## Data Engineering Resources ### Blogs * [Start Data Engineering](https://www.startdataengineering.com/) ### Podcasts * [The Data Engineering Podcast](https://www.dataengineeringpodcast.com/) * [DataTalks.Club Podcast](https://datatalks.club/podcast.html) (only some episodes are about data engineering) * ### Communities * [DataTalks.Club](https://datatalks.club/) * [/r/dataengineering](https://www.reddit.com/r/dataengineering) ### Meetups * [Sydney Data Engineers](https://sydneydataengineers.github.io/) ### People to follow on Twitter and LinkedIn * TODO ### YouTube channels * [Karolina Sowinska - YouTube](https://www.youtube.com/channel/UCAxnMry1lETl47xQWABvH7g) x` * [Seattle Data Guy - YouTube](https://www.youtube.com/c/SeattleDataGuy) * [Andreas Kretz - YouTube](https://www.youtube.com/c/andreaskayy) * [DataTalksClub - YouTube](https://youtube.com/c/datatalksclub) (only some videos are about data engineering) ### Resource aggregators * [Reading List](https://www.scling.com/reading-list/) by Lars Albertsson * [GitHub - igorbarinov/awesome-data-engineering](https://github.com/igorbarinov/awesome-data-engineering) (focus is more on tools) * [GitHub - DataExpert-io/data-engineer-handbook](https://github.com/DataExpert-io/data-engineer-handbook) (contains tools,blogs and more) ## License This work is licensed under a Creative Commons Attribution 4.0 International License. CC BY 4.0 ================================================ FILE: certificates.md ================================================ ## Getting your certificate Congratulations on finishing the course! You can find your certificate in your enrollment profile (you need to be logged in): * For the 2025 edition, it's https://courses.datatalks.club/de-zoomcamp-2025/enrollment If you can't find a certificate in your profile, it means you didn't pass the project. If you believe it's a mistake, write in the course channel in Slack. ## Adding to LinkedIn You can add your certificate to LinkedIn: * Log in to your LinkedIn account, then go to your profile. * On the right, in the "Add profile" section dropdown, choose "Background" and then select the drop-down triangle next to "Licenses & Certifications". * In "Name", enter "Data Engineering Zoomcamp". * In "Issuing Organization", enter "DataTalksClub". * (Optional) In "Issue Date", enter the time when the certificate was created. * (Optional) Select the checkbox This certification does not expire. * Put your certificate ID. * In "Certification URL", enter the URL for your certificate. [Adapted from here](https://support.edx.org/hc/en-us/articles/206501938-How-can-I-add-my-certificate-to-my-LinkedIn-profile-) ================================================ FILE: cohorts/2022/README.md ================================================ ### 2022 Cohort * **Start**: 17 January 2022 * **Registration link**: https://airtable.com/shr6oVXeQvSI5HuWD * [Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vR9oQiYnAVvzL4dagnhvp0sngqagF0AceD0FGjhS-dnzMTBzNQIal3-hOgkTibVQvfuqbQ69b0fvRnf/pubhtml) ================================================ FILE: cohorts/2022/project.md ================================================ ## Course Project The goal of this project is to apply everything we learned in this course and build an end-to-end data pipeline. Remember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered compelete. ### Submitting #### Project Cohort #2 Project: * Form: https://forms.gle/JECXB9jYQ1vBXbsw6 * Deadline: 2 May, 22:00 CET Peer reviewing: * Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vShnv8T4iY_5NA8h0nySIS8Wzr-DZGGigEikIW4ZMSi9HlvhaEB4RhwmepVIuIUGaQHS90r5iHR2YXV/pubhtml?gid=964123374&single=true) * Form: https://forms.gle/Pb2fBwYLQ3GGFsaK6 * Deadline: 9 May, 22:00 CET #### Project Cohort #1 Project: * Form: https://forms.gle/6aeVcEVJipqR2BqC8 * Deadline: 4 April, 22:00 CET Peer reviewing: * Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vShnv8T4iY_5NA8h0nySIS8Wzr-DZGGigEikIW4ZMSi9HlvhaEB4RhwmepVIuIUGaQHS90r5iHR2YXV/pubhtml) * Form: https://forms.gle/AZ62bXMp4SGcVUmK7 * Deadline: 11 April, 22:00 CET Project feedback: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRcVCkO-jes5mbPAcikn9X_s2laJ1KhsO8aibHYQxxKqdCUYMVTEJLJQdM8C5aAUWKFl_0SJW4rme7H/pubhtml) ================================================ FILE: cohorts/2022/week_1_basics_n_setup/homework.md ================================================ ## Week 1 Homework In this homework we'll prepare the environment and practice with terraform and SQL ## Question 1. Google Cloud SDK Install Google Cloud SDK. What's the version you have? To get the version, run `gcloud --version` ## Google Cloud account Create an account in Google Cloud and create a project. ## Question 2. Terraform Now install terraform and go to the terraform directory (`week_1_basics_n_setup/1_terraform_gcp/terraform`) After that, run * `terraform init` * `terraform plan` * `terraform apply` Apply the plan and copy the output (after running `apply`) to the form. It should be the entire output - from the moment you typed `terraform init` to the very end. ## Prepare Postgres Run Postgres and load data as shown in the videos We'll use the yellow taxi trips from January 2021: ```bash wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv ``` You will also need the dataset with zones: ```bash wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv ``` Download this data and put it to Postgres ## Question 3. Count records How many taxi trips were there on January 15? Consider only trips that started on January 15. ## Question 4. Largest tip for each day Find the largest tip for each day. On which day it was the largest tip in January? Use the pick up time for your calculations. (note: it's not a typo, it's "tip", not "trip") ## Question 5. Most popular destination What was the most popular destination for passengers picked up in central park on January 14? Use the pick up time for your calculations. Enter the zone name (not id). If the zone name is unknown (missing), write "Unknown" ## Question 6. Most expensive locations What's the pickup-dropoff pair with the largest average price for a ride (calculated based on `total_amount`)? Enter two zone names separated by a slash For example: "Jamaica Bay / Clinton East" If any of the zone names are unknown (missing), write "Unknown". For example, "Unknown / Clinton East". ## Submitting the solutions * Form for submitting: https://forms.gle/yGQrkgRdVbiFs8Vd7 * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 26 January (Wednesday), 22:00 CET ## Solution Here is the solution to questions 3-6: [video](https://www.youtube.com/watch?v=HxHqH2ARfxM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ================================================ FILE: cohorts/2022/week_2_data_ingestion/README.md ================================================ ## Week 2: Data Ingestion ### Data Lake (GCS) * What is a Data Lake * ELT vs. ETL * Alternatives to components (S3/HDFS, Redshift, Snowflake etc.) :movie_camera: [Video](https://www.youtube.com/watch?v=W3Zm6rjOq70&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) [Slides](https://docs.google.com/presentation/d/1RkH-YhBz2apIjYZAxUz2Uks4Pt51-fVWVN9CcH9ckyY/edit?usp=sharing) ### Introduction to Workflow orchestration * What is an Orchestration Pipeline? * What is a DAG? * [Video](https://www.youtube.com/watch?v=0yK7LXwYeD0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ### Setting up Airflow locally * Setting up Airflow with Docker-Compose * [Video](https://www.youtube.com/watch?v=lqDMzReAtrw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) * More information in the [airflow folder](airflow) If you want to run a lighter version of Airflow with fewer services, check this [video](https://www.youtube.com/watch?v=A1p5LQ0zzaQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb). It's optional. ### Ingesting data to GCP with Airflow * Extraction: Download and unpack the data * Pre-processing: Convert this raw data to parquet * Upload the parquet files to GCS * Create an external table in BigQuery * [Video](https://www.youtube.com/watch?v=9ksX9REfL8w&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=19) ### Ingesting data to Local Postgres with Airflow * Converting the ingestion script for loading data to Postgres to Airflow DAG * [Video](https://www.youtube.com/watch?v=s2U8MWJH5xA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ### Transfer service (AWS -> GCP) Moving files from AWS to GCP. You will need an AWS account for this. This section is optional * [Video 1](https://www.youtube.com/watch?v=rFOFTfD1uGk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) * [Video 2](https://www.youtube.com/watch?v=VhmmbqpIzeI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ### Homework In the homework, you'll create a few DAGs for processing the NY Taxi data for 2019-2021 More information [here](homework.md) ## Community notes Did you take notes? You can share them here. * [Notes from Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/2_data_ingestion.md) * [Notes from Aaron Wright](https://github.com/ABZ-Aaron/DataEngineerZoomCamp/blob/master/week_2_data_ingestion/README.md) * [Notes from Abd](https://itnadigital.notion.site/Week-2-Data-Ingestion-ec2d0d36c0664bc4b8be6a554b2765fd) * [Blog post by Isaac Kargar](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/25/data-engineering-w2.html) * [Blog, notes, walkthroughs by Sandy Behrens](https://learningdataengineering540969211.wordpress.com/2022/01/30/week-2-de-zoomcamp-2-3-2-ingesting-data-to-gcp-with-airflow/) * [Notes from Apurva Hegde](https://github.com/apuhegde/Airflow-LocalExecutor-In-Docker#readme) * [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd) * Add your notes here (above this line) ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/.env_example ================================================ # Custom COMPOSE_PROJECT_NAME=dtc-de GOOGLE_APPLICATION_CREDENTIALS=/.google/credentials/google_credentials.json AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json # AIRFLOW_UID= GCP_PROJECT_ID= GCP_GCS_BUCKET= # Postgres POSTGRES_USER=airflow POSTGRES_PASSWORD=airflow POSTGRES_DB=airflow # Airflow AIRFLOW__CORE__EXECUTOR=LocalExecutor AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC=10 AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB} AIRFLOW_CONN_METADATA_DB=postgres+psycopg2://airflow:airflow@postgres:5432/airflow AIRFLOW_VAR__METADATA_DB_SCHEMA=airflow _AIRFLOW_WWW_USER_CREATE=True _AIRFLOW_WWW_USER_USERNAME=${_AIRFLOW_WWW_USER_USERNAME:airflow} _AIRFLOW_WWW_USER_PASSWORD=${_AIRFLOW_WWW_USER_PASSWORD:airflow} AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True AIRFLOW__CORE__LOAD_EXAMPLES=False ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/1_setup_official.md ================================================ ## Setup (Official) ### Pre-Reqs 1. For the sake of standardization across this workshop's config, rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory ``` bash cd ~ && mkdir -p ~/.google/credentials/ mv .json ~/.google/credentials/google_credentials.json ``` 2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 5GB (ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting. 3. Python version: 3.7+ ### Airflow Setup 1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in) 2. **Set the Airflow user**: On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. You have to make sure to configure them for the docker-compose: ```bash mkdir -p ./dags ./logs ./plugins echo -e "AIRFLOW_UID=$(id -u)" > .env ``` On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. To get rid of the warning ("AIRFLOW_UID is not set"), you can create `.env` file with this content: ``` AIRFLOW_UID=50000 ``` 3. **Import the official docker setup file** from the latest Airflow version: ```shell curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml' ``` 4. It could be overwhelming to see a lot of services in here. But this is only a quick-start template, and as you proceed you'll figure out which unused services can be removed. Eg. [Here's](docker-compose-nofrills.yml) a no-frills version of that template. 5. **Docker Build**: When you want to run Airflow locally, you might want to use an extended image, containing some additional dependencies - for example you might add new python packages, or upgrade airflow providers to a later version. Create a `Dockerfile` pointing to Airflow version you've just downloaded, such as `apache/airflow:2.2.3`, as the base image, And customize this `Dockerfile` by: * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket/Data Lake. * Also, integrating `requirements.txt` to install libraries via `pip install` 6. **Docker Compose**: Back in your `docker-compose.yaml`: * In `x-airflow-common`: * Remove the `image` tag, to replace it with your `build` from your Dockerfile, as shown * Mount your `google_credentials` in `volumes` section as read-only * Set environment variables: `GCP_PROJECT_ID`, `GCP_GCS_BUCKET`, `GOOGLE_APPLICATION_CREDENTIALS` & `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`, as per your config. * Change `AIRFLOW__CORE__LOAD_EXAMPLES` to `false` (optional) 7. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look. ## Problems ### `File /.google/credentials/google_credentials.json was not found` First, make sure you have your credentials in your `$HOME/.google/credentials`. Maybe you missed the step and didn't copy the your JSON with credentials there? Also, make sure the file-name is `google_credentials.json`. Second, check that docker-compose can correctly map this directory to airflow worker. Execute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker. Then execute `bash` on this container: ```bash docker exec -it bash ``` Now check if the file with credentials is actually there: ```bash ls -lh /.google/credentials/ ``` If it's empty, docker-compose couldn't map the folder with credentials. In this case, try changing it to the absolute path to this folder: ```yaml volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins # here: ---------------------------- - c:/Users/alexe/.google/credentials/:/.google/credentials:ro # ----------------------------------- ``` ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/2_setup_nofrills.md ================================================ ## Setup (No-frills) ### Pre-Reqs 1. For the sake of standardization across this workshop's config, rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory ``` bash cd ~ && mkdir -p ~/.google/credentials/ mv .json ~/.google/credentials/google_credentials.json ``` 2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 4GB (ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting. 3. Python version: 3.7+ ### Airflow Setup 1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in) 2. **Set the Airflow user**: On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. You have to make sure to configure them for the docker-compose: ```bash mkdir -p ./dags ./logs ./plugins echo -e "AIRFLOW_UID=$(id -u)" >> .env ``` On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. To get rid of the warning ("AIRFLOW_UID is not set"), you can create `.env` file with this content: ``` AIRFLOW_UID=50000 ``` 3. **Docker Build**: When you want to run Airflow locally, you might want to use an extended image, containing some additional dependencies - for example you might add new python packages, or upgrade airflow providers to a later version. Create a `Dockerfile` pointing to the latest Airflow version such as `apache/airflow:2.2.3`, for the base image, And customize this `Dockerfile` by: * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket (Data Lake). * Also, integrating `requirements.txt` to install libraries via `pip install` 4. Copy [docker-compose-nofrills.yml](docker-compose-nofrills.yml), [.env_example](.env_example) & [entrypoint.sh](scripts/entrypoint.sh) from this repo. The changes from the official setup are: * Removal of `redis` queue, `worker`, `triggerer`, `flower` & `airflow-init` services, and changing from `CeleryExecutor` (multi-node) mode to `LocalExecutor` (single-node) mode * Inclusion of `.env` for better parametrization & flexibility * Inclusion of simple `entrypoint.sh` to the `webserver` container, responsible to initialize the database and create login-user (admin). * Updated `Dockerfile` to grant permissions on executing `scripts/entrypoint.sh` 5. `.env`: * Rebuild your `.env` file by making a copy of `.env_example` (but make sure your `AIRFLOW_UID` remains): ```shell mv .env_example .env ``` * Set environment variables `AIRFLOW_UID`, `GCP_PROJECT_ID` & `GCP_GCS_BUCKET`, as per your config. * Optionally, if your `google-credentials.json` is stored somewhere else, such as a path like `$HOME/.gc`, modify the env-vars (`GOOGLE_APPLICATION_CREDENTIALS`, `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`) and `volumes` path in `docker-compose-nofrills.yml` 6. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose-nofrills](./docker-compose-nofrills.yml) should look. ## Problems ### `no-frills setup does not work for me - WSL/Windows user ` If you are running Docker in Windows/WSL/WSL2 and you have encountered some `ModuleNotFoundError` or low performance issues, take a look at this [Airflow & WSL2 gist](https://gist.github.com/nervuzz/d1afe81116cbfa3c834634ebce7f11c5) focused entirely on troubleshooting possible problems. ### `File /.google/credentials/google_credentials.json was not found` First, make sure you have your credentials in your `$HOME/.google/credentials`. Maybe you missed the step and didn't copy the your JSON with credentials there? Also, make sure the file-name is `google_credentials.json`. Second, check that docker-compose can correctly map this directory to airflow worker. Execute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker. Then execute `bash` on this container: ```bash docker exec -it bash ``` Now check if the file with credentials is actually there: ```bash ls -lh /.google/credentials/ ``` If it's empty, docker-compose couldn't map the folder with credentials. In this case, try changing it to the absolute path to this folder: ```yaml volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins # here: ---------------------------- - c:/Users/alexe/.google/credentials/:/.google/credentials:ro # ----------------------------------- ``` ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/Dockerfile ================================================ # First-time build can take upto 10 mins. FROM apache/airflow:2.2.3 ENV AIRFLOW_HOME=/opt/airflow USER root RUN apt-get update -qq && apt-get install vim -qqq # git gcc g++ -qqq COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Ref: https://airflow.apache.org/docs/docker-stack/recipes.html SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"] ARG CLOUD_SDK_VERSION=322.0.0 ENV GCLOUD_HOME=/home/google-cloud-sdk ENV PATH="${GCLOUD_HOME}/bin/:${PATH}" RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" \ && TMP_DIR="$(mktemp -d)" \ && curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" \ && mkdir -p "${GCLOUD_HOME}" \ && tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 \ && "${GCLOUD_HOME}/install.sh" \ --bash-completion=false \ --path-update=false \ --usage-reporting=false \ --quiet \ && rm -rf "${TMP_DIR}" \ && gcloud --version WORKDIR $AIRFLOW_HOME COPY scripts scripts RUN chmod +x scripts USER $AIRFLOW_UID ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/README.md ================================================ ### Concepts [Airflow Concepts and Architecture](docs/1_concepts.md) ### Workflow ![](docs/gcs_ingestion_dag.png) ### Setup - Official Version (For the section on the Custom/Lightweight setup, scroll down) #### Setup [Airflow Setup with Docker, through official guidelines](1_setup_official.md) #### Execution 1. Build the image (only first-time, or when there's any change in the `Dockerfile`, takes ~15 mins for the first-time): ```shell docker-compose build ``` or (for legacy versions) ```shell docker build . ``` 2. Initialize the Airflow scheduler, DB, and other config ```shell docker-compose up airflow-init ``` 3. Kick up the all the services from the container: ```shell docker-compose up ``` 4. In another terminal, run `docker-compose ps` to see which containers are up & running (there should be 7, matching with the services in your docker-compose file). 5. Login to Airflow web UI on `localhost:8080` with default creds: `airflow/airflow` 6. Run your DAG on the Web Console. 7. On finishing your run or to shut down the container/s: ```shell docker-compose down ``` To stop and delete containers, delete volumes with database data, and download images, run: ``` docker-compose down --volumes --rmi all ``` or ``` docker-compose down --volumes --remove-orphans ``` ### Setup - Custom No-Frills Version (Lightweight) This is a quick, simple & less memory-intensive setup of Airflow that works on a LocalExecutor. #### Setup [Airflow Setup with Docker, customized](2_setup_nofrills.md) #### Execution 1. Stop and delete containers, delete volumes with database data, & downloaded images (from the previous setup): ``` docker-compose down --volumes --rmi all ``` or ``` docker-compose down --volumes --remove-orphans ``` Or, if you need to clear your system of any pre-cached Docker issues: ``` docker system prune ``` Also, empty the airflow `logs` directory. 2. Build the image (only first-time, or when there's any change in the `Dockerfile`): Takes ~5-10 mins for the first-time ```shell docker-compose build ``` or (for legacy versions) ```shell docker build . ``` 3. Kick up the all the services from the container (no need to specially initialize): ```shell docker-compose -f docker-compose-nofrills.yml up ``` 4. In another terminal, run `docker ps` to see which containers are up & running (there should be 3, matching with the services in your docker-compose file). 5. Login to Airflow web UI on `localhost:8080` with creds: `admin/admin` (explicit creation of admin user was required) 6. Run your DAG on the Web Console. 7. On finishing your run or to shut down the container/s: ```shell docker-compose down ``` ### Setup - Taken from DE Zoomcamp 2.3.4 - Optional: Lightweight Local Setup for Airflow Use the docker-compose_2.3.4.yaml file (and rename it to docker-compose.yaml). Don't forget to replace the variables `GCP_PROJECT_ID` and `GCP_GCS_BUCKET`. ### Future Enhancements * Deploy self-hosted Airflow setup on Kubernetes cluster, or use a Managed Airflow (Cloud Composer) service by GCP ### References For more info, check out these official docs: * https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html * https://airflow.apache.org/docs/docker-stack/build.html * https://airflow.apache.org/docs/docker-stack/recipes.html ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/dags/data_ingestion_gcs_dag.py ================================================ import os import logging from airflow import DAG from airflow.utils.dates import days_ago from airflow.operators.bash import BashOperator from airflow.operators.python import PythonOperator from google.cloud import storage from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator import pyarrow.csv as pv import pyarrow.parquet as pq PROJECT_ID = os.environ.get("GCP_PROJECT_ID") BUCKET = os.environ.get("GCP_GCS_BUCKET") dataset_file = "yellow_tripdata_2021-01.csv" dataset_url = f"https://s3.amazonaws.com/nyc-tlc/trip+data/{dataset_file}" path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/") parquet_file = dataset_file.replace('.csv', '.parquet') BIGQUERY_DATASET = os.environ.get("BIGQUERY_DATASET", 'trips_data_all') def format_to_parquet(src_file): if not src_file.endswith('.csv'): logging.error("Can only accept source files in CSV format, for the moment") return table = pv.read_csv(src_file) pq.write_table(table, src_file.replace('.csv', '.parquet')) # NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed def upload_to_gcs(bucket, object_name, local_file): """ Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python :param bucket: GCS bucket name :param object_name: target path & file-name :param local_file: source path & file-name :return: """ # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed. # (Ref: https://github.com/googleapis/python-storage/issues/74) storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024 # 5 MB storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024 # 5 MB # End of Workaround client = storage.Client() bucket = client.bucket(bucket) blob = bucket.blob(object_name) blob.upload_from_filename(local_file) default_args = { "owner": "airflow", "start_date": days_ago(1), "depends_on_past": False, "retries": 1, } # NOTE: DAG declaration - using a Context Manager (an implicit way) with DAG( dag_id="data_ingestion_gcs_dag", schedule_interval="@daily", default_args=default_args, catchup=False, max_active_runs=1, tags=['dtc-de'], ) as dag: download_dataset_task = BashOperator( task_id="download_dataset_task", bash_command=f"curl -sSL {dataset_url} > {path_to_local_home}/{dataset_file}" ) format_to_parquet_task = PythonOperator( task_id="format_to_parquet_task", python_callable=format_to_parquet, op_kwargs={ "src_file": f"{path_to_local_home}/{dataset_file}", }, ) # TODO: Homework - research and try XCOM to communicate output values between 2 tasks/operators local_to_gcs_task = PythonOperator( task_id="local_to_gcs_task", python_callable=upload_to_gcs, op_kwargs={ "bucket": BUCKET, "object_name": f"raw/{parquet_file}", "local_file": f"{path_to_local_home}/{parquet_file}", }, ) bigquery_external_table_task = BigQueryCreateExternalTableOperator( task_id="bigquery_external_table_task", table_resource={ "tableReference": { "projectId": PROJECT_ID, "datasetId": BIGQUERY_DATASET, "tableId": "external_table", }, "externalDataConfiguration": { "sourceFormat": "PARQUET", "sourceUris": [f"gs://{BUCKET}/raw/{parquet_file}"], }, }, ) download_dataset_task >> format_to_parquet_task >> local_to_gcs_task >> bigquery_external_table_task ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/dags_local/data_ingestion_local.py ================================================ import os from datetime import datetime from airflow import DAG from airflow.operators.bash import BashOperator from airflow.operators.python import PythonOperator from ingest_script import ingest_callable AIRFLOW_HOME = os.environ.get("AIRFLOW_HOME", "/opt/airflow/") PG_HOST = os.getenv('PG_HOST') PG_USER = os.getenv('PG_USER') PG_PASSWORD = os.getenv('PG_PASSWORD') PG_PORT = os.getenv('PG_PORT') PG_DATABASE = os.getenv('PG_DATABASE') local_workflow = DAG( "LocalIngestionDag", schedule_interval="0 6 2 * *", start_date=datetime(2021, 1, 1) ) URL_PREFIX = 'https://s3.amazonaws.com/nyc-tlc/trip+data' URL_TEMPLATE = URL_PREFIX + '/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv' OUTPUT_FILE_TEMPLATE = AIRFLOW_HOME + '/output_{{ execution_date.strftime(\'%Y-%m\') }}.csv' TABLE_NAME_TEMPLATE = 'yellow_taxi_{{ execution_date.strftime(\'%Y_%m\') }}' with local_workflow: wget_task = BashOperator( task_id='wget', bash_command=f'curl -sSL {URL_TEMPLATE} > {OUTPUT_FILE_TEMPLATE}' ) ingest_task = PythonOperator( task_id="ingest", python_callable=ingest_callable, op_kwargs=dict( user=PG_USER, password=PG_PASSWORD, host=PG_HOST, port=PG_PORT, db=PG_DATABASE, table_name=TABLE_NAME_TEMPLATE, csv_file=OUTPUT_FILE_TEMPLATE ), ) wget_task >> ingest_task ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/dags_local/ingest_script.py ================================================ import os from time import time import pandas as pd from sqlalchemy import create_engine def ingest_callable(user, password, host, port, db, table_name, csv_file, execution_date): print(table_name, csv_file, execution_date) engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}') engine.connect() print('connection established successfully, inserting data...') t_start = time() df_iter = pd.read_csv(csv_file, iterator=True, chunksize=100000) df = next(df_iter) df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime) df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime) df.head(n=0).to_sql(name=table_name, con=engine, if_exists='replace') df.to_sql(name=table_name, con=engine, if_exists='append') t_end = time() print('inserted the first chunk, took %.3f second' % (t_end - t_start)) while True: t_start = time() try: df = next(df_iter) except StopIteration: print("completed") break df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime) df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime) df.to_sql(name=table_name, con=engine, if_exists='append') t_end = time() print('inserted another chunk, took %.3f second' % (t_end - t_start)) ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/docker-compose-nofrills.yml ================================================ version: '3' services: postgres: image: postgres:13 env_file: - .env volumes: - postgres-db-volume:/var/lib/postgresql/data healthcheck: test: ["CMD", "pg_isready", "-U", "airflow"] interval: 5s retries: 5 restart: always scheduler: build: . command: scheduler restart: on-failure depends_on: - postgres env_file: - .env volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins - ./scripts:/opt/airflow/scripts - ~/.google/credentials/:/.google/credentials webserver: build: . entrypoint: ./scripts/entrypoint.sh restart: on-failure depends_on: - postgres - scheduler env_file: - .env volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins - ~/.google/credentials/:/.google/credentials:ro - ./scripts:/opt/airflow/scripts user: "${AIRFLOW_UID:-50000}:0" ports: - "8080:8080" healthcheck: test: [ "CMD-SHELL", "[ -f /home/airflow/airflow-webserver.pid ]" ] interval: 30s timeout: 30s retries: 3 volumes: postgres-db-volume: ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/docker-compose.yaml ================================================ # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # # Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL. # # WARNING: This configuration is for local development. Do not use it in a production deployment. # # This configuration supports basic configuration using environment variables or an .env file # The following variables are supported: # # AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow. # Default: apache/airflow:2.2.3 # AIRFLOW_UID - User ID in Airflow containers # Default: 50000 # Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode # # _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account (if requested). # Default: airflow # _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account (if requested). # Default: airflow # _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers. # Default: '' # # Feel free to modify this file to suit your needs. --- version: '3' x-airflow-common: &airflow-common # In order to add custom dependencies or upgrade provider packages you can use your extended image. # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml # and uncomment the "build" line below, Then run `docker-compose build` to build the images. build: context: . dockerfile: ./Dockerfile environment: &airflow-common-env AIRFLOW__CORE__EXECUTOR: CeleryExecutor AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0 AIRFLOW__CORE__FERNET_KEY: '' AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true' AIRFLOW__CORE__LOAD_EXAMPLES: 'false' AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth' _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-} GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json' # TODO: Please change GCP_PROJECT_ID & GCP_GCS_BUCKET, as per your config GCP_PROJECT_ID: 'pivotal-surfer-336713' GCP_GCS_BUCKET: 'dtc_data_lake_pivotal-surfer-336713' volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins - ~/.google/credentials/:/.google/credentials:ro user: "${AIRFLOW_UID:-50000}:0" depends_on: &airflow-common-depends-on redis: condition: service_healthy postgres: condition: service_healthy services: postgres: image: postgres:13 environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow volumes: - postgres-db-volume:/var/lib/postgresql/data healthcheck: test: ["CMD", "pg_isready", "-U", "airflow"] interval: 5s retries: 5 restart: always redis: image: redis:latest expose: - 6379 healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 30s retries: 50 restart: always airflow-webserver: <<: *airflow-common command: webserver ports: - 8080:8080 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-scheduler: <<: *airflow-common command: scheduler healthcheck: test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"'] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-worker: <<: *airflow-common command: celery worker healthcheck: test: - "CMD-SHELL" - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"' interval: 10s timeout: 10s retries: 5 environment: <<: *airflow-common-env # Required to handle warm shutdown of the celery workers properly # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation DUMB_INIT_SETSID: "0" restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-triggerer: <<: *airflow-common command: triggerer healthcheck: test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"'] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-init: <<: *airflow-common entrypoint: /bin/bash # yamllint disable rule:line-length command: - -c - | function ver() { printf "%04d%04d%04d%04d" $${1//./ } } airflow_version=$$(gosu airflow airflow version) airflow_version_comparable=$$(ver $${airflow_version}) min_airflow_version=2.2.0 min_airflow_version_comparable=$$(ver $${min_airflow_version}) if (( airflow_version_comparable < min_airflow_version_comparable )); then echo echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m" echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!" echo exit 1 fi if [[ -z "${AIRFLOW_UID}" ]]; then echo echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m" echo "If you are on Linux, you SHOULD follow the instructions below to set " echo "AIRFLOW_UID environment variable, otherwise files will be owned by root." echo "For other operating systems you can get rid of the warning with manually created .env file:" echo " See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user" echo fi one_meg=1048576 mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg)) cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat) disk_available=$$(df / | tail -1 | awk '{print $$4}') warning_resources="false" if (( mem_available < 4000 )) ; then echo echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m" echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))" echo warning_resources="true" fi if (( cpus_available < 2 )); then echo echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m" echo "At least 2 CPUs recommended. You have $${cpus_available}" echo warning_resources="true" fi if (( disk_available < one_meg * 10 )); then echo echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m" echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))" echo warning_resources="true" fi if [[ $${warning_resources} == "true" ]]; then echo echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m" echo "Please follow the instructions to increase amount of resources available:" echo " https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin" echo fi mkdir -p /sources/logs /sources/dags /sources/plugins chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins} exec /entrypoint airflow version # yamllint enable rule:line-length environment: <<: *airflow-common-env _AIRFLOW_DB_UPGRADE: 'true' _AIRFLOW_WWW_USER_CREATE: 'true' _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow} _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow} user: "0:0" volumes: - .:/sources airflow-cli: <<: *airflow-common profiles: - debug environment: <<: *airflow-common-env CONNECTION_CHECK_MAX_COUNT: "0" # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252 command: - bash - -c - airflow flower: <<: *airflow-common command: celery flower ports: - 5555:5555 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:5555/"] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully volumes: postgres-db-volume: ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/docker-compose_2.3.4.yaml ================================================ # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # # Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL. # # WARNING: This configuration is for local development. Do not use it in a production deployment. # # This configuration supports basic configuration using environment variables or an .env file # The following variables are supported: # # AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow. # Default: apache/airflow:2.2.3 # AIRFLOW_UID - User ID in Airflow containers # Default: 50000 # Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode # # _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account (if requested). # Default: airflow # _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account (if requested). # Default: airflow # _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers. # Default: '' # # Feel free to modify this file to suit your needs. --- version: '3' x-airflow-common: &airflow-common # In order to add custom dependencies or upgrade provider packages you can use your extended image. # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml # and uncomment the "build" line below, Then run `docker-compose build` to build the images. build: context: . dockerfile: ./Dockerfile environment: &airflow-common-env AIRFLOW__CORE__EXECUTOR: LocalExecutor AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow AIRFLOW__CORE__FERNET_KEY: '' AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true' AIRFLOW__CORE__LOAD_EXAMPLES: 'false' AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth' _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-} GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json' GCP_PROJECT_ID: 'abc' GCP_GCS_BUCKET: "abc" volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins - ~/.google/credentials/:/.google/credentials:ro user: "${AIRFLOW_UID:-50000}:0" depends_on: &airflow-common-depends-on postgres: condition: service_healthy services: postgres: image: postgres:13 environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow volumes: - postgres-db-volume:/var/lib/postgresql/data healthcheck: test: ["CMD", "pg_isready", "-U", "airflow"] interval: 5s retries: 5 restart: always airflow-webserver: <<: *airflow-common command: webserver ports: - 8080:8080 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-scheduler: <<: *airflow-common command: scheduler healthcheck: test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"'] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-init: <<: *airflow-common entrypoint: /bin/bash # yamllint disable rule:line-length command: - -c - | function ver() { printf "%04d%04d%04d%04d" $${1//./ } } airflow_version=$$(gosu airflow airflow version) airflow_version_comparable=$$(ver $${airflow_version}) min_airflow_version=2.2.0 min_airflow_version_comparable=$$(ver $${min_airflow_version}) if (( airflow_version_comparable < min_airflow_version_comparable )); then echo echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m" echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!" echo exit 1 fi if [[ -z "${AIRFLOW_UID}" ]]; then echo echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m" echo "If you are on Linux, you SHOULD follow the instructions below to set " echo "AIRFLOW_UID environment variable, otherwise files will be owned by root." echo "For other operating systems you can get rid of the warning with manually created .env file:" echo " See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user" echo fi one_meg=1048576 mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg)) cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat) disk_available=$$(df / | tail -1 | awk '{print $$4}') warning_resources="false" if (( mem_available < 4000 )) ; then echo echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m" echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))" echo warning_resources="true" fi if (( cpus_available < 2 )); then echo echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m" echo "At least 2 CPUs recommended. You have $${cpus_available}" echo warning_resources="true" fi if (( disk_available < one_meg * 10 )); then echo echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m" echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))" echo warning_resources="true" fi if [[ $${warning_resources} == "true" ]]; then echo echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m" echo "Please follow the instructions to increase amount of resources available:" echo " https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin" echo fi mkdir -p /sources/logs /sources/dags /sources/plugins chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins} exec /entrypoint airflow version # yamllint enable rule:line-length environment: <<: *airflow-common-env _AIRFLOW_DB_UPGRADE: 'true' _AIRFLOW_WWW_USER_CREATE: 'true' _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow} _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow} user: "0:0" volumes: - .:/sources airflow-cli: <<: *airflow-common profiles: - debug environment: <<: *airflow-common-env CONNECTION_CHECK_MAX_COUNT: "0" # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252 command: - bash - -c - airflow volumes: postgres-db-volume: ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/docs/1_concepts.md ================================================ ## Airflow concepts ### Airflow architecture ![](arch-diag-airflow.png) Ref: https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html * **Web server**: GUI to inspect, trigger and debug the behaviour of DAGs and tasks. Available at http://localhost:8080. * **Scheduler**: Responsible for scheduling jobs. Handles both triggering & scheduled workflows, submits Tasks to the executor to run, monitors all tasks and DAGs, and then triggers the task instances once their dependencies are complete. * **Worker**: This component executes the tasks given by the scheduler. * **Metadata database (postgres)**: Backend to the Airflow environment. Used by the scheduler, executor and webserver to store state. * **Other components** (seen in docker-compose services): * `redis`: Message broker that forwards messages from scheduler to worker. * `flower`: The flower app for monitoring the environment. It is available at http://localhost:5555. * `airflow-init`: initialization service (customized as per this design) All these services allow you to run Airflow with CeleryExecutor. For more information, see [Architecture Overview](https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html). ### Project Structure: * `./dags` - `DAG_FOLDER` for DAG files (use `./dags_local` for the local ingestion DAG) * `./logs` - contains logs from task execution and scheduler. * `./plugins` - for custom plugins ### Workflow components * `DAG`: Directed acyclic graph, specifies the dependencies between a set of tasks with explicit execution order, and has a beginning as well as an end. (Hence, “acyclic”) * `DAG Structure`: DAG Definition, Tasks (eg. Operators), Task Dependencies (control flow: `>>` or `<<` ) * `Task`: a defined unit of work (aka, operators in Airflow). The Tasks themselves describe what to do, be it fetching data, running analysis, triggering other systems, or more. * Common Types: Operators (used in this workshop), Sensors, TaskFlow decorators * Sub-classes of Airflow's BaseOperator * `DAG Run`: individual execution/run of a DAG * scheduled or triggered * `Task Instance`: an individual run of a single task. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc. * Ideally, a task should flow from `none`, to `scheduled`, to `queued`, to `running`, and finally to `success`. ### References https://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/extras/data_ingestion_gcs_dag_ex2.py ================================================ import os from datetime import datetime from airflow import DAG from airflow.utils.dates import days_ago from airflow.operators.bash import BashOperator from airflow.operators.python import PythonOperator from google.cloud import storage PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "pivotal-surfer-336713") BUCKET = os.environ.get("GCP_GCS_BUCKET", "dtc_data_lake_pivotal-surfer-336713") dataset_file = "yellow_tripdata_2021-01.csv" dataset_url = f"https://s3.amazonaws.com/nyc-tlc/trip+data/{dataset_file}" path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/") path_to_creds = f"{path_to_local_home}/google_credentials.json" default_args = { "owner": "airflow", "start_date": days_ago(1), "depends_on_past": False, "retries": 1, } # # Takes 15-20 mins to run. Good case for using Spark (distributed processing, in place of chunks) # def upload_to_gcs(bucket, object_name, local_file): # """ # Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python # :param bucket: GCS bucket name # :param object_name: target path & file-name # :param local_file: source path & file-name # :return: # """ # # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload link. # # (Ref: https://github.com/googleapis/python-storage/issues/74) # storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024 # 5 MB # storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024 # 5 MB # # client = storage.Client() # bucket = client.bucket(bucket) # # blob = bucket.blob(object_name) # # blob.chunk_size = 5 * 1024 * 1024 # blob.upload_from_filename(local_file) with DAG( dag_id="data_ingestion_gcs_dag", schedule_interval="@daily", default_args=default_args, catchup=True, max_active_runs=1, ) as dag: # Takes ~2 mins, depending upon your internet's download speed download_dataset_task = BashOperator( task_id="download_dataset_task", bash_command=f"curl -sS {dataset_url} > {path_to_local_home}/{dataset_file}" # "&& unzip {zip_file} && rm {zip_file}" ) # # APPROACH 1: (takes 20 mins, at an upload speed of 800Kbps. Faster if your internet has a better upload speed) # upload_to_gcs_task = PythonOperator( # task_id="upload_to_gcs_task", # python_callable=upload_to_gcs, # op_kwargs={ # "bucket": BUCKET, # "object_name": f"raw/{dataset_file}", # "local_file": f"{path_to_local_home}/{dataset_file}", # # }, # ) # OR APPROACH 2: (takes 20 mins, at an upload speed of 800Kbps. Faster if your internet has a better upload speed) # Ref: https://cloud.google.com/blog/products/gcp/optimizing-your-cloud-storage-performance-google-cloud-performance-atlas upload_to_gcs_task = BashOperator( task_id="upload_to_gcs_task", bash_command=f"gcloud auth activate-service-account --key-file={path_to_creds} && \ gsutil -m cp {path_to_local_home}/{dataset_file} gs://{BUCKET}", ) download_dataset_task >> upload_to_gcs_task ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/extras/web_to_gcs.sh ================================================ dataset_url=${dataset_url} dataset_file=${dataset_file} path_to_local_file=${path_to_local_file} path_to_creds=${path_to_creds} curl -sS "$dataset_url" > $path_to_local_file/$dataset_file gcloud auth activate-service-account --key-file=$path_to_creds gsutil -m cp $path_to_local_file/$dataset_file gs://$BUCKET ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/requirements.txt ================================================ apache-airflow-providers-google pyarrow ================================================ FILE: cohorts/2022/week_2_data_ingestion/airflow/scripts/entrypoint.sh ================================================ #!/usr/bin/env bash export GOOGLE_APPLICATION_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS} export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=${AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT} airflow db upgrade airflow users create -r Admin -u admin -p admin -e admin@example.com -f admin -l airflow # "$_AIRFLOW_WWW_USER_USERNAME" -p "$_AIRFLOW_WWW_USER_PASSWORD" airflow webserver ================================================ FILE: cohorts/2022/week_2_data_ingestion/homework/homework.md ================================================ ## Week 2 Homework In this homework, we'll prepare data for the next week. We'll need to put these datasets to our data lake: * For the lessons, we'll need the Yellow taxi dataset (years 2019 and 2020) * For the homework, we'll need FHV Data (for-hire vehicles, for 2019 only) You can find all the URLs on [the dataset page](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) In this homework, we will: * Modify the DAG we created during the lessons for transferring the yellow taxi data * Create a new dag for transferring the FHV data * Create another dag for the Zones data If you don't have access to GCP, you can do that locally and ingest data to Postgres instead. If you have access to GCP, you don't need to do it for local Postgres - only if you want. Also note that for this homework we don't need the last step - creating a table in GCP. After putting all the files to the datalake, we'll create the tables in Week 3. ## Question 1: Start date for the Yellow taxi data (1 point) You'll need to parametrize the DAG for processing the yellow taxi data that we created in the videos. What should be the start date for this dag? * 2019-01-01 * 2020-01-01 * 2021-01-01 * days_ago(1) ## Question 2: Frequency for the Yellow taxi data (1 point) How often do we need to run this DAG? * Daily * Monthly * Yearly * Once ## Re-running the DAGs for past dates To execute your DAG for past dates, try this: * First, delete your DAG from the web interface (the bin icon) * Set the `catchup` parameter to `True` * Be careful with running a lot of jobs in parallel - your system may not like it. Don't set it higher than 3: `max_active_runs=3` * Rename the DAG to something like `data_ingestion_gcs_dag_v02` * Execute it from the Airflow GUI (the play button) Also, there's no data for the recent months, but `curl` will exit successfully. To make it fail on 404, add the `-f` flag: ```bash curl -sSLf { URL } > { LOCAL_PATH } ``` When you run this for all the data, the temporary files will be saved in Docker and will consume your disk space. If it causes problems for you, add another step in your DAG that cleans everything up. It could be a bash operator that runs this command: ```bash rm name-of-csv-file.csv name-of-parquet-file.parquet ``` ## Question 3: DAG for FHV Data (2 points) Now create another DAG - for uploading the FHV data. We will need three steps: * Download the data * Parquetize it * Upload to GCS If you don't have a GCP account, for local ingestion you'll need two steps: * Download the data * Ingest to Postgres Use the same frequency and the start date as for the yellow taxi dataset Question: how many DAG runs are green for data in 2019 after finishing everything? Note: when processing the data for 2020-01 you probably will get an error. It's up to you to decide what to do with it - for Week 3 homework we won't need 2020 data. ## Question 4: DAG for Zones (2 points) Create the final DAG - for Zones: * Download it * Parquetize * Upload to GCS (Or two steps for local ingestion: download -> ingest to postgres) How often does it need to run? * Daily * Monthly * Yearly * Once ## Submitting the solutions * Form for submitting: https://forms.gle/ViWS8pDf2tZD4zSu5 * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: February 7, 17:00 CET ================================================ FILE: cohorts/2022/week_2_data_ingestion/homework/solution.py ================================================ import os import logging from datetime import datetime from airflow import DAG from airflow.utils.dates import days_ago from airflow.operators.bash import BashOperator from airflow.operators.python import PythonOperator from google.cloud import storage import pyarrow.csv as pv import pyarrow.parquet as pq PROJECT_ID = os.environ.get("GCP_PROJECT_ID") BUCKET = os.environ.get("GCP_GCS_BUCKET") AIRFLOW_HOME = os.environ.get("AIRFLOW_HOME", "/opt/airflow/") def format_to_parquet(src_file, dest_file): if not src_file.endswith('.csv'): logging.error("Can only accept source files in CSV format, for the moment") return table = pv.read_csv(src_file) pq.write_table(table, dest_file) def upload_to_gcs(bucket, object_name, local_file): client = storage.Client() bucket = client.bucket(bucket) blob = bucket.blob(object_name) blob.upload_from_filename(local_file) default_args = { "owner": "airflow", #"start_date": days_ago(1), "depends_on_past": False, "retries": 1, } def donwload_parquetize_upload_dag( dag, url_template, local_csv_path_template, local_parquet_path_template, gcs_path_template ): with dag: download_dataset_task = BashOperator( task_id="download_dataset_task", bash_command=f"curl -sSLf {url_template} > {local_csv_path_template}" ) format_to_parquet_task = PythonOperator( task_id="format_to_parquet_task", python_callable=format_to_parquet, op_kwargs={ "src_file": local_csv_path_template, "dest_file": local_parquet_path_template }, ) local_to_gcs_task = PythonOperator( task_id="local_to_gcs_task", python_callable=upload_to_gcs, op_kwargs={ "bucket": BUCKET, "object_name": gcs_path_template, "local_file": local_parquet_path_template, }, ) rm_task = BashOperator( task_id="rm_task", bash_command=f"rm {local_csv_path_template} {local_parquet_path_template}" ) download_dataset_task >> format_to_parquet_task >> local_to_gcs_task >> rm_task URL_PREFIX = 'https://s3.amazonaws.com/nyc-tlc/trip+data' YELLOW_TAXI_URL_TEMPLATE = URL_PREFIX + '/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv' YELLOW_TAXI_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv' YELLOW_TAXI_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet' YELLOW_TAXI_GCS_PATH_TEMPLATE = "raw/yellow_tripdata/{{ execution_date.strftime(\'%Y\') }}/yellow_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet" yellow_taxi_data_dag = DAG( dag_id="yellow_taxi_data_v2", schedule_interval="0 6 2 * *", start_date=datetime(2019, 1, 1), default_args=default_args, catchup=True, max_active_runs=3, tags=['dtc-de'], ) donwload_parquetize_upload_dag( dag=yellow_taxi_data_dag, url_template=YELLOW_TAXI_URL_TEMPLATE, local_csv_path_template=YELLOW_TAXI_CSV_FILE_TEMPLATE, local_parquet_path_template=YELLOW_TAXI_PARQUET_FILE_TEMPLATE, gcs_path_template=YELLOW_TAXI_GCS_PATH_TEMPLATE ) # https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2021-01.csv GREEN_TAXI_URL_TEMPLATE = URL_PREFIX + '/green_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv' GREEN_TAXI_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/green_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv' GREEN_TAXI_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/green_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet' GREEN_TAXI_GCS_PATH_TEMPLATE = "raw/green_tripdata/{{ execution_date.strftime(\'%Y\') }}/green_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet" green_taxi_data_dag = DAG( dag_id="green_taxi_data_v1", schedule_interval="0 7 2 * *", start_date=datetime(2019, 1, 1), default_args=default_args, catchup=True, max_active_runs=3, tags=['dtc-de'], ) donwload_parquetize_upload_dag( dag=green_taxi_data_dag, url_template=GREEN_TAXI_URL_TEMPLATE, local_csv_path_template=GREEN_TAXI_CSV_FILE_TEMPLATE, local_parquet_path_template=GREEN_TAXI_PARQUET_FILE_TEMPLATE, gcs_path_template=GREEN_TAXI_GCS_PATH_TEMPLATE ) # https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-01.csv FHV_TAXI_URL_TEMPLATE = URL_PREFIX + '/fhv_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv' FHV_TAXI_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/fhv_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.csv' FHV_TAXI_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/fhv_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet' FHV_TAXI_GCS_PATH_TEMPLATE = "raw/fhv_tripdata/{{ execution_date.strftime(\'%Y\') }}/fhv_tripdata_{{ execution_date.strftime(\'%Y-%m\') }}.parquet" fhv_taxi_data_dag = DAG( dag_id="hfv_taxi_data_v1", schedule_interval="0 8 2 * *", start_date=datetime(2019, 1, 1), end_date=datetime(2020, 1, 1), default_args=default_args, catchup=True, max_active_runs=3, tags=['dtc-de'], ) donwload_parquetize_upload_dag( dag=fhv_taxi_data_dag, url_template=FHV_TAXI_URL_TEMPLATE, local_csv_path_template=FHV_TAXI_CSV_FILE_TEMPLATE, local_parquet_path_template=FHV_TAXI_PARQUET_FILE_TEMPLATE, gcs_path_template=FHV_TAXI_GCS_PATH_TEMPLATE ) # https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv ZONES_URL_TEMPLATE = 'https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv' ZONES_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/taxi_zone_lookup.csv' ZONES_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/taxi_zone_lookup.parquet' ZONES_GCS_PATH_TEMPLATE = "raw/taxi_zone/taxi_zone_lookup.parquet" zones_data_dag = DAG( dag_id="zones_data_v1", schedule_interval="@once", start_date=days_ago(1), default_args=default_args, catchup=True, max_active_runs=3, tags=['dtc-de'], ) donwload_parquetize_upload_dag( dag=zones_data_dag, url_template=ZONES_URL_TEMPLATE, local_csv_path_template=ZONES_CSV_FILE_TEMPLATE, local_parquet_path_template=ZONES_PARQUET_FILE_TEMPLATE, gcs_path_template=ZONES_GCS_PATH_TEMPLATE ) ================================================ FILE: cohorts/2022/week_2_data_ingestion/transfer_service/README.md ================================================ ## Generate AWS Access key - Login in to AWS account - Search for IAM ![aws iam](../../images/aws/iam.png) - Click on `Manage access key` - Click on `Create New Access Key` - Download the csv, your access key and secret would be in that csv (Please note that once lost secret cannot be recovered) ## Transfer service https://console.cloud.google.com/transfer/cloud/jobs ================================================ FILE: cohorts/2022/week_3_data_warehouse/airflow/.env_example ================================================ # Custom COMPOSE_PROJECT_NAME=dtc-de GOOGLE_APPLICATION_CREDENTIALS=/.google/credentials/google_credentials.json AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json # AIRFLOW_UID= GCP_PROJECT_ID= GCP_GCS_BUCKET= # Postgres POSTGRES_USER=airflow POSTGRES_PASSWORD=airflow POSTGRES_DB=airflow # Airflow AIRFLOW__CORE__EXECUTOR=LocalExecutor AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC=10 AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB} AIRFLOW_CONN_METADATA_DB=postgres+psycopg2://airflow:airflow@postgres:5432/airflow AIRFLOW_VAR__METADATA_DB_SCHEMA=airflow _AIRFLOW_WWW_USER_CREATE=True _AIRFLOW_WWW_USER_USERNAME=${_AIRFLOW_WWW_USER_USERNAME:airflow} _AIRFLOW_WWW_USER_PASSWORD=${_AIRFLOW_WWW_USER_PASSWORD:airflow} AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True AIRFLOW__CORE__LOAD_EXAMPLES=False ================================================ FILE: cohorts/2022/week_3_data_warehouse/airflow/1_setup_official.md ================================================ ## Setup (Official) ### Pre-Reqs 1. For the sake of standardization across this workshop's config, rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory ``` bash cd ~ && mkdir -p ~/.google/credentials/ mv .json ~/.google/credentials/google_credentials.json ``` 2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 5GB (ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting. 3. Python version: 3.7+ ### Airflow Setup 1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in) 2. **Set the Airflow user**: On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. You have to make sure to configure them for the docker-compose: ```bash mkdir -p ./dags ./logs ./plugins echo -e "AIRFLOW_UID=$(id -u)" > .env ``` On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. To get rid of the warning ("AIRFLOW_UID is not set"), you can create `.env` file with this content: ``` AIRFLOW_UID=50000 ``` 3. **Import the official docker setup file** from the latest Airflow version: ```shell curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml' ``` 4. It could be overwhelming to see a lot of services in here. But this is only a quick-start template, and as you proceed you'll figure out which unused services can be removed. Eg. [Here's](docker-compose-nofrills.yml) a no-frills version of that template. 5. **Docker Build**: When you want to run Airflow locally, you might want to use an extended image, containing some additional dependencies - for example you might add new python packages, or upgrade airflow providers to a later version. Create a `Dockerfile` pointing to Airflow version you've just downloaded, such as `apache/airflow:2.2.3`, as the base image, And customize this `Dockerfile` by: * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket/Data Lake. * Also, integrating `requirements.txt` to install libraries via `pip install` 6. **Docker Compose**: Back in your `docker-compose.yaml`: * In `x-airflow-common`: * Remove the `image` tag, to replace it with your `build` from your Dockerfile, as shown * Mount your `google_credentials` in `volumes` section as read-only * Set environment variables: `GCP_PROJECT_ID`, `GCP_GCS_BUCKET`, `GOOGLE_APPLICATION_CREDENTIALS` & `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`, as per your config. * Change `AIRFLOW__CORE__LOAD_EXAMPLES` to `false` (optional) 7. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look. ## Problems ### `File /.google/credentials/google_credentials.json was not found` First, make sure you have your credentials in your `$HOME/.google/credentials`. Maybe you missed the step and didn't copy the your JSON with credentials there? Also, make sure the file-name is `google_credentials.json`. Second, check that docker-compose can correctly map this directory to airflow worker. Execute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker. Then execute `bash` on this container: ```bash docker exec -it bash ``` Now check if the file with credentials is actually there: ```bash ls -lh /.google/credentials/ ``` If it's empty, docker-compose couldn't map the folder with credentials. In this case, try changing it to the absolute path to this folder: ```yaml volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins # here: ---------------------------- - c:/Users/alexe/.google/credentials/:/.google/credentials:ro # ----------------------------------- ``` ================================================ FILE: cohorts/2022/week_3_data_warehouse/airflow/2_setup_nofrills.md ================================================ ## Setup (No-frills) ### Pre-Reqs 1. For the sake of standardization across this workshop's config, rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory ``` bash cd ~ && mkdir -p ~/.google/credentials/ mv .json ~/.google/credentials/google_credentials.json ``` 2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 4GB (ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting. 3. Python version: 3.7+ ### Airflow Setup 1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in) 2. **Set the Airflow user**: On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. You have to make sure to configure them for the docker-compose: ```bash mkdir -p ./dags ./logs ./plugins echo -e "AIRFLOW_UID=$(id -u)" >> .env ``` On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. To get rid of the warning ("AIRFLOW_UID is not set"), you can create `.env` file with this content: ``` AIRFLOW_UID=50000 ``` 3. **Docker Build**: When you want to run Airflow locally, you might want to use an extended image, containing some additional dependencies - for example you might add new python packages, or upgrade airflow providers to a later version. Create a `Dockerfile` pointing to the latest Airflow version such as `apache/airflow:2.2.3`, for the base image, And customize this `Dockerfile` by: * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket (Data Lake). * Also, integrating `requirements.txt` to install libraries via `pip install` 4. Copy [docker-compose-nofrills.yml](docker-compose-nofrills.yml), [.env_example](.env_example) & [entrypoint.sh](scripts/entrypoint.sh) from this repo. The changes from the official setup are: * Removal of `redis` queue, `worker`, `triggerer`, `flower` & `airflow-init` services, and changing from `CeleryExecutor` (multi-node) mode to `LocalExecutor` (single-node) mode * Inclusion of `.env` for better parametrization & flexibility * Inclusion of simple `entrypoint.sh` to the `webserver` container, responsible to initialize the database and create login-user (admin). * Updated `Dockerfile` to grant permissions on executing `scripts/entrypoint.sh` 5. `.env`: * Rebuild your `.env` file by making a copy of `.env_example` (but make sure your `AIRFLOW_UID` remains): ```shell mv .env_example .env ``` * Set environment variables `AIRFLOW_UID`, `GCP_PROJECT_ID` & `GCP_GCS_BUCKET`, as per your config. * Optionally, if your `google-credentials.json` is stored somewhere else, such as a path like `$HOME/.gc`, modify the env-vars (`GOOGLE_APPLICATION_CREDENTIALS`, `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`) and `volumes` path in `docker-compose-nofrills.yml` 6. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose-nofrills](./docker-compose-nofrills.yml) should look. ## Problems ### `no-frills setup does not work for me - WSL/Windows user ` If you are running Docker in Windows/WSL/WSL2 and you have encountered some `ModuleNotFoundError` or low performance issues, take a look at this [Airflow & WSL2 gist](https://gist.github.com/nervuzz/d1afe81116cbfa3c834634ebce7f11c5) focused entirely on troubleshooting possible problems. ### `File /.google/credentials/google_credentials.json was not found` First, make sure you have your credentials in your `$HOME/.google/credentials`. Maybe you missed the step and didn't copy the your JSON with credentials there? Also, make sure the file-name is `google_credentials.json`. Second, check that docker-compose can correctly map this directory to airflow worker. Execute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker. Then execute `bash` on this container: ```bash docker exec -it bash ``` Now check if the file with credentials is actually there: ```bash ls -lh /.google/credentials/ ``` If it's empty, docker-compose couldn't map the folder with credentials. In this case, try changing it to the absolute path to this folder: ```yaml volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins # here: ---------------------------- - c:/Users/alexe/.google/credentials/:/.google/credentials:ro # ----------------------------------- ``` ================================================ FILE: cohorts/2022/week_3_data_warehouse/airflow/README.md ================================================ ### Concepts [Airflow Concepts and Architecture](../week_2_data_ingestion/airflow/docs/1_concepts.md) ### Workflow ![](docs/gcs_2_bq_dag_graph_view.png) ![](docs/gcs_2_bq_dag_tree_view.png) ### Setup - Official Version (For the section on the Custom/Lightweight setup, scroll down) #### Setup [Airflow Setup with Docker, through official guidelines](1_setup_official.md) #### Execution 1. Build the image (only first-time, or when there's any change in the `Dockerfile`, takes ~15 mins for the first-time): ```shell docker-compose build ``` or (for legacy versions) ```shell docker build . ``` 2. Initialize the Airflow scheduler, DB, and other config ```shell docker-compose up airflow-init ``` 3. Kick up the all the services from the container: ```shell docker-compose up ``` 4. In another terminal, run `docker-compose ps` to see which containers are up & running (there should be 7, matching with the services in your docker-compose file). 5. Login to Airflow web UI on `localhost:8080` with default creds: `airflow/airflow` 6. Run your DAG on the Web Console. 7. On finishing your run or to shut down the container/s: ```shell docker-compose down ``` To stop and delete containers, delete volumes with database data, and download images, run: ``` docker-compose down --volumes --rmi all ``` or ``` docker-compose down --volumes --remove-orphans ``` ### Setup - Custom No-Frills Version (Lightweight) This is a quick, simple & less memory-intensive setup of Airflow that works on a LocalExecutor. #### Setup [Airflow Setup with Docker, customized](2_setup_nofrills.md) #### Execution 1. Stop and delete containers, delete volumes with database data, & downloaded images (from the previous setup): ``` docker-compose down --volumes --rmi all ``` or ``` docker-compose down --volumes --remove-orphans ``` Or, if you need to clear your system of any pre-cached Docker issues: ``` docker system prune ``` Also, empty the airflow `logs` directory. 2. Build the image (only first-time, or when there's any change in the `Dockerfile`): Takes ~5-10 mins for the first-time ```shell docker-compose build ``` or (for legacy versions) ```shell docker build . ``` 3. Kick up the all the services from the container (no need to specially initialize): ```shell docker-compose -f docker-compose-nofrills.yml up ``` 4. In another terminal, run `docker ps` to see which containers are up & running (there should be 3, matching with the services in your docker-compose file). 5. Login to Airflow web UI on `localhost:8080` with creds: `admin/admin` (explicit creation of admin user was required) 6. Run your DAG on the Web Console. 7. On finishing your run or to shut down the container/s: ```shell docker-compose down ``` ### Future Enhancements * Deploy self-hosted Airflow setup on Kubernetes cluster, or use a Managed Airflow (Cloud Composer) service by GCP ### References For more info, check out these official docs: * https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html * https://airflow.apache.org/docs/docker-stack/build.html * https://airflow.apache.org/docs/docker-stack/recipes.html ================================================ FILE: cohorts/2022/week_3_data_warehouse/airflow/dags/gcs_to_bq_dag.py ================================================ import os import logging from airflow import DAG from airflow.utils.dates import days_ago from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator, BigQueryInsertJobOperator from airflow.providers.google.cloud.transfers.gcs_to_gcs import GCSToGCSOperator PROJECT_ID = os.environ.get("GCP_PROJECT_ID") BUCKET = os.environ.get("GCP_GCS_BUCKET") path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/") BIGQUERY_DATASET = os.environ.get("BIGQUERY_DATASET", 'trips_data_all') DATASET = "tripdata" COLOUR_RANGE = {'yellow': 'tpep_pickup_datetime', 'green': 'lpep_pickup_datetime'} INPUT_PART = "raw" INPUT_FILETYPE = "parquet" default_args = { "owner": "airflow", "start_date": days_ago(1), "depends_on_past": False, "retries": 1, } # NOTE: DAG declaration - using a Context Manager (an implicit way) with DAG( dag_id="gcs_2_bq_dag", schedule_interval="@daily", default_args=default_args, catchup=False, max_active_runs=1, tags=['dtc-de'], ) as dag: for colour, ds_col in COLOUR_RANGE.items(): move_files_gcs_task = GCSToGCSOperator( task_id=f'move_{colour}_{DATASET}_files_task', source_bucket=BUCKET, source_object=f'{INPUT_PART}/{colour}_{DATASET}*.{INPUT_FILETYPE}', destination_bucket=BUCKET, destination_object=f'{colour}/{colour}_{DATASET}', move_object=True ) bigquery_external_table_task = BigQueryCreateExternalTableOperator( task_id=f"bq_{colour}_{DATASET}_external_table_task", table_resource={ "tableReference": { "projectId": PROJECT_ID, "datasetId": BIGQUERY_DATASET, "tableId": f"{colour}_{DATASET}_external_table", }, "externalDataConfiguration": { "autodetect": "True", "sourceFormat": f"{INPUT_FILETYPE.upper()}", "sourceUris": [f"gs://{BUCKET}/{colour}/*"], }, }, ) CREATE_BQ_TBL_QUERY = ( f"CREATE OR REPLACE TABLE {BIGQUERY_DATASET}.{colour}_{DATASET} \ PARTITION BY DATE({ds_col}) \ AS \ SELECT * FROM {BIGQUERY_DATASET}.{colour}_{DATASET}_external_table;" ) # Create a partitioned table from external table bq_create_partitioned_table_job = BigQueryInsertJobOperator( task_id=f"bq_create_{colour}_{DATASET}_partitioned_table_task", configuration={ "query": { "query": CREATE_BQ_TBL_QUERY, "useLegacySql": False, } } ) move_files_gcs_task >> bigquery_external_table_task >> bq_create_partitioned_table_job ================================================ FILE: cohorts/2022/week_3_data_warehouse/airflow/docker-compose-nofrills.yml ================================================ version: '3' services: postgres: image: postgres:13 env_file: - .env volumes: - postgres-db-volume:/var/lib/postgresql/data healthcheck: test: ["CMD", "pg_isready", "-U", "airflow"] interval: 5s retries: 5 restart: always scheduler: build: . command: scheduler restart: on-failure depends_on: - postgres env_file: - .env volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins - ./scripts:/opt/airflow/scripts - ~/.google/credentials/:/.google/credentials:ro webserver: build: . entrypoint: ./scripts/entrypoint.sh restart: on-failure depends_on: - postgres - scheduler env_file: - .env volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins - ~/.google/credentials/:/.google/credentials:ro - ./scripts:/opt/airflow/scripts user: "${AIRFLOW_UID:-50000}:0" ports: - "8080:8080" healthcheck: test: [ "CMD-SHELL", "[ -f /home/airflow/airflow-webserver.pid ]" ] interval: 30s timeout: 30s retries: 3 volumes: postgres-db-volume: ================================================ FILE: cohorts/2022/week_3_data_warehouse/airflow/docker-compose.yaml ================================================ # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # # Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL. # # WARNING: This configuration is for local development. Do not use it in a production deployment. # # This configuration supports basic configuration using environment variables or an .env file # The following variables are supported: # # AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow. # Default: apache/airflow:2.2.3 # AIRFLOW_UID - User ID in Airflow containers # Default: 50000 # Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode # # _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account (if requested). # Default: airflow # _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account (if requested). # Default: airflow # _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers. # Default: '' # # Feel free to modify this file to suit your needs. --- version: '3' x-airflow-common: &airflow-common # In order to add custom dependencies or upgrade provider packages you can use your extended image. # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml # and uncomment the "build" line below, Then run `docker-compose build` to build the images. build: context: . dockerfile: ./Dockerfile environment: &airflow-common-env AIRFLOW__CORE__EXECUTOR: LocalExecutor AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow # AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow # AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0 AIRFLOW__CORE__FERNET_KEY: '' AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true' AIRFLOW__CORE__LOAD_EXAMPLES: 'false' AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth' _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-} GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json' # TODO: Please change GCP_PROJECT_ID & GCP_GCS_BUCKET, as per your config GCP_PROJECT_ID: 'pivotal-surfer-336713' GCP_GCS_BUCKET: 'dtc_data_lake_pivotal-surfer-336713' volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins - ~/.google/credentials/:/.google/credentials:ro user: "${AIRFLOW_UID:-50000}:0" depends_on: &airflow-common-depends-on # redis: # condition: service_healthy postgres: condition: service_healthy services: postgres: image: postgres:13 environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow volumes: - postgres-db-volume:/var/lib/postgresql/data healthcheck: test: ["CMD", "pg_isready", "-U", "airflow"] interval: 5s retries: 5 restart: always # redis: # image: redis:latest # expose: # - 6379 # healthcheck: # test: ["CMD", "redis-cli", "ping"] # interval: 5s # timeout: 30s # retries: 50 # restart: always airflow-webserver: <<: *airflow-common command: webserver ports: - 8080:8080 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-scheduler: <<: *airflow-common command: scheduler healthcheck: test: ["CMD-SHELL", 'airflow jobs check --job-type SchedulerJob --hostname "$${HOSTNAME}"'] interval: 10s timeout: 10s retries: 5 restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully # airflow-worker: # <<: *airflow-common # command: celery worker # healthcheck: # test: # - "CMD-SHELL" # - 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"' # interval: 10s # timeout: 10s # retries: 5 # environment: # <<: *airflow-common-env # # Required to handle warm shutdown of the celery workers properly # # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation # DUMB_INIT_SETSID: "0" # restart: always # depends_on: # <<: *airflow-common-depends-on # airflow-init: # condition: service_completed_successfully # # airflow-triggerer: # <<: *airflow-common # command: triggerer # healthcheck: # test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"'] # interval: 10s # timeout: 10s # retries: 5 # restart: always # depends_on: # <<: *airflow-common-depends-on # airflow-init: # condition: service_completed_successfully airflow-init: <<: *airflow-common entrypoint: /bin/bash # yamllint disable rule:line-length command: - -c - | function ver() { printf "%04d%04d%04d%04d" $${1//./ } } airflow_version=$$(gosu airflow airflow version) airflow_version_comparable=$$(ver $${airflow_version}) min_airflow_version=2.2.0 min_airflow_version_comparable=$$(ver $${min_airflow_version}) if (( airflow_version_comparable < min_airflow_version_comparable )); then echo echo -e "\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\e[0m" echo "The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!" echo exit 1 fi if [[ -z "${AIRFLOW_UID}" ]]; then echo echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m" echo "If you are on Linux, you SHOULD follow the instructions below to set " echo "AIRFLOW_UID environment variable, otherwise files will be owned by root." echo "For other operating systems you can get rid of the warning with manually created .env file:" echo " See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user" echo fi one_meg=1048576 mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg)) cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat) disk_available=$$(df / | tail -1 | awk '{print $$4}') warning_resources="false" if (( mem_available < 4000 )) ; then echo echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m" echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))" echo warning_resources="true" fi if (( cpus_available < 2 )); then echo echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m" echo "At least 2 CPUs recommended. You have $${cpus_available}" echo warning_resources="true" fi if (( disk_available < one_meg * 10 )); then echo echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m" echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))" echo warning_resources="true" fi if [[ $${warning_resources} == "true" ]]; then echo echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m" echo "Please follow the instructions to increase amount of resources available:" echo " https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin" echo fi mkdir -p /sources/logs /sources/dags /sources/plugins chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins} exec /entrypoint airflow version # yamllint enable rule:line-length environment: <<: *airflow-common-env _AIRFLOW_DB_UPGRADE: 'true' _AIRFLOW_WWW_USER_CREATE: 'true' _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow} _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow} user: "0:0" volumes: - .:/sources airflow-cli: <<: *airflow-common profiles: - debug environment: <<: *airflow-common-env CONNECTION_CHECK_MAX_COUNT: "0" # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252 command: - bash - -c - airflow # flower: # <<: *airflow-common # command: celery flower # ports: # - 5555:5555 # healthcheck: # test: ["CMD", "curl", "--fail", "http://localhost:5555/"] # interval: 10s # timeout: 10s # retries: 5 # restart: always # depends_on: # <<: *airflow-common-depends-on # airflow-init: # condition: service_completed_successfully volumes: postgres-db-volume: ================================================ FILE: cohorts/2022/week_3_data_warehouse/airflow/scripts/entrypoint.sh ================================================ #!/usr/bin/env bash export GOOGLE_APPLICATION_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS} export AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=${AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT} airflow db upgrade airflow users create -r Admin -u admin -p admin -e admin@example.com -f admin -l airflow # "$_AIRFLOW_WWW_USER_USERNAME" -p "$_AIRFLOW_WWW_USER_PASSWORD" airflow webserver ================================================ FILE: cohorts/2022/week_5_batch_processing/homework.md ================================================ ## Week 5 Homework In this homework we'll put what we learned about Spark in practice. We'll use high volume for-hire vehicles (HVFHV) dataset for that. ## Question 1. Install Spark and PySpark * Install Spark * Run PySpark * Create a local spark session * Execute `spark.version` What's the output? ## Question 2. HVFHW February 2021 Download the HVFHV data for february 2021: ```bash wget https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-02.csv ``` Read it with Spark using the same schema as we did in the lessons. We will use this dataset for all the remaining questions. Repartition it to 24 partitions and save it to parquet. What's the size of the folder with results (in MB)? ## Question 3. Count records How many taxi trips were there on February 15? Consider only trips that started on February 15. ## Question 4. Longest trip for each day Now calculate the duration for each trip. Trip starting on which day was the longest? ## Question 5. Most frequent `dispatching_base_num` Now find the most frequently occurring `dispatching_base_num` in this dataset. How many stages this spark job has? > Note: the answer may depend on how you write the query, > so there are multiple correct answers. > Select the one you have. ## Question 6. Most common locations pair Find the most common pickup-dropoff pair. For example: "Jamaica Bay / Clinton East" Enter two zone names separated by a slash If any of the zone names are unknown (missing), use "Unknown". For example, "Unknown / Clinton East". ## Bonus question. Join type (not graded) For finding the answer to Q6, you'll need to perform a join. What type of join is it? And how many stages your spark job has? ## Submitting the solutions * Form for submitting: https://forms.gle/dBkVK9yT8cSMDwuw7 * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 07 March (Monday), 22:00 CET ================================================ FILE: cohorts/2022/week_6_stream_processing/homework.md ================================================ ## Week 6 Homework [Form](https://forms.gle/mSzfpPCXskWCabeu5) The homework is mostly theoretical. In the last question you have to provide working code link, please keep in mind that this question is not scored. Deadline: 14 March, 22:00 CET ================================================ FILE: cohorts/2023/README.md ================================================ ## Data Engineering Zoomcamp 2023 Cohort * [Launch stream with course overview](https://www.youtube.com/watch?v=-zpVha7bw5A) * [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html) * [Public Leaderboard](leaderboard.md) and [Private Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vTbL00GcdQp0bJt9wf1ROltMq7s3qyxl-NYF7Pvk79Jfxgwfn9dNWmPD_yJHTDq_Wzvps8EIr6cOKWm/pubhtml) * [Course Playlist: Only 2023 Live videos & homeworks](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW) [**Week 1: Introduction & Prerequisites**](week_1_docker_sql/) * [Homework SQL](week_1_docker_sql/homework.md) and [solution](https://www.youtube.com/watch?v=KIh_9tZiroA) * [Homework Terraform](week_1_terraform/homework.md) * [Office hours](https://www.youtube.com/watch?v=RVTryVvSyw4&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW) [**Week 2: Workflow Orchestration**](week_2_workflow_orchestration) * [Homework](week_2_workflow_orchestration/homework.md) * [Office hours part 1](https://www.youtube.com/watch?v=a_nmLHb8hzw&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW) and [part 2](https://www.youtube.com/watch?v=PK8yyMY54Vk&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW&index=7) [**Week 3: Data Warehouse**](week_3_data_warehouse) * [Homework](week_3_data_warehouse/homework.md) * [Office hours](https://www.youtube.com/watch?v=QXfmtJp3bXE&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW) [**Week 4: Analytics Engineering**](week_4_analytics_engineering/) * [Homework](week_4_analytics_engineering/homework.md) * [PipeRider + dbt Workshop](workshops/piperider.md) * [Office hours](https://www.youtube.com/watch?v=ODYg_r72qaE&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW) [**Week 5: Batch processing**](week_5_batch_processing/) * [Homework](week_5_batch_processing/homework.md) * [Office hours](https://www.youtube.com/watch?v=5_69yL2PPYI&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW) [**Week 6: Stream Processing**](week_6_stream_processing) * [Homework](week_6_stream_processing/homework.md) [**Week 7, 8 & 9: Project**](project.md) More information [here](project.md) ================================================ FILE: cohorts/2023/leaderboard.md ================================================ ## Leaderboard This is the top [100 leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vTbL00GcdQp0bJt9wf1ROltMq7s3qyxl-NYF7Pvk79Jfxgwfn9dNWmPD_yJHTDq_Wzvps8EIr6cOKWm/pubhtml) of participants of Data Engineering Zoomcamp 2023 edition!
Name Project Social Links and comments
Katharina Eichinger Project
Alia Hamwi Project
Emmanuel Ikpesu Project
More info Links:
Sanya Syed Project
More info Links: > I am excited about the prospect of securing a challenging role as a Data Engineer, where I can utilise my skills and expertise to contribute meaningfully to an organisation's data-driven initiatives.
Aminu Lawal Project
Lisa Reiber Project
More info Links: > always happy to connect with other data enthusiasts over topics like low-budget data engineering solutions for non-profits or AI solutions for non-profits
Vincenzo Galante Project
More info > Thank you for having this course!
Grzegorz Gątkowski Project
Matt Young Project
More info Links: > Experienced Developer | Cloud & Data Enthusiast | Open to Cloud & Data Engineering Roles 🌩️ ➜ C#, SQL, JavaScript, Python | BI, Data Analytics | AWS, Azure, GCP Passionate about data pipelines, storage, and processing. Excited to implement advanced cloud solutions and enable data-driven insights. Seeking Data Engineering opportunities to leverage my extensive SQL/Data Analytics experience and to transition into the world of cloud-based data solutions. Let's connect and collaborate on innovative data projects! #DataEngineering #CloudTechnology
Sam Hatley Project
Evan Hofmeister Project
Barys Kazarkin Project
Joshua Ati Project
Oleg Agapov Project
More info Links:
Mikhail Kuklin Project
More info Links:
Emmanuel Letremble Project
More info Links: > Thanks to the DataTalks.Club for completing my Full Stack & Machine Learning skill sets with some extra DE knowledge.
Victor Kuang Project
Antonis Angelakis Project
Christian Ruiz
Alex Pilugin Project
Ahmad Rizky Project
Juan Francisco Hernandez Hernandez Project
More info > Thanks to Data Talks Club, it was amazing learning for me as a Career changer.
Iurii Chernigin Project
Franklyne Kibet
Federico Zambelli Project
Marilina Orihuela Project
Alejandro R. Mármol Ruiz Project
Daniel Takeshi Project
Xia He-Bleinagel Project
More info Links:
Thorsten Foltz
Danh Vo Project
Joseph Ologunja Project
Roman Zabolotin Project
Aditya Gupta Project
More info Links:
Vladimir Bugaevskii Project
Fozan Talat Project
Alain Boisvert Project
reneboy garcia Project
More info > "Success is not always about the grand achievements; it's about the small victories that accumulate over time." - Unknown
Svetlana Kononova
Dmitrii Nikolaev
More info Links:
Francis Romio Project
Saul Acevedo Project
Alina Li Project
Alexander Eryuzhev Project
Paul Nwosu Project
More info Links:
  • https://medium.com/@nwosupaul141/serverless-deployment-of-a-prefect-data-pipeline-on-google-cloud-run-8c48765f2480
Param mirani Project
Oscar Garcia - ozkary Project
More info Links:
Hector Torres Project
More info Links: > Currently looking for a position as data engineer
Dewi Nurfitri Oktaviani Project
More info Links:
Ryno Marx
Hidir Cem Altun
Francis Mark Cayco Project
Adrian Baumann Project
Vladislav Garist Project
Gerald Ooi
Roman
Aleksandr Krasnov
More info Links:
Jaesung Ryu Project
António Damião Rodrigues Project
Alicia Escontrela Project
Chalermdej Lematavekul Project
More info > Thank you so much for the course. Learn so many thing from here.
Muhammed Jimoh Project
Bartosz Skłodowski Project
Daniel Rigney Project
Daniel Gheorghita Project
Daniel Gheorghita Project
Niel Kemp
Shahmir Project
More info Links: > I've added a bunch of new features since the reviews! Check it out
Matt Bertrand Project
Nikolay Galkov Project
Hiroko Sakai Project
Rohit Joshi Project
Valerii Bazyrov
Juan Pablo Ricapito Project
Ashraf Omara Project
More info > I need to thank all of the data club community for this amazing contribution.
Wasawat Boonyarittikit Project
Fedor Faizov Project
More info > Absolutly amazing course <3
================================================ FILE: cohorts/2023/project.md ================================================ ## Course Project The goal of this project is to apply everything we learned in this course and build an end-to-end data pipeline. You will have two attempts to submit your project. If you don't have time to submit your project by the end of attempt #1 (you started the course late, you have vacation plans, life/work got in the way, etc.) or you fail your first attempt, then you will have a second chance to submit your project as attempt #2. There are only two attempts. Remember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered complete. To find the projects assigned to you, use the peer review assignments link and find your hash in the first column. You will see three rows: you need to evaluate each of these projects. For each project, you need to submit the form once, so in total, you will make three submissions. ### Submitting #### Project Attempt #1 Project: * Form: https://forms.gle/zTJiVYSmCgsENj6y8 * Deadline: 10 April, 22:00 CET Peer reviewing: * Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRYQ0A9C7AkRK-YPSFhqaRMmuPR97QPfl2PjI8n11l5jntc6YMHIJXVVS0GQNqAYIGwzyevyManDB08/pubhtml?gid=0&single=true) ("project-01" sheet) * Form: https://forms.gle/1bxmgR8yPwV359zb7 * Deadline: 17 April, 22:00 CET Project feedback: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vQuMt9m1XlPrCACqnsFTXTV_KGiSnsl9UjL7kdTMsLJ8DLu3jNJlPzoUKG6baxc8APeEQ8RaSP1U2VX/pubhtml?gid=27207346&single=true) ("project-01" sheet) #### Project Attempt #2 Project: * Form: https://forms.gle/gCXUSYBm1KgMKXVm8 * Deadline: 4 May, 22:00 CET Peer reviewing: * Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRYQ0A9C7AkRK-YPSFhqaRMmuPR97QPfl2PjI8n11l5jntc6YMHIJXVVS0GQNqAYIGwzyevyManDB08/pubhtml?gid=303437788&single=true) ("project-02" sheet) * Form: https://forms.gle/2x5MT4xxczR8isy37 * Deadline: 11 May, 22:00 CET Project feedback: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vQuMt9m1XlPrCACqnsFTXTV_KGiSnsl9UjL7kdTMsLJ8DLu3jNJlPzoUKG6baxc8APeEQ8RaSP1U2VX/pubhtml?gid=246029638&single=true) ### Evaluation criteria See [here](../../week_7_project/README.md) ### Misc To get the hash for your project, use this function to hash your email: ```python from hashlib import sha1 def compute_hash(email): return sha1(email.lower().encode('utf-8')).hexdigest() ``` Or use [this website](http://www.sha1-online.com/). ================================================ FILE: cohorts/2023/week_1_docker_sql/homework.md ================================================ ## Week 1 Homework In this homework we'll prepare the environment and practice with Docker and SQL ## Question 1. Knowing docker tags Run the command to get information on Docker ```docker --help``` Now run the command to get help on the "docker build" command Which tag has the following text? - *Write the image ID to the file* - `--imageid string` - `--iidfile string` - `--idimage string` - `--idfile string` ## Question 2. Understanding docker first run Run docker with the python:3.9 image in an interactive mode and the entrypoint of bash. Now check the python modules that are installed ( use pip list). How many python packages/modules are installed? - 1 - 6 - 3 - 7 # Prepare Postgres Run Postgres and load data as shown in the videos We'll use the green taxi trips from January 2019: ```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz``` You will also need the dataset with zones: ```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv``` Download this data and put it into Postgres (with jupyter notebooks or with a pipeline) ## Question 3. Count records How many taxi trips were totally made on January 15? Tip: started and finished on 2019-01-15. Remember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date. - 20689 - 20530 - 17630 - 21090 ## Question 4. Largest trip for each day Which was the day with the largest trip distance Use the pick up time for your calculations. - 2019-01-18 - 2019-01-28 - 2019-01-15 - 2019-01-10 ## Question 5. The number of passengers In 2019-01-01 how many trips had 2 and 3 passengers? - 2: 1282 ; 3: 266 - 2: 1532 ; 3: 126 - 2: 1282 ; 3: 254 - 2: 1282 ; 3: 274 ## Question 6. Largest tip For the passengers picked up in the Astoria Zone which was the drop off zone that had the largest tip? We want the name of the zone, not the id. Note: it's not a typo, it's `tip` , not `trip` - Central Park - Jamaica - South Ozone Park - Long Island City/Queens Plaza ## Submitting the solutions * Form for submitting: [form](https://forms.gle/EjphSkR1b3nsdojv7) * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 30 January (Monday), 22:00 CET ## Solution See here: https://www.youtube.com/watch?v=KIh_9tZiroA ================================================ FILE: cohorts/2023/week_1_terraform/homework.md ================================================ ## Week 1 Homework In this homework we'll prepare the environment by creating resources in GCP with Terraform. In your VM on GCP install Terraform. Copy the files from the course repo [here](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup/1_terraform_gcp/terraform) to your VM. Modify the files as necessary to create a GCP Bucket and Big Query Dataset. ## Question 1. Creating Resources After updating the main.tf and variable.tf files run: ``` terraform apply ``` Paste the output of this command into the homework submission form. ## Submitting the solutions * Form for submitting: [form](https://forms.gle/S57Xs3HL9nB3YTzj9) * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 30 January (Monday), 22:00 CET ================================================ FILE: cohorts/2023/week_2_workflow_orchestration/README.md ================================================ ## Week 2: Workflow Orchestration Python code from videos is linked [below](#code-repository). Also, if you find the commands too small to view in Kalise's videos, here's the [transcript with code for the second Prefect video](https://github.com/discdiver/prefect-zoomcamp/tree/main/flows/01_start) and the [fifth Prefect video](https://github.com/discdiver/prefect-zoomcamp/tree/main/flows/03_deployments). ### Data Lake (GCS) * What is a Data Lake * ELT vs. ETL * Alternatives to components (S3/HDFS, Redshift, Snowflake etc.) * [Video](https://www.youtube.com/watch?v=W3Zm6rjOq70&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) * [Slides](https://docs.google.com/presentation/d/1RkH-YhBz2apIjYZAxUz2Uks4Pt51-fVWVN9CcH9ckyY/edit?usp=sharing) ### 1. Introduction to Workflow orchestration * What is orchestration? * Workflow orchestrators vs. other types of orchestrators * Core features of a workflow orchestration tool * Different types of workflow orchestration tools that currently exist :movie_camera: [Video](https://www.youtube.com/watch?v=8oLs6pzHp68&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ### 2. Introduction to Prefect concepts * What is Prefect? * Installing Prefect * Prefect flow * Creating an ETL * Prefect task * Blocks and collections * Orion UI :movie_camera: [Video](https://www.youtube.com/watch?v=cdtN6dhp708&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ### 3. ETL with GCP & Prefect * Flow 1: Putting data to Google Cloud Storage :movie_camera: [Video](https://www.youtube.com/watch?v=W-rMz_2GwqQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ### 4. From Google Cloud Storage to Big Query * Flow 2: From GCS to BigQuery :movie_camera: [Video](https://www.youtube.com/watch?v=Cx5jt-V5sgE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ### 5. Parametrizing Flow & Deployments * Parametrizing the script from your flow * Parameter validation with Pydantic * Creating a deployment locally * Setting up Prefect Agent * Running the flow * Notifications :movie_camera: [Video](https://www.youtube.com/watch?v=QrDxPjX10iw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ### 6. Schedules & Docker Storage with Infrastructure * Scheduling a deployment * Flow code storage * Running tasks in Docker :movie_camera: [Video](https://www.youtube.com/watch?v=psNSzqTsi-s&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) ### 7. Prefect Cloud and Additional Resources * Using Prefect Cloud instead of local Prefect * Workspaces * Running flows on GCP :movie_camera: [Video](https://www.youtube.com/watch?v=gGC23ZK7lr8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) * [Prefect docs](https://docs.prefect.io/) * [Pefect Discourse](https://discourse.prefect.io/) * [Prefect Cloud](https://app.prefect.cloud/) * [Prefect Slack](https://prefect-community.slack.com) ### Code repository [Code from videos](https://github.com/discdiver/prefect-zoomcamp) (with a few minor enhancements) ### Homework Homework can be found [here](./homework.md). ## Community notes Did you take notes? You can share them here. * [Blog by Marcos Torregrosa (Prefect)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-2/) * [Notes from Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week2) * [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week2.md) * [Notes by Candace Williams](https://github.com/teacherc/de_zoomcamp_candace2023/blob/main/week_2/week2_notes.md) * [Notes from Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/02/week-2-data-engineering-zoomcamp-notes-prefect/) * [Notes from froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_2_workflow_orchestration/notes/notes_week_02.md) * [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%202/Detailed%20Week%202%20Notes.ipynb) * More on [Pandas vs SQL, Prefect capabilities, and testing your data](https://medium.com/@verazabeida/zoomcamp-2023-week-3-7f27bb8c483f), by Vera * Add your notes here (above this line) ================================================ FILE: cohorts/2023/week_2_workflow_orchestration/homework.md ================================================ ## Week 2 Homework The goal of this homework is to familiarise users with workflow orchestration and observation. ## Question 1. Load January 2020 data Using the `etl_web_to_gcs.py` flow that loads taxi data into GCS as a guide, create a flow that loads the green taxi CSV dataset for January 2020 into GCS and run it. Look at the logs to find out how many rows the dataset has. How many rows does that dataset have? * 447,770 * 766,792 * 299,234 * 822,132 ## Question 2. Scheduling with Cron Cron is a common scheduling specification for workflows. Using the flow in `etl_web_to_gcs.py`, create a deployment to run on the first of every month at 5am UTC. What’s the cron schedule for that? - `0 5 1 * *` - `0 0 5 1 *` - `5 * 1 0 *` - `* * 5 1 0` ## Question 3. Loading data to BigQuery Using `etl_gcs_to_bq.py` as a starting point, modify the script for extracting data from GCS and loading it into BigQuery. This new script should not fill or remove rows with missing values. (The script is really just doing the E and L parts of ETL). The main flow should print the total number of rows processed by the script. Set the flow decorator to log the print statement. Parametrize the entrypoint flow to accept a list of months, a year, and a taxi color. Make any other necessary changes to the code for it to function as required. Create a deployment for this flow to run in a local subprocess with local flow code storage (the defaults). Make sure you have the parquet data files for Yellow taxi data for Feb. 2019 and March 2019 loaded in GCS. Run your deployment to append this data to your BiqQuery table. How many rows did your flow code process? - 14,851,920 - 12,282,990 - 27,235,753 - 11,338,483 ## Question 4. Github Storage Block Using the `web_to_gcs` script from the videos as a guide, you want to store your flow code in a GitHub repository for collaboration with your team. Prefect can look in the GitHub repo to find your flow code and read it. Create a GitHub storage block from the UI or in Python code and use that in your Deployment instead of storing your flow code locally or baking your flow code into a Docker image. Note that you will have to push your code to GitHub, Prefect will not push it for you. Run your deployment in a local subprocess (the default if you don’t specify an infrastructure). Use the Green taxi data for the month of November 2020. How many rows were processed by the script? - 88,019 - 192,297 - 88,605 - 190,225 ## Question 5. Email or Slack notifications Q5. It’s often helpful to be notified when something with your dataflow doesn’t work as planned. Choose one of the options below for creating email or slack notifications. The hosted Prefect Cloud lets you avoid running your own server and has Automations that allow you to get notifications when certain events occur or don’t occur. Create a free forever Prefect Cloud account at app.prefect.cloud and connect your workspace to it following the steps in the UI when you sign up. Set up an Automation that will send yourself an email when a flow run completes. Run the deployment used in Q4 for the Green taxi data for April 2019. Check your email to see the notification. Alternatively, use a Prefect Cloud Automation or a self-hosted Orion server Notification to get notifications in a Slack workspace via an incoming webhook. Join my temporary Slack workspace with [this link](https://join.slack.com/t/temp-notify/shared_invite/zt-1odklt4wh-hH~b89HN8MjMrPGEaOlxIw). 400 people can use this link and it expires in 90 days. In the Prefect Cloud UI create an [Automation](https://docs.prefect.io/ui/automations) or in the Prefect Orion UI create a [Notification](https://docs.prefect.io/ui/notifications/) to send a Slack message when a flow run enters a Completed state. Here is the Webhook URL to use: https://hooks.slack.com/services/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 Test the functionality. Alternatively, you can grab the webhook URL from your own Slack workspace and Slack App that you create. How many rows were processed by the script? - `125,268` - `377,922` - `728,390` - `514,392` ## Question 6. Secrets Prefect Secret blocks provide secure, encrypted storage in the database and obfuscation in the UI. Create a secret block in the UI that stores a fake 10-digit password to connect to a third-party service. Once you’ve created your block in the UI, how many characters are shown as asterisks (*) on the next page of the UI? - 5 - 6 - 8 - 10 ## Submitting the solutions * Form for submitting: https://forms.gle/PY8mBEGXJ1RvmTM97 * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 8 February (Wednesday), 22:00 CET ## Solution * Video: https://youtu.be/L04lvYqNlc0 * Code: https://github.com/discdiver/prefect-zoomcamp/tree/main/flows/04_homework ================================================ FILE: cohorts/2023/week_3_data_warehouse/homework.md ================================================ ## Week 3 Homework Important Note:

You can load the data however you would like, but keep the files in .GZ Format. If you are using orchestration such as Airflow or Prefect do not load the data into Big Query using the orchestrator.
Stop with loading the files into a bucket.

NOTE: You can use the CSV option for the GZ files when creating an External Table
SETUP:
Create an external table using the fhv 2019 data.
Create a table in BQ using the fhv 2019 data (do not partition or cluster this table).
Data can be found here: https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv

## Question 1: What is the count for fhv vehicle records for year 2019? - 65,623,481 - 43,244,696 - 22,978,333 - 13,942,414 ## Question 2: Write a query to count the distinct number of affiliated_base_number for the entire dataset on both the tables.
What is the estimated amount of data that will be read when this query is executed on the External Table and the Table? - 25.2 MB for the External Table and 100.87MB for the BQ Table - 225.82 MB for the External Table and 47.60MB for the BQ Table - 0 MB for the External Table and 0MB for the BQ Table - 0 MB for the External Table and 317.94MB for the BQ Table ## Question 3: How many records have both a blank (null) PUlocationID and DOlocationID in the entire dataset? - 717,748 - 1,215,687 - 5 - 20,332 ## Question 4: What is the best strategy to optimize the table if query always filter by pickup_datetime and order by affiliated_base_number? - Cluster on pickup_datetime Cluster on affiliated_base_number - Partition by pickup_datetime Cluster on affiliated_base_number - Partition by pickup_datetime Partition by affiliated_base_number - Partition by affiliated_base_number Cluster on pickup_datetime ## Question 5: Implement the optimized solution you chose for question 4. Write a query to retrieve the distinct affiliated_base_number between pickup_datetime 2019/03/01 and 2019/03/31 (inclusive).
Use the BQ table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values? Choose the answer which most closely matches. - 12.82 MB for non-partitioned table and 647.87 MB for the partitioned table - 647.87 MB for non-partitioned table and 23.06 MB for the partitioned table - 582.63 MB for non-partitioned table and 0 MB for the partitioned table - 646.25 MB for non-partitioned table and 646.25 MB for the partitioned table ## Question 6: Where is the data stored in the External Table you created? - Big Query - GCP Bucket - Container Registry - Big Table ## Question 7: It is best practice in Big Query to always cluster your data: - True - False ## (Not required) Question 8: A better format to store these files may be parquet. Create a data pipeline to download the gzip files and convert them into parquet. Upload the files to your GCP Bucket and create an External and BQ Table. Note: Column types for all files used in an External Table must have the same datatype. While an External Table may be created and shown in the side panel in Big Query, this will need to be validated by running a count query on the External Table to check if any errors occur. ## Submitting the solutions * Form for submitting: https://forms.gle/rLdvQW2igsAT73HTA * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 13 February (Monday), 22:00 CET ## Solution Solution: https://www.youtube.com/watch?v=j8r2OigKBWE ================================================ FILE: cohorts/2023/week_4_analytics_engineering/homework.md ================================================ ## Week 4 Homework In this homework, we'll use the models developed during the week 4 videos and enhance the already presented dbt project using the already loaded Taxi data for fhv vehicles for year 2019 in our DWH. This means that in this homework we use the following data [Datasets list](https://github.com/DataTalksClub/nyc-tlc-data/) * Yellow taxi data - Years 2019 and 2020 * Green taxi data - Years 2019 and 2020 * fhv data - Year 2019. We will use the data loaded for: * Building a source table: `stg_fhv_tripdata` * Building a fact table: `fact_fhv_trips` * Create a dashboard If you don't have access to GCP, you can do this locally using the ingested data from your Postgres database instead. If you have access to GCP, you don't need to do it for local Postgres - only if you want to. > **Note**: if your answer doesn't match exactly, select the closest option ### Question 1: **What is the count of records in the model fact_trips after running all models with the test run variable disabled and filtering for 2019 and 2020 data only (pickup datetime)?** You'll need to have completed the ["Build the first dbt models"](https://www.youtube.com/watch?v=UVI30Vxzd6c) video and have been able to run the models via the CLI. You should find the views and models for querying in your DWH. - 41648442 - 51648442 - 61648442 - 71648442 ### Question 2: **What is the distribution between service type filtering by years 2019 and 2020 data as done in the videos?** You will need to complete "Visualising the data" videos, either using [google data studio](https://www.youtube.com/watch?v=39nLTs74A3E) or [metabase](https://www.youtube.com/watch?v=BnLkrA7a6gM). - 89.9/10.1 - 94/6 - 76.3/23.7 - 99.1/0.9 ### Question 3: **What is the count of records in the model stg_fhv_tripdata after running all models with the test run variable disabled (:false)?** Create a staging model for the fhv data for 2019 and do not add a deduplication step. Run it via the CLI without limits (is_test_run: false). Filter records with pickup time in year 2019. - 33244696 - 43244696 - 53244696 - 63244696 ### Question 4: **What is the count of records in the model fact_fhv_trips after running all dependencies with the test run variable disabled (:false)?** Create a core model for the stg_fhv_tripdata joining with dim_zones. Similar to what we've done in fact_trips, keep only records with known pickup and dropoff locations entries for pickup and dropoff locations. Run it via the CLI without limits (is_test_run: false) and filter records with pickup time in year 2019. - 12998722 - 22998722 - 32998722 - 42998722 ### Question 5: **What is the month with the biggest amount of rides after building a tile for the fact_fhv_trips table?** Create a dashboard with some tiles that you find interesting to explore the data. One tile should show the amount of trips per month, as done in the videos for fact_trips, based on the fact_fhv_trips table. - March - April - January - December ## Submitting the solutions * Form for submitting: https://forms.gle/6A94GPutZJTuT5Y16 * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 25 February (Saturday), 22:00 CET ## Solution * Video: https://www.youtube.com/watch?v=I_K0lNu9WQw&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW * Answers: * Question 1: 61648442, * Question 2: 89.9/10.1 * Question 3: 43244696 * Question 4: 22998722 * Question 5: January ================================================ FILE: cohorts/2023/week_5_batch_processing/homework.md ================================================ ## Week 5 Homework In this homework we'll put what we learned about Spark in practice. For this homework we will be using the FHVHV 2021-06 data found here. [FHVHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-06.csv.gz ) ### Question 1: **Install Spark and PySpark** - Install Spark - Run PySpark - Create a local spark session - Execute spark.version. What's the output? - 3.3.2 - 2.1.4 - 1.2.3 - 5.4

### Question 2: **HVFHW June 2021** Read it with Spark using the same schema as we did in the lessons.
We will use this dataset for all the remaining questions.
Repartition it to 12 partitions and save it to parquet.
What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.
- 2MB - 24MB - 100MB - 250MB

### Question 3: **Count records** How many taxi trips were there on June 15?

Consider only trips that started on June 15.
- 308,164 - 12,856 - 452,470 - 50,982

### Question 4: **Longest trip for each day** Now calculate the duration for each trip.
How long was the longest trip in Hours?
- 66.87 Hours - 243.44 Hours - 7.68 Hours - 3.32 Hours

### Question 5: **User Interface** Spark’s User Interface which shows application's dashboard runs on which local port?
- 80 - 443 - 4040 - 8080

### Question 6: **Most frequent pickup location zone** Load the zone lookup data into a temp view in Spark
[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv)
Using the zone lookup data and the fhvhv June 2021 data, what is the name of the most frequent pickup location zone?
- East Chelsea - Astoria - Union Sq - Crown Heights North

## Submitting the solutions * Form for submitting: https://forms.gle/EcSvDs6vp64gcGuD8 * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 06 March (Monday), 22:00 CET ## Solution * Video: https://www.youtube.com/watch?v=ldoDIT32pJs * Answers: * Question 1: 3.3.2 * Question 2: 24MB * Question 3: 452,470 * Question 4: 66.87 Hours * Question 5: 4040 * Question 6: Crown Heights North ================================================ FILE: cohorts/2023/week_6_stream_processing/client.properties ================================================ # Required connection configs for Kafka producer, consumer, and admin bootstrap.servers=:9092 security.protocol=SASL_SSL sasl.mechanisms=PLAIN sasl.username= sasl.password= # Best practice for higher availability in librdkafka clients prior to 1.7 session.timeout.ms=45000 ================================================ FILE: cohorts/2023/week_6_stream_processing/homework.md ================================================ ## Week 6 Homework In this homework, there will be two sections, the first session focus on theoretical questions related to Kafka and streaming concepts and the second session asks to create a small streaming application using preferred programming language (Python or Java). ### Question 1: **Please select the statements that are correct** - Kafka Node is responsible to store topics [x] - Zookeeper is removed from Kafka cluster starting from version 4.0 [x] - Retention configuration ensures the messages not get lost over specific period of time. [x] - Group-Id ensures the messages are distributed to associated consumers [x] ### Question 2: **Please select the Kafka concepts that support reliability and availability** - Topic Replication [x] - Topic Partioning - Consumer Group Id - Ack All [x] ### Question 3: **Please select the Kafka concepts that support scaling** - Topic Replication - Topic Paritioning [x] - Consumer Group Id [x] - Ack All ### Question 4: **Please select the attributes that are good candidates for partitioning key. Consider cardinality of the field you have selected and scaling aspects of your application** - payment_type [x] - vendor_id [x] - passenger_count - total_amount - tpep_pickup_datetime - tpep_dropoff_datetime ### Question 5: **Which configurations below should be provided for Kafka Consumer but not needed for Kafka Producer** - Deserializer Configuration [x] - Topics Subscription [x] - Bootstrap Server - Group-Id [x] - Offset [x] - Cluster Key and Cluster-Secret ### Question 6: Please implement a streaming application, for finding out popularity of PUlocationID across green and fhv trip datasets. Please use the datasets [fhv_tripdata_2019-01.csv.gz](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv) and [green_tripdata_2019-01.csv.gz](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green) PS: If you encounter memory related issue, you can use the smaller portion of these two datasets as well, it is not necessary to find exact number in the question. Your code should include following 1. Producer that reads csv files and publish rides in corresponding kafka topics (such as rides_green, rides_fhv) 2. Pyspark-streaming-application that reads two kafka topics and writes both of them in topic rides_all and apply aggregations to find most popular pickup location. ## Submitting the solutions * Form for submitting: https://forms.gle/rK7268U92mHJBpmW7 * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 13 March (Monday), 22:00 CET ## Solution We will publish the solution here after deadline# For Question 6 ensure, 1) Download fhv_tripdata_2019-01.csv and green_tripdata_2019-01.csv under resources/fhv_tripdata and resources/green_tripdata resprctively. ps: You need to unzip the compressed files 2) Update the client.properties settings using your Confluent Cloud api keys and cluster. 3) And create the topics(all_rides, fhv_taxi_rides, green_taxi_rides) in Confluent Cloud UI 4) Run Producers for two datasets ``` python3 producer_confluent --type green python3 producer_confluent --type fhv ``` 5) Run pyspark streaming ``` ./spark-submit.sh streaming_confluent.py ``` ================================================ FILE: cohorts/2023/week_6_stream_processing/producer_confluent.py ================================================ from confluent_kafka import Producer import argparse import csv from typing import Dict from time import sleep from settings import CONFLUENT_CLOUD_CONFIG, \ GREEN_TAXI_TOPIC, FHV_TAXI_TOPIC, \ GREEN_TRIP_DATA_PATH, FHV_TRIP_DATA_PATH class RideCSVProducer: def __init__(self, probs: Dict, ride_type: str): self.producer = Producer(**probs) self.ride_type = ride_type def parse_row(self, row): if self.ride_type == 'green': record = f'{row[5]}, {row[6]}' # PULocationID, DOLocationID key = str(row[0]) # vendor_id elif self.ride_type == 'fhv': record = f'{row[3]}, {row[4]}' # PULocationID, DOLocationID, key = str(row[0]) # dispatching_base_num return key, record def read_records(self, resource_path: str): records, ride_keys = [], [] with open(resource_path, 'r') as f: reader = csv.reader(f) header = next(reader) # skip the header for row in reader: key, record = self.parse_row(row) ride_keys.append(key) records.append(record) return zip(ride_keys, records) def publish(self, records: [str, str], topic: str): for key_value in records: key, value = key_value try: self.producer.poll(0) self.producer.produce(topic=topic, key=key, value=value) print(f"Producing record for ") except KeyboardInterrupt: break except BufferError as bfer: self.producer.poll(0.1) except Exception as e: print(f"Exception while producing record - {value}: {e}") self.producer.flush() sleep(10) if __name__ == "__main__": parser = argparse.ArgumentParser(description='Kafka Consumer') parser.add_argument('--type', type=str, default='green') args = parser.parse_args() if args.type == 'green': kafka_topic = GREEN_TAXI_TOPIC data_path = GREEN_TRIP_DATA_PATH elif args.type == 'fhv': kafka_topic = FHV_TAXI_TOPIC data_path = FHV_TRIP_DATA_PATH producer = RideCSVProducer(ride_type=args.type, probs=CONFLUENT_CLOUD_CONFIG) ride_records = producer.read_records(resource_path=data_path) producer.publish(records=ride_records, topic=kafka_topic) ================================================ FILE: cohorts/2023/week_6_stream_processing/settings.py ================================================ import pyspark.sql.types as T GREEN_TRIP_DATA_PATH = './resources/green_tripdata/green_tripdata_2019-01.csv' FHV_TRIP_DATA_PATH = './resources/fhv_tripdata/fhv_tripdata_2019-01.csv' BOOTSTRAP_SERVERS = 'localhost:9092' RIDES_TOPIC = 'all_rides' FHV_TAXI_TOPIC = 'fhv_taxi_rides' GREEN_TAXI_TOPIC = 'green_taxi_rides' ALL_RIDE_SCHEMA = T.StructType( [T.StructField("PUlocationID", T.StringType()), T.StructField("DOlocationID", T.StringType()), ]) def read_ccloud_config(config_file): conf = {} with open(config_file) as fh: for line in fh: line = line.strip() if len(line) != 0 and line[0] != "#": parameter, value = line.strip().split('=', 1) conf[parameter] = value.strip() return conf CONFLUENT_CLOUD_CONFIG = read_ccloud_config('client_original.properties') ================================================ FILE: cohorts/2023/week_6_stream_processing/spark-submit.sh ================================================ # Submit Python code to SparkMaster if [ $# -lt 1 ] then echo "Usage: $0 [ executor-memory ]" echo "(specify memory in string format such as \"512M\" or \"2G\")" exit 1 fi PYTHON_JOB=$1 if [ -z $2 ] then EXEC_MEM="1G" else EXEC_MEM=$2 fi spark-submit --master spark://localhost:7077 --num-executors 2 \ --executor-memory $EXEC_MEM --executor-cores 1 \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.3.1 \ $PYTHON_JOB ================================================ FILE: cohorts/2023/week_6_stream_processing/streaming_confluent.py ================================================ from pyspark.sql import SparkSession import pyspark.sql.functions as F from settings import CONFLUENT_CLOUD_CONFIG, GREEN_TAXI_TOPIC, FHV_TAXI_TOPIC, RIDES_TOPIC, ALL_RIDE_SCHEMA def read_from_kafka(consume_topic: str): # Spark Streaming DataFrame, connect to Kafka topic served at host in bootrap.servers option df_stream = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", CONFLUENT_CLOUD_CONFIG['bootstrap.servers']) \ .option("subscribe", consume_topic) \ .option("startingOffsets", "earliest") \ .option("checkpointLocation", "checkpoint") \ .option("kafka.security.protocol", "SASL_SSL") \ .option("kafka.sasl.mechanism", "PLAIN") \ .option("kafka.sasl.jaas.config", f"""org.apache.kafka.common.security.plain.PlainLoginModule required username="{CONFLUENT_CLOUD_CONFIG['sasl.username']}" password="{CONFLUENT_CLOUD_CONFIG['sasl.password']}";""") \ .option("failOnDataLoss", False) \ .load() return df_stream def parse_rides(df, schema): """ take a Spark Streaming df and parse value col based on , return streaming df cols in schema """ assert df.isStreaming is True, "DataFrame doesn't receive streaming data" df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") # split attributes to nested array in one Column col = F.split(df['value'], ', ') # expand col to multiple top-level columns for idx, field in enumerate(schema): df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType)) df = df.na.drop() df.printSchema() return df.select([field.name for field in schema]) def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'): query = df.writeStream \ .outputMode(output_mode) \ .trigger(processingTime=processing_time) \ .format("console") \ .option("truncate", False) \ .start() \ .awaitTermination() return query # pyspark.sql.streaming.StreamingQuery def sink_kafka(df, topic, output_mode: str = 'complete'): query = df.writeStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "pkc-75m1o.europe-west3.gcp.confluent.cloud:9092") \ .outputMode(output_mode) \ .option("topic", topic) \ .option("checkpointLocation", "checkpoint") \ .option("kafka.security.protocol", "SASL_SSL") \ .option("kafka.sasl.mechanism", "PLAIN") \ .option("kafka.sasl.jaas.config", f"""org.apache.kafka.common.security.plain.PlainLoginModule required username="{CONFLUENT_CLOUD_CONFIG['sasl.username']}" password="{CONFLUENT_CLOUD_CONFIG['sasl.password']}";""") \ .option("failOnDataLoss", False) \ .start() return query def op_groupby(df, column_names): df_aggregation = df.groupBy(column_names).count() return df_aggregation if __name__ == "__main__": spark = SparkSession.builder.appName('streaming-homework').getOrCreate() spark.sparkContext.setLogLevel('WARN') # Step 1: Consume GREEN_TAXI_TOPIC and FHV_TAXI_TOPIC df_green_rides = read_from_kafka(consume_topic=GREEN_TAXI_TOPIC) df_fhv_rides = read_from_kafka(consume_topic=FHV_TAXI_TOPIC) # Step 2: Publish green and fhv rides to RIDES_TOPIC kafka_sink_green_query = sink_kafka(df=df_green_rides, topic=RIDES_TOPIC, output_mode='append') kafka_sink_fhv_query = sink_kafka(df=df_fhv_rides, topic=RIDES_TOPIC, output_mode='append') # Step 3: Read RIDES_TOPIC and parse it in ALL_RIDE_SCHEMA df_all_rides = read_from_kafka(consume_topic=RIDES_TOPIC) df_all_rides = parse_rides(df_all_rides, ALL_RIDE_SCHEMA) # Step 4: Apply Aggregation on the all_rides df_pu_location_count = op_groupby(df_all_rides, ['PULocationID']) df_pu_location_count = df_pu_location_count.sort(F.col('count').desc()) # Step 5: Sink Aggregation Streams to Console console_sink_pu_location = sink_console(df_pu_location_count, output_mode='complete') ================================================ FILE: cohorts/2023/workshops/piperider.md ================================================ ## Workshop: Maximizing Confidence in Your Data Model Changes with dbt and PipeRider To learn how to use PipeRider together with dbt for detecting changes in model and data, sign up for a workshop - Video: https://www.youtube.com/watch?v=O-tyUOQccSs - Repository: https://github.com/InfuseAI/taxi_rides_ny_duckdb ## Homework The following questions follow on from the original Week 4 homework, and so use the same data as required by those questions: https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2023/week_4_analytics_engineering/homework.md Yellow taxi data - Years 2019 and 2020 Green taxi data - Years 2019 and 2020 fhv data - Year 2019. ### Question 1: What is the distribution between vendor id filtering by years 2019 and 2020 data? You will need to run PipeRider and check the report * 70.1/29.6/0.5 * 60.1/39.5/0.4 * 90.2/9.5/0.3 * 80.1/19.7/0.2 ### Question 2: What is the composition of total amount (positive/zero/negative) filtering by years 2019 and 2020 data? You will need to run PipeRider and check the report * 51.4M/15K/48.6K * 21.4M/5K/248.6K * 61.4M/25K/148.6K * 81.4M/35K/14.6K ### Question 3: What is the numeric statistics (average/standard deviation/min/max/sum) of trip distances filtering by years 2019 and 2020 data? You will need to run PipeRider and check the report * 1.95/35.43/0/16.3K/151.5M * 3.95/25.43/23.88/267.3K/281.5M * 5.95/75.43/-63.88/67.3K/81.5M * 2.95/35.43/-23.88/167.3K/181.5M ## Submitting the solutions * Form for submitting: https://forms.gle/WyLQHBu1DNwNTfqe8 * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 20 March, 22:00 CET ## Solution Video: https://www.youtube.com/watch?v=inNrUys7W8U&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW ================================================ FILE: cohorts/2024/01-docker-terraform/homework.md ================================================ ## Module 1 Homework ATTENTION: At the very end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository. ## Docker & SQL In this homework we'll prepare the environment and practice with Docker and SQL ## Question 1. Knowing docker tags Run the command to get information on Docker ```docker --help``` Now run the command to get help on the "docker build" command: ```docker build --help``` Do the same for "docker run". Which tag has the following text? - *Automatically remove the container when it exits* - `--delete` - `--rc` - `--rmc` - `--rm` ## Question 2. Understanding docker first run Run docker with the python:3.9 image in an interactive mode and the entrypoint of bash. Now check the python modules that are installed ( use ```pip list``` ). What is version of the package *wheel* ? - 0.42.0 - 1.0.0 - 23.0.1 - 58.1.0 # Prepare Postgres Run Postgres and load data as shown in the videos We'll use the green taxi trips from September 2019: ```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz``` You will also need the dataset with zones: ```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv``` Download this data and put it into Postgres (with jupyter notebooks or with a pipeline) ## Question 3. Count records How many taxi trips were totally made on September 18th 2019? Tip: started and finished on 2019-09-18. Remember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date. - 15767 - 15612 - 15859 - 89009 ## Question 4. Longest trip for each day Which was the pick up day with the longest trip distance? Use the pick up time for your calculations. Tip: For every trip on a single day, we only care about the trip with the longest distance. - 2019-09-18 - 2019-09-16 - 2019-09-26 - 2019-09-21 ## Question 5. Three biggest pick up Boroughs Consider lpep_pickup_datetime in '2019-09-18' and ignoring Borough has Unknown Which were the 3 pick up Boroughs that had the maximum total_amount? - "Brooklyn" "Manhattan" "Queens" - "Bronx" "Brooklyn" "Manhattan" - "Bronx" "Manhattan" "Queens" - "Brooklyn" "Queens" "Staten Island" ## Question 6. Largest tip For the passengers picked up in September 2019 in the zone name Astoria which was the drop off zone that had the largest tip? We want the name of the zone, not the id. Note: it's not a typo, it's `tip` , not `trip` - Central Park - Jamaica - JFK Airport - Long Island City/Queens Plaza ## Terraform In this section homework we'll prepare the environment by creating resources in GCP with Terraform. In your VM on GCP/Laptop/GitHub Codespace install Terraform. Copy the files from the course repo [here](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/01-docker-terraform/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace. Modify the files as necessary to create a GCP Bucket and Big Query Dataset. ## Question 7. Creating Resources After updating the main.tf and variable.tf files run: ``` terraform apply ``` Paste the output of this command into the homework submission form. ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw01 * You can submit your homework multiple times. In this case, only the last submission will be used. Deadline: 29 January, 23:00 CET ================================================ FILE: cohorts/2024/01-docker-terraform/solutions.md ================================================ ## Question 1. Knowing docker tags ``` ❯ docker run --help | grep "Automatically remove" --rm Automatically remove ``` - `|` pipe operator redirects the previous command output as an input to the command after the operator - `docker run --help` -----> outputs `|` ---------> inputs to `grep "Automatically remove"` - `grep` allows you to search through text Answer: `--rm` ## Question 2. Understanding docker first run - Run python:3.9 image with `docker run -it python:3.9 bash` - Since you opened with `it` tag, the container will be interactive` - Since the docker command ends with `bash`, the entrypoint into the container will be `bash` ```shell root@root: docker run -it python:3.9 bash root@b67c6949422a:/# pip list Package Version ---------- ------- pip 23.0.1 setuptools 58.1.0 wheel 0.45.1 ``` Since it's been a while since 2024 cohort, your wheel version might differ and may not be in the options provided. Answer: For me it was `0.45.1` ## Question 3. Count records - Trips that started and finished on 2019-09-18 - Format timestamp(date and hour+min+sec) to date. ```sql SELECT COUNT(*) FROM "csv_green_tripdata_2019_09" WHERE DATE("lpep_pickup_datetime") = '2019-09-18' AND DATE("lpep_dropoff_datetime") = '2019-09-18'; ``` ``` +-------+ | count | |-------| | 15612 | +-------+ ``` Answer: `15612` ## Question 4. Longest trip for each day ```sql SELECT DATE("lpep_pickup_datetime") AS "pickup_date", MAX("trip_distance") AS "longest_trip" FROM "csv_green_tripdata_2019_09" GROUP BY DATE("lpep_pickup_datetime") ORDER BY "longest_trip" DESC LIMIT 1; ``` ``` +-------------+--------------+ | pickup_date | longest_trip | |-------------+--------------| | 2019-09-26 | 341.64 | +-------------+--------------+ ``` Answer: `2019-09-26` ## Question 5. Three biggest pickup zones ```sql SELECT "zone"."Zone", ROUND(SUM(("total_amount")::NUMERIC), 3) AS "total_amount" FROM "csv_green_tripdata_2019_09" INNER JOIN "zone" ON "csv_green_tripdata_2019_09"."PULocationID" = "zone"."LocationID" WHERE DATE("lpep_pickup_datetime") = '2019-09-18' GROUP BY "zone"."Zone" ORDER BY "total_amount" DESC LIMIT 3; ``` ``` +---------------------+--------------+ | Zone | total_amount | |---------------------+--------------| | East Harlem North | 17893.060 | | East Harlem South | 17152.160 | | Morningside Heights | 11259.680 | +---------------------+--------------+ ``` Answer: `East Harlem North, East Harlem South, Morningside Heights` ## Question 6. Largest tip ```sql SELECT puz."Zone" AS pickup_zone, doz."Zone" AS dropoff_zone, g."tip_amount" FROM "csv_green_tripdata_2019_09" g INNER JOIN "zone" puz ON g."PULocationID" = puz."LocationID" INNER JOIN "zone" doz ON g."DOLocationID" = doz."LocationID" WHERE puz."Zone" = 'Astoria' ORDER BY g."tip_amount" DESC LIMIT 1; ``` ``` +-------------+--------------+------------+ | pickup_zone | dropoff_zone | tip_amount | |-------------+--------------+------------| | Astoria | JFK Airport | 62.31 | +-------------+--------------+------------+ ``` Answer: `JFK Airport` ## Question 7. Terraform Workflow > self-explanatory ================================================ FILE: cohorts/2024/02-workflow-orchestration/README.md ================================================ > [!NOTE] >If you're looking for Airflow videos from the 2022 edition, check the [2022 cohort folder](../cohorts/2022/week_2_data_ingestion/). > >If you're looking for Prefect videos from the 2023 edition, check the [2023 cohort folder](../cohorts/2023/week_2_data_ingestion/). # Week 2: Workflow Orchestration Welcome to Week 2 of the Data Engineering Zoomcamp! 🚀😤 This week, we'll be covering workflow orchestration with Mage. Mage is an open-source, hybrid framework for transforming and integrating data. ✨ This week, you'll learn how to use the Mage platform to author and share _magical_ data pipelines. This will all be covered in the course, but if you'd like to learn a bit more about Mage, check out our docs [here](https://docs.mage.ai/introduction/overview). * [2.2.1 - 📯 Intro to Orchestration](#221----intro-to-orchestration) * [2.2.2 - 🧙‍♂️ Intro to Mage](#222---%EF%B8%8F-intro-to-mage) * [2.2.3 - 🐘 ETL: API to Postgres](#223----etl-api-to-postgres) * [2.2.4 - 🤓 ETL: API to GCS](#224----etl-api-to-gcs) * [2.2.5 - 🔍 ETL: GCS to BigQuery](#225----etl-gcs-to-bigquery) * [2.2.6 - 👨‍💻 Parameterized Execution](#226----parameterized-execution) * [2.2.7 - 🤖 Deployment (Optional)](#227----deployment-optional) * [2.2.8 - 🗒️ Homework](#228---️-homework) * [2.2.9 - 👣 Next Steps](#229----next-steps) ## 📕 Course Resources ### 2.2.1 - 📯 Intro to Orchestration In this section, we'll cover the basics of workflow orchestration. We'll discuss what it is, why it's important, and how it can be used to build data pipelines. Videos - 2.2.1a - What is Orchestration? [![](https://markdown-videos-api.jorgenkh.no/youtube/Li8-MWHhTbo)](https://youtu.be/Li8-MWHhTbo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=17) Resources - [Slides](https://docs.google.com/presentation/d/17zSxG5Z-tidmgY-9l7Al1cPmz4Slh4VPK6o2sryFYvw/) ### 2.2.2 - 🧙‍♂️ Intro to Mage In this section, we'll introduce the Mage platform. We'll cover what makes Mage different from other orchestrators, the fundamental concepts behind Mage, and how to get started. To cap it off, we'll spin Mage up via Docker 🐳 and run a simple pipeline. Videos - 2.2.2a - What is Mage? [![](https://markdown-videos-api.jorgenkh.no/youtube/AicKRcK3pa4)](https://youtu.be/AicKRcK3pa4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=18) - 2.2.2b - Configuring Mage [![](https://markdown-videos-api.jorgenkh.no/youtube/tNiV7Wp08XE)](https://youtu.be/tNiV7Wp08XE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=19) - 2.2.2c - A Simple Pipeline [![](https://markdown-videos-api.jorgenkh.no/youtube/stI-gg4QBnI)](https://youtu.be/stI-gg4QBnI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=20) Resources - [Getting Started Repo](https://github.com/mage-ai/mage-zoomcamp) - [Slides](https://docs.google.com/presentation/d/1y_5p3sxr6Xh1RqE6N8o2280gUzAdiic2hPhYUUD6l88/) ### 2.2.3 - 🐘 ETL: API to Postgres Hooray! Mage is up and running. Now, let's build a _real_ pipeline. In this section, we'll build a simple ETL pipeline that loads data from an API into a Postgres database. Our database will be built using Docker— it will be running locally, but it's the same as if it were running in the cloud. Videos - 2.2.3a - Configuring Postgres [![](https://markdown-videos-api.jorgenkh.no/youtube/pmhI-ezd3BE)](https://youtu.be/pmhI-ezd3BE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=21) - 2.2.3b - Writing an ETL Pipeline : API to postgres [![](https://markdown-videos-api.jorgenkh.no/youtube/Maidfe7oKLs)](https://youtu.be/Maidfe7oKLs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=22) ### 2.2.4 - 🤓 ETL: API to GCS Ok, so we've written data _locally_ to a database, but what about the cloud? In this tutorial, we'll walk through the process of using Mage to extract, transform, and load data from an API to Google Cloud Storage (GCS). We'll cover both writing _partitioned_ and _unpartitioned_ data to GCS and discuss _why_ you might want to do one over the other. Many data teams start with extracting data from a source and writing it to a data lake _before_ loading it to a structured data source, like a database. Videos - 2.2.4a - Configuring GCP [![](https://markdown-videos-api.jorgenkh.no/youtube/00LP360iYvE)](https://youtu.be/00LP360iYvE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=23) - 2.2.4b - Writing an ETL Pipeline : API to GCS [![](https://markdown-videos-api.jorgenkh.no/youtube/w0XmcASRUnc)](https://youtu.be/w0XmcASRUnc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=24) Resources - [DTC Zoomcamp GCP Setup](../01-docker-terraform/1_terraform_gcp/2_gcp_overview.md) ### 2.2.5 - 🔍 ETL: GCS to BigQuery Now that we've written data to GCS, let's load it into BigQuery. In this section, we'll walk through the process of using Mage to load our data from GCS to BigQuery. This closely mirrors a very common data engineering workflow: loading data from a data lake into a data warehouse. Videos - 2.2.5a - Writing an ETL Pipeline : GCS to BigQuery [![](https://markdown-videos-api.jorgenkh.no/youtube/JKp_uzM-XsM)](https://youtu.be/JKp_uzM-XsM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=25) ### 2.2.6 - 👨‍💻 Parameterized Execution By now you're familiar with building pipelines, but what about adding parameters? In this video, we'll discuss some built-in runtime variables that exist in Mage and show you how to define your own! We'll also cover how to use these variables to parameterize your pipelines. Finally, we'll talk about what it means to *backfill* a pipeline and how to do it in Mage. Videos - 2.2.6a - Parameterized Execution [![](https://markdown-videos-api.jorgenkh.no/youtube/H0hWjWxB-rg)](https://youtu.be/H0hWjWxB-rg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=26) - 2.2.6b - Backfills [![](https://markdown-videos-api.jorgenkh.no/youtube/ZoeC6Ag5gQc)](https://youtu.be/ZoeC6Ag5gQc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=27) Resources - [Mage Variables Overview](https://docs.mage.ai/development/variables/overview) - [Mage Runtime Variables](https://docs.mage.ai/getting-started/runtime-variable) ### 2.2.7 - 🤖 Deployment (Optional) In this section, we'll cover deploying Mage using Terraform and Google Cloud. This section is optional— it's not *necessary* to learn Mage, but it might be helpful if you're interested in creating a fully deployed project. If you're using Mage in your final project, you'll need to deploy it to the cloud. Videos - 2.2.7a - Deployment Prerequisites [![](https://markdown-videos-api.jorgenkh.no/youtube/zAwAX5sxqsg)](https://youtu.be/zAwAX5sxqsg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=28) - 2.2.7b - Google Cloud Permissions [![](https://markdown-videos-api.jorgenkh.no/youtube/O_H7DCmq2rA)](https://youtu.be/O_H7DCmq2rA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=29) - 2.2.7c - Deploying to Google Cloud - Part 1 [![](https://markdown-videos-api.jorgenkh.no/youtube/9A872B5hb_0)](https://youtu.be/9A872B5hb_0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=30) - 2.2.7d - Deploying to Google Cloud - Part 2 [![](https://markdown-videos-api.jorgenkh.no/youtube/0YExsb2HgLI)](https://youtu.be/0YExsb2HgLI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=31) Resources - [Installing Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) - [Installing `gcloud` CLI](https://cloud.google.com/sdk/docs/install) - [Mage Terraform Templates](https://github.com/mage-ai/mage-ai-terraform-templates) Additional Mage Guides - [Terraform](https://docs.mage.ai/production/deploying-to-cloud/using-terraform) - [Deploying to GCP with Terraform](https://docs.mage.ai/production/deploying-to-cloud/gcp/setup) ### 2.2.8 - 🗒️ Homework We've prepared a short exercise to test you on what you've learned this week. You can find the homework [here](../cohorts/2024/02-workflow-orchestration/homework.md). This follows closely from the contents of the course and shouldn't take more than an hour or two to complete. 😄 ### 2.2.9 - 👣 Next Steps Congratulations! You've completed Week 2 of the Data Engineering Zoomcamp. We hope you've enjoyed learning about Mage and that you're excited to use it in your final project. If you have any questions, feel free to reach out to us on Slack. Be sure to check out our "Next Steps" video for some inspiration for the rest of your journey 😄. Videos - 2.2.9 - Next Steps [![](https://markdown-videos-api.jorgenkh.no/youtube/uUtj7N0TleQ)](https://youtu.be/uUtj7N0TleQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=32) Resources - [Slides](https://docs.google.com/presentation/d/1yN-e22VNwezmPfKrZkgXQVrX5owDb285I2HxHWgmAEQ/edit#slide=id.g262fb0d2905_0_12) ### 📑 Additional Resources - [Mage Docs](https://docs.mage.ai/) - [Mage Guides](https://docs.mage.ai/guides) - [Mage Slack](https://www.mage.ai/chat) # Community notes Did you take notes? You can share them here: ## 2024 notes * [2024 Videos transcripts week 2](https://drive.google.com/drive/folders/1yxT0uMMYKa6YOxanh91wGqmQUMS7yYW7?usp=sharing) by Maria Fisher * [Notes from Jonah Oliver](https://www.jonahboliver.com/blog/de-zc-w2) * [Notes from Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/2-workflow-orchestration/readme.md) * [Notes from Kirill](https://github.com/kirill505/data-engineering-zoomcamp/blob/main/02-workflow-orchestration/README.md) * [Notes from Zharko](https://www.zharconsulting.com/contents/data/data-engineering-bootcamp-2024/week-2-ingesting-data-with-mage/) * Add your notes above this line ## 2023 notes See [here](../cohorts/2023/week_2_workflow_orchestration#community-notes) ## 2022 notes See [here](../cohorts/2022/week_2_data_ingestion#community-notes) ================================================ FILE: cohorts/2024/02-workflow-orchestration/homework.md ================================================ ## Module 2 Homework ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository. > In case you don't get one option exactly, select the closest one For the homework, we'll be working with the _green_ taxi dataset located here: `https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download` To get a `wget`-able link, use this prefix (note that the link itself gives 404): `https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/` ### Assignment The goal will be to construct an ETL pipeline that loads the data, performs some transformations, and writes the data to a database (and Google Cloud!). - Create a new pipeline, call it `green_taxi_etl` - Add a data loader block and use Pandas to read data for the final quarter of 2020 (months `10`, `11`, `12`). - You can use the same datatypes and date parsing methods shown in the course. - `BONUS`: load the final three months using a for loop and `pd.concat` - Add a transformer block and perform the following: - Remove rows where the passenger count is equal to 0 _and_ the trip distance is equal to zero. - Create a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date. - Rename columns in Camel Case to Snake Case, e.g. `VendorID` to `vendor_id`. - Add three assertions: - `vendor_id` is one of the existing values in the column (currently) - `passenger_count` is greater than 0 - `trip_distance` is greater than 0 - Using a Postgres data exporter (SQL or Python), write the dataset to a table called `green_taxi` in a schema `mage`. Replace the table if it already exists. - Write your data as Parquet files to a bucket in GCP, partioned by `lpep_pickup_date`. Use the `pyarrow` library! - Schedule your pipeline to run daily at 5AM UTC. ### Questions ## Question 1. Data Loading Once the dataset is loaded, what's the shape of the data? * 266,855 rows x 20 columns * 544,898 rows x 18 columns * 544,898 rows x 20 columns * 133,744 rows x 20 columns ## Question 2. Data Transformation Upon filtering the dataset where the passenger count is greater than 0 _and_ the trip distance is greater than zero, how many rows are left? * 544,897 rows * 266,855 rows * 139,370 rows * 266,856 rows ## Question 3. Data Transformation Which of the following creates a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date? * `data = data['lpep_pickup_datetime'].date` * `data('lpep_pickup_date') = data['lpep_pickup_datetime'].date` * `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date` * `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt().date()` ## Question 4. Data Transformation What are the existing values of `VendorID` in the dataset? * 1, 2, or 3 * 1 or 2 * 1, 2, 3, 4 * 1 ## Question 5. Data Transformation How many columns need to be renamed to snake case? * 3 * 6 * 2 * 4 ## Question 6. Data Exporting Once exported, how many partitions (folders) are present in Google Cloud? * 96 * 56 * 67 * 108 ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw2 * Check the link above to see the due date ## Solution Will be added after the due date ================================================ FILE: cohorts/2024/03-data-warehouse/homework.md ================================================ ## Module 3 Homework Solution: https://www.youtube.com/watch?v=8g_lRKaC9ro ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository. Important Note:

For this homework we will be using the 2022 Green Taxi Trip Record Parquet Files from the New York City Taxi Data found here:
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
If you are using orchestration such as Mage, Airflow or Prefect do not load the data into Big Query using the orchestrator.
Stop with loading the files into a bucket.

NOTE: You will need to use the PARQUET option files when creating an External Table
SETUP:
Create an external table using the Green Taxi Trip Records Data for 2022.
Create a table in BQ using the Green Taxi Trip Records for 2022 (do not partition or cluster this table).

## Question 1: Question 1: What is count of records for the 2022 Green Taxi Data?? - 65,623,481 - 840,402 - 1,936,423 - 253,647 ## Question 2: Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.
What is the estimated amount of data that will be read when this query is executed on the External Table and the Table? - 0 MB for the External Table and 6.41MB for the Materialized Table - 18.82 MB for the External Table and 47.60 MB for the Materialized Table - 0 MB for the External Table and 0MB for the Materialized Table - 2.14 MB for the External Table and 0MB for the Materialized Table ## Question 3: How many records have a fare_amount of 0? - 12,488 - 128,219 - 112 - 1,622 ## Question 4: What is the best strategy to make an optimized table in Big Query if your query will always order the results by PUlocationID and filter based on lpep_pickup_datetime? (Create a new table with this strategy) - Cluster on lpep_pickup_datetime Partition by PUlocationID - Partition by lpep_pickup_datetime Cluster on PUlocationID - Partition by lpep_pickup_datetime and Partition by PUlocationID - Cluster on by lpep_pickup_datetime and Cluster on PUlocationID ## Question 5: Write a query to retrieve the distinct PULocationID between lpep_pickup_datetime 06/01/2022 and 06/30/2022 (inclusive)
Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values?
Choose the answer which most closely matches.
- 22.82 MB for non-partitioned table and 647.87 MB for the partitioned table - 12.82 MB for non-partitioned table and 1.12 MB for the partitioned table - 5.63 MB for non-partitioned table and 0 MB for the partitioned table - 10.31 MB for non-partitioned table and 10.31 MB for the partitioned table ## Question 6: Where is the data stored in the External Table you created? - Big Query - GCP Bucket - Big Table - Container Registry ## Question 7: It is best practice in Big Query to always cluster your data: - True - False ## (Bonus: Not worth points) Question 8: No Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why? ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw3 ================================================ FILE: cohorts/2024/04-analytics-engineering/homework.md ================================================ ## Module 4 Homework In this homework, we'll use the models developed during the week 4 videos and enhance the already presented dbt project using the already loaded Taxi data for fhv vehicles for year 2019 in our DWH. This means that in this homework we use the following data [Datasets list](https://github.com/DataTalksClub/nyc-tlc-data/) * Yellow taxi data - Years 2019 and 2020 * Green taxi data - Years 2019 and 2020 * fhv data - Year 2019. We will use the data loaded for: * Building a source table: `stg_fhv_tripdata` * Building a fact table: `fact_fhv_trips` * Create a dashboard If you don't have access to GCP, you can do this locally using the ingested data from your Postgres database instead. If you have access to GCP, you don't need to do it for local Postgres - only if you want to. > **Note**: if your answer doesn't match exactly, select the closest option ### Question 1: **What happens when we execute dbt build --vars '{'is_test_run':'true'}'** You'll need to have completed the ["Build the first dbt models"](https://www.youtube.com/watch?v=UVI30Vxzd6c) video. - It's the same as running *dbt build* - It applies a _limit 100_ to all of our models - It applies a _limit 100_ only to our staging models - Nothing ### Question 2: **What is the code that our CI job will run? Where is this code coming from?** - The code that has been merged into the main branch - The code that is behind the creation object on the dbt_cloud_pr_ schema - The code from any development branch that has been opened based on main - The code from the development branch we are requesting to merge to main ### Question 3 (2 points) **What is the count of records in the model fact_fhv_trips after running all dependencies with the test run variable disabled (:false)?** Create a staging model for the fhv data, similar to the ones made for yellow and green data. Add an additional filter for keeping only records with pickup time in year 2019. Do not add a deduplication step. Run this models without limits (is_test_run: false). Create a core model similar to fact trips, but selecting from stg_fhv_tripdata and joining with dim_zones. Similar to what we've done in fact_trips, keep only records with known pickup and dropoff locations entries for pickup and dropoff locations. Run the dbt model without limits (is_test_run: false). - 12998722 - 22998722 - 32998722 - 42998722 ### Question 4 (2 points) **What is the service that had the most rides during the month of July 2019 month with the biggest amount of rides after building a tile for the fact_fhv_trips table and the fact_trips tile as seen in the videos?** Create a dashboard with some tiles that you find interesting to explore the data. One tile should show the amount of trips per month, as done in the videos for fact_trips, including the fact_fhv_trips data. - FHV - Green - Yellow - FHV and Green ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw4 Deadline: 22 February (Thursday), 22:00 CET ## Solution (To be published after deadline) * Video: https://youtu.be/3OPggh5Rca8 * Answers: * Question 1: It applies a _limit 100_ only to our staging models * Question 2: The code from the development branch we are requesting to merge to main * Question 3: 22998722 * Question 4: Yellow ================================================ FILE: cohorts/2024/05-batch/homework.md ================================================ ## Module 5 Homework Solution: https://www.youtube.com/watch?v=YtddC7vJOgQ In this homework we'll put what we learned about Spark in practice. For this homework we will be using the FHV 2019-10 data found here. [FHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz) ### Question 1: **Install Spark and PySpark** - Install Spark - Run PySpark - Create a local spark session - Execute spark.version. What's the output? > [!NOTE] > To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/setup/pyspark.md) ### Question 2: **FHV October 2019** Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons. Repartition the Dataframe to 6 partitions and save it to parquet. What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches. - 1MB - 6MB - 25MB - 87MB ### Question 3: **Count records** How many taxi trips were there on the 15th of October? Consider only trips that started on the 15th of October. - 108,164 - 12,856 - 452,470 - 62,610 > [!IMPORTANT] > Be aware of columns order when defining schema ### Question 4: **Longest trip for each day** What is the length of the longest trip in the dataset in hours? - 631,152.50 Hours - 243.44 Hours - 7.68 Hours - 3.32 Hours ### Question 5: **User Interface** Spark’s User Interface which shows the application's dashboard runs on which local port? - 80 - 443 - 4040 - 8080 ### Question 6: **Least frequent pickup location zone** Load the zone lookup data into a temp view in Spark
[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv) Using the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?
- East Chelsea - Jamaica Bay - Union Sq - Crown Heights North ## Submitting the solutions - Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw5 - Deadline: See the website ================================================ FILE: cohorts/2024/06-streaming/docker-compose.yml ================================================ version: '3.7' services: # Redpanda cluster redpanda-1: image: docker.redpanda.com/vectorized/redpanda:v22.3.5 container_name: redpanda-1 command: - redpanda - start - --smp - '1' - --reserve-memory - 0M - --overprovisioned - --node-id - '1' - --kafka-addr - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092 - --advertise-kafka-addr - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092 - --pandaproxy-addr - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082 - --advertise-pandaproxy-addr - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082 - --rpc-addr - 0.0.0.0:33145 - --advertise-rpc-addr - redpanda-1:33145 ports: # - 8081:8081 - 8082:8082 - 9092:9092 - 28082:28082 - 29092:29092 ================================================ FILE: cohorts/2024/06-streaming/homework.md ================================================ ## Module 6 Homework In this homework, we're going to extend Module 5 Homework and learn about streaming with PySpark. Instead of Kafka, we will use Red Panda, which is a drop-in replacement for Kafka. Ensure you have the following set up (if you had done the previous homework and the module): - Docker (see [module 1](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/01-docker-terraform)) - PySpark (see [module 5](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/05-batch/setup)) For this homework we will be using the files from Module 5 homework: - Green 2019-10 data from [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz) ## Start Red Panda Let's start redpanda in a docker container. There's a `docker-compose.yml` file in the homework folder (taken from [here](https://github.com/redpanda-data-blog/2023-python-gsg/blob/main/docker-compose.yml)) Copy this file to your homework directory and run ```bash docker-compose up ``` (Add `-d` if you want to run in detached mode) ## Question 1: Redpanda version Now let's find out the version of redpandas. For that, check the output of the command `rpk help` _inside the container_. The name of the container is `redpanda-1`. Find out what you need to execute based on the `help` output. What's the version, based on the output of the command you executed? (copy the entire version) ## Question 2. Creating a topic Before we can send data to the redpanda server, we need to create a topic. We do it also with the `rpk` command we used previously for figuring out the version of redpandas. Read the output of `help` and based on it, create a topic with name `test-topic` What's the output of the command for creating a topic? Include the entire output in your answer. ## Question 3. Connecting to the Kafka server We need to make sure we can connect to the server, so later we can send some data to its topics First, let's install the kafka connector (up to you if you want to have a separate virtual environment for that) ```bash pip install kafka-python ``` You can start a jupyter notebook in your solution folder or create a script Let's try to connect to our server: ```python import json import time from kafka import KafkaProducer def json_serializer(data): return json.dumps(data).encode('utf-8') server = 'localhost:9092' producer = KafkaProducer( bootstrap_servers=[server], value_serializer=json_serializer ) producer.bootstrap_connected() ``` Provided that you can connect to the server, what's the output of the last command? ## Question 4. Sending data to the stream Now we're ready to send some test data: ```python t0 = time.time() topic_name = 'test-topic' for i in range(10): message = {'number': i} producer.send(topic_name, value=message) print(f"Sent: {message}") time.sleep(0.05) producer.flush() t1 = time.time() print(f'took {(t1 - t0):.2f} seconds') ``` How much time did it take? Where did it spend most of the time? * Sending the messages * Flushing * Both took approximately the same amount of time (Don't remove `time.sleep` when answering this question) ## Reading data with `rpk` You can see the messages that you send to the topic with `rpk`: ```bash rpk topic consume test-topic ``` Run the command above and send the messages one more time to see them ## Sending the taxi data Now let's send our actual data: * Read the green csv.gz file * We will only need these columns: * `'lpep_pickup_datetime',` * `'lpep_dropoff_datetime',` * `'PULocationID',` * `'DOLocationID',` * `'passenger_count',` * `'trip_distance',` * `'tip_amount'` Iterate over the records in the dataframe ```python for row in df_green.itertuples(index=False): row_dict = {col: getattr(row, col) for col in row._fields} print(row_dict) break # TODO implement sending the data here ``` Note: this way of iterating over the records is more efficient compared to `iterrows` ## Question 5: Sending the Trip Data * Create a topic `green-trips` and send the data there * How much time in seconds did it take? (You can round it to a whole number) * Make sure you don't include sleeps in your code ## Creating the PySpark consumer Now let's read the data with PySpark. Spark needs a library (jar) to be able to connect to Kafka, so we need to tell PySpark that it needs to use it: ```python import pyspark from pyspark.sql import SparkSession pyspark_version = pyspark.__version__ kafka_jar_package = f"org.apache.spark:spark-sql-kafka-0-10_2.12:{pyspark_version}" spark = SparkSession \ .builder \ .master("local[*]") \ .appName("GreenTripsConsumer") \ .config("spark.jars.packages", kafka_jar_package) \ .getOrCreate() ``` Now we can connect to the stream: ```python green_stream = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("subscribe", "green-trips") \ .option("startingOffsets", "earliest") \ .load() ``` In order to test that we can consume from the stream, let's see what will be the first record there. In Spark streaming, the stream is represented as a sequence of small batches, each batch being a small RDD (or a small dataframe). So we can execute a function over each mini-batch. Let's run `take(1)` there to see what do we have in the stream: ```python def peek(mini_batch, batch_id): first_row = mini_batch.take(1) if first_row: print(first_row[0]) query = green_stream.writeStream.foreachBatch(peek).start() ``` You should see a record like this: ``` Row(key=None, value=bytearray(b'{"lpep_pickup_datetime": "2019-10-01 00:26:02", "lpep_dropoff_datetime": "2019-10-01 00:39:58", "PULocationID": 112, "DOLocationID": 196, "passenger_count": 1.0, "trip_distance": 5.88, "tip_amount": 0.0}'), topic='green-trips', partition=0, offset=0, timestamp=datetime.datetime(2024, 3, 12, 22, 42, 9, 411000), timestampType=0) ``` Now let's stop the query, so it doesn't keep consuming messages from the stream ```python query.stop() ``` ## Question 6. Parsing the data The data is JSON, but currently it's in binary format. We need to parse it and turn it into a streaming dataframe with proper columns. Similarly to PySpark, we define the schema ```python from pyspark.sql import types schema = types.StructType() \ .add("lpep_pickup_datetime", types.StringType()) \ .add("lpep_dropoff_datetime", types.StringType()) \ .add("PULocationID", types.IntegerType()) \ .add("DOLocationID", types.IntegerType()) \ .add("passenger_count", types.DoubleType()) \ .add("trip_distance", types.DoubleType()) \ .add("tip_amount", types.DoubleType()) ``` And apply this schema: ```python from pyspark.sql import functions as F green_stream = green_stream \ .select(F.from_json(F.col("value").cast('STRING'), schema).alias("data")) \ .select("data.*") ``` How does the record look after parsing? Copy the output. ### Question 7: Most popular destination Now let's finally do some streaming analytics. We will see what's the most popular destination currently based on our stream of data (which ideally we should have sent with delays like we did in workshop 2) This is how you can do it: * Add a column "timestamp" using the `current_timestamp` function * Group by: * 5 minutes window based on the timestamp column (`F.window(col("timestamp"), "5 minutes")`) * `"DOLocationID"` * Order by count You can print the output to the console using this code ```python query = popular_destinations \ .writeStream \ .outputMode("complete") \ .format("console") \ .option("truncate", "false") \ .start() query.awaitTermination() ``` Write the most popular destination, your answer should be *either* the zone ID or the zone name of this destination. (You will need to re-send the data for this to work) ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw6 ## Solution We will publish the solution here after deadline. ================================================ FILE: cohorts/2024/README.md ================================================ ## Data Engineering Zoomcamp 2024 Cohort * [Pre-launch Q&A stream](https://www.youtube.com/watch?v=91b8u9GmqB4) * [Launch stream with course overview](https://www.youtube.com/live/AtRhA-NfS24?si=5JzA_E8BmJjiLi8l) * [Deadline calendar](https://docs.google.com/spreadsheets/d/e/2PACX-1vQACMLuutV5rvXg5qICuJGL-yZqIV0FBD84CxPdC5eZHf8TfzB-CJT_3Mo7U7oGVTXmSihPgQxuuoku/pubhtml) * [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html) * Course Playlist: Only 2024 Live videos & homeworks (TODO) * [Public Leaderboard of Top-100 Participants](leaderboard.md) [**Module 1: Introduction & Prerequisites**](01-docker-terraform/) * [Homework](01-docker-terraform/homework.md) [**Module 2: Workflow Orchestration**](02-workflow-orchestration) * [Homework](02-workflow-orchestration/homework.md) * Office hours [**Workshop 1: Data Ingestion**](workshops/dlt.md) * Workshop with dlt * [Homework](workshops/dlt.md) [**Module 3: Data Warehouse**](03-data-warehouse) * [Homework](03-data-warehouse/homework.md) [**Module 4: Analytics Engineering**](04-analytics-engineering/) * [Homework](04-analytics-engineering/homework.md) [**Module 5: Batch processing**](05-batch/) * [Homework](05-batch/homework.md) [**Module 6: Stream Processing**](06-streaming) * [Homework](06-streaming/homework.md) [**Project**](project.md) More information [here](project.md) ================================================ FILE: cohorts/2024/leaderboard.md ================================================ ## Leaderboard This is the top [100 leaderboard](https://courses.datatalks.club/de-zoomcamp-2024/leaderboard) of participants of Data Engineering Zoomcamp 2024 edition!
Name Projects Social Comments
Ashraf Mohammad
comment Really Recommend this bootcamp , if you want to get hands on data engineering experience. My two Capstone project: www.github.com/Ashraf1395/supply_chain_finance, www.github.com/Ashraf1395/customer_retention_analytics
Jorge Vladimir Abrego Arevalo
Purnendu Shekhar Shukla
Krishna Anand
Abhijit Chakraborty
Hekmatullah Sajid
Lottie Jane Pollard
AviAnna
Ketut Garjita
comment I would like to express my thanks and appreciation to the Data Talks Club for organizing this excellent Data Engineering Zoomcamp training. This made me valuable experience in deepening new knowledge for me even though previously I had mostly worked as a Database Administrator for various platform databases. Thank you also to the community (datatalks-club.slack.com), especially slack course-data-engineering, as well as other slack communities such as mageai.slack.com.
Diogo Costa
comment Great course! Check out my YouTube channel: https://www.youtube.com/@TechWithCosta
Francisco Ortiz Tena
comment It is an awesome course!
Nevenka Lukic
comment This DE Zoomcamp was fantastic learning and networking experiences. Many thanks to organizers and big recommendations to anyone!
Mukhammad Sofyan Rizka Akbar
comment Thanks for providing this course, especially for Alexey and other Datatalk hosts and I hope I can join ML, ML Ops, and LLM Zoomcamp. See you soon :)
Mahmoud Mahdy Zaky
Brilliant Pancake
Jobert M. Gutierrez
Olusegun Samson Ayeni
Lily Chau
comment Big thank you to Alexey and all other speakers. This is one of the best online learning platforms I have ever come across.
Aleksandr Kolmakov
Kang Zhi Yong
Eduardo Muñoz Sala
Kirill Bazarov
Shayan Shafiee Moghadam
Landry N.
comment Thanks for the awsome course.
Condescending Austin
Lee Durbin
Loving Einstein
Carlos Vecina Tebar
Abiodun Oki
comment thoroughly enjoyed the course, great work Alexey & course team!
Jimoh
Sleepy Villani
Ella Cinders
Max Lutz
Jessica De Silva
Daniel Okello
Kirill Sitnikov
comment Thank you Alexey and all DTC team! I’m so glad that I knew about your courses and projects!
edumad
Duy Quoc Vo
comment NA
Xiang Li
Sugeng Wahyudi
comment Thanks a lot, this was amazing. Can't miss another course and zoomcamp from datatalks.club
Anatolii Kryvko
David Vanegas
Honey Badger
Abdelrahman Kamal
Jean Paul Rodriguez
Eager Pasteur
Damian Pszczoła
ManPrat
forrest_parnassus
Ramazan Abylkassov
comment Look mom, I am on leaderboard!
Digamber Deshmukh
Andrew Lee
Matt R
Raul Antonio Catacora Grundy
comment I just want to thank everyone, all the instructors, collaborators for creating this amazing set of resources and such a solid community based on sharing and caring. Many many thanks and shout out to you guys
Ranga H.
Salma Gouda
Artsiom Turevich
comment A long time ago in a galaxy far, far away...
Abhirup Ghosh
Sonny Pham
Peter Tran
Ritika Tilwalia
Eager Yalow
Dave Samaniego
comment Thank you DataTalksClub for the course. It was challenging learning many new things, but I had fun along the way too!
Lucid Keldysh
Isaac Ndirangu Muturi
comment Amazing learning experience
Agitated Wing
Hanaa HAMMAD
comment Grateful to this great course
Jonah Oliver
Paul Emilio Arizpe Colorado
comment DataTalksClub brought me the opportunity to learn data engineering. Thanks for all :D
Asma-Chloë FARAH
comment Thank you for this amazing zoomcamp ! It was really fun !
Happy Feistel
Luca Pugliese
comment it has been a crowdlearning experience! starting in thousands of us. 359 graduated in the end. Proud to have classified 59th. Thanks to all.
Jake Maund
Aditya Phulallwar
Dave Wilson
Haitham Hussein Hamad
Alexandre Bergere aka Rocket
TOGBAN COKOUVI Joyce Elvis Mahoutondji
Sad Robinson
Tetiana Omelchenko
Amanda Kershaw
comment This course was incredibly rewarding and absolutely worth the effort.
Kristjan Sert
Murad Arfanyan
Ecstatic Hofstadter
Chung Huu Tin
Zen Mayer
Zhastay Yeltay
comment ;)
AV3NII
Sebastian Alejandro Peralta Casafranca
Relaxed Williams
George Mouratos
comment -
mhmed ahmed rjb
Frosty Jackson
WANJOHI
Ighorr Holstrom
Jesse Delzio
Khalil El Daou
comment Already made a post about the zoomcamp
Juan Rojas
Gonçalo
Muhamad Farikhin
Bold Lederberg
Taras Shalaiko
================================================ FILE: cohorts/2024/project.md ================================================ ## Course Project The goal of this project is to apply everything we learned in this course and build an end-to-end data pipeline. You will have two attempts to submit your project. If you don't have time to submit your project by the end of attempt #1 (you started the course late, you have vacation plans, life/work got in the way, etc.) or you fail your first attempt, then you will have a second chance to submit your project as attempt #2. There are only two attempts. Remember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered complete. To find the projects assigned to you, use the peer review assignments link and find your hash in the first column. You will see three rows: you need to evaluate each of these projects. For each project, you need to submit the form once, so in total, you will make three submissions. ### Submitting #### Project Attempt #1 * Project: https://courses.datatalks.club/de-zoomcamp-2024/project/project1 * Review: https://courses.datatalks.club/de-zoomcamp-2024/project/project1/eval #### Project Attempt #2 * Project: https://courses.datatalks.club/de-zoomcamp-2024/project/project2 * Review: https://courses.datatalks.club/de-zoomcamp-2024/project/project2/eval > **Important**: update your "Certificate name" here: https://courses.datatalks.club/de-zoomcamp-2024/enrollment - this is what we will use when generating certificates for you. ### Evaluation criteria See [here](../../week_7_project/README.md) ================================================ FILE: cohorts/2024/workshops/dlt.md ================================================ # Data ingestion with dlt ​In this hands-on workshop, we’ll learn how to build data ingestion pipelines. ​We’ll cover the following steps: * ​Extracting data from APIs, or files. * ​Normalizing and loading data * ​Incremental loading ​By the end of this workshop, you’ll be able to write data pipelines like a senior data engineer: Quickly, concisely, scalable, and self-maintaining. Video: https://www.youtube.com/live/oLXhBM7nf2Q --- # Navigation * [Workshop content](dlt_resources/data_ingestion_workshop.md) * [Workshop notebook](dlt_resources/workshop.ipynb) * [Homework starter notebook](dlt_resources/homework_starter.ipynb) # Resources - Website and community: Visit our [docs](https://dlthub.com/docs/intro), discuss on our slack (Link at top of docs). - Course colab: [Notebook](https://colab.research.google.com/drive/1kLyD3AL-tYf_HqCXYnA3ZLwHGpzbLmoj#scrollTo=5aPjk0O3S_Ag&forceEdit=true&sandboxMode=true). - dlthub [community Slack](https://dlthub.com/community). --- # Teacher Welcome to the data talks club data engineering zoomcamp, the data ingestion workshop. - My name is [Adrian](https://www.linkedin.com/in/data-team/), and I work in the data field since 2012 - I built many data warehouses some lakes, and a few data teams - 10 years into my career I started working on dlt “data load tool”, which is an open source library to enable data engineers to build faster and better. - I started working on dlt because data engineering is one of the few areas of software engineering where we do not have developer tools to do our work. - Building better pipelines would require more code re-use - we cannot all just build perfect pipelines from scratch every time. - And so dlt was born, a library that automates the tedious part of data ingestion: Loading, schema management, data type detection, scalability, self healing, scalable extraction… you get the idea - essentially a data engineer’s “one stop shop” for best practice data pipelining. - Due to its **simplicity** of use, dlt enables **laymen** to - Build pipelines 5-10x faster than without it - Build self healing, self maintaining pipelines with all the best practices of data engineers. Automating schema changes removes the bulk of maintenance efforts. - Govern your pipelines with schema evolution alerts and data contracts. - and generally develop pipelines like a senior, commercial data engineer. --- # Course You can find the course file [here](./dlt_resources/data_ingestion_workshop.md) The course has 3 parts - [Extraction Section](./dlt_resources/data_ingestion_workshop.md#extracting-data): In this section we will learn about scalable extraction - [Normalisation Section](./dlt_resources/data_ingestion_workshop.md#normalisation): In this section we will learn to prepare data for loading - [Loading Section](./dlt_resources/data_ingestion_workshop.md#incremental-loading)): Here we will learn about incremental loading modes --- # Homework The [linked colab notebook](https://colab.research.google.com/drive/1Te-AT0lfh0GpChg1Rbd0ByEKOHYtWXfm#scrollTo=wLF4iXf-NR7t&forceEdit=true&sandboxMode=true) offers a few exercises to practice what you learned today. #### Question 1: What is the sum of the outputs of the generator for limit = 5? - **A**: 10.23433234744176 - **B**: 7.892332347441762 - **C**: 8.382332347441762 - **D**: 9.123332347441762 #### Question 2: What is the 13th number yielded by the generator? - **A**: 4.236551275463989 - **B**: 3.605551275463989 - **C**: 2.345551275463989 - **D**: 5.678551275463989 #### Question 3: Append the 2 generators. After correctly appending the data, calculate the sum of all ages of people. - **A**: 353 - **B**: 365 - **C**: 378 - **D**: 390 #### Question 4: Merge the 2 generators using the ID column. Calculate the sum of ages of all the people loaded as described above. - **A**: 215 - **B**: 266 - **C**: 241 - **D**: 258 Submit the solution here: https://courses.datatalks.club/de-zoomcamp-2024/homework/workshop1 --- # Next steps As you are learning the various concepts of data engineering, consider creating a portfolio project that will further your own knowledge. By demonstrating the ability to deliver end to end, you will have an easier time finding your first role. This will help regardless of whether your hiring manager reviews your project, largely because you will have a better understanding and will be able to talk the talk. Here are some example projects that others did with dlt: - Serverless dlt-dbt on cloud functions: [Article](https://docs.getdbt.com/blog/serverless-dlt-dbt-stack) - Bird finder: [Part 1](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-i), [Part 2](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-ii) - Event ingestion on GCP: [Article and repo](https://dlthub.com/docs/blog/streaming-pubsub-json-gcp) - Event ingestion on AWS: [Article and repo](https://dlthub.com/docs/blog/dlt-aws-taktile-blog) - Or see one of the many demos created by our working students: [Hacker news](https://dlthub.com/docs/blog/hacker-news-gpt-4-dashboard-demo), [GA4 events](https://dlthub.com/docs/blog/ga4-internal-dashboard-demo), [an E-Commerce](https://dlthub.com/docs/blog/postgresql-bigquery-metabase-demo), [google sheets](https://dlthub.com/docs/blog/google-sheets-to-data-warehouse-pipeline), [Motherduck](https://dlthub.com/docs/blog/dlt-motherduck-demo), [MongoDB + Holistics](https://dlthub.com/docs/blog/MongoDB-dlt-Holistics), [Deepnote](https://dlthub.com/docs/blog/deepnote-women-wellness-violence-tends), [Prefect](https://dlthub.com/docs/blog/dlt-prefect), [PowerBI vs GoodData vs Metabase](https://dlthub.com/docs/blog/semantic-modeling-tools-comparison), [Dagster](https://dlthub.com/docs/blog/dlt-dagster), [Ingesting events via gcp webhooks](https://dlthub.com/docs/blog/dlt-webhooks-on-cloud-functions-for-event-capture), [SAP to snowflake replication](https://dlthub.com/docs/blog/sap-hana-to-snowflake-demo-blog), [Read emails and send sumamry to slack with AI and Kestra](https://dlthub.com/docs/blog/dlt-kestra-demo-blog), [Mode +dlt capabilities](https://dlthub.com/docs/blog/dlt-mode-blog), [dbt on cloud functions](https://dlthub.com/docs/blog/dlt-dbt-runner-on-cloud-functions) - If you want to use dlt in your project, [check this list of public APIs](https://dlthub.com/docs/blog/practice-api-sources) If you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt slack. **And don't forget, if you like dlt** - **Give us a [GitHub Star!](https://github.com/dlt-hub/dlt)** - **Join our [Slack community](https://dlthub.com/community)** # Notes * Add your notes here ================================================ FILE: cohorts/2024/workshops/dlt_resources/data_ingestion_workshop.md ================================================ # Intro What is data loading, or data ingestion? Data ingestion is the process of extracting data from a producer, transporting it to a convenient environment, and preparing it for usage by normalising it, sometimes cleaning, and adding metadata. ### “A wild dataset magically appears!” In many data science teams, data magically appears - because the engineer loads it. - Sometimes the format in which it appears is structured, and with explicit schema - In that case, they can go straight to using it; Examples: Parquet, Avro, or table in a db, - Sometimes the format is weakly typed and without explicit schema, such as csv, json - in which case some extra normalisation or cleaning might be needed before usage > 💡 **What is a schema?** The schema specifies the expected format and structure of data within a document or data store, defining the allowed keys, their data types, and any constraints or relationships. ### Be the magician! 😎 Since you are here to learn about data engineering, you will be the one making datasets magically appear. Here’s what you need to learn to build pipelines - Extracting data - Normalising, cleaning, adding metadata such as schema and types - and Incremental loading, which is vital for fast, cost effective data refreshes. ### What else does a data engineer do? What are we not learning, and what are we learning? - It might seem simplistic, but in fact a data engineer’s main goal is to ensure data flows from source systems to analytical destinations. - So besides building pipelines, running pipelines and fixing pipelines, a data engineer may also focus on optimising data storage, ensuring data quality and integrity, implementing effective data governance practices, and continuously refining data architecture to meet the evolving needs of the organisation. - Ultimately, a data engineer's role extends beyond the mechanical aspects of pipeline development, encompassing the strategic management and enhancement of the entire data lifecycle. - This workshop focuses on building robust, scalable, self maintaining pipelines, with built in governance - in other words, best practices applied. # Extracting data ### The considerations of extracting data In this section we will learn about extracting data from source systems, and what to care about when doing so. Most data is stored behind an API - Sometimes that’s a RESTful api for some business application, returning records of data. - Sometimes the API returns a secure file path to something like a json or parquet file in a bucket that enables you to grab the data in bulk, - Sometimes the API is something else (mongo, sql, other databases or applications) and will generally return records as JSON - the most common interchange format. As an engineer, you will need to build pipelines that “just work”. So here’s what you need to consider on extraction, to prevent the pipelines from breaking, and to keep them running smoothly. - Hardware limits: During this course we will cover how to navigate the challenges of managing memory. - Network limits: Sometimes networks can fail. We can’t fix what could go wrong but we can retry network jobs until they succeed. For example, dlt library offers a requests “replacement” that has built in retries. [Docs](https://dlthub.com/docs/reference/performance#using-the-built-in-requests-client). We won’t focus on this during the course but you can read the docs on your own. - Source api limits: Each source might have some limits such as how many requests you can do per second. We would call these “rate limits”. Read each source’s docs carefully to understand how to navigate these obstacles. You can find some examples of how to wait for rate limits in our verified sources repositories - examples: [Zendesk](https://developer.zendesk.com/api-reference/introduction/rate-limits/), [Shopify](https://shopify.dev/docs/api/usage/rate-limits) ### Extracting data without hitting hardware limits What kind of limits could you hit on your machine? In the case of data extraction, the only limits are memory and storage. This refers to the RAM or virtual memory, and the disk, or physical storage. ### **Managing memory.** - Many data pipelines run on serverless functions or on orchestrators that delegate the workloads to clusters of small workers. - These systems have a small memory or share it between multiple workers - so filling the memory is BAAAD: It might lead to not only your pipeline crashing, but crashing the entire container or machine that might be shared with other worker processes, taking them down too. - The same can be said about disk - in most cases your disk is sufficient, but in some cases it’s not. For those cases, mounting an external drive mapped to a storage bucket is the way to go. Airflow for example supports a “data” folder that is used just like a local folder but can be mapped to a bucket for unlimited capacity. ### So how do we avoid filling the memory? - We often do not know the volume of data upfront - And we cannot scale dynamically or infinitely on hardware during runtime - So the answer is: Control the max memory you use ### Control the max memory used by streaming the data Streaming here refers to processing the data event by event or chunk by chunk instead of doing bulk operations. Let’s look at some classic examples of streaming where data is transferred chunk by chunk or event by event - Between an audio broadcaster and an in-browser audio player - Between a server and a local video player - Between a smart home device or IoT device and your phone - between google maps and your navigation app - Between instagram live and your followers What do data engineers do? We usually stream the data between buffers, such as - from API to local file - from webhooks to event queues - from event queue (Kafka, SQS) to Bucket ### Streaming in python via generators Let’s focus on how we build most data pipelines: - To process data in a stream in python, we use generators, which are functions that can return multiple times - by allowing multiple returns, the data can be released as it’s produced, as stream, instead of returning it all at once as a batch. Take the following theoretical example: - We search twitter for “cat pictures”. We do not know how many pictures will be returned - maybe 10, maybe 10.000.000. Will they fit in memory? Who knows. - So to grab this data without running out of memory, we would use a python generator. - What’s a generator? In simple words, it’s a function that can return multiple times. Here’s an example of a regular function, and how that function looks if written as a generator. ### Generator examples: Let’s look at a regular returning function, and how we can re-write it as a generator. **Regular function collects data in memory.** Here you can see how data is collected row by row in a list called `data`before it is returned. This will break if we have more data than memory. ```python def search_twitter(query): data = [] for row in paginated_get(query): data.append(row) return data # Collect all the cat picture data for row in search_twitter("cat pictures"): # Once collected, # print row by row print(row) ``` When calling `for row in search_twitter("cat pictures"):` all the data must first be downloaded before the first record is returned Let’s see how we could rewrite this as a generator. **Generator for streaming the data.** The memory usage here is minimal. As you can see, in the modified function, we yield each row as we get the data, without collecting it into memory. We can then run this generator and handle the data item by item. ```python def search_twitter(query): for row in paginated_get(query): yield row # Get one row at a time for row in extract_data("cat pictures"): # print the row print(row) # do something with the row such as cleaning it and writing it to a buffer # continue requesting and printing data ``` When calling `for row in extract_data("cat pictures"):` the function only runs until the first data item is yielded, before printing - so we do not need to wait long for the first value. It will then continue until there is no more data to get. If we wanted to get all the values at once from a generator instead of one by one, we would need to first “run” the generator and collect the data. For example, if we wanted to get all the data in memory we could do `data = list(extract_data("cat pictures"))` which would run the generator and collect all the data in a list before continuing. ## 3 Extraction examples: ### Example 1: Grabbing data from an api > 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab or in your local setup. For these purposes we created an api that can serve the data you are already familiar with, the NYC taxi dataset. The api documentation is as follows: - There are a limited nr of records behind the api - The data can be requested page by page, each page containing 1000 records - If we request a page with no data, we will get a successful response with no data - so this means that when we get an empty page, we know there is no more data and we can stop requesting pages - this is a common way to paginate but not the only one - each api may be different. - details: - method: get - url: `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api` - parameters: `page` integer. Represents the page number you are requesting. Defaults to 1. So how do we design our requester? - We need to request page by page until we get no more data. At this point, we do not know how much data is behind the api. - It could be 1000 records or it could be 10GB of records. So let’s grab the data with a generator to avoid having to fit an undetermined amount of data into ram. In this approach to grabbing data from apis, we have pros and cons: - Pros: **Easy memory management** thanks to api returning events/pages - Cons: **Low throughput**, due to the data transfer being constrained via an API. ```bash import requests BASE_API_URL = "https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api" # I call this a paginated getter # as it's a function that gets data # and also paginates until there is no more data # by yielding pages, we "microbatch", which speeds up downstream processing def paginated_getter(): page_number = 1 while True: # Set the query parameters params = {'page': page_number} # Make the GET request to the API response = requests.get(BASE_API_URL, params=params) response.raise_for_status() # Raise an HTTPError for bad responses page_json = response.json() print(f'got page number {page_number} with {len(page_json)} records') # if the page has no records, stop iterating if page_json: yield page_json page_number += 1 else: # No more data, break the loop break if __name__ == '__main__': # Use the generator to iterate over pages for page_data in paginated_getter(): # Process each page as needed print(page_data) ``` ### Example 2: Grabbing the same data from file - simple download > 💡 This part is demonstrative, so you do not need to follow along; just pay attention. - Why am I showing you this? so when you do this in the future, you will remember there is a best practice you can apply for scalability. Some apis respond with files instead of pages of data. The reason for this is simple: Throughput and cost. A restful api that returns data has to read the data from storage and process and return it to you by some logic - If this data is large, this costs time, money and creates a bottleneck. A better way is to offer the data as files that someone can download from storage directly, without going through the restful api layer. This is common for apis that offer large volumes of data, such as ad impressions data. In this example, we grab exactly the same data as we did in the API example above, but now we get it from the underlying file instead of going through the API. - Pros: **High throughput** - Cons: **Memory** is used to hold all the data This is how the code could look. As you can see in this case our `data`and `parsed_data` variables hold the entire file data in memory before returning it. Not great. ```python import requests import json url = "https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl" def download_and_read_jsonl(url): response = requests.get(url) response.raise_for_status() # Raise an HTTPError for bad responses data = response.text.splitlines() parsed_data = [json.loads(line) for line in data] return parsed_data downloaded_data = download_and_read_jsonl(url) if downloaded_data: # Process or print the downloaded data as needed print(downloaded_data[:5]) # Print the first 5 entries as an example ``` ### Example 3: Same file, streaming download > 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab Ok, downloading files is simple, but what if we want to do a stream download? That’s possible too - in effect giving us the best of both worlds. In this case we prepared a jsonl file which is already split into lines making our code simple. But json (not jsonl) files could also be downloaded in this fashion, for example using the `ijson` library. What are the pros and cons of this method of grabbing data? Pros: **High throughput, easy memory management,** because we are downloading a file Cons: **Difficult to do for columnar file formats**, as entire blocks need to be downloaded before they can be deserialised to rows. Sometimes, the code is complex too. Here’s what the code looks like - in a jsonl file each line is a json document, or a “row” of data, so we yield them as they get downloaded. This allows us to download one row and process it before getting the next row. ```bash import requests import json def download_and_yield_rows(url): response = requests.get(url, stream=True) response.raise_for_status() # Raise an HTTPError for bad responses for line in response.iter_lines(): if line: yield json.loads(line) # Replace the URL with your actual URL url = "https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl" # Use the generator to iterate over rows with minimal memory usage for row in download_and_yield_rows(url): # Process each row as needed print(row) ``` In the colab notebook you can also find a code snippet to load the data - but we will load some data later in the course and you can explore the colab on your own after the course. What is worth keeping in mind at this point is that our loader library that we will use later, `dlt`or data load tool, will respect the streaming concept of the generator and will process it in an efficient way keeping memory usage low and using parallelism where possible. Let’s move over to the Colab notebook and run examples 2 and 3, compare them, and finally load examples 1 and 3 to DuckDB # Normalising data You often hear that data people spend most of their time “cleaning” data. What does this mean? Let’s look granularly into what people consider data cleaning. Usually we have 2 parts: - Normalising data without changing its meaning, - and filtering data for a use case, which changes its meaning. ### Part of what we often call data cleaning is just metadata work: - Add types (string to number, string to timestamp, etc) - Rename columns: Ensure column names follow a supported standard downstream - such as no strange characters in the names. - Flatten nested dictionaries: Bring nested dictionary values into the top dictionary row - Unnest lists or arrays into child tables: Arrays or lists cannot be flattened into their parent record, so if we want flat data we need to break them out into separate tables. - We will look at a practical example next, as these concepts can be difficult to visualise from text. ### **Why prepare data? why not use json as is?** - We do not easily know what is inside a json document due to lack of schema - Types are not enforced between rows of json - we could have one record where age is `25`and another where age is `twenty five` , and another where it’s `25.00`. Or in some systems, you might have a dictionary for a single record, but a list of dicts for multiple records. This could easily lead to applications downstream breaking. - We cannot just use json data easily, for example we would need to convert strings to time if we want to do a daily aggregation. - Reading json loads more data into memory, as the whole document is scanned - while in parquet or databases we can scan a single column of a document. This causes costs and slowness. - Json is not fast to aggregate - columnar formats are. - Json is not fast to search. - Basically json is designed as a "lowest common denominator format" for "interchange" / data transfer and is unsuitable for direct analytical usage. ### Practical example > 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab notebook. In the case of the NY taxi rides data, the dataset is quite clean - so let’s instead use a small example of more complex data. Let’s assume we know some information about passengers and stops. For this example we modified the dataset as follows - We added nested dictionaries ```json "coordinates": { "start": { "lon": -73.787442, "lat": 40.641525 }, ``` - We added nested lists ```json "passengers": [ {"name": "John", "rating": 4.9}, {"name": "Jack", "rating": 3.9} ], ``` - We added a record hash that gives us an unique id for the record, for easy identification ```json "record_hash": "b00361a396177a9cb410ff61f20015ad", ``` We want to load this data to a database. How do we want to clean the data? - We want to flatten dictionaries into the base row - We want to flatten lists into a separate table - We want to convert time strings into time type ```python data = [ { "vendor_name": "VTS", "record_hash": "b00361a396177a9cb410ff61f20015ad", "time": { "pickup": "2009-06-14 23:23:00", "dropoff": "2009-06-14 23:48:00" }, "Trip_Distance": 17.52, "coordinates": { "start": { "lon": -73.787442, "lat": 40.641525 }, "end": { "lon": -73.980072, "lat": 40.742963 } }, "Rate_Code": None, "store_and_forward": None, "Payment": { "type": "Credit", "amt": 20.5, "surcharge": 0, "mta_tax": None, "tip": 9, "tolls": 4.15, "status": "booked" }, "Passenger_Count": 2, "passengers": [ {"name": "John", "rating": 4.9}, {"name": "Jack", "rating": 3.9} ], "Stops": [ {"lon": -73.6, "lat": 40.6}, {"lon": -73.5, "lat": 40.5} ] }, ] ``` Now let’s normalise this data. ## Introducing dlt dlt is a python library created for the purpose of assisting data engineers to build simpler, faster and more robust pipelines with minimal effort. You can think of dlt as a loading tool that implements the best practices of data pipelines enabling you to just “use” those best practices in your own pipelines, in a declarative way. This enables you to stop reinventing the flat tyre, and leverage dlt to build pipelines much faster than if you did everything from scratch. dlt automates much of the tedious work a data engineer would do, and does it in a way that is robust. dlt can handle things like: - Schema: Inferring and evolving schema, alerting changes, using schemas as data contracts. - Typing data, flattening structures, renaming columns to fit database standards. In our example we will pass the “data” you can see above and see it normalised. - Processing a stream of events/rows without filling memory. This includes extraction from generators. - Loading to a variety of dbs or file formats. Let’s use it to load our nested json to duckdb: Here’s how you would do that on your local machine. I will walk you through before showing you in colab as well. First, install dlt ```bash # Make sure you are using Python 3.8-3.11 and have pip installed # spin up a venv python -m venv ./env source ./env/bin/activate # pip install pip install dlt[duckdb] ``` Next, grab your data from above and run this snippet - here we define a pipeline, which is a connection to a destination - and we run the pipeline, printing the outcome ```python # define the connection to load to. # We now use duckdb, but you can switch to Bigquery later pipeline = dlt.pipeline(pipeline_name="taxi_data", destination='duckdb', dataset_name='taxi_rides') # run the pipeline with default settings, and capture the outcome info = pipeline.run(data, table_name="users", write_disposition="replace") # show the outcome print(info) ``` If you are running dlt locally you can use the built in streamlit app by running the cli command with the pipeline name we chose above. ```bash dlt pipeline taxi_data show ``` Or explore the data in the linked colab notebook. I’ll switch to it now to show you the data. # Incremental loading Incremental loading means that as we update our datasets with the new data, we would only load the new data, as opposed to making a full copy of a source’s data all over again and replacing the old version. By loading incrementally, our pipelines run faster and cheaper. - Incremental loading goes hand in hand with incremental extraction and state, two concepts which we will not delve into during this workshop - `State` is information that keeps track of what was loaded, to know what else remains to be loaded. dlt stores the state at the destination in a separate table. - Incremental extraction refers to only requesting the increment of data that we need, and not more. This is tightly connected to the state to determine the exact chunk that needs to be extracted and loaded. - You can learn more about incremental extraction and state by reading the dlt docs on how to do it. ### dlt currently supports 2 ways of loading incrementally: 1. Append: - We can use this for immutable or stateless events (data that doesn’t change), such as taxi rides - For example, every day there are new rides, and we could load the new ones only instead of the entire history. - We could also use this to load different versions of stateful data, for example for creating a “slowly changing dimension” table for auditing changes. For example, if we load a list of cars and their colors every day, and one day one car changes color, we need both sets of data to be able to discern that a change happened. 2. Merge: - We can use this to update data that changes. - For example, a taxi ride could have a payment status, which is originally “booked” but could later be changed into “paid”, “rejected” or “cancelled” Here is how you can think about which method to use: ![Incremental Loading](./incremental_loading.png) * If you want to keep track of when changes occur in stateful data (slowly changing dimension) then you will need to append the data ### Let’s do a merge example together: > 💡 This is the bread and butter of data engineers pulling data, so follow along. - In our previous example, the payment status changed from "booked" to “cancelled”. Perhaps Jack likes to fraud taxis and that explains his low rating. Besides the ride status change, he also got his rating lowered further. - The merge operation replaces an old record with a new one based on a key. The key could consist of multiple fields or a single unique id. We will use record hash that we created for simplicity. If you do not have a unique key, you could create one deterministically out of several fields, such as by concatenating the data and hashing it. - A merge operation replaces rows, it does not update them. If you want to update only parts of a row, you would have to load the new data by appending it and doing a custom transformation to combine the old and new data. In this example, the score of the 2 drivers got lowered and we need to update the values. We do it by using merge write disposition, replacing the records identified by `record hash` present in the new data. ```python data = [ { "vendor_name": "VTS", "record_hash": "b00361a396177a9cb410ff61f20015ad", "time": { "pickup": "2009-06-14 23:23:00", "dropoff": "2009-06-14 23:48:00" }, "Trip_Distance": 17.52, "coordinates": { "start": { "lon": -73.787442, "lat": 40.641525 }, "end": { "lon": -73.980072, "lat": 40.742963 } }, "Rate_Code": None, "store_and_forward": None, "Payment": { "type": "Credit", "amt": 20.5, "surcharge": 0, "mta_tax": None, "tip": 9, "tolls": 4.15, "status": "cancelled" }, "Passenger_Count": 2, "passengers": [ {"name": "John", "rating": 4.4}, {"name": "Jack", "rating": 3.6} ], "Stops": [ {"lon": -73.6, "lat": 40.6}, {"lon": -73.5, "lat": 40.5} ] }, ] # define the connection to load to. # We now use duckdb, but you can switch to Bigquery later pipeline = dlt.pipeline(destination='duckdb', dataset_name='taxi_rides') # run the pipeline with default settings, and capture the outcome info = pipeline.run(data, table_name="users", write_disposition="merge", merge_key="record_hash") # show the outcome print(info) ``` As you can see in your notebook, the payment status and Jack’s rating were updated after running the code. ### What’s next? - You could change the destination to parquet + local file system or storage bucket. See the colab bonus section. - You could change the destination to BigQuery. Destination & credential setup docs: https://dlthub.com/docs/dlt-ecosystem/destinations/, https://dlthub.com/docs/walkthroughs/add_credentials or See the colab bonus section. - You could use a decorator to convert the generator into a customised dlt resource: https://dlthub.com/docs/general-usage/resource - You can deep dive into building more complex pipelines by following the guides: - https://dlthub.com/docs/walkthroughs - https://dlthub.com/docs/build-a-pipeline-tutorial - You can join our [Slack community](https://dlthub.com/community) and engage with us there. ================================================ FILE: cohorts/2024/workshops/dlt_resources/homework_solution.ipynb ================================================ { "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# **Homework**: Data talks club data engineering zoomcamp Data loading workshop\n", "\n", "Hello folks, let's practice what we learned - Loading data with the best practices of data engineering.\n", "\n", "Here are the exercises we will do\n", "\n", "\n" ], "metadata": { "id": "mrTFv5nPClXh" } }, { "cell_type": "markdown", "source": [ "# 1. Use a generator\n", "\n", "Remember the concept of generator? Let's practice using them to futher our understanding of how they work.\n", "\n", "Let's define a generator and then run it as practice.\n", "\n", "**Answer the following questions:**\n", "\n", "- **Question 1: What is the sum of the outputs of the generator for limit = 5?**\n", "- **Question 2: What is the 13th number yielded**\n", "\n", "I suggest practicing these questions without GPT as the purpose is to further your learning." ], "metadata": { "id": "wLF4iXf-NR7t" } }, { "cell_type": "code", "source": [ "def square_root_generator(limit):\n", " n = 1\n", " while n <= limit:\n", " yield n ** 0.5\n", " n += 1\n", "\n", "# Example usage:\n", "limit = 5\n", "generator = square_root_generator(limit)\n", "\n", "for sqrt_value in generator:\n", " print(sqrt_value)\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wLng-bDJN4jf", "outputId": "547683cb-5f56-4815-a903-d0d9578eb1f9" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "1.0\n", "1.4142135623730951\n", "1.7320508075688772\n", "2.0\n", "2.23606797749979\n" ] } ] }, { "cell_type": "markdown", "source": [], "metadata": { "id": "xbe3q55zN43j" } }, { "cell_type": "markdown", "source": [ "# 2. Append a generator to a table with existing data\n", "\n", "\n", "Below you have 2 generators. You will be tasked to load them to duckdb and answer some questions from the data\n", "\n", "1. Load the first generator and calculate the sum of ages of all people. Make sure to only load it once.\n", "2. Append the second generator to the same table as the first.\n", "3. **After correctly appending the data, calculate the sum of all ages of people.**\n", "\n", "\n" ], "metadata": { "id": "vjWhILzGJMpK" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2MoaQcdLBEk6", "outputId": "d2b93dc1-d83f-44ea-aeff-fdf51d75f7aa" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'ID': 1, 'Name': 'Person_1', 'Age': 26, 'City': 'City_A'}\n", "{'ID': 2, 'Name': 'Person_2', 'Age': 27, 'City': 'City_A'}\n", "{'ID': 3, 'Name': 'Person_3', 'Age': 28, 'City': 'City_A'}\n", "{'ID': 4, 'Name': 'Person_4', 'Age': 29, 'City': 'City_A'}\n", "{'ID': 5, 'Name': 'Person_5', 'Age': 30, 'City': 'City_A'}\n", "{'ID': 3, 'Name': 'Person_3', 'Age': 33, 'City': 'City_B', 'Occupation': 'Job_3'}\n", "{'ID': 4, 'Name': 'Person_4', 'Age': 34, 'City': 'City_B', 'Occupation': 'Job_4'}\n", "{'ID': 5, 'Name': 'Person_5', 'Age': 35, 'City': 'City_B', 'Occupation': 'Job_5'}\n", "{'ID': 6, 'Name': 'Person_6', 'Age': 36, 'City': 'City_B', 'Occupation': 'Job_6'}\n", "{'ID': 7, 'Name': 'Person_7', 'Age': 37, 'City': 'City_B', 'Occupation': 'Job_7'}\n", "{'ID': 8, 'Name': 'Person_8', 'Age': 38, 'City': 'City_B', 'Occupation': 'Job_8'}\n" ] } ], "source": [ "def people_1():\n", " for i in range(1, 6):\n", " yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 25 + i, \"City\": \"City_A\"}\n", "\n", "for person in people_1():\n", " print(person)\n", "\n", "\n", "def people_2():\n", " for i in range(3, 9):\n", " yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 30 + i, \"City\": \"City_B\", \"Occupation\": f\"Job_{i}\"}\n", "\n", "\n", "for person in people_2():\n", " print(person)\n" ] }, { "cell_type": "markdown", "source": [], "metadata": { "id": "vtdTIm4fvQCN" } }, { "cell_type": "markdown", "source": [ "# 3. Merge a generator\n", "\n", "Re-use the generators from Exercise 2.\n", "\n", "A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\n", "\n", "Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\n", "\n", "After loading, you should have a total of 8 records, and ID 3 should have age 33.\n", "\n", "Question: **Calculate the sum of ages of all the people loaded as described above.**\n" ], "metadata": { "id": "pY4cFAWOSwN1" } }, { "cell_type": "markdown", "source": [ "# Solution: First make sure that the following modules are installed:" ], "metadata": { "id": "kKB2GTB9oVjr" } }, { "cell_type": "code", "source": [ "#Install the dependencies\n", "%%capture\n", "!pip install dlt[duckdb]" ], "metadata": { "id": "xTVvtyqrfVNq" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Solutions\n", "\n", "You can use these solutions to self check your results, or to check how the answer can be obtained if you get stuck." ], "metadata": { "id": "kUG4DNYGb5dF" } }, { "cell_type": "markdown", "source": [ "\n", "\n", "\n", "\n" ], "metadata": { "id": "ks6Sh_jBJWdh" } }, { "cell_type": "markdown", "source": [ "## Solution 1" ], "metadata": { "id": "U61tgQaYb8Yt" } }, { "cell_type": "code", "source": [ "def sum_of_generator_outputs(generator, limit):\n", " return sum(next(generator) for _ in range(limit))\n", "\n", "# Example usage:\n", "limit_1 = 5\n", "generator_1 = square_root_generator(limit_1)\n", "result_1 = sum_of_generator_outputs(generator_1, limit_1)\n", "print(f\"The sum of the outputs for limit={limit_1} is: {result_1}\")\n", "\n", "\n", "def nth_yielded_number(generator, n):\n", " for _ in range(n - 1):\n", " next(generator)\n", " return next(generator)\n", "\n", "# Example usage:\n", "n = 13\n", "generator_2 = square_root_generator(n)\n", "result_2 = nth_yielded_number(generator_2, n)\n", "print(f\"The {n}th number yielded is: {result_2}\")\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Roc3y_lSTSfn", "outputId": "f03d348e-cdfa-44d0-e5f2-276db6af1cf5" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The sum of the outputs for limit=5 is: 8.382332347441762\n", "The 13th number yielded is: 3.605551275463989\n" ] } ] }, { "cell_type": "markdown", "source": [ "## Solution 2: Append a generator\n", "\n", "Load your first generator first, and then load the second one using the \"append\" operation. Since they have overlapping IDs, some records will appear multiple times.\n", "\n", "After loading, you should have a total of 11 records.\n", "\n", "Question: Calculate the sum of ages of all the people loaded as described above" ], "metadata": { "id": "M3PJYca2TIw8" } }, { "cell_type": "code", "source": [ "# Importing the DLT library\n", "import dlt\n", "\n", "# Create a DLT pipeline for the first generator `people_1`\n", "# The pipeline is set to load data into a DuckDB database with the dataset named 'people'\n", "people_1_pipeline = dlt.pipeline(destination='duckdb', dataset_name='people')\n", "\n", "# Run the pipeline for the first generator, creating or replacing the table 'people'\n", "info = people_1_pipeline.run(people_1(),\n", " table_name=\"people\",\n", " write_disposition=\"replace\")\n", "\n", "print(f\"{info}\\n\\n\")\n", "\n", "\n", "# Create a second DLT pipeline for the generator `people_2`, targeting the same DuckDB database and dataset\n", "people_2_pipeline = dlt.pipeline(destination='duckdb', dataset_name='people')\n", "\n", "# Run the second pipeline, appending data from `people_2` to the existing 'people' table\n", "info = people_2_pipeline.run(people_2(),\n", " table_name=\"people\",\n", " write_disposition=\"append\")\n", "\n", "print(f\"{info}\\n\\n\")\n", "\n", "\n", "# Importing the DuckDB library\n", "import duckdb\n", "\n", "# Connect to the DuckDB database created by the first generator\n", "conn = duckdb.connect(f\"{people_1_pipeline.pipeline_name}.duckdb\")\n", "\n", "# Setting the search path to the dataset 'people' and displaying available tables\n", "conn.sql(f\"SET search_path = '{people_1_pipeline.dataset_name}'\")\n", "print('Loaded tables: ')\n", "display(conn.sql(\"show tables\"))\n", "\n", "\n", "# Fetching the appended data from the 'people' table and displaying it\n", "data = conn.sql(\"SELECT * FROM people\").df()\n", "display(data)\n", "\n", "# Calculate the sum of ages from the combined data of `people_1` and `people_2` in the 'people' table\n", "sum_of_ages_p1_p2 = conn.execute(\"SELECT SUM(age) FROM people\").fetchone()[0]\n", "print(f\"\\n\\nSum of ages from generators `people_1()` and `people_2()` combined: {sum_of_ages_p1_p2}\")\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 841 }, "id": "0u2mtndkTLpk", "outputId": "d5d253de-4502-42bf-ac89-08e0a7065d85" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Pipeline dlt_colab_kernel_launcher load step completed in 0.59 seconds\n", "1 load package(s) were loaded to destination duckdb and into dataset people\n", "The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\n", "Load package 1706029306.7456656 is LOADED and contains no failed jobs\n", "\n", "\n", "Pipeline dlt_colab_kernel_launcher load step completed in 0.43 seconds\n", "1 load package(s) were loaded to destination duckdb and into dataset people\n", "The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\n", "Load package 1706029307.9851513 is LOADED and contains no failed jobs\n", "\n", "\n", "Loaded tables: \n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "┌─────────────────────┐\n", "│ name │\n", "│ varchar │\n", "├─────────────────────┤\n", "│ _dlt_loads │\n", "│ _dlt_pipeline_state │\n", "│ _dlt_version │\n", "│ people │\n", "└─────────────────────┘" ] }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ " id name age city _dlt_load_id _dlt_id occupation\n", "0 1 Person_1 26 City_A 1706029306.7456656 An8WyXL43/J1GQ None\n", "1 2 Person_2 27 City_A 1706029306.7456656 ZGI1S72CddPbJQ None\n", "2 3 Person_3 28 City_A 1706029306.7456656 +z4Pm5oCykL2Vg None\n", "3 4 Person_4 29 City_A 1706029306.7456656 0Vfr36JHZ34OJA None\n", "4 5 Person_5 30 City_A 1706029306.7456656 aA+9WOclw3YWpg None\n", "5 3 Person_3 33 City_B 1706029307.9851513 mEegoM7n4XujYw Job_3\n", "6 4 Person_4 34 City_B 1706029307.9851513 FPrsrzXgz+E9Fw Job_4\n", "7 5 Person_5 35 City_B 1706029307.9851513 ZaAOBa5EEqXU1Q Job_5\n", "8 6 Person_6 36 City_B 1706029307.9851513 gmcktDnX6y4Fmg Job_6\n", "9 7 Person_7 37 City_B 1706029307.9851513 960gdVKySsa4JA Job_7\n", "10 8 Person_8 38 City_B 1706029307.9851513 +su5IfZQyFEsEw Job_8" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnameagecity_dlt_load_id_dlt_idoccupation
01Person_126City_A1706029306.7456656An8WyXL43/J1GQNone
12Person_227City_A1706029306.7456656ZGI1S72CddPbJQNone
23Person_328City_A1706029306.7456656+z4Pm5oCykL2VgNone
34Person_429City_A1706029306.74566560Vfr36JHZ34OJANone
45Person_530City_A1706029306.7456656aA+9WOclw3YWpgNone
53Person_333City_B1706029307.9851513mEegoM7n4XujYwJob_3
64Person_434City_B1706029307.9851513FPrsrzXgz+E9FwJob_4
75Person_535City_B1706029307.9851513ZaAOBa5EEqXU1QJob_5
86Person_636City_B1706029307.9851513gmcktDnX6y4FmgJob_6
97Person_737City_B1706029307.9851513960gdVKySsa4JAJob_7
108Person_838City_B1706029307.9851513+su5IfZQyFEsEwJob_8
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {} }, { "output_type": "stream", "name": "stdout", "text": [ "\n", "\n", "Sum of ages from generators `people_1()` and `people_2()` combined: 353\n" ] } ] }, { "cell_type": "markdown", "source": [ "## Solution 3: Merge a generator\n", "\n", "A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\n", "\n", "Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\n", "\n", "After loading, you should have a total of 8 records, and ID 3 should have age 33." ], "metadata": { "id": "G-T-jR9qlzdB" } }, { "cell_type": "code", "source": [ "import dlt\n", "\n", "# Set up a DLT pipeline.\n", "# Currently using DuckDB for local testing, but it can be switched to BigQuery for production.\n", "generators_pipeline = dlt.pipeline(destination='duckdb', dataset_name='people_merge')\n", "\n", "# Load data from the first generator `people_1` into 'people_merge' table.\n", "# This operation will replace any existing data in the table.\n", "# A primary key 'ID' is specified for potential future merge operations.\n", "info = generators_pipeline.run(people_1(),\n", " table_name=\"people_v2\",\n", " write_disposition=\"replace\",\n", " primary_key=\"ID\")\n", "\n", "# Print metadata of the loading process for the first generator.\n", "print(f\"{info}\\n\\n\")\n", "\n", "# Load data from the second generator `people_2` into the same 'people_merge' table.\n", "# This operation will merge the new data with existing data based on the primary key 'ID'.\n", "info = generators_pipeline.run(people_2(),\n", " table_name=\"people_merged\",\n", " write_disposition=\"merge\",\n", " primary_key=\"ID\")\n", "\n", "# Print metadata of the loading process for the second generator.\n", "print(f\"{info}\\n\\n\")\n", "\n", "import duckdb\n", "\n", "# Establish a connection to the DuckDB database created by the pipeline.\n", "conn = duckdb.connect(f\"{generators_pipeline.pipeline_name}.duckdb\")\n", "\n", "# Set the search path to the dataset 'people_merge' and display the available tables.\n", "conn.sql(f\"SET search_path = '{generators_pipeline.dataset_name}'\")\n", "print('Loaded tables: ')\n", "display(conn.sql(\"show tables\"))\n", "\n", "# Display the merged data from the 'people_merged' table.\n", "print(\"\\n\\n\\nData from the 'people_merged' table:\")\n", "data = conn.sql(\"SELECT * FROM people_merged\").df()\n", "display(data)\n", "\n", "# Calculate and display the sum of ages from the merged data in 'people_merged' table.\n", "sum_of_ages_p1_p2 = conn.execute(\"SELECT SUM(age) FROM people_merged\").fetchone()[0]\n", "print(f\"\\n\\nSum of ages of people in generator `people_1()` merged with generator `people_2()` is: {sum_of_ages_p1_p2}\")\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 773 }, "id": "rXR-IN85kBtq", "outputId": "c74a7ab7-aa77-4445-c2bc-e782054a7201" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Pipeline dlt_colab_kernel_launcher load step completed in 0.24 seconds\n", "1 load package(s) were loaded to destination duckdb and into dataset people_merge\n", "The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\n", "Load package 1706030294.0522 is LOADED and contains no failed jobs\n", "\n", "\n", "Pipeline dlt_colab_kernel_launcher load step completed in 0.42 seconds\n", "1 load package(s) were loaded to destination duckdb and into dataset people_merge\n", "The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\n", "Load package 1706030294.7037766 is LOADED and contains no failed jobs\n", "\n", "\n", "Loaded tables: \n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "┌─────────────────────┐\n", "│ name │\n", "│ varchar │\n", "├─────────────────────┤\n", "│ _dlt_loads │\n", "│ _dlt_pipeline_state │\n", "│ _dlt_version │\n", "│ people_merged │\n", "│ people_v2 │\n", "└─────────────────────┘" ] }, "metadata": {} }, { "output_type": "stream", "name": "stdout", "text": [ "\n", "\n", "\n", "Data from the 'people_merged' table:\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ " id name age city occupation _dlt_load_id _dlt_id\n", "0 8 Person_8 38 City_B Job_8 1706030294.7037766 Q1k+DIAjXLL7cg\n", "1 4 Person_4 34 City_B Job_4 1706030294.7037766 ewlZ3LjULEchiQ\n", "2 5 Person_5 35 City_B Job_5 1706030294.7037766 X+LfQEa/X8GU9w\n", "3 7 Person_7 37 City_B Job_7 1706030294.7037766 lQT0h7IL7E/wxg\n", "4 3 Person_3 33 City_B Job_3 1706030294.7037766 gRBswCo8B/DJmw\n", "5 6 Person_6 36 City_B Job_6 1706030294.7037766 M3IbNKfZZCtbcQ" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnameagecityoccupation_dlt_load_id_dlt_id
08Person_838City_BJob_81706030294.7037766Q1k+DIAjXLL7cg
14Person_434City_BJob_41706030294.7037766ewlZ3LjULEchiQ
25Person_535City_BJob_51706030294.7037766X+LfQEa/X8GU9w
37Person_737City_BJob_71706030294.7037766lQT0h7IL7E/wxg
43Person_333City_BJob_31706030294.7037766gRBswCo8B/DJmw
56Person_636City_BJob_61706030294.7037766M3IbNKfZZCtbcQ
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {} }, { "output_type": "stream", "name": "stdout", "text": [ "\n", "\n", "Sum of ages of people in generator `people_1()` merged with generator `people_2()` is: 213\n" ] } ] }, { "cell_type": "code", "source": [], "metadata": { "id": "TApfkuNKtlt3" }, "execution_count": null, "outputs": [] } ] } ================================================ FILE: cohorts/2024/workshops/dlt_resources/homework_starter.ipynb ================================================ { "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# **Homework**: Data talks club data engineering zoomcamp Data loading workshop\n", "\n", "Hello folks, let's practice what we learned - Loading data with the best practices of data engineering.\n", "\n", "Here are the exercises we will do\n", "\n", "\n" ], "metadata": { "id": "mrTFv5nPClXh" } }, { "cell_type": "markdown", "source": [ "# 1. Use a generator\n", "\n", "Remember the concept of generator? Let's practice using them to futher our understanding of how they work.\n", "\n", "Let's define a generator and then run it as practice.\n", "\n", "**Answer the following questions:**\n", "\n", "- **Question 1: What is the sum of the outputs of the generator for limit = 5?**\n", "- **Question 2: What is the 13th number yielded**\n", "\n", "I suggest practicing these questions without GPT as the purpose is to further your learning." ], "metadata": { "id": "wLF4iXf-NR7t" } }, { "cell_type": "code", "source": [ "def square_root_generator(limit):\n", " n = 1\n", " while n <= limit:\n", " yield n ** 0.5\n", " n += 1\n", "\n", "# Example usage:\n", "limit = 5\n", "generator = square_root_generator(limit)\n", "\n", "for sqrt_value in generator:\n", " print(sqrt_value)\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wLng-bDJN4jf", "outputId": "547683cb-5f56-4815-a903-d0d9578eb1f9" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "1.0\n", "1.4142135623730951\n", "1.7320508075688772\n", "2.0\n", "2.23606797749979\n" ] } ] }, { "cell_type": "markdown", "source": [], "metadata": { "id": "xbe3q55zN43j" } }, { "cell_type": "markdown", "source": [ "# 2. Append a generator to a table with existing data\n", "\n", "\n", "Below you have 2 generators. You will be tasked to load them to duckdb and answer some questions from the data\n", "\n", "1. Load the first generator and calculate the sum of ages of all people. Make sure to only load it once.\n", "2. Append the second generator to the same table as the first.\n", "3. **After correctly appending the data, calculate the sum of all ages of people.**\n", "\n", "\n" ], "metadata": { "id": "vjWhILzGJMpK" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2MoaQcdLBEk6", "outputId": "d2b93dc1-d83f-44ea-aeff-fdf51d75f7aa" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'ID': 1, 'Name': 'Person_1', 'Age': 26, 'City': 'City_A'}\n", "{'ID': 2, 'Name': 'Person_2', 'Age': 27, 'City': 'City_A'}\n", "{'ID': 3, 'Name': 'Person_3', 'Age': 28, 'City': 'City_A'}\n", "{'ID': 4, 'Name': 'Person_4', 'Age': 29, 'City': 'City_A'}\n", "{'ID': 5, 'Name': 'Person_5', 'Age': 30, 'City': 'City_A'}\n", "{'ID': 3, 'Name': 'Person_3', 'Age': 33, 'City': 'City_B', 'Occupation': 'Job_3'}\n", "{'ID': 4, 'Name': 'Person_4', 'Age': 34, 'City': 'City_B', 'Occupation': 'Job_4'}\n", "{'ID': 5, 'Name': 'Person_5', 'Age': 35, 'City': 'City_B', 'Occupation': 'Job_5'}\n", "{'ID': 6, 'Name': 'Person_6', 'Age': 36, 'City': 'City_B', 'Occupation': 'Job_6'}\n", "{'ID': 7, 'Name': 'Person_7', 'Age': 37, 'City': 'City_B', 'Occupation': 'Job_7'}\n", "{'ID': 8, 'Name': 'Person_8', 'Age': 38, 'City': 'City_B', 'Occupation': 'Job_8'}\n" ] } ], "source": [ "def people_1():\n", " for i in range(1, 6):\n", " yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 25 + i, \"City\": \"City_A\"}\n", "\n", "for person in people_1():\n", " print(person)\n", "\n", "\n", "def people_2():\n", " for i in range(3, 9):\n", " yield {\"ID\": i, \"Name\": f\"Person_{i}\", \"Age\": 30 + i, \"City\": \"City_B\", \"Occupation\": f\"Job_{i}\"}\n", "\n", "\n", "for person in people_2():\n", " print(person)\n" ] }, { "cell_type": "markdown", "source": [], "metadata": { "id": "vtdTIm4fvQCN" } }, { "cell_type": "markdown", "source": [ "# 3. Merge a generator\n", "\n", "Re-use the generators from Exercise 2.\n", "\n", "A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\n", "\n", "Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\n", "\n", "After loading, you should have a total of 8 records, and ID 3 should have age 33.\n", "\n", "Question: **Calculate the sum of ages of all the people loaded as described above.**\n" ], "metadata": { "id": "pY4cFAWOSwN1" } }, { "cell_type": "markdown", "source": [ "# Solution: First make sure that the following modules are installed:" ], "metadata": { "id": "kKB2GTB9oVjr" } }, { "cell_type": "code", "source": [ "#Install the dependencies\n", "%%capture\n", "!pip install dlt[duckdb]" ], "metadata": { "id": "xTVvtyqrfVNq" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# to do: homework :)" ], "metadata": { "id": "a2-PRBAkGC2K" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Questions? difficulties? We are here to help.\n", "- DTC data engineering course channel: https://datatalks-club.slack.com/archives/C01FABYF2RG\n", "- dlt's DTC cohort channel: https://dlthub-community.slack.com/archives/C06GAEX2VNX" ], "metadata": { "id": "PoTJu4kbGG0z" } } ] } ================================================ FILE: cohorts/2024/workshops/dlt_resources/workshop.ipynb ================================================ [File too large to display: 10.7 MB] ================================================ FILE: cohorts/2024/workshops/rising-wave.md ================================================

Documentation   📑    Hands-on Tutorials   🎯    RisingWave Cloud   🚀    Get Instant Help

## Stream processing with RisingWave In this hands-on workshop, we’ll learn how to process real-time streaming data using SQL in RisingWave. The system we’ll use is [RisingWave](https://github.com/risingwavelabs/risingwave), an open-source SQL database for processing and managing streaming data. You may not feel unfamiliar with RisingWave’s user experience, as it’s fully wire compatible with PostgreSQL. ![RisingWave](https://raw.githubusercontent.com/risingwavelabs/risingwave-docs/main/docs/images/new_archi_grey.png) We’ll cover the following topics in this Workshop: - Why Stream Processing? - Stateless computation (Filters, Projections) - Stateful Computation (Aggregations, Joins) - Data Ingestion and Delivery RisingWave in 10 Minutes: https://tutorials.risingwave.com/docs/intro Workshop video: [Project Repository](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04) ## Homework **Please setup the environment in [Getting Started](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04?tab=readme-ov-file#getting-started) and for the [Homework](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04/blob/main/homework.md#setting-up) first.** ### Question 0 _This question is just a warm-up to introduce dynamic filter, please attempt it before viewing its solution._ What are the dropoff taxi zones at the latest dropoff times? For this part, we will use the [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/).
Solution ```sql CREATE MATERIALIZED VIEW latest_dropoff_time AS WITH t AS ( SELECT MAX(tpep_dropoff_datetime) AS latest_dropoff_time FROM trip_data ) SELECT taxi_zone.Zone as taxi_zone, latest_dropoff_time FROM t, trip_data JOIN taxi_zone ON trip_data.DOLocationID = taxi_zone.location_id WHERE trip_data.tpep_dropoff_datetime = t.latest_dropoff_time; -- taxi_zone | latest_dropoff_time -- ----------------+--------------------- -- Midtown Center | 2022-01-03 17:24:54 -- (1 row) ```
### Question 1 Create a materialized view to compute the average, min and max trip time **between each taxi zone**. Note that we consider the do not consider `a->b` and `b->a` as the same trip pair. So as an example, you would consider the following trip pairs as different pairs: ```plaintext Yorkville East -> Steinway Steinway -> Yorkville East ``` From this MV, find the pair of taxi zones with the highest average trip time. You may need to use the [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/) for this. Bonus (no marks): Create an MV which can identify anomalies in the data. For example, if the average trip time between two zones is 1 minute, but the max trip time is 10 minutes and 20 minutes respectively. Options: 1. Yorkville East, Steinway 2. Murray Hill, Midwood 3. East Flatbush/Farragut, East Harlem North 4. Midtown Center, University Heights/Morris Heights p.s. The trip time between taxi zones does not take symmetricity into account, i.e. `A -> B` and `B -> A` are considered different trips. This applies to subsequent questions as well. ### Question 2 Recreate the MV(s) in question 1, to also find the **number of trips** for the pair of taxi zones with the highest average trip time. Options: 1. 5 2. 3 3. 10 4. 1 ### Question 3 From the latest pickup time to 17 hours before, what are the top 3 busiest zones in terms of number of pickups? For example if the latest pickup time is 2020-01-01 17:00:00, then the query should return the top 3 busiest zones from 2020-01-01 00:00:00 to 2020-01-01 17:00:00. HINT: You can use [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/) to create a filter condition based on the latest pickup time. NOTE: For this question `17 hours` was picked to ensure we have enough data to work with. Options: 1. Clinton East, Upper East Side North, Penn Station 2. LaGuardia Airport, Lincoln Square East, JFK Airport 3. Midtown Center, Upper East Side South, Upper East Side North 4. LaGuardia Airport, Midtown Center, Upper East Side North ## Submitting the solutions - Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/workshop2 - Deadline: 11 March (Monday), 23:00 CET ## Rewards 🥳 Everyone who completes the homework will get a pen and a sticker, and 5 lucky winners will receive a Tshirt and other secret surprises! We encourage you to share your achievements with this workshop on your socials and look forward to your submissions 😁 - Follow us on **LinkedIn**: https://www.linkedin.com/company/risingwave - Follow us on **GitHub**: https://github.com/risingwavelabs/risingwave - Join us on **Slack**: https://risingwave-labs.com/slack See you around! ## Solution ================================================ FILE: cohorts/2025/01-docker-terraform/homework.md ================================================ # Module 1 Homework: Docker & SQL In this homework we'll prepare the environment and practice Docker and SQL When submitting your homework, you will also need to include a link to your GitHub repository or other public code-hosting site. This repository should contain the code for solving the homework. When your solution has SQL or shell commands and not code (e.g. python files) file format, include them directly in the README file of your repository. ## Question 1. Understanding docker first run Run docker with the `python:3.12.8` image in an interactive mode, use the entrypoint `bash`. What's the version of `pip` in the image? - 24.3.1 - 24.2.1 - 23.3.1 - 23.2.1 ## Question 2. Understanding Docker networking and docker-compose Given the following `docker-compose.yaml`, what is the `hostname` and `port` that **pgadmin** should use to connect to the postgres database? ```yaml services: db: container_name: postgres image: postgres:17-alpine environment: POSTGRES_USER: 'postgres' POSTGRES_PASSWORD: 'postgres' POSTGRES_DB: 'ny_taxi' ports: - '5433:5432' volumes: - vol-pgdata:/var/lib/postgresql/data pgadmin: container_name: pgadmin image: dpage/pgadmin4:latest environment: PGADMIN_DEFAULT_EMAIL: "pgadmin@pgadmin.com" PGADMIN_DEFAULT_PASSWORD: "pgadmin" ports: - "8080:80" volumes: - vol-pgadmin_data:/var/lib/pgadmin volumes: vol-pgdata: name: vol-pgdata vol-pgadmin_data: name: vol-pgadmin_data ``` - postgres:5433 - localhost:5432 - db:5433 - postgres:5432 - db:5432 If there are more than one answers, select only one of them ## Prepare Postgres Run Postgres and load data as shown in the videos We'll use the green taxi trips from October 2019: ```bash wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz ``` You will also need the dataset with zones: ```bash wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv ``` Download this data and put it into Postgres. You can use the code from the course. It's up to you whether you want to use Jupyter or a python script. ## Question 3. Trip Segmentation Count During the period of October 1st 2019 (inclusive) and November 1st 2019 (exclusive), how many trips, **respectively**, happened: 1. Up to 1 mile 2. In between 1 (exclusive) and 3 miles (inclusive), 3. In between 3 (exclusive) and 7 miles (inclusive), 4. In between 7 (exclusive) and 10 miles (inclusive), 5. Over 10 miles Answers: - 104,802; 197,670; 110,612; 27,831; 35,281 - 104,802; 198,924; 109,603; 27,678; 35,189 - 104,793; 201,407; 110,612; 27,831; 35,281 - 104,793; 202,661; 109,603; 27,678; 35,189 - 104,838; 199,013; 109,645; 27,688; 35,202 ## Question 4. Longest trip for each day Which was the pick up day with the longest trip distance? Use the pick up time for your calculations. Tip: For every day, we only care about one single trip with the longest distance. - 2019-10-11 - 2019-10-24 - 2019-10-26 - 2019-10-31 ## Question 5. Three biggest pickup zones Which were the top pickup locations with over 13,000 in `total_amount` (across all trips) for 2019-10-18? Consider only `lpep_pickup_datetime` when filtering by date. - East Harlem North, East Harlem South, Morningside Heights - East Harlem North, Morningside Heights - Morningside Heights, Astoria Park, East Harlem South - Bedford, East Harlem North, Astoria Park ## Question 6. Largest tip For the passengers picked up in October 2019 in the zone named "East Harlem North" which was the drop off zone that had the largest tip? Note: it's `tip` , not `trip` We need the name of the zone, not the ID. - Yorkville West - JFK Airport - East Harlem North - East Harlem South ## Terraform In this section homework we'll prepare the environment by creating resources in GCP with Terraform. In your VM on GCP/Laptop/GitHub Codespace install Terraform. Copy the files from the course repo [here](../../../01-docker-terraform/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace. Modify the files as necessary to create a GCP Bucket and Big Query Dataset. ## Question 7. Terraform Workflow Which of the following sequences, **respectively**, describes the workflow for: 1. Downloading the provider plugins and setting up backend, 2. Generating proposed changes and auto-executing the plan 3. Remove all resources managed by terraform` Answers: - terraform import, terraform apply -y, terraform destroy - teraform init, terraform plan -auto-apply, terraform rm - terraform init, terraform run -auto-approve, terraform destroy - terraform init, terraform apply -auto-approve, terraform destroy - terraform import, terraform apply -y, terraform rm ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw1 ================================================ FILE: cohorts/2025/02-workflow-orchestration/README.md ================================================ # Workflow Orchestration Welcome to Module 2 of the Data Engineering Zoomcamp! This week, we’ll dive into workflow orchestration using [Kestra](https://go.kestra.io/de-zoomcamp/github). Kestra is an open-source, event-driven orchestration platform that simplifies building both scheduled and event-driven workflows. By adopting Infrastructure as Code practices for data and process orchestration, Kestra enables you to build reliable workflows with just a few lines of YAML. > [!NOTE] >You can find all videos for this week in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist). --- # Course Structure ## 1. Conceptual Material: Introduction to Orchestration and Kestra In this section, you’ll learn the foundations of workflow orchestration, its importance, and how Kestra fits into the orchestration landscape. ### Videos - **2.2.1 - Introduction to Workflow Orchestration** [![2.2.1 - Workflow Orchestration Introduction](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FNp6QmmcgLCs)](https://youtu.be/Np6QmmcgLCs) - **2.2.2 - Learn the Concepts of Kestra** [![Learn Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fo79n-EVpics)](https://youtu.be/o79n-EVpics) ### Resources - [Quickstart Guide](https://go.kestra.io/de-zoomcamp/quickstart) - [Install Kestra with Docker Compose](https://go.kestra.io/de-zoomcamp/docker-compose) - [Tutorial](https://go.kestra.io/de-zoomcamp/tutorial) - [What is an Orchestrator?](https://go.kestra.io/de-zoomcamp/what-is-an-orchestrator) --- ## 2. Hands-On Coding Project: Build Data Pipelines with Kestra This week, we're gonna build ETL pipelines for Yellow and Green Taxi data from NYC’s Taxi and Limousine Commission (TLC). You will: 1. Extract data from [CSV files](https://github.com/DataTalksClub/nyc-tlc-data/releases). 2. Load it into Postgres or Google Cloud (GCS + BigQuery). 3. Explore scheduling and backfilling workflows. >[!NOTE] If you’re using the PostgreSQL and PgAdmin docker setup from Module 1 for this week’s Kestra Workflow Orchestration exercise, ensure your PostgreSQL image version is 15 or later (preferably the latest). The MERGE statement, introduced in PostgreSQL 15, won’t work on earlier versions and will likely cause syntax errors in your kestra flows. ### File Structure The project is organized as follows: ``` . ├── flows/ │ ├── 01_getting_started_data_pipeline.yaml │ ├── 02_postgres_taxi.yaml │ ├── 02_postgres_taxi_scheduled.yaml │ ├── 03_postgres_dbt.yaml │ ├── 04_gcp_kv.yaml │ ├── 05_gcp_setup.yaml │ ├── 06_gcp_taxi.yaml │ ├── 06_gcp_taxi_scheduled.yaml │ └── 07_gcp_dbt.yaml ``` ### Setup Kestra We'll set up Kestra using Docker Compose containing one container for the Kestra server and another for the Postgres database: ```bash cd 02-workflow-orchestration/docker/combined docker compose up -d ``` Once the container starts, you can access the Kestra UI at [http://localhost:8080](http://localhost:8080). If you prefer to add flows programmatically using Kestra's API, run the following commands: ```bash curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/01_getting_started_data_pipeline.yaml curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/02_postgres_taxi.yaml curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/02_postgres_taxi_scheduled.yaml curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/03_postgres_dbt.yaml curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/04_gcp_kv.yaml curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/05_gcp_setup.yaml curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/06_gcp_taxi.yaml curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/06_gcp_taxi_scheduled.yaml curl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/07_gcp_dbt.yaml ``` --- ## 3. ETL Pipelines in Kestra: Detailed Walkthrough ### Getting Started Pipeline This introductory flow is added just to demonstrate a simple data pipeline which extracts data via HTTP REST API, transforms that data in Python and then queries it using DuckDB. For this stage, a new separate Postgres database is created for the exercises. **Note:** Check that `pgAdmin` isn't running on the same ports as Kestra. If so, check out the [FAQ](#troubleshooting-tips) at the bottom of the README. ### Videos - **2.2.3 - Create an ETL Pipeline with Postgres in Kestra** [![Create an ETL Pipeline with Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FOkfLX28Ecjg%3Fsi%3DvKbIyWo1TtjpNnvt)](https://youtu.be/OkfLX28Ecjg?si=vKbIyWo1TtjpNnvt) - **2.2.4 - Manage Scheduling and Backfills using Postgres in Kestra** [![Manage Scheduling and Backfills using Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F_-li_z97zog%3Fsi%3DG6jZbkfJb3GAyqrd)](https://youtu.be/_-li_z97zog?si=G6jZbkfJb3GAyqrd) - **2.2.5 - Transform Data with dbt and Postgres in Kestra** [![Transform Data with dbt and Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FZLp2N6p2JjE%3Fsi%3DtWhcvq5w4lO8v1_p)](https://youtu.be/ZLp2N6p2JjE?si=tWhcvq5w4lO8v1_p) ```mermaid graph LR Extract[Extract Data via HTTP REST API] --> Transform[Transform Data in Python] Transform --> Query[Query Data with DuckDB] ``` Add the flow [`01_getting_started_data_pipeline.yaml`](flows/01_getting_started_data_pipeline.yaml) from the UI if you haven't already and execute it to see the results. Inspect the Gantt and Logs tabs to understand the flow execution. ### Local DB: Load Taxi Data to Postgres Before we start loading data to GCP, we'll first play with the Yellow and Green Taxi data using a local Postgres database running in a Docker container. We'll create a new Postgres database for these examples using this [Docker Compose file](docker/postgres/docker-compose.yml). Download it into a new directory, navigate to it and run the following command to start it: ```bash docker compose up -d ``` The flow will extract CSV data partitioned by year and month, create tables, load data to the monthly table, and finally merge the data to the final destination table. ```mermaid graph LR Start[Select Year & Month] --> SetLabel[Set Labels] SetLabel --> Extract[Extract CSV Data] Extract -->|Taxi=Yellow| YellowFinalTable[Create Yellow Final Table]:::yellow Extract -->|Taxi=Green| GreenFinalTable[Create Green Final Table]:::green YellowFinalTable --> YellowMonthlyTable[Create Yellow Monthly Table]:::yellow GreenFinalTable --> GreenMonthlyTable[Create Green Monthly Table]:::green YellowMonthlyTable --> YellowCopyIn[Load Data to Monthly Table]:::yellow GreenMonthlyTable --> GreenCopyIn[Load Data to Monthly Table]:::green YellowCopyIn --> YellowMerge[Merge Yellow Data]:::yellow GreenCopyIn --> GreenMerge[Merge Green Data]:::green classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px; classDef green fill:#32CD32,stroke:#000,stroke-width:1px; ``` The flow code: [`02_postgres_taxi.yaml`](flows/02_postgres_taxi.yaml). > [!NOTE] > The NYC Taxi and Limousine Commission (TLC) Trip Record Data provided on the [nyc.gov](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) website is currently available only in a Parquet format, but this is NOT the dataset we're going to use in this course. For the purpose of this course, we'll use the **CSV files** available [here on GitHub](https://github.com/DataTalksClub/nyc-tlc-data/releases). This is because the Parquet format can be challenging to understand by newcomers, and we want to make the course as accessible as possible — the CSV format can be easily introspected using tools like Excel or Google Sheets, or even a simple text editor. ### Local DB: Learn Scheduling and Backfills We can now schedule the same pipeline shown above to run daily at 9 AM UTC. We'll also demonstrate how to backfill the data pipeline to run on historical data. Note: given the large dataset, we'll backfill only data for the green taxi dataset for the year 2019. The flow code: [`02_postgres_taxi_scheduled.yaml`](flows/02_postgres_taxi_scheduled.yaml). ### Local DB: Orchestrate dbt Models (Optional) Now that we have raw data ingested into a local Postgres database, we can use dbt to transform the data into meaningful insights. The flow will sync the dbt models from Git to Kestra and run the `dbt build` command to build the models. ```mermaid graph LR Start[Select dbt command] --> Sync[Sync Namespace Files] Sync --> DbtBuild[Run dbt CLI] ``` This gives you a quick showcase of dbt inside of Kestra so the homework tasks do not depend on it. The course will go into more detail of dbt in [Week 4](../04-analytics-engineering). The flow code: [`03_postgres_dbt.yaml`](flows/03_postgres_dbt.yaml). ### Resources - [pgAdmin Download](https://www.pgadmin.org/download/) - [Postgres DB Docker Compose](docker/postgres/docker-compose.yml) --- ## 4. ETL Pipelines in Kestra: Google Cloud Platform Now that you've learned how to build ETL pipelines locally using Postgres, we are ready to move to the cloud. In this section, we'll load the same Yellow and Green Taxi data to Google Cloud Platform (GCP) using: 1. Google Cloud Storage (GCS) as a data lake 2. BigQuery as a data warehouse. ### Videos - **2.2.6 - Create an ETL Pipeline with GCS and BigQuery in Kestra** [![Create an ETL Pipeline with BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FnKqjjLJ7YXs)](https://youtu.be/nKqjjLJ7YXs) - **2.2.7 - Manage Scheduling and Backfills using BigQuery in Kestra** [![Manage Scheduling and Backfills using BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FDoaZ5JWEkH0)](https://youtu.be/DoaZ5JWEkH0) - **2.2.8 - Transform Data with dbt and BigQuery in Kestra** [![Transform Data with dbt and BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FeF_EdV4A1Wk)](https://youtu.be/eF_EdV4A1Wk) ### Setup Google Cloud Platform (GCP) Before we start loading data to GCP, we need to set up the Google Cloud Platform. First, adjust the following flow [`04_gcp_kv.yaml`](flows/04_gcp_kv.yaml) to include your service account, GCP project ID, BigQuery dataset and GCS bucket name (_along with their location_) as KV Store values: - GCP_CREDS - GCP_PROJECT_ID - GCP_LOCATION - GCP_BUCKET_NAME - GCP_DATASET. > [!WARNING] > The `GCP_CREDS` service account contains sensitive information. Ensure you keep it secure and do not commit it to Git. Keep it as secure as your passwords. ### Create GCP Resources If you haven't already created the GCS bucket and BigQuery dataset in the first week of the course, you can use this flow to create them: [`05_gcp_setup.yaml`](flows/05_gcp_setup.yaml). ### GCP Workflow: Load Taxi Data to BigQuery ```mermaid graph LR SetLabel[Set Labels] --> Extract[Extract CSV Data] Extract --> UploadToGCS[Upload Data to GCS] UploadToGCS -->|Taxi=Yellow| BQYellowTripdata[Main Yellow Tripdata Table]:::yellow UploadToGCS -->|Taxi=Green| BQGreenTripdata[Main Green Tripdata Table]:::green BQYellowTripdata --> BQYellowTableExt[External Table]:::yellow BQGreenTripdata --> BQGreenTableExt[External Table]:::green BQYellowTableExt --> BQYellowTableTmp[Monthly Table]:::yellow BQGreenTableExt --> BQGreenTableTmp[Monthly Table]:::green BQYellowTableTmp --> BQYellowMerge[Merge to Main Table]:::yellow BQGreenTableTmp --> BQGreenMerge[Merge to Main Table]:::green BQYellowMerge --> PurgeFiles[Purge Files] BQGreenMerge --> PurgeFiles[Purge Files] classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px; classDef green fill:#32CD32,stroke:#000,stroke-width:1px; ``` The flow code: [`06_gcp_taxi.yaml`](flows/06_gcp_taxi.yaml). ### GCP Workflow: Schedule and Backfill Full Dataset We can now schedule the same pipeline shown above to run daily at 9 AM UTC for the green dataset and at 10 AM UTC for the yellow dataset. You can backfill historical data directly from the Kestra UI. Since we now process data in a cloud environment with infinitely scalable storage and compute, we can backfill the entire dataset for both the yellow and green taxi data without the risk of running out of resources on our local machine. The flow code: [`06_gcp_taxi_scheduled.yaml`](flows/06_gcp_taxi_scheduled.yaml). ### GCP Workflow: Orchestrate dbt Models (Optional) Now that we have raw data ingested into BigQuery, we can use dbt to transform that data. The flow will sync the dbt models from Git to Kestra and run the `dbt build` command to build the models: ```mermaid graph LR Start[Select dbt command] --> Sync[Sync Namespace Files] Sync --> Build[Run dbt Build Command] ``` This gives you a quick showcase of dbt inside of Kestra so the homework tasks do not depend on it. The course will go into more detail of dbt in [Week 4](../04-analytics-engineering). The flow code: [`07_gcp_dbt.yaml`](flows/07_gcp_dbt.yaml). --- ## 5. Bonus: Deploy to the Cloud (Optional) Now that we've got our ETL pipeline working both locally and in the cloud, we can deploy Kestra to the cloud so it can continue to orchestrate our ETL pipelines monthly with our configured schedules, We'll cover how you can install Kestra on Google Cloud in Production, and automatically sync and deploy your workflows from a Git repository. Note: When committing your workflows to Kestra, make sure your workflow doesn't contain any sensitive information. You can use [Secrets](https://go.kestra.io/de-zoomcamp/secret) and the [KV Store](https://go.kestra.io/de-zoomcamp/kv-store) to keep sensitive data out of your workflow logic. ### Videos - **2.2.9 - Deploy Workflows to the Cloud with Git** [![Deploy Workflows to the Cloud with Git](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fl-wC71tI3co)](https://youtu.be/l-wC71tI3co) Resources - [Install Kestra on Google Cloud](https://go.kestra.io/de-zoomcamp/gcp-install) - [Moving from Development to Production](https://go.kestra.io/de-zoomcamp/dev-to-prod) - [Using Git in Kestra](https://go.kestra.io/de-zoomcamp/git) - [Deploy Flows with GitHub Actions](https://go.kestra.io/de-zoomcamp/deploy-github-actions) ## 6. Additional Resources 📚 - Check [Kestra Docs](https://go.kestra.io/de-zoomcamp/docs) - Explore our [Blueprints](https://go.kestra.io/de-zoomcamp/blueprints) library - Browse over 600 [plugins](https://go.kestra.io/de-zoomcamp/plugins) available in Kestra - Give us a star on [GitHub](https://go.kestra.io/de-zoomcamp/github) - Join our [Slack community](https://go.kestra.io/de-zoomcamp/slack) if you have any questions - Find all the videos in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist) ### Troubleshooting tips If you face any issues with Kestra flows in Module 2, make sure to use the following Docker images/ports: - `kestra/kestra:latest` is correct = latest stable release, while `kestra/kestra:develop` is incorrect as this is a bleeding-edge development version that might contain bugs - `postgres:latest` — make sure to use Postgres image, which uses **PostgreSQL 15** or higher - If you run `pgAdmin` or something else on port 8080, you can adjust Kestra docker-compose to use a different port, e.g. change port mapping to 18080 instead of 8080, and then access Kestra UI in your browser from http://localhost:18080/ instead of from http://localhost:8080/ If you're using Linux, you might encounter `Connection Refused` errors when connecting to the Postgres DB from within Kestra. This is because `host.docker.internal` works differently on Linux. Using the modified Docker Compose file below, you can run both Kestra and its dedicated Postgres DB, as well as the Postgres DB for the exercises all together. You can access it within Kestra by referring to the container name `postgres_zoomcamp` instead of `host.docker.internal` in `pluginDefaults`. This applies to pgAdmin as well. If you'd prefer to keep it in separate Docker Compose files, you'll need to setup a Docker network so that they can communicate with each other.
Docker Compose Example This Docker Compose has the Zoomcamp DB container and pgAdmin container added to it, so it's all in one file. Changes include: - New `volume` for the Zoomcamp DB container - Zoomcamp DB container is added and renamed to prevent clashes with the Kestra DB container - Depends on condition is added to make sure Kestra is running before it starts - pgAdmin is added and running on Port 8085 so it doesn't clash wit Kestra which uses 8080 and 8081 ```yaml volumes: postgres-data: driver: local kestra-data: driver: local zoomcamp-data: driver: local services: postgres: image: postgres volumes: - postgres-data:/var/lib/postgresql/data environment: POSTGRES_DB: kestra POSTGRES_USER: kestra POSTGRES_PASSWORD: k3str4 healthcheck: test: ["CMD-SHELL", "pg_isready -d $${POSTGRES_DB} -U $${POSTGRES_USER}"] interval: 30s timeout: 10s retries: 10 kestra: image: kestra/kestra:latest pull_policy: always # Note that this setup with a root user is intended for development purpose. # Our base image runs without root, but the Docker Compose implementation needs root to access the Docker socket # To run Kestra in a rootless mode in production, see: https://kestra.io/docs/installation/podman-compose user: "root" command: server standalone volumes: - kestra-data:/app/storage - /var/run/docker.sock:/var/run/docker.sock - /tmp/kestra-wd:/tmp/kestra-wd environment: KESTRA_CONFIGURATION: | datasources: postgres: url: jdbc:postgresql://postgres:5432/kestra driverClassName: org.postgresql.Driver username: kestra password: k3str4 kestra: server: basicAuth: enabled: false username: "admin@kestra.io" # it must be a valid email address password: kestra repository: type: postgres storage: type: local local: basePath: "/app/storage" queue: type: postgres tasks: tmpDir: path: /tmp/kestra-wd/tmp url: http://localhost:8080/ ports: - "8080:8080" - "8081:8081" depends_on: postgres: condition: service_started postgres_zoomcamp: image: postgres environment: POSTGRES_USER: kestra POSTGRES_PASSWORD: k3str4 POSTGRES_DB: postgres-zoomcamp ports: - "5432:5432" volumes: - zoomcamp-data:/var/lib/postgresql/data depends_on: kestra: condition: service_started pgadmin: image: dpage/pgadmin4 environment: - PGADMIN_DEFAULT_EMAIL=admin@admin.com - PGADMIN_DEFAULT_PASSWORD=root ports: - "8085:80" depends_on: postgres_zoomcamp: condition: service_started ```
If you are still facing any issues, stop and remove your existing Kestra + Postgres containers and start them again using `docker-compose up -d`. If this doesn't help, post your question on the DataTalksClub Slack or on Kestra's Slack http://kestra.io/slack. - **DE Zoomcamp FAQ - PostgresDB Setup and Installing pgAdmin** [![DE Zoomcamp FAQ - PostgresDB Setup and Installing pgAdmin](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FywAPYNYFaB4%3Fsi%3D5X9AD0nFAT2WLWgS)](https://youtu.be/ywAPYNYFaB4?si=5X9AD0nFAT2WLWgS) - **DE Zoomcamp FAQ - Port and Images** [![DE Zoomcamp FAQ - Ports and Images](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fl2M2mW76RIU%3Fsi%3DoqyZ7KUaI27vi90V)](https://youtu.be/l2M2mW76RIU?si=oqyZ7KUaI27vi90V) - **DE Zoomcamp FAQ - Docker Setup** [![DE Zoomcamp FAQ - Docker Setup](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F73g6qJN0HcM)](https://youtu.be/73g6qJN0HcM) If you encounter similar errors to: ``` BigQueryError{reason=invalid, location=null, message=Error while reading table: kestra-sandbox.zooomcamp.yellow_tripdata_2020_01, error message: CSV table references column position 17, but line contains only 14 columns.; line_number: 2103925 byte_offset_to_start_of_line: 194863028 column_index: 17 column_name: "congestion_surcharge" column_type: NUMERIC File: gs://anna-geller/yellow_tripdata_2020-01.csv} ``` It means that the CSV file you're trying to load into BigQuery has a mismatch in the number of columns between the external source table (i.e. file in GCS) and the destination table in BigQuery. This can happen when for due to network/transfer issues, the file is not fully downloaded from GitHub or not correctly uploaded to GCS. The error suggests schema issues but that's not the case. Simply rerun the entire execution including redownloading the CSV file and reuploading it to GCS. This should resolve the issue. --- ## Homework See the [2025 cohort folder](../cohorts/2025/02-workflow-orchestration/homework.md) --- # Community notes Did you take notes? You can share them by creating a PR to this file! * [Notes from Manuel Guerra)](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/2_Workflow-Orchestration-(Kestra)/README.md) * [Notes from Horeb Seidou](https://spotted-hardhat-eea.notion.site/Week-2-Workflow-Orchestration-17129780dc4a80148debf61e6453fffe) * [Notes from Livia](https://docs.google.com/document/d/1Y_QMonvEtFPbXIzmdpCSVsKNC1BWAHFBA1mpK9qaZko/edit?usp=sharing) * [2025 Gitbook Notes from Tinker0425](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/module-2/introduction-to-module-2) * [Notes from Mercy Markus: Linux/Fedora Tweaks and Tips](https://mercymarkus.com/posts/2025/series/dtc-dez-jan-2025/dtc-dez-2025-module-2/) * Add your notes above this line --- # Previous Cohorts * 2022: [notes](../cohorts/2022/week_2_data_ingestion#community-notes) and [videos](../cohorts/2022/week_2_data_ingestion) * 2023: [notes](../cohorts/2023/week_2_workflow_orchestration#community-notes) and [videos](../cohorts/2023/week_2_workflow_orchestration) * 2024: [notes](../cohorts/2024/02-workflow-orchestration#community-notes) and [videos](../cohorts/2024/02-workflow-orchestration) ================================================ FILE: cohorts/2025/02-workflow-orchestration/flows/01_getting_started_data_pipeline.yaml ================================================ id: 01_getting_started_data_pipeline namespace: zoomcamp inputs: - id: columns_to_keep type: ARRAY itemType: STRING defaults: - brand - price tasks: - id: extract type: io.kestra.plugin.core.http.Download uri: https://dummyjson.com/products - id: transform type: io.kestra.plugin.scripts.python.Script containerImage: python:3.11-alpine inputFiles: data.json: "{{outputs.extract.uri}}" outputFiles: - "*.json" env: COLUMNS_TO_KEEP: "{{inputs.columns_to_keep}}" script: | import json import os columns_to_keep_str = os.getenv("COLUMNS_TO_KEEP") columns_to_keep = json.loads(columns_to_keep_str) with open("data.json", "r") as file: data = json.load(file) filtered_data = [ {column: product.get(column, "N/A") for column in columns_to_keep} for product in data["products"] ] with open("products.json", "w") as file: json.dump(filtered_data, file, indent=4) - id: query type: io.kestra.plugin.jdbc.duckdb.Query inputFiles: products.json: "{{outputs.transform.outputFiles['products.json']}}" sql: | INSTALL json; LOAD json; SELECT brand, round(avg(price), 2) as avg_price FROM read_json_auto('{{workingDir}}/products.json') GROUP BY brand ORDER BY avg_price DESC; fetchType: STORE ================================================ FILE: cohorts/2025/02-workflow-orchestration/flows/02_postgres_taxi.yaml ================================================ id: 02_postgres_taxi namespace: zoomcamp description: | The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases inputs: - id: taxi type: SELECT displayName: Select taxi type values: [yellow, green] defaults: yellow - id: year type: SELECT displayName: Select year values: ["2019", "2020"] defaults: "2019" - id: month type: SELECT displayName: Select month values: ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"] defaults: "01" variables: file: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv" staging_table: "public.{{inputs.taxi}}_tripdata_staging" table: "public.{{inputs.taxi}}_tripdata" data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}" tasks: - id: set_label type: io.kestra.plugin.core.execution.Labels labels: file: "{{render(vars.file)}}" taxi: "{{inputs.taxi}}" - id: extract type: io.kestra.plugin.scripts.shell.Commands outputFiles: - "*.csv" taskRunner: type: io.kestra.plugin.core.runner.Process commands: - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}} - id: if_yellow_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'yellow'}}" then: - id: yellow_create_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.table)}} ( unique_row_id text, filename text, VendorID text, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count integer, trip_distance double precision, RatecodeID text, store_and_fwd_flag text, PULocationID text, DOLocationID text, payment_type integer, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, improvement_surcharge double precision, total_amount double precision, congestion_surcharge double precision ); - id: yellow_create_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} ( unique_row_id text, filename text, VendorID text, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count integer, trip_distance double precision, RatecodeID text, store_and_fwd_flag text, PULocationID text, DOLocationID text, payment_type integer, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, improvement_surcharge double precision, total_amount double precision, congestion_surcharge double precision ); - id: yellow_truncate_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | TRUNCATE TABLE {{render(vars.staging_table)}}; - id: yellow_copy_in_to_staging_table type: io.kestra.plugin.jdbc.postgresql.CopyIn format: CSV from: "{{render(vars.data)}}" table: "{{render(vars.staging_table)}}" header: true columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge] - id: yellow_add_unique_id_and_filename type: io.kestra.plugin.jdbc.postgresql.Queries sql: | UPDATE {{render(vars.staging_table)}} SET unique_row_id = md5( COALESCE(CAST(VendorID AS text), '') || COALESCE(CAST(tpep_pickup_datetime AS text), '') || COALESCE(CAST(tpep_dropoff_datetime AS text), '') || COALESCE(PULocationID, '') || COALESCE(DOLocationID, '') || COALESCE(CAST(fare_amount AS text), '') || COALESCE(CAST(trip_distance AS text), '') ), filename = '{{render(vars.file)}}'; - id: yellow_merge_data type: io.kestra.plugin.jdbc.postgresql.Queries sql: | MERGE INTO {{render(vars.table)}} AS T USING {{render(vars.staging_table)}} AS S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT ( unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge ) VALUES ( S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge ); - id: if_green_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'green'}}" then: - id: green_create_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.table)}} ( unique_row_id text, filename text, VendorID text, lpep_pickup_datetime timestamp, lpep_dropoff_datetime timestamp, store_and_fwd_flag text, RatecodeID text, PULocationID text, DOLocationID text, passenger_count integer, trip_distance double precision, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, ehail_fee double precision, improvement_surcharge double precision, total_amount double precision, payment_type integer, trip_type integer, congestion_surcharge double precision ); - id: green_create_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} ( unique_row_id text, filename text, VendorID text, lpep_pickup_datetime timestamp, lpep_dropoff_datetime timestamp, store_and_fwd_flag text, RatecodeID text, PULocationID text, DOLocationID text, passenger_count integer, trip_distance double precision, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, ehail_fee double precision, improvement_surcharge double precision, total_amount double precision, payment_type integer, trip_type integer, congestion_surcharge double precision ); - id: green_truncate_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | TRUNCATE TABLE {{render(vars.staging_table)}}; - id: green_copy_in_to_staging_table type: io.kestra.plugin.jdbc.postgresql.CopyIn format: CSV from: "{{render(vars.data)}}" table: "{{render(vars.staging_table)}}" header: true columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge] - id: green_add_unique_id_and_filename type: io.kestra.plugin.jdbc.postgresql.Queries sql: | UPDATE {{render(vars.staging_table)}} SET unique_row_id = md5( COALESCE(CAST(VendorID AS text), '') || COALESCE(CAST(lpep_pickup_datetime AS text), '') || COALESCE(CAST(lpep_dropoff_datetime AS text), '') || COALESCE(PULocationID, '') || COALESCE(DOLocationID, '') || COALESCE(CAST(fare_amount AS text), '') || COALESCE(CAST(trip_distance AS text), '') ), filename = '{{render(vars.file)}}'; - id: green_merge_data type: io.kestra.plugin.jdbc.postgresql.Queries sql: | MERGE INTO {{render(vars.table)}} AS T USING {{render(vars.staging_table)}} AS S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT ( unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge ) VALUES ( S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge ); - id: purge_files type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles description: This will remove output files. If you'd like to explore Kestra outputs, disable it. pluginDefaults: - type: io.kestra.plugin.jdbc.postgresql values: url: jdbc:postgresql://host.docker.internal:5432/postgres-zoomcamp username: kestra password: k3str4 ================================================ FILE: cohorts/2025/02-workflow-orchestration/flows/02_postgres_taxi_scheduled.yaml ================================================ id: 02_postgres_taxi_scheduled namespace: zoomcamp description: | Best to add a label `backfill:true` from the UI to track executions created via a backfill. CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases concurrency: limit: 1 inputs: - id: taxi type: SELECT displayName: Select taxi type values: [yellow, green] defaults: yellow variables: file: "{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv" staging_table: "public.{{inputs.taxi}}_tripdata_staging" table: "public.{{inputs.taxi}}_tripdata" data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}" tasks: - id: set_label type: io.kestra.plugin.core.execution.Labels labels: file: "{{render(vars.file)}}" taxi: "{{inputs.taxi}}" - id: extract type: io.kestra.plugin.scripts.shell.Commands outputFiles: - "*.csv" taskRunner: type: io.kestra.plugin.core.runner.Process commands: - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}} - id: if_yellow_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'yellow'}}" then: - id: yellow_create_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.table)}} ( unique_row_id text, filename text, VendorID text, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count integer, trip_distance double precision, RatecodeID text, store_and_fwd_flag text, PULocationID text, DOLocationID text, payment_type integer, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, improvement_surcharge double precision, total_amount double precision, congestion_surcharge double precision ); - id: yellow_create_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} ( unique_row_id text, filename text, VendorID text, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count integer, trip_distance double precision, RatecodeID text, store_and_fwd_flag text, PULocationID text, DOLocationID text, payment_type integer, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, improvement_surcharge double precision, total_amount double precision, congestion_surcharge double precision ); - id: yellow_truncate_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | TRUNCATE TABLE {{render(vars.staging_table)}}; - id: yellow_copy_in_to_staging_table type: io.kestra.plugin.jdbc.postgresql.CopyIn format: CSV from: "{{render(vars.data)}}" table: "{{render(vars.staging_table)}}" header: true columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge] - id: yellow_add_unique_id_and_filename type: io.kestra.plugin.jdbc.postgresql.Queries sql: | UPDATE {{render(vars.staging_table)}} SET unique_row_id = md5( COALESCE(CAST(VendorID AS text), '') || COALESCE(CAST(tpep_pickup_datetime AS text), '') || COALESCE(CAST(tpep_dropoff_datetime AS text), '') || COALESCE(PULocationID, '') || COALESCE(DOLocationID, '') || COALESCE(CAST(fare_amount AS text), '') || COALESCE(CAST(trip_distance AS text), '') ), filename = '{{render(vars.file)}}'; - id: yellow_merge_data type: io.kestra.plugin.jdbc.postgresql.Queries sql: | MERGE INTO {{render(vars.table)}} AS T USING {{render(vars.staging_table)}} AS S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT ( unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge ) VALUES ( S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge ); - id: if_green_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'green'}}" then: - id: green_create_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.table)}} ( unique_row_id text, filename text, VendorID text, lpep_pickup_datetime timestamp, lpep_dropoff_datetime timestamp, store_and_fwd_flag text, RatecodeID text, PULocationID text, DOLocationID text, passenger_count integer, trip_distance double precision, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, ehail_fee double precision, improvement_surcharge double precision, total_amount double precision, payment_type integer, trip_type integer, congestion_surcharge double precision ); - id: green_create_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} ( unique_row_id text, filename text, VendorID text, lpep_pickup_datetime timestamp, lpep_dropoff_datetime timestamp, store_and_fwd_flag text, RatecodeID text, PULocationID text, DOLocationID text, passenger_count integer, trip_distance double precision, fare_amount double precision, extra double precision, mta_tax double precision, tip_amount double precision, tolls_amount double precision, ehail_fee double precision, improvement_surcharge double precision, total_amount double precision, payment_type integer, trip_type integer, congestion_surcharge double precision ); - id: green_truncate_staging_table type: io.kestra.plugin.jdbc.postgresql.Queries sql: | TRUNCATE TABLE {{render(vars.staging_table)}}; - id: green_copy_in_to_staging_table type: io.kestra.plugin.jdbc.postgresql.CopyIn format: CSV from: "{{render(vars.data)}}" table: "{{render(vars.staging_table)}}" header: true columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge] - id: green_add_unique_id_and_filename type: io.kestra.plugin.jdbc.postgresql.Queries sql: | UPDATE {{render(vars.staging_table)}} SET unique_row_id = md5( COALESCE(CAST(VendorID AS text), '') || COALESCE(CAST(lpep_pickup_datetime AS text), '') || COALESCE(CAST(lpep_dropoff_datetime AS text), '') || COALESCE(PULocationID, '') || COALESCE(DOLocationID, '') || COALESCE(CAST(fare_amount AS text), '') || COALESCE(CAST(trip_distance AS text), '') ), filename = '{{render(vars.file)}}'; - id: green_merge_data type: io.kestra.plugin.jdbc.postgresql.Queries sql: | MERGE INTO {{render(vars.table)}} AS T USING {{render(vars.staging_table)}} AS S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT ( unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge ) VALUES ( S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge ); - id: purge_files type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles description: To avoid cluttering your storage, we will remove the downloaded files pluginDefaults: - type: io.kestra.plugin.jdbc.postgresql values: url: jdbc:postgresql://host.docker.internal:5432/postgres-zoomcamp username: kestra password: k3str4 triggers: - id: green_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 9 1 * *" inputs: taxi: green - id: yellow_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 10 1 * *" inputs: taxi: yellow ================================================ FILE: cohorts/2025/02-workflow-orchestration/flows/03_postgres_dbt.yaml ================================================ id: 03_postgres_dbt namespace: zoomcamp inputs: - id: dbt_command type: SELECT allowCustomValue: true defaults: dbt build values: - dbt build - dbt debug # use when running the first time to validate DB connection tasks: - id: sync type: io.kestra.plugin.git.SyncNamespaceFiles url: https://github.com/DataTalksClub/data-engineering-zoomcamp branch: main namespace: "{{ flow.namespace }}" gitDirectory: 04-analytics-engineering/taxi_rides_ny dryRun: false # disabled: true # this Git Sync is needed only when running it the first time, afterwards the task can be disabled - id: dbt-build type: io.kestra.plugin.dbt.cli.DbtCLI env: DBT_DATABASE: postgres-zoomcamp DBT_SCHEMA: public namespaceFiles: enabled: true containerImage: ghcr.io/kestra-io/dbt-postgres:latest taskRunner: type: io.kestra.plugin.scripts.runner.docker.Docker networkMode: host commands: - dbt deps - "{{ inputs.dbt_command }}" storeManifest: key: manifest.json namespace: "{{ flow.namespace }}" profiles: | default: outputs: dev: type: postgres host: host.docker.internal user: kestra password: k3str4 port: 5432 dbname: postgres-zoomcamp schema: public threads: 8 connect_timeout: 10 priority: interactive target: dev description: | Note that you need to adjust the models/staging/schema.yml file to match your database and schema. Select and edit that Namespace File from the UI. Save and run this flow. Once https://github.com/DataTalksClub/data-engineering-zoomcamp/pull/565/files is merged, you can ignore this note as it will be dynamically adjusted based on env variables. ```yaml sources: - name: staging database: postgres-zoomcamp schema: public ``` ================================================ FILE: cohorts/2025/02-workflow-orchestration/flows/04_gcp_kv.yaml ================================================ id: 04_gcp_kv namespace: zoomcamp tasks: - id: gcp_project_id type: io.kestra.plugin.core.kv.Set key: GCP_PROJECT_ID kvType: STRING value: kestra-sandbox # TODO replace with your project id - id: gcp_location type: io.kestra.plugin.core.kv.Set key: GCP_LOCATION kvType: STRING value: europe-west2 - id: gcp_bucket_name type: io.kestra.plugin.core.kv.Set key: GCP_BUCKET_NAME kvType: STRING value: your-name-kestra # TODO make sure it's globally unique! - id: gcp_dataset type: io.kestra.plugin.core.kv.Set key: GCP_DATASET kvType: STRING value: zoomcamp ================================================ FILE: cohorts/2025/02-workflow-orchestration/flows/05_gcp_setup.yaml ================================================ id: 05_gcp_setup namespace: zoomcamp tasks: - id: create_gcs_bucket type: io.kestra.plugin.gcp.gcs.CreateBucket ifExists: SKIP storageClass: REGIONAL name: "{{kv('GCP_BUCKET_NAME')}}" # make sure it's globally unique! - id: create_bq_dataset type: io.kestra.plugin.gcp.bigquery.CreateDataset name: "{{kv('GCP_DATASET')}}" ifExists: SKIP pluginDefaults: - type: io.kestra.plugin.gcp values: serviceAccount: "{{kv('GCP_CREDS')}}" projectId: "{{kv('GCP_PROJECT_ID')}}" location: "{{kv('GCP_LOCATION')}}" bucket: "{{kv('GCP_BUCKET_NAME')}}" ================================================ FILE: cohorts/2025/02-workflow-orchestration/flows/06_gcp_taxi.yaml ================================================ id: 06_gcp_taxi namespace: zoomcamp description: | The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases inputs: - id: taxi type: SELECT displayName: Select taxi type values: [yellow, green] defaults: green - id: year type: SELECT displayName: Select year values: ["2019", "2020"] defaults: "2019" allowCustomValue: true # allows you to type 2021 from the UI for the homework 🤗 - id: month type: SELECT displayName: Select month values: ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"] defaults: "01" variables: file: "{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv" gcs_file: "gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}" table: "{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{inputs.year}}_{{inputs.month}}" data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}" tasks: - id: set_label type: io.kestra.plugin.core.execution.Labels labels: file: "{{render(vars.file)}}" taxi: "{{inputs.taxi}}" - id: extract type: io.kestra.plugin.scripts.shell.Commands outputFiles: - "*.csv" taskRunner: type: io.kestra.plugin.core.runner.Process commands: - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}} - id: upload_to_gcs type: io.kestra.plugin.gcp.gcs.Upload from: "{{render(vars.data)}}" to: "{{render(vars.gcs_file)}}" - id: if_yellow_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'yellow'}}" then: - id: bq_yellow_tripdata type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` ( unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'), filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'), VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) PARTITION BY DATE(tpep_pickup_datetime); - id: bq_yellow_table_ext type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext` ( VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) OPTIONS ( format = 'CSV', uris = ['{{render(vars.gcs_file)}}'], skip_leading_rows = 1, ignore_unknown_values = TRUE ); - id: bq_yellow_table_tmp type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` AS SELECT MD5(CONCAT( COALESCE(CAST(VendorID AS STRING), ""), COALESCE(CAST(tpep_pickup_datetime AS STRING), ""), COALESCE(CAST(tpep_dropoff_datetime AS STRING), ""), COALESCE(CAST(PULocationID AS STRING), ""), COALESCE(CAST(DOLocationID AS STRING), "") )) AS unique_row_id, "{{render(vars.file)}}" AS filename, * FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`; - id: bq_yellow_merge type: io.kestra.plugin.gcp.bigquery.Query sql: | MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge) VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge); - id: if_green_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'green'}}" then: - id: bq_green_tripdata type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` ( unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'), filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'), VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), ehail_fee NUMERIC, improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) PARTITION BY DATE(lpep_pickup_datetime); - id: bq_green_table_ext type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext` ( VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), ehail_fee NUMERIC, improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) OPTIONS ( format = 'CSV', uris = ['{{render(vars.gcs_file)}}'], skip_leading_rows = 1, ignore_unknown_values = TRUE ); - id: bq_green_table_tmp type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` AS SELECT MD5(CONCAT( COALESCE(CAST(VendorID AS STRING), ""), COALESCE(CAST(lpep_pickup_datetime AS STRING), ""), COALESCE(CAST(lpep_dropoff_datetime AS STRING), ""), COALESCE(CAST(PULocationID AS STRING), ""), COALESCE(CAST(DOLocationID AS STRING), "") )) AS unique_row_id, "{{render(vars.file)}}" AS filename, * FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`; - id: bq_green_merge type: io.kestra.plugin.gcp.bigquery.Query sql: | MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge) VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge); - id: purge_files type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles description: If you'd like to explore Kestra outputs, disable it. disabled: false pluginDefaults: - type: io.kestra.plugin.gcp values: serviceAccount: "{{kv('GCP_CREDS')}}" projectId: "{{kv('GCP_PROJECT_ID')}}" location: "{{kv('GCP_LOCATION')}}" bucket: "{{kv('GCP_BUCKET_NAME')}}" ================================================ FILE: cohorts/2025/02-workflow-orchestration/flows/06_gcp_taxi_scheduled.yaml ================================================ id: 06_gcp_taxi_scheduled namespace: zoomcamp description: | Best to add a label `backfill:true` from the UI to track executions created via a backfill. CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases inputs: - id: taxi type: SELECT displayName: Select taxi type values: [yellow, green] defaults: green variables: file: "{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv" gcs_file: "gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}" table: "{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy_MM')}}" data: "{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}" tasks: - id: set_label type: io.kestra.plugin.core.execution.Labels labels: file: "{{render(vars.file)}}" taxi: "{{inputs.taxi}}" - id: extract type: io.kestra.plugin.scripts.shell.Commands outputFiles: - "*.csv" taskRunner: type: io.kestra.plugin.core.runner.Process commands: - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}} - id: upload_to_gcs type: io.kestra.plugin.gcp.gcs.Upload from: "{{render(vars.data)}}" to: "{{render(vars.gcs_file)}}" - id: if_yellow_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'yellow'}}" then: - id: bq_yellow_tripdata type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` ( unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'), filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'), VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) PARTITION BY DATE(tpep_pickup_datetime); - id: bq_yellow_table_ext type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext` ( VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) OPTIONS ( format = 'CSV', uris = ['{{render(vars.gcs_file)}}'], skip_leading_rows = 1, ignore_unknown_values = TRUE ); - id: bq_yellow_table_tmp type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` AS SELECT MD5(CONCAT( COALESCE(CAST(VendorID AS STRING), ""), COALESCE(CAST(tpep_pickup_datetime AS STRING), ""), COALESCE(CAST(tpep_dropoff_datetime AS STRING), ""), COALESCE(CAST(PULocationID AS STRING), ""), COALESCE(CAST(DOLocationID AS STRING), "") )) AS unique_row_id, "{{render(vars.file)}}" AS filename, * FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`; - id: bq_yellow_merge type: io.kestra.plugin.gcp.bigquery.Query sql: | MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge) VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge); - id: if_green_taxi type: io.kestra.plugin.core.flow.If condition: "{{inputs.taxi == 'green'}}" then: - id: bq_green_tripdata type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` ( unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'), filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'), VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), ehail_fee NUMERIC, improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) PARTITION BY DATE(lpep_pickup_datetime); - id: bq_green_table_ext type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext` ( VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'), lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'), lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'), store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'), RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'), PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'), DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'), passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'), trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'), fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'), extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'), mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'), tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'), tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'), ehail_fee NUMERIC, improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'), total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'), payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'), trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'), congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones') ) OPTIONS ( format = 'CSV', uris = ['{{render(vars.gcs_file)}}'], skip_leading_rows = 1, ignore_unknown_values = TRUE ); - id: bq_green_table_tmp type: io.kestra.plugin.gcp.bigquery.Query sql: | CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` AS SELECT MD5(CONCAT( COALESCE(CAST(VendorID AS STRING), ""), COALESCE(CAST(lpep_pickup_datetime AS STRING), ""), COALESCE(CAST(lpep_dropoff_datetime AS STRING), ""), COALESCE(CAST(PULocationID AS STRING), ""), COALESCE(CAST(DOLocationID AS STRING), "") )) AS unique_row_id, "{{render(vars.file)}}" AS filename, * FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`; - id: bq_green_merge type: io.kestra.plugin.gcp.bigquery.Query sql: | MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S ON T.unique_row_id = S.unique_row_id WHEN NOT MATCHED THEN INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge) VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge); - id: purge_files type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles description: To avoid cluttering your storage, we will remove the downloaded files pluginDefaults: - type: io.kestra.plugin.gcp values: serviceAccount: "{{kv('GCP_CREDS')}}" projectId: "{{kv('GCP_PROJECT_ID')}}" location: "{{kv('GCP_LOCATION')}}" bucket: "{{kv('GCP_BUCKET_NAME')}}" triggers: - id: green_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 9 1 * *" inputs: taxi: green - id: yellow_schedule type: io.kestra.plugin.core.trigger.Schedule cron: "0 10 1 * *" inputs: taxi: yellow ================================================ FILE: cohorts/2025/02-workflow-orchestration/flows/07_gcp_dbt.yaml ================================================ id: 07_gcp_dbt namespace: zoomcamp inputs: - id: dbt_command type: SELECT allowCustomValue: true defaults: dbt build values: - dbt build - dbt debug # use when running the first time to validate DB connection tasks: - id: sync type: io.kestra.plugin.git.SyncNamespaceFiles url: https://github.com/DataTalksClub/data-engineering-zoomcamp branch: main namespace: "{{flow.namespace}}" gitDirectory: 04-analytics-engineering/taxi_rides_ny dryRun: false # disabled: true # this Git Sync is needed only when running it the first time, afterwards the task can be disabled - id: dbt-build type: io.kestra.plugin.dbt.cli.DbtCLI env: DBT_DATABASE: "{{kv('GCP_PROJECT_ID')}}" DBT_SCHEMA: "{{kv('GCP_DATASET')}}" namespaceFiles: enabled: true containerImage: ghcr.io/kestra-io/dbt-bigquery:latest taskRunner: type: io.kestra.plugin.scripts.runner.docker.Docker inputFiles: sa.json: "{{kv('GCP_CREDS')}}" commands: - dbt deps - "{{ inputs.dbt_command }}" storeManifest: key: manifest.json namespace: "{{ flow.namespace }}" profiles: | default: outputs: dev: type: bigquery dataset: "{{kv('GCP_DATASET')}}" project: "{{kv('GCP_PROJECT_ID')}}" location: "{{kv('GCP_LOCATION')}}" keyfile: sa.json method: service-account priority: interactive threads: 16 timeout_seconds: 300 fixed_retries: 1 target: dev description: | Note that you need to adjust the models/staging/schema.yml file to match your database and schema. Select and edit that Namespace File from the UI. Save and run this flow. Once https://github.com/DataTalksClub/data-engineering-zoomcamp/pull/565/files is merged, you can ignore this note as it will be dynamically adjusted based on env variables. ```yaml sources: - name: staging database: kestra-sandbox schema: zoomcamp ``` ================================================ FILE: cohorts/2025/02-workflow-orchestration/homework.md ================================================ ## Module 2 Homework ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository. > In case you don't get one option exactly, select the closest one For the homework, we'll be working with the _green_ taxi dataset located here: `https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download` To get a `wget`-able link, use this prefix (note that the link itself gives 404): `https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/` ### Assignment So far in the course, we processed data for the year 2019 and 2020. Your task is to extend the existing flows to include data for the year 2021. ![homework datasets](../../../02-workflow-orchestration/images/homework.png) As a hint, Kestra makes that process really easy: 1. You can leverage the backfill functionality in the [scheduled flow](../../../02-workflow-orchestration/flows/06_gcp_taxi_scheduled.yaml) to backfill the data for the year 2021. Just make sure to select the time period for which data exists i.e. from `2021-01-01` to `2021-07-31`. Also, make sure to do the same for both `yellow` and `green` taxi data (select the right service in the `taxi` input). 2. Alternatively, run the flow manually for each of the seven months of 2021 for both `yellow` and `green` taxi data. Challenge for you: find out how to loop over the combination of Year-Month and `taxi`-type using `ForEach` task which triggers the flow for each combination using a `Subflow` task. ### Quiz Questions Complete the Quiz shown below. It’s a set of 6 multiple-choice questions to test your understanding of workflow orchestration, Kestra and ETL pipelines for data lakes and warehouses. 1) Within the execution for `Yellow` Taxi data for the year `2020` and month `12`: what is the uncompressed file size (i.e. the output file `yellow_tripdata_2020-12.csv` of the `extract` task)? - 128.3 MiB - 134.5 MiB - 364.7 MiB - 692.6 MiB 2) What is the rendered value of the variable `file` when the inputs `taxi` is set to `green`, `year` is set to `2020`, and `month` is set to `04` during execution? - `{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv` - `green_tripdata_2020-04.csv` - `green_tripdata_04_2020.csv` - `green_tripdata_2020.csv` 3) How many rows are there for the `Yellow` Taxi data for all CSV files in the year 2020? - 13,537.299 - 24,648,499 - 18,324,219 - 29,430,127 4) How many rows are there for the `Green` Taxi data for all CSV files in the year 2020? - 5,327,301 - 936,199 - 1,734,051 - 1,342,034 5) How many rows are there for the `Yellow` Taxi data for the March 2021 CSV file? - 1,428,092 - 706,911 - 1,925,152 - 2,561,031 6) How would you configure the timezone to New York in a Schedule trigger? - Add a `timezone` property set to `EST` in the `Schedule` trigger configuration - Add a `timezone` property set to `America/New_York` in the `Schedule` trigger configuration - Add a `timezone` property set to `UTC-5` in the `Schedule` trigger configuration - Add a `location` property set to `New_York` in the `Schedule` trigger configuration ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw2 * Check the link above to see the due date ## Solution Will be added after the due date ================================================ FILE: cohorts/2025/03-data-warehouse/DLT_upload_to_GCP.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": { "id": "aC2QnhmKxpq1" }, "source": [ "**Please set up your credentials JSON as GCP_CREDENTIALS secrets**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "UsUZobVduL7l" }, "outputs": [], "source": [ "import os\n", "from google.colab import userdata\n", "\n", "os.environ[\"DESTINATION__CREDENTIALS\"] = userdata.get('GCP_CREDENTIALS')\n", "os.environ[\"BUCKET_URL\"] = \"gs://your_bucket_url\"" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "mPBzsEgyjsBo" }, "outputs": [], "source": [ "# Install for production\n", "%%capture\n", "!pip install dlt[bigquery, gs]" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "evdUsDNbkCTk" }, "outputs": [], "source": [ "# Install for testing\n", "%%capture\n", "!pip install dlt[duckdb]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "lYh7r1mTf4uo" }, "outputs": [], "source": [ "import dlt\n", "import requests\n", "import pandas as pd\n", "from dlt.destinations import filesystem\n", "from io import BytesIO" ] }, { "cell_type": "markdown", "metadata": { "id": "76zT1PzAgs7A" }, "source": [ "Ingesting parquet files to GCS." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xya0215jsnsb" }, "outputs": [], "source": [ "# Define a dlt source to download and process Parquet files as resources\n", "@dlt.source(name=\"rides\")\n", "def download_parquet():\n", " for month in range(1,7):\n", " file_name = f\"yellow_tripdata_2024-0{month}.parquet\"\n", "\n", " url = f\"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-0{month}.parquet\"\n", " response = requests.get(url)\n", "\n", " df = pd.read_parquet(BytesIO(response.content))\n", "\n", " # Return the dataframe as a dlt resource for ingestion\n", " yield dlt.resource(df, name=file_name)\n", "\n", "# Initialize the pipeline\n", "pipeline = dlt.pipeline(\n", " pipeline_name=\"rides_pipeline\",\n", " destination=filesystem(\n", " layout=\"{schema_name}/{table_name}.{ext}\"\n", " ),\n", " dataset_name=\"rides_dataset\"\n", ")\n", "\n", "# Run the pipeline to load Parquet data into DuckDB\n", "load_info = pipeline.run(\n", " download_parquet(),\n", " loader_file_format=\"parquet\"\n", " )\n", "\n", "# Print the results\n", "print(load_info)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "S0310FT-gy_P" }, "source": [ "Ingesting data to Database" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1_3K97w1c2v2", "outputId": "4b2d26bf-2814-46fa-f80d-7a2e17417a95" }, "outputs": [], "source": [ "# Define a dlt resource to download and process Parquet files as single table\n", "@dlt.resource(name=\"rides\", write_disposition=\"replace\")\n", "def download_parquet():\n", " for month in range(1,7):\n", " url = f\"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-0{month}.parquet\"\n", " response = requests.get(url)\n", "\n", " df = pd.read_parquet(BytesIO(response.content))\n", "\n", " # Return the dataframe as a dlt resource for ingestion\n", " yield df\n", "\n", "# Initialize the pipeline\n", "pipeline = dlt.pipeline(\n", " pipeline_name=\"rides_pipeline\",\n", " destination=\"duckdb\", # Use DuckDB for testing\n", " # destination=\"bigquery\", # Use BigQuery for production\n", " dataset_name=\"rides_dataset\"\n", ")\n", "\n", "# Run the pipeline to load Parquet data into DuckDB\n", "info = pipeline.run(download_parquet)\n", "\n", "# Print the results\n", "print(info)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gDcLjzLtooBV", "outputId": "74ff2de7-2f2e-41b9-a681-3dc5887f6eed" }, "outputs": [], "source": [ "import duckdb\n", "conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n", "\n", "# Set search path to the dataset\n", "conn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n", "\n", "# Describe the dataset to see loaded tables\n", "res = conn.sql(\"DESCRIBE\").df()\n", "print(res)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VVJy8JoerI2P", "outputId": "3f8c7fee-a9ee-4fd4-ec75-153ca60bd36f" }, "outputs": [], "source": [ "# provide a resource name to query a table of that name\n", "with pipeline.sql_client() as client:\n", " with client.execute_query(f\"SELECT count(1) FROM rides\") as cursor:\n", " data = cursor.df()\n", "print(data)" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 } ================================================ FILE: cohorts/2025/03-data-warehouse/homework.md ================================================ ## Module 3 Homework ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository. Important Note:

For this homework we will be using the Yellow Taxi Trip Records for **January 2024 - June 2024 NOT the entire year of data** Parquet Files from the New York City Taxi Data found here:
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
If you are using orchestration such as Kestra, Mage, Airflow or Prefect etc. do not load the data into Big Query using the orchestrator.
Stop with loading the files into a bucket.

**Load Script:** You can manually download the parquet files and upload them to your GCS Bucket or you can use the linked script [here](./load_yellow_taxi_data.py):
You will simply need to generate a Service Account with GCS Admin Priveleges or be authenticated with the Google SDK and update the bucket name in the script to the name of your bucket
Nothing is fool proof so make sure that all 6 files show in your GCS Bucket before beginning.

NOTE: You will need to use the PARQUET option files when creating an External Table
BIG QUERY SETUP:
Create an external table using the Yellow Taxi Trip Records.
Create a (regular/materialized) table in BQ using the Yellow Taxi Trip Records (do not partition or cluster this table).

## Question 1: What is count of records for the 2024 Yellow Taxi Data? - 65,623 - 840,402 - 20,332,093 - 85,431,289 ## Question 2: Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.
What is the **estimated amount** of data that will be read when this query is executed on the External Table and the Table? - 18.82 MB for the External Table and 47.60 MB for the Materialized Table - 0 MB for the External Table and 155.12 MB for the Materialized Table - 2.14 GB for the External Table and 0MB for the Materialized Table - 0 MB for the External Table and 0MB for the Materialized Table ## Question 3: Write a query to retrieve the PULocationID from the table (not the external table) in BigQuery. Now write a query to retrieve the PULocationID and DOLocationID on the same table. Why are the estimated number of Bytes different? - BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires reading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed. - BigQuery duplicates data across multiple storage partitions, so selecting two columns instead of one requires scanning the table twice, doubling the estimated bytes processed. - BigQuery automatically caches the first queried column, so adding a second column increases processing time but does not affect the estimated bytes scanned. - When selecting multiple columns, BigQuery performs an implicit join operation between them, increasing the estimated bytes processed ## Question 4: How many records have a fare_amount of 0? - 128,210 - 546,578 - 20,188,016 - 8,333 ## Question 5: What is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy) - Partition by tpep_dropoff_datetime and Cluster on VendorID - Cluster on by tpep_dropoff_datetime and Cluster on VendorID - Cluster on tpep_dropoff_datetime Partition by VendorID - Partition by tpep_dropoff_datetime and Partition by VendorID ## Question 6: Write a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime 2024-03-01 and 2024-03-15 (inclusive)
Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values?
Choose the answer which most closely matches.
- 12.47 MB for non-partitioned table and 326.42 MB for the partitioned table - 310.24 MB for non-partitioned table and 26.84 MB for the partitioned table - 5.87 MB for non-partitioned table and 0 MB for the partitioned table - 310.31 MB for non-partitioned table and 285.64 MB for the partitioned table ## Question 7: Where is the data stored in the External Table you created? - Big Query - Container Registry - GCP Bucket - Big Table ## Question 8: It is best practice in Big Query to always cluster your data: - True - False ## (Bonus: Not worth points) Question 9: No Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why? ## Submitting the solutions Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw3 ## Solution Solution: https://www.youtube.com/watch?v=wpLmImIUlPg ================================================ FILE: cohorts/2025/03-data-warehouse/load_yellow_taxi_data.py ================================================ import os import sys import urllib.request from concurrent.futures import ThreadPoolExecutor from google.cloud import storage from google.api_core.exceptions import NotFound, Forbidden import time # Change this to your bucket name BUCKET_NAME = "dezoomcamp_hw3_2025" # If you authenticated through the GCP SDK you can comment out these two lines CREDENTIALS_FILE = "gcs.json" client = storage.Client.from_service_account_json(CREDENTIALS_FILE) # If commented initialize client with the following # client = storage.Client(project='zoomcamp-mod3-datawarehouse') BASE_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-" MONTHS = [f"{i:02d}" for i in range(1, 7)] DOWNLOAD_DIR = "." CHUNK_SIZE = 8 * 1024 * 1024 os.makedirs(DOWNLOAD_DIR, exist_ok=True) bucket = client.bucket(BUCKET_NAME) def download_file(month): url = f"{BASE_URL}{month}.parquet" file_path = os.path.join(DOWNLOAD_DIR, f"yellow_tripdata_2024-{month}.parquet") try: print(f"Downloading {url}...") urllib.request.urlretrieve(url, file_path) print(f"Downloaded: {file_path}") return file_path except Exception as e: print(f"Failed to download {url}: {e}") return None def create_bucket(bucket_name): try: # Get bucket details bucket = client.get_bucket(bucket_name) # Check if the bucket belongs to the current project project_bucket_ids = [bckt.id for bckt in client.list_buckets()] if bucket_name in project_bucket_ids: print( f"Bucket '{bucket_name}' exists and belongs to your project. Proceeding..." ) else: print( f"A bucket with the name '{bucket_name}' already exists, but it does not belong to your project." ) sys.exit(1) except NotFound: # If the bucket doesn't exist, create it bucket = client.create_bucket(bucket_name) print(f"Created bucket '{bucket_name}'") except Forbidden: # If the request is forbidden, it means the bucket exists but you don't have access to see details print( f"A bucket with the name '{bucket_name}' exists, but it is not accessible. Bucket name is taken. Please try a different bucket name." ) sys.exit(1) def verify_gcs_upload(blob_name): return storage.Blob(bucket=bucket, name=blob_name).exists(client) def upload_to_gcs(file_path, max_retries=3): blob_name = os.path.basename(file_path) blob = bucket.blob(blob_name) blob.chunk_size = CHUNK_SIZE create_bucket(BUCKET_NAME) for attempt in range(max_retries): try: print(f"Uploading {file_path} to {BUCKET_NAME} (Attempt {attempt + 1})...") blob.upload_from_filename(file_path) print(f"Uploaded: gs://{BUCKET_NAME}/{blob_name}") if verify_gcs_upload(blob_name): print(f"Verification successful for {blob_name}") return else: print(f"Verification failed for {blob_name}, retrying...") except Exception as e: print(f"Failed to upload {file_path} to GCS: {e}") time.sleep(5) print(f"Giving up on {file_path} after {max_retries} attempts.") if __name__ == "__main__": create_bucket(BUCKET_NAME) with ThreadPoolExecutor(max_workers=4) as executor: file_paths = list(executor.map(download_file, MONTHS)) with ThreadPoolExecutor(max_workers=4) as executor: executor.map(upload_to_gcs, filter(None, file_paths)) # Remove None values print("All files processed and verified.") ================================================ FILE: cohorts/2025/04-analytics-engineering/homework.md ================================================ ## Module 4 Homework For this homework, you will need the following datasets: * [Green Taxi dataset (2019 and 2020)](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green) * [Yellow Taxi dataset (2019 and 2020)](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/yellow) * [For Hire Vehicle dataset (2019)](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv) ### Before you start 1. Make sure you, **at least**, have them in GCS with a External Table **OR** a Native Table - use whichever method you prefer to accomplish that (Workflow Orchestration with [pandas-gbq](https://cloud.google.com/bigquery/docs/samples/bigquery-pandas-gbq-to-gbq-simple), [dlt for gcs](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem), [dlt for BigQuery](https://dlthub.com/docs/dlt-ecosystem/destinations/bigquery), [gsutil](https://cloud.google.com/storage/docs/gsutil), etc) 2. You should have exactly `7,778,101` records in your Green Taxi table 3. You should have exactly `109,047,518` records in your Yellow Taxi table 4. You should have exactly `43,244,696` records in your FHV table 5. Build the staging models for green/yellow as shown in [here](../../../04-analytics-engineering/taxi_rides_ny/models/staging/) 6. Build the dimension/fact for taxi_trips joining with `dim_zones` as shown in [here](../../../04-analytics-engineering/taxi_rides_ny/models/core/fact_trips.sql) **Note**: If you don't have access to GCP, you can spin up a local Postgres instance and ingest the datasets above ### Question 1: Understanding dbt model resolution Provided you've got the following sources.yaml ```yaml version: 2 sources: - name: raw_nyc_tripdata database: "{{ env_var('DBT_BIGQUERY_PROJECT', 'dtc_zoomcamp_2025') }}" schema: "{{ env_var('DBT_BIGQUERY_SOURCE_DATASET', 'raw_nyc_tripdata') }}" tables: - name: ext_green_taxi - name: ext_yellow_taxi ``` with the following env variables setup where `dbt` runs: ```shell export DBT_BIGQUERY_PROJECT=myproject export DBT_BIGQUERY_DATASET=my_nyc_tripdata ``` What does this .sql model compile to? ```sql select * from {{ source('raw_nyc_tripdata', 'ext_green_taxi' ) }} ``` - `select * from dtc_zoomcamp_2025.raw_nyc_tripdata.ext_green_taxi` - `select * from dtc_zoomcamp_2025.my_nyc_tripdata.ext_green_taxi` - `select * from myproject.raw_nyc_tripdata.ext_green_taxi` - `select * from myproject.my_nyc_tripdata.ext_green_taxi` - `select * from dtc_zoomcamp_2025.raw_nyc_tripdata.green_taxi` ### Question 2: dbt Variables & Dynamic Models Say you have to modify the following dbt_model (`fct_recent_taxi_trips.sql`) to enable Analytics Engineers to dynamically control the date range. - In development, you want to process only **the last 7 days of trips** - In production, you need to process **the last 30 days** for analytics ```sql select * from {{ ref('fact_taxi_trips') }} where pickup_datetime >= CURRENT_DATE - INTERVAL '30' DAY ``` What would you change to accomplish that in a such way that command line arguments takes precedence over ENV_VARs, which takes precedence over DEFAULT value? - Add `ORDER BY pickup_datetime DESC` and `LIMIT {{ var("days_back", 30) }}` - Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var("days_back", 30) }}' DAY` - Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var("DAYS_BACK", "30") }}' DAY` - Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var("days_back", env_var("DAYS_BACK", "30")) }}' DAY` - Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var("DAYS_BACK", var("days_back", "30")) }}' DAY` ### Question 3: dbt Data Lineage and Execution Considering the data lineage below **and** that taxi_zone_lookup is the **only** materialization build (from a .csv seed file): ![image](./homework_q2.png) Select the option that does **NOT** apply for materializing `fct_taxi_monthly_zone_revenue`: - `dbt run` - `dbt run --select +models/core/dim_taxi_trips.sql+ --target prod` - `dbt run --select +models/core/fct_taxi_monthly_zone_revenue.sql` - `dbt run --select +models/core/` - `dbt run --select models/staging/+` ### Question 4: dbt Macros and Jinja Consider you're dealing with sensitive data (e.g.: [PII](https://en.wikipedia.org/wiki/Personal_data)), that is **only available to your team and very selected few individuals**, in the `raw layer` of your DWH (e.g: a specific BigQuery dataset or PostgreSQL schema), - Among other things, you decide to obfuscate/masquerade that data through your staging models, and make it available in a different schema (a `staging layer`) for other Data/Analytics Engineers to explore - And **optionally**, yet another layer (`service layer`), where you'll build your dimension (`dim_`) and fact (`fct_`) tables (assuming the [Star Schema dimensional modeling](https://www.databricks.com/glossary/star-schema)) for Dashboarding and for Tech Product Owners/Managers You decide to make a macro to wrap a logic around it: ```sql {% macro resolve_schema_for(model_type) -%} {%- set target_env_var = 'DBT_BIGQUERY_TARGET_DATASET' -%} {%- set stging_env_var = 'DBT_BIGQUERY_STAGING_DATASET' -%} {%- if model_type == 'core' -%} {{- env_var(target_env_var) -}} {%- else -%} {{- env_var(stging_env_var, env_var(target_env_var)) -}} {%- endif -%} {%- endmacro %} ``` And use on your staging, dim_ and fact_ models as: ```sql {{ config( schema=resolve_schema_for('core'), ) }} ``` That all being said, regarding macro above, **select all statements that are true to the models using it**: - Setting a value for `DBT_BIGQUERY_TARGET_DATASET` env var is mandatory, or it'll fail to compile - Setting a value for `DBT_BIGQUERY_STAGING_DATASET` env var is mandatory, or it'll fail to compile - When using `core`, it materializes in the dataset defined in `DBT_BIGQUERY_TARGET_DATASET` - When using `stg`, it materializes in the dataset defined in `DBT_BIGQUERY_STAGING_DATASET`, or defaults to `DBT_BIGQUERY_TARGET_DATASET` - When using `staging`, it materializes in the dataset defined in `DBT_BIGQUERY_STAGING_DATASET`, or defaults to `DBT_BIGQUERY_TARGET_DATASET` ## Serious SQL Alright, in module 1, you had a SQL refresher, so now let's build on top of that with some serious SQL. These are not meant to be easy - but they'll boost your SQL and Analytics skills to the next level. So, without any further do, let's get started... You might want to add some new dimensions `year` (e.g.: 2019, 2020), `quarter` (1, 2, 3, 4), `year_quarter` (e.g.: `2019/Q1`, `2019-Q2`), and `month` (e.g.: 1, 2, ..., 12), **extracted from pickup_datetime**, to your `fct_taxi_trips` OR `dim_taxi_trips.sql` models to facilitate filtering your queries ### Question 5: Taxi Quarterly Revenue Growth 1. Create a new model `fct_taxi_trips_quarterly_revenue.sql` 2. Compute the Quarterly Revenues for each year for based on `total_amount` 3. Compute the Quarterly YoY (Year-over-Year) revenue growth * e.g.: In 2020/Q1, Green Taxi had -12.34% revenue growth compared to 2019/Q1 * e.g.: In 2020/Q4, Yellow Taxi had +34.56% revenue growth compared to 2019/Q4 ***Important Note: The Year-over-Year (YoY) growth percentages provided in the examples are purely illustrative. You will not be able to reproduce these exact values using the datasets provided for this homework.*** Considering the YoY Growth in 2020, which were the yearly quarters with the best (or less worse) and worst results for green, and yellow - green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q2, worst: 2020/Q1} - green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q3, worst: 2020/Q4} - green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q2, worst: 2020/Q1} - green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q1, worst: 2020/Q2} - green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q3, worst: 2020/Q4} ### Question 6: P97/P95/P90 Taxi Monthly Fare 1. Create a new model `fct_taxi_trips_monthly_fare_p95.sql` 2. Filter out invalid entries (`fare_amount > 0`, `trip_distance > 0`, and `payment_type_description in ('Cash', 'Credit card')`) 3. Compute the **continous percentile** of `fare_amount` partitioning by service_type, year and and month Now, what are the values of `p97`, `p95`, `p90` for Green Taxi and Yellow Taxi, in April 2020? - green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5} - green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0} - green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5} - green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0} - green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 25.5, p90: 19.0} ### Question 7: Top #Nth longest P90 travel time Location for FHV Prerequisites: * Create a staging model for FHV Data (2019), and **DO NOT** add a deduplication step, just filter out the entries where `where dispatching_base_num is not null` * Create a core model for FHV Data (`dim_fhv_trips.sql`) joining with `dim_zones`. Similar to what has been done [here](../../../04-analytics-engineering/taxi_rides_ny/models/core/fact_trips.sql) * Add some new dimensions `year` (e.g.: 2019) and `month` (e.g.: 1, 2, ..., 12), based on `pickup_datetime`, to the core model to facilitate filtering for your queries Now... 1. Create a new model `fct_fhv_monthly_zone_traveltime_p90.sql` 2. For each record in `dim_fhv_trips.sql`, compute the [timestamp_diff](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timestamp_diff) in seconds between dropoff_datetime and pickup_datetime - we'll call it `trip_duration` for this exercise 3. Compute the **continous** `p90` of `trip_duration` partitioning by year, month, pickup_location_id, and dropoff_location_id For the Trips that **respectively** started from `Newark Airport`, `SoHo`, and `Yorkville East`, in November 2019, what are **dropoff_zones** with the 2nd longest p90 trip_duration ? - LaGuardia Airport, Chinatown, Garment District - LaGuardia Airport, Park Slope, Clinton East - LaGuardia Airport, Saint Albans, Howard Beach - LaGuardia Airport, Rosedale, Bath Beach - LaGuardia Airport, Yorkville East, Greenpoint ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw4 ## Solution * To be published after deadline ================================================ FILE: cohorts/2025/05-batch/homework.md ================================================ # Module 5 Homework In this homework we'll put what we learned about Spark in practice. For this homework we will be using the Yellow 2024-10 data from the official website: ```bash wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet ``` ## Question 1: Install Spark and PySpark - Install Spark - Run PySpark - Create a local spark session - Execute spark.version. What's the output? > [!NOTE] > To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/setup/pyspark.md) ## Question 2: Yellow October 2024 Read the October 2024 Yellow into a Spark Dataframe. Repartition the Dataframe to 4 partitions and save it to parquet. What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches. - 6MB - 25MB - 75MB - 100MB ## Question 3: Count records How many taxi trips were there on the 15th of October? Consider only trips that started on the 15th of October. - 85,567 - 105,567 - 125,567 - 145,567 ## Question 4: Longest trip What is the length of the longest trip in the dataset in hours? - 122 - 142 - 162 - 182 ## Question 5: User Interface Spark’s User Interface which shows the application's dashboard runs on which local port? - 80 - 443 - 4040 - 8080 ## Question 6: Least frequent pickup location zone Load the zone lookup data into a temp view in Spark: ```bash wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv ``` Using the zone lookup data and the Yellow October 2024 data, what is the name of the LEAST frequent pickup location Zone? - Governor's Island/Ellis Island/Liberty Island - Arden Heights - Rikers Island - Jamaica Bay ## Submitting the solutions - Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw5 - Deadline: See the website ================================================ FILE: cohorts/2025/06-streaming/homework/homework.ipynb ================================================ { "cells": [ { "cell_type": "code", "execution_count": 1, "id": "a63a4585-8a6b-4446-9b63-8c5d5d0b80fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "\n", "from kafka import KafkaProducer\n", "\n", "def json_serializer(data):\n", " return json.dumps(data).encode('utf-8')\n", "\n", "server = 'localhost:9092'\n", "\n", "producer = KafkaProducer(\n", " bootstrap_servers=[server],\n", " value_serializer=json_serializer\n", ")\n", "\n", "producer.bootstrap_connected()" ] }, { "cell_type": "code", "execution_count": 2, "id": "78bd28f9-66cb-4532-bf03-bb3fe90655b5", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "--2025-03-07 19:27:06-- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz\n", "Resolving github.com (github.com)... 140.82.121.3\n", "Connecting to github.com (github.com)|140.82.121.3|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/ea580e9e-555c-4bd0-ae73-43051d8e7c0b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250307%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250307T182706Z&X-Amz-Expires=300&X-Amz-Signature=6b8f2f603fe86515be24510f3f30bcf93c932b551769e5121fb0cbdf58e9b767&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dgreen_tripdata_2019-10.csv.gz&response-content-type=application%2Foctet-stream [following]\n", "--2025-03-07 19:27:07-- https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/ea580e9e-555c-4bd0-ae73-43051d8e7c0b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250307%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250307T182706Z&X-Amz-Expires=300&X-Amz-Signature=6b8f2f603fe86515be24510f3f30bcf93c932b551769e5121fb0cbdf58e9b767&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dgreen_tripdata_2019-10.csv.gz&response-content-type=application%2Foctet-stream\n", "Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...\n", "Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 8262584 (7.9M) [application/octet-stream]\n", "Saving to: 'green_tripdata_2019-10.csv.gz'\n", "\n", " 0K .......... .......... .......... .......... .......... 0% 1.08M 7s\n", " 50K .......... .......... .......... .......... .......... 1% 2.93M 5s\n", " 100K .......... .......... .......... .......... .......... 1% 3.15M 4s\n", " 150K .......... .......... .......... .......... .......... 2% 6.40M 3s\n", " 200K .......... .......... .......... .......... .......... 3% 5.41M 3s\n", " 250K .......... .......... .......... .......... .......... 3% 7.09M 3s\n", " 300K .......... .......... .......... .......... .......... 4% 4.84M 2s\n", " 350K .......... .......... .......... .......... .......... 4% 7.74M 2s\n", " 400K .......... .......... .......... .......... .......... 5% 20.4M 2s\n", " 450K .......... .......... .......... .......... .......... 6% 10.9M 2s\n", " 500K .......... .......... .......... .......... .......... 6% 5.03M 2s\n", " 550K .......... .......... .......... .......... .......... 7% 139M 2s\n", " 600K .......... .......... .......... .......... .......... 8% 11.8M 2s\n", " 650K .......... .......... .......... .......... .......... 8% 333M 1s\n", " 700K .......... .......... .......... .......... .......... 9% 6.83M 1s\n", " 750K .......... .......... .......... .......... .......... 9% 14.7M 1s\n", " 800K .......... .......... .......... .......... .......... 10% 4.41M 1s\n", " 850K .......... .......... .......... .......... .......... 11% 6.43M 1s\n", " 900K .......... .......... .......... .......... .......... 11% 292M 1s\n", " 950K .......... .......... .......... .......... .......... 12% 2.94M 1s\n", " 1000K .......... .......... .......... .......... .......... 13% 372M 1s\n", " 1050K .......... .......... .......... .......... .......... 13% 166M 1s\n", " 1100K .......... .......... .......... .......... .......... 14% 8.69M 1s\n", " 1150K .......... .......... .......... .......... .......... 14% 269M 1s\n", " 1200K .......... .......... .......... .......... .......... 15% 22.0M 1s\n", " 1250K .......... .......... .......... .......... .......... 16% 2.57M 1s\n", " 1300K .......... .......... .......... .......... .......... 16% 69.2M 1s\n", " 1350K .......... .......... .......... .......... .......... 17% 4.57M 1s\n", " 1400K .......... .......... .......... .......... .......... 17% 65.4M 1s\n", " 1450K .......... .......... .......... .......... .......... 18% 180M 1s\n", " 1500K .......... .......... .......... .......... .......... 19% 5.49M 1s\n", " 1550K .......... .......... .......... .......... .......... 19% 114M 1s\n", " 1600K .......... .......... .......... .......... .......... 20% 7.88M 1s\n", " 1650K .......... .......... .......... .......... .......... 21% 6.59M 1s\n", " 1700K .......... .......... .......... .......... .......... 21% 73.7M 1s\n", " 1750K .......... .......... .......... .......... .......... 22% 14.9M 1s\n", " 1800K .......... .......... .......... .......... .......... 22% 4.31M 1s\n", " 1850K .......... .......... .......... .......... .......... 23% 1.87M 1s\n", " 1900K .......... .......... .......... .......... .......... 24% 92.4M 1s\n", " 1950K .......... .......... .......... .......... .......... 24% 49.0M 1s\n", " 2000K .......... .......... .......... .......... .......... 25% 13.5M 1s\n", " 2050K .......... .......... .......... .......... .......... 26% 6.24M 1s\n", " 2100K .......... .......... .......... .......... .......... 26% 67.6M 1s\n", " 2150K .......... .......... .......... .......... .......... 27% 79.1M 1s\n", " 2200K .......... .......... .......... .......... .......... 27% 4.86M 1s\n", " 2250K .......... .......... .......... .......... .......... 28% 94.8M 1s\n", " 2300K .......... .......... .......... .......... .......... 29% 4.48M 1s\n", " 2350K .......... .......... .......... .......... .......... 29% 7.86M 1s\n", " 2400K .......... .......... .......... .......... .......... 30% 27.3M 1s\n", " 2450K .......... .......... .......... .......... .......... 30% 3.10M 1s\n", " 2500K .......... .......... .......... .......... .......... 31% 64.7M 1s\n", " 2550K .......... .......... .......... .......... .......... 32% 82.8M 1s\n", " 2600K .......... .......... .......... .......... .......... 32% 10.8M 1s\n", " 2650K .......... .......... .......... .......... .......... 33% 90.0M 1s\n", " 2700K .......... .......... .......... .......... .......... 34% 5.29M 1s\n", " 2750K .......... .......... .......... .......... .......... 34% 56.3M 1s\n", " 2800K .......... .......... .......... .......... .......... 35% 5.53M 1s\n", " 2850K .......... .......... .......... .......... .......... 35% 135M 1s\n", " 2900K .......... .......... .......... .......... .......... 36% 3.52M 1s\n", " 2950K .......... .......... .......... .......... .......... 37% 34.8M 1s\n", " 3000K .......... .......... .......... .......... .......... 37% 9.28M 1s\n", " 3050K .......... .......... .......... .......... .......... 38% 155M 1s\n", " 3100K .......... .......... .......... .......... .......... 39% 4.57M 1s\n", " 3150K .......... .......... .......... .......... .......... 39% 57.5M 1s\n", " 3200K .......... .......... .......... .......... .......... 40% 182M 1s\n", " 3250K .......... .......... .......... .......... .......... 40% 3.73M 1s\n", " 3300K .......... .......... .......... .......... .......... 41% 83.8M 1s\n", " 3350K .......... .......... .......... .......... .......... 42% 191M 1s\n", " 3400K .......... .......... .......... .......... .......... 42% 3.88M 1s\n", " 3450K .......... .......... .......... .......... .......... 43% 40.2M 1s\n", " 3500K .......... .......... .......... .......... .......... 43% 5.15M 1s\n", " 3550K .......... .......... .......... .......... .......... 44% 48.2M 1s\n", " 3600K .......... .......... .......... .......... .......... 45% 146M 1s\n", " 3650K .......... .......... .......... .......... .......... 45% 3.83M 1s\n", " 3700K .......... .......... .......... .......... .......... 46% 103M 1s\n", " 3750K .......... .......... .......... .......... .......... 47% 152M 1s\n", " 3800K .......... .......... .......... .......... .......... 47% 544M 1s\n", " 3850K .......... .......... .......... .......... .......... 48% 5.68M 0s\n", " 3900K .......... .......... .......... .......... .......... 48% 232M 0s\n", " 3950K .......... .......... .......... .......... .......... 49% 2.19M 0s\n", " 4000K .......... .......... .......... .......... .......... 50% 8.45M 0s\n", " 4050K .......... .......... .......... .......... .......... 50% 45.0M 0s\n", " 4100K .......... .......... .......... .......... .......... 51% 4.58M 0s\n", " 4150K .......... .......... .......... .......... .......... 52% 117M 0s\n", " 4200K .......... .......... .......... .......... .......... 52% 19.5M 0s\n", " 4250K .......... .......... .......... .......... .......... 53% 102M 0s\n", " 4300K .......... .......... .......... .......... .......... 53% 2.69M 0s\n", " 4350K .......... .......... .......... .......... .......... 54% 83.6M 0s\n", " 4400K .......... .......... .......... .......... .......... 55% 121M 0s\n", " 4450K .......... .......... .......... .......... .......... 55% 9.85M 0s\n", " 4500K .......... .......... .......... .......... .......... 56% 102M 0s\n", " 4550K .......... .......... .......... .......... .......... 57% 261M 0s\n", " 4600K .......... .......... .......... .......... .......... 57% 1.84M 0s\n", " 4650K .......... .......... .......... .......... .......... 58% 6.32M 0s\n", " 4700K .......... .......... .......... .......... .......... 58% 49.2M 0s\n", " 4750K .......... .......... .......... .......... .......... 59% 10.8M 0s\n", " 4800K .......... .......... .......... .......... .......... 60% 5.01M 0s\n", " 4850K .......... .......... .......... .......... .......... 60% 271M 0s\n", " 4900K .......... .......... .......... .......... .......... 61% 115M 0s\n", " 4950K .......... .......... .......... .......... .......... 61% 5.14M 0s\n", " 5000K .......... .......... .......... .......... .......... 62% 50.3M 0s\n", " 5050K .......... .......... .......... .......... .......... 63% 3.50M 0s\n", " 5100K .......... .......... .......... .......... .......... 63% 160M 0s\n", " 5150K .......... .......... .......... .......... .......... 64% 15.1M 0s\n", " 5200K .......... .......... .......... .......... .......... 65% 306M 0s\n", " 5250K .......... .......... .......... .......... .......... 65% 202M 0s\n", " 5300K .......... .......... .......... .......... .......... 66% 164M 0s\n", " 5350K .......... .......... .......... .......... .......... 66% 7.69M 0s\n", " 5400K .......... .......... .......... .......... .......... 67% 8.07M 0s\n", " 5450K .......... .......... .......... .......... .......... 68% 75.0M 0s\n", " 5500K .......... .......... .......... .......... .......... 68% 5.82M 0s\n", " 5550K .......... .......... .......... .......... .......... 69% 4.58M 0s\n", " 5600K .......... .......... .......... .......... .......... 70% 6.70M 0s\n", " 5650K .......... .......... .......... .......... .......... 70% 34.4M 0s\n", " 5700K .......... .......... .......... .......... .......... 71% 281M 0s\n", " 5750K .......... .......... .......... .......... .......... 71% 11.8M 0s\n", " 5800K .......... .......... .......... .......... .......... 72% 65.4M 0s\n", " 5850K .......... .......... .......... .......... .......... 73% 54.6M 0s\n", " 5900K .......... .......... .......... .......... .......... 73% 2.49M 0s\n", " 5950K .......... .......... .......... .......... .......... 74% 94.0M 0s\n", " 6000K .......... .......... .......... .......... .......... 74% 307M 0s\n", " 6050K .......... .......... .......... .......... .......... 75% 263M 0s\n", " 6100K .......... .......... .......... .......... .......... 76% 288M 0s\n", " 6150K .......... .......... .......... .......... .......... 76% 8.37M 0s\n", " 6200K .......... .......... .......... .......... .......... 77% 3.78M 0s\n", " 6250K .......... .......... .......... .......... .......... 78% 98.7M 0s\n", " 6300K .......... .......... .......... .......... .......... 78% 2.62M 0s\n", " 6350K .......... .......... .......... .......... .......... 79% 157M 0s\n", " 6400K .......... .......... .......... .......... .......... 79% 424M 0s\n", " 6450K .......... .......... .......... .......... .......... 80% 3.23M 0s\n", " 6500K .......... .......... .......... .......... .......... 81% 30.9M 0s\n", " 6550K .......... .......... .......... .......... .......... 81% 452M 0s\n", " 6600K .......... .......... .......... .......... .......... 82% 8.21M 0s\n", " 6650K .......... .......... .......... .......... .......... 83% 5.23M 0s\n", " 6700K .......... .......... .......... .......... .......... 83% 9.57M 0s\n", " 6750K .......... .......... .......... .......... .......... 84% 3.61M 0s\n", " 6800K .......... .......... .......... .......... .......... 84% 93.1M 0s\n", " 6850K .......... .......... .......... .......... .......... 85% 4.97M 0s\n", " 6900K .......... .......... .......... .......... .......... 86% 41.2M 0s\n", " 6950K .......... .......... .......... .......... .......... 86% 494M 0s\n", " 7000K .......... .......... .......... .......... .......... 87% 5.51M 0s\n", " 7050K .......... .......... .......... .......... .......... 87% 158M 0s\n", " 7100K .......... .......... .......... .......... .......... 88% 5.97M 0s\n", " 7150K .......... .......... .......... .......... .......... 89% 79.3M 0s\n", " 7200K .......... .......... .......... .......... .......... 89% 65.0M 0s\n", " 7250K .......... .......... .......... .......... .......... 90% 4.07M 0s\n", " 7300K .......... .......... .......... .......... .......... 91% 89.6M 0s\n", " 7350K .......... .......... .......... .......... .......... 91% 149M 0s\n", " 7400K .......... .......... .......... .......... .......... 92% 10.1M 0s\n", " 7450K .......... .......... .......... .......... .......... 92% 73.1M 0s\n", " 7500K .......... .......... .......... .......... .......... 93% 51.8M 0s\n", " 7550K .......... .......... .......... .......... .......... 94% 15.4M 0s\n", " 7600K .......... .......... .......... .......... .......... 94% 2.93M 0s\n", " 7650K .......... .......... .......... .......... .......... 95% 101M 0s\n", " 7700K .......... .......... .......... .......... .......... 96% 120M 0s\n", " 7750K .......... .......... .......... .......... .......... 96% 133M 0s\n", " 7800K .......... .......... .......... .......... .......... 97% 49.0M 0s\n", " 7850K .......... .......... .......... .......... .......... 97% 314M 0s\n", " 7900K .......... .......... .......... .......... .......... 98% 117M 0s\n", " 7950K .......... .......... .......... .......... .......... 99% 9.48M 0s\n", " 8000K .......... .......... .......... .......... .......... 99% 2.76M 0s\n", " 8050K .......... ........ 100% 223M=0.9s\n", "\n", "2025-03-07 19:27:08 (9.10 MB/s) - 'green_tripdata_2019-10.csv.gz' saved [8262584/8262584]\n", "\n" ] } ], "source": [ "!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz" ] }, { "cell_type": "code", "execution_count": 3, "id": "57fb14bf-f7f2-45a9-b918-d64203e5d802", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 4, "id": "2b8b3ac1-e3fb-4713-9ccb-7c0fbfe4c017", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\alexe\\AppData\\Local\\Temp\\ipykernel_3424\\2667354967.py:1: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.\n", " df = pd.read_csv('green_tripdata_2019-10.csv.gz')\n" ] } ], "source": [ "df = pd.read_csv('green_tripdata_2019-10.csv.gz')" ] }, { "cell_type": "code", "execution_count": 7, "id": "a0e8ab41-1520-46b1-b8fa-a3fedf170896", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VendorIDlpep_pickup_datetimelpep_dropoff_datetimestore_and_fwd_flagRatecodeIDPULocationIDDOLocationIDpassenger_counttrip_distancefare_amountextramta_taxtip_amounttolls_amountehail_feeimprovement_surchargetotal_amountpayment_typetrip_typecongestion_surcharge
02.02019-10-01 00:26:022019-10-01 00:39:58N1.01121961.05.8818.00.500.50.000.0NaN0.319.302.01.00.0
11.02019-10-01 00:18:112019-10-01 00:22:38N1.0432631.00.805.03.250.50.000.0NaN0.39.052.01.00.0
21.02019-10-01 00:09:312019-10-01 00:24:47N1.02552282.07.5021.50.500.50.000.0NaN0.322.802.01.00.0
31.02019-10-01 00:37:402019-10-01 00:41:49N1.01811811.00.905.50.500.50.000.0NaN0.36.802.01.00.0
42.02019-10-01 00:08:132019-10-01 00:17:56N1.0971881.02.5210.00.500.52.260.0NaN0.313.561.01.00.0
\n", "
" ], "text/plain": [ " VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag \\\n", "0 2.0 2019-10-01 00:26:02 2019-10-01 00:39:58 N \n", "1 1.0 2019-10-01 00:18:11 2019-10-01 00:22:38 N \n", "2 1.0 2019-10-01 00:09:31 2019-10-01 00:24:47 N \n", "3 1.0 2019-10-01 00:37:40 2019-10-01 00:41:49 N \n", "4 2.0 2019-10-01 00:08:13 2019-10-01 00:17:56 N \n", "\n", " RatecodeID PULocationID DOLocationID passenger_count trip_distance \\\n", "0 1.0 112 196 1.0 5.88 \n", "1 1.0 43 263 1.0 0.80 \n", "2 1.0 255 228 2.0 7.50 \n", "3 1.0 181 181 1.0 0.90 \n", "4 1.0 97 188 1.0 2.52 \n", "\n", " fare_amount extra mta_tax tip_amount tolls_amount ehail_fee \\\n", "0 18.0 0.50 0.5 0.00 0.0 NaN \n", "1 5.0 3.25 0.5 0.00 0.0 NaN \n", "2 21.5 0.50 0.5 0.00 0.0 NaN \n", "3 5.5 0.50 0.5 0.00 0.0 NaN \n", "4 10.0 0.50 0.5 2.26 0.0 NaN \n", "\n", " improvement_surcharge total_amount payment_type trip_type \\\n", "0 0.3 19.30 2.0 1.0 \n", "1 0.3 9.05 2.0 1.0 \n", "2 0.3 22.80 2.0 1.0 \n", "3 0.3 6.80 2.0 1.0 \n", "4 0.3 13.56 1.0 1.0 \n", "\n", " congestion_surcharge \n", "0 0.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 8, "id": "d085b583-1609-41a9-a222-ff6ca495ee27", "metadata": {}, "outputs": [], "source": [ "columns = [\n", " 'lpep_pickup_datetime',\n", " 'lpep_dropoff_datetime',\n", " 'PULocationID',\n", " 'DOLocationID',\n", " 'passenger_count',\n", " 'trip_distance',\n", " 'tip_amount'\n", "]" ] }, { "cell_type": "code", "execution_count": 9, "id": "66e9f47c-9284-4760-8011-3a8f48aaa49f", "metadata": {}, "outputs": [], "source": [ "df = df[columns]" ] }, { "cell_type": "code", "execution_count": 11, "id": "7ae3f843-d428-43d2-9e47-7f9fb43acbad", "metadata": {}, "outputs": [], "source": [ "from time import time" ] }, { "cell_type": "code", "execution_count": 14, "id": "f1ca1ac1-176c-4ccc-aa11-7e1cb5659d39", "metadata": {}, "outputs": [], "source": [ "from tqdm.auto import tqdm" ] }, { "cell_type": "code", "execution_count": 12, "id": "0b3da4e1-2f1c-400f-bb67-82734c1193f4", "metadata": {}, "outputs": [], "source": [ "messages = df.to_dict(orient='records')" ] }, { "cell_type": "code", "execution_count": 13, "id": "3bdc95d8-64e1-4819-a885-996813b4bf94", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "476386" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(messages)" ] }, { "cell_type": "code", "execution_count": 15, "id": "d6f15929-e928-464d-afc1-690343f4f780", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4dffdeb2a0064e1d9bd02dff9f9c49f0", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/476386 [00:00 **Important**: update your "Certificate name" here: https://courses.datatalks.club/de-zoomcamp-2025/enrollment - this is what we will use when generating certificates for you. ### Evaluation criteria See [here](../../projects/README.md) ================================================ FILE: cohorts/2025/workshops/dlt/README.md ================================================ # Data ingestion with dlt Homework: [dlt_homework.md](dlt_homework.md) 🎥 **Watch the workshop video** [![Watch the workshop video](https://markdown-videos-api.jorgenkh.no/youtube/pgJWP_xqO1g)](https://www.youtube.com/watch?v=pgJWP_xqO1g "Watch the workshop video") Welcome to this hands-on workshop, where you'll learn to build efficient and scalable data ingestion pipelines. ### **What will you learn in this workshop?** In this workshop, you’ll learn the core skills required to build and manage data pipelines: - **How to build robust, scalable, and self-maintaining pipelines**. - **Best practices**, like built-in data governance, for ensuring clean and reliable data flows. - **Incremental loading techniques** to refresh data quickly and cost-effectively. - **How to build a Data Lake** with dlt. By the end of this workshop, you'll be able to build data pipelines like a senior data engineer — quickly, concisely, and with best practices baked in. --- ## 📂 Navigation & Resources - Workshop: - [Workshop content](data_ingestion_workshop.md). - [Workshop Colab Notebook](https://colab.research.google.com/drive/1FiAHNFenM8RyptyTPtDTfqPCi5W6KX_V?usp=sharing). - Homework: - [Homework Markdown](dlt_homework.md). - [Homework Colab Notebook](https://colab.research.google.com/drive/1plqdl33K_HkVx0E0nGJrrkEUssStQsW7). - 🌐 [Official dlt Documentation](https://dlthub.com/docs/intro). - 💬 Join our [Slack Community](https://dlthub.com/community). --- ## 📖 Course overview This workshop is structured into three key parts: 1️⃣ **[Extracting Data](data_ingestion_workshop.md#extracting-data)** – Learn scalable data extraction techniques. 2️⃣ **[Normalizing Data](data_ingestion_workshop.md#normalizing-data)** – Clean and structure data before loading. 3️⃣ **[Loading & Incremental Updates](data_ingestion_workshop.md#loading-data)** – Efficiently load and update data. 📌 **Find the full course file here**: [Course File](data_ingestion_workshop.md) --- ## 👩‍🏫 Teacher Welcome to the DataTalks.Club Data Engineering Zoomcamp the data ingestion workshop! I'm Violetta Mishechkina, Solutions Engineer at dltHub. 👋 - I’ve been working in the data field since 2018, with a background in machine learning. - I started as a Data Scientist, training ML models and neural networks. - Over time, I realized that in production, hitting the highest RMSE isn’t as important as model size, infrastructure, and data quality - so I transitioned into MLOps. - A year ago, I joined dltHub’s Customer Success team and discovered dlt, a Python library that automates 90% of tedious data engineering tasks. - Now, I work closely with customers and partners to help them integrate and optimize dlt in production. - I also collaborate with our development team as the voice of the customer, ensuring our product meets real-world data engineering needs. - My experience across ML, MLOps, and data engineering gives me a practical, hands-on perspective on solving data challenges. --- ## Homework - [Homework Markdown](dlt_homework.md). - [Homework Colab Notebook](https://colab.research.google.com/drive/1plqdl33K_HkVx0E0nGJrrkEUssStQsW7). --- ## Next steps As you are learning the various concepts of data engineering, consider creating a portfolio project that will further your own knowledge. By demonstrating the ability to deliver end to end, you will have an easier time finding your first role. This will help regardless of whether your hiring manager reviews your project, largely because you will have a better understanding and will be able to talk the talk. Here are some example projects that others did with dlt: - Serverless dlt-dbt on cloud functions: [Article](https://docs.getdbt.com/blog/serverless-dlt-dbt-stack) - Bird finder: [Part 1](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-i), [Part 2](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-ii) - Event ingestion on GCP: [Article and repo](https://dlthub.com/docs/blog/streaming-pubsub-json-gcp) - Event ingestion on AWS: [Article and repo](https://dlthub.com/docs/blog/dlt-aws-taktile-blog) - Or see one of the many demos created by our working students: [Hacker news](https://dlthub.com/docs/blog/hacker-news-gpt-4-dashboard-demo), [GA4 events](https://dlthub.com/docs/blog/ga4-internal-dashboard-demo), [an E-Commerce](https://dlthub.com/docs/blog/postgresql-bigquery-metabase-demo), [Google Sheets](https://dlthub.com/docs/blog/google-sheets-to-data-warehouse-pipeline), [Motherduck](https://dlthub.com/docs/blog/dlt-motherduck-demo), [MongoDB + Holistics](https://dlthub.com/docs/blog/MongoDB-dlt-Holistics), [Deepnote](https://dlthub.com/docs/blog/deepnote-women-wellness-violence-tends), [Prefect](https://dlthub.com/docs/blog/dlt-prefect), [PowerBI vs GoodData vs Metabase](https://dlthub.com/docs/blog/semantic-modeling-tools-comparison), [Dagster](https://dlthub.com/docs/blog/dlt-dagster), [Ingesting events via gcp webhooks](https://dlthub.com/docs/blog/dlt-webhooks-on-cloud-functions-for-event-capture), [SAP to snowflake replication](https://dlthub.com/docs/blog/sap-hana-to-snowflake-demo-blog), [Read emails and send sumamry to slack with AI and Kestra](https://dlthub.com/docs/blog/dlt-kestra-demo-blog), [Mode +dlt capabilities](https://dlthub.com/docs/blog/dlt-mode-blog), [dbt on cloud functions](https://dlthub.com/docs/blog/dlt-dbt-runner-on-cloud-functions) - If you want to use dlt in your project, [check this list of public APIs](https://dlthub.com/docs/blog/practice-api-sources) If you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt Slack. ## **💛 If you enjoy dlt, support us!** * ⭐ **Give us a [GitHub Star](https://github.com/dlt-hub/dlt)!** * 💬 **Join our [Slack Community](https://dlthub.com/community)!** * 🚀 **Let’s build great data pipelines together!** --- # Community notes Did you take notes? You can share them by creating a PR to this file! * [Ingest Data to GCS by dlt from peatwan](https://github.com/peatwan/de-zoomcamp/tree/main/workshop/dlt/homework/load_to_gcs) * Add your notes above this line ================================================ FILE: cohorts/2025/workshops/dlt/data_ingestion_workshop.md ================================================ # Data ingestion with dlt * Sign up: https://lu.ma/quyfn4q8 (optional) * Homework: [dlt_homework.md](dlt_homework.md) ## **What is data ingestion?** Data ingestion is the process of **extracting** data from a source, transporting it to a suitable environment, and preparing it for use. This often includes **normalizing**, **cleaning**, and **adding metadata**. --- ### **“A wild dataset Magically appears!”** In many data science teams, data seems to appear out of nowhere — because an engineer loads it. For example, the well-known **NYC Taxi dataset** looks well-structured and ready to use, making it easy to query and analyze. However, not all datasets arrive in such a clean format. - **Well-structured data** (with an explicit schema) can be used immediately. - Examples: Parquet, Avro, or database tables where data types and structures are predefined. - **Unstructured or weakly typed data** (without a defined schema) often needs cleaning and formatting first. - Examples: CSV, JSON, where fields might be inconsistent, nested or missing key details. 💡 **What is a schema?** A schema defines the expected format and structure of data, including field names, data types, and relationships. --- ### **Be the Magician! 😎** Since you're here to learn data engineering, **you** will be the one making datasets magically appear! To build effective pipelines, you need to master: ✅ **Extracting** data from various sources (APIs, databases, files). ✅ **Normalization** data by transforming, cleaning, and defining schemas. ✅ **Loading** data where it can be used (data warehouse, lake, or database). --- ### **Why are data pipelines so amazing?** Data pipelines are the backbone of modern data-driven organizations, transforming raw, scattered data into actionable insights. They ensure data flows seamlessly from its source to its final destination, where it can drive decision-making, analytics, and innovation. But pipelines don’t just move data — they enable an entire ecosystem of functionality that makes them indispensable. ![pipes](img/pipes.jpg) ### **What makes data pipelines so essential?** 1. **Collect**: Data pipelines gather information from a variety of sources, such as databases, data streams, and applications. This ensures no data is overlooked. - Example: Retrieving sales data from an online store or capturing user activity logs from an app. 2. **Ingest**: The collected data flows into an event queue, where it’s organized and prepared for the next steps. - **Structured data** (like Parquet files or database tables) can be processed immediately. - **Unstructured data** (like CSV or JSON files) often needs cleaning and normalization. - Example: Cleaning a JSON response by standardizing its fields or formatting dates in a CSV file. 3. **Store**: Pipelines send the processed data to **data lakes**, **data warehouses**, or **data lakehouses** for efficient storage and easy access. - Example: Storing marketing campaign data in a data warehouse to analyze its performance. 4. **Compute**: Data is processed either in **batches** (large chunks) or as **streams** (real-time updates) to make it ready for analysis. - Example: Calculating monthly revenue or processing live stock market data. 5. **Consume**: Finally, the prepared data is delivered to users in forms they can act on: - **Dashboards** for executives and analysts. - **Self-service analytics tools** for teams exploring trends. - **Machine learning models** for predictions and automation. --- ### **Why are data engineers so important in this process?** Data engineers are the architects behind these pipelines. They don’t just build pipelines—they make sure they’re reliable, efficient, and scalable. Beyond pipeline development, data engineers: - **Optimize data storage** to keep costs low and performance high. - **Ensure data quality and integrity**, addressing duplicates, inconsistencies, and missing values. - **Implement governance** for secure, compliant, and well-managed data. - **Adapt data architectures** to meet the changing needs of the organization. Ultimately, their role is to strategically manage the entire **data lifecycle**, from collection to consumption. --- ### **What will you learn in this workshop?** In this workshop, you’ll learn the core skills required to build and manage data pipelines: - **How to build robust, scalable, and self-maintaining pipelines**. - **Best practices**, like built-in data governance, for ensuring clean and reliable data flows. - **Incremental loading techniques** to refresh data quickly and cost-effectively. - **How to build a Data Lake** with dlt. By the end, you’ll not only understand why data pipelines are amazing, but you’ll also know how to create them with best practices to power your organization’s data-driven success.🚀 --- ## **Extracting data** Most of the data you’ll work with is stored behind an **API**, which is like a doorway to the data. Here are the most common types: - **RESTful APIs**: Provide records of data from business applications. - Example: Getting a list of customers from a CRM system. - **File-based APIs**: Return secure file paths to bulk data like JSON or Parquet files stored in buckets. - Example: Downloading monthly sales reports. - **Database APIs**: Connect to databases like MongoDB or SQL, often returning data as JSON, the most common interchange format. As an engineer, you will need to build pipelines that “just work”. So here’s what you need to consider on extraction, to prevent the pipelines from breaking, and to keep them running smoothly: 1. **Hardware limits**: Be mindful of memory (RAM) and storage (disk space). Overloading these can crash your system. 2. **Network reliability**: Networks can fail! Always account for retries to make your pipelines more robust. - Tip: Use libraries like `dlt` that have built-in retry mechanisms. 3. **API rate limits**: APIs often restrict the number of requests you can make in a given time. - Tip: Check the API documentation to understand its limits (e.g., [Zendesk](https://developer.zendesk.com/api-reference/introduction/rate-limits/), [Shopify](https://shopify.dev/docs/api/usage/rate-limits)). There are even more challenges to consider when working with APIs — such as **pagination and authentication**. Let’s explore how to handle these effectively when working with **REST APIs**. ### **Working with REST APIs** REST APIs (Representational State Transfer APIs) are one of the most common ways to extract data. They allow you to retrieve structured data using simple HTTP requests. However, working with APIs comes with its own challenges. #### **Common Challenges** ![rest_api](img/Rest_API.png) #### **1. Rate limits** Many APIs **limit the number of requests** you can make within a certain time frame to prevent overloading their servers. If you exceed this limit, the API may **reject your requests** temporarily or even block you for a period. To avoid hitting these limits, we can: - **Monitor API rate limits** – Some APIs provide headers that tell you how many requests you have left. - **Pause requests when needed** – If we're close to the limit, we wait before making more requests. - **Implement automatic retries** – If a request fails due to rate limiting, we can wait and retry after some time. 💡Some APIs provide a **retry-after** header, which tells you how long to wait before making another request. Always check the API documentation for best practices! --- #### **2. Authentication** Many APIs require an **API key or token** to access data securely. Without authentication, requests may be limited or denied. 🔐 **Types of Authentication in APIs:** - **API Keys** – A simple token included in the request header or URL. - **OAuth Tokens** – A more secure authentication method requiring user authorization. - **Basic Authentication** – Using a username and password (less common today). 💡 Never share your API token publicly! Store it in environment variables or use a secure secrets manager. ---- #### **3. Pagination** Many APIs return data in **chunks (or pages)** rather than sending everything at once. This prevents **overloading the server** and improves performance, especially for large datasets. To retrieve **all the data**, we need to make multiple requests and keep track of pages until we reach the last one. 📌 Example: >In this example, we’ll request data from an API that serves the **NYC taxi dataset**. For these purposes we created an API that can serve the data you are already familiar with. The API returns **1,000 records per page**, and we must request multiple pages to retrieve the full dataset. ```py import requests BASE_API_URL = "https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api" page_number = 1 while True: params = {'page': page_number} response = requests.get(BASE_API_URL, params=params) page_data = response.json() if not page_data: break print(page_data) page_number += 1 # limit the number of pages for testing if page_number > 2: break ``` What happens here: - Starts at page 1 and makes a GET request to the API. - Retrieves JSON data and checks if the page contains records. - If data exists, prints it and moves to the next page. - If the page is empty, stops requesting more data. 💡 Different APIs handle pagination differently (some use offsets, cursors, or tokens instead of page numbers). Always check the API documentation for the correct method! --- #### **4. Avoiding memory issues during extraction** To prevent your pipeline from crashing, you need to control memory usage. #### **Challenges with memory** - Many pipelines run on systems with limited memory, like serverless functions or shared clusters. - If you try to load all the data into memory at once, it can crash the entire system. - Even disk space can become an issue if you’re storing large amounts of data. #### **The solution: streaming data** **Streaming** means processing data in small chunks or events, rather than loading everything at once. This keeps memory usage low and ensures your pipeline remains efficient. As a data engineer, you’ll use streaming to transfer data between buffers, such as: - from APIs to local files; - from Webhooks to event queues; - from Event queues (like Kafka) to storage buckets. --- ### **Example of extracting data: Grabbing data from an API** In this example, we’ll request data from an API that serves the **NYC taxi dataset**. For these purposes we created an API that can serve the data you are already familiar with. #### **API documentation**: - **Data**: Comes in pages of 1,000 records. - **Pagination**: When there’s no more data, the API returns an empty page. - **Details**: - **Method**: GET - **URL**: `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api` - **Parameters**: - `page`: Integer (page number), defaults to 1. Here’s how we design our requester: 1. **Request page by page** until we hit an empty page. Since we don’t know how much data is behind the API, we must assume it could be as little as 1,000 records or as much as 10GB. 2. **Use a generator** to handle this efficiently and avoid loading all data into memory. ```py import requests BASE_API_URL = "https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api" def paginated_getter(): page_number = 1 while True: params = {'page': page_number} response = requests.get(BASE_API_URL, params=params) response.raise_for_status() page_json = response.json() print(f'Got page {page_number} with {len(page_json)} records') if page_json: yield page_json page_number += 1 else: break for page_data in paginated_getter(): print(page_data) ``` In this approach to grabbing data from APIs, there are both pros and cons: ✅ Pros: **Easy memory management** since the API returns data in small pages or events. ❌ Cons: **Low throughput** because data transfer is limited by API constraints (rate limits, response time). To simplify data extraction, use specialized tools that follow best practices like streaming — for example, [dlt (data load tool)](https://dlthub.com). It efficiently processes data while **keeping memory usage low** and **leveraging parallelism** for better performance. ### **Extracting data with dlt** Extracting data from APIs manually requires handling - **pagination**, - **rate limits**, - **authentication**, - **errors**. Instead of writing custom scripts, **[dlt](https://dlthub.com/)** simplifies the process with a built-in **[REST API Client](https://dlthub.com/docs/general-usage/http/rest-client)**, making extraction **efficient, scalable, and reliable**. --- ### **Why use dlt for extraction?** ✅ **Built-in REST API support** – Extract data from APIs with minimal code. ✅ **Automatic pagination handling** – No need to loop through pages manually. ✅ **Manages Rate Limits & Retries** – Prevents exceeding API limits and handles failures. ✅ **Streaming support** – Extracts and processes data without loading everything into memory. ✅ **Seamless integration** – Works with **normalization and loading** in a single pipeline. ![dlt](img/dlt.png) ### **Install dlt** [Install](https://dlthub.com/docs/reference/installation) dlt with DuckDB as destination: ```shell pip install dlt[duckdb] ``` ### **Example of extracting data with dlt** Instead of manually writing pagination logic, let’s use **dlt’s [`RESTClient` helper](https://dlthub.com/docs/general-usage/http/rest-client)** to extract NYC taxi ride data: ```py import dlt from dlt.sources.helpers.rest_client import RESTClient from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator def paginated_getter(): client = RESTClient( base_url="https://us-central1-dlthub-analytics.cloudfunctions.net", # Define pagination strategy - page-based pagination paginator=PageNumberPaginator( # <--- Pages are numbered (1, 2, 3, ...) base_page=1, # <--- Start from page 1 total_path=None # <--- No total count of pages provided by API, pagination should stop when a page contains no result items ) ) for page in client.paginate("data_engineering_zoomcamp_api"): # <--- API endpoint for retrieving taxi ride data yield page # remember about memory management and yield data for page_data in paginated_getter(): print(page_data) ``` **How dlt simplifies API extraction:** 🔹 **No manual pagination** – dlt **automatically** fetches **all pages** of data. 🔹 **Low memory usage** – Streams data **chunk by chunk**, avoiding RAM overflows. 🔹 **Handles rate limits & retries** – Ensures requests are sent efficiently **without failures**. 🔹 **Flexible destination support** – Load extracted data into **databases, warehouses, or data lakes**. --- Well, you’ve successfully **extracted** the data — great! 🎉 But raw data isn’t always ready to use. Now, you need to **process**, **clean**, and **structure** it before it can be loaded into a data lake or data warehouse. ## **Normalizing data** You often hear that data professionals spend most of their time **“cleaning” data** — but what does that actually mean? Data cleaning typically involves two key steps: 1. **Normalizing data** – Structuring and standardizing data **without changing its meaning**. 2. **Filtering data for a specific use case** – Selecting or modifying data **in a way that changes its meaning** to fit the analysis. ### **Data cleaning: more than just fixing errors** A big part of **data cleaning** is actually **metadata work** — ensuring data is structured and standardized so it can be used effectively. #### **Metadata tasks in data cleaning:** ✅ **Add types** – Convert strings to numbers, timestamps, etc. ✅ **Rename columns** – Ensure names follow a standard format (e.g., no special characters). ✅ **Flatten nested dictionaries** – Bring values from nested dictionaries into the top-level row. ✅ **Unnest lists/arrays** – Convert lists into **child tables** since they can’t be stored directly in a flat format. 👉 **We’ll look at a practical example next, as these concepts are easier to understand with real data.** --- ### **Why prepare data? Why not use JSON directly?** While JSON is a great format for **data transfer**, it’s not ideal for analysis. Here’s why: ❌ **No enforced schema** – We don’t always know what fields exist in a JSON document. ❌ **Inconsistent data types** – A field like `age` might appear as `25`, `"twenty five"`, or `25.00`, which can break downstream applications. ❌ **Hard to process** – If we need to group data by day, we must manually convert date strings to timestamps. ❌ **Memory-heavy** – JSON requires reading the entire file into memory, unlike databases or columnar formats that allow scanning just the necessary fields. ❌ **Slow for aggregation and search** – JSON is not optimized for quick lookups or aggregations like columnar formats (e.g., Parquet). JSON is great for **data exchange** but **not for direct analytical use**. To make data useful, we need to **normalize it** — flattening, typing, and structuring it for efficiency. --- ### **Normalization example** To understand what we’re working with, let’s look at a sample record from our API: ```py item = page_data[0] item ``` Output: ```json {'End_Lat': 40.742963, 'End_Lon': -73.980072, 'Fare_Amt': 45.0, 'Passenger_Count': 1, 'Payment_Type': 'Credit', 'Rate_Code': None, 'Start_Lat': 40.641525, 'Start_Lon': -73.787442, 'Tip_Amt': 9.0, 'Tolls_Amt': 4.15, 'Total_Amt': 58.15, 'Trip_Distance': 17.52, 'Trip_Dropoff_DateTime': '2009-06-14 23:48:00', 'Trip_Pickup_DateTime': '2009-06-14 23:23:00', 'mta_tax': None, 'store_and_forward': None, 'surcharge': 0.0, 'vendor_name': 'VTS'} ``` The data we retrieved from the API has **already been processed and unnested**, meaning that any **nested structures** (like dictionaries and lists) have been flattened, making it easier to store and query in a database or a dataframe. However, let’s imagine we originally received the **raw data** in a more complex format. --- ### **How was this data processed?** Before reaching this format, the raw data likely contained **nested structures** that had to be **flattened and transformed**. 1️⃣ **Flattened nested coordinates:** - Originally, the latitude and longitude values might have been nested like this: ```json "coordinates": { "start": {"lat": 40.641525, "lon": -73.787442}, "end": {"lat": 40.742963, "lon": -73.980072} } ``` - These were **flattened** into `Start_Lat`, `Start_Lon`, `End_Lat`, and `End_Lon`. 2️⃣ **Converted timestamps:** - Originally, timestamps might have been stored as Unix timestamps or separate date/time fields: ```json "Trip_Pickup": {"date": "2009-06-14", "time": "23:23:00"} ``` - Now, they are **formatted as ISO datetime strings**: ```json "Trip_Pickup_DateTime": "2009-06-14 23:23:00" ``` 3️⃣ **Unnested passenger & payment information:** - The original structure might have included a nested list for passengers: ```json "passengers": [ {"name": "John", "rating": 4.9}, {"name": "Jack", "rating": 3.9} ] ``` - Since lists **cannot be stored directly in a database table**, they were likely **moved to a separate table**. 💡 **However, real-world data is rarely this clean!** We often receive raw, nested, and inconsistent data. This is why the **normalization process** is so important—it **prepares** the data for efficient storage and analysis. **[dlt (data load tool)](https://dlthub.com/docs/intro)** simplifies the **normalization process**, automatically transforming raw data into a **structured, clean format** that is ready for storage and analysis. --- ### **Normalizing data with dlt** **Why use dlt for normalization?** ✅ **Automatically detects schema** – No need to define column types manually. ✅ **Flattens nested JSON** – Converts complex structures into table-ready formats. ✅ **Handles data type conversion** – Converts dates, numbers, and booleans correctly. ✅ **Splits lists into child tables** – Ensures relational integrity for better analysis. ✅ **Schema evolution support** – Adapts to changes in data structure over time. --- ### **Example** Let's assume we extracted the following raw NYC taxi ride data, which contains **nested dictionaries** and **lists**: ```py data = [ { "vendor_name": "VTS", "record_hash": "b00361a396177a9cb410ff61f20015ad", "time": { "pickup": "2009-06-14 23:23:00", "dropoff": "2009-06-14 23:48:00" }, "coordinates": { "start": {"lon": -73.787442, "lat": 40.641525}, "end": {"lon": -73.980072, "lat": 40.742963} }, "passengers": [ {"name": "John", "rating": 4.9}, {"name": "Jack", "rating": 3.9} ] } ] ``` ### **How dlt normalizes this data automatically** Instead of manually flattening fields and extracting nested lists, we can **load it directly into dlt**: ```py import dlt # Define a dlt pipeline with automatic normalization pipeline = dlt.pipeline( pipeline_name="ny_taxi_data", destination="duckdb", dataset_name="taxi_rides", ) # Run the pipeline with raw nested data info = pipeline.run(data, table_name="rides", write_disposition="replace") # Print the load summary print(info) print(pipeline.last_trace) ``` --- ### **What happens behind the scenes?** After running this pipeline, dlt automatically **transforms the data** into the following **normalized structure**: **Main table: `rides`** ```py pipeline.dataset(dataset_type="default").rides.df() ``` | vendor_name | record_hash | time__pickup | time__dropoff | coordinates__start__lon | coordinates__start__lat | coordinates__end__lon | coordinates__end__lat | _dlt_load_id | _dlt_id | |-------------|------------------------------------|---------------------------|---------------------------|-------------------------|-------------------------|-----------------------|-----------------------|-------------------|---------------| | VTS | b00361a396177a9cb410ff61f20015ad | 2009-06-14 23:23:00+00:00 | 2009-06-14 23:48:00+00:00 | -73.787442 | 40.641525 | -73.980072 | 40.742963 | 1738604244.2625916 | k+bnoLuti245ag | This table **displays structured taxi ride data**, including **vendor details, timestamps, coordinates, and dlt metadata**. **Child Table: `rides_passengers`** ```py pipeline.dataset(dataset_type="default").rides__passengers.df() ``` | name | rating | _dlt_parent_id | _dlt_list_idx | _dlt_id | |-------|--------|------------------|--------------|---------------| | John | 4.9 | k+bnoLuti245ag | 0 | 8ppDh+8gQ7SSHg | | Jack | 3.9 | k+bnoLuti245ag | 1 | oQnWuvkgHhxlaA | ✅ **Nested structures were flattened** into separate columns. ✅ **Lists were extracted into child tables**, preserving relationships. ✅ **Timestamps were converted to the correct format.** --- ### **Why dlt makes normalization easy** 🔹 **No manual transformations needed** – Just load the raw data, and dlt does the rest! 🔹 **Database-ready format** – Ensures clean, structured tables for easy querying. 🔹 **Handles schema evolution** – Adapts to new fields automatically. 🔹 **Scales effortlessly** – Works for small datasets and enterprise-scale pipelines. 💡 With dlt, normalization happens automatically, so you can focus on insights instead of data wrangling. --- ## **Loading data** Now that we’ve covered **extracting** and **normalizing** data, the final step is **loading** the data **into a destination**. This is where the processed data is stored, making it ready for querying, analysis, or further transformations. ### **How data loading happens without dlt** Before dlt, data engineers had to manually handle **schema validation, batch processing, error handling, and retries** for every destination. This process becomes especially complex when loading data into **data warehouses and data lakes**, where performance optimization, partitioning, and incremental updates are critical. ### **Example: Loading data into database without dlt** A basic pipeline requires: 1. Setting up a database connection. 2. Creating tables and defining schemas. 3. Handling schema changes manually. 4. Writing queries to insert/update data. ```py import duckdb # 1. Create a connection to an in-memory DuckDB database conn = duckdb.connect("ny_taxi_manual.db") # 2. Create the rides Table # Since our dataset has nested structures, we must manually flatten it before inserting data. conn.execute(""" CREATE TABLE IF NOT EXISTS rides ( record_hash TEXT PRIMARY KEY, vendor_name TEXT, pickup_time TIMESTAMP, dropoff_time TIMESTAMP, start_lon DOUBLE, start_lat DOUBLE, end_lon DOUBLE, end_lat DOUBLE ); """) # 3. Insert Data Manually # Since JSON data has nested fields, we need to extract and transform them before inserting them into DuckDB. data = [ { "vendor_name": "VTS", "record_hash": "b00361a396177a9cb410ff61f20015ad", "time": { "pickup": "2009-06-14 23:23:00", "dropoff": "2009-06-14 23:48:00" }, "coordinates": { "start": {"lon": -73.787442, "lat": 40.641525}, "end": {"lon": -73.980072, "lat": 40.742963} } } ] # Prepare data for insertion flattened_data = [ ( ride["record_hash"], ride["vendor_name"], ride["time"]["pickup"], ride["time"]["dropoff"], ride["coordinates"]["start"]["lon"], ride["coordinates"]["start"]["lat"], ride["coordinates"]["end"]["lon"], ride["coordinates"]["end"]["lat"] ) for ride in data ] # Insert into DuckDB conn.executemany(""" INSERT INTO rides (record_hash, vendor_name, pickup_time, dropoff_time, start_lon, start_lat, end_lon, end_lat) VALUES (?, ?, ?, ?, ?, ?, ?, ?) """, flattened_data) print("Data successfully loaded into DuckDB!") # 4. Query Data in DuckDB # Now that the data is loaded, we can query it using DuckDB’s SQL engine. df = conn.execute("SELECT * FROM rides").df() conn.close() ``` Problems without dlt: ❌ **Schema management is manual** – If the schema changes, you need to update table structures manually. ❌ **No automatic retries** – If the network fails, data may be lost. ❌ **No incremental loading** – Every run reloads everything, making it slow and expensive. ❌ **More code to maintain** – A simple pipeline quickly becomes complex. --- ### **How dlt handles the load step automatically** With dlt, loading data **requires just a few lines of code** — schema inference, error handling, and incremental updates are all handled automatically! ### **Why use dlt for loading?** ✅ **Supports multiple destinations** – Load data into **BigQuery, Redshift, Snowflake, Postgres, DuckDB, Parquet (S3, GCS)** and more. ✅ **Optimized for performance** – Uses **batch loading, parallelism, and streaming** for fast and scalable data transfer. ✅ **Schema-aware** – Ensures that **column names, data types, and structures match** the destination’s requirements. ✅ **Incremental loading** – Avoids unnecessary reloading by **only inserting new or updated records**. ✅ **Resilience & retries** – Automatically handles failures, ensuring data is loaded **without missing records**. ![dlt](img/dlt.png) ### **Example: Loading data into database with dlt** To use all the power of dlt is better to wrap our API Client in the `@dlt.resource` decorator which denotes a logical grouping of data within a data source, typically holding data of similar structure and origin: ```py import dlt from dlt.sources.helpers.rest_client import RESTClient from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator # Define the API resource for NYC taxi data @dlt.resource(name="rides") # <--- The name of the resource (will be used as the table name) def ny_taxi(): client = RESTClient( base_url="https://us-central1-dlthub-analytics.cloudfunctions.net", paginator=PageNumberPaginator( base_page=1, total_path=None ) ) for page in client.paginate("data_engineering_zoomcamp_api"): # <--- API endpoint for retrieving taxi ride data yield page # <--- yield data to manage memory # define new dlt pipeline pipeline = dlt.pipeline(destination="duckdb") # run the pipeline with the new resource load_info = pipeline.run(ny_taxi, write_disposition="replace") print(load_info) # explore loaded data pipeline.dataset(dataset_type="default").rides.df() ``` **Done!** The data is now stored in **DuckDB**, with schema managed automatically! --- ### **Incremental Loading** Incremental loading allows us to update datasets by **loading only new or changed data**, instead of replacing the entire dataset. This makes pipelines **faster and more cost-effective** by reducing redundant data processing. ### **How does incremental loading work?** Incremental loading works alongside two key concepts: - **Incremental extraction** – Only extracts the new or modified data rather than retrieving everything again. - **State tracking** – Keeps track of what has already been loaded, ensuring that only new data is processed. In dlt, **state** is stored in a **separate table** at the destination, allowing pipelines to track what has been processed. 🔹 **Want to learn more?** You can read about incremental extraction and state management in the [dlt documentation](https://dlthub.com/docs). --- ### **Incremental loading methods in dlt** dlt provides two ways to load data incrementally: #### **1. Append (adding new records)** - Best for **immutable or stateless data**, such as taxi ride records. - Each run **adds new records** without modifying previous data. - Can also be used to create a **history of changes** (slowly changing dimensions). **Example:** - If taxi ride data is loaded daily, only **new rides** are added, rather than reloading the full history. - If tracking changes in a list of vehicles, **each version** is stored as a new row for auditing. --- #### **2. Merge (updating existing records)** - Best for **updating existing records** (stateful data). - Replaces old records with updated ones based on a **unique key**. - Useful for tracking **status changes**, such as payment updates. **Example:** - A taxi ride's **payment status** could change from `"booked"` to `"cancelled"`, requiring an update. - A **customer profile** might be updated with a new email or phone number. --- ### **Choosing between Append and Merge** | **Scenario** | **Use Append** | **Use Merge** | |-----------------------------------|--------------|--------------| | Immutable records (e.g., ride history) | ✅ Yes | ❌ No | | Tracking historical changes (slowly changing dimensions) | ✅ Yes | ❌ No | | Updating existing records (e.g., payment status) | ❌ No | ✅ Yes | | Keeping full change history | ✅ Yes | ❌ No | ### **Example: Incremental loading with dlt** **The goal**: download only trips made after June 15, 2009, skipping the old ones. Using `dlt`, we set up an [incremental filter](https://dlthub.com/docs/general-usage/incremental-loading%23incremental-loading-with-a-cursor-field) to only fetch trips made after a certain date: ```python cursor_date = dlt.sources.incremental("Trip_Dropoff_DateTime", initial_value="2009-06-15") ``` This tells `dlt`: - **Start date**: June 15, 2009 (`initial_value`). - **Field to track**: `Trip_Dropoff_DateTime` (our timestamp). As you run the pipeline repeatedly, `dlt` will keep track of the latest `Trip_Dropoff_DateTime` value processed. It will skip records older than this date in future runs. Let's make the data resource incremental using `dlt.sources.incremental`: ```py import dlt from dlt.sources.helpers.rest_client import RESTClient from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator @dlt.resource(name="rides", write_disposition="append") def ny_taxi( cursor_date=dlt.sources.incremental( "Trip_Dropoff_DateTime", # <--- field to track, our timestamp initial_value="2009-06-15", # <--- start date June 15, 2009 ) ): client = RESTClient( base_url="https://us-central1-dlthub-analytics.cloudfunctions.net", paginator=PageNumberPaginator( base_page=1, total_path=None ) ) for page in client.paginate("data_engineering_zoomcamp_api"): yield page ``` Finally, we run our pipeline and load the fresh taxi rides data: ```py # define new dlt pipeline pipeline = dlt.pipeline(pipeline_name="ny_taxi", destination="duckdb", dataset_name="ny_taxi_data") # run the pipeline with the new resource load_info = pipeline.run(ny_taxi) print(pipeline.last_trace) ``` Only 5325 rows were flitered out and loaded into the `duckdb` destination. Let's take a look at the earliest date in the loaded data: ```py with pipeline.sql_client() as client: res = client.execute_sql( """ SELECT MIN(trip_dropoff_date_time) FROM rides; """ ) print(res) ``` Run the same pipeline again. ```py # define new dlt pipeline pipeline = dlt.pipeline(pipeline_name="ny_taxi", destination="duckdb", dataset_name="ny_taxi_data") # run the pipeline with the new resource load_info = pipeline.run(ny_taxi) print(pipeline.last_trace) ``` The pipeline will detect that there are **no new records** based on the `Trip_Dropoff_DateTime` field and the incremental cursor. As a result, **no new data will be loaded** into the destination: >0 load package(s) were loaded 💡 **With dlt, incremental loading is simple, scalable, and automatic!** --- ### **Example: Loading data into a Data Warehouse (BigQuery)** First, install the dependencies, define the source, then change the destination name and run the pipeline. ```shell pip install dlt[bigquery] ``` Let's use our NY Taxi API and load data from the source into destination. ```py import dlt from dlt.sources.helpers.rest_client import RESTClient from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator @dlt.resource(name="rides", write_disposition="replace") def ny_taxi(): client = RESTClient( base_url="https://us-central1-dlthub-analytics.cloudfunctions.net", paginator=PageNumberPaginator( base_page=1, total_path=None ) ) for page in client.paginate("data_engineering_zoomcamp_api"): yield page ``` **Choosing a destination** Switching between **data warehouses (BigQuery, Snowflake, Redshift)** or **data lakes (S3, Google Cloud Storage, Parquet files)** in dlt is incredibly straightforward — simply modify the `destination` parameter in your pipeline configuration. For example: ```py pipeline = dlt.pipeline( pipeline_name='taxi_data', destination='duckdb', # <--- to test pipeline locally dataset_name='taxi_rides', ) pipeline = dlt.pipeline( pipeline_name='taxi_data', destination='bigquery', # <--- to run pipeline in production dataset_name='taxi_rides', ) ``` This flexibility allows you to easily transition from local development to production-grade environments. > 💡 No need to rewrite your pipeline — dlt adapts automatically! **Set Credentials** The next logical step is to [set credentials](https://dlthub.com/docs/general-usage/credentials/) using **dlt's TOML providers** or **environment variables (ENVs)**. ```py import os from google.colab import userdata os.environ["DESTINATION__BIGQUERY__CREDENTIALS"] = userdata.get('BIGQUERY_CREDENTIALS') ``` Run the pipeline: ```py pipeline = dlt.pipeline( pipeline_name="taxi_data", destination="bigquery", dataset_name="taxi_rides", dev_mode=True, ) info = pipeline.run(ny_taxi) print(info) ``` 💡 **What’s different?** - **dlt automatically adapts the schema** to fit BigQuery. - **Partitioning & clustering** can be applied for performance optimization. - **Efficient batch loading** ensures scalability. --- ### **Example: Loading data into a Data Lake (Parquet on Local FS or S3)** **Why use a Data Lake?** - **Cost-effective storage** – Cheaper than traditional databases. - **Optimized for big data processing** – Works seamlessly with Spark, Databricks, and Presto. - **Easy scalability** – Store petabytes of data efficiently. The `filesystem` destination enables you to load data into **files stored locally** or in **cloud storage** solutions, making it an excellent choice for lightweight testing, prototyping, or file-based workflows. Below is an **example** demonstrating how to use the `filesystem` destination to load data in **Parquet** format: * Step 1: Set up a local bucket or cloud directory for storing files ```py import os os.environ["BUCKET_URL"] = "/content" ``` * Step 2: Define the data source (above) * Step 3: Run the pipeline ```py import dlt pipeline = dlt.pipeline( pipeline_name='fs_pipeline', destination='filesystem', # <--- change destination to 'filesystem' dataset_name='fs_data', ) load_info = pipeline.run(ny_taxi, loader_file_format="parquet") # <--- choose a file format: parquet, csv or jsonl print(load_info) ``` Look at the files: ```shell ! ls fs_data/rides ``` Look at the loaded data: ```py # explore loaded data pipeline.dataset(dataset_type="default").rides.df() ``` #### **Table formats: [Delta tables & Iceberg](https://dlthub.com/docs/dlt-ecosystem/destinations/delta-iceberg)** dlt supports writing **Delta** and **Iceberg** tables when using the `filesystem` destination. **How it works:** dlt uses the `deltalake` and `pyiceberg` libraries to write Delta and Iceberg tables, respectively. One or multiple Parquet files are prepared during the extract and normalize steps. In the load step, these Parquet files are exposed as an Arrow data structure and fed into `deltalake` or `pyiceberg`. ```shell !pip install "dlt[pyiceberg]" ``` ```py pipeline = dlt.pipeline( pipeline_name='fs_pipeline', destination='filesystem', # <--- change destination to 'filesystem' dataset_name='fs_iceberg_data', ) load_info = pipeline.run( ny_taxi, loader_file_format="parquet", table_format="iceberg", # <--- choose a table format: delta or iceberg ) print(load_info) ``` 💡**Note:** Open source version of dlt supports basic functionality for **iceberg**, but the dltHub team is currently working on an **extended** and **more powerful** integration with iceberg. [Join the waiting list to learn more about dlt+ and Iceberg.](https://info.dlthub.com/waiting-list) --- ## **What’s Next?** - **Try loading data into different [destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/)** – Test Postgres, Snowflake, or Parquet. - **Experiment with [incremental loading](https://dlthub.com/docs/general-usage/incremental-loading)** – Load only new records for better efficiency. - **Explore dlt’s [schema evolution](https://dlthub.com/docs/general-usage/schema-evolution)** – Automatically adjust to data structure changes. - **Join our [Slack community](https://dlthub.com/community)** to share your progress! With **dlt’s automated load step**, you get **effortless, scalable, and resilient data loading**—so you can focus on insights instead of pipeline maintenance. 🚀 --- ### Extra homework 💻 * [Data ingestion with DLT to Bigquery from Sara Sabater](https://github.com/saraisab/Data_Engineer/blob/main/courses/DE_zoomcamp/Homework/DLT-Workshop/extra_homework/Data_ingestion_with_DLT_to_bigquery.ipynb). ================================================ FILE: cohorts/2025/workshops/dlt/dlt_homework.md ================================================ Original file is located at https://colab.research.google.com/drive/1plqdl33K_HkVx0E0nGJrrkEUssStQsW7 # **Workshop "Data Ingestion with dlt": Homework** --- ## **Dataset & API** We’ll use **NYC Taxi data** via the same custom API from the workshop: 🔹 **Base API URL:** ``` https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api ``` 🔹 **Data format:** Paginated JSON (1,000 records per page). 🔹 **API Pagination:** Stop when an empty page is returned. ## **Question 1: dlt Version** 1. **Install dlt**: ``` !pip install dlt[duckdb] ``` > Or choose a different bracket—`bigquery`, `redshift`, etc.—if you prefer another primary destination. For this assignment, we’ll still do a quick test with DuckDB. 2. **Check** the version: ``` !dlt --version ``` or: ```py import dlt print("dlt version:", dlt.__version__) ``` Provide the **version** you see in the output. ## **Question 2: Define & Run the Pipeline (NYC Taxi API)** Use dlt to extract all pages of data from the API. Steps: 1️⃣ Use the `@dlt.resource` decorator to define the API source. 2️⃣ Implement automatic pagination using dlt's built-in REST client. 3️⃣ Load the extracted data into DuckDB for querying. ```py import dlt from dlt.sources.helpers.rest_client import RESTClient from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator # your code is here pipeline = dlt.pipeline( pipeline_name="ny_taxi_pipeline", destination="duckdb", dataset_name="ny_taxi_data" ) ``` Load the data into DuckDB to test: ```py load_info = pipeline.run(ny_taxi) print(load_info) ``` Start a connection to your database using native `duckdb` connection and look what tables were generated:""" ```py import duckdb from google.colab import data_table data_table.enable_dataframe_formatter() # A database '.duckdb' was created in working directory so just connect to it # Connect to the DuckDB database conn = duckdb.connect(f"{pipeline.pipeline_name}.duckdb") # Set search path to the dataset conn.sql(f"SET search_path = '{pipeline.dataset_name}'") # Describe the dataset conn.sql("DESCRIBE").df() ``` How many tables were created? * 2 * 4 * 6 * 8 ## **Question 3: Explore the loaded data** Inspect the table `ride`: ```py df = pipeline.dataset(dataset_type="default").rides.df() df ``` What is the total number of records extracted? * 2500 * 5000 * 7500 * 10000 ## **Question 4: Trip Duration Analysis** Run the SQL query below to: * Calculate the average trip duration in minutes. ```py with pipeline.sql_client() as client: res = client.execute_sql( """ SELECT AVG(date_diff('minute', trip_pickup_date_time, trip_dropoff_date_time)) FROM rides; """ ) # Prints column values of the first row print(res) ``` What is the average trip duration? * 12.3049 * 22.3049 * 32.3049 * 42.3049 ## **Submitting the solutions** * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/workshop1 ## **Solution** We will publish the solution here after deadline. ================================================ FILE: cohorts/2025/workshops/dynamic_load_dlt.py ================================================ import json import os import toml import requests import dlt from dlt.sources.filesystem import filesystem, read_parquet from google.cloud import storage import io import pyarrow.parquet as pq # Load the TOML file # the TOML file should follow below format: #[credentials] #project_id = "your project id" #private_key = "your sevice account key" #client_email = "email" config = toml.load("./.dlt/secrets.toml") # Set environment variables os.environ["CREDENTIALS__PROJECT_ID"] = config["credentials"]["project_id"] os.environ["CREDENTIALS__PRIVATE_KEY"] = config["credentials"]["private_key"] os.environ["CREDENTIALS__CLIENT_EMAIL"] = config["credentials"]["client_email"] # Function to generate URLs based on user input for the date range and trip color def generate_urls(color, start_year, end_year, start_month, end_month): base_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/" urls = [] # Generate the list of URLs based on the specified date range and color for year in range(start_year, end_year + 1): for month in range(start_month, end_month + 1): # Format the month to ensure two digits month_str = f"{month:02d}" url = f"{base_url}{color}_tripdata_{year}-{month_str}.parquet" #https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2020-01.parquet urls.append(url) return urls # User input for time range and trip color color = input("Enter color (green, yellow): ").lower() start_year = int(input("Enter the start year (e.g., 2019): ")) end_year = int(input("Enter the end year (e.g., 2022): ")) start_month = int(input("Enter the start month (1-12): ")) end_month = int(input("Enter the end month (1-12): ")) # Generate URLs based on user input urls = generate_urls(color, start_year, end_year, start_month, end_month) # Debug: Print generated URLs print("Generated URLs:") for url in urls: print(url) dlt_method = input("Choose loading method: 1 for GCS -> Bigquery, 2 for Direct Web -> Bigquery: ") if dlt_method == "1": # Initialize GCS client storage_client = storage.Client.from_service_account_json("gcs.json") bucket_name = input("Enter the GCS bucket name: ") # Replace with your GCS bucket name bucket = storage_client.bucket(bucket_name) # Download files and upload them to GCS gcs_files = [] for url in urls: file_name = url.split("/")[-1] # Extract the file name from the URL gcs_blob = bucket.blob(file_name) print(f"Downloading {url} and uploading to GCS as {file_name}") response = requests.get(url) gcs_blob.upload_from_string(response.content) gcs_files.append(f"gs://{bucket_name}/{file_name}") @dlt.resource(name="rides", write_disposition="replace") def parquet_source(): # Use filesystem to load files from GCS and apply read_parquet transformation files = filesystem(bucket_url=f"gs://{bucket_name}/", file_glob="*.parquet") reader = (files | read_parquet()).with_name("tripdata") # Iterate through the rows from the reader and yield them row_count = 0 for row in reader: row_count += 1 yield row print(f"Total rows yielded: {row_count}") elif dlt_method == "2": # Alternative method: Streaming Parquet files directly from the web @dlt.resource(name="ny_taxi_dlt", write_disposition="replace") def paginated_getter(): for url in urls: try: with requests.get(url, stream=True) as response: response.raise_for_status() buffer = io.BytesIO() for chunk in response.iter_content(chunk_size=1024 * 1024): # 1MB chunks buffer.write(chunk) buffer.seek(0) table = pq.read_table(buffer) print(f'Got data from {url} with {table.num_rows} records') if table.num_rows > 0: yield table except Exception as e: print(f"Failed to fetch data from {url}: {e}") # Create the pipeline pipeline = dlt.pipeline( pipeline_name="test_taxi", dataset_name=input("Enter the dataset name: "), destination="bigquery" # dev_mode=True ) # Run the pipeline with either method if dlt_method == "1": info = pipeline.run(parquet_source()) elif dlt_method == "2": info = pipeline.run(paginated_getter()) else: print("Invalid selection") exit() print(info) ================================================ FILE: cohorts/2026/01-docker-terraform/homework.md ================================================ # Module 1 Homework: Docker & SQL In this homework we'll prepare the environment and practice Docker and SQL When submitting your homework, you will also need to include a link to your GitHub repository or other public code-hosting site. This repository should contain the code for solving the homework. When your solution has SQL or shell commands and not code (e.g. python files) file format, include them directly in the README file of your repository. ## Question 1. Understanding Docker images Run docker with the `python:3.13` image. Use an entrypoint `bash` to interact with the container. What's the version of `pip` in the image? - 25.3 - 24.3.1 - 24.2.1 - 23.3.1 ## Question 2. Understanding Docker networking and docker-compose Given the following `docker-compose.yaml`, what is the `hostname` and `port` that pgadmin should use to connect to the postgres database? ```yaml services: db: container_name: postgres image: postgres:17-alpine environment: POSTGRES_USER: 'postgres' POSTGRES_PASSWORD: 'postgres' POSTGRES_DB: 'ny_taxi' ports: - '5433:5432' volumes: - vol-pgdata:/var/lib/postgresql/data pgadmin: container_name: pgadmin image: dpage/pgadmin4:latest environment: PGADMIN_DEFAULT_EMAIL: "pgadmin@pgadmin.com" PGADMIN_DEFAULT_PASSWORD: "pgadmin" ports: - "8080:80" volumes: - vol-pgadmin_data:/var/lib/pgadmin volumes: vol-pgdata: name: vol-pgdata vol-pgadmin_data: name: vol-pgadmin_data ``` - postgres:5433 - localhost:5432 - db:5433 - postgres:5432 - db:5432 If multiple answers are correct, select any ## Prepare the Data Download the green taxi trips data for November 2025: ```bash wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet ``` You will also need the dataset with zones: ```bash wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv ``` ## Question 3. Counting short trips For the trips in November 2025 (lpep_pickup_datetime between '2025-11-01' and '2025-12-01', exclusive of the upper bound), how many trips had a `trip_distance` of less than or equal to 1 mile? - 7,853 - 8,007 - 8,254 - 8,421 ## Question 4. Longest trip for each day Which was the pick up day with the longest trip distance? Only consider trips with `trip_distance` less than 100 miles (to exclude data errors). Use the pick up time for your calculations. - 2025-11-14 - 2025-11-20 - 2025-11-23 - 2025-11-25 ## Question 5. Biggest pickup zone Which was the pickup zone with the largest `total_amount` (sum of all trips) on November 18th, 2025? - East Harlem North - East Harlem South - Morningside Heights - Forest Hills ## Question 6. Largest tip For the passengers picked up in the zone named "East Harlem North" in November 2025, which was the drop off zone that had the largest tip? Note: it's `tip` , not `trip`. We need the name of the zone, not the ID. - JFK Airport - Yorkville West - East Harlem North - LaGuardia Airport ## Terraform In this section homework we'll prepare the environment by creating resources in GCP with Terraform. In your VM on GCP/Laptop/GitHub Codespace install Terraform. Copy the files from the course repo [here](../../../01-docker-terraform/terraform/terraform) to your VM/Laptop/GitHub Codespace. Modify the files as necessary to create a GCP Bucket and Big Query Dataset. ## Question 7. Terraform Workflow Which of the following sequences, respectively, describes the workflow for: 1. Downloading the provider plugins and setting up backend, 2. Generating proposed changes and auto-executing the plan 3. Remove all resources managed by terraform` Answers: - terraform import, terraform apply -y, terraform destroy - teraform init, terraform plan -auto-apply, terraform rm - terraform init, terraform run -auto-approve, terraform destroy - terraform init, terraform apply -auto-approve, terraform destroy - terraform import, terraform apply -y, terraform rm ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw1 ## Learning in Public We encourage everyone to share what they learned. This is called "learning in public". ### Why learn in public? - Accountability: Sharing your progress creates commitment and motivation to continue - Feedback: The community can provide valuable suggestions and corrections - Networking: You'll connect with like-minded people and potential collaborators - Documentation: Your posts become a learning journal you can reference later - Opportunities: Employers and clients often discover talent through public learning You can read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and). Don't worry about being perfect. Everyone starts somewhere, and people love following genuine learning journeys! ### Example post for LinkedIn ``` 🚀 Week 1 of Data Engineering Zoomcamp by @DataTalksClub complete! Just finished Module 1 - Docker & Terraform. Learned how to: ✅ Containerize applications with Docker and Docker Compose ✅ Set up PostgreSQL databases and write SQL queries ✅ Build data pipelines to ingest NYC taxi data ✅ Provision cloud infrastructure with Terraform Here's my homework solution: Following along with this amazing free course - who else is learning data engineering? You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ### Example post for Twitter/X ``` 🐳 Module 1 of Data Engineering Zoomcamp done! - Docker containers - Postgres & SQL - Terraform & GCP - NYC taxi data pipeline My solution: Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ================================================ FILE: cohorts/2026/02-workflow-orchestration/homework.md ================================================ ## Module 2 Homework ATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository. > In case you don't get one option exactly, select the closest one For the homework, we'll be working with the _green_ taxi dataset located here: `https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download` To get a `wget`-able link, use this prefix (note that the link itself gives 404): `https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/` ### Assignment So far in the course, we processed data for the year 2019 and 2020. Your task is to extend the existing flows to include data for the year 2021. ![homework datasets](../../../02-workflow-orchestration/images/homework.png) As a hint, Kestra makes that process really easy: 1. You can leverage the backfill functionality in the [scheduled flow](../../../02-workflow-orchestration/flows/09_gcp_taxi_scheduled.yaml) to backfill the data for the year 2021. Just make sure to select the time period for which data exists i.e. from `2021-01-01` to `2021-07-31`. Also, make sure to do the same for both `yellow` and `green` taxi data (select the right service in the `taxi` input). 2. Alternatively, run the flow manually for each of the seven months of 2021 for both `yellow` and `green` taxi data. Challenge for you: find out how to loop over the combination of Year-Month and `taxi`-type using `ForEach` task which triggers the flow for each combination using a `Subflow` task. ### Quiz Questions Complete the quiz shown below. It's a set of 6 multiple-choice questions to test your understanding of workflow orchestration, Kestra, and ETL pipelines. 1) Within the execution for `Yellow` Taxi data for the year `2020` and month `12`: what is the uncompressed file size (i.e. the output file `yellow_tripdata_2020-12.csv` of the `extract` task)? - 128.3 MiB - 134.5 MiB - 364.7 MiB - 692.6 MiB 2) What is the rendered value of the variable `file` when the inputs `taxi` is set to `green`, `year` is set to `2020`, and `month` is set to `04` during execution? - `{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv` - `green_tripdata_2020-04.csv` - `green_tripdata_04_2020.csv` - `green_tripdata_2020.csv` 3) How many rows are there for the `Yellow` Taxi data for all CSV files in the year 2020? - 13,537.299 - 24,648,499 - 18,324,219 - 29,430,127 4) How many rows are there for the `Green` Taxi data for all CSV files in the year 2020? - 5,327,301 - 936,199 - 1,734,051 - 1,342,034 5) How many rows are there for the `Yellow` Taxi data for the March 2021 CSV file? - 1,428,092 - 706,911 - 1,925,152 - 2,561,031 6) How would you configure the timezone to New York in a Schedule trigger? - Add a `timezone` property set to `EST` in the `Schedule` trigger configuration - Add a `timezone` property set to `America/New_York` in the `Schedule` trigger configuration - Add a `timezone` property set to `UTC-5` in the `Schedule` trigger configuration - Add a `location` property set to `New_York` in the `Schedule` trigger configuration ## Submitting the solutions * Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw2 * Check the link above to see the due date ## Solution Will be added after the due date ## Learning in Public We encourage everyone to share what they learned. This is called "learning in public". Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and). ### Example post for LinkedIn ``` 🚀 Week 2 of Data Engineering Zoomcamp by @DataTalksClub and @Will Russell complete! Just finished Module 2 - Workflow Orchestration with @Kestra. Learned how to: ✅ Orchestrate data pipelines with Kestra flows ✅ Use variables and expressions for dynamic workflows ✅ Implement backfill for historical data ✅ Schedule workflows with timezone support ✅ Process NYC taxi data (Yellow & Green) for 2019-2021 Built ETL pipelines that extract, transform, and load taxi trip data automatically! Thanks to the @Kestra team for the great orchestration tool! Here's my homework solution: Following along with this amazing free course - who else is learning data engineering? You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ### Example post for Twitter/X ``` Module 2 of DE Zoomcamp by @DataTalksClub @wrussell1999 done! - @kestra_io workflow orchestration - ETL pipelines for taxi data - Backfill & scheduling - Variables & dynamic flows My solution: Join me here: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ================================================ FILE: cohorts/2026/03-data-warehouse/DLT_upload_to_GCP.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": { "id": "aC2QnhmKxpq1" }, "source": [ "**Please set up your credentials JSON as GCP_CREDENTIALS secrets**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UsUZobVduL7l" }, "outputs": [], "source": [ "import os\n", "from google.colab import userdata\n", "\n", "os.environ[\"DESTINATION__CREDENTIALS\"] = userdata.get(\"GCP_CREDENTIALS\")\n", "os.environ[\"BUCKET_URL\"] = \"gs://your_bucket_url\"" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "mPBzsEgyjsBo" }, "outputs": [], "source": [ "# Install for production\n", "%%capture\n", "!pip install dlt[bigquery, gs]" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "evdUsDNbkCTk" }, "outputs": [], "source": [ "# Install for testing\n", "%%capture\n", "!pip install dlt[duckdb]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "lYh7r1mTf4uo" }, "outputs": [], "source": [ "import dlt\n", "import requests\n", "import pandas as pd\n", "from dlt.destinations import filesystem\n", "from io import BytesIO" ] }, { "cell_type": "markdown", "metadata": { "id": "76zT1PzAgs7A" }, "source": [ "Ingesting parquet files to GCS." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xya0215jsnsb" }, "outputs": [], "source": [ "# Define a dlt source to download and process Parquet files as resources\n", "@dlt.source(name=\"rides\")\n", "def download_parquet():\n", " prefix = \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata\"\n", " for month in range(1, 7):\n", " file_name = f\"yellow_tripdata_2024-0{month}.parquet\"\n", " url = f\"{prefix}_2024-0{month}.parquet\"\n", " response = requests.get(url)\n", "\n", " df = pd.read_parquet(BytesIO(response.content))\n", "\n", " # Return the dataframe as a dlt resource for ingestion\n", " yield dlt.resource(df, name=file_name)\n", "\n", "\n", "# Initialize the pipeline\n", "pipeline = dlt.pipeline(\n", " pipeline_name=\"rides_pipeline\",\n", " destination=filesystem(layout=\"{schema_name}/{table_name}.{ext}\"),\n", " dataset_name=\"rides_dataset\",\n", ")\n", "\n", "# Run the pipeline to load Parquet data into DuckDB\n", "load_info = pipeline.run(download_parquet(), loader_file_format=\"parquet\")\n", "\n", "# Print the results\n", "print(load_info)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "S0310FT-gy_P" }, "source": [ "Ingesting data to Database" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1_3K97w1c2v2", "outputId": "4b2d26bf-2814-46fa-f80d-7a2e17417a95" }, "outputs": [], "source": [ "# Define a dlt resource to download and process Parquet files as single table\n", "@dlt.resource(name=\"rides\", write_disposition=\"replace\")\n", "def download_parquet():\n", " prefix = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata'\n", "\n", " for month in range(1, 7):\n", " url = f\"{prefix}_2024-0{month}.parquet\"\n", " response = requests.get(url)\n", "\n", " df = pd.read_parquet(BytesIO(response.content))\n", "\n", " yield df\n", "\n", "\n", "# Initialize the pipeline\n", "pipeline = dlt.pipeline(\n", " pipeline_name=\"rides_pipeline\",\n", " destination=\"duckdb\", # Use DuckDB for testing\n", " # destination=\"bigquery\", # Use BigQuery for production\n", " dataset_name=\"rides_dataset\",\n", ")\n", "\n", "# Run the pipeline to load Parquet data into DuckDB\n", "info = pipeline.run(download_parquet)\n", "\n", "# Print the results\n", "print(info)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gDcLjzLtooBV", "outputId": "74ff2de7-2f2e-41b9-a681-3dc5887f6eed" }, "outputs": [], "source": [ "import duckdb\n", "\n", "conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n", "\n", "# Set search path to the dataset\n", "conn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n", "\n", "# Describe the dataset to see loaded tables\n", "res = conn.sql(\"DESCRIBE\").df()\n", "print(res)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VVJy8JoerI2P", "outputId": "3f8c7fee-a9ee-4fd4-ec75-153ca60bd36f" }, "outputs": [], "source": [ "# provide a resource name to query a table of that name\n", "with pipeline.sql_client() as client:\n", " with client.execute_query(f\"SELECT count(1) FROM rides\") as cursor:\n", " data = cursor.df()\n", "print(data)" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 } ================================================ FILE: cohorts/2026/03-data-warehouse/homework.md ================================================ # Module 3 Homework: Data Warehousing & BigQuery In this homework we'll practice working with BigQuery and Google Cloud Storage. When submitting your homework, you will also need to include a link to your GitHub repository or other public code-hosting site. This repository should contain the code for solving the homework. When your solution has SQL or shell commands and not code (e.g. python files) file format, include them directly in the README file of your repository. ## Data For this homework we will be using the Yellow Taxi Trip Records for January 2024 - June 2024 (not the entire year of data). Parquet Files are available from the New York City Taxi Data found here: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page ## Loading the data You can use the following scripts to load the data into your GCS bucket: - Python script: [load_yellow_taxi_data.py](./load_yellow_taxi_data.py) - Jupyter notebook with DLT: [DLT_upload_to_GCP.ipynb](./DLT_upload_to_GCP.ipynb) You will need to generate a Service Account with GCS Admin privileges or be authenticated with the Google SDK, and update the bucket name in the script. If you are using orchestration tools such as Kestra, Mage, Airflow, or Prefect, do not load the data into BigQuery using the orchestrator. Make sure that all 6 files show in your GCS bucket before beginning. Note: You will need to use the PARQUET option when creating an external table. ## BigQuery Setup Create an external table using the Yellow Taxi Trip Records. Create a (regular/materialized) table in BQ using the Yellow Taxi Trip Records (do not partition or cluster this table). ## Question 1. Counting records What is count of records for the 2024 Yellow Taxi Data? - 65,623 - 840,402 - 20,332,093 - 85,431,289 ## Question 2. Data read estimation Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables. What is the **estimated amount** of data that will be read when this query is executed on the External Table and the Table? - 18.82 MB for the External Table and 47.60 MB for the Materialized Table - 0 MB for the External Table and 155.12 MB for the Materialized Table - 2.14 GB for the External Table and 0MB for the Materialized Table - 0 MB for the External Table and 0MB for the Materialized Table ## Question 3. Understanding columnar storage Write a query to retrieve the PULocationID from the table (not the external table) in BigQuery. Now write a query to retrieve the PULocationID and DOLocationID on the same table. Why are the estimated number of Bytes different? - BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires reading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed. - BigQuery duplicates data across multiple storage partitions, so selecting two columns instead of one requires scanning the table twice, doubling the estimated bytes processed. - BigQuery automatically caches the first queried column, so adding a second column increases processing time but does not affect the estimated bytes scanned. - When selecting multiple columns, BigQuery performs an implicit join operation between them, increasing the estimated bytes processed ## Question 4. Counting zero fare trips How many records have a fare_amount of 0? - 128,210 - 546,578 - 20,188,016 - 8,333 ## Question 5. Partitioning and clustering What is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy) - Partition by tpep_dropoff_datetime and Cluster on VendorID - Cluster on by tpep_dropoff_datetime and Cluster on VendorID - Cluster on tpep_dropoff_datetime Partition by VendorID - Partition by tpep_dropoff_datetime and Partition by VendorID ## Question 6. Partition benefits Write a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime 2024-03-01 and 2024-03-15 (inclusive) Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values? Choose the answer which most closely matches. - 12.47 MB for non-partitioned table and 326.42 MB for the partitioned table - 310.24 MB for non-partitioned table and 26.84 MB for the partitioned table - 5.87 MB for non-partitioned table and 0 MB for the partitioned table - 310.31 MB for non-partitioned table and 285.64 MB for the partitioned table ## Question 7. External table storage Where is the data stored in the External Table you created? - Big Query - Container Registry - GCP Bucket - Big Table ## Question 8. Clustering best practices It is best practice in Big Query to always cluster your data: - True - False ## Question 9. Understanding table scans No Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why? ## Submitting the solutions Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw3 ## Learning in Public We encourage everyone to share what they learned. This is called "learning in public". Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and). ### Example post for LinkedIn ``` 🚀 Week 3 of Data Engineering Zoomcamp by @DataTalksClub complete! Just finished Module 3 - Data Warehousing with BigQuery. Learned how to: ✅ Create external tables from GCS bucket data ✅ Build materialized tables in BigQuery ✅ Partition and cluster tables for performance ✅ Understand columnar storage and query optimization ✅ Analyze NYC taxi data at scale Working with 20M+ records and learning how partitioning reduces query costs! Here's my homework solution: Following along with this amazing free course - who else is learning data engineering? You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ### Example post for Twitter/X ``` 📊 Module 3 of Data Engineering Zoomcamp done! - BigQuery & GCS - External vs materialized tables - Partitioning & clustering - Query optimization My solution: Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ================================================ FILE: cohorts/2026/03-data-warehouse/load_yellow_taxi_data.py ================================================ import os import sys import urllib.request from concurrent.futures import ThreadPoolExecutor from google.cloud import storage from google.api_core.exceptions import NotFound, Forbidden import time # Change this to your bucket name BUCKET_NAME = "dezoomcamp_hw3_2025" # If you authenticated through the GCP SDK you can comment out these two lines CREDENTIALS_FILE = "gcs.json" client = storage.Client.from_service_account_json(CREDENTIALS_FILE) # If commented initialize client with the following # client = storage.Client(project='zoomcamp-mod3-datawarehouse') BASE_URL = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-" MONTHS = [f"{i:02d}" for i in range(1, 7)] DOWNLOAD_DIR = "." CHUNK_SIZE = 8 * 1024 * 1024 os.makedirs(DOWNLOAD_DIR, exist_ok=True) bucket = client.bucket(BUCKET_NAME) def download_file(month): url = f"{BASE_URL}{month}.parquet" file_path = os.path.join(DOWNLOAD_DIR, f"yellow_tripdata_2024-{month}.parquet") try: print(f"Downloading {url}...") urllib.request.urlretrieve(url, file_path) print(f"Downloaded: {file_path}") return file_path except Exception as e: print(f"Failed to download {url}: {e}") return None def create_bucket(bucket_name): try: # Get bucket details bucket = client.get_bucket(bucket_name) # Check if the bucket belongs to the current project project_bucket_ids = [bckt.id for bckt in client.list_buckets()] if bucket_name in project_bucket_ids: print( f"Bucket '{bucket_name}' exists and belongs to your project. Proceeding..." ) else: print( f"A bucket with the name '{bucket_name}' already exists, but it does not belong to your project." ) sys.exit(1) except NotFound: # If the bucket doesn't exist, create it bucket = client.create_bucket(bucket_name) print(f"Created bucket '{bucket_name}'") except Forbidden: # If the request is forbidden, it means the bucket exists but you don't have access to see details print( f"A bucket with the name '{bucket_name}' exists, but it is not accessible. Bucket name is taken. Please try a different bucket name." ) sys.exit(1) def verify_gcs_upload(blob_name): return storage.Blob(bucket=bucket, name=blob_name).exists(client) def upload_to_gcs(file_path, max_retries=3): blob_name = os.path.basename(file_path) blob = bucket.blob(blob_name) blob.chunk_size = CHUNK_SIZE create_bucket(BUCKET_NAME) for attempt in range(max_retries): try: print(f"Uploading {file_path} to {BUCKET_NAME} (Attempt {attempt + 1})...") blob.upload_from_filename(file_path) print(f"Uploaded: gs://{BUCKET_NAME}/{blob_name}") if verify_gcs_upload(blob_name): print(f"Verification successful for {blob_name}") return else: print(f"Verification failed for {blob_name}, retrying...") except Exception as e: print(f"Failed to upload {file_path} to GCS: {e}") time.sleep(5) print(f"Giving up on {file_path} after {max_retries} attempts.") if __name__ == "__main__": create_bucket(BUCKET_NAME) with ThreadPoolExecutor(max_workers=4) as executor: file_paths = list(executor.map(download_file, MONTHS)) with ThreadPoolExecutor(max_workers=4) as executor: executor.map(upload_to_gcs, filter(None, file_paths)) # Remove None values print("All files processed and verified.") ================================================ FILE: cohorts/2026/04-analytics-engineering/homework.md ================================================ # Module 4 Homework: Analytics Engineering with dbt In this homework, we'll use the dbt project in `04-analytics-engineering/taxi_rides_ny/` to transform NYC taxi data and answer questions by querying the models. ## Setup 1. Set up your dbt project following the [setup guide](../../../04-analytics-engineering/setup/) 2. Load the Green and Yellow taxi data for 2019-2020 and FHV trip data for 2019 into your warehouse (use static tables from [dtc github](https://github.com/DataTalksClub/nyc-tlc-data/), don't use offical tables from tlc because some values change from time to time) 3. Run `dbt build --target prod` to create all models and run tests > **Note:** By default, dbt uses the `dev` target. You must use `--target prod` to build the models in the production dataset, which is required for the homework queries below. After a successful build, you should have models like `fct_trips`, `dim_zones`, and `fct_monthly_zone_revenue` in your warehouse. --- ### Question 1. dbt Lineage and Execution Given a dbt project with the following structure: ``` models/ ├── staging/ │ ├── stg_green_tripdata.sql │ └── stg_yellow_tripdata.sql └── intermediate/ └── int_trips_unioned.sql (depends on stg_green_tripdata & stg_yellow_tripdata) ``` If you run `dbt run --select int_trips_unioned`, what models will be built? - `stg_green_tripdata`, `stg_yellow_tripdata`, and `int_trips_unioned` (upstream dependencies) - Any model with upstream and downstream dependencies to `int_trips_unioned` - `int_trips_unioned` only - `int_trips_unioned`, `int_trips`, and `fct_trips` (downstream dependencies) --- ### Question 2. dbt Tests You've configured a generic test like this in your `schema.yml`: ```yaml columns: - name: payment_type data_tests: - accepted_values: arguments: values: [1, 2, 3, 4, 5] quote: false ``` Your model `fct_trips` has been running successfully for months. A new value `6` now appears in the source data. What happens when you run `dbt test --select fct_trips`? - dbt will skip the test because the model didn't change - dbt will fail the test, returning a non-zero exit code - dbt will pass the test with a warning about the new value - dbt will update the configuration to include the new value --- ### Question 3. Counting Records in `fct_monthly_zone_revenue` After running your dbt project, query the `fct_monthly_zone_revenue` model. What is the count of records in the `fct_monthly_zone_revenue` model? - 12,998 - 14,120 - 12,184 - 15,421 --- ### Question 4. Best Performing Zone for Green Taxis (2020) Using the `fct_monthly_zone_revenue` table, find the pickup zone with the **highest total revenue** (`revenue_monthly_total_amount`) for **Green** taxi trips in 2020. Which zone had the highest revenue? - East Harlem North - Morningside Heights - East Harlem South - Washington Heights South --- ### Question 5. Green Taxi Trip Counts (October 2019) Using the `fct_monthly_zone_revenue` table, what is the **total number of trips** (`total_monthly_trips`) for Green taxis in October 2019? - 500,234 - 350,891 - 384,624 - 421,509 --- ### Question 6. Build a Staging Model for FHV Data Create a staging model for the **For-Hire Vehicle (FHV)** trip data for 2019. 1. Load the [FHV trip data for 2019](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv) into your data warehouse 2. Create a staging model `stg_fhv_tripdata` with these requirements: - Filter out records where `dispatching_base_num IS NULL` - Rename fields to match your project's naming conventions (e.g., `PUlocationID` → `pickup_location_id`) What is the count of records in `stg_fhv_tripdata`? - 42,084,899 - 43,244,693 - 22,998,722 - 44,112,187 --- ## Submitting the solutions - Form for submitting: ======= ## Learning in Public We encourage everyone to share what they learned. This is called "learning in public". Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and). ### Example post for LinkedIn ``` 🚀 Week 4 of Data Engineering Zoomcamp by @DataTalksClub complete! Just finished Module 4 - Analytics Engineering with dbt. Learned how to: ✅ Build transformation models with dbt ✅ Create staging, intermediate, and fact tables ✅ Write tests to ensure data quality ✅ Understand lineage and model dependencies ✅ Analyze revenue patterns across NYC zones Transforming raw data into analytics-ready models - the T in ELT! Here's my homework solution: Following along with this amazing free course - who else is learning data engineering? You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ### Example post for Twitter/X ``` 📈 Module 4 of Data Engineering Zoomcamp done! - Analytics Engineering with dbt - Transformation models & tests - Data lineage & dependencies - NYC taxi revenue analysis My solution: Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ================================================ FILE: cohorts/2026/05-data-platforms/homework.md ================================================ # Module 5 Homework: Data Platforms with Bruin In this homework, we'll use Bruin to build a complete data pipeline, from ingestion to reporting. ## Setup 1. Install Bruin CLI: `curl -LsSf https://getbruin.com/install/cli | sh` 2. Initialize the zoomcamp template: `bruin init zoomcamp my-pipeline` 3. Configure your `.bruin.yml` with a DuckDB connection 4. Follow the tutorial in the [main module README](../../../05-data-platforms/) After completing the setup, you should have a working NYC taxi data pipeline. --- ### Question 1. Bruin Pipeline Structure In a Bruin project, what are the required files/directories? - `bruin.yml` and `assets/` - `.bruin.yml` and `pipeline.yml` (assets can be anywhere) - `.bruin.yml` and `pipeline/` with `pipeline.yml` and `assets/` - `pipeline.yml` and `assets/` only --- ### Question 2. Materialization Strategies You're building a pipeline that processes NYC taxi data organized by month based on `pickup_datetime`. Which incremental strategy is best for processing a specific interval period by deleting and inserting data for that time period? - `append` - always add new rows - `replace` - truncate and rebuild entirely - `time_interval` - incremental based on a time column - `view` - create a virtual table only --- ### Question 3. Pipeline Variables You have the following variable defined in `pipeline.yml`: ```yaml variables: taxi_types: type: array items: type: string default: ["yellow", "green"] ``` How do you override this when running the pipeline to only process yellow taxis? - `bruin run --taxi-types yellow` - `bruin run --var taxi_types=yellow` - `bruin run --var 'taxi_types=["yellow"]'` - `bruin run --set taxi_types=["yellow"]` --- ### Question 4. Running with Dependencies You've modified the `ingestion/trips.py` asset and want to run it plus all downstream assets. Which command should you use? - `bruin run ingestion.trips --all` - `bruin run ingestion/trips.py --downstream` - `bruin run pipeline/trips.py --recursive` - `bruin run --select ingestion.trips+` --- ### Question 5. Quality Checks You want to ensure the `pickup_datetime` column in your trips table never has NULL values. Which quality check should you add to your asset definition? - `name: unique` - `name: not_null` - `name: positive` - `name: accepted_values, value: [not_null]` --- ### Question 6. Lineage and Dependencies After building your pipeline, you want to visualize the dependency graph between assets. Which Bruin command should you use? - `bruin graph` - `bruin dependencies` - `bruin lineage` - `bruin show` --- ### Question 7. First-Time Run You're running a Bruin pipeline for the first time on a new DuckDB database. What flag should you use to ensure tables are created from scratch? - `--create` - `--init` - `--full-refresh` - `--truncate` --- ## Submitting the solutions - Form for submitting: ======= ## Learning in Public We encourage everyone to share what they learned. This is called "learning in public". Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and). ### Example post for LinkedIn ``` 🚀 Week 5 of Data Engineering Zoomcamp by @DataTalksClub complete! Just finished Module 5 - Data Platforms with Bruin. Learned how to: ✅ Build end-to-end ELT pipelines with Bruin ✅ Configure environments and connections ✅ Use materialization strategies for incremental processing ✅ Add data quality checks to ensure data integrity ✅ Deploy pipelines from local to cloud (BigQuery) Modern data platforms in a single CLI tool - no vendor lock-in! Here's my homework solution: Following along with this amazing free course - who else is learning data engineering? You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ### Example post for Twitter/X ``` 📊 Module 5 of Data Engineering Zoomcamp done! - Data Platforms with Bruin - End-to-end ELT pipelines - Data quality & lineage - Deployment to BigQuery My solution: Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ================================================ FILE: cohorts/2026/06-batch/homework.md ================================================ # Module 6 Homework In this homework we'll put what we learned about Spark in practice. For this homework we will be using the Yellow 2025-11 data from the official website: ```bash wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet ``` ## Question 1: Install Spark and PySpark - Install Spark - Run PySpark - Create a local spark session - Execute spark.version. What's the output? > [!NOTE] > To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/06-batch/setup/) ## Question 2: Yellow November 2025 Read the November 2025 Yellow into a Spark Dataframe. Repartition the Dataframe to 4 partitions and save it to parquet. What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches. - 6MB - 25MB - 75MB - 100MB ## Question 3: Count records How many taxi trips were there on the 15th of November? Consider only trips that started on the 15th of November. - 62,610 - 102,340 - 162,604 - 225,768 ## Question 4: Longest trip What is the length of the longest trip in the dataset in hours? - 22.7 - 58.2 - 90.6 - 134.5 ## Question 5: User Interface Spark's User Interface which shows the application's dashboard runs on which local port? - 80 - 443 - 4040 - 8080 ## Question 6: Least frequent pickup location zone Load the zone lookup data into a temp view in Spark: ```bash wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv ``` Using the zone lookup data and the Yellow November 2025 data, what is the name of the LEAST frequent pickup location Zone? - Governor's Island/Ellis Island/Liberty Island - Arden Heights - Rikers Island - Jamaica Bay If multiple answers are correct, select any ## Submitting the solutions - Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw6 - Deadline: See the website ## Learning in Public We encourage everyone to share what they learned. This is called "learning in public". Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and). ### Example post for LinkedIn ``` 🚀 Week 6 of Data Engineering Zoomcamp by @DataTalksClub complete! Just finished Module 6 - Batch Processing with Spark. Learned how to: ✅ Set up PySpark and create Spark sessions ✅ Read and process Parquet files at scale ✅ Repartition data for optimal performance ✅ Analyze millions of taxi trips with DataFrames ✅ Use Spark UI for monitoring jobs Processing 4M+ taxi trips with Spark - distributed computing is powerful! 💪 Here's my homework solution: Following along with this amazing free course - who else is learning data engineering? You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ### Example post for Twitter/X ``` ⚡ Module 6 of Data Engineering Zoomcamp done! - Batch processing with Spark 🔥 - PySpark & DataFrames - Parquet file optimization - Spark UI on port 4040 My solution: Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ================================================ FILE: cohorts/2026/07-streaming/homework.md ================================================ # Homework In this homework, we'll practice streaming with Kafka (Redpanda) and PyFlink. We use Redpanda, a drop-in replacement for Kafka. It implements the same protocol, so any Kafka client library works with it unchanged. For this homework we will be using Green Taxi Trip data from October 2025: - [green_tripdata_2025-10.parquet](https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-10.parquet) ## Setup We'll use the same infrastructure from the [workshop](../../../07-streaming/workshop/). Follow the setup instructions: build the Docker image, start the services: ```bash cd 07-streaming/workshop/ docker compose build docker compose up -d ``` This gives us: - Redpanda (Kafka-compatible broker) on `localhost:9092` - Flink Job Manager at http://localhost:8081 - Flink Task Manager - PostgreSQL on `localhost:5432` (user: `postgres`, password: `postgres`) If you previously ran the workshop and have old containers/volumes, do a clean start: ```bash docker compose down -v docker compose build docker compose up -d ``` Note: the container names (like `workshop-redpanda-1`) assume the directory is called `workshop`. If you renamed it, adjust accordingly. ## Question 1. Redpanda version Run `rpk version` inside the Redpanda container: ```bash docker exec -it workshop-redpanda-1 rpk version ``` What version of Redpanda are you running? ## Question 2. Sending data to Redpanda Create a topic called `green-trips`: ```bash docker exec -it workshop-redpanda-1 rpk topic create green-trips ``` Now write a producer to send the green taxi data to this topic. Read the parquet file and keep only these columns: - `lpep_pickup_datetime` - `lpep_dropoff_datetime` - `PULocationID` - `DOLocationID` - `passenger_count` - `trip_distance` - `tip_amount` - `total_amount` Convert each row to a dictionary and send it to the `green-trips` topic. You'll need to handle the datetime columns - convert them to strings before serializing to JSON. Measure the time it takes to send the entire dataset and flush: ```python from time import time t0 = time() # send all rows ... producer.flush() t1 = time() print(f'took {(t1 - t0):.2f} seconds') ``` How long did it take to send the data? - 10 seconds - 60 seconds - 120 seconds - 300 seconds ## Question 3. Consumer - trip distance Write a Kafka consumer that reads all messages from the `green-trips` topic (set `auto_offset_reset='earliest'`). Count how many trips have a `trip_distance` greater than 5.0 kilometers. How many trips have `trip_distance` > 5? - 6506 - 7506 - 8506 - 9506 ## Part 2: PyFlink (Questions 4-6) For the PyFlink questions, you'll adapt the workshop code to work with the green taxi data. The key differences from the workshop: - Topic name: `green-trips` (instead of `rides`) - Datetime columns use `lpep_` prefix (instead of `tpep_`) - You'll need to handle timestamps as strings (not epoch milliseconds) You can convert string timestamps to Flink timestamps in your source DDL: ```sql lpep_pickup_datetime VARCHAR, event_timestamp AS TO_TIMESTAMP(lpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss'), WATERMARK FOR event_timestamp AS event_timestamp - INTERVAL '5' SECOND ``` Before running the Flink jobs, create the necessary PostgreSQL tables for your results. Important notes for the Flink jobs: - Place your job files in `workshop/src/job/` - this directory is mounted into the Flink containers at `/opt/src/job/` - Submit jobs with: `docker exec -it workshop-jobmanager-1 flink run -py /opt/src/job/your_job.py` - The `green-trips` topic has 1 partition, so set parallelism to 1 in your Flink jobs (`env.set_parallelism(1)`). With higher parallelism, idle consumer subtasks prevent the watermark from advancing. - Flink streaming jobs run continuously. Let the job run for a minute or two until results appear in PostgreSQL, then query the results. You can cancel the job from the Flink UI at http://localhost:8081 - If you sent data to the topic multiple times, delete and recreate the topic to avoid duplicates: `docker exec -it workshop-redpanda-1 rpk topic delete green-trips` ## Question 4. Tumbling window - pickup location Create a Flink job that reads from `green-trips` and uses a 5-minute tumbling window to count trips per `PULocationID`. Write the results to a PostgreSQL table with columns: `window_start`, `PULocationID`, `num_trips`. After the job processes all data, query the results: ```sql SELECT PULocationID, num_trips FROM ORDER BY num_trips DESC LIMIT 3; ``` Which `PULocationID` had the most trips in a single 5-minute window? - 42 - 74 - 75 - 166 ## Question 5. Session window - longest streak Create another Flink job that uses a session window with a 5-minute gap on `PULocationID`, using `lpep_pickup_datetime` as the event time with a 5-second watermark tolerance. A session window groups events that arrive within 5 minutes of each other. When there's a gap of more than 5 minutes, the window closes. Write the results to a PostgreSQL table and find the `PULocationID` with the longest session (most trips in a single session). How many trips were in the longest session? - 12 - 31 - 51 - 81 ## Question 6. Tumbling window - largest tip Create a Flink job that uses a 1-hour tumbling window to compute the total `tip_amount` per hour (across all locations). Which hour had the highest total tip amount? - 2025-10-01 18:00:00 - 2025-10-16 18:00:00 - 2025-10-22 08:00:00 - 2025-10-30 16:00:00 ## Submitting the solutions - Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw7 ## Learning in public We encourage everyone to share what they learned. Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and). ## Example post for LinkedIn ``` Week 7 of Data Engineering Zoomcamp by @DataTalksClub complete! Just finished Module 7 - Streaming with PyFlink. Learned how to: - Set up Redpanda as a Kafka replacement - Build Kafka producers and consumers in Python - Create tumbling and session windows in Flink - Analyze real-time taxi trip data with stream processing Here's my homework solution: You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ## Example post for Twitter/X ``` Module 7 of Data Engineering Zoomcamp done! - Kafka producers and consumers - PyFlink tumbling and session windows - Real-time taxi data analysis - Redpanda as Kafka replacement My solution: Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ================================================ FILE: cohorts/2026/README.md ================================================ ## Data Engineering Zoomcamp 2026 Cohort * [Pre-launch Q&A stream](https://www.youtube.com/watch?v=WB6b1lcguaA) * [Launch stream with course overview](https://www.youtube.com/watch?v=JgspdlKXS-w) * [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html) * [Course Playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb) [**Module 1: Introduction & Prerequisites**](01-docker-terraform/) * [Homework](01-docker-terraform/homework.md) [**Module 2: Workflow Orchestration**](02-workflow-orchestration) * [Homework](02-workflow-orchestration/homework.md) * Office hours [**Workshop 1: Data Ingestion**](workshops/dlt/README.md) * Workshop with dlt * [Homework](workshops/dlt/README.md) [**Workshop 2: AI-Assisted Data Ingestion with dlt**](workshops/dlt.md) * [Workshop details and registration](workshops/dlt.md) [**Module 3: Data Warehouse**](03-data-warehouse) * [Homework](03-data-warehouse/homework.md) [**Module 4: Analytics Engineering**](04-analytics-engineering/) * [Homework](04-analytics-engineering/homework.md) [**Module 5: Data Platforms**](05-data-platforms/) * [Homework](05-data-platforms/homework.md) [**Module 6: Batch processing**](06-batch/) * [Homework](06-batch/homework.md) [**Module 7: Stream Processing**](07-streaming) * [Homework](07-streaming/homework.md) [**Project**](project.md) More information [here](project.md) ================================================ FILE: cohorts/2026/project.md ================================================ ## Course Project The goal of this project is to apply everything we learned in this course and build an end-to-end data pipeline. You will have two attempts to submit your project. If you don't have time to submit your project by the end of attempt #1 (you started the course late, you have vacation plans, life/work got in the way, etc.) or you fail your first attempt, then you will have a second chance to submit your project as attempt #2. There are only two attempts. Remember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered complete. To find the projects assigned to you, use the peer review assignments link and find your hash in the first column. You will see three rows: you need to evaluate each of these projects. For each project, you need to submit the form once, so in total, you will make three submissions. ### Submitting #### Project Attempt #1 * Project: https://courses.datatalks.club/de-zoomcamp-2026/project/project1 * Review: https://courses.datatalks.club/de-zoomcamp-2026/project/project1/eval #### Project Attempt #2 * Project: https://courses.datatalks.club/de-zoomcamp-2026/project/project2 * Review: https://courses.datatalks.club/de-zoomcamp-2026/project/project2/eval > **Important**: update your "Certificate name" here: https://courses.datatalks.club/de-zoomcamp-2026/enrollment - this is what we will use when generating certificates for you. ### Evaluation criteria See [here](../../projects/README.md) ================================================ FILE: cohorts/2026/workshops/dlt/README.md ================================================ # From APIs to Warehouses: AI-Assisted Data Ingestion with dlt Welcome to the **Data Engineering Zoomcamp 2026** workshop! In this workshop, you'll use an AI-powered IDE to build a complete data pipeline. Using simple prompts, you can go from an API to a local data warehouse with [dlt](https://dlthub.com/docs) (data load tool). The AI handles the code generation. You focus on the results. ## What You'll Build By the end of this workshop, you will have: 1. A working dlt pipeline that extracts data from the [Open Library API](https://openlibrary.org/developers/api) 2. Normalized relational tables stored in DuckDB 3. The ability to query, inspect, and visualize your data 4. Experience using AI-assisted development for data engineering **No API key required!** The Open Library API is completely open and doesn't require authentication. You can start building immediately. --- ## Prerequisites Before the workshop, make sure you have the following set up: ### 1. Understand What dlt Does (Recommended for Beginners) If you're unfamiliar with dlt and what the library does, we recommend reading through the included Jupyter notebook before the workshop. **[Open the notebook in Google Colab](https://colab.research.google.com/github/anair123/data-engineering-zoomcamp/blob/workshop/dlt_2026/cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb)** It walks through dlt step by step: - What a dlt source and pipeline are - How data moves through Extract, Normalize, and Load - How to inspect the loaded data Understanding these concepts will help you know what the agent-generated code is actually doing. > You do not need to clone the repo to follow the workshop. The `dlt init` command scaffolds everything you need. ### 2. An Agentic IDE You'll need an AI-powered code editor that can understand context and generate code from natural language. We recommend: | IDE | Description | |-----|-------------| | [**Cursor**](https://cursor.sh) | VS Code fork with built-in AI assistance (recommended) | | [Windsurf](https://codeium.com/windsurf) | Alternative agentic IDE | | [VS Code + GitHub Copilot](https://github.com/features/copilot) | Works, but less integrated | ### 3. Python 3.11+ ```bash python --version # Should be 3.11 or higher ``` ### 4. uv (Recommended) or pip We use [uv](https://docs.astral.sh/uv/) for fast dependency management: ```bash # Install uv (if you don't have it) curl -LsSf https://astral.sh/uv/install.sh | sh ``` --- ## Workshop Instructions ### Step 1: Create a New Project Folder Create a fresh folder for your pipeline and open it in Cursor (or your preferred agentic IDE): ```bash mkdir my-dlt-pipeline cd my-dlt-pipeline ``` ### Step 2: Add the dlt MCP Server Config Choose the setup for your IDE: Cursor - go to **Settings → Tools & MCP → New MCP Server** and add: ```json { "mcpServers": { "dlt": { "command": "uv", "args": [ "run", "--with", "dlt[duckdb]", "--with", "dlt-mcp[search]", "python", "-m", "dlt_mcp" ] } } } ``` VS Code (Copilot) - create `.vscode/mcp.json` in your project folder: ```json { "servers": { "dlt": { "command": "uv", "args": [ "run", "--with", "dlt[duckdb]", "--with", "dlt-mcp[search]", "python", "-m", "dlt_mcp" ] } } } ``` Claude Code - run in your terminal: ```bash claude mcp add dlt -- uv run --with "dlt[duckdb]" --with "dlt-mcp[search]" python -m dlt_mcp ``` This enables the dlt MCP server, which gives the AI access to dlt documentation, code examples, and your pipeline metadata. ### Step 3: Install dlt Workspace ```bash pip install "dlt[workspace]" ``` ### Step 4: Initialize the dlt Project ```bash dlt init dlthub:open_library duckdb ``` This scaffolds the pipeline files and configuration for Open Library. You now have everything you need to start prompting. > 📖 **Reference:** [Open Library Workspace Instructions](https://dlthub.com/workspace/source/open-library) ### Step 5: Prompt the Agent to Build and Run the Pipeline This is where the magic happens. The `dlt init` command scaffolds sample prompts you can use. Here's an example to get started: ``` Please generate a REST API Source for Open Library API, as specified in @open_library-docs.yaml Start with endpoint(s) books and skip incremental loading for now. Place the code in open_library_pipeline.py and name the pipeline open_library_pipeline. If the file exists, use it as a starting point. Do not add or modify any other files. Use @dlt rest api as a tutorial. After adding the endpoints, allow the user to run the pipeline with python open_library_pipeline.py and await further instructions. ``` Feel free to tweak the prompt based on your objective. The agent will: 1. Generate the pipeline code 2. Run the pipeline 3. Load data into your local DuckDB database All from a single prompt. ### Step 6: Debug with the Agent If there are any errors, paste them into the chat and let the AI resolve them. This is the power of AI-assisted development: you iterate quickly without getting stuck. ### Step 7: Inspect Pipeline Data with the dlt Dashboard Once your pipeline runs successfully, launch the dashboard to inspect your data and metadata: ```bash dlt pipeline open_library_pipeline show ``` This opens a web app where you can: - View pipeline state and run history - Explore schemas, tables, and columns - Query the loaded data - Debug any issues > 📖 **Reference:** [dlt Dashboard Documentation](https://dlthub.com/docs/general-usage/dashboard) ### Step 8: Inspect the Pipeline via Chat With the dlt MCP server configured, you can ask the AI about your pipeline directly: > "What tables were created in the pipeline?" > "Show me the schema for the books table." > "How many rows were loaded?" The agent has access to your pipeline metadata and can answer these questions. ### Step 9 (Bonus): Build Visualizations with marimo + ibis Take your analysis further by creating interactive reports with [marimo](https://marimo.io/) notebooks and [ibis](https://ibis-project.org/). Prompt the agent to build a visualization: > "Create a marimo notebook that visualizes the top 10 authors by book count. Use ibis for data access. Reference: https://dlthub.com/docs/general-usage/dataset-access/marimo" By providing the docs link, the agent will use the correct stack. Run your notebook: ```bash # Edit mode (for development) marimo edit your_notebook.py # Run mode (view the report) marimo run your_notebook.py ``` > 📖 **Reference:** [Explore Data with marimo](https://dlthub.com/docs/general-usage/dataset-access/marimo) --- ## Homework You've seen me do it, now it's your turn! See [dlt_homework.md](dlt_homework.md) for instructions. --- ## Resources | Resource | Link | |----------|------| | dlt Documentation | [dlthub.com/docs](https://dlthub.com/docs) | | Open Library Workspace Guide | [dlthub.com/workspace/source/open-library](https://dlthub.com/workspace/source/open-library) | | dlt Dashboard Docs | [dlthub.com/docs/general-usage/dashboard](https://dlthub.com/docs/general-usage/dashboard) | | marimo + dlt Guide | [dlthub.com/docs/general-usage/dataset-access/marimo](https://dlthub.com/docs/general-usage/dataset-access/marimo) | | Open Library API | [openlibrary.org/developers/api](https://openlibrary.org/developers/api) | --- *Workshop by [dltHub](https://dlthub.com) for the Data Engineering Zoomcamp 2026* ================================================ FILE: cohorts/2026/workshops/dlt/analysis.py ================================================ import marimo __generated_with = "0.19.9" app = marimo.App(width="medium") @app.cell def _(): import marimo as mo import dlt import ibis import altair as alt from dlt.helpers.marimo import render, load_package_viewer return alt, dlt, ibis, load_package_viewer, mo, render @app.cell def _(mo): mo.md(r""" # 📚 Open Library Harry Potter Books Analysis This notebook analyzes Harry Potter-related books from the Open Library API using dlt's dataset interface. """) return @app.cell def _(dlt): # Access the pipeline and dataset using dlt's native interface pipeline = dlt.attach("open_library_pipeline") dataset = pipeline.dataset() # Get ibis connection for rich data exploration ibis_con = dataset.ibis() return (ibis_con,) @app.cell async def _(load_package_viewer, render): # Display the dlt package viewer widget await render(load_package_viewer) return @app.cell def _(mo): mo.md(r""" ## 📊 Books by Author """) return @app.cell def _(alt, ibis, ibis_con): # Query for books by author (top 15) using ibis author_table = ibis_con.table("books__author_name") author_query = ( author_table .group_by("value") .agg(book_count=author_table.value.count()) .order_by(ibis.desc("book_count")) .limit(15) ) author_df = author_query.to_pandas() author_df = author_df.rename(columns={"value": "author"}) # Bar chart for authors author_chart = alt.Chart(author_df).mark_bar(color="#6366f1").encode( x=alt.X("book_count:Q", title="Number of Books"), y=alt.Y("author:N", sort="-x", title="Author"), tooltip=["author", "book_count"] ).properties( title="Top 15 Authors by Number of Books", width=600, height=400 ) author_chart return @app.cell def _(mo): mo.md(r""" ## 📈 Books Published Per Year """) return @app.cell def _(alt, ibis_con): # Query for books by year using ibis books_table = ibis_con.table("books") year_query = ( books_table .filter((books_table.first_publish_year >= 1997) & (books_table.first_publish_year <= 2025)) .group_by("first_publish_year") .agg(books=books_table.first_publish_year.count()) .order_by("first_publish_year") ) year_df = year_query.to_pandas() year_df = year_df.rename(columns={"first_publish_year": "year"}) # Line chart for publication years year_chart = alt.Chart(year_df).mark_line( point=True, color="#10b981" ).encode( x=alt.X("year:O", title="Year"), y=alt.Y("books:Q", title="Number of Books"), tooltip=["year", "books"] ).properties( title="Harry Potter-Related Books Published Per Year (1997-2025)", width=700, height=350 ) year_chart return @app.cell def _(mo): mo.md(r""" ## 🌍 Books by Language """) return @app.cell def _(alt, ibis, ibis_con): # Query for books by language using ibis lang_table = ibis_con.table("books__language") lang_query = ( lang_table .group_by("value") .agg(count=lang_table.value.count()) .order_by(ibis.desc("count")) .limit(10) ) language_df = lang_query.to_pandas() # Map language codes to full names lang_map = { 'eng': 'English', 'ger': 'German', 'fre': 'French', 'spa': 'Spanish', 'ita': 'Italian', 'chi': 'Chinese', 'por': 'Portuguese', 'rus': 'Russian', 'kor': 'Korean', 'pol': 'Polish' } language_df["language"] = language_df["value"].map(lambda x: lang_map.get(x, x)) # Pie chart for languages language_chart = alt.Chart(language_df).mark_arc(innerRadius=50).encode( theta=alt.Theta("count:Q", title="Count"), color=alt.Color("language:N", title="Language", scale=alt.Scale(scheme="tableau10")), tooltip=["language", "count"] ).properties( title="Proportion of Books by Language (Top 10)", width=400, height=400 ) language_chart return @app.cell def _(mo): mo.md(r""" ## 📋 Summary Statistics Key insights from the Open Library Harry Potter books dataset. """) return @app.cell def _(ibis_con, mo): # Get summary stats using ibis total_books = ibis_con.table("books").count().to_pandas() total_authors = ibis_con.table("books__author_name").value.nunique().to_pandas() total_languages = ibis_con.table("books__language").value.nunique().to_pandas() mo.md(f""" | Metric | Value | |--------|-------| | **Total Books** | {total_books:,} | | **Unique Authors** | {total_authors:,} | | **Languages** | {total_languages} | """) return @app.cell def _(): return @app.cell def _(): return if __name__ == "__main__": app.run() ================================================ FILE: cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb ================================================ { "cells": [ { "cell_type": "markdown", "metadata": { "id": "bPVVve29bu6Z" }, "source": [ "# Building a Data Pipeline with dlt\n", "\n", "In this notebook, we will build a complete data pipeline from scratch using **dlt**.\n", "\n", "Our goal is simple:\n", "\n", "→ Fetch real data from an API \n", "→ Turn it into clean relational tables \n", "→ Load it into a database \n", "→ Explore and analyze it \n", "\n", "We will use the **Open Library API** as our data source and **DuckDB** as our database.\n", "\n", "Along the way, you will learn:\n", "\n", "- What a dlt source is \n", "- What a dlt pipeline does \n", "- How data moves through Extract → Normalize → Load \n", "- How to inspect and explore the final dataset \n", "\n", "By the end, you will understand not just how to run a pipeline, but what happens at each stage.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "u9eCv60qV5PS" }, "source": [ "## 📦 Step 0: Install Dependencies\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "Arp4d7KZNRTS" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "zsh:1: no matches found: dlt[duckdb]\n" ] } ], "source": [ "# install dependencies first\n", "!pip -q install dlt[duckdb]" ] }, { "cell_type": "markdown", "metadata": { "id": "x7VGYS5hWNKQ" }, "source": [ "

In this notebook we will use:

\n", "\n", "
    \n", "
  • dlt to extract, normalize, and load data
  • \n", "
  • DuckDB as the destination database (runs locally inside Colab)
  • \n", "
\n", "\n", "

\n", " DuckDB is great for beginners because it requires no setup and no credentials.\n", "

" ] }, { "cell_type": "markdown", "metadata": { "id": "aQTSvnvnHWBd" }, "source": [ "## 📚 Step 1: Import Libraries" ] }, { "cell_type": "markdown", "metadata": { "id": "YFQGLTECWkpn" }, "source": [ "\n", "

In this cell we import the libraries we will use throughout the notebook:

\n", "\n", "
    \n", "
  • dlt is the main library for building and running the pipeline
  • \n", "
  • rest_api_source helps us define an API source using a simple configuration
  • \n", "
  • islice (from itertools) is a small Python helper for previewing only a few records
  • \n", "
\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "Lm8AbbHBImjI" }, "outputs": [], "source": [ "import dlt\n", "import dlt\n", "from itertools import islice\n", "from dlt.sources.rest_api import rest_api_source" ] }, { "cell_type": "markdown", "metadata": { "id": "UFoBTwDVhzRL" }, "source": [ "## 🔗 Step 2: Define the API Source (Open Library)" ] }, { "cell_type": "markdown", "metadata": { "id": "VdKrEM-VXEY2" }, "source": [ "

\n", " In dlt, a source is the part of your pipeline that knows how to fetch data from somewhere.\n", " In this notebook, our source fetches data from the Open Library Search API.\n", "

\n", "\n", "

\n", " We define the source using rest_api_source, which lets us describe an API in a simple\n", " Python dictionary instead of writing lots of request code.\n", "

\n", "\n", "

\n", " 📖 Open Library Search API docs:
\n", " \n", " https://openlibrary.org/dev/docs/api/search\n", " \n", "

" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "hOxkEKy4Kaj4" }, "outputs": [], "source": [ "def openlibrary_source(query: str = \"harry potter\"):\n", "\n", " return rest_api_source({\n", " \"client\": {\n", " \"base_url\": \"https://openlibrary.org\",\n", " },\n", " \"resource_defaults\": {\n", " \"primary_key\": \"key\",\n", " \"write_disposition\": \"replace\",\n", " },\n", " \"resources\": [\n", " {\n", " \"name\": \"books\",\n", " \"endpoint\": {\n", " \"path\": \"search.json\",\n", " \"params\": {\n", " \"q\": query,\n", " \"limit\": 100,\n", " },\n", " \"data_selector\": \"docs\",\n", " \"paginator\": {\n", " \"type\": \"offset\",\n", " \"limit\": 100,\n", " \"offset_param\": \"offset\",\n", " \"limit_param\": \"limit\",\n", " \"total_path\": \"numFound\",\n", " },\n", " },\n", " },\n", " ],\n", " })\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ntKAVaEGYFgw" }, "source": [ "## 🔧 Step 3: Create the dlt Pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bxpFEetGh3lS" }, "outputs": [], "source": [ "pipeline = dlt.pipeline(\n", " pipeline_name=\"ol_demo\",\n", " destination=\"duckdb\",\n", " dataset_name=\"ol_data\",\n", " progress=\"log\" # logs the pipeline run (Optiona)\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "y7CJ9A2HXsFb" }, "source": [ "## 🔍 Understanding the Pipeline\n", "\n", "At this point we have defined two key building blocks:\n", "\n", "- **The source** describes where the data comes from and how to fetch it from the API. \n", "- **The pipeline** describes where the data should go (DuckDB) and keeps track of tables, schemas, and run history. \n", "\n", "---\n", "\n", "Instead of running everything at once, we will now run the pipeline in three separate phases so you can clearly see what happens at each stage:\n", "\n", "1. **Extract**: download raw data from the API \n", "2. **Normalize**: turn nested JSON into relational tables \n", "3. **Load**: write those tables into DuckDB \n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![ETL Diagram](./images/etl_diagram.png)" ] }, { "cell_type": "markdown", "metadata": { "id": "pAYgUUJIw-c4" }, "source": [ "Once these steps make sense, we will run the full workflow again using one command:\n", "\n", "```python\n", "pipeline.run(source)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "JsfcBA7McJMo" }, "source": [ "## ⬇️ Step 4: Extract\n", "\n", "Now we run the first stage of the pipeline: **Extract**.\n", "\n", "Extract means:\n", "\n", "- dlt sends requests to the Open Library API\n", "- the raw JSON responses are downloaded\n", "- the results are stored in dlt’s local working folder\n", "\n", "At this stage, the data is **not** in DuckDB yet. We are just confirming that we successfully pulled data from the API." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "yifCIPxSKJZ4" }, "outputs": [], "source": [ "extract_info = pipeline.extract(openlibrary_source())" ] }, { "cell_type": "markdown", "metadata": { "id": "NLRRVLnLcNgl" }, "source": [ "---\n", "\n", "### What we will print\n", "\n", "After extraction, we will print a small summary showing:\n", "\n", "- which **resources** were extracted\n", "- which **tables** will be created later\n", "- how many rows were extracted per resource\n", "\n", "This helps confirm that the pipeline is working before we move on to normalization." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wtDasHRNNNN0", "outputId": "51c71eeb-5435-40a1-8728-ea48c59bfd58" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Resources: ['books']\n", "Tables: ['books']\n", "Load ID: 1770907406.962898\n", "\n", "Resource: books\n", "rows extracted: 3756\n", "\n" ] } ], "source": [ "load_id = extract_info.loads_ids[-1]\n", "m = extract_info.metrics[load_id][0]\n", "\n", "print(\"Resources:\", list(m[\"resource_metrics\"].keys()))\n", "print(\"Tables:\", list(m[\"table_metrics\"].keys()))\n", "print(\"Load ID:\", load_id)\n", "print()\n", "\n", "for resource, rm in m[\"resource_metrics\"].items():\n", " print(f\"Resource: {resource}\")\n", " print(f\"rows extracted: {rm.items_count}\")\n", " print()" ] }, { "cell_type": "markdown", "metadata": { "id": "f6MwYtznc3UX" }, "source": [ "### What you should see after Extract\n", "\n", "In our case, Extract shows only **one resource and one table**:\n", "\n", "- **Resources:** `['books']` \n", "- **Tables:** `['books']`\n", "\n", "That is expected.\n", "\n", "The `search` endpoint returns a list of book results, so dlt stores those rows in a single table called `books`. The interesting part comes next, because many fields inside each row are lists or nested objects. Those will turn into additional tables during **Normalize**.\n", "\n", "Example output:\n", "\n", "- **25 rows extracted** means we pulled 25 search results (books) \n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "id": "lQVLZMcyXWkm" }, "source": [ "## 🔄 Step 5: Normalize\n", "\n", "Now we run **Normalize**. This is where dlt transforms raw JSON into a clean relational structure.\n", "\n", "During normalization, dlt does three key things:\n", "\n", "### 1. Adds Tracking Columns to the Main Table\n", "\n", "dlt adds special columns to every table:\n", "- `_dlt_id`: A unique identifier for each row\n", "- `_dlt_load_id`: Links each row to the load job that created it\n", "\n", "### 2. Flattens Nested Data into Child Tables\n", "\n", "APIs often return nested JSON. For example, a book can have multiple authors (a list), multiple editions, and multiple identifiers.\n", "\n", "dlt flattens these nested structures into separate **child tables** with names like:\n", "- `books__author_name`\n", "- `books__author_key`\n", "- `books__language`\n", "\n", "Each child table has a `_dlt_parent_id` column that references `_dlt_id` in the parent table. This is how dlt maintains relationships.\n", "\n", "### 3. Creates Metadata Tables\n", "\n", "dlt also creates internal tables to track pipeline state:\n", "- `_dlt_loads`: Tracks load history (when data was loaded, status)\n", "- `_dlt_pipeline_state`: Stores pipeline state for incremental loading\n", "- `_dlt_version`: Tracks schema versions\n", "\n", "In the next cell, we will print a summary showing which tables were created.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "LCmiiG3tXXwh" }, "outputs": [], "source": [ "normalize_info = pipeline.normalize()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-kNiY112Xvuk", "outputId": "502bff6b-edb2-4bd8-a9e9-1f1b88f20c48" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Load ID: 1770907406.962898\n", "\n", "Tables created/updated:\n", " - books: 3756 rows\n", " - books__author_key: 4600 rows\n", " - books__author_name: 4600 rows\n", " - books__ia: 3422 rows\n", " - books__ia_collection: 2724 rows\n", " - books__language: 3748 rows\n", " - books__id_standard_ebooks: 12 rows\n", " - books__id_librivox: 60 rows\n", " - books__id_project_gutenberg: 54 rows\n" ] } ], "source": [ "load_id = normalize_info.loads_ids[-1]\n", "m = normalize_info.metrics[load_id][0]\n", "\n", "print(\"Load ID:\", load_id)\n", "print()\n", "\n", "print(\"Tables created/updated:\")\n", "for table_name, tm in m[\"table_metrics\"].items():\n", " # skip dlt internal tables to keep it beginner-friendly\n", " if table_name.startswith(\"_dlt\"):\n", " continue\n", " print(f\" - {table_name}: {tm.items_count} rows\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ctHuJ0yEdNaq" }, "source": [ "### What happened during Normalize?\n", "\n", "After running `pipeline.normalize()`, we now see multiple tables instead of just one.\n", "\n", "Tables created/updated:\n", "\n", "- `books`\n", "- `books__author_key`\n", "- `books__author_name`\n", "- `books__editions__docs`\n", "- `books__editions__docs__language`\n", "- `books__ia`\n", "\n", "---\n", "\n", "### What does this mean?\n", "\n", "We started with **N book search results** in the `books` table.\n", "\n", "During normalization:\n", "\n", "- Each book may have **more than N authors**, so those were split into:\n", " - `books__author_name`\n", " - `books__author_key`\n", "\n", "- Each book may contain **edition information**, which became:\n", " - `books__editions__docs`\n", "\n", "- Some editions contain **language information**, which became:\n", " - `books__editions__docs__language`\n", "\n", "- The `ia` field (Internet Archive IDs) is a list, so it became:\n", " - `books__ia`\n", "\n", "This is the key moment in the pipeline.\n", "\n", "The data has been transformed from nested JSON into a **relational structure** with multiple linked tables. This makes it much easier to query and analyze.\n", "\n", "---\n", "\n", "### Schema Visualization\n", "\n", "dlt can render the schema as a visual diagram. Run the next cell to see the parent-child table relationships:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display schema \n", "pipeline.default_schema" ] }, { "cell_type": "markdown", "metadata": { "id": "lJ5QzSnYdidK" }, "source": [ "## 📤 Step 6: Load\n", "\n", "Now we run the final stage of the pipeline: **Load**.\n", "\n", "Load means:\n", "\n", "- dlt creates tables in DuckDB (if they do not already exist)\n", "- the normalized rows are inserted into those tables\n", "- the pipeline records the load in its internal tracking tables\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "d9Xb67c5XfL5" }, "outputs": [], "source": [ "load_info = pipeline.load()" ] }, { "cell_type": "markdown", "metadata": { "id": "ehkz8lESGGdm" }, "source": [ "\n", "After this step, the data is fully stored in the database and ready to query.\n", "\n", "At this point:\n", "\n", "- The `books` table contains our books\n", "- The related tables (such as `books__author_name` and `books__editions__docs`) contain the exploded nested data\n", "- Everything is now queryable using `pipeline.dataset()` or SQL\n", "\n", "This is the moment where the data officially moves from “pipeline processing” into a database you can explore." ] }, { "cell_type": "markdown", "metadata": { "id": "jBznxM00eCOF" }, "source": [ "## 🚀 Step 7: Run the Full Pipeline\n", "\n", "Now that we have walked through each step individually, we can run the entire workflow using a single command:\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "YQLigkh-f7Ey" }, "outputs": [], "source": [ "load_info = pipeline.run(openlibrary_source())" ] }, { "cell_type": "markdown", "metadata": { "id": "SbLkA8W7eNPb" }, "source": [ "

What does pipeline.run() do?

\n", "\n", "

\n", " pipeline.run() simply combines the three steps we already executed manually:\n", "

\n", "\n", "
    \n", "
  1. Extract – fetch data from the Open Library API
  2. \n", "
  3. Normalize – convert nested JSON into relational tables
  4. \n", "
  5. Load – write those tables into DuckDB
  6. \n", "
\n", "\n", "

In other words, this:

\n", "\n", "
pipeline.run(source)
\n", "\n", "

is equivalent to:

\n", "\n", "
pipeline.extract(source)\n",
    "pipeline.normalize()\n",
    "pipeline.load()
\n", "\n", "

\n", " There is no hidden magic. It just runs the full ELT process in order.\n", "

\n" ] }, { "cell_type": "markdown", "metadata": { "id": "7ViMq6gIfJj_" }, "source": [ "## 🔎 Step 8: Inspect the Loaded Data\n", "\n", "Now that the data is loaded into DuckDB, we can inspect it using `pipeline.dataset()`.\n", "\n", "This gives us a convenient Python interface for exploring the tables that dlt created, without writing SQL.\n", "\n", "---\n", "\n", "### List available tables\n", "\n", "First, let’s see what tables exist in the dataset:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "bmnrK1aVZXPO" }, "outputs": [], "source": [ "ds = pipeline.dataset()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "SV6J6AtBf0xq", "outputId": "19ad26bf-f34a-4f8e-c30c-5acd3342c3c5" }, "outputs": [ { "data": { "text/plain": [ "['books',\n", " 'books__author_key',\n", " 'books__author_name',\n", " 'books__ia',\n", " 'books__ia_collection',\n", " 'books__language',\n", " 'books__id_standard_ebooks',\n", " 'books__id_librivox',\n", " 'books__id_project_gutenberg',\n", " '_dlt_version',\n", " '_dlt_loads',\n", " '_dlt_pipeline_state']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.tables" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 315 }, "id": "WLa4yN7lf1TF", "outputId": "d2da841b-a8bf-461f-a011-eb1db644656f" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "summary": "{\n \"name\": \"df\",\n \"rows\": 3756,\n \"fields\": [\n {\n \"column\": \"cover_edition_key\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1192,\n \"samples\": [\n \"OL24951484M\",\n \"OL9131663M\",\n \"OL47198575M\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cover_i\",\n \"properties\": {\n \"dtype\": \"Int64\",\n \"num_unique_values\": 1288,\n \"samples\": [\n 842156,\n 10365881,\n 3341732\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"ebook_access\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"printdisabled\",\n \"unclassified\",\n \"no_ebook\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"edition_count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 108,\n \"min\": 0,\n \"max\": 3546,\n \"num_unique_values\": 62,\n \"samples\": [\n 44,\n 92,\n 396\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"first_publish_year\",\n \"properties\": {\n \"dtype\": \"Int64\",\n \"num_unique_values\": 127,\n \"samples\": [\n 2008,\n 1622,\n 1962\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"has_fulltext\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n false,\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"key\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3756,\n \"samples\": [\n \"/works/OL34662215W\",\n \"/works/OL39702699W\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"lending_edition_s\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 281,\n \"samples\": [\n \"OL45637056M\",\n \"OL26064272M\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"lending_identifier_s\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 281,\n \"samples\": [\n \"alicesadventures0000unse_v7d2\",\n \"harrypottermagic0000unse_n5w6\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"public_scan_b\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2984,\n \"samples\": [\n \"1000 Facts and Trivia about Marvel Cinematic Universe, Game of Thrones, Disney, Star Wars, Harry Potter 1\",\n \"The Unofficial Harry Potter Insults Handbook\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"_dlt_load_id\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"1770819876.9353185\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"_dlt_id\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3756,\n \"samples\": [\n \"ZN3UfCkWBXFxSw\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"subtitle\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 59,\n \"samples\": [\n \"Hogwarts Through the Years\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", "type": "dataframe", "variable_name": "df" }, "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cover_edition_keycover_iebook_accessedition_countfirst_publish_yearhas_fulltextkeylending_edition_slending_identifier_spublic_scan_btitle_dlt_load_id_dlt_idsubtitle
0OL61027601M15155833borrowable3961997True/works/OL82563WOL38565767Mharrypotterylapi0000rowl_q5r6FalseHarry Potter and the Philosopher's Stone1770819876.9353185lGJrV2BS8Z9qJQNone
1OL26378158M15158660printdisabled1442007True/works/OL82586WNoneNoneFalseHarry Potter and the Deathly Hallows1770819876.9353185F9W0WQlLwgvsFwNone
2OL26234270M10580435borrowable2781999True/works/OL82536WOL48101764Mbdrc-W8LS66814FalseHarry Potter and the Prisoner of Azkaban1770819876.9353185kSdfO1XbBVAjmQNone
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "text/plain": [ " cover_edition_key cover_i ebook_access edition_count \\\n", "0 OL61027601M 15155833 borrowable 396 \n", "1 OL26378158M 15158660 printdisabled 144 \n", "2 OL26234270M 10580435 borrowable 278 \n", "\n", " first_publish_year has_fulltext key lending_edition_s \\\n", "0 1997 True /works/OL82563W OL38565767M \n", "1 2007 True /works/OL82586W None \n", "2 1999 True /works/OL82536W OL48101764M \n", "\n", " lending_identifier_s public_scan_b \\\n", "0 harrypotterylapi0000rowl_q5r6 False \n", "1 None False \n", "2 bdrc-W8LS66814 False \n", "\n", " title _dlt_load_id \\\n", "0 Harry Potter and the Philosopher's Stone 1770819876.9353185 \n", "1 Harry Potter and the Deathly Hallows 1770819876.9353185 \n", "2 Harry Potter and the Prisoner of Azkaban 1770819876.9353185 \n", "\n", " _dlt_id subtitle \n", "0 lGJrV2BS8Z9qJQ None \n", "1 F9W0WQlLwgvsFw None \n", "2 kSdfO1XbBVAjmQ None " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = ds.books.df() # main table\n", "df.head(3)" ] }, { "cell_type": "markdown", "metadata": { "id": "OWFqaH2wgCWR" }, "source": [ "## 💡 Conclusion\n", "\n", "### What dlt handled for us\n", "\n", "✔ API requests \n", "✔ JSON normalization \n", "✔ Table creation \n", "✔ Database loading \n", "✔ Simple dataset inspection \n", "\n", "---\n", "\n", "### But there are still friction points\n", "\n", "• Getting the REST API config exactly right \n", "• Remembering paginator syntax \n", "• Remembering how to inspect tables \n", "• Debugging schema or pagination issues \n", "• Writing Python or SQL to get insights \n", "\n", "It works... but it still takes effort.\n", "\n", "---\n", "\n", "## 🚀 Next Up: LLM-Powered Workflows\n", "\n", "dlt now integrates LLMs directly into the workflow to make:\n", "\n", "• Pipeline runs easier \n", "• Debugging faster \n", "• Schema inspection simpler \n", "• Data analysis more natural \n", "\n", "Instead of writing glue code, you can use natural language.\n", "\n", "In the workshop, we will see what that looks like.\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "BweSVO3igErN" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 4 } ================================================ FILE: cohorts/2026/workshops/dlt/dlt_homework.md ================================================ # Homework: Build Your Own dlt Pipeline You've seen how to build a pipeline with a scaffolded source. Now it's your turn to do it from scratch with a **custom API**. ## Workshop Content * [Workshop README](README.md) * [dlt Pipeline Overview Notebook (Google Colab)](https://colab.research.google.com/github/anair123/data-engineering-zoomcamp/blob/workshop/dlt_2026/cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb) * [Workshop registration page](https://luma.com/hzis1yzp) ## The Challenge For this homework, build a dlt pipeline that loads NYC taxi trip data from a custom API into DuckDB and then answer some questions using the loaded data. ## Data Source You'll be working with **NYC Yellow Taxi trip data** from a custom API (not available as a dlt scaffold). This dataset contains records of individual taxi trips in New York City. | Property | Value | |----------|-------| | Base URL | `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api` | | Format | Paginated JSON | | Page Size | 1,000 records per page | | Pagination | Stop when an empty page is returned | ## Setup Instructions Since this API is custom (not one of the scaffolds in dlt workspace), the setup is slightly different. ### Step 1: Create a New Project (or Reuse Your Demo Project) If you already created a project folder while following along with the workshop demo, you can reuse that folder. Otherwise, create a new one: ```bash mkdir taxi-pipeline cd taxi-pipeline ``` Open this folder in Cursor (or your preferred agentic IDE). ### Step 2: Set Up the dlt MCP Server (If Not Already Done) Choose the setup for your IDE: Cursor - go to **Settings → Tools & MCP → New MCP Server** and add: ```json { "mcpServers": { "dlt": { "command": "uv", "args": [ "run", "--with", "dlt[duckdb]", "--with", "dlt-mcp[search]", "python", "-m", "dlt_mcp" ] } } } ``` VS Code (Copilot) - create `.vscode/mcp.json` in your project folder: ```json { "servers": { "dlt": { "command": "uv", "args": [ "run", "--with", "dlt[duckdb]", "--with", "dlt-mcp[search]", "python", "-m", "dlt_mcp" ] } } } ``` Claude Code - run in your terminal: ```bash claude mcp add dlt -- uv run --with "dlt[duckdb]" --with "dlt-mcp[search]" python -m dlt_mcp ``` This enables the dlt MCP server, giving the AI access to dlt documentation, code examples, and your pipeline metadata. ### Step 3: Install dlt ```bash pip install "dlt[workspace]" ``` ### Step 4: Initialize the Project ```bash dlt init dlthub:taxi_pipeline duckdb ``` You can name the project whatever you like. Since this API has no scaffold, the command will create: - The dlt project files - Cursor rules for AI assistance **But no YAML file with API metadata.** You will need to provide the API information yourself. ### Step 5: Prompt the Agent Now use your AI assistant to build the pipeline. You'll need to provide the API details in your prompt since there's no scaffold. Here's an example to get you started: ``` Build a REST API source for NYC taxi data. API details: - Base URL: https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api - Data format: Paginated JSON (1,000 records per page) - Pagination: Stop when an empty page is returned Place the code in taxi_pipeline.py and name the pipeline taxi_pipeline. Use @dlt rest api as a tutorial. ``` ### Step 6: Run and Debug Run your pipeline and iterate with the agent until it works: ```bash python taxi_pipeline.py ``` --- ## Questions Once your pipeline has run successfully, use the methods covered in the workshop to investigate the following: - **dlt Dashboard**: `dlt pipeline taxi_pipeline show` - **dlt MCP Server**: Ask the agent questions about your pipeline - **Marimo Notebook**: Build visualizations and run queries We challenge you to try out the different methods explored in the workshop when answering these questions to see what works best for you. Feel free to share your thoughts on what worked (or didn't) in your submission! ### Question 1: What is the start date and end date of the dataset? - 2009-01-01 to 2009-01-31 - 2009-06-01 to 2009-07-01 - 2024-01-01 to 2024-02-01 - 2024-06-01 to 2024-07-01 ### Question 2: What proportion of trips are paid with credit card? - 16.66% - 26.66% - 36.66% - 46.66% ### Question 3: What is the total amount of money generated in tips? - $4,063.41 - $6,063.41 - $8,063.41 - $10,063.41 ### Resources | Resource | Link | |----------|------| | dlt Dashboard Docs | [dlthub.com/docs/general-usage/dashboard](https://dlthub.com/docs/general-usage/dashboard) | | marimo + dlt Guide | [dlthub.com/docs/general-usage/dataset-access/marimo](https://dlthub.com/docs/general-usage/dataset-access/marimo) | | dlt Documentation | [dlthub.com/docs](https://dlthub.com/docs) | --- ## Submitting the solutions - Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/dlt - Deadline: See the website ## Tips - The API returns paginated data. Make sure your pipeline handles pagination correctly. - If the agent gets stuck, paste the error into the chat and let it debug. - Use the dlt MCP server to ask questions about your pipeline metadata. ## Learning in Public We encourage everyone to share what they learned. This is called "learning in public". Read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and). ### Example post for LinkedIn ``` 🚀 dlt Workshop of Data Engineering Zoomcamp by @DataTalksClub complete! Just finished the Data Ingestion workshop with @dltHub. Learned how to: ✅ Build REST API data pipelines with dlt ✅ Use AI-assisted development with dlt MCP Server ✅ Load paginated API data into DuckDB ✅ Inspect pipeline data with dlt Dashboard and marimo notebooks Built a full NYC taxi data pipeline from a custom API - AI-assisted data engineering is the future! Here's my homework solution: Following along with this amazing free course - who else is learning data engineering? You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ### Example post for Twitter/X ``` 🔄 dlt Workshop of Data Engineering Zoomcamp done! - REST API pipelines with @dltHub - AI-assisted pipeline building - DuckDB as local data warehouse - dlt Dashboard & marimo notebooks My solution: Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/ ``` ================================================ FILE: cohorts/2026/workshops/dlt/open_library_pipeline.py ================================================ """Pipeline to ingest data from the Open Library Search API.""" import dlt from dlt.sources.rest_api import rest_api_source def open_library_source(query: str = "harry potter"): """ Create a dlt source for the Open Library Search API. Args: query: Search query string (default: "harry potter") """ return rest_api_source({ "client": { "base_url": "https://openlibrary.org", }, "resource_defaults": { "primary_key": "key", "write_disposition": "replace", }, "resources": [ { "name": "books", "endpoint": { "path": "search.json", "params": { "q": query, "limit": 100, }, "data_selector": "docs", "paginator": { "type": "offset", "limit": 100, "offset_param": "offset", "limit_param": "limit", "total_path": "numFound", }, }, }, ], }) if __name__ == "__main__": pipeline = dlt.pipeline( pipeline_name="open_library_pipeline", destination="duckdb", dataset_name="open_library_data", progress="log", ) # Load Harry Potter books from Open Library load_info = pipeline.run(open_library_source(query="harry potter")) print(load_info) ================================================ FILE: cohorts/2026/workshops/dlt/pyproject.toml ================================================ [project] name = "zoomcamp-workshop-prep" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.13" dependencies = [ "altair>=6.0.0", "dlt[workspace]>=1.21.0", "ibis-framework[duckdb]>=12.0.0", "jupyterlab>=4.5.4", "marimo>=0.19.9", ] ================================================ FILE: cohorts/2026/workshops/dlt.md ================================================ # From APIs to Warehouses: AI-Assisted Data Ingestion with dlt [Video](https://www.youtube.com/watch?v=5eMytPBgmVs) This hands-on workshop focuses on building reliable data ingestion pipelines to data warehouses (for example, Snowflake) using dlt (data load tool), enhanced with LLMs, the dlt dashboard, and dlt MCP. ## What you'll learn You'll work through the key building blocks of a production-ready ingestion setup, including: - Extracting data from APIs, files, and databases - Normalizing data into consistent schemas - Writing data to a data warehouse (e.g. Snowflake) - Using LLMs to accelerate dlt pipeline development - Validating data and schema changes using the dlt dashboard and dlt MCP The session is fully practical and code-driven. By the end of the workshop, you'll understand how to design maintainable, scalable ingestion pipelines and use AI and validation tools to build them faster and with confidence. ## Materials * [Workshop instructions](dlt/README.md) * [dlt Pipeline Overview Notebook (Google Colab)](https://colab.research.google.com/github/anair123/data-engineering-zoomcamp/blob/workshop/dlt_2026/cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb) * [Homework](dlt/dlt_homework.md) * [Homework submission form](https://courses.datatalks.club/de-zoomcamp-2026/homework/dlt) ## About the Speaker **Aashish Nair** is a Data Engineer at dltHub and the creator of the famous _dlt deployment_ course, where he teaches best practices for running dlt pipelines in production. ================================================ FILE: learning-in-public.md ================================================ # Learning in public Most people learn in private: they consume content but don't tell anyone about it. There's nothing wrong with it. But we want to encourage you to document your progress and share it publicly on social media. It helps you get noticed and will lead to: * Expanding your network: meeting new people and making new friends * Being invited to meetups, conferences and podcasts * Landing a job or getting clients * Many other good things Here's a more comprehensive reading on why you want to do it: https://github.com/readme/guides/publishing-your-work ## Learning in Public for Zoomcamps When you submit your homework or project, you can also submit learning in public posts: You can watch this video to see how your learning in public posts may look like: ## Daily Documentation - **Post Daily Diaries**: Document what you learn each day, including the challenges faced and the methods used to overcome them. - **Create Quick Videos**: Make short videos showcasing your work and upload them to GitHub. Send a PR if you want to suggest improvements for this document ================================================ FILE: projects/README.md ================================================ ## Course Project [🎥 Projects how-to (watch it!)](https://www.youtube.com/watch?v=BL0E8xO8OnE) ### Objective The goal of this project is to apply everything we have learned in this course to build an end-to-end data pipeline. ### Problem statement Develop a dashboard with two tiles by: * Selecting a dataset of interest (see [Datasets](#datasets)) * Creating a pipeline for processing this dataset and putting it to a datalake * Creating a pipeline for moving the data from the lake to a data warehouse * Transforming the data in the data warehouse: prepare it for the dashboard * Building a dashboard to visualize the data ## Data Pipeline The pipeline could be **stream** or **batch**: this is the first thing you'll need to decide * **Stream**: If you want to consume data in real-time and put them to data lake * **Batch**: If you want to run things periodically (e.g. hourly/daily) ## Technologies You don't have to limit yourself to technologies covered in the course. You can use alternatives as well: * **Cloud**: AWS, GCP, Azure, ... * **Infrastructure as code (IaC)**: Terraform, Pulumi, Cloud Formation, ... * **Workflow orchestration**: Airflow, Prefect, Luigi, ... * **Data Warehouse**: BigQuery, Snowflake, Redshift, ... * **Batch processing**: Spark, Flink, AWS Batch, ... * **Stream processing**: Kafka, Pulsar, Kinesis, ... If you use a tool that wasn't covered in the course, be sure to explain what that tool does. If you're not certain about some tools, ask in Slack. ## Dashboard You can use any of the tools shown in the course (Looker Studio or Streamlit) or any other BI tool of your choice to build a dashboard. If you do use another tool, please specify and make sure that the dashboard is somehow accessible to your peers. Your dashboard should contain at least two tiles, we suggest you include: - 1 graph that shows the distribution of some categorical data - 1 graph that shows the distribution of the data across a temporal line Ensure that your graph is easy to understand by adding references and titles. Example dashboard: ![image](https://user-images.githubusercontent.com/4315804/159771458-b924d0c1-91d5-4a8a-8c34-f36c25c31a3c.png) ## Peer reviewing > [!IMPORTANT] > To evaluate the projects, we'll use peer reviewing. This is a great opportunity for you to learn from each other. > * To get points for your project, you need to evaluate 3 projects of your peers > * You get 3 extra points for each evaluation ## Evaluation Criteria * Problem description * 0 points: Problem is not described * 2 points: Problem is described but shortly or not clearly * 4 points: Problem is well described and it's clear what the problem the project solves * Cloud * 0 points: Cloud is not used, things run only locally * 2 points: The project is developed in the cloud * 4 points: The project is developed in the cloud and IaC tools are used * Data ingestion (choose either batch or stream) * Batch / Workflow orchestration * 0 points: No workflow orchestration * 2 points: Partial workflow orchestration: some steps are orchestrated, some run manually * 4 points: End-to-end pipeline: multiple steps in the DAG, uploading data to data lake * Stream * 0 points: No streaming system (like Kafka, Pulsar, etc) * 2 points: A simple pipeline with one consumer and one producer * 4 points: Using consumer/producers and streaming technologies (like Kafka streaming, Spark streaming, Flink, etc) * Data warehouse * 0 points: No DWH is used * 2 points: Tables are created in DWH, but not optimized * 4 points: Tables are partitioned and clustered in a way that makes sense for the upstream queries (with explanation) * Transformations (dbt, spark, etc) * 0 points: No tranformations * 2 points: Simple SQL transformation (no dbt or similar tools) * 4 points: Tranformations are defined with dbt, Spark or similar technologies * Dashboard * 0 points: No dashboard * 2 points: A dashboard with 1 tile * 4 points: A dashboard with 2 tiles * Reproducibility * 0 points: No instructions how to run the code at all * 2 points: Some instructions are there, but they are not complete * 4 points: Instructions are clear, it's easy to run the code, and the code works > [!NOTE] > It's highly recommended to create a new repository for your project (not inside an existing repo) with a meaningful title, such as > "Quake Analytics Dashboard" or "Bike Data Insights" and include as many details as possible in the README file. ChatGPT can assist you with this. Doing so will not only make it easier to showcase your project for potential job opportunities but also have it featured on the [Projects Gallery App](#projects-gallery). > If you leave the README file empty or with minimal details, there may be point deductions as per the [Evaluation Criteria](#evaluation-criteria). ## Going the extra mile (Optional) > [!NOTE] > The following things are not covered in the course, are entirely optional and they will not be graded. However, implementing these could significantly enhance the quality of your project: * Add tests * Use make * Add CI/CD pipeline If you intend to include this project in your portfolio, adding these additional features will definitely help you to stand out from others. ## Cheating and plagiarism Plagiarism in any form is not allowed. Examples of plagiarism: * Taking somebody's else notebooks and projects (in full or partly) and using it for the capstone project * Re-using your own projects (in full or partly) from other courses and bootcamps * Re-using your midterm project from ML Zoomcamp in capstone * Re-using your ML Zoomcamp from previous iterations of the course Violating any of this will result in 0 points for this project. ## Resources ### Datasets Refer to the provided [datasets](datasets.md) for possible selection. ### Helpful Links * [Unit Tests + CI for Airflow](https://www.astronomer.io/events/recaps/testing-airflow-to-bulletproof-your-code/) * [CI/CD for Airflow (with Gitlab & GCP state file)](https://engineering.ripple.com/building-ci-cd-with-airflow-gitlab-and-terraform-in-gcp) * [CI/CD for Airflow (with GitHub and S3 state file)](https://programmaticponderings.com/2021/12/14/devops-for-dataops-building-a-ci-cd-pipeline-for-apache-airflow-dags/) * [CD for Terraform](https://medium.com/towards-data-science/git-actions-terraform-for-data-engineers-scientists-gcp-aws-azure-448dc7c60fcc) * [Spark + Airflow](https://medium.com/doubtnut/github-actions-airflow-for-automating-your-spark-pipeline-c9dff32686b) ### Projects Gallery Explore a collection of projects completed by members of our community. The projects cover a wide range of topics and utilize different tools and techniques. Feel free to delve into any project and see how others have tackled real-world problems with data, structured their code, and presented their findings. It's a great resource to learn and get ideas for your own projects. [![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://datatalksclub-projects.streamlit.app/) ### DE Zoomcamp 2023 * [2023 Projects](../cohorts/2023/project.md) ### DE Zoomcamp 2022 * [2022 Projects](../cohorts/2022/project.md) ================================================ FILE: projects/datasets.md ================================================ ## Datasets Here are some datasets that you could use for the project: * [Kaggle](https://www.kaggle.com/datasets) * [AWS datasets](https://registry.opendata.aws/) * [UK government open data](https://data.gov.uk/) * [Github archive](https://www.gharchive.org) * [Awesome public datasets](https://github.com/awesomedata/awesome-public-datasets) * [Million songs dataset](http://millionsongdataset.com) * [Some random datasets](https://components.one/datasets/) * [COVID Datasets](https://www.reddit.com/r/datasets/comments/n3ph2d/coronavirus_datsets/) * [Datasets from Azure](https://docs.microsoft.com/en-us/azure/azure-sql/public-data-sets) * [Datasets from BigQuery](https://cloud.google.com/bigquery/public-data/) * [Dataset search engine from Google](https://datasetsearch.research.google.com/) * [Public datasets offered by different GCP services](https://cloud.google.com/solutions/datasets) * [European statistics datasets](https://ec.europa.eu/eurostat/data/database) * [Datasets for streaming](https://github.com/ColinEberhardt/awesome-public-streaming-datasets) * [Dataset for Santander bicycle rentals in London](https://cycling.data.tfl.gov.uk/) * [Common crawl data](https://commoncrawl.org/) (copy of the internet) * [NASA's EarthData](https://search.earthdata.nasa.gov/search) (May require introductory geospatial analysis) * Collection Of Data Repositories * [part 1](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-1.html) (from agriculture and finance to government) * [part 2](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-2.html) (from healthcare to transportation) * [Data For Good by Meta](https://dataforgood.facebook.com/dfg/tools) PRs with more datasets are welcome! It's not mandatory that you use a dataset from this list. You can use any dataset you want. ================================================ FILE: workshop-best-practices.md ================================================ # Workshop Best Practices Preferences and patterns learned from building the PyFlink streaming workshop. ## Structure and Pacing - Introduce services one at a time, not all at once. Start with one container (e.g., Redpanda), explain it, use it. Then add the next (PostgreSQL), etc. - Start with the simplest version that works (plain Python consumer), then motivate the more complex tool (Flink) by showing what's missing. - Use `docker compose up -d` to start services selectively during the gradual buildup. `docker compose up --build -d` only when everything is ready. ## Data - Use real datasets, not fake test data. NYC taxi data (`yellow_tripdata_YYYY-MM.parquet`) is a good go-to. - Limit to manageable sizes (e.g., first 1000 rows) for workshop speed. ## Project Setup - Assume starting from scratch: `uv init -p 3.12` + `uv add `. - Add dependencies gradually as they're needed in the narrative (e.g., `uv add kafka-python pandas pyarrow` first, `uv add psycopg2-binary` later when PostgreSQL is introduced). - Always note "if you cloned the repo, run `uv sync` instead" as a blockquote. ## Code Delivery - Break large code blocks into small, focused blocks. Each block should do one thing. Don't dump a full script in one block. - Pattern for code blocks: short intro line (what it does), then the code, then the explanation of how it works below. Don't put detailed explanations before the code - let the reader see the code first. - Keep imports local to each block - don't introduce all imports upfront. Each block should only import what it uses. - Introduce functions and utilities where they're first used, not earlier. For example, show `dataclasses.asdict()` in the block that calls it, not in the block that defines the dataclass. - When introducing a function, show a test with sample data before using it in the real code. For example, create a test binary string to verify a deserializer, then pass it to the consumer. - Prefer named functions over inline lambdas. A named function is reusable, testable, and easier to explain step by step. For example, `value_deserializer=ride_deserializer` instead of `value_deserializer=lambda m: json.loads(m.decode('utf-8'))`. - Extract repetitive logic into named functions. For example, row-to-object conversion that appears in multiple places should be a function like `ride_from_row(row)`. - Split one-liner functions into multiple lines. Each step (decode, parse, construct) on its own line is easier to follow and explain. - Show the simple approach first, then improve it. For example, show a generic `json_serializer` with manual `dataclasses.asdict()` calls, then introduce a specialized `ride_serializer` that handles the conversion internally. Let the student feel the friction before showing the fix. - Extract shared code (dataclasses, serializers, deserializers, converters) into shared modules (e.g., `models.py`) so multiple scripts can import from one place. - Reference the complete script at the end (e.g., "> The complete script is in `src/producers/producer.py`."). - For infrastructure files that are long or complex (Dockerfile, YAML configs), link to the file on GitHub and provide a short summary list of what it does. Use `wget` to download from the GitHub repo instead of asking students to type them. - Mention that students can run Python code in Jupyter notebooks (`uv add jupyter`, `uv run jupyter lab`) as an alternative to .py scripts. The small-block style maps naturally to notebook cells. - Flink jobs must remain as .py files (they're submitted to the cluster via `docker compose exec`). Add a note explaining this distinction. ## Formatting - No bold formatting (`**text**`) in README files. Use plain text. - No em dashes. Use hyphens with spaces (` - `) instead. - Use `python` not `python3`. - Use `docker compose` not `docker-compose`. - Use `uvx pgcli` not just `pgcli`. - Use `uv run python` not `python` for running scripts. ## Naming - Use meaningful names that reflect purpose, not generic placeholders. For example, `group_id='rides-console'` or `group_id='rides-to-postgres'`, not `group_id='test-consumer-group'`. ## Explanations - For complex configurations (like Redpanda's docker-compose command), explain every parameter in a table or list. - Explain the "why" not just the "what" (why two Kafka addresses? why checkpointing every 10 seconds? why watermarks?). - Use tables for parameter explanations and comparisons. - Include sample output for every command students will run. - Use `>` blockquotes for tips, notes about the repo, and common mistakes from original workshops/streams. - For complex concepts (watermarks, task slots, parallelism), pull the explanation out of bullet lists into its own multi-paragraph section. State the value or syntax in the bullet, then explain the concept below in separate paragraphs for easier reading. - Use lists for multi-point summaries instead of packing everything into one long sentence. - When showing a development shortcut (like mounting local files into Docker), add a note explaining how it works in production. Students benefit from understanding real-world deployment patterns alongside the workshop setup. ## Code Organization - Define the source (where you read from) before the sink (where you write to) when presenting code blocks. Set up the consumer/reader first, then the database connection or output destination. ## Docker Compose - Don't use `container_name` or `hostname` - Docker Compose handles naming automatically. - Don't use `extra_hosts` unless specifically needed. - Service names are automatically resolvable as hostnames within the Docker network. - Prefer short service names (e.g., `redpanda` not `redpanda-1`). - Keep `restart: on-failure` only for services that need it (like databases). ## Dependencies and Versions - Always use the latest stable versions of images and libraries. - Pin exact versions for Flink and its connectors (they must match). - Use `uv` for everything Python-related (package management, running scripts, even installing Python itself inside Docker). - Prefer `COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/` in Dockerfiles instead of `apt-get install`. ## Workshop Header - Credit the original stream/video at the top with a link. - If the new video is not yet available, put "TBA" with a sign-up link (e.g., Luma). - Brief description of what we'll build and prerequisites. ## Workshop Flow Template 1. Introduce the first component (message broker, database, etc.) 2. Set up with docker-compose (explain parameters) 3. Create a simple producer/writer 4. Create a simple consumer/reader 5. Add a database, save data 6. Show limitations of the simple approach 7. Introduce the framework (Flink, Spark, etc.) 8. Reproduce the simple case with the framework 9. Do something the simple approach can't (aggregation, windowing) 10. Explain advanced concepts (window types, offsets, etc.) 11. Cleanup 12. Q&A - questions and answers from the original stream. Include production deployment topics here rather than as standalone sections.