[
  {
    "path": ".github/FUNDING.yml",
    "content": "github: alexeygrigorev\n"
  },
  {
    "path": ".gitignore",
    "content": "\n.DS_Store\n.idea\n*.tfstate\n*.tfstate.*\n**.terraform\n**.terraform.lock.*\n**google_credentials.json\n**logs/\n**.env\n**__pycache__/\n.history\n**/ny_taxi_postgres_data/*\nserving_dir\n.ipynb_checkpoints/\n!week_6_stream_processing/avro_example/data/rides.csv\n*.parquet\n*.csv\n*.duckdb\n"
  },
  {
    "path": "01-docker-terraform/README.md",
    "content": "# Introduction\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/JgspdlKXS-w)](https://www.youtube.com/watch?v=JgspdlKXS-w)\n\n\nWe suggest watching videos in the same order as in this document.\n\n\n# Docker + Postgres\n\n## Workshop\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/lP8xXebHmuE)](https://youtu.be/lP8xXebHmuE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10)\n\n* Video: https://www.youtube.com/watch?v=lP8xXebHmuE\n* Follow the instructions here: [docker-sql/](docker-sql/)\n\n## :movie_camera: SQL refresher\n\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/QEcps_iskgg)](https://youtu.be/QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10)\n\n* Video: https://www.youtube.com/watch?v=QEcps_iskgg\n* SQL queries: [10-sql-refresher.md](docker-sql/10-sql-refresher.md)\n\n\n# GCP\n\n## :movie_camera: Introduction to GCP (Google Cloud Platform)\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/18jIzE41fJ4)](https://youtu.be/18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=3)\n\n# Terraform\n\n[Code and notes](terraform/)\n\n## :movie_camera: Introduction Terraform: Concepts and Overview, a primer\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/s2bOYDCKl_M)](https://youtu.be/s2bOYDCKl_M&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=11)\n\n## :movie_camera: Terraform Basics: Simple one file Terraform Deployment\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/Y2ux7gq3Z0o)](https://youtu.be/Y2ux7gq3Z0o&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=12)\n\n## :movie_camera: Deployment with a Variables File\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/PBi0hHjLftk)](https://youtu.be/PBi0hHjLftk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=13)\n\n## Configuring terraform and GCP SDK on Windows\n\n* [Instructions](terraform/windows.md)\n\n\n\n# Homework\n\n* [Homework](../cohorts/2026/01-docker-terraform/homework.md)\n\n\n# Community notes\n\n<details>\n<summary>Did you take notes? You can share them here</summary>\n\n* [Notes from Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/1_intro.md)\n* [Notes from Abd](https://itnadigital.notion.site/Week-1-Introduction-f18de7e69eb4453594175d0b1334b2f4)\n* [Notes from Aaron](https://github.com/ABZ-Aaron/DataEngineerZoomCamp/blob/master/week_1_basics_n_setup/README.md)\n* [Notes from Faisal](https://github.com/FaisalMohd/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/Notes/DE%20Zoomcamp%20Week-1.pdf)\n* [Michael Harty's Notes](https://github.com/mharty3/data_engineering_zoomcamp_2022/tree/main/week01)\n* [Blog post from Isaac Kargar](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/18/data-engineering-w1.html)\n* [Handwritten Notes By Mahmoud Zaher](https://github.com/zaherweb/DataEngineering/blob/master/week%201.pdf)\n* [Notes from Candace Williams](https://teacherc.github.io/data-engineering/2023/01/18/zoomcamp1.html)\n* [Notes from Marcos Torregrosa](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-1/)\n* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)\n* [Notes from Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week1)\n* [Notes from froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_1_basics_n_setup/notes/notes_week_01.md)\n* [Notes from adamiaonr](https://github.com/adamiaonr/data-engineering-zoomcamp/blob/main/week_1_basics_n_setup/2_docker_sql/NOTES.md)\n* [Notes from Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/01/week-1-data-engineering-zoomcamp-notes/)\n* [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%201/Detailed%20Week%201%20Notes.ipynb)\n* [Notes from Erik](https://twitter.com/ehub96/status/1621351266281730049)\n* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week1.md)\n* Notes on [Docker, Docker Compose, and setting up a proper Python environment](https://medium.com/@verazabeida/zoomcamp-2023-week-1-f4f94cb360ae), by Vera\n* [Setting up the development environment on Google Virtual Machine](https://itsadityagupta.hashnode.dev/setting-up-the-development-environment-on-google-virtual-machine), blog post by Aditya Gupta\n* [Notes from Zharko Cekovski](https://www.zharconsulting.com/contents/data/data-engineering-bootcamp-2024/week-1-postgres-docker-and-ingestion-scripts/)\n* [2024 Module-01 Walkthough video by ellacharmed on youtube](https://youtu.be/VUZshlVAnk4)\n* [2024 Companion Module Walkthough slides by ellacharmed](https://github.com/ellacharmed/data-engineering-zoomcamp/blob/ella2024/cohorts/2024/01-docker-terraform/walkthrough-01.pdf)\n* [2024 Module-01 Environment setup video by ellacharmed on youtube](https://youtu.be/Zce_Hd37NGs)\n* [Docker Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/1a-docker_sql/readme.md) • [Terraform Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/1b-terraform_gcp/readme.md)\n* [Notes from Hammad Tariq](https://github.com/hamad-tariq/HammadTariq-ZoomCamp2024/blob/9c8b4908416eb8cade3d7ec220e7664c003e9b11/week_1_basics_n_setup/README.md)\n* [Hung's Notes](https://hung.bearblog.dev/docker/) & [Docker Cheatsheet](https://github.com/HangenYuu/docker-cheatsheet)\n* [Kemal's Notes](https://github.com/kemaldahha/data-engineering-course/blob/main/week_1_notes.md)\n* [Notes from Manuel Guerra (Windows+WSL2 Environment)](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/1_Containerization-and-Infrastructure-as-Code/README.md)\n* [Notes from Horeb SEIDOU](https://spotted-hardhat-eea.notion.site/Week-1-Containerization-and-Infrastructure-as-Code-15729780dc4a80a08288e497ba937a37)\n* [2025 Gitbook Notes from Tinker0425](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/introduction/introduction-and-set-up)\n* [Alex's Docker Notes](https://github.com/alexg9010/2025_data_engineering_zoomcamp/blob/master/01_docker/README.md) | [Alex's Terraform Notes](https://github.com/alexg9010/2025_data_engineering_zoomcamp/blob/master/01_3_terraform/README.md)\n* [2025 SQL Refresher - Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/01_docker_postgress/0_sql_refresh.ipynb)\n* [2025 Setting up the Environment - Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/01_docker_postgress/_setting_up.md)\n* [Notes from Mercy Markus: Linux/Fedora Tweaks and Tips](https://mercymarkus.com/posts/2025/series/dtc-dez-jan-2025/dtc-dez-2025-module-1/)\n* [[2026 tutorial video - Khanh Nguyen] Setting up the environment for homework-w1](https://youtu.be/_iqCWi_UoOc)\n* Add your notes above this line\n\n</details>\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/01-introduction.md",
    "content": "# Introduction to Docker\n\n**[↑ Up](README.md)** | **[← Previous](README.md)** | **[Next →](02-virtual-environment.md)**\n\nDocker is a _containerization software_ that allows us to isolate software in a similar way to virtual machines but in a much leaner way.\n\nA Docker image is a _snapshot_ of a container that we can define to run our software, or in this case our data pipelines. By exporting our Docker images to Cloud providers such as Amazon Web Services or Google Cloud Platform we can run our containers there.\n\n## Why Docker?\n\nDocker provides the following advantages:\n\n- Reproducibility: Same environment everywhere\n- Isolation: Applications run independently\n- Portability: Run anywhere Docker is installed\n\nThey are used in many situations:\n\n- Integration tests: CI/CD pipelines\n- Running pipelines on the cloud: AWS Batch, Kubernetes jobs\n- Spark: Analytics engine for large-scale data processing\n- Serverless: AWS Lambda, Google Functions\n\n## Basic Docker Commands\n\nCheck Docker version:\n\n```bash\ndocker --version\n```\n\nRun a simple container:\n\n```bash\ndocker run hello-world\n```\n\nRun something more complex:\n\n```bash\ndocker run ubuntu\n```\n\nNothing happens. Need to run it in `-it` mode:\n\n```bash\ndocker run -it ubuntu\n```\n\nWe don't have `python` there so let's install it:\n\n```bash\napt update && apt install python3\npython3 -V\n```\n\n## Stateless Containers\n\nImportant: Docker containers are stateless - any changes done inside a container will NOT be saved when the container is killed and started again.\n\nWhen you exit the container and use it again, the changes are gone:\n\n```bash\ndocker run -it ubuntu\npython3 -V\n```\n\nThis is good, because it doesn't affect your host system. Let's say you do something crazy like this:\n\n```bash\ndocker run -it ubuntu\nrm -rf / # don't run it on your computer!\n```\n\nNext time we run it, all the files are back.\n\n## Managing Containers\n\nBut, this is not _completely_ correct. The state is saved somewhere. We can see stopped containers:\n\n```bash\ndocker ps -a\n```\n\nWe can restart one of them, but we won't do it, because it's not a good practice. They take space, so let's delete them:\n\n```bash\ndocker rm $(docker ps -aq)\n```\n\nNext time we run something, we add `--rm`:\n\n```bash\ndocker run -it --rm ubuntu\n```\n\n## Different Base Images\n\nThere are other base images besides `hello-world` and `ubuntu`. For example, Python:\n\n```bash\ndocker run -it --rm python:3.9.16\n# add -slim to get a smaller version\n```\n\nThis one starts `python`. If we want bash, we need to overwrite `entrypoint`:\n\n```bash\ndocker run -it \\\n    --rm \\\n    --entrypoint=bash \\\n    python:3.9.16-slim\n```\n\n## Volumes\n\nSo, we know that with docker we can restore any container to its initial state in a reproducible manner. But what about data? A common way to do so is with _volumes_.\n\nLet's create some data in `test`:\n\n```bash\nmkdir test\ncd test\ntouch file1.txt file2.txt file3.txt\necho \"Hello from host\" > file1.txt\ncd ..\n```\n\nNow let's create a simple script `test/list_files.py` that shows the files in the folder:\n\n```python\nfrom pathlib import Path\n\ncurrent_dir = Path.cwd()\ncurrent_file = Path(__file__).name\n\nprint(f\"Files in {current_dir}:\")\n\nfor filepath in current_dir.iterdir():\n    if filepath.name == current_file:\n        continue\n\n    print(f\"  - {filepath.name}\")\n\n    if filepath.is_file():\n        content = filepath.read_text(encoding='utf-8')\n        print(f\"    Content: {content}\")\n```\n\nNow let's map this to a Python container:\n\n```bash\ndocker run -it \\\n    --rm \\\n    -v $(pwd)/test:/app/test \\\n    --entrypoint=bash \\\n    python:3.9.16-slim\n```\n\nInside the container, run:\n\n```bash\ncd /app/test\nls -la\ncat file1.txt\npython list_files.py\n```\n\nYou'll see the files from your host machine are accessible in the container!\n\n**[↑ Up](README.md)** | **[← Previous](README.md)** | **[Next →](02-virtual-environment.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/02-virtual-environment.md",
    "content": "# Virtual Environments and Data Pipelines\n\n**[↑ Up](README.md)** | **[← Previous](01-introduction.md)** | **[Next →](03-dockerizing-pipeline.md)**\n\nA **data pipeline** is a service that receives data as input and outputs more data. For example, reading a CSV file, transforming the data somehow and storing it as a table in a PostgreSQL database.\n\n```mermaid\ngraph LR\n    A[CSV File] --> B[Data Pipeline]\n    B --> C[Parquet File]\n    B --> D[PostgreSQL Database]\n    B --> E[Data Warehouse]\n    style B fill:#4CAF50,stroke:#333,stroke-width:2px,color:#fff\n```\n\nIn this workshop, we'll build pipelines that:\n- Download CSV data from the web\n- Transform and clean the data with pandas\n- Load it into PostgreSQL for querying\n- Process data in chunks to handle large files\n\n## Creating a Simple Pipeline\n\nLet's create an example pipeline. First, create a directory `pipeline` and inside, create a file  `pipeline.py`:\n\n```python\nimport sys\nprint(\"arguments\", sys.argv)\n\nday = int(sys.argv[1])\nprint(f\"Running pipeline for day {day}\")\n```\n\nNow let's add pandas:\n\n```python\nimport pandas as pd\n\ndf = pd.DataFrame({\"A\": [1, 2], \"B\": [3, 4]})\nprint(df.head())\n\ndf.to_parquet(f\"output_day_{sys.argv[1]}.parquet\")\n```\n\n## Why Virtual Environments?\n\nWe need pandas, but we don't have it. We want to test it before we run things in a container.\n\nWe can install it with `pip`:\n\n```bash\npip install pandas pyarrow\n```\n\nBut this installs it globally on your system. This can cause conflicts if different projects need different versions of the same package.\n\nInstead, we want to use a **virtual environment** - an isolated Python environment that keeps dependencies for this project separate from other projects and from your system Python.\n\n## Using uv - Modern Python Package Manager\n\nWe'll use `uv` - a modern, fast Python package and project manager written in Rust. It's much faster than pip and handles virtual environments automatically.\n\n```bash\npip install uv\n```\n\nNow initialize a Python project with uv:\n\n```bash\nuv init --python=3.13\n```\n\nThis creates a `pyproject.toml` file for managing dependencies and a `.python-version` file.\n\n### Comparing Python Versions\n\n```bash\nuv run which python  # Python in the virtual environment\nuv run python -V\n\nwhich python        # System Python\npython -V\n```\n\nYou'll see they're different - `uv run` uses the isolated environment.\n\n### Adding Dependencies\n\nNow let's add pandas:\n\n```bash\nuv add pandas pyarrow\n```\n\nThis adds pandas to your `pyproject.toml` and installs it in the virtual environment.\n\n### Running the Pipeline\n\nNow we can execute the file:\n\n```bash\nuv run python pipeline.py 10\n```\n\nWe will see:\n\n* `['pipeline.py', '10']`\n* `job finished successfully for day = 10`\n\n## Git Configuration\n\nThis script produces a binary (parquet) file, so let's make sure we don't accidentally commit it to git by adding parquet extensions to `.gitignore`:\n\n```\n*.parquet\n```\n\n**[↑ Up](README.md)** | **[← Previous](01-introduction.md)** | **[Next →](03-dockerizing-pipeline.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/03-dockerizing-pipeline.md",
    "content": "# Dockerizing the Pipeline\n\n**[↑ Up](README.md)** | **[← Previous](02-virtual-environment.md)** | **[Next →](04-postgres-docker.md)**\n\nNow let's containerize the script. Create the following `Dockerfile` file:\n\n## Simple Dockerfile with pip\n\n```dockerfile\n# base Docker image that we will build on\nFROM python:3.13.11-slim\n\n# set up our image by installing prerequisites; pandas in this case\nRUN pip install pandas pyarrow\n\n# set up the working directory inside the container\nWORKDIR /app\n# copy the script to the container. 1st name is source file, 2nd is destination\nCOPY pipeline.py pipeline.py\n\n# define what to do first when the container runs\n# in this example, we will just run the script\nENTRYPOINT [\"python\", \"pipeline.py\"]\n```\n\n**Explanation:**\n\n- `FROM`: Base image (Python 3.13)\n- `RUN`: Execute commands during build\n- `WORKDIR`: Set working directory\n- `COPY`: Copy files into the image\n- `ENTRYPOINT`: Default command to run\n\n### Build and Run\n\nLet's build the image:\n\n```bash\ndocker build -t test:pandas .\n```\n\n* The image name will be `test` and its tag will be `pandas`. If the tag isn't specified it will default to `latest`.\n\nWe can now run the container and pass an argument to it, so that our pipeline will receive it:\n\n```bash\ndocker run -it test:pandas some_number\n```\n\nYou should get the same output you did when you ran the pipeline script by itself.\n\n> Note: these instructions assume that `pipeline.py` and `Dockerfile` are in the same directory. The Docker commands should also be run from the same directory as these files.\n\n## Dockerfile with uv\n\nWhat about uv? Let's use it instead of using pip:\n\n```dockerfile\n# Start with slim Python 3.13 image\nFROM python:3.13.10-slim\n\n# Copy uv binary from official uv image (multi-stage build pattern)\nCOPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/\n\n# Set working directory\nWORKDIR /app\n\n# Add virtual environment to PATH so we can use installed packages\nENV PATH=\"/app/.venv/bin:$PATH\"\n\n# Copy dependency files first (better layer caching)\nCOPY \"pyproject.toml\" \"uv.lock\" \".python-version\" ./\n# Install dependencies from lock file (ensures reproducible builds)\nRUN uv sync --locked\n\n# Copy application code\nCOPY pipeline.py pipeline.py\n\n# Set entry point\nENTRYPOINT [\"uv\", \"run\", \"python\", \"pipeline.py\"]\n```\n\n**[↑ Up](README.md)** | **[← Previous](02-virtual-environment.md)** | **[Next →](04-postgres-docker.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/04-postgres-docker.md",
    "content": "# Running PostgreSQL with Docker\n\n**[↑ Up](README.md)** | **[← Previous](03-dockerizing-pipeline.md)** | **[Next →](05-data-ingestion.md)**\n\nNow we want to do real data engineering. Let's use a Postgres database for that.\n\nYou can run a containerized version of Postgres that doesn't require any installation steps. You only need to provide a few _environment variables_ to it as well as a _volume_ for storing data.\n\n## Running PostgreSQL in a Container\n\nCreate a folder anywhere you'd like for Postgres to store data in. We will use the example folder `ny_taxi_postgres_data`. Here's how to run the container:\n\n```bash\ndocker run -it --rm \\\n  -e POSTGRES_USER=\"root\" \\\n  -e POSTGRES_PASSWORD=\"root\" \\\n  -e POSTGRES_DB=\"ny_taxi\" \\\n  -v ny_taxi_postgres_data:/var/lib/postgresql \\\n  -p 5432:5432 \\\n  postgres:18\n```\n\n### Explanation of Parameters\n\n* `-e` sets environment variables (user, password, database name)\n* `-v ny_taxi_postgres_data:/var/lib/postgresql` creates a **named volume**\n  * Docker manages this volume automatically\n  * Data persists even after container is removed\n  * Volume is stored in Docker's internal storage\n* `-p 5432:5432` maps port 5432 from container to host\n* `postgres:18` uses PostgreSQL version 18 (latest as of Dec 2025)\n\n### Alternative Approach - Bind Mount\n\nFirst create the directory, then map it:\n\n```bash\nmkdir ny_taxi_postgres_data\n\ndocker run -it \\\n  -e POSTGRES_USER=\"root\" \\\n  -e POSTGRES_PASSWORD=\"root\" \\\n  -e POSTGRES_DB=\"ny_taxi\" \\\n  -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql \\\n  -p 5432:5432 \\\n  postgres:18\n```\n\n### Named Volume vs Bind Mount\n\n* **Named volume** (`name:/path`): Managed by Docker, easier\n* **Bind mount** (`/host/path:/container/path`): Direct mapping to host filesystem, more control\n\n## Connecting to PostgreSQL\n\nOnce the container is running, we can log into our database with [pgcli](https://www.pgcli.com/).\n\nInstall pgcli:\n\n```bash\nuv add --dev pgcli\n```\n\nThe `--dev` flag marks this as a development dependency (not needed in production). It will be added to the `[dependency-groups]` section of `pyproject.toml` instead of the main `dependencies` section.\n\nNow use it to connect to Postgres:\n\n```bash\nuv run pgcli -h localhost -p 5432 -u root -d ny_taxi\n```\n\n* `uv run` executes a command in the context of the virtual environment\n* `-h` is the host. Since we're running locally we can use `localhost`.\n* `-p` is the port.\n* `-u` is the username.\n* `-d` is the database name.\n* The password is not provided; it will be requested after running the command.\n\nWhen prompted, enter the password: `root`\n\n## Basic SQL Commands\n\nTry some SQL commands:\n\n```sql\n-- List tables\n\\dt\n\n-- Create a test table\nCREATE TABLE test (id INTEGER, name VARCHAR(50));\n\n-- Insert data\nINSERT INTO test VALUES (1, 'Hello Docker');\n\n-- Query data\nSELECT * FROM test;\n\n-- Exit\n\\q\n```\n\n**[↑ Up](README.md)** | **[← Previous](03-dockerizing-pipeline.md)** | **[Next →](05-data-ingestion.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/05-data-ingestion.md",
    "content": "# NY Taxi Dataset and Data Ingestion\n\n**[↑ Up](README.md)** | **[← Previous](04-postgres-docker.md)** | **[Next →](06-ingestion-script.md)**\n\nWe will now create a Jupyter Notebook `notebook.ipynb` file which we will use to read a CSV file and export it to Postgres.\n\n## Setting up Jupyter\n\nInstall Jupyter:\n\n```bash\nuv add --dev jupyter\n```\n\nLet's create a Jupyter notebook to explore the data:\n\n```bash\nuv run jupyter notebook\n```\n\n## The NYC Taxi Dataset\n\nWe will use data from the [NYC TLC Trip Record Data website](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).\n\nSpecifically, we will use the [Yellow taxi trip records CSV file for January 2021](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz).\n\nThis data used to be csv, but later they switched to parquet. We want to keep using CSV because we need to do a bit of extra pre-processing (for the purposes of learning it).\n\nA dictionary to understand each field is available [here](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf).\n\n> Note: The CSV data is stored as gzipped files. Pandas can read them directly.\n\n## Explore the Data\n\nCreate a new notebook and run:\n\n```python\nimport pandas as pd\n\n# Read a sample of the data\nprefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/'\ndf = pd.read_csv(prefix + 'yellow_tripdata_2021-01.csv.gz', nrows=100)\n\n# Display first rows\ndf.head()\n\n# Check data types\ndf.dtypes\n\n# Check data shape\ndf.shape\n```\n\n### Handling Data Types\n\nWe have a warning: (Note that this warning might pop up later for some users, so it's best to follow the instructions below)\n\n```\n/tmp/ipykernel_25483/2933316018.py:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.\n```\n\nSo we need to specify the types:\n\n```python\ndtype = {\n    \"VendorID\": \"Int64\",\n    \"passenger_count\": \"Int64\",\n    \"trip_distance\": \"float64\",\n    \"RatecodeID\": \"Int64\",\n    \"store_and_fwd_flag\": \"string\",\n    \"PULocationID\": \"Int64\",\n    \"DOLocationID\": \"Int64\",\n    \"payment_type\": \"Int64\",\n    \"fare_amount\": \"float64\",\n    \"extra\": \"float64\",\n    \"mta_tax\": \"float64\",\n    \"tip_amount\": \"float64\",\n    \"tolls_amount\": \"float64\",\n    \"improvement_surcharge\": \"float64\",\n    \"total_amount\": \"float64\",\n    \"congestion_surcharge\": \"float64\"\n}\n\nparse_dates = [\n    \"tpep_pickup_datetime\",\n    \"tpep_dropoff_datetime\"\n]\n\ndf = pd.read_csv(\n    prefix + 'yellow_tripdata_2021-01.csv.gz',\n    nrows=100,\n    dtype=dtype,\n    parse_dates=parse_dates\n)\n```\n\n## Ingesting Data into Postgres\n\nIn the Jupyter notebook, we create code to:\n\n1. Download the CSV file\n2. Read it in chunks with pandas\n3. Convert datetime columns\n4. Insert data into PostgreSQL using SQLAlchemy\n\n### Install SQLAlchemy\n\n```bash\nuv add sqlalchemy \"psycopg[binary,pool]\"\n```\n\n### Create Database Connection\n\n```python\nfrom sqlalchemy import create_engine\nengine = create_engine('postgresql+psycopg://root:root@localhost:5432/ny_taxi')\n```\n\n### Get DDL Schema\n\n```python\nprint(pd.io.sql.get_schema(df, name='yellow_taxi_data', con=engine))\n```\n\nOutput:\n\n```sql\nCREATE TABLE yellow_taxi_data (\n    \"VendorID\" BIGINT,\n    tpep_pickup_datetime TIMESTAMP WITHOUT TIME ZONE,\n    tpep_dropoff_datetime TIMESTAMP WITHOUT TIME ZONE,\n    passenger_count BIGINT,\n    trip_distance FLOAT(53),\n    \"RatecodeID\" BIGINT,\n    store_and_fwd_flag TEXT,\n    \"PULocationID\" BIGINT,\n    \"DOLocationID\" BIGINT,\n    payment_type BIGINT,\n    fare_amount FLOAT(53),\n    extra FLOAT(53),\n    mta_tax FLOAT(53),\n    tip_amount FLOAT(53),\n    tolls_amount FLOAT(53),\n    improvement_surcharge FLOAT(53),\n    total_amount FLOAT(53),\n    congestion_surcharge FLOAT(53)\n)\n```\n\n### Create the Table\n\n```python\ndf.head(n=0).to_sql(name='yellow_taxi_data', con=engine, if_exists='replace')\n```\n\n`head(n=0)` makes sure we only create the table, we don't add any data yet.\n\n## Ingesting Data in Chunks\n\nWe don't want to insert all the data at once. Let's do it in batches and use an iterator for that:\n\n```python\ndf_iter = pd.read_csv(\n    prefix + 'yellow_tripdata_2021-01.csv.gz',\n    dtype=dtype,\n    parse_dates=parse_dates,\n    iterator=True,\n    chunksize=100000\n)\n```\n\n### Iterate Over Chunks\n\n```python\nfor df_chunk in df_iter:\n    print(len(df_chunk))\n```\n\n### Inserting Data\n\n```python\ndf_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append')\n```\n\n### Complete Ingestion Loop\n\n```python\nfirst = True\n\nfor df_chunk in df_iter:\n\n    if first:\n        # Create table schema (no data)\n        df_chunk.head(0).to_sql(\n            name=\"yellow_taxi_data\",\n            con=engine,\n            if_exists=\"replace\"\n        )\n        first = False\n        print(\"Table created\")\n\n    # Insert chunk\n    df_chunk.to_sql(\n        name=\"yellow_taxi_data\",\n        con=engine,\n        if_exists=\"append\"\n    )\n\n    print(\"Inserted:\", len(df_chunk))\n```\n\n### Alternative Approach (Without First Flag)\n\n```python\nfirst_chunk = next(df_iter)\n\nfirst_chunk.head(0).to_sql(\n    name=\"yellow_taxi_data\",\n    con=engine,\n    if_exists=\"replace\"\n)\n\nprint(\"Table created\")\n\nfirst_chunk.to_sql(\n    name=\"yellow_taxi_data\",\n    con=engine,\n    if_exists=\"append\"\n)\n\nprint(\"Inserted first chunk:\", len(first_chunk))\n\nfor df_chunk in df_iter:\n    df_chunk.to_sql(\n        name=\"yellow_taxi_data\",\n        con=engine,\n        if_exists=\"append\"\n    )\n    print(\"Inserted chunk:\", len(df_chunk))\n```\n\n## Adding Progress Bar\n\nAdd `tqdm` to see progress:\n\n```bash\nuv add tqdm\n```\n\nPut it around the iterable:\n\n```python\nfrom tqdm.auto import tqdm\n\nfor df_chunk in tqdm(df_iter):\n    ...\n```\nTo see progress in terms of total chunks, you would have to add the `total` argument to `tqdm(df_iter)`. In our scenario, the pragmatic way is \nto hardcode a value based on the number of entries in the table.\n\n## Verify the Data\n\nConnect to it using pgcli:\n\n```bash\nuv run pgcli -h localhost -p 5432 -u root -d ny_taxi\n```\n\nAnd explore the data.\n\n**[↑ Up](README.md)** | **[← Previous](04-postgres-docker.md)** | **[Next →](06-ingestion-script.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/06-ingestion-script.md",
    "content": "# Creating the Data Ingestion Script\n\n**[↑ Up](README.md)** | **[← Previous](05-data-ingestion.md)** | **[Next →](07-pgadmin.md)**\n\nNow let's convert the notebook to a Python script.\n\n## Convert Notebook to Script\n\n```bash\nuv run jupyter nbconvert --to=script notebook.ipynb\nmv notebook.py ingest_data.py\n```\n\n## The Complete Ingestion Script\n\nSee the `pipeline/` directory for the complete script with click integration. Here's the core structure:\n\n```python\nimport pandas as pd\nfrom sqlalchemy import create_engine\nfrom tqdm.auto import tqdm\n\ndtype = {\n    \"VendorID\": \"Int64\",\n    \"passenger_count\": \"Int64\",\n    \"trip_distance\": \"float64\",\n    \"RatecodeID\": \"Int64\",\n    \"store_and_fwd_flag\": \"string\",\n    \"PULocationID\": \"Int64\",\n    \"DOLocationID\": \"Int64\",\n    \"payment_type\": \"Int64\",\n    \"fare_amount\": \"float64\",\n    \"extra\": \"float64\",\n    \"mta_tax\": \"float64\",\n    \"tip_amount\": \"float64\",\n    \"tolls_amount\": \"float64\",\n    \"improvement_surcharge\": \"float64\",\n    \"total_amount\": \"float64\",\n    \"congestion_surcharge\": \"float64\"\n}\n\nparse_dates = [\n    \"tpep_pickup_datetime\",\n    \"tpep_dropoff_datetime\"\n]\n```\n\n## Click Integration\n\nThe script uses `click` for command-line argument parsing:\n\n```python\nimport click\n\n@click.command()\n@click.option('--pg-user', default='root', help='PostgreSQL user')\n@click.option('--pg-pass', default='root', help='PostgreSQL password')\n@click.option('--pg-host', default='localhost', help='PostgreSQL host')\n@click.option('--pg-port', default=5432, type=int, help='PostgreSQL port')\n@click.option('--pg-db', default='ny_taxi', help='PostgreSQL database name')\n@click.option('--target-table', default='yellow_taxi_data', help='Target table name')\ndef run(pg_user, pg_pass, pg_host, pg_port, pg_db, target_table):\n    # Ingestion logic here\n    pass\n```\n\n## Running the Script\n\nThe script reads data in chunks (100,000 rows at a time) to handle large files efficiently without running out of memory.\n\nExample usage:\n\n```bash\nuv run python ingest_data.py \\\n  --pg-user=root \\\n  --pg-pass=root \\\n  --pg-host=localhost \\\n  --pg-port=5432 \\\n  --pg-db=ny_taxi \\\n  --target-table=yellow_taxi_trips\n```\n\n**[↑ Up](README.md)** | **[← Previous](05-data-ingestion.md)** | **[Next →](07-pgadmin.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/07-pgadmin.md",
    "content": "# pgAdmin - Database Management Tool\n\n**[↑ Up](README.md)** | **[← Previous](06-ingestion-script.md)** | **[Next →](08-dockerizing-ingestion.md)**\n\n`pgcli` is a handy tool but it's cumbersome to use for complex queries and database management. [`pgAdmin` is a web-based tool](https://www.pgadmin.org/) that makes it more convenient to access and manage our databases.\n\nIt's possible to run pgAdmin as a container along with the Postgres container, but both containers will have to be in the same _virtual network_ so that they can find each other.\n\n## Run pgAdmin Container\n\n```bash\ndocker run -it \\\n  -e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\n  -e PGADMIN_DEFAULT_PASSWORD=\"root\" \\\n  -v pgadmin_data:/var/lib/pgadmin \\\n  -p 8085:80 \\\n  dpage/pgadmin4\n```\n\nThe `-v pgadmin_data:/var/lib/pgadmin` volume mapping saves pgAdmin settings (server connections, preferences) so you don't have to reconfigure it every time you restart the container.\n\n### Parameters Explained\n\n* The container needs 2 environment variables: a login email and a password. We use `admin@admin.com` and `root` in this example.\n* pgAdmin is a web app and its default port is 80; we map it to 8085 in our localhost to avoid any possible conflicts.\n* The actual image name is `dpage/pgadmin4`.\n\n**Note:** This won't work yet because pgAdmin can't see the PostgreSQL container. They need to be on the same Docker network!\n\n## Docker Networks\n\nLet's create a virtual Docker network called `pg-network`:\n\n```bash\ndocker network create pg-network\n```\n\n> You can remove the network later with the command `docker network rm pg-network`. You can look at the existing networks with `docker network ls`.\n\n### Run Containers on the Same Network\n\nStop both containers and re-run them with the network configuration:\n\n```bash\n# Run PostgreSQL on the network\ndocker run -it \\\n  -e POSTGRES_USER=\"root\" \\\n  -e POSTGRES_PASSWORD=\"root\" \\\n  -e POSTGRES_DB=\"ny_taxi\" \\\n  -v ny_taxi_postgres_data:/var/lib/postgresql \\\n  -p 5432:5432 \\\n  --network=pg-network \\\n  --name pgdatabase \\\n  postgres:18\n\n# In another terminal, run pgAdmin on the same network\ndocker run -it \\\n  -e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\n  -e PGADMIN_DEFAULT_PASSWORD=\"root\" \\\n  -v pgadmin_data:/var/lib/pgadmin \\\n  -p 8085:80 \\\n  --network=pg-network \\\n  --name pgadmin \\\n  dpage/pgadmin4\n```\n\n* Just like with the Postgres container, we specify a network and a name for pgAdmin.\n* The container names (`pgdatabase` and `pgadmin`) allow the containers to find each other within the network.\n\n## Connect pgAdmin to PostgreSQL\n\nYou should now be able to load pgAdmin on a web browser by browsing to `http://localhost:8085`. Use the same email and password you used for running the container to log in.\n\n1. Open browser and go to `http://localhost:8085`\n2. Login with email: `admin@admin.com`, password: `root`\n3. Right-click \"Servers\" → Register → Server\n4. Configure:\n   - **General tab**: Name: `Local Docker`\n   - **Connection tab**:\n     - Host: `pgdatabase` (the container name)\n     - Port: `5432`\n     - Username: `root`\n     - Password: `root`\n5. Save\n\nNow you can explore the database using the pgAdmin interface!\n\n**[↑ Up](README.md)** | **[← Previous](06-ingestion-script.md)** | **[Next →](08-dockerizing-ingestion.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/08-dockerizing-ingestion.md",
    "content": "# Dockerizing the Ingestion Script\n\n**[↑ Up](README.md)** | **[← Previous](07-pgadmin.md)** | **[Next →](09-docker-compose.md)**\n\nNow let's containerize the ingestion script so we can run it in Docker.\n\n## The Dockerfile\n\nThe `pipeline/Dockerfile` shows how to containerize the ingestion script:\n\n```dockerfile\nFROM python:3.13.11-slim\nCOPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/\n\nWORKDIR /code\nENV PATH=\"/code/.venv/bin:$PATH\"\n\nCOPY pyproject.toml .python-version uv.lock ./\nRUN uv sync --locked\n\nCOPY ingest_data.py .\n\nENTRYPOINT [\"uv\", \"run\", \"python\", \"ingest_data.py\"]\n```\n\n### Explanation\n\n- `FROM python:3.13.11-slim`: Start with slim Python 3.13 image for smaller size\n- `COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/`: Copy uv binary from official uv image\n- `WORKDIR /code`: Set working directory inside container\n- `ENV PATH=\"/code/.venv/bin:$PATH\"`: Add virtual environment to PATH\n- `COPY pyproject.toml .python-version uv.lock ./`: Copy dependency files first (better caching)\n- `RUN uv sync --locked`: Install all dependencies from lock file (ensures reproducible builds)\n- `COPY ingest_data.py .`: Copy ingestion script\n- `ENTRYPOINT [\"uv\", \"run\", \"python\", \"ingest_data.py\"]`: Set entry point to run the ingestion script\n\n## Build the Docker Image\n\n```bash\ncd pipeline\ndocker build -t taxi_ingest:v001 .\n```\n\n## Run the Containerized Ingestion\n\n```bash\ndocker run -it \\\n  --network=pg-network \\\n  taxi_ingest:v001 \\\n    --pg-user=root \\\n    --pg-pass=root \\\n    --pg-host=pgdatabase \\\n    --pg-port=5432 \\\n    --pg-db=ny_taxi \\\n    --target-table=yellow_taxi_trips\n```\n\n### Important Notes\n\n* We need to provide the network for Docker to find the Postgres container. It goes before the name of the image.\n* Since Postgres is running on a separate container, the host argument will have to point to the container name of Postgres (`pgdatabase`).\n* You can drop the table in pgAdmin beforehand if you want, but the script will automatically replace the pre-existing table.\n\n**[↑ Up](README.md)** | **[← Previous](07-pgadmin.md)** | **[Next →](09-docker-compose.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/09-docker-compose.md",
    "content": "# Docker Compose\n\n**[↑ Up](README.md)** | **[← Previous](08-dockerizing-ingestion.md)** | **[Next →](10-sql-refresher.md)**\n\n`docker-compose` allows us to launch multiple containers using a single configuration file, so that we don't have to run multiple complex `docker run` commands separately.\n\nDocker compose makes use of YAML files. Here's the `docker-compose.yaml` file:\n\n```yaml\nservices:\n  pgdatabase:\n    image: postgres:18\n    environment:\n      POSTGRES_USER: \"root\"\n      POSTGRES_PASSWORD: \"root\"\n      POSTGRES_DB: \"ny_taxi\"\n    volumes:\n      - \"ny_taxi_postgres_data:/var/lib/postgresql\"\n    ports:\n      - \"5432:5432\"\n\n  pgadmin:\n    image: dpage/pgadmin4\n    environment:\n      PGADMIN_DEFAULT_EMAIL: \"admin@admin.com\"\n      PGADMIN_DEFAULT_PASSWORD: \"root\"\n    volumes:\n      - \"pgadmin_data:/var/lib/pgadmin\"\n    ports:\n      - \"8085:80\"\n\n\n\nvolumes:\n  ny_taxi_postgres_data:\n  pgadmin_data:\n```\n\n### Explanation\n\n* We don't have to specify a network because `docker compose` takes care of it: every single container (or \"service\", as the file states) will run within the same network and will be able to find each other according to their names (`pgdatabase` and `pgadmin` in this example).\n* All other details from the `docker run` commands (environment variables, volumes and ports) are mentioned accordingly in the file following YAML syntax.\n\n## Start Services with Docker Compose\n\nWe can now run Docker compose by running the following command from the same directory where `docker-compose.yaml` is found. Make sure that all previous containers aren't running anymore:\n\n```bash\ndocker-compose up\n```\n\n### Detached Mode\n\nIf you want to run the containers again in the background rather than in the foreground (thus freeing up your terminal), you can run them in detached mode:\n\n```bash\ndocker-compose up -d\n```\n\n## Stop Services\n\nYou will have to press `Ctrl+C` in order to shut down the containers when running in foreground mode. The proper way of shutting them down is with this command:\n\n```bash\ndocker-compose down\n```\n\n## Other Useful Commands\n\n```bash\n# View logs\ndocker-compose logs\n\n# Stop and remove volumes\ndocker-compose down -v\n```\n\n## Benefits of Docker Compose\n\n- Single command to start all services\n- Automatic network creation\n- Easy configuration management\n- Declarative infrastructure\n\n## Running the Ingestion Script with Docker Compose\n\nIf you want to re-run the dockerized ingest script when you run Postgres and pgAdmin with `docker compose`, you will have to find the name of the virtual network that Docker compose created for the containers.\n\n```bash\n# check the network link:\ndocker network ls\n\n# it's pipeline_default (or similar based on directory name)\n# now run the script:\ndocker run -it --rm\\\n  --network=pipeline_default \\\n  taxi_ingest:v001 \\\n    --pg-user=root \\\n    --pg-pass=root \\\n    --pg-host=pgdatabase \\\n    --pg-port=5432 \\\n    --pg-db=ny_taxi \\\n    --target-table=yellow_taxi_trips\n```\n\n**[↑ Up](README.md)** | **[← Previous](08-dockerizing-ingestion.md)** | **[Next →](10-sql-refresher.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/10-sql-refresher.md",
    "content": "# SQL Refresher\n\n**[↑ Up](README.md)** | **[← Previous](09-docker-compose.md)** | **[Next →](11-cleanup.md)**\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/QEcps_iskgg)](https://youtu.be/QEcps_iskgg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=10)\n\nPre-Requisites: If you followed the course in the given order, Docker Compose should already be running with pgdatabase and pgAdmin.\n\nOnce done, you can go to http://localhost:8085/browser/ to access pgAdmin.\nDon't forget to Right Click on the server or database to refresh it in case you don't see the new table.\n\nNow start querying!\n\n## Inner Joins\n\n### Implicit INNER JOIN\n\nJoining Yellow Taxi table with Zones Lookup table (implicit INNER JOIN):\n\n```sql\nSELECT\n    tpep_pickup_datetime,\n    tpep_dropoff_datetime,\n    total_amount,\n    CONCAT(zpu.\"Borough\", ' | ', zpu.\"Zone\") AS \"pickup_loc\",\n    CONCAT(zdo.\"Borough\", ' | ', zdo.\"Zone\") AS \"dropoff_loc\"\nFROM\n    yellow_taxi_trips t,\n    zones zpu,\n    zones zdo\nWHERE\n    t.\"PULocationID\" = zpu.\"LocationID\"\n    AND t.\"DOLocationID\" = zdo.\"LocationID\"\nLIMIT 100;\n```\n\n### Explicit INNER JOIN\n\n```sql\nSELECT\n    tpep_pickup_datetime,\n    tpep_dropoff_datetime,\n    total_amount,\n    CONCAT(zpu.\"Borough\", ' | ', zpu.\"Zone\") AS \"pickup_loc\",\n    CONCAT(zdo.\"Borough\", ' | ', zdo.\"Zone\") AS \"dropoff_loc\"\nFROM\n    yellow_taxi_trips t\nJOIN\n-- or INNER JOIN but it's less used, when writing JOIN, postgreSQL understands implicitly that we want to use an INNER JOIN\n    zones zpu ON t.\"PULocationID\" = zpu.\"LocationID\"\nJOIN\n    zones zdo ON t.\"DOLocationID\" = zdo.\"LocationID\"\nLIMIT 100;\n```\n\n## Data Quality Checks\n\n### Checking for NULL Location IDs\n\n```sql\nSELECT\n    tpep_pickup_datetime,\n    tpep_dropoff_datetime,\n    total_amount,\n    \"PULocationID\",\n    \"DOLocationID\"\nFROM\n    yellow_taxi_trips\nWHERE\n    \"PULocationID\" IS NULL\n    OR \"DOLocationID\" IS NULL\nLIMIT 100;\n```\n\n### Checking for Location IDs NOT IN Zones Table\n\n```sql\nSELECT\n    tpep_pickup_datetime,\n    tpep_dropoff_datetime,\n    total_amount,\n    \"PULocationID\",\n    \"DOLocationID\"\nFROM\n    yellow_taxi_trips\nWHERE\n    \"DOLocationID\" NOT IN (SELECT \"LocationID\" from zones)\n    OR \"PULocationID\" NOT IN (SELECT \"LocationID\" from zones)\nLIMIT 100;\n```\n\n## LEFT, RIGHT, and OUTER JOINS\n\nUsing LEFT, RIGHT, and OUTER JOINS when some Location IDs are not in either Tables:\n\n```sql\nDELETE FROM zones WHERE \"LocationID\" = 142;\n\nSELECT\n    tpep_pickup_datetime,\n    tpep_dropoff_datetime,\n    total_amount,\n    CONCAT(zpu.\"Borough\", ' | ', zpu.\"Zone\") AS \"pickup_loc\",\n    CONCAT(zdo.\"Borough\", ' | ', zdo.\"Zone\") AS \"dropoff_loc\"\nFROM\n    yellow_taxi_trips t\nLEFT JOIN\n    zones zpu ON t.\"PULocationID\" = zpu.\"LocationID\"\nJOIN\n    zones zdo ON t.\"DOLocationID\" = zdo.\"LocationID\"\nLIMIT 100;\n```\n\n```sql\nSELECT\n    tpep_pickup_datetime,\n    tpep_dropoff_datetime,\n    total_amount,\n    CONCAT(zpu.\"Borough\", ' | ', zpu.\"Zone\") AS \"pickup_loc\",\n    CONCAT(zdo.\"Borough\", ' | ', zdo.\"Zone\") AS \"dropoff_loc\"\nFROM\n    yellow_taxi_trips t\nRIGHT JOIN\n    zones zpu ON t.\"PULocationID\" = zpu.\"LocationID\"\nJOIN\n    zones zdo ON t.\"DOLocationID\" = zdo.\"LocationID\"\nLIMIT 100;\n```\n\n```sql\nSELECT\n    tpep_pickup_datetime,\n    tpep_dropoff_datetime,\n    total_amount,\n    CONCAT(zpu.\"Borough\", ' | ', zpu.\"Zone\") AS \"pickup_loc\",\n    CONCAT(zdo.\"Borough\", ' | ', zdo.\"Zone\") AS \"dropoff_loc\"\nFROM\n    yellow_taxi_trips t\nOUTER JOIN\n    zones zpu ON t.\"PULocationID\" = zpu.\"LocationID\"\nJOIN\n    zones zdo ON t.\"DOLocationID\" = zdo.\"LocationID\"\nLIMIT 100;\n```\n\n## GROUP BY\n\n### Calculate Number of Trips Per Day\n\n```sql\nSELECT\n    CAST(tpep_dropoff_datetime AS DATE) AS \"day\",\n    COUNT(1)\nFROM\n    yellow_taxi_trips\nGROUP BY\n    CAST(tpep_dropoff_datetime AS DATE)\nLIMIT 100;\n```\n\n## ORDER BY\n\n### Ordering by Day\n\n```sql\nSELECT\n    CAST(tpep_dropoff_datetime AS DATE) AS \"day\",\n    COUNT(1)\nFROM\n    yellow_taxi_trips\nGROUP BY\n    CAST(tpep_dropoff_datetime AS DATE)\nORDER BY\n    \"day\" ASC\nLIMIT 100;\n```\n\n### Ordering by Count\n\n```sql\nSELECT\n    CAST(tpep_dropoff_datetime AS DATE) AS \"day\",\n    COUNT(1) AS \"count\"\nFROM\n    yellow_taxi_trips\nGROUP BY\n    CAST(tpep_dropoff_datetime AS DATE)\nORDER BY\n    \"count\" DESC\nLIMIT 100;\n```\n\n## Other Aggregations\n\n```sql\nSELECT\n    CAST(tpep_dropoff_datetime AS DATE) AS \"day\",\n    COUNT(1) AS \"count\",\n    MAX(total_amount) AS \"total_amount\",\n    MAX(passenger_count) AS \"passenger_count\"\nFROM\n    yellow_taxi_trips\nGROUP BY\n    CAST(tpep_dropoff_datetime AS DATE)\nORDER BY\n    \"count\" DESC\nLIMIT 100;\n```\n\n## Grouping by Multiple Fields\n\n```sql\nSELECT\n    CAST(tpep_dropoff_datetime AS DATE) AS \"day\",\n    \"DOLocationID\",\n    COUNT(1) AS \"count\",\n    MAX(total_amount) AS \"total_amount\",\n    MAX(passenger_count) AS \"passenger_count\"\nFROM\n    yellow_taxi_trips\nGROUP BY\n    1, 2\nORDER BY\n    \"day\" ASC,\n    \"DOLocationID\" ASC\nLIMIT 100;\n```\n\n**[↑ Up](README.md)** | **[← Previous](09-docker-compose.md)** | **[Next →](11-cleanup.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/11-cleanup.md",
    "content": "# Cleanup\n\n**[↑ Up](README.md)** | **[← Previous](10-sql-refresher.md)** | **[Next →](../README.md)**\n\nWhen you're done with the workshop, clean up Docker resources to free up disk space.\n\n## Stop All Running Containers\n\n```bash\ndocker-compose down\n```\n\n## Remove Specific Containers\n\n```bash\n# List all containers\ndocker ps -a\n\n# Remove specific container\ndocker rm <container_id>\n\n# Remove all stopped containers\ndocker container prune\n```\n\n## Remove Docker Images\n\n```bash\n# List all images\ndocker images\n\n# Remove specific image\ndocker rmi taxi_ingest:v001\n\n# Remove all unused images\ndocker image prune -a\n```\n\n## Remove Docker Volumes\n\n```bash\n# List volumes\ndocker volume ls\n\n# Remove specific volumes\ndocker volume rm ny_taxi_postgres_data\ndocker volume rm pgadmin_data\n\n# Remove all unused volumes\ndocker volume prune\n```\n\n## Remove Docker Networks\n\n```bash\n# List networks\ndocker network ls\n\n# Remove specific network\ndocker network rm pg-network\n\n# Remove all unused networks\ndocker network prune\n```\n\n## Complete Cleanup\n\nRemoves ALL Docker resources - use with caution!\n\n```bash\n# ⚠️ Warning: This removes ALL Docker resources!\ndocker system prune -a --volumes\n```\n\n## Clean Up Local Files\n\n```bash\n# Remove parquet files\nrm *.parquet\n\n# Remove Python cache\nrm -rf __pycache__ .pytest_cache\n\n# Remove virtual environment (if using venv)\nrm -rf .venv\n```\n\n---\n\nThat's all for today. Happy learning! 🐳📊\n\n**[↑ Up](README.md)** | **[← Previous](10-sql-refresher.md)** | **[Next →](../README.md)**\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/README.md",
    "content": "# Docker and PostgreSQL: Data Engineering Workshop\n\n* Video: [link](https://www.youtube.com/watch?v=lP8xXebHmuE)\n* Slides: [link](https://docs.google.com/presentation/d/19pXcInDwBnlvKWCukP5sDoCAb69SPqgIoxJ_0Bikr00/edit?usp=sharing)\n* Code: [pipeline/](pipeline/)\n\nIn this workshop, we will explore Docker fundamentals and data engineering workflows using Docker containers. This workshop is part of Module 1 of the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp).\n\n**Data Engineering** is the design and development of systems for collecting, storing and analyzing data at scale.\n\n## Prerequisites\n\n- Basic understanding of Python\n- Basic SQL knowledge (helpful but not required)\n- Docker and Python installed on your machine\n- Git (optional)\n\n## Workshop Contents\n\n1. [Introduction to Docker](01-introduction.md) - What is Docker, why use it, basic commands\n2. [Virtual Environments and Data Pipelines](02-virtual-environment.md) - Setting up Python environments with uv\n3. [Dockerizing the Pipeline](03-dockerizing-pipeline.md) - Creating a Dockerfile for a simple pipeline\n4. [Running PostgreSQL with Docker](04-postgres-docker.md) - Dockerizing PostgreSQL database\n5. [NY Taxi Dataset and Data Ingestion](05-data-ingestion.md) - Working with real data, pandas, SQLAlchemy\n6. [Creating the Data Ingestion Script](06-ingestion-script.md) - Converting notebook to Python script\n7. [pgAdmin - Database Management Tool](07-pgadmin.md) - Web-based database management\n8. [Dockerizing the Ingestion Script](08-dockerizing-ingestion.md) - Containerizing the pipeline\n9. [Docker Compose](09-docker-compose.md) - Multi-container orchestration\n10. [SQL Refresher](10-sql-refresher.md) - SQL joins, aggregations, and queries\n11. [Cleanup](11-cleanup.md) - Cleaning up Docker resources\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/pipeline/.python-version",
    "content": "3.13\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/pipeline/Dockerfile",
    "content": "FROM python:3.13.11-slim\nCOPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/\n\nWORKDIR /code\nENV PATH=\"/code/.venv/bin:$PATH\"\n\nCOPY pyproject.toml .python-version uv.lock ./\nRUN uv sync --locked\n\nCOPY ingest_data.py .\n\nENTRYPOINT [\"python\", \"ingest_data.py\"]"
  },
  {
    "path": "01-docker-terraform/docker-sql/pipeline/docker-compose.yaml",
    "content": "services:\n  pgdatabase:\n    image: postgres:18\n    environment:\n      POSTGRES_USER: \"root\"\n      POSTGRES_PASSWORD: \"root\"\n      POSTGRES_DB: \"ny_taxi\"\n    volumes:\n      - ny_taxi_postgres_data:/var/lib/postgresql\n    ports:\n      - \"5432:5432\"\n\n  pgadmin:\n    image: dpage/pgadmin4\n    environment:\n      PGADMIN_DEFAULT_EMAIL: \"admin@admin.com\"\n      PGADMIN_DEFAULT_PASSWORD: \"root\"\n    volumes:\n      - pgadmin_data:/var/lib/pgadmin\n    ports:\n      - \"8085:80\"\n\n\n\nvolumes:\n  ny_taxi_postgres_data:\n  pgadmin_data:\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-ingest.sh",
    "content": "#!/usr/bin/env bash\n\n## bash script to run the ingestion container\necho \"Running data ingestion for January 2021...\"\n\ndocker run -it --rm \\\n  --network=pg-network \\\n  taxi_ingest:v001 \\\n  --year=2021 \\\n  --month=1 \\\n  --pg-user=root \\\n  --pg-pass=root \\\n  --pg-host=pgdatabase \\\n  --pg-port=5432 \\\n  --pg-db=ny_taxi \\\n  --chunksize=100000 \\\n  --target-table=yellow_taxi_trips"
  },
  {
    "path": "01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-pgadmin.sh",
    "content": "#!/usr/bin/env bash\n\n## bash script to start pgadmin\necho \"Starting pgAdmin container...\"\nmkdir -p ../pgadmin_data\n\ndocker run -it \\\n  -e PGADMIN_DEFAULT_EMAIL=\"admin@admin.com\" \\\n  -e PGADMIN_DEFAULT_PASSWORD=\"root\" \\\n  -v ../pgadmin_data:/var/lib/pgadmin \\\n  -p 8085:80 \\\n  --network=pg-network \\\n  --name pgadmin \\\n  dpage/pgadmin4"
  },
  {
    "path": "01-docker-terraform/docker-sql/pipeline/docker-helper-scripts/docker-postgres.sh",
    "content": "#!/usr/bin/env bash\n\n## bash script to start the Postgres container\nmkdir -p ../ny_taxi_postgres_data\n\necho \"Starting PostgreSQL container...\"\n\ndocker run -it \\\n  -e POSTGRES_USER=\"root\" \\\n  -e POSTGRES_PASSWORD=\"root\" \\\n  -e POSTGRES_DB=\"ny_taxi\" \\\n  -v ../ny_taxi_postgres_data:/var/lib/postgresql \\\n  -p 5432:5432 \\\n  --network=pg-network \\\n  --name pgdatabase \\\n  postgres:18\n\n# to use the pgcli\n# pgcli -h localhost -p 5432 -u root -d ny_taxi"
  },
  {
    "path": "01-docker-terraform/docker-sql/pipeline/ingest_data.py",
    "content": "#!/usr/bin/env python\n# coding: utf-8\n\nimport click\nimport pandas as pd\nfrom sqlalchemy import create_engine\nfrom tqdm.auto import tqdm\n\ndtype = {\n    \"VendorID\": \"Int64\",\n    \"passenger_count\": \"Int64\",\n    \"trip_distance\": \"float64\",\n    \"RatecodeID\": \"Int64\",\n    \"store_and_fwd_flag\": \"string\",\n    \"PULocationID\": \"Int64\",\n    \"DOLocationID\": \"Int64\",\n    \"payment_type\": \"Int64\",\n    \"fare_amount\": \"float64\",\n    \"extra\": \"float64\",\n    \"mta_tax\": \"float64\",\n    \"tip_amount\": \"float64\",\n    \"tolls_amount\": \"float64\",\n    \"improvement_surcharge\": \"float64\",\n    \"total_amount\": \"float64\",\n    \"congestion_surcharge\": \"float64\"\n}\n\nparse_dates = [\n    \"tpep_pickup_datetime\",\n    \"tpep_dropoff_datetime\"\n]\n\n\n@click.command()\n@click.option('--pg-user', default='root', help='PostgreSQL user')\n@click.option('--pg-pass', default='root', help='PostgreSQL password')\n@click.option('--pg-host', default='localhost', help='PostgreSQL host')\n@click.option('--pg-port', default=5432, type=int, help='PostgreSQL port')\n@click.option('--pg-db', default='ny_taxi', help='PostgreSQL database name')\n@click.option('--year', default=2021, type=int, help='Year of the data')\n@click.option('--month', default=1, type=int, help='Month of the data')\n@click.option('--target-table', default='yellow_taxi_data', help='Target table name')\n@click.option('--chunksize', default=100000, type=int, help='Chunk size for reading CSV')\ndef run(pg_user, pg_pass, pg_host, pg_port, pg_db, year, month, target_table, chunksize):\n    \"\"\"Ingest NYC taxi data into PostgreSQL database.\"\"\"\n    prefix = 'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow'\n    url = f'{prefix}/yellow_tripdata_{year}-{month:02d}.csv.gz'\n\n    engine = create_engine(f'postgresql+psycopg://{pg_user}:{pg_pass}@{pg_host}:{pg_port}/{pg_db}')\n\n    df_iter = pd.read_csv(\n        url,\n        dtype=dtype,\n        parse_dates=parse_dates,\n        iterator=True,\n        chunksize=chunksize,\n    )\n\n    first = True\n\n    for df_chunk in tqdm(df_iter):\n        if first:\n            df_chunk.head(0).to_sql(\n                name=target_table,\n                con=engine,\n                if_exists='replace'\n            )\n            first = False\n\n        df_chunk.to_sql(\n            name=target_table,\n            con=engine,\n            if_exists='append'\n        )\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "01-docker-terraform/docker-sql/pipeline/pyproject.toml",
    "content": "[project]\nname = \"pipeline\"\nversion = \"0.1.0\"\ndescription = \"Add your description here\"\nreadme = \"README.md\"\nrequires-python = \">=3.13\"\ndependencies = [\n    \"click>=8.3.1\",\n    \"pandas>=2.3.3\",\n    \"psycopg2-binary>=2.9.11\",\n    \"pyarrow>=22.0.0\",\n    \"sqlalchemy>=2.0.44\",\n    \"tqdm>=4.67.1\",\n]\n\n[dependency-groups]\ndev = [\n    \"jupyter>=1.1.1\",\n    \"pgcli>=4.3.0\",\n]\n"
  },
  {
    "path": "01-docker-terraform/terraform/1_terraform_overview.md",
    "content": "## Terraform Overview\n\n[Video](https://www.youtube.com/watch?v=18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=2)\n\n### Concepts\n\n#### Introduction\n\n1. What is [Terraform](https://www.terraform.io)?\n   * open-source tool by [HashiCorp](https://www.hashicorp.com), used for provisioning infrastructure resources\n   * supports DevOps best practices for change management\n   * Managing configuration files in source control to maintain an ideal provisioning state \n     for testing and production environments\n2. What is IaC?\n   * Infrastructure-as-Code\n   * build, change, and manage your infrastructure in a safe, consistent, and repeatable way \n     by defining resource configurations that you can version, reuse, and share.\n3. Some advantages\n   * Infrastructure lifecycle management\n   * Version control commits\n   * Very useful for stack-based deployments, and with cloud providers such as AWS, GCP, Azure, K8S…\n   * State-based approach to track resource changes throughout deployments\n\n\n#### Files\n\n* `main.tf`\n* `variables.tf`\n* Optional: `resources.tf`, `output.tf`\n* `.tfstate`\n\n#### Declarations\n* `terraform`: configure basic Terraform settings to provision your infrastructure\n   * `required_version`: minimum Terraform version to apply to your configuration\n   * `backend`: stores Terraform's \"state\" snapshots, to map real-world resources to your configuration.\n      * `local`: stores state file locally as `terraform.tfstate`\n   * `required_providers`: specifies the providers required by the current module\n* `provider`:\n   * adds a set of resource types and/or data sources that Terraform can manage\n   * The Terraform Registry is the main directory of publicly available providers from most major infrastructure platforms.\n* `resource`\n  * blocks to define components of your infrastructure\n  * Project modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table\n* `variable` & `locals`\n  * runtime arguments and constants\n\n\n#### Execution steps\n1. `terraform init`: \n    * Initializes & configures the backend, installs plugins/providers, & checks out an existing configuration from a version control \n2. `terraform plan`:\n    * Matches/previews local changes against a remote state, and proposes an Execution Plan.\n3. `terraform apply`: \n    * Asks for approval to the proposed plan, and applies changes to cloud\n4. `terraform destroy`\n    * Removes your stack from the Cloud\n\n\n### Terraform Workshop to create GCP Infra\nContinue [here](./terraform): `week_1_basics_n_setup/1_terraform_gcp/terraform`\n\n\n### References\nhttps://learn.hashicorp.com/collections/terraform/gcp-get-started\n"
  },
  {
    "path": "01-docker-terraform/terraform/2_gcp_overview.md",
    "content": "## GCP Overview\n\n[Video](https://www.youtube.com/watch?v=18jIzE41fJ4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=2)\n\n\n### Project infrastructure modules in GCP:\n* Google Cloud Storage (GCS): Data Lake\n* BigQuery: Data Warehouse\n\n(Concepts explained in Week 2 - Data Ingestion)\n\n### Initial Setup\n\nFor this course, we'll use a free version (upto EUR 300 credits). \n\n1. Create an account with your Google email ID \n2. Setup your first [project](https://console.cloud.google.com/) if you haven't already\n    * eg. \"DTC DE Course\", and note down the \"Project ID\" (we'll use this later when deploying infra with TF)\n3. Setup [service account & authentication](https://cloud.google.com/docs/authentication/getting-started) for this project\n    * Grant `Viewer` role to begin with.\n    * Download service-account-keys (.json) for auth.\n4. Download [SDK](https://cloud.google.com/sdk/docs/quickstart) for local setup\n5. Set environment variable to point to your downloaded GCP keys:\n   ```shell\n   export GOOGLE_APPLICATION_CREDENTIALS=\"<path/to/your/service-account-authkeys>.json\"\n   \n   # Refresh token/session, and verify authentication\n   gcloud auth application-default login\n   ```\n   \n### Setup for Access\n \n1. [IAM Roles](https://cloud.google.com/storage/docs/access-control/iam-roles) for Service account:\n   * Go to the *IAM* section of *IAM & Admin* https://console.cloud.google.com/iam-admin/iam\n   * Click the *Edit principal* icon for your service account.\n   * Add these roles in addition to *Viewer* : **Storage Admin** + **Storage Object Admin** + **BigQuery Admin**\n   \n2. Enable these APIs for your project:\n   * https://console.cloud.google.com/apis/library/iam.googleapis.com\n   * https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com\n   \n3. Please ensure `GOOGLE_APPLICATION_CREDENTIALS` env-var is set.\n   ```shell\n   export GOOGLE_APPLICATION_CREDENTIALS=\"<path/to/your/service-account-authkeys>.json\"\n   ```\n \n### Terraform Workshop to create GCP Infra\nContinue [here](./terraform): `week_1_basics_n_setup/1_terraform_gcp/terraform`\n"
  },
  {
    "path": "01-docker-terraform/terraform/README.md",
    "content": "## Local Setup for Terraform and GCP\n\n### Pre-Requisites\n1. Terraform client installation: https://www.terraform.io/downloads\n2. Cloud Provider account: https://console.cloud.google.com/ \n\n### Terraform Concepts\n[Terraform Overview](1_terraform_overview.md)\n\n### GCP setup\n\n1. [Setup for First-time](2_gcp_overview.md#initial-setup)\n    * [Only for Windows](windows.md) - Steps 4 & 5\n2. [IAM / Access specific to this course](2_gcp_overview.md#setup-for-access)\n\n### Terraform Workshop for GCP Infra\nYour setup is ready!\nNow head to the [terraform](terraform) directory, and perform the execution steps to create your infrastructure.\n"
  },
  {
    "path": "01-docker-terraform/terraform/terraform/README.md",
    "content": "### Concepts\n* [Terraform_overview](../1_terraform_overview.md)\n* If you were unable to generate a service account keyfile due to organizational policies, refer to the instructions [below](#fallback)\n\n### Execution\n\n```shell\n# Refresh service-account's auth-token for this session\ngcloud auth application-default login\n\n# Initialize state file (.tfstate)\nterraform init\n\n# Check changes to new infra plan\nterraform plan -var=\"project=<your-gcp-project-id>\"\n```\n\n```shell\n# Create new infra\nterraform apply -var=\"project=<your-gcp-project-id>\"\n```\n\n```shell\n# Delete infra after your work, to avoid costs on any running services\nterraform destroy\n```\n\n### Warning\nRemember to use a [proper gitignore](https://github.com/github/gitignore/blob/main/Terraform.gitignore) file before publishing your code on GitHub\n\n### Fallback\n1. Give yourself the token creator role on the pertinent service account\n    ```bash\n    gcloud iam service-accounts add-iam-policy-binding \\\n        <SERVICE_ACCOUNT_EMAIL> \\\n        --member=\"user:YOUR_EMAIL@gmail.com\" \\\n        --role=\"roles/iam.serviceAccountTokenCreator\"\n    ```\n2. Add the sections below the first block to your main terraform configuration\n   ```terraform\n    # Connect to gcp using ADC (identity verification)\n    provider \"google\" {\n      project = var.project\n      region  = var.region\n      zone    = var.zone\n    }\n\n    /* add these data blocks */\n    \n    # This data source gets a temporary token for the service account\n    data \"google_service_account_access_token\" \"default\" {\n      provider               = google\n      target_service_account = \"<SERVICE_ACCOUNT_EMAIL>\"\n      scopes                 = [\"https://www.googleapis.com/auth/cloud-platform\"]\n      lifetime               = \"3600s\"\n    }\n    \n    # This second provider block uses that temporary token and does the real work\n    provider \"google\" {\n      alias        = \"impersonated\"\n      access_token = data.google_service_account_access_token.default.access_token\n      project      = var.project\n      region       = var.region\n      zone         = var.zone\n    }\n   ```\n\n3. Now, you can follow the instructions [above](#execution)\n"
  },
  {
    "path": "01-docker-terraform/terraform/terraform/terraform_basic/main.tf",
    "content": "terraform {\n  required_providers {\n    google = {\n      source  = \"hashicorp/google\"\n      version = \"4.51.0\"\n    }\n  }\n}\n\nprovider \"google\" {\n# Credentials only needs to be set if you do not have the GOOGLE_APPLICATION_CREDENTIALS set\n#  credentials = \n  project = \"<Your Project ID>\"\n  region  = \"us-central1\"\n}\n\n\n\nresource \"google_storage_bucket\" \"data-lake-bucket\" {\n  name          = \"<Your Unique Bucket Name>\"\n  location      = \"US\"\n\n  # Optional, but recommended settings:\n  storage_class = \"STANDARD\"\n  uniform_bucket_level_access = true\n\n  versioning {\n    enabled     = true\n  }\n\n  lifecycle_rule {\n    action {\n      type = \"Delete\"\n    }\n    condition {\n      age = 30  // days\n    }\n  }\n\n  force_destroy = true\n}\n\n\nresource \"google_bigquery_dataset\" \"dataset\" {\n  dataset_id = \"<The Dataset Name You Want to Use>\"\n  project    = \"<Your Project ID>\"\n  location   = \"US\"\n}"
  },
  {
    "path": "01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/README.md",
    "content": "# AWS Terraform Data Lake (GCP Equivalent)\n\n## 📌 Overview\n\nThis repository contains an **AWS-based Terraform implementation** that mirrors the **Google Cloud Platform (GCP)** infrastructure used in the Data Engineering course (e.g. GCS + BigQuery), but implemented using **AWS services**.\n\nThe goal is to help learners who:\n- Are enrolled in a **GCP-focused Data Engineering course**\n- Prefer or need to work with **AWS**\n- Want to understand **cloud-agnostic data engineering concepts**\n\nThis setup focuses on building a **basic data lake foundation** using:\n- **Amazon S3** (equivalent to GCS)\n- **AWS Glue Data Catalog** (equivalent to BigQuery datasets / metadata layer)\n- **Terraform** as Infrastructure as Code (IaC)\n\n---\n\n## 🏗️ Architecture Mapping (GCP → AWS)\n\n| GCP Service | AWS Equivalent | Purpose |\n|------------|---------------|---------|\n| Google Cloud Storage (GCS) | Amazon S3 | Data Lake storage |\n| Uniform Bucket Level Access | S3 Public Access Block | Secure bucket access |\n| Object Lifecycle Rules | S3 Lifecycle Configuration | Automatic data expiration |\n| BigQuery Dataset | AWS Glue Catalog Database | Metadata & query layer |\n| Terraform (GCP provider) | Terraform (AWS provider) | Infrastructure as Code |\n\n---\n\n## 📁 Project Structure\n\n```text\n.\n├── main.tf            # Core infrastructure resources\n├── variables.tf       # Input variable definitions\n├── terraform.tfvars   # Environment-specific values\n└── README.md          # Project documentation\n"
  },
  {
    "path": "01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/main.tf",
    "content": "terraform {\n    required_providers {\n        aws = {\n            source  = \"hashicorp/aws\"\n            version = \"~> 5.0\"\n        }\n    }\n}\n\nprovider \"aws\" {\n    region = var.aws_region\n}\n\n#S3 Bucket to store data equivalent to GCS Bucket in GCP\nresource \"aws_s3_bucket\" \"data_lake_bucket\" {\n  bucket        = var.bucket_name\n  force_destroy = true\n}\n\n#Bucket verisioning\nresource \"aws_s3_bucket_versioning\" \"versioning\" {\n  bucket = aws_s3_bucket.data_lake_bucket.id # Reference the S3 bucket created above\n\n  versioning_configuration {\n    status = \"Enabled\" # Enable versioning\n  }\n}\n\n# \"Uniform bucket level access\" ~ control prin policy/ACL; recomandat: block public access\nresource \"aws_s3_bucket_public_access_block\" \"block_public_access\" {\n  bucket = aws_s3_bucket.data_lake_bucket.id\n\n  block_public_acls       = true\n  block_public_policy     = true\n  ignore_public_acls      = true\n  restrict_public_buckets = true\n}\n\n# Lifecycle: delete objects older than 30 days (echivalent lifecycle_rule age=30)\nresource \"aws_s3_bucket_lifecycle_configuration\" \"lifecycle_rules\" {\n  bucket = aws_s3_bucket.data_lake_bucket.id\n\n  rule {\n    id     = \"Delete_old_older_than_30_days\"\n    status = \"Enabled\"\n\n    expiration {\n      days = 30\n    }\n    filter {\n      prefix = \"\" # Apply to all objects in the bucket\n    }\n  }\n}\n\nresource \"aws_glue_catalog_database\" \"dataset\" {\n  name = var.dataset_name\n}\n"
  },
  {
    "path": "01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/terraform.tfvars",
    "content": "bucket_name  = \"my-unique-data-lake-bucket-12345\"\ndataset_name = \"ny_taxi_dataset\"\n"
  },
  {
    "path": "01-docker-terraform/terraform/terraform/terraform_with_variable_AWS/variables.tf",
    "content": "# Specifies the geographic location for AWS resource deployment.\n# Defaulting to Stockholm (eu-north-1) to keep latency low for European users.\nvariable \"aws_region\" {\n  description = \"AWS region to deploy resources in\"\n  type = string\n  default = \"eu-north-1\"\n\n}\n\n# The unique identifier for the S3 bucket where raw data will be stored.\n# S3 bucket names must be globally unique across all AWS accounts.\nvariable \"bucket_name\" {\n  description = \"Name of the S3 bucket\"\n  type        = string\n  default     = \"data-engineering-zoomcamp-1568692036\"\n}\n\n# Defines the logical grouping for metadata in the AWS Glue Catalog.\n# This allows tools like Athena to query the S3 data using SQL.\nvariable \"dataset_name\" {\n  description = \"Glue Catalog database name (logical dataset for Athena/Glue)\"\n  type        = string\n  default = \"ny_taxi_database\"\n}\n"
  },
  {
    "path": "01-docker-terraform/terraform/terraform/terraform_with_variables/main.tf",
    "content": "terraform {\n  required_providers {\n    google = {\n      source  = \"hashicorp/google\"\n      version = \"5.6.0\"\n    }\n  }\n}\n\nprovider \"google\" {\n  credentials = file(var.credentials)\n  project     = var.project\n  region      = var.region\n}\n\n\nresource \"google_storage_bucket\" \"demo-bucket\" {\n  name          = var.gcs_bucket_name\n  location      = var.location\n  force_destroy = true\n\n\n  lifecycle_rule {\n    condition {\n      age = 1\n    }\n    action {\n      type = \"AbortIncompleteMultipartUpload\"\n    }\n  }\n}\n\n\n\nresource \"google_bigquery_dataset\" \"demo_dataset\" {\n  dataset_id = var.bq_dataset_name\n  location   = var.location\n}"
  },
  {
    "path": "01-docker-terraform/terraform/terraform/terraform_with_variables/variables.tf",
    "content": "variable \"credentials\" {\n  description = \"My Credentials\"\n  default     = \"<Path to your Service Account json file>\"\n  #ex: if you have a directory where this file is called keys with your service account json file\n  #saved there as my-creds.json you could use default = \"./keys/my-creds.json\"\n}\n\n\nvariable \"project\" {\n  description = \"Project\"\n  default     = \"<Your Project ID>\"\n}\n\nvariable \"region\" {\n  description = \"Region\"\n  #Update the below to your desired region\n  default     = \"us-central1\"\n}\n\nvariable \"location\" {\n  description = \"Project Location\"\n  #Update the below to your desired location\n  default     = \"US\"\n}\n\nvariable \"bq_dataset_name\" {\n  description = \"My BigQuery Dataset Name\"\n  #Update the below to what you want your dataset to be called\n  default     = \"demo_dataset\"\n}\n\nvariable \"gcs_bucket_name\" {\n  description = \"My Storage Bucket Name\"\n  #Update the below to a unique bucket name\n  default     = \"terraform-demo-terra-bucket\"\n}\n\nvariable \"gcs_storage_class\" {\n  description = \"Bucket Storage Class\"\n  default     = \"STANDARD\"\n}"
  },
  {
    "path": "01-docker-terraform/terraform/windows.md",
    "content": "## GCP and Terraform on Windows\n\nYou don't need these instructions if you use WSL. It's only for \"plain Windows\" \n\n### Google Cloud SDK\n\n* For this tutorial, you'll need a Linux-like environment, e.g. [GitBash](https://gitforwindows.org/), [MinGW](https://www.mingw-w64.org/) or [cygwin](https://www.cygwin.com/)\n  * Power Shell should also work, but will require adjustments \n* Download SDK in zip: https://dl.google.com/dl/cloudsdk/channels/rapid/google-cloud-sdk.zip\n  * source: https://cloud.google.com/sdk/docs/downloads-interactive\n* Unzip it and run the `install.sh` script\n\nWhen installing it, you might see something like that:\n\n```\nThe installer is unable to automatically update your system PATH. Please add\n  C:\\tools\\google-cloud-sdk\\bin\n```\n\n* To fix that, adjust your `.bashrc` to include this in `PATH` ([instructions](https://unix.stackexchange.com/questions/26047/how-to-correctly-add-a-path-to-path))\n* You can also do it system-wide ([instructions](https://gist.github.com/nex3/c395b2f8fd4b02068be37c961301caa7))\n\nNow we need to point it to correct Python installation. Assuming you use [Anaconda](https://www.anaconda.com/products/individual):\n\n```bash\nexport CLOUDSDK_PYTHON=~/Anaconda3/python\n```\n\nNow let's check that it works:\n\n```bash\n$ gcloud version\nGoogle Cloud SDK 367.0.0\nbq 2.0.72\ncore 2021.12.10\ngsutil 5.5\n```\n\n### Google Cloud SDK Authentication \n\n* Now create a service account and generate keys like shown in the videos\n* Download the key and put it to some location, e.g. `.gc/ny-rides.json`\n* Set `GOOGLE_APPLICATION_CREDENTIALS` to point to the file\n\n```bash\nexport GOOGLE_APPLICATION_CREDENTIALS=~/.gc/ny-rides.json\n```\n\nNow authenticate: \n\n```bash\ngcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS\n```\n\nAlternatively, you can authenticate using OAuth like shown in the video\n\n```bash\ngcloud auth application-default login\n```\n\nIf you get a message like `quota exceeded`\n\n> WARNING:\n> Cannot find a quota project to add to ADC. You might receive a \"quota exceeded\" or \"API not enabled\" error. \n> Run `$ gcloud auth application-default set-quota-project` to add a quota project.\n\nThen run this:\n\n```bash\nPROJECT_NAME=\"ny-rides-alexey\"\ngcloud auth application-default set-quota-project ${PROJECT_NAME}\n```\n\n\n### Terraform \n\n* [Download Terraform](https://www.terraform.io/downloads)\n* Put it to a folder in [PATH](https://gist.github.com/nex3/c395b2f8fd4b02068be37c961301caa7)\n* Go to the location with Terraform files and initialize it\n\n```bash\nterraform init\n```\n\nOptionally you can configure your terraform files (`variables.tf`) to include your project id:\n\n```bash\nvariable \"project\" {\n  description = \"Your GCP Project ID\"\n  default = \"ny-rides-alexey\"\n  type = string\n}\n```\n\n* Now [follow the instructions](1_terraform_overview.md#execution-steps)\n  * Run `terraform plan`\n  * Next, run `terraform apply`\n\nIf you get an error like that:\n\n> Error: googleapi: Error 403: terraform@ny-rides-alexey.iam.gserviceaccount.com does not have\n> storage.buckets.create access to the Google Cloud project., forbidden\n\n\nThen you need to give your service account all the permissions. Make sure you follow the instructions in the videos \n\n* You can also use [this file](https://docs.google.com/document/d/e/2PACX-1vSZapy7gIj0TP-EFzub2OpAlAkuifGEVJ4XpkA1RvxZ45NjiQi29b6OhLuetdXXHWAn2lbbKxnbzMdd/pub), but it doesn't list all the required permissions\n"
  },
  {
    "path": "02-workflow-orchestration/README.md",
    "content": "# Workflow Orchestration\n\nWelcome to Module 2 of the Data Engineering Zoomcamp! This week, we’ll dive into workflow orchestration using [Kestra](https://go.kestra.io/de-zoomcamp/github). \n\nKestra is an open-source, event-driven orchestration platform that simplifies building both scheduled and event-driven workflows. By adopting Infrastructure as Code practices for data and process orchestration, Kestra enables you to build reliable workflows with just a few lines of YAML.\n\n> [!NOTE]  \n>You can find all videos for this week in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist).\n\n---\n\n## Course Structure\n\n- [2.1 - Introduction to Workflow Orchestration](#21-introduction-to-workflow-orchestration)\n- [2.2 - Getting Started With Kestra](#22-getting-started-with-kestra)\n- [2.3 - Hands-On Coding Project: Build ETL Data Pipelines with Kestra](#23-hands-on-coding-project-build-data-pipelines-with-kestra)\n- [2.4 - ELT Pipelines in Kestra: Google Cloud Platform](#24-elt-pipelines-in-kestra-google-cloud-platform)\n- [2.5 - Using AI for Data Engineering in Kestra](#25-using-ai-for-data-engineering-in-kestra)\n- [2.6 - Bonus](#26-bonus-deploy-to-the-cloud-optional)\n\n\n## 2.1 Introduction to Workflow Orchestration\n\nIn this section, you’ll learn the foundations of workflow orchestration, its importance, and how Kestra fits into the orchestration landscape.\n\n### 2.1.1 - What is Workflow Orchestration?\n  \nThink of a music orchestra. There's a variety of different instruments. Some more than others, all with different roles when it comes to playing music. To make sure they all come together at the right time, they follow a conductor who helps the orchestra to play together. \n\nNow replace the instruments with tools and the conductor with an orchestrator. We often have multiple tools and platforms that we need to work together. Sometimes on a routine schedule, other times based on events that happen. That's where the orchestrator comes in to help all of these tools work together.\n\nA workflow orchestrator might do the following tasks:\n- Run workflows which contain a number of predefined steps\n- Monitor and log errors, as well as taking a number of extra steps when they occur\n- Automatically run workflows based on schedules and events\n\nIn data engineering, you often need to move data from one place, to another, sometimes with some modifications made to the data in the middle. This is where a workflow orchestrator can help out by managing these steps, while giving us visibility into it at the same time. \n\nIn this module, we're going to build our own data pipeline using ETL (Extract, Transform Load) with Kestra at the core of the operation, but first we need to understand a bit more about how Kestra works before we can get building! \n\n#### Videos\n- **2.1.1 - What is Workflow Orchestration?**  \n  [![2.1.1 - What is Workflow Orchestration?](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F-JLnp-iLins)](https://youtu.be/-JLnp-iLins)\n\n\n### 2.1.2 - What is Kestra?\n\nKestra is an open-source, infinitely-scalable orchestration platform that enables all engineers to manage business-critical workflows. \n\nKestra is a great choice for workflow orchestration:\n- Build with Flow code (YAML), No-code or with the AI Copilot - flexibility in how you build your workflows\n- 1000+ Plugins - integrate with all the tools you use\n- Support for any programming language - pick the right tool for the job\n- Schedule or Event Based Triggers - have your workflows respond to data\n\n#### Videos\n\n- **2.1.2 - What is Kestra?**  \n  [![2.1.2 - What is Kestra?](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FZvVN_NmB_1s)](https://youtu.be/ZvVN_NmB_1s)\n\n### Resources\n- [Quickstart Guide](https://go.kestra.io/de-zoomcamp/quickstart)\n- [What is an Orchestrator?](https://go.kestra.io/de-zoomcamp/what-is-an-orchestrator)\n\n---\n\n## 2.2 Getting Started with Kestra\n\nIn this section, you'll learn how to install Kestra, as well as the key concepts required to build your first workflow. Once our first workflow is built, we can extend this further by executing a Python script inside of a workflow. \n\nYou will:\n1. Install Kestra using Docker Compose\n2. Learn the concepts of Kestra to build your first workflow\n3. Execute a Python script inside of a Kestra Flow\n\n### 2.2.1 - Installing Kestra\n\nTo install Kestra, we are going to use Docker Compose. We already have a Postgres database set up, along with pgAdmin from Module 1. We can continue to use these with Kestra but we'll need to make a few modifications to our Docker Compose file.\n\nUse [this example Docker Compose file](docker-compose.yml) to correctly add the 2 new services and set up the volumes correctly.\n\nAdd information about setting a username and password.\n\nWe'll set up Kestra using Docker Compose containing one container for the Kestra server and another for the Postgres database:\n\n```bash\ncd 02-workflow-orchestration\ndocker compose up -d\n```\n\n**Note:** Check that `pgAdmin` isn't running on the same ports as Kestra. If so, check out the [FAQ](#troubleshooting-tips) at the bottom of the README.\n\nOnce the container starts, you can access the Kestra UI at [http://localhost:8080](http://localhost:8080).\n\nTo shut down Kestra, go to the same directory and run the following command:\n\n```bash\ndocker compose down\n```\n#### Add Flows to Kestra\n\nFlows can be added to Kestra by copying and pasting the YAML directly into the editor, or by adding via Kestra's API. See below for adding programmatically.\n\n<details>\n<summary>Add Flows to Kestra programmatically</summary>\n\nIf you prefer to add flows programmatically using Kestra's API, run the following commands:\n\n```bash\n# Import all flows: assuming username admin@kestra.io and password Admin1234! (adjust to match your username and password)\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/01_hello_world.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/02_python.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/03_getting_started_data_pipeline.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/04_postgres_taxi.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/05_postgres_taxi_scheduled.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/06_gcp_kv.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/07_gcp_setup.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/08_gcp_taxi.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/09_gcp_taxi_scheduled.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/10_chat_without_rag.yaml\ncurl -X POST -u 'admin@kestra.io:Admin1234!' http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/11_chat_with_rag.yaml\n```\n</details>\n\n#### Videos\n\n- **2.2.1 - Installing Kestra**  \n  [![2.2.1 - Installing Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FwgPxC4UjoLM)](https://youtu.be/wgPxC4UjoLM)\n\n#### Resources\n- [Install Kestra with Docker Compose](https://go.kestra.io/de-zoomcamp/docker-compose)\n\n\n### 2.2.2 - Kestra Concepts\n\nTo start building workflows in Kestra, we need to understand a number of concepts.\n- [Flow](https://go.kestra.io/de-zoomcamp/flow) - a container for tasks and their orchestration logic. \n- [Tasks](https://go.kestra.io/de-zoomcamp/tasks) - the steps within a flow.\n- [Inputs](https://go.kestra.io/de-zoomcamp/inputs) - dynamic values passed to the flow at runtime.\n- [Outputs](https://go.kestra.io/de-zoomcamp/outputs) - pass data between tasks and flows.\n- [Triggers](https://go.kestra.io/de-zoomcamp/triggers) - mechanism that automatically starts the execution of a flow.\n- [Execution](https://go.kestra.io/de-zoomcamp/execution) - a single run of a flow with a specific state.\n- [Variables](https://go.kestra.io/de-zoomcamp/variables) - key–value pairs that let you reuse values across tasks.\n- [Plugin Defaults](https://go.kestra.io/de-zoomcamp/plugin-defaults) - default values applied to every task of a given type within one or more flows.\n- [Concurrency](https://go.kestra.io/de-zoomcamp/concurrency) - control how many executions of a flow can run at the same time.\n\nWhile there are more concepts used for building powerful workflows, these are the ones we're going to use to build our data pipelines.\n\nThe flow [`01_hello_world.yaml`](flows/01_hello_world.yaml) showcases all of these concepts inside of one workflow:\n- The flow has 5 tasks: 3 log tasks and a sleep task\n- The flow takes an input called `name`.\n- There is a variable that takes the `name` input to generate a full welcome message.\n- An output is generated from the return task and is logged in a later log task.\n- There is a trigger to execute this flow every day at 10am.\n- Plugin Defaults are used to make both log tasks send their messages as `ERROR` level.\n- We have a concurrency limit of 2 executions. Any further ones made while 2 are running will fail.\n\n#### Videos\n- **2.2.2 - Kestra Concepts**  \n  [![2.2.2 - Kestra Concepts](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FMNOKVx8780E)](https://youtu.be/MNOKVx8780E)\n\n#### Resources\n- [Tutorial](https://go.kestra.io/de-zoomcamp/tutorial)\n- [Workflow Components Documentation](https://go.kestra.io/de-zoomcamp/workflow-components)\n\n### 2.2.3 - Orchestrate Python Code\n\nNow that we've built our first workflow, we can take it a step further by adding Python code into our flow. In Kestra, we can run Python code from a dedicated file or write it directly inside of our workflow.\n\nWhile Kestra has a huge variety of plugins available for building your workflows, you also have the option to write your own code and have Kestra execute that based on schedules or events. This means you can pick the right tools for your pipelines, rather than the ones you're limited to. \n\nIn our example Python workflow, [`02_python.yaml`](flows/02_python.yaml), our code fetches the number of Docker image pulls from DockerHub and returns it as an output to Kestra. This is useful as we can access this output with other tasks, even though it was generated inside of our Python script.\n\n#### Videos\n- **2.2.3 - Orchestrate Python Code**  \n  [![2.2.3 - Orchestrate Python Code](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FVAHm0R_XjqI)](https://youtu.be/VAHm0R_XjqI)\n\n#### Resources\n- [How-to Guide: Python](https://go.kestra.io/de-zoomcamp/python)\n\n\n## 2.3 Hands-On Coding Project: Build Data Pipelines with Kestra\n\nNext, we're gonna build ETL pipelines for Yellow and Green Taxi data from NYC’s Taxi and Limousine Commission (TLC). You will:\n1. Extract data from [CSV files](https://github.com/DataTalksClub/nyc-tlc-data/releases).\n2. Load it into Postgres or Google Cloud (GCS + BigQuery).\n3. Explore scheduling and backfilling workflows.\n\n### 2.3.1 Getting Started Pipeline\n\nThis introductory flow is added just to demonstrate a simple data pipeline which extracts data via HTTP REST API, transforms that data in Python and then queries it using DuckDB. For this stage, a new separate Postgres database is created for the exercises. \n\n\n```mermaid\ngraph LR\n  Extract[Extract Data via HTTP REST API] --> Transform[Transform Data in Python]\n  Transform --> Query[Query Data with DuckDB]\n```\n\nAdd the flow [`03_getting_started_data_pipeline.yaml`](flows/03_getting_started_data_pipeline.yaml) from the UI if you haven't already and execute it to see the results. Inspect the Gantt and Logs tabs to understand the flow execution.\n\n#### Videos\n\n- **2.3.1 - Getting Started Pipeline**   \n  [![Create an ETL Pipeline with Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F-KmwrCqRhic)](https://youtu.be/-KmwrCqRhic)\n\n#### Resources\n- [ETL Tutorial Video](https://go.kestra.io/de-zoomcamp/etl-tutorial)\n- [ETL in 3 Minutes](https://go.kestra.io/de-zoomcamp/etl-get-started)\n\n### 2.3.2 Local DB: Load Taxi Data to Postgres\n\nBefore we start loading data to GCP, we'll first play with the Yellow and Green Taxi data using a local Postgres database running in a Docker container. We will use the same database from Module 1 which should be in the same Docker Compose file as Kestra.\n\nThe flow will extract CSV data partitioned by year and month, create tables, load data to the monthly table, and finally merge the data to the final destination table.\n\n```mermaid\ngraph LR\n  Start[Select Year & Month] --> SetLabel[Set Labels]\n  SetLabel --> Extract[Extract CSV Data]\n  Extract -->|Taxi=Yellow| YellowFinalTable[Create Yellow Final Table]:::yellow\n  Extract -->|Taxi=Green| GreenFinalTable[Create Green Final Table]:::green\n  YellowFinalTable --> YellowMonthlyTable[Create Yellow Monthly Table]:::yellow\n  GreenFinalTable --> GreenMonthlyTable[Create Green Monthly Table]:::green\n  YellowMonthlyTable --> YellowCopyIn[Load Data to Monthly Table]:::yellow\n  GreenMonthlyTable --> GreenCopyIn[Load Data to Monthly Table]:::green\n  YellowCopyIn --> YellowMerge[Merge Yellow Data]:::yellow\n  GreenCopyIn --> GreenMerge[Merge Green Data]:::green\n\n  classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px,color:#000;\n  classDef green fill:#32CD32,stroke:#000,stroke-width:1px,color:#000;\n\n```\n\nThe flow code: [`04_postgres_taxi.yaml`](flows/04_postgres_taxi.yaml).\n\n\n> [!NOTE]  \n> The NYC Taxi and Limousine Commission (TLC) Trip Record Data provided on the [nyc.gov](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) website is currently available only in a Parquet format, but this is NOT the dataset we're going to use in this course. For the purpose of this course, we'll use the **CSV files** available [here on GitHub](https://github.com/DataTalksClub/nyc-tlc-data/releases). This is because the Parquet format can be challenging to understand by newcomers, and we want to make the course as accessible as possible — the CSV format can be easily introspected using tools like Excel or Google Sheets, or even a simple text editor.\n\n#### Videos\n\n- **2.3.2 - Local DB: Load Taxi Data to Postgres**   \n  [![Local DB: Load Taxi Data to Postgres](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FZ9ZmmwtXDcU)](https://youtu.be/Z9ZmmwtXDcU)\n\n#### Resources\n- [Docker Compose with Kestra, Postgres and pgAdmin](docker-compose.yml)\n\n### 2.3.3 Local DB: Learn Scheduling and Backfills\n\nWe can now schedule the same pipeline shown above to run daily at 9 AM UTC. We'll also demonstrate how to backfill the data pipeline to run on historical data.\n\nNote: given the large dataset, we'll backfill only data for the green taxi dataset for the year 2019.\n\nThe flow code: [`05_postgres_taxi_scheduled.yaml`](flows/05_postgres_taxi_scheduled.yaml).\n\n#### Videos\n\n- **2.3.3 - Scheduling and Backfills**  \n  [![Scheduling and Backfills](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F1pu_C_oOAMA)](https://youtu.be/1pu_C_oOAMA)\n---\n\n## 2.4 ELT Pipelines in Kestra: Google Cloud Platform\n\nNow that you've learned how to build ETL pipelines locally using Postgres, we are ready to move to the cloud. In this section, we'll load the same Yellow and Green Taxi data to Google Cloud Platform (GCP) using: \n1. Google Cloud Storage (GCS) as a data lake  \n2. BigQuery as a data warehouse.\n\n### 2.4.1 - ETL vs ELT\n\nIn 2.3, we made a ETL pipeline inside of Kestra:\n- **Extract:** Firstly, we extract the dataset from GitHub\n- **Transform:** Next, we transform it with Python\n- **Load:** Finally, we load it into our Postgres database\n\nWhile this is very standard across the industry, sometimes it makes sense to change the order when working with the cloud. If you're working with a large dataset, like the Yellow Taxi data, there can be benefits to extracting and loading straight into a data warehouse, and then performing transformations directly in the data warehouse. When working with BigQuery, we will use ELT:\n- **Extract:** Firstly, we extract the dataset from GitHub\n- **Load:** Next, we load this dataset (in this case, a csv file) into a data lake (Google Cloud Storage)\n- **Transform:** Finally, we can create a table inside of our data warehouse (BigQuery) which uses the data from our data lake to perform our transformations.\n\nThe reason for loading into the data warehouse before transforming means we can utilize the cloud's performance benefits for transforming large datasets. What might take a lot longer for a local machine, can take a fraction of the time in the cloud.\n\nOver the next few videos, we'll look at setting up BigQuery and transforming the Yellow Taxi dataset.\n\n#### Videos\n\n- **2.4.1 - ETL vs ELT**  \n  [![ETL vs ELT](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FE04yurp1tSU)](https://youtu.be/E04yurp1tSU)\n\n#### Resources\n- [ETL vs ELT Video](https://go.kestra.io/de-zoomcamp/etl-vs-elt)\n- [Data Warehouse 101 Video](https://go.kestra.io/de-zoomcamp/data-warehouse-101)\n- [Data Lakes 101 Video](https://go.kestra.io/de-zoomcamp/data-lakes-101)\n\n### 2.4.2 Setup Google Cloud Platform (GCP)\n\nBefore we start loading data to GCP, we need to set up the Google Cloud Platform. \n\nFirst, adjust the following flow [`06_gcp_kv.yaml`](flows/06_gcp_kv.yaml) to include your service account, GCP project ID, BigQuery dataset and GCS bucket name (_along with their location_) as KV Store values:\n- GCP_PROJECT_ID\n- GCP_LOCATION\n- GCP_BUCKET_NAME\n- GCP_DATASET.\n\n#### Create GCP Resources\n\nIf you haven't already created the GCS bucket and BigQuery dataset in the first week of the course, you can use this flow to create them: [`07_gcp_setup.yaml`](flows/07_gcp_setup.yaml).\n\n> [!WARNING]  \n> The `GCP_CREDS` service account contains sensitive information. Ensure you keep it secure and do not commit it to Git. Keep it as secure as your passwords.\n\n\n#### Videos\n\n- **2.4.2 - Setup Google Cloud Platform**  \n  [![Setup Google Cloud Platform](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FTLGFAOHpOYM)](https://youtu.be/TLGFAOHpOYM)\n\n#### Resources\n- [Set up Google Cloud Service Account in Kestra](https://go.kestra.io/de-zoomcamp/google-sa)\n\n### 2.4.3 GCP Workflow: Load Taxi Data to BigQuery\n\nNow that Google Cloud is set up with a storage bucket, we can start the ELT process.\n\n```mermaid\ngraph LR\n  SetLabel[Set Labels] --> Extract[Extract CSV Data]\n  Extract --> UploadToGCS[Upload Data to GCS]\n  UploadToGCS -->|Taxi=Yellow| BQYellowTripdata[Main Yellow Tripdata Table]:::yellow\n  UploadToGCS -->|Taxi=Green| BQGreenTripdata[Main Green Tripdata Table]:::green\n  BQYellowTripdata --> BQYellowTableExt[External Table]:::yellow\n  BQGreenTripdata --> BQGreenTableExt[External Table]:::green\n  BQYellowTableExt --> BQYellowTableTmp[Monthly Table]:::yellow\n  BQGreenTableExt --> BQGreenTableTmp[Monthly Table]:::green\n  BQYellowTableTmp --> BQYellowMerge[Merge to Main Table]:::yellow\n  BQGreenTableTmp --> BQGreenMerge[Merge to Main Table]:::green\n  BQYellowMerge --> PurgeFiles[Purge Files]\n  BQGreenMerge --> PurgeFiles[Purge Files]\n\n  classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px,color:#000\n  classDef green fill:#32CD32,stroke:#000,stroke-width:1px,color:#000\n```\n\nThe flow code: [`08_gcp_taxi.yaml`](flows/08_gcp_taxi.yaml).\n\n#### Videos\n\n- **2.4.3 - Create an ETL Pipeline with GCS and BigQuery in Kestra**  \n  [![Create an ETL Pipeline with GCS and BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F52u9X_bfTAo)](https://youtu.be/52u9X_bfTAo)\n\n### 2.4.4 GCP Workflow: Schedule and Backfill Full Dataset\n\nWe can now schedule the same pipeline shown above to run daily at 9 AM UTC for the green dataset and at 10 AM UTC for the yellow dataset. You can backfill historical data directly from the Kestra UI.\n\nSince we now process data in a cloud environment with infinitely scalable storage and compute, we can backfill the entire dataset for both the yellow and green taxi data without the risk of running out of resources on our local machine.\n\nThe flow code: [`09_gcp_taxi_scheduled.yaml`](flows/09_gcp_taxi_scheduled.yaml).\n\n#### Videos\n\n- **2.4.4 - GCP Workflow: Schedule and Backfills**  \n  [![GCP Workflow: Schedule and Backfills](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fb-6KhfWfk2M)](https://youtu.be/b-6KhfWfk2M)\n\n---\n\n## 2.5 Using AI for Data Engineering in Kestra\n\nThis section builds on what you learned earlier in Module 2 to show you how AI can speed up workflow development.\n\nBy the end of this section, you will:\n- Understand why context engineering matters when collaborating with LLMs\n- Use AI Copilot to build Kestra flows faster\n- Use Retrieval Augmented Generation (RAG) in data pipelines\n\n### Prerequisites\n\n- Completion of earlier sections in Module 2 (Workflow Orchestration with Kestra)\n- Kestra running locally\n- Google Cloud account with access to Gemini API (there's a generous free tier!)\n\n---\n\n### 2.5.1 Introduction: Why AI for Workflows?\n\nAs data engineers, we spend significant time writing boilerplate code, searching documentation, and structuring data pipelines. AI tools can help us:\n\n- **Generate workflows faster**: Describe what you want to accomplish in natural language instead of writing YAML from scratch\n- **Avoid errors**: Get syntax-correct, up-to-date workflow code that follows best practices\n\nHowever, AI is only as good as the context we provide. This section teaches you how to engineer that context for reliable, production-ready data workflows.\n\n#### Videos\n\n- **2.5.1 - Using AI for Data Engineering**  \n  [![Using AI for Data Engineering](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FGHPtRDAv044)](https://youtu.be/GHPtRDAv044)\n\n---\n\n### 2.5.2 Context Engineering with ChatGPT\n\nLet's start by seeing what happens when AI lacks proper context.\n\n#### Experiment: ChatGPT Without Context\n\n1. **Open ChatGPT in a private browser window** (to avoid any existing chat context): https://chatgpt.com\n\n2. **Enter this prompt:**\n   ```\n   Create a Kestra flow that loads NYC taxi data from a CSV file to BigQuery. The flow should extract data, upload to GCS, and load to BigQuery.\n   ```\n\n3. **Observe the results:**\n   - ChatGPT will generate a Kestra flow, but it likely contains:\n     - **Outdated plugin syntax** e.g., old task types that have been renamed\n     - **Incorrect property names** e.g., properties that don't exist in current versions\n     - **Hallucinated features** e.g., tasks, triggers or properties that never existed\n\n#### Why Does This Happen?\n\nLarge Language Models (LLMs) like GPT models from OpenAI are trained on data up to a specific point in time (knowledge cutoff). They don't automatically know about:\n- Software updates and new releases\n- Renamed plugins or changed APIs\n\nThis is the fundamental challenge of using AI: **the model can only work with information it has access to.**\n\n#### Key Learning: Context is Everything\n\nWithout proper context:\n- ❌ Generic AI assistants hallucinate outdated or incorrect code\n- ❌ You can't trust the output for production use\n\nWith proper context:\n- ✅ AI generates accurate, current, production-ready code\n- ✅ You can iterate faster by letting AI generate boilerplate workflow code\n\nIn the next section, we'll see how Kestra's AI Copilot solves this problem.\n\n#### Videos\n\n- **2.5.2 - Context Engineering with ChatGPT**  \n  [![Context Engineering with ChatGPT](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FLmnfjGKwnVU)](https://youtu.be/LmnfjGKwnVU)\n\n---\n\n### 2.5.3 AI Copilot in Kestra\n\nKestra's AI Copilot is specifically designed to generate and modify Kestra flows with full context about the latest plugins, workflow syntax, and best practices.\n\n#### Setup AI Copilot\n\nBefore using AI Copilot, you need to configure Gemini API access in your Kestra instance.\n\n**Step 1: Get Your Gemini API Key**\n\n1. Visit Google AI Studio: https://aistudio.google.com/app/apikey\n2. Sign in with your Google account\n3. Click \"Create API Key\"\n4. Copy the generated key (keep it secure!)\n\n> [!WARNING]  \n> Never commit API keys to Git. Always use environment variables or Kestra's KV Store.\n\n**Step 2: Configure Kestra AI Copilot**\n\nAdd the following to your Kestra configuration. You can do this by modifying your `docker-compose.yml` file from 2.2:\n\n```yaml\nservices:\n  kestra:\n    environment:\n      KESTRA_CONFIGURATION: |\n        kestra:\n          ai:\n            type: gemini\n            gemini:\n              model-name: gemini-2.5-flash\n              api-key: ${GEMINI_API_KEY}\n```\n\nThen restart Kestra:\n```bash\ncd 02-workflow-orchestration/docker\nexport GEMINI_API_KEY=\"your-api-key-here\"\ndocker compose up -d\n```\n\n#### Exercise: ChatGPT vs AI Copilot Comparison\n\n**Objective:** Learn why context engineering matters.\n\n1. **Open Kestra UI** at http://localhost:8080\n2. **Create a new flow** and open the Code editor panel\n3. **Click the AI Copilot button** (sparkle icon ✨) in the top-right corner\n4. **Enter the same exact prompt** we used with ChatGPT:\n   ```\n   Create a Kestra flow that loads NYC taxi data from a CSV file to BigQuery. The flow should extract data, upload to GCS, and load to BigQuery.\n   ```\n5. **Compare the outputs:**\n   - ✅ Copilot generates executable, working YAML\n   - ✅ Copilot uses correct plugin types and properties\n   - ✅ Copilot follows current Kestra best practices\n\n**Key Learning:** Context matters! AI Copilot has access to current Kestra documentation, generating Kestra flows better than a generic ChatGPT assistant.\n\n#### Videos\n\n- **2.5.3 - AI Copilot in Kestra**  \n  [![AI Copilot in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F3IbjHfC8bMg)](https://youtu.be/3IbjHfC8bMg)\n\n\n### 2.5.4 Bonus: Retrieval Augmented Generation (RAG)\n\nTo further learn how to provide context to your prompts, this bonus section demonstrates how to use RAG.\n\n#### What is RAG?\n\n**RAG (Retrieval Augmented Generation)** is a technique that:\n1. **Retrieves** relevant information from your data sources\n2. **Augments** the AI prompt with this context\n3. **Generates** a response grounded in real data\n\nThis solves the hallucination problem by ensuring the AI has access to current, accurate information at query time.\n\n#### How RAG Works in Kestra\n\n```mermaid\ngraph LR\n    A[Ask AI] --> B[Fetch Docs]\n    B --> C[Create Embeddings]\n    C --> D[Find Similar Content]\n    D --> E[Add Context to Prompt]\n    E --> F[LLM Answer]\n```\n\n**The Process:**\n1. **Ingest documents**: Load documentation, release notes, or other data sources\n2. **Create embeddings**: Convert text into vector representations using an LLM\n3. **Store embeddings**: Save vectors in Kestra's KV Store (or a vector database)\n4. **Query with context**: When you ask a question, retrieve relevant embeddings and include them in the prompt\n5. **Generate response**: The LLM has real context and provides accurate answers\n\n#### Exercise: Retrieval With vs Without Context\n\n**Objective:** Understand how RAG eliminates hallucinations by grounding LLM responses in real data.\n\n**Part A: Without RAG**\n1. Navigate to the [`10_chat_without_rag.yaml`](flows/10_chat_without_rag.yaml) flow in your Kestra UI\n2. Click **Execute**\n3. Wait for the execution to complete\n4. Open the **Logs** tab\n5. Read the output - notice how the response about \"Kestra 1.1 features\" is:\n   - Vague or generic\n   - Potentially incorrect\n   - Missing specific details\n   - Based only on the model's training data (which may be outdated)\n\n**Part B: With RAG**\n1. Navigate to the [`11_chat_with_rag.yaml`](flows/11_chat_with_rag.yaml) flow\n2. Click **Execute**\n3. Watch the execution:\n   - First task: **Ingests** Kestra 1.1 release documentation, creates **embeddings** and stores them\n   - Second task: **Prompts LLM** with context retrieved from stored embeddings\n4. Open the **Logs** tab\n5. Compare this output with the previous one - notice how it's:\n   - ✅ Specific and detailed\n   - ✅ Accurate with real features from the release\n   - ✅ Grounded in actual documentation\n\n**Key Learning:** RAG (Retrieval Augmented Generation) grounds AI responses in current documentation, eliminating hallucinations and providing accurate, context-aware answers.\n\n#### RAG Best Practices\n\n1. **Keep documents updated**: Regularly re-ingest to ensure current information\n2. **Chunk appropriately**: Break large documents into meaningful chunks\n3. **Test retrieval quality**: Verify that the right documents are retrieved\n\n#### Additional AI Resources\n\nKestra Documentation:\n- [AI Tools Overview](https://go.kestra.io/de-zoomcamp/ai-tools)\n- [AI Copilot](https://go.kestra.io/de-zoomcamp/ai-copilot)\n- [RAG Workflows](https://go.kestra.io/de-zoomcamp/rag-workflows)\n- [AI Workflows](https://go.kestra.io/de-zoomcamp/ai-workflows)\n- [Kestra Blueprints](https://go.kestra.io/de-zoomcamp/blueprints) - Pre-built workflow examples\n\nKestra Plugin Documentation:\n- [AI Plugin](https://go.kestra.io/de-zoomcamp/ai-plugin)\n- [RAG Tasks](https://go.kestra.io/de-zoomcamp/ai-rag-task)\n\nExternal Documentation:\n- [Google Gemini](https://go.kestra.io/de-zoomcamp/gemini-docs)\n- [Google AI Studio](https://go.kestra.io/de-zoomcamp/ai-studio)\n\n#### Videos\n\n- **2.5.4 (Bonus) - Retrieval Augmented Generation**  \n  [![Retrieval Augmented Generation](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FXuPDQ1UcNyI)](https://youtu.be/XuPDQ1UcNyI)\n\n## 2.6 Bonus: Deploy to the Cloud (Optional)\n\nNow that we've got all our pipelines working and we know how to quickly create new flows with Kestra's AI Copilot, we can deploy Kestra to the cloud so it can continue to orchestrate our scheduled pipelines. \n\nIn this bonus section, we'll cover how you can deploy Kestra on Google Cloud and automatically sync your workflows from a Git repository.\n\nNote: When committing your workflows to Kestra, make sure your workflow doesn't contain any sensitive information. You can use [Secrets](https://go.kestra.io/de-zoomcamp/secret) and the [KV Store](https://go.kestra.io/de-zoomcamp/kv-store) to keep sensitive data out of your workflow logic.\n\n#### Resources\n\n- [Install Kestra on Google Cloud](https://go.kestra.io/de-zoomcamp/gcp-install)\n- [Moving from Development to Production](https://go.kestra.io/de-zoomcamp/dev-to-prod)\n- [Using Git in Kestra](https://go.kestra.io/de-zoomcamp/git)\n- [Deploy Flows with GitHub Actions](https://go.kestra.io/de-zoomcamp/deploy-github-actions)\n\n## 2.7 Additional Resources 📚\n\n- Check [Kestra Docs](https://go.kestra.io/de-zoomcamp/docs)\n- Explore our [Blueprints](https://go.kestra.io/de-zoomcamp/blueprints) library\n- Browse over 600 [plugins](https://go.kestra.io/de-zoomcamp/plugins) available in Kestra\n- Give us a star on [GitHub](https://go.kestra.io/de-zoomcamp/github)\n- Join our [Slack community](https://go.kestra.io/de-zoomcamp/slack) if you have any questions\n- Find all the videos in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist)\n\n\n### Troubleshooting tips\n\nIf you face any issues with Kestra flows in Module 2, make sure to use the following Docker images/ports:\n- `image: kestra/kestra:v1.1` - pin your Kestra Docker image to this version so we can ensure reproducibility; do NOT use `kestra/kestra:develop` as this is a bleeding-edge development version that might contain bugs\n- `postgres:18` — make sure to pin your Postgres image to version 18\n- If you run `pgAdmin` or something else on port 8080, you can adjust Kestra `docker-compose` to use a different port, e.g. change port mapping to 18080 instead of 8080, and then access Kestra UI in your browser from http://localhost:18080/ instead of from http://localhost:8080/\n\nIf you are still facing any issues, stop and remove your existing Kestra + Postgres containers and start them again using `docker-compose up -d`. If this doesn't help, post your question on the DataTalksClub Slack or on Kestra's Slack http://kestra.io/slack.\n\nIf you encounter similar errors to:\n```\nBigQueryError{reason=invalid, location=null, \nmessage=Error while reading table: kestra-sandbox.zooomcamp.yellow_tripdata_2020_01, \nerror message: CSV table references column position 17, but line contains only 14 columns.; \nline_number: 2103925 byte_offset_to_start_of_line: 194863028 \ncolumn_index: 17 column_name: \"congestion_surcharge\" column_type: NUMERIC \nFile: gs://anna-geller/yellow_tripdata_2020-01.csv}\n```\n\nIt means that the CSV file you're trying to load into BigQuery has a mismatch in the number of columns between the external source table (i.e. file in GCS) and the destination table in BigQuery. This can happen when for due to network/transfer issues, the file is not fully downloaded from GitHub or not correctly uploaded to GCS. The error suggests schema issues but that's not the case. Simply rerun the entire execution including redownloading the CSV file and reuploading it to GCS. This should resolve the issue.\n\n---\n\n## Homework \n\nSee the [2026 cohort folder](../cohorts/2026/02-workflow-orchestration/homework.md)\n\n---\n\n# Community notes\n\nDid you take notes? You can share them by creating a PR to this file! \n\n* Add your notes above this line\n\n---\n\n# Previous Cohorts\n\n* 2022: [notes](../cohorts/2022/week_2_data_ingestion#community-notes) and [videos](../cohorts/2022/week_2_data_ingestion)\n* 2023: [notes](../cohorts/2023/week_2_workflow_orchestration#community-notes) and [videos](../cohorts/2023/week_2_workflow_orchestration)\n* 2024: [notes](../cohorts/2024/02-workflow-orchestration#community-notes) and [videos](../cohorts/2024/02-workflow-orchestration)\n* 2025: [notes](../cohorts/2025/02-workflow-orchestration/README.md#community-notes) and [videos](../cohorts/2025/02-workflow-orchestration)\n"
  },
  {
    "path": "02-workflow-orchestration/docker-compose.yml",
    "content": "volumes:\n  ny_taxi_postgres_data:\n    driver: local\n  kestra_postgres_data:\n    driver: local\n  kestra_data:\n    driver: local\n  kestra_tmp:\n    driver: local\n\nservices:\n  pgdatabase:\n    image: postgres:18\n    environment:\n      POSTGRES_USER: root\n      POSTGRES_PASSWORD: root\n      POSTGRES_DB: ny_taxi\n    ports:\n      - \"5432:5432\"\n    volumes:\n      - ny_taxi_postgres_data:/var/lib/postgresql\n    depends_on:\n      kestra:\n        condition: service_started\n\n  pgadmin:\n    image: dpage/pgadmin4\n    environment:\n      - PGADMIN_DEFAULT_EMAIL=admin@admin.com\n      - PGADMIN_DEFAULT_PASSWORD=root\n    ports:\n      - \"8085:80\"\n    depends_on:\n      pgdatabase:\n        condition: service_started\n\n  kestra_postgres:\n    image: postgres:18\n    volumes:\n      - kestra_postgres_data:/var/lib/postgresql\n    environment:\n      POSTGRES_DB: kestra\n      POSTGRES_USER: kestra\n      POSTGRES_PASSWORD: k3str4\n    healthcheck:\n      test: [\"CMD-SHELL\", \"pg_isready -d $${POSTGRES_DB} -U $${POSTGRES_USER}\"]\n      interval: 30s\n      timeout: 10s\n      retries: 10\n\n  kestra:\n    image: kestra/kestra:v1.1\n    pull_policy: always\n    # Note that this setup with a root user is intended for development purpose.\n    # Our base image runs without root, but the Docker Compose implementation needs root to access the Docker socket\n    # To run Kestra in a rootless mode in production, see: https://kestra.io/docs/installation/podman-compose\n    user: \"root\"\n    command: server standalone\n    volumes:\n      - kestra_data:/app/storage\n      - /var/run/docker.sock:/var/run/docker.sock\n      - kestra_tmp:/tmp/kestra-wd\n    environment:\n      KESTRA_CONFIGURATION: |\n        datasources:\n          postgres:\n            url: jdbc:postgresql://kestra_postgres:5432/kestra\n            driverClassName: org.postgresql.Driver\n            username: kestra\n            password: k3str4\n        kestra:\n          server:\n            basicAuth:\n              username: \"admin@kestra.io\" # it must be a valid email address\n              password: Admin1234!\n          repository:\n            type: postgres\n          storage:\n            type: local\n            local:\n              basePath: \"/app/storage\"\n          queue:\n            type: postgres\n          tasks:\n            tmpDir:\n              path: /tmp/kestra-wd/tmp\n          url: http://localhost:8080/\n    ports:\n      - \"8080:8080\"\n      - \"8081:8081\"\n    depends_on:\n      kestra_postgres:\n        condition: service_started\n    "
  },
  {
    "path": "02-workflow-orchestration/flows/01_hello_world.yaml",
    "content": "id: 01_hello_world\nnamespace: zoomcamp\n\ninputs:\n  - id: name\n    type: STRING\n    defaults: Will\n\nconcurrency:\n  behavior: FAIL\n  limit: 2\n\nvariables:\n  welcome_message: \"Hello, {{ inputs.name }}!\"\n  \ntasks:\n  - id: hello_message\n    type: io.kestra.plugin.core.log.Log\n    message: \"{{ render(vars.welcome_message) }}\"\n  \n  - id: generate_output\n    type: io.kestra.plugin.core.debug.Return\n    format: I was generated during this workflow.\n\n  - id: sleep\n    type: io.kestra.plugin.core.flow.Sleep\n    duration: PT15S\n\n  - id: log_output\n    type: io.kestra.plugin.core.log.Log\n    message: \"This is an output: {{ outputs.generate_output.value }}\"\n\n  - id: goodbye_message\n    type: io.kestra.plugin.core.log.Log\n    message: \"Goodbye, {{ inputs.name }}!\"\n\npluginDefaults:\n  - type: io.kestra.plugin.core.log.Log\n    values:\n      level: ERROR\n\ntriggers:\n  - id: schedule\n    type: io.kestra.plugin.core.trigger.Schedule\n    cron: \"0 10 * * *\"\n    inputs:\n      name: Sarah\n    disabled: true\n"
  },
  {
    "path": "02-workflow-orchestration/flows/02_python.yaml",
    "content": "id: 02_python\nnamespace: zoomcamp\n\ndescription: This flow will install the pip package in a Docker container, and use kestra's Python library to generate outputs (number of downloads of the Kestra Docker image) and metrics (duration of the script).\n\ntasks:\n  - id: collect_stats\n    type: io.kestra.plugin.scripts.python.Script\n    taskRunner:\n      type: io.kestra.plugin.scripts.runner.docker.Docker\n    containerImage: python:slim\n    dependencies:\n      - requests\n      - kestra\n    script: |\n      from kestra import Kestra\n      import requests\n      def get_docker_image_downloads(image_name: str = \"kestra/kestra\"):\n          \"\"\"Queries the Docker Hub API to get the number of downloads for a specific Docker image.\"\"\"\n          url = f\"https://hub.docker.com/v2/repositories/{image_name}/\"\n          response = requests.get(url)\n          data = response.json()\n          downloads = data.get('pull_count', 'Not available')\n          return downloads\n      downloads = get_docker_image_downloads()\n      outputs = {\n          'downloads': downloads\n      }\n      Kestra.outputs(outputs)"
  },
  {
    "path": "02-workflow-orchestration/flows/03_getting_started_data_pipeline.yaml",
    "content": "id: 03_getting_started_data_pipeline\nnamespace: zoomcamp\n\ninputs:\n  - id: columns_to_keep\n    type: ARRAY\n    itemType: STRING\n    defaults:\n      - brand\n      - price\n\ntasks:\n  - id: extract\n    type: io.kestra.plugin.core.http.Download\n    uri: https://dummyjson.com/products\n\n  - id: transform\n    type: io.kestra.plugin.scripts.python.Script\n    containerImage: python:3.11-alpine\n    inputFiles:\n      data.json: \"{{outputs.extract.uri}}\"\n    outputFiles:\n      - \"*.json\"\n    env:\n      COLUMNS_TO_KEEP: \"{{inputs.columns_to_keep}}\"\n    script: |\n      import json\n      import os\n\n      columns_to_keep_str = os.getenv(\"COLUMNS_TO_KEEP\")\n      columns_to_keep = json.loads(columns_to_keep_str)\n\n      with open(\"data.json\", \"r\") as file:\n          data = json.load(file)\n\n      filtered_data = [\n          {column: product.get(column, \"N/A\") for column in columns_to_keep}\n          for product in data[\"products\"]\n      ]\n\n      with open(\"products.json\", \"w\") as file:\n          json.dump(filtered_data, file, indent=4)\n\n  - id: query\n    type: io.kestra.plugin.jdbc.duckdb.Queries\n    inputFiles:\n      products.json: \"{{outputs.transform.outputFiles['products.json']}}\"\n    sql: |\n      INSTALL json;\n      LOAD json;\n      SELECT brand, round(avg(price), 2) as avg_price\n      FROM read_json_auto('{{workingDir}}/products.json')\n      GROUP BY brand\n      ORDER BY avg_price DESC;\n    fetchType: STORE\n"
  },
  {
    "path": "02-workflow-orchestration/flows/04_postgres_taxi.yaml",
    "content": "id: 04_postgres_taxi\nnamespace: zoomcamp\ndescription: |\n  The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases\n\ninputs:\n  - id: taxi\n    type: SELECT\n    displayName: Select taxi type\n    values: [yellow, green]\n    defaults: yellow\n\n  - id: year\n    type: SELECT\n    displayName: Select year\n    values: [\"2019\", \"2020\"]\n    defaults: \"2019\"\n\n  - id: month\n    type: SELECT\n    displayName: Select month\n    values: [\"01\", \"02\", \"03\", \"04\", \"05\", \"06\", \"07\", \"08\", \"09\", \"10\", \"11\", \"12\"]\n    defaults: \"01\"\n\nvariables:\n  file: \"{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv\"\n  staging_table: \"public.{{inputs.taxi}}_tripdata_staging\"\n  table: \"public.{{inputs.taxi}}_tripdata\"\n  data: \"{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}\"\n\ntasks:\n  - id: set_label\n    type: io.kestra.plugin.core.execution.Labels\n    labels:\n      file: \"{{render(vars.file)}}\"\n      taxi: \"{{inputs.taxi}}\"\n\n  - id: extract\n    type: io.kestra.plugin.scripts.shell.Commands\n    outputFiles:\n      - \"*.csv\"\n    taskRunner:\n      type: io.kestra.plugin.core.runner.Process\n    commands:\n      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}\n\n  - id: if_yellow_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'yellow'}}\"\n    then:\n      - id: yellow_create_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              tpep_pickup_datetime   timestamp,\n              tpep_dropoff_datetime  timestamp,\n              passenger_count        integer,\n              trip_distance          double precision,\n              RatecodeID             text,\n              store_and_fwd_flag     text,\n              PULocationID           text,\n              DOLocationID           text,\n              payment_type           integer,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              congestion_surcharge   double precision\n          );\n\n      - id: yellow_create_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              tpep_pickup_datetime   timestamp,\n              tpep_dropoff_datetime  timestamp,\n              passenger_count        integer,\n              trip_distance          double precision,\n              RatecodeID             text,\n              store_and_fwd_flag     text,\n              PULocationID           text,\n              DOLocationID           text,\n              payment_type           integer,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              congestion_surcharge   double precision\n          );\n\n      - id: yellow_truncate_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          TRUNCATE TABLE {{render(vars.staging_table)}};\n\n      - id: yellow_copy_in_to_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.CopyIn\n        format: CSV\n        from: \"{{render(vars.data)}}\"\n        table: \"{{render(vars.staging_table)}}\"\n        header: true\n        columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge]\n\n      - id: yellow_add_unique_id_and_filename\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          UPDATE {{render(vars.staging_table)}}\n          SET \n            unique_row_id = md5(\n              COALESCE(CAST(VendorID AS text), '') ||\n              COALESCE(CAST(tpep_pickup_datetime AS text), '') || \n              COALESCE(CAST(tpep_dropoff_datetime AS text), '') || \n              COALESCE(PULocationID, '') || \n              COALESCE(DOLocationID, '') || \n              COALESCE(CAST(fare_amount AS text), '') || \n              COALESCE(CAST(trip_distance AS text), '')      \n            ),\n            filename = '{{render(vars.file)}}';\n\n      - id: yellow_merge_data\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          MERGE INTO {{render(vars.table)}} AS T\n          USING {{render(vars.staging_table)}} AS S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (\n              unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,\n              passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,\n              DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,\n              improvement_surcharge, total_amount, congestion_surcharge\n            )\n            VALUES (\n              S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,\n              S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,\n              S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,\n              S.improvement_surcharge, S.total_amount, S.congestion_surcharge\n            );\n\n  - id: if_green_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'green'}}\"\n    then:\n      - id: green_create_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              lpep_pickup_datetime   timestamp,\n              lpep_dropoff_datetime  timestamp,\n              store_and_fwd_flag     text,\n              RatecodeID             text,\n              PULocationID           text,\n              DOLocationID           text,\n              passenger_count        integer,\n              trip_distance          double precision,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              ehail_fee              double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              payment_type           integer,\n              trip_type              integer,\n              congestion_surcharge   double precision\n          );\n\n      - id: green_create_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              lpep_pickup_datetime   timestamp,\n              lpep_dropoff_datetime  timestamp,\n              store_and_fwd_flag     text,\n              RatecodeID             text,\n              PULocationID           text,\n              DOLocationID           text,\n              passenger_count        integer,\n              trip_distance          double precision,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              ehail_fee              double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              payment_type           integer,\n              trip_type              integer,\n              congestion_surcharge   double precision\n          );\n\n      - id: green_truncate_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          TRUNCATE TABLE {{render(vars.staging_table)}};\n\n      - id: green_copy_in_to_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.CopyIn\n        format: CSV\n        from: \"{{render(vars.data)}}\"\n        table: \"{{render(vars.staging_table)}}\"\n        header: true\n        columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge]\n\n      - id: green_add_unique_id_and_filename\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          UPDATE {{render(vars.staging_table)}}\n          SET \n            unique_row_id = md5(\n              COALESCE(CAST(VendorID AS text), '') ||\n              COALESCE(CAST(lpep_pickup_datetime AS text), '') || \n              COALESCE(CAST(lpep_dropoff_datetime AS text), '') || \n              COALESCE(PULocationID, '') || \n              COALESCE(DOLocationID, '') || \n              COALESCE(CAST(fare_amount AS text), '') || \n              COALESCE(CAST(trip_distance AS text), '')      \n            ),\n            filename = '{{render(vars.file)}}';\n\n      - id: green_merge_data\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          MERGE INTO {{render(vars.table)}} AS T\n          USING {{render(vars.staging_table)}} AS S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (\n              unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime,\n              store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count,\n              trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee,\n              improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge\n            )\n            VALUES (\n              S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime,\n              S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count,\n              S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee,\n              S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge\n            );\n  \n  - id: purge_files\n    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles\n    description: This will remove output files. If you'd like to explore Kestra outputs, disable it.\n\npluginDefaults:\n  - type: io.kestra.plugin.jdbc.postgresql\n    values:\n      url: jdbc:postgresql://pgdatabase:5432/ny_taxi\n      username: root\n      password: root\n"
  },
  {
    "path": "02-workflow-orchestration/flows/05_postgres_taxi_scheduled.yaml",
    "content": "id: 05_postgres_taxi_scheduled\nnamespace: zoomcamp\ndescription: |\n  Best to add a label `backfill:true` from the UI to track executions created via a backfill.\n  CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases\n\nconcurrency:\n  limit: 1\n\ninputs:\n  - id: taxi\n    type: SELECT\n    displayName: Select taxi type\n    values: [yellow, green]\n    defaults: yellow\n\nvariables:\n  file: \"{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv\"\n  staging_table: \"public.{{inputs.taxi}}_tripdata_staging\"\n  table: \"public.{{inputs.taxi}}_tripdata\"\n  data: \"{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}\"\n\ntasks:\n  - id: set_label\n    type: io.kestra.plugin.core.execution.Labels\n    labels:\n      file: \"{{render(vars.file)}}\"\n      taxi: \"{{inputs.taxi}}\"\n\n  - id: extract\n    type: io.kestra.plugin.scripts.shell.Commands\n    outputFiles:\n      - \"*.csv\"\n    taskRunner:\n      type: io.kestra.plugin.core.runner.Process\n    commands:\n      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}\n\n  - id: if_yellow_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'yellow'}}\"\n    then:\n      - id: yellow_create_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              tpep_pickup_datetime   timestamp,\n              tpep_dropoff_datetime  timestamp,\n              passenger_count        integer,\n              trip_distance          double precision,\n              RatecodeID             text,\n              store_and_fwd_flag     text,\n              PULocationID           text,\n              DOLocationID           text,\n              payment_type           integer,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              congestion_surcharge   double precision\n          );\n\n      - id: yellow_create_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              tpep_pickup_datetime   timestamp,\n              tpep_dropoff_datetime  timestamp,\n              passenger_count        integer,\n              trip_distance          double precision,\n              RatecodeID             text,\n              store_and_fwd_flag     text,\n              PULocationID           text,\n              DOLocationID           text,\n              payment_type           integer,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              congestion_surcharge   double precision\n          );\n\n      - id: yellow_truncate_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          TRUNCATE TABLE {{render(vars.staging_table)}};\n\n      - id: yellow_copy_in_to_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.CopyIn\n        format: CSV\n        from: \"{{render(vars.data)}}\"\n        table: \"{{render(vars.staging_table)}}\"\n        header: true\n        columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge]\n\n      - id: yellow_add_unique_id_and_filename\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          UPDATE {{render(vars.staging_table)}}\n          SET \n            unique_row_id = md5(\n              COALESCE(CAST(VendorID AS text), '') ||\n              COALESCE(CAST(tpep_pickup_datetime AS text), '') || \n              COALESCE(CAST(tpep_dropoff_datetime AS text), '') || \n              COALESCE(PULocationID, '') || \n              COALESCE(DOLocationID, '') || \n              COALESCE(CAST(fare_amount AS text), '') || \n              COALESCE(CAST(trip_distance AS text), '')      \n            ),\n            filename = '{{render(vars.file)}}';\n\n      - id: yellow_merge_data\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          MERGE INTO {{render(vars.table)}} AS T\n          USING {{render(vars.staging_table)}} AS S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (\n              unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,\n              passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,\n              DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,\n              improvement_surcharge, total_amount, congestion_surcharge\n            )\n            VALUES (\n              S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,\n              S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,\n              S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,\n              S.improvement_surcharge, S.total_amount, S.congestion_surcharge\n            );\n\n  - id: if_green_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'green'}}\"\n    then:\n      - id: green_create_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              lpep_pickup_datetime   timestamp,\n              lpep_dropoff_datetime  timestamp,\n              store_and_fwd_flag     text,\n              RatecodeID             text,\n              PULocationID           text,\n              DOLocationID           text,\n              passenger_count        integer,\n              trip_distance          double precision,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              ehail_fee              double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              payment_type           integer,\n              trip_type              integer,\n              congestion_surcharge   double precision\n          );\n\n      - id: green_create_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              lpep_pickup_datetime   timestamp,\n              lpep_dropoff_datetime  timestamp,\n              store_and_fwd_flag     text,\n              RatecodeID             text,\n              PULocationID           text,\n              DOLocationID           text,\n              passenger_count        integer,\n              trip_distance          double precision,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              ehail_fee              double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              payment_type           integer,\n              trip_type              integer,\n              congestion_surcharge   double precision\n          );\n\n      - id: green_truncate_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          TRUNCATE TABLE {{render(vars.staging_table)}};\n\n      - id: green_copy_in_to_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.CopyIn\n        format: CSV\n        from: \"{{render(vars.data)}}\"\n        table: \"{{render(vars.staging_table)}}\"\n        header: true\n        columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge]\n\n      - id: green_add_unique_id_and_filename\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          UPDATE {{render(vars.staging_table)}}\n          SET \n            unique_row_id = md5(\n              COALESCE(CAST(VendorID AS text), '') ||\n              COALESCE(CAST(lpep_pickup_datetime AS text), '') || \n              COALESCE(CAST(lpep_dropoff_datetime AS text), '') || \n              COALESCE(PULocationID, '') || \n              COALESCE(DOLocationID, '') || \n              COALESCE(CAST(fare_amount AS text), '') || \n              COALESCE(CAST(trip_distance AS text), '')      \n            ),\n            filename = '{{render(vars.file)}}';\n\n      - id: green_merge_data\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          MERGE INTO {{render(vars.table)}} AS T\n          USING {{render(vars.staging_table)}} AS S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (\n              unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime,\n              store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count,\n              trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee,\n              improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge\n            )\n            VALUES (\n              S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime,\n              S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count,\n              S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee,\n              S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge\n            );\n  \n  - id: purge_files\n    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles\n    description: To avoid cluttering your storage, we will remove the downloaded files\n\npluginDefaults:\n  - type: io.kestra.plugin.jdbc.postgresql\n    values:\n      url: jdbc:postgresql://pgdatabase:5432/ny_taxi\n      username: root\n      password: root\n\ntriggers:\n  - id: green_schedule\n    type: io.kestra.plugin.core.trigger.Schedule\n    cron: \"0 9 1 * *\"\n    inputs:\n      taxi: green\n\n  - id: yellow_schedule\n    type: io.kestra.plugin.core.trigger.Schedule\n    cron: \"0 10 1 * *\"\n    inputs:\n      taxi: yellow\n"
  },
  {
    "path": "02-workflow-orchestration/flows/06_gcp_kv.yaml",
    "content": "id: 06_gcp_kv\nnamespace: zoomcamp\n\ntasks:\n  - id: gcp_project_id\n    type: io.kestra.plugin.core.kv.Set\n    key: GCP_PROJECT_ID\n    kvType: STRING\n    value: kestra-sandbox # TODO replace with your project id\n\n  - id: gcp_location\n    type: io.kestra.plugin.core.kv.Set\n    key: GCP_LOCATION\n    kvType: STRING\n    value: europe-west2\n\n  - id: gcp_bucket_name\n    type: io.kestra.plugin.core.kv.Set\n    key: GCP_BUCKET_NAME\n    kvType: STRING\n    value: your-name-kestra # TODO make sure it's globally unique!\n\n  - id: gcp_dataset\n    type: io.kestra.plugin.core.kv.Set\n    key: GCP_DATASET\n    kvType: STRING\n    value: zoomcamp\n\n"
  },
  {
    "path": "02-workflow-orchestration/flows/07_gcp_setup.yaml",
    "content": "id: 07_gcp_setup\nnamespace: zoomcamp\n\ntasks:\n  - id: create_gcs_bucket\n    type: io.kestra.plugin.gcp.gcs.CreateBucket\n    ifExists: SKIP\n    storageClass: REGIONAL\n    name: \"{{kv('GCP_BUCKET_NAME')}}\" # make sure it's globally unique!\n\n  - id: create_bq_dataset\n    type: io.kestra.plugin.gcp.bigquery.CreateDataset\n    name: \"{{kv('GCP_DATASET')}}\"\n    ifExists: SKIP\n\npluginDefaults:\n  - type: io.kestra.plugin.gcp\n    values:\n      serviceAccount: \"{{secret('GCP_CREDS')}}\"\n      projectId: \"{{kv('GCP_PROJECT_ID')}}\"\n      location: \"{{kv('GCP_LOCATION')}}\"\n      bucket: \"{{kv('GCP_BUCKET_NAME')}}\"\n\n"
  },
  {
    "path": "02-workflow-orchestration/flows/08_gcp_taxi.yaml",
    "content": "id: 08_gcp_taxi\nnamespace: zoomcamp\ndescription: |\n  The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases\n\ninputs:\n  - id: taxi\n    type: SELECT\n    displayName: Select taxi type\n    values: [yellow, green]\n    defaults: green\n\n  - id: year\n    type: SELECT\n    displayName: Select year\n    values: [\"2019\", \"2020\"]\n    defaults: \"2019\"\n    allowCustomValue: true # allows you to type 2021 from the UI for the homework 🤗\n\n  - id: month\n    type: SELECT\n    displayName: Select month\n    values: [\"01\", \"02\", \"03\", \"04\", \"05\", \"06\", \"07\", \"08\", \"09\", \"10\", \"11\", \"12\"]\n    defaults: \"01\"\n\nvariables:\n  file: \"{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv\"\n  gcs_file: \"gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}\"\n  table: \"{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{inputs.year}}_{{inputs.month}}\"\n  data: \"{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}\"\n\ntasks:\n  - id: set_label\n    type: io.kestra.plugin.core.execution.Labels\n    labels:\n      file: \"{{render(vars.file)}}\"\n      taxi: \"{{inputs.taxi}}\"\n\n  - id: extract\n    type: io.kestra.plugin.scripts.shell.Commands\n    outputFiles:\n      - \"*.csv\"\n    taskRunner:\n      type: io.kestra.plugin.core.runner.Process\n    commands:\n      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}\n\n  - id: upload_to_gcs\n    type: io.kestra.plugin.gcp.gcs.Upload\n    from: \"{{render(vars.data)}}\"\n    to: \"{{render(vars.gcs_file)}}\"\n\n  - id: if_yellow_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'yellow'}}\"\n    then:\n      - id: bq_yellow_tripdata\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata`\n          (\n              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),\n              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      \n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          PARTITION BY DATE(tpep_pickup_datetime);\n\n      - id: bq_yellow_table_ext\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`\n          (\n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          OPTIONS (\n              format = 'CSV',\n              uris = ['{{render(vars.gcs_file)}}'],\n              skip_leading_rows = 1,\n              ignore_unknown_values = TRUE\n          );\n\n      - id: bq_yellow_table_tmp\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`\n          AS\n          SELECT\n            MD5(CONCAT(\n              COALESCE(CAST(VendorID AS STRING), \"\"),\n              COALESCE(CAST(tpep_pickup_datetime AS STRING), \"\"),\n              COALESCE(CAST(tpep_dropoff_datetime AS STRING), \"\"),\n              COALESCE(CAST(PULocationID AS STRING), \"\"),\n              COALESCE(CAST(DOLocationID AS STRING), \"\")\n            )) AS unique_row_id,\n            \"{{render(vars.file)}}\" AS filename,\n            *\n          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;\n\n      - id: bq_yellow_merge\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T\n          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge)\n            VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge);\n\n  - id: if_green_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'green'}}\"\n    then:\n      - id: bq_green_tripdata\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata`\n          (\n              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),\n              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      \n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              ehail_fee NUMERIC,\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          PARTITION BY DATE(lpep_pickup_datetime);\n\n      - id: bq_green_table_ext\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`\n          (\n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              ehail_fee NUMERIC,\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          OPTIONS (\n              format = 'CSV',\n              uris = ['{{render(vars.gcs_file)}}'],\n              skip_leading_rows = 1,\n              ignore_unknown_values = TRUE\n          );\n\n      - id: bq_green_table_tmp\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`\n          AS\n          SELECT\n            MD5(CONCAT(\n              COALESCE(CAST(VendorID AS STRING), \"\"),\n              COALESCE(CAST(lpep_pickup_datetime AS STRING), \"\"),\n              COALESCE(CAST(lpep_dropoff_datetime AS STRING), \"\"),\n              COALESCE(CAST(PULocationID AS STRING), \"\"),\n              COALESCE(CAST(DOLocationID AS STRING), \"\")\n            )) AS unique_row_id,\n            \"{{render(vars.file)}}\" AS filename,\n            *\n          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;\n\n      - id: bq_green_merge\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T\n          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge)\n            VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge);\n\n  - id: purge_files\n    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles\n    description: If you'd like to explore Kestra outputs, disable it.\n    disabled: false\n\npluginDefaults:\n  - type: io.kestra.plugin.gcp\n    values:\n      serviceAccount: \"{{secret('GCP_CREDS')}}\"\n      projectId: \"{{kv('GCP_PROJECT_ID')}}\"\n      location: \"{{kv('GCP_LOCATION')}}\"\n      bucket: \"{{kv('GCP_BUCKET_NAME')}}\"\n\n"
  },
  {
    "path": "02-workflow-orchestration/flows/09_gcp_taxi_scheduled.yaml",
    "content": "\nid: 09_gcp_taxi_scheduled\nnamespace: zoomcamp\ndescription: |\n  Best to add a label `backfill:true` from the UI to track executions created via a backfill.\n  CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases\n\ninputs:\n  - id: taxi\n    type: SELECT\n    displayName: Select taxi type\n    values: [yellow, green]\n    defaults: green\n\nvariables:\n  file: \"{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv\"\n  gcs_file: \"gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}\"\n  table: \"{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy_MM')}}\"\n  data: \"{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}\"\n\ntasks:\n  - id: set_label\n    type: io.kestra.plugin.core.execution.Labels\n    labels:\n      file: \"{{render(vars.file)}}\"\n      taxi: \"{{inputs.taxi}}\"\n\n  - id: extract\n    type: io.kestra.plugin.scripts.shell.Commands\n    outputFiles:\n      - \"*.csv\"\n    taskRunner:\n      type: io.kestra.plugin.core.runner.Process\n    commands:\n      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}\n\n  - id: upload_to_gcs\n    type: io.kestra.plugin.gcp.gcs.Upload\n    from: \"{{render(vars.data)}}\"\n    to: \"{{render(vars.gcs_file)}}\"\n\n  - id: if_yellow_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'yellow'}}\"\n    then:\n      - id: bq_yellow_tripdata\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata`\n          (\n              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),\n              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      \n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          PARTITION BY DATE(tpep_pickup_datetime);\n\n      - id: bq_yellow_table_ext\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`\n          (\n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          OPTIONS (\n              format = 'CSV',\n              uris = ['{{render(vars.gcs_file)}}'],\n              skip_leading_rows = 1,\n              ignore_unknown_values = TRUE\n          );\n\n      - id: bq_yellow_table_tmp\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`\n          AS\n          SELECT\n            MD5(CONCAT(\n              COALESCE(CAST(VendorID AS STRING), \"\"),\n              COALESCE(CAST(tpep_pickup_datetime AS STRING), \"\"),\n              COALESCE(CAST(tpep_dropoff_datetime AS STRING), \"\"),\n              COALESCE(CAST(PULocationID AS STRING), \"\"),\n              COALESCE(CAST(DOLocationID AS STRING), \"\")\n            )) AS unique_row_id,\n            \"{{render(vars.file)}}\" AS filename,\n            *\n          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;\n\n      - id: bq_yellow_merge\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T\n          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge)\n            VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge);\n\n  - id: if_green_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'green'}}\"\n    then:\n      - id: bq_green_tripdata\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata`\n          (\n              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),\n              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      \n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              ehail_fee NUMERIC,\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          PARTITION BY DATE(lpep_pickup_datetime);\n\n      - id: bq_green_table_ext\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`\n          (\n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              ehail_fee NUMERIC,\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          OPTIONS (\n              format = 'CSV',\n              uris = ['{{render(vars.gcs_file)}}'],\n              skip_leading_rows = 1,\n              ignore_unknown_values = TRUE\n          );\n\n      - id: bq_green_table_tmp\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`\n          AS\n          SELECT\n            MD5(CONCAT(\n              COALESCE(CAST(VendorID AS STRING), \"\"),\n              COALESCE(CAST(lpep_pickup_datetime AS STRING), \"\"),\n              COALESCE(CAST(lpep_dropoff_datetime AS STRING), \"\"),\n              COALESCE(CAST(PULocationID AS STRING), \"\"),\n              COALESCE(CAST(DOLocationID AS STRING), \"\")\n            )) AS unique_row_id,\n            \"{{render(vars.file)}}\" AS filename,\n            *\n          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;\n\n      - id: bq_green_merge\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T\n          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge)\n            VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge);\n\n  - id: purge_files\n    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles\n    description: To avoid cluttering your storage, we will remove the downloaded files\n\npluginDefaults:\n  - type: io.kestra.plugin.gcp\n    values:\n      serviceAccount: \"{{secret('GCP_CREDS')}}\"\n      projectId: \"{{kv('GCP_PROJECT_ID')}}\"\n      location: \"{{kv('GCP_LOCATION')}}\"\n      bucket: \"{{kv('GCP_BUCKET_NAME')}}\"\n\ntriggers:\n  - id: green_schedule\n    type: io.kestra.plugin.core.trigger.Schedule\n    cron: \"0 9 1 * *\"\n    inputs:\n      taxi: green\n\n  - id: yellow_schedule\n    type: io.kestra.plugin.core.trigger.Schedule\n    cron: \"0 10 1 * *\"\n    inputs:\n      taxi: yellow\n\n"
  },
  {
    "path": "02-workflow-orchestration/flows/10_chat_without_rag.yaml",
    "content": "id: 10_chat_without_rag\nnamespace: zoomcamp\n\ndescription: |\n  This flow demonstrates what happens when you query an LLM WITHOUT RAG.\n  The model can only rely on its training data, which may be outdated or incomplete.\n  \n  After running this, check out 11_chat_with_rag.yaml to see how RAG fixes these issues.\n\ntasks:\n  - id: chat_without_rag\n    type: io.kestra.plugin.ai.completion.ChatCompletion\n    description: Query about Kestra 1.1 features WITHOUT RAG\n    provider:\n      type: io.kestra.plugin.ai.provider.GoogleGemini\n      modelName: gemini-2.5-flash\n      apiKey: \"{{ kv('GEMINI_API_KEY') }}\"\n    messages:\n      - type: USER\n        content: |\n          Which features were released in Kestra 1.1? \n          Please list at least 5 major features with brief descriptions.\n\n  - id: log_results\n    type: io.kestra.plugin.core.log.Log\n    message: |\n      ❌ Response WITHOUT RAG (no retrieved context):\n      {{ outputs.chat_without_rag.textOutput }}\n      \n      🤔 Did you notice that this response seems to be:\n      - Incorrect\n      - Vague/generic\n      - Listing features that haven't been added in exactly this version but rather a long time ago\n      \n      👉 This is why context matters. Run `11_chat_with_rag.yaml` to see the accurate, context-grounded response.\n\n\n"
  },
  {
    "path": "02-workflow-orchestration/flows/11_chat_with_rag.yaml",
    "content": "id: 11_chat_with_rag\nnamespace: zoomcamp\n\ndescription: |\n  This flow demonstrates RAG (Retrieval Augmented Generation) by ingesting Kestra release documentation and using it to answer questions accurately.\n  \n  Compare this with 10_chat_without_rag.yaml to see the difference RAG makes.\n\ntasks:\n  - id: ingest_release_notes\n    type: io.kestra.plugin.ai.rag.IngestDocument\n    description: Ingest Kestra 1.1 release notes to create embeddings\n    provider:\n      type: io.kestra.plugin.ai.provider.GoogleGemini\n      modelName: gemini-embedding-001\n      apiKey: \"{{ kv('GEMINI_API_KEY') }}\"\n    embeddings:\n      type: io.kestra.plugin.ai.embeddings.KestraKVStore\n    drop: true\n    fromExternalURLs:\n      - https://raw.githubusercontent.com/kestra-io/docs/refs/heads/main/src/contents/blogs/release-1-1/index.md\n\n  - id: chat_with_rag\n    type: io.kestra.plugin.ai.rag.ChatCompletion\n    description: Query about Kestra 1.1 features with RAG context\n    chatProvider:\n      type: io.kestra.plugin.ai.provider.GoogleGemini\n      modelName: gemini-2.5-flash\n      apiKey: \"{{ kv('GEMINI_API_KEY') }}\"\n    embeddingProvider:\n      type: io.kestra.plugin.ai.provider.GoogleGemini\n      modelName: gemini-embedding-001\n      apiKey: \"{{ kv('GEMINI_API_KEY') }}\"\n    embeddings:\n      type: io.kestra.plugin.ai.embeddings.KestraKVStore\n    systemMessage: |\n      You are a helpful assistant that answers questions about Kestra.\n      Use the provided documentation to give accurate, specific answers.\n      If you don't find the information in the context, say so.\n    prompt: |\n      Which features were released in Kestra 1.1? \n      Please list at least 5 major features with brief descriptions.\n\n  - id: log_results\n    type: io.kestra.plugin.core.log.Log\n    message: |\n      ✅ RAG Response (with retrieved context):\n      {{ outputs.chat_with_rag.textOutput }}\n      \n      Note that this response is detailed, accurate, and grounded in the actual release documentation. Compare this with the output from 06_chat_without_rag.yaml.\n\n\n"
  },
  {
    "path": "03-data-warehouse/README.md",
    "content": "# Data Warehouse and BigQuery\n\n- [Slides](https://docs.google.com/presentation/d/1a3ZoBAXFk8-EhUsd7rAZd-5p_HpltkzSeujjRGB2TAI/edit?usp=sharing)  \n- [Big Query basic SQL](big_query.sql)\n\n# Videos\n\n## Data Warehouse\n\n- Data Warehouse and BigQuery\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/jrHljAoD6nM)](https://youtu.be/jrHljAoD6nM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34)\n\n## :movie_camera: Partitioning and clustering\n\n- Partitioning vs Clustering\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/-CqXf7vhhDs)](https://youtu.be/-CqXf7vhhDs?si=p1sYQCAs8dAa7jIm&t=193&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=35)\n\n## :movie_camera: Best practices\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/k81mLJVX08w)](https://youtu.be/k81mLJVX08w&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=36)\n\n## :movie_camera: Internals of BigQuery\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/eduHi1inM4s)](https://youtu.be/eduHi1inM4s&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=37)\n\n## Advanced topics\n\n### :movie_camera: Machine Learning in Big Query\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/B-WtpB0PuG4)](https://youtu.be/B-WtpB0PuG4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=34)\n\n* [SQL for ML in BigQuery](big_query_ml.sql)\n\n**Important links**\n\n- [BigQuery ML Tutorials](https://cloud.google.com/bigquery-ml/docs/tutorials)\n- [BigQuery ML Reference Parameter](https://cloud.google.com/bigquery-ml/docs/analytics-reference-patterns)\n- [Hyper Parameter tuning](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-glm)\n- [Feature preprocessing](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-preprocess-overview)\n\n### :movie_camera: Deploying Machine Learning model from BigQuery\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/BjARzEWaznU)](https://youtu.be/BjARzEWaznU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=39)\n\n- [Steps to extract and deploy model with docker](extract_model.md)  \n\n\n\n# Homework\n\n* [2026 Homework](../cohorts/2026/03-data-warehouse/homework.md)\n\n\n# Community notes\n\n<details>\n<summary>Did you take notes? You can share them here</summary>\n\n* [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/3_data_warehouse.md)\n* [Isaac Kargar's blog post](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/30/data-engineering-w3.html)\n* [Marcos Torregrosa's blog post](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-3/) \n* [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week3)\n* [Notes from Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/02/week-3-data-engineering-zoomcamp-notes-data-warehouse-and-bigquery/)\n* [Bigger picture summary on Data Lakes, Data Warehouses, and tooling](https://medium.com/@verazabeida/zoomcamp-week-4-b8bde661bf98), by Vera\n* [Notes by froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_3_data_warehouse/notes/notes_week_03.md)\n* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week3.md)\n* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)\n* [2024 videos transcript week3](https://drive.google.com/drive/folders/1quIiwWO-tJCruqvtlqe_Olw8nvYSmmDJ?usp=sharing) by Maria Fisher \n* [Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/3a-data-warehouse/readme.md)\n* [Jonah Oliver's blog post](https://www.jonahboliver.com/blog/de-zc-w3)\n* [2024 - steps to send data from Mage to GCS + creating external table](https://drive.google.com/file/d/1GIi6xnS4070a8MUlIg-ozITt485_-ePB/view?usp=drive_link) by Maria Fisher\n* [2024 - mage dataloader script to load the parquet files from a remote URL and push it to Google bucket as parquet file](https://github.com/amohan601/dataengineering-zoomcamp2024/blob/main/week_3_data_warehouse/mage_scripts/green_taxi_2022_v2.py) by Anju Mohan\n* [2024 - steps to send data from Mage to GCS + creating external table](https://drive.google.com/file/d/1GIi6xnS4070a8MUlIg-ozITt485_-ePB/view?usp=drive_link) by Maria Fisher \n* [Notes by HongWei](https://github.com/hwchua0209/data-engineering-zoomcamp-submission/blob/main/03-data-warehouse/README.md)\n* [2025 Notes by Manuel Guerra](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/3_Data-Warehouse/README.md)\n* [Notes from Horeb SEIDOU](https://spotted-hardhat-eea.notion.site/Week-3-Data-Warehouse-and-BigQuery-17c29780dc4a80c8a226f372543ae388)\n* [2025 - Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/03_data_warehouse/00_notes.md)\n* [2025 Gitbook Notes Tinker0425](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/module-3/introduction-to-module-3)\n* [2025 Notes from Daniel Lachner](https://drive.google.com/file/d/105zjtLFi0sRqqFFgdMSCTzfcLPx2rfv4/view?usp=sharing)\n* [2026 Notes from Catherine Frost](https://docs.google.com/document/d/1j3jeNnBI2fw1nq7JwEauPx2G8FybDfTqmMk7eRu0vSo/edit?tab=t.0)\n* Add your notes here (above this line)\n\n</details>\n"
  },
  {
    "path": "03-data-warehouse/big_query.sql",
    "content": "-- Query public available table\nSELECT station_id, name FROM\n    bigquery-public-data.new_york_citibike.citibike_stations\nLIMIT 100;\n\n\n-- Creating external table referring to gcs path\nCREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.external_yellow_tripdata`\nOPTIONS (\n  format = 'CSV',\n  uris = ['gs://nyc-tl-data/trip data/yellow_tripdata_2019-*.csv', 'gs://nyc-tl-data/trip data/yellow_tripdata_2020-*.csv']\n);\n\n-- Check yellow trip data\nSELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata limit 10;\n\n-- Create a non partitioned table from external table\nCREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_non_partitioned AS\nSELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;\n\n\n-- Create a partitioned table from external table\nCREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned\nPARTITION BY\n  DATE(tpep_pickup_datetime) AS\nSELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;\n\n-- Impact of partition\n-- Scanning 1.6GB of data\nSELECT DISTINCT(VendorID)\nFROM taxi-rides-ny.nytaxi.yellow_tripdata_non_partitioned\nWHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';\n\n-- Scanning ~106 MB of DATA\nSELECT DISTINCT(VendorID)\nFROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned\nWHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';\n\n-- Let's look into the partitions\nSELECT table_name, partition_id, total_rows\nFROM `nytaxi.INFORMATION_SCHEMA.PARTITIONS`\nWHERE table_name = 'yellow_tripdata_partitioned'\nORDER BY total_rows DESC;\n\n-- Creating a partition and cluster table\nCREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered\nPARTITION BY DATE(tpep_pickup_datetime)\nCLUSTER BY VendorID AS\nSELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;\n\n-- Query scans 1.1 GB\nSELECT count(*) as trips\nFROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned\nWHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'\n  AND VendorID=1;\n\n-- Query scans 864.5 MB\nSELECT count(*) as trips\nFROM taxi-rides-ny.nytaxi.yellow_tripdata_partitioned_clustered\nWHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'\n  AND VendorID=1;\n\n"
  },
  {
    "path": "03-data-warehouse/big_query_hw.sql",
    "content": "CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.fhv_tripdata`\nOPTIONS (\n  format = 'CSV',\n  uris = ['gs://nyc-tl-data/trip data/fhv_tripdata_2019-*.csv']\n);\n\n\nSELECT count(*) FROM `taxi-rides-ny.nytaxi.fhv_tripdata`;\n\n\nSELECT COUNT(DISTINCT(dispatching_base_num)) FROM `taxi-rides-ny.nytaxi.fhv_tripdata`;\n\n\nCREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.fhv_nonpartitioned_tripdata`\nAS SELECT * FROM `taxi-rides-ny.nytaxi.fhv_tripdata`;\n\nCREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.fhv_partitioned_tripdata`\nPARTITION BY DATE(dropoff_datetime)\nCLUSTER BY dispatching_base_num AS (\n  SELECT * FROM `taxi-rides-ny.nytaxi.fhv_tripdata`\n);\n\nSELECT count(*) FROM  `taxi-rides-ny.nytaxi.fhv_nonpartitioned_tripdata`\nWHERE DATE(dropoff_datetime) BETWEEN '2019-01-01' AND '2019-03-31'\n  AND dispatching_base_num IN ('B00987', 'B02279', 'B02060');\n\n\nSELECT count(*) FROM `taxi-rides-ny.nytaxi.fhv_partitioned_tripdata`\nWHERE DATE(dropoff_datetime) BETWEEN '2019-01-01' AND '2019-03-31'\n  AND dispatching_base_num IN ('B00987', 'B02279', 'B02060');\n"
  },
  {
    "path": "03-data-warehouse/big_query_ml.sql",
    "content": "-- SELECT THE COLUMNS INTERESTED FOR YOU\nSELECT passenger_count, trip_distance, PULocationID, DOLocationID, payment_type, fare_amount, tolls_amount, tip_amount\nFROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitioned` WHERE fare_amount != 0;\n\n-- CREATE A ML TABLE WITH APPROPRIATE TYPE\nCREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.yellow_tripdata_ml` (\n`passenger_count` INTEGER,\n`trip_distance` FLOAT64,\n`PULocationID` STRING,\n`DOLocationID` STRING,\n`payment_type` STRING,\n`fare_amount` FLOAT64,\n`tolls_amount` FLOAT64,\n`tip_amount` FLOAT64\n) AS (\nSELECT passenger_count, trip_distance, cast(PULocationID AS STRING), CAST(DOLocationID AS STRING),\nCAST(payment_type AS STRING), fare_amount, tolls_amount, tip_amount\nFROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitioned` WHERE fare_amount != 0\n);\n\n-- CREATE MODEL WITH DEFAULT SETTING\nCREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_model`\nOPTIONS\n(model_type='linear_reg',\ninput_label_cols=['tip_amount'],\nDATA_SPLIT_METHOD='AUTO_SPLIT') AS\nSELECT\n*\nFROM\n`taxi-rides-ny.nytaxi.yellow_tripdata_ml`\nWHERE\ntip_amount IS NOT NULL;\n\n-- CHECK FEATURES\nSELECT * FROM ML.FEATURE_INFO(MODEL `taxi-rides-ny.nytaxi.tip_model`);\n\n-- EVALUATE THE MODEL\nSELECT\n*\nFROM\nML.EVALUATE(MODEL `taxi-rides-ny.nytaxi.tip_model`,\n(\nSELECT\n*\nFROM\n`taxi-rides-ny.nytaxi.yellow_tripdata_ml`\nWHERE\ntip_amount IS NOT NULL\n));\n\n-- PREDICT THE MODEL\nSELECT\n*\nFROM\nML.PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,\n(\nSELECT\n*\nFROM\n`taxi-rides-ny.nytaxi.yellow_tripdata_ml`\nWHERE\ntip_amount IS NOT NULL\n));\n\n-- PREDICT AND EXPLAIN\nSELECT\n*\nFROM\nML.EXPLAIN_PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,\n(\nSELECT\n*\nFROM\n`taxi-rides-ny.nytaxi.yellow_tripdata_ml`\nWHERE\ntip_amount IS NOT NULL\n), STRUCT(3 as top_k_features));\n\n-- HYPER PARAM TUNNING\nCREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_hyperparam_model`\nOPTIONS\n(model_type='linear_reg',\ninput_label_cols=['tip_amount'],\nDATA_SPLIT_METHOD='AUTO_SPLIT',\nnum_trials=5,\nmax_parallel_trials=2,\nl1_reg=hparam_range(0, 20),\nl2_reg=hparam_candidates([0, 0.1, 1, 10])) AS\nSELECT\n*\nFROM\n`taxi-rides-ny.nytaxi.yellow_tripdata_ml`\nWHERE\ntip_amount IS NOT NULL;\n\n"
  },
  {
    "path": "03-data-warehouse/extract_model.md",
    "content": "## Model deployment\n[Tutorial](https://cloud.google.com/bigquery-ml/docs/export-model-tutorial)\n### Steps\n- gcloud auth login\n- bq --project_id taxi-rides-ny extract -m nytaxi.tip_model gs://taxi_ml_model/tip_model\n- mkdir /tmp/model\n- gsutil cp -r gs://taxi_ml_model/tip_model /tmp/model\n- mkdir -p serving_dir/tip_model/1\n- cp -r /tmp/model/tip_model/* serving_dir/tip_model/1\n- docker pull tensorflow/serving\n- docker run -p 8501:8501 --mount type=bind,source=`pwd`/serving_dir/tip_model,target=\n  /models/tip_model -e MODEL_NAME=tip_model -t tensorflow/serving &\n- curl -d '{\"instances\": [{\"passenger_count\":1, \"trip_distance\":12.2, \"PULocationID\":\"193\", \"DOLocationID\":\"264\", \"payment_type\":\"2\",\"fare_amount\":20.4,\"tolls_amount\":0.0}]}' -X POST http://localhost:8501/v1/models/tip_model:predict\n- http://localhost:8501/v1/models/tip_model"
  },
  {
    "path": "03-data-warehouse/extras/.env-example",
    "content": "GCP_GCS_BUCKET=\"your_bucket_name\"\nGOOGLE_APPLICATION_CREDENTIALS=Path/to/key/GCP_service_account_key.json"
  },
  {
    "path": "03-data-warehouse/extras/.gitignore",
    "content": "*.env\n*.parquet\n*.csv*"
  },
  {
    "path": "03-data-warehouse/extras/README.md",
    "content": "Quick hack to load files directly to GCS, without Airflow. Downloads csv files from https://nyc-tlc.s3.amazonaws.com/trip+data/ and uploads them to your Cloud Storage Account as parquet files.\n\n1. Install pre-reqs with `uv sync` \n2. Run: `uv run python web_to_gcs_with_progress_bar.py`\n2. or Run: `uv run python web_to_gcs.py` for less verbose (if you have fast internet connection in upload)\n"
  },
  {
    "path": "03-data-warehouse/extras/pyproject.toml",
    "content": "[project]\nname = \"extras\"\nversion = \"0.1.0\"\ndescription = \"Add your description here\"\nreadme = \"README.md\"\nrequires-python = \">=3.14\"\ndependencies = [\n    \"google-cloud-storage>=3.8.0\",\n    \"pandas>=3.0.0\",\n    \"pyarrow>=23.0.0\",\n    \"python-dotenv>=1.2.1\",\n    \"requests>=2.32.5\",\n    \"tqdm>=4.67.1\",\n]\n"
  },
  {
    "path": "03-data-warehouse/extras/web_to_gcs.py",
    "content": "import os\nimport requests\nimport pandas as pd\nfrom google.cloud import storage\nfrom dotenv import load_dotenv\n\n\n\"\"\"\nPre-reqs: \n1. run `uv sync` from this 'extra' folder (create venv and install dependencies from pyproject.toml)\n2. rename .env-example to .env (not commited thanks to .gitignore)\n3. in .env, \n    - set GCP_GCS_BUCKET as your bucket or change default value of BUCKET\n    - Set GOOGLE_APPLICATION_CREDENTIALS to your project/service-account json key \n    (or don't set it if you use google ADC)\n\"\"\"\n# load env vars from .env\nload_dotenv()\n\n# services = ['fhv','green','yellow']\ninit_url = \"https://github.com/DataTalksClub/nyc-tlc-data/releases/download/\"\n# if not done in .env, switch out the default bucketname\nBUCKET = os.environ.get(\"GCP_GCS_BUCKET\", \"dtc-data-lake-bucketname\")\n\n\ndef upload_to_gcs(bucket, object_name, local_file):\n    \"\"\"\n    Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python\n    \"\"\"\n    # # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.\n    # # (Ref: https://github.com/googleapis/python-storage/issues/74)\n    # storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB\n    # storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB\n\n    client = storage.Client()\n    bucket = client.bucket(bucket)\n    blob = bucket.blob(object_name)\n    blob.upload_from_filename(local_file)\n\n\ndef web_to_gcs(year, service):\n    for i in range(12):\n        # sets the month part of the file_name string\n        month = \"0\" + str(i + 1)\n        month = month[-2:]\n\n        # csv file_name\n        file_name = f\"{service}_tripdata_{year}-{month}.csv.gz\"\n\n        # download it using requests via a pandas df\n        request_url = f\"{init_url}{service}/{file_name}\"\n        r = requests.get(request_url)\n        open(file_name, \"wb\").write(r.content)\n        print(f\"Local: {file_name}\")\n\n        # read it back into a parquet file\n        # enforce types so parquet columns will directly have good types\n        # (as we did in module 1 in ingest.py script)\n        dtypes = {\n            \"VendorID\": \"Int64\",\n            \"RatecodeID\": \"Int64\",\n            \"PULocationID\": \"Int64\",\n            \"DOLocationID\": \"Int64\",\n            \"passenger_count\": \"Int64\",\n            \"payment_type\": \"Int64\",\n            \"trip_type\": \"Int64\",  # only in green but ignored if missing column\n            \"store_and_fwd_flag\": \"string\",\n            \"trip_distance\": \"float64\",\n            \"fare_amount\": \"float64\",\n            \"extra\": \"float64\",\n            \"mta_tax\": \"float64\",\n            \"tip_amount\": \"float64\",\n            \"tolls_amount\": \"float64\",\n            \"ehailfee\": \"float64\",  # only in green but ignored if missing column\n            \"improvement_surcharge\": \"float64\",\n            \"total_amount\": \"float64\",\n            \"congestion_surcharge\": \"float64\",\n        }\n\n        if service == \"yellow\":\n            parse_dates = [\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"]\n        else:\n            parse_dates = [\"lpep_pickup_datetime\", \"lpep_dropoff_datetime\"]\n\n        df = pd.read_csv(\n            file_name, dtype=dtypes, parse_dates=parse_dates, compression=\"gzip\"\n        )\n        file_name = file_name.replace(\".csv.gz\", \".parquet\")\n        df.to_parquet(file_name, engine=\"pyarrow\")\n        print(f\"Parquet: {file_name}\")\n\n        # upload it to gcs\n        upload_to_gcs(BUCKET, f\"{service}/{file_name}\", file_name)\n        print(f\"GCS: {service}/{file_name}\")\n\n\nweb_to_gcs(\"2019\", \"green\")\nweb_to_gcs(\"2020\", \"green\")\nweb_to_gcs(\"2021\", \"green\")  # fail when reach 08 (normal, file not in github :)\n# web_to_gcs(\"2019\", \"yellow\")\n# web_to_gcs(\"2020\", \"yellow\")\n# web_to_gcs(\"2021\", \"yellow\") # fail when reach 08 (normal, file not in github :)\n"
  },
  {
    "path": "03-data-warehouse/extras/web_to_gcs_with_progress_bar.py",
    "content": "import os\nimport requests\nimport pandas as pd\nfrom google.cloud import storage\nfrom dotenv import load_dotenv\nfrom tqdm import tqdm\nimport gzip\nimport pyarrow as pa\nimport pyarrow.parquet as pq\n\n\n\"\"\"\nPre-reqs: \n1. run `uv sync` from this 'extra' folder (create venv and install dependencies from pyproject.toml)\n2. rename .env-example to .env (not commited thanks to .gitignore)\n3. in .env, \n    - set GCP_GCS_BUCKET as your bucket or change default value of BUCKET\n    - Set GOOGLE_APPLICATION_CREDENTIALS to your project/service-account json key \n    (or don't set it if you use google ADC)\n\"\"\"\n# load env vars from .env\nload_dotenv()\n\n# services = ['fhv','green','yellow']\ninit_url = \"https://github.com/DataTalksClub/nyc-tlc-data/releases/download/\"\n# if not done in .env, switch out the default bucketname\nBUCKET = os.environ.get(\"GCP_GCS_BUCKET\", \"dtc-data-lake-bucketname\")\n\n\ndef download_with_progress(url: str, local_path: str, desc: str = \"Downloading\"):\n    with requests.get(url, stream=True) as r:\n        r.raise_for_status()\n        total = int(r.headers.get(\"content-length\", 0))\n        # Configure tqdm for bytes\n        with (\n            open(local_path, \"wb\") as f,\n            tqdm(\n                total=total,\n                unit=\"B\",\n                unit_scale=True,\n                unit_divisor=1024,\n                desc=desc,\n            ) as bar,\n        ):\n            for chunk in r.iter_content(chunk_size=1024 * 1024):  # 1 MB\n                if not chunk:\n                    continue\n                size = f.write(chunk)\n                bar.update(size)\n\n\ndef csv_to_parquet_with_progress(\n    csv_path: str, parquet_path: str, service_color: str, chunksize: int = 100_000\n):\n    # 1) Count rows (gzip-aware)\n    with gzip.open(csv_path, mode=\"rt\") as f:\n        total_rows = sum(1 for _ in f) - 1  # minus header\n    if total_rows <= 0:\n        raise ValueError(\"CSV appears to be empty\")\n\n    # 2) Read in chunks with fixed dtypes so parquet columns will directly have good types\n    # (as we did in module 1 in ingest.py script)\n    dtypes = {\n        \"VendorID\": \"Int64\",\n        \"RatecodeID\": \"Int64\",\n        \"PULocationID\": \"Int64\",\n        \"DOLocationID\": \"Int64\",\n        \"passenger_count\": \"Int64\",\n        \"payment_type\": \"Int64\",\n        \"trip_type\": \"Int64\",  # only in green but ignored if missing column\n        \"store_and_fwd_flag\": \"string\",\n        \"trip_distance\": \"float64\",\n        \"fare_amount\": \"float64\",\n        \"extra\": \"float64\",\n        \"mta_tax\": \"float64\",\n        \"tip_amount\": \"float64\",\n        \"tolls_amount\": \"float64\",\n        \"ehailfee\": \"float64\",  # only in green but ignored if missing column\n        \"improvement_surcharge\": \"float64\",\n        \"total_amount\": \"float64\",\n        \"congestion_surcharge\": \"float64\",\n    }\n\n    if service_color == \"yellow\":\n        parse_dates = [\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"]\n    else:\n        parse_dates = [\"lpep_pickup_datetime\", \"lpep_dropoff_datetime\"]\n\n    reader = pd.read_csv(\n        csv_path,\n        dtype=dtypes,\n        parse_dates=parse_dates,\n        compression=\"gzip\",\n        chunksize=chunksize,\n        low_memory=False,\n    )\n\n    writer = None\n\n    with tqdm(total=total_rows, unit=\"rows\", desc=f\"Parquet {csv_path}\") as bar:\n        for chunk in reader:\n            table = pa.Table.from_pandas(chunk)\n            if writer is None:\n                writer = pq.ParquetWriter(parquet_path, table.schema)\n            else:\n                # Optional safety: align to first schema\n                table = table.cast(writer.schema)\n            writer.write_table(table)\n            bar.update(len(chunk))\n\n    if writer is not None:\n        writer.close()\n\n\ndef upload_to_gcs_with_progress(bucket: str, object_name: str, local_file: str):\n    # # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.\n    # # (Ref: https://github.com/googleapis/python-storage/issues/74)\n    # Optional: tune chunk size (must be multiple of 256 KiB)\n    storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB\n    storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB\n\n    client = storage.Client()\n    bucket_obj = client.bucket(bucket)\n    blob = bucket_obj.blob(object_name)\n\n    if blob.exists(client):\n        print(f\"Skipping upload, already in GCS: gs://{bucket}/{object_name}\")\n        return\n\n    file_size = os.path.getsize(local_file)\n\n    with open(local_file, \"rb\") as f:\n        with tqdm.wrapattr(\n            f,\n            \"read\",\n            total=file_size,\n            miniters=1,\n            unit=\"B\",\n            unit_scale=True,\n            unit_divisor=1024,\n            desc=f\"Uploading {os.path.basename(local_file)}\",\n        ) as wrapped_file:\n            blob.upload_from_file(\n                wrapped_file,\n                size=file_size,  # important so the library knows total bytes\n            )\n\n    print(f\"Uploaded to GCS: gs://{bucket}/{object_name}\")\n\n\ndef web_to_gcs(year, service):\n    client = storage.Client()\n    bucket_obj = client.bucket(BUCKET)\n\n    for i in tqdm(range(12), desc=f\"{service} {year}\", unit=\"month\"):\n        month = f\"{i + 1:02d}\"\n\n        csv_file_name = f\"{service}_tripdata_{year}-{month}.csv.gz\"\n        parquet_file_name = csv_file_name.replace(\".csv.gz\", \".parquet\")\n        object_name = f\"{service}/{parquet_file_name}\"\n\n        # 1) Check if parquet already in GCS\n        blob = bucket_obj.blob(object_name)\n        if blob.exists(client):\n            print(f\"Already in GCS, skipping: gs://{BUCKET}/{object_name}\")\n            continue\n\n        # 2) Check if CSV already downloaded locally\n        if os.path.exists(csv_file_name):\n            print(f\"CSV already exists locally, skipping download: {csv_file_name}\")\n        else:\n            request_url = f\"{init_url}{service}/{csv_file_name}\"\n            download_with_progress(\n                request_url, csv_file_name, desc=f\"Downloading {csv_file_name}\"\n            )\n\n        # 3) Check if Parquet already exists locally\n        if os.path.exists(parquet_file_name):\n            print(\n                f\"Parquet already exists locally, skipping conversion: {parquet_file_name}\"\n            )\n        else:\n            csv_to_parquet_with_progress(csv_file_name, parquet_file_name, service)\n            print(f\"Parquet: {parquet_file_name}\")\n\n        # 4) Upload with per-byte progress bar\n        upload_to_gcs_with_progress(BUCKET, object_name, parquet_file_name)\n\n\nweb_to_gcs(\"2019\", \"green\")\nweb_to_gcs(\"2020\", \"green\")\nweb_to_gcs(\n    \"2021\", \"green\"\n)  # will fail when reaching 08 (normal, file does not exists in github :)\n# web_to_gcs(\"2019\", \"yellow\")\n# web_to_gcs(\"2020\", \"yellow\")\n# web_to_gcs(\"2021\", \"yellow\") # will fail when reaching 08 (normal, file does not exists in github :)\n"
  },
  {
    "path": "04-analytics-engineering/README.md",
    "content": "# Module 4: Analytics Engineering\n\nGoal: Transforming the data loaded in DWH into Analytical Views developing a [dbt project](taxi_rides_ny/README.md).\n\n### Prerequisites\n\nThe prerequisites depend on which setup path you choose:\n\n**For Cloud Setup (BigQuery):**\n\n- Completed [Module 3: Data Warehouse](../03-data-warehouse/) with:\n  - A GCP project with BigQuery enabled\n  - Service account with BigQuery permissions\n  - NYC taxi data loaded into BigQuery (yellow and green taxi data for 2019-2020)\n\n**For Local Setup (DuckDB):**\n\n- No prerequisites! The local setup guide will walk you through downloading and loading the data.\n\n> [!NOTE]\n> This module focuses on **yellow and green taxi data** (2019-2020). While Module 3 may have included FHV data, it is not used in this dbt project.\n\n## Setting up your environment\n\nChoose your setup path:\n\n### 🏠 [Local Setup](setup/local_setup.md)\n\n- **Stack**: DuckDB + dbt Core\n- **Cost**: Free\n- [→ Get Started](setup/local_setup.md)\n\n### ☁️ [Cloud Setup](setup/cloud_setup.md)\n\n- **Stack**: BigQuery + dbt Cloud\n- **Cost**: Free tier available (dbt Cloud Developer), BigQuery costs vary\n- **Requires**: Completed Module 3 with BigQuery data\n- [→ Get Started](setup/cloud_setup.md)\n\n## Content\n\n### Introduction to Analytics Engineering\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/HxMIsPrIyGQ)](https://www.youtube.com/watch?v=HxMIsPrIyGQ)\n\n### Introduction to data modeling\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/uF76d5EmdtU)](https://www.youtube.com/watch?v=uF76d5EmdtU&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=40)\n\n### What is dbt?\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/gsKuETFJr54)](https://www.youtube.com/watch?v=gsKuETFJr54&list=PLaNLNpjZpzwgneiI-Gl8df8GCsPYp_6Bs&index=5)\n\n### Differences between dbt Core and dbt Cloud\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/auzcdLRyEIk)](https://www.youtube.com/watch?v=auzcdLRyEIk)\n\n### Project Setup\n\n| Alternative A  | Alternative B   |\n|-----------------------------|--------------------------------|\n| BigQuery + dbt Platform | DuckDB + dbt core |\n| [![](https://markdown-videos-api.jorgenkh.no/youtube/GFbwlrt6f54)](https://www.youtube.com/watch?v=GFbwlrt6f54) | [![](https://markdown-videos-api.jorgenkh.no/youtube/GoFAbJYfvlw)](https://www.youtube.com/watch?v=GoFAbJYfvlw) |\n\n### dbt Course\n\n| dbt Project Structure | dbt Sources | dbt Models | Seeds and Macros |\n|-----------------------|-------------|------------|------------------|\n| [![](https://markdown-videos-api.jorgenkh.no/youtube/2dYDS4OQbT0)](https://www.youtube.com/watch?v=2dYDS4OQbT0) | [![](https://markdown-videos-api.jorgenkh.no/youtube/7CrrXazV_8k)](https://www.youtube.com/watch?v=7CrrXazV_8k) | [![](https://markdown-videos-api.jorgenkh.no/youtube/JQYz-8sl1aQ)](https://www.youtube.com/watch?v=JQYz-8sl1aQ) | [![](https://markdown-videos-api.jorgenkh.no/youtube/lT4fmTDEqVk)](https://www.youtube.com/watch?v=lT4fmTDEqVk) |\n\n| dbt Tests | Documentation | dbt Packages | dbt Commands |\n|-----------|---------------|----------------------|---------------|\n| [![](https://markdown-videos-api.jorgenkh.no/youtube/bvZ-rJm7uMU)](https://www.youtube.com/watch?v=bvZ-rJm7uMU) | [![](https://markdown-videos-api.jorgenkh.no/youtube/UqoWyMjcqrA)](https://www.youtube.com/watch?v=UqoWyMjcqrA) | [![](https://markdown-videos-api.jorgenkh.no/youtube/KfhUA9Kfp8Y)](https://www.youtube.com/watch?v=KfhUA9Kfp8Y) | [![](https://markdown-videos-api.jorgenkh.no/youtube/t4OeWHW3SsA)](https://www.youtube.com/watch?v=t4OeWHW3SsA) |\n\n## Troubleshooting\n\n- [DuckDB Troubleshooting Guide](setup/duckdb_troubleshooting.md) — If you're getting OOM errors during `dbt build` with DuckDB\n\n## Extra resources\n\n> [!NOTE]\n> If you find the videos above overwhelming, we recommend completing the [dbt Fundamentals](https://learn.getdbt.com/courses/dbt-fundamentals) course and then rewatching the module. It provides a solid foundation for all the key concepts you need in this module.\n\n## SQL refresher\n\nThe homework for this module focuses heavily on window functions and CTEs. If you need a refresher on these topics, you can refer to these notes.\n\n* [SQL refresher](refreshers/SQL.md)\n\n## Homework\n\n* [2026 Homework](../cohorts/2026/04-analytics-engineering/homework.md)\n\n# Community notes\n\n<details>\n<summary>Did you take notes? You can share them here</summary>\n\n* [Slides used in previous years](https://docs.google.com/presentation/d/1xSll_jv0T8JF4rYZvLHfkJXYqUjPtThA/edit?usp=sharing&ouid=114544032874539580154&rtpof=true&sd=true)\n* [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/4_analytics.md)\n* [Sandy's DE learning blog](https://learningdataengineering540969211.wordpress.com/2022/02/17/week-4-setting-up-dbt-cloud-with-bigquery/)\n* [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week4)\n* [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-4/)\n* [Notes by froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_4_analytics_engineering/notes/notes_week_04.md)\n* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week4.md)\n* [Setting up Prefect with dbt by Vera](https://medium.com/@verazabeida/zoomcamp-week-5-5b6a9d53a3a0)\n* [Blog by Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/02/week-4-data-engineering-zoomcamp-notes-analytics-engineering-and-dbt/)\n* [Setting up DBT with BigQuery by Tofag](https://medium.com/@fagbuyit/setting-up-your-dbt-cloud-dej-9-d18e5b7c96ba)\n* [Blog post by Dewi Oktaviani](https://medium.com/@oktavianidewi/de-zoomcamp-2023-learning-week-4-analytics-engineering-with-dbt-53f781803d3e)\n* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)\n* [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%204/Data%20Engineering%20Zoomcamp%20Week%204.ipynb)\n* [Notes by Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/4-analytics-engineering/readme.md)\n* [2024 - Videos transcript week4](https://drive.google.com/drive/folders/1V2sHWOotPEMQTdMT4IMki1fbMPTn3jOP?usp=drive)\n* [Blog Post](https://www.jonahboliver.com/blog/de-zc-w4) by Jonah Oliver\n* [2025 Notes by Manuel Guerra](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/4_Analytics-Engineering/README.md)\n* [2025 Notes by Horeb SEIDOU](https://spotted-hardhat-eea.notion.site/Week-4-Analytics-Engineering-18929780dc4a808692e4e0ee488bf49c?pvs=74)\n* [2025 Notes by Daniel Lachner](https://github.com/mossdet/dlp_data_eng/blob/main/Notes/04_01_Analytics_Engineering.pdf)\n* [2026 Notes by Sharad K. Gupta](https://github.com/sharadgupta27/data-engineering/blob/main/Notes/dbt_commands.md)\n* [Analytical Engineering overview](https://github.com/khanhnguyen7802/DataEngineer101/tree/main/week4-analytics-engineering#readme) \n* [2026 Notes about dbt](https://github.com/khanhnguyen7802/DataEngineer101/blob/main/week4-analytics-engineering/dbt_installation.md) | [dbt + Duckdb setup using Docker](https://github.com/khanhnguyen7802/DataEngineer101/blob/main/week4-analytics-engineering/dbt_installation.md) by Khanh Nguyen\n* Add your notes here (above this line)\n\n</details>\n"
  },
  {
    "path": "04-analytics-engineering/class_notes/4_1_1_analytics_engineering_basics.md",
    "content": "# DE Zoomcamp 4.1.1 — Analytics Engineering Basics\n\n> 📄 Video: [Analytics Engineering Basics](https://www.youtube.com/watch?v=uF76d5EmdtU)  \n> 📄 Further reading: [What is Analytics Engineering?](https://docs.getdbt.com/docs/introduction)  \n> 📄 Kimball's Dimensional Modeling: *The Data Warehouse Toolkit* (Ralph Kimball & Margy Ross)\n\nThis is the kickoff video for Module 4. No hands-on coding here — it's all about setting the stage. Why does analytics engineering exist, what does it actually do, and what are the data modeling concepts we'll be leaning on for the rest of the module. Worth sitting with before diving into the dbt stuff.\n\n---\n\n## Why analytics engineering exists\n\nA few shifts in the data world created a gap that nobody was filling:\n\n- **Cloud data warehouses** (BigQuery, Snowflake, Redshift) made storage and compute cheap. You no longer have to be surgical about what data you load.\n- **EL tools** like Fivetran and Stitch made getting data into the warehouse almost trivial — the extract and load steps are basically automated now.\n- **SQL-first BI tools** like Looker brought version control into the data workflow. And tools like Mode enabled self-service analytics for business users.\n- **Data governance** became a bigger conversation as more people started touching data.\n\nAll of this changed how data teams work and how stakeholders consume data. But it left a gap between the people building the infrastructure and the people using the data.\n\n### The traditional data team\n\nIn the old model you had three roles and a pretty clean split:\n\n- **Data Engineer** — builds and maintains the infrastructure. Great software engineer, but not necessarily close to how the business actually uses the data.\n- **Data Analyst** — uses the data to answer questions and solve business problems. Understands the business well, but not trained as a software engineer.\n- **Data Scientist** — similar story to the analyst. Writing more and more code these days, but software engineering best practices weren't part of the training.\n\n### The gap\n\nAnalysts and scientists are writing more code, but they weren't trained for it. Engineers are great at building systems, but they don't always know how the data gets consumed downstream. Nobody was bridging that gap.\n\n### Analytics Engineer\n\nThe analytics engineer is the bridge. They bring software engineering best practices — version control, testing, documentation, modularity — into the work that analysts and scientists are already doing. It's a role that sits at the intersection of the data engineer and the data analyst.\n\nIn terms of the toolchain, an analytics engineer might touch:\n\n- **Data loading** — tools like Fivetran, Stitch (the EL layer)\n- **Data storing** — cloud data warehouses, shared territory with data engineers\n- **Data modeling** — this is the core of it. Tools like dbt or Dataform. This is where most of Module 4 lives.\n- **Data presentation** — BI tools like Google Looker Studio. The end product that business users actually see.\n\nThe focus this week is on modeling and presentation — everything in between \"data is in the warehouse\" and \"business user sees a dashboard.\"\n\n---\n\n## ETL vs ELT — a quick recap\n\nTwo philosophies for getting data transformed and ready:\n\n**ETL (Extract → Transform → Load)** — you transform the data *before* it hits the warehouse. Takes longer to set up because the transformation logic has to be built first, but the data in the warehouse is clean and stable from day one.\n\n**ELT (Extract → Load → Transform)** — you load the raw data first, then transform it *inside* the warehouse. Faster and more flexible. This is the approach that cloud warehouses made possible — storage is cheap, so just load everything and figure out the transformations later.\n\nELT is the dominant approach now, and it's the one we'll be working with. dbt fits squarely into the \"T\" of ELT — it runs transformations inside the warehouse using SQL.\n\n---\n\n## Dimensional Modeling — the key concepts\n\nThis is Kimball's framework, and it's the main mental model for how we'll structure our data this week. The goal is twofold: make the data **understandable to business users**, and make **queries fast**.\n\nNote: unlike third normal form (3NF), dimensional modeling deliberately allows some data redundancy. The priority is usability and performance, not eliminating duplication.\n\n### Fact tables vs Dimension tables (Star Schema)\n\nThe two building blocks:\n\n- **Fact tables** — measurements, metrics, business events. Think of them as **verbs**. \"A sale happened.\" \"An order was placed.\" They correspond to a business process.\n- **Dimension tables** — the context around those facts. Think of them as **nouns**. \"Who bought it? What product? When?\" They correspond to a business entity like a customer or a product.\n\nTogether they form a **star schema** — the fact table in the center, dimension tables radiating out around it. It's the classic layout you'll see in most data warehouses.\n\n### The Kitchen Analogy\n\nKimball's book uses a restaurant analogy to describe how data flows through a warehouse. It maps pretty cleanly onto what we'll be doing in the project:\n\n- **Staging area (the pantry)** — raw data lands here. Not meant for business users. Only people who know what they're doing should be poking around in it.\n- **Processing area (the kitchen)** — this is where raw data gets transformed into proper data models. Again, limited to the people doing the cooking — the data engineers and analytics engineers. The focus here is on efficiency and following standards.\n- **Presentation area (the dining hall)** — the final, polished output. This is what business stakeholders actually see and interact with. Clean, structured, ready to consume.\n\nWe'll be building exactly this layered structure in our dbt project throughout the module."
  },
  {
    "path": "04-analytics-engineering/class_notes/4_1_2_what_is_dbt.md",
    "content": "# DE Zoomcamp 4.1.2 — What is dbt?\n\n> 📄 Video: [What is dbt?](https://www.youtube.com/watch?v=gsKuETFJr54)  \n> 📄 Official docs: [Introduction to dbt](https://docs.getdbt.com/docs/introduction)  \n> 📄 dbt Cloud vs Core: [Choose your dbt](https://docs.getdbt.com/docs/cloud/about-cloud/dbt-cloud-features)\n\nThis is the big-picture overview of dbt before we start building anything. What it is, what problems it solves, and how we'll be using it in the course. No hands-on work yet — just the framing.\n\n---\n\n## What is dbt?\n\ndbt is a transformation workflow tool. It sits on top of your data warehouse and helps you turn raw data into something useful for downstream consumers (analysts, BI tools, ML pipelines, whatever needs clean, structured data).\n\nYou write SQL (or Python) to define your transformations, and dbt handles the rest: compiling it, running it against the warehouse, managing dependencies, and persisting the results as tables or views.\n\nIn a real company setup, you'd have data flowing in from all over the place — backend systems, frontend apps, third-party APIs like weather data. All of that gets loaded into your warehouse (BigQuery, Snowflake, Databricks, whatever), and dbt is the layer that transforms that raw data into something the business can actually consume.\n\n---\n\n## What problems it solves\n\nThe transformation step has always existed. What dbt brings to the table is **software engineering best practices for analytics code**. Things that software engineers have been doing for years but didn't have a clear path into the analytics world:\n\n- **Version control** — your transformations live in git, just like any other code\n- **Modularity** — break complex logic into reusable pieces instead of massive spaghetti queries\n- **Testing** — automated data quality checks that run with every deployment\n- **Documentation** — generated from your code, not a separate wiki that gets out of date\n- **Environments** — separate dev and prod. Each developer gets their own sandbox to work in without stepping on each other's toes\n- **CI/CD** — automated deployments with validation and rollback\n\nThe result is higher-quality pipelines that are easier to maintain and less prone to breaking in production.\n\n---\n\n## How it works — the mechanics\n\nYou write a SQL file. It looks like a normal `SELECT` statement. dbt takes that file, figures out where it should go in the warehouse (which schema, which dataset, what environment), wraps it in the necessary DDL/DML, compiles it with any Jinja templating you've used, and runs it.\n\nWhen you run `dbt run`, it:\n1. Compiles your SQL (resolves `ref()` calls, `source()` calls, Jinja macros, everything)\n2. Sends the compiled SQL to your warehouse\n3. Materializes the result as a table, view, incremental table, or ephemeral CTE — whatever you configured\n\nYou don't write `CREATE TABLE` statements yourself. You just write the `SELECT`, and dbt handles the rest.\n\n---\n\n## dbt Core vs dbt Cloud\n\nThere are two ways to use dbt, and it's worth understanding the difference:\n\n### dbt Core\n\nOpen source. Free. You install it locally on your machine (or wherever) and run commands from the terminal. You're responsible for:\n\n- Setting up your dev environment\n- Orchestrating production runs (Airflow, cron jobs, whatever you want)\n- Hosting documentation if you want it accessible\n- Managing logs and metadata\n\nIt's the raw engine. You get full control, but you also have to build the surrounding infrastructure yourself.\n\n### dbt Cloud\n\nSaaS product that runs dbt Core under the hood. It gives you:\n\n- A web-based IDE for writing transformations (or you can use a Cloud CLI if you prefer local development)\n- Environment management — dev/staging/prod, all handled for you\n- Built-in orchestration (job scheduling, triggers, dependencies)\n- Hosted documentation (automatically generated and served)\n- Logging and observability\n- APIs for administration and metadata access\n- A semantic layer for metrics (if you need it)\n\nThere's a free Developer plan that works for small teams or individual learning. For anything bigger, it's a paid product.\n\n---\n\n## The course setup — two paths\n\nThe Zoomcamp gives you two options, and the videos will alternate between them (version A and version B):\n\n### Option A: BigQuery + dbt Cloud (recommended)\n\n- Data warehouse: BigQuery (assuming you set this up in previous weeks)\n- dbt: dbt Cloud Developer plan (free account, web IDE)\n- No local installation needed\n\nThis is the path most of the videos will follow. It's the fastest way to get started and closest to how teams actually use dbt in production.\n\n### Option B: DuckDB + dbt Core\n\n- Data warehouse: DuckDB (local or however you've got it set up)\n- dbt: dbt Core installed locally\n- Dev environment: your own IDE (VS Code, etc.)\n- Orchestration: you'll need to handle this separately (Airflow, Prefect, whatever)\n\nThis path gives you more hands-on control but requires more setup.\n\n---\n\n## The project flow\n\nBy the time we get to the end of the module, here's what we'll have built:\n\n1. Raw data sitting in the warehouse — trip data from previous weeks, plus a lookup table to demonstrate joining multiple sources\n2. dbt transformations that turn that raw data into properly modeled tables following the dimensional modeling concepts from 4.1.1\n3. Dashboards that consume the final output and make it useful for business stakeholders\n\nThe next videos will walk through actually setting this up and building it out step by step."
  },
  {
    "path": "04-analytics-engineering/class_notes/4_2_1_dbt_core_vs_dbt_cloud.md",
    "content": "# DE Zoomcamp 4.2.1 — dbt Core vs dbt Cloud\n\n> 📄 Official feature comparison: [dbt Core vs dbt Cloud](https://www.getdbt.com/product/dbt-core-vs-dbt-cloud)\n\n## dbt Core\n- Born in **2016** as a fully **open-source, command-line tool**\n- 100% free, runs locally on your own machine\n- All code is available on GitHub (can fork, modify, etc.)\n\n## dbt Cloud\n- Introduced **two years after dbt Core** (~2018) by dbt Labs (originally called Fishtown Analytics)\n- Sold as a **paid SaaS platform** — no need to manage infrastructure yourself\n- Handles the heavy lifting:\n  - Hosting dbt documentation\n  - Orchestration\n  - Environment setup\n  - Backups of dbt artifacts (e.g. for Slim CI)\n- Comes with **collaboration and security features** useful for teams/companies\n\n## How They Were Used Together (Hybrid Approach)\n- Common pattern: more technical users worked with dbt Core; less technical users used dbt Cloud\n- The two were designed to be **compatible** — e.g. developers could work locally with dbt Core while production runs were executed through dbt Cloud\n- dbt Labs published an article in **October 2024** outlining how both products were meant to coexist side by side → [How we think about dbt Core and dbt Cloud](https://www.getdbt.com/blog/how-we-think-about-dbt-core-and-dbt-cloud)\n\n## dbt Fusion — The Future\n- In **May 2025**, dbt Labs announced a **full rewrite of the code base** using a new engine called **Fusion**\n- Key improvements:\n  - **Faster compilation** of dbt code (up to 30x faster in some cases)\n  - **Better developer experience** — catches many errors *before* running/building, saving time and money\n- dbt Core will continue to be maintained, but **Fusion is the future direction** for both Core and Cloud\n\n### Fusion Limitations\n- **Not supported by all adapters** — as of early 2026, Fusion supports major adapters like Snowflake, Databricks, Postgres (and derivatives), BigQuery, and Redshift\n- Notably **does not support DuckDB** (yet) or many community-maintained adapters\n- If you use a less common adapter, dbt Fusion and the newest versions of dbt Cloud may not work for you\n- Adapter support is being actively expanded — check the official docs for the current list\n\n> 📄 Fusion upgrade guide: [Upgrading to the dbt Fusion engine](https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-fusion)  \n> 📄 Full adapter support list: [Supported features](https://docs.getdbt.com/docs/fusion/supported-features)\n\n## New Vision: Unified License\n- Instead of splitting users between Core and Cloud, Fusion envisions **everyone having a dbt license**\n- Users can choose to work in:\n  - The **dbt Cloud IDE**, or\n  - **VS Code** using the official dbt Labs extension\n- Both options are backed by the same Fusion engine\n\n## Course Decisions & Recommendations\n- This course uses **DuckDB + dbt Core** (local, via VS Code) because:\n  - It forces learners to understand what's actually happening under the hood\n  - dbt Cloud abstracts a lot away — understanding Core first makes Cloud easier to pick up later\n- If you follow along with dbt Cloud + BigQuery, the concepts transfer well\n- dbt Labs' own documentation and courses are excellent resources for learning dbt Cloud specifically → [dbt Developer Hub](https://docs.getdbt.com)\n- **Bottom line:** It doesn't matter much which one you learn first — especially as a consultant, you'll likely use both. Focus on the shared fundamentals.\n\n---\n\n*Note: This document was last updated February 2026. For the latest information on dbt Fusion and adapter support, always consult the official dbt documentation.*"
  },
  {
    "path": "04-analytics-engineering/class_notes/4_3_1_dbt_project_structure.md",
    "content": "# DE Zoomcamp 4.3.1 — dbt Project Structure\n\n> 📄 Video: [dbt Project Structure](https://www.youtube.com/watch?v=2dYDS4OQbT0)  \n> 📄 Official docs: [About dbt projects](https://docs.getdbt.com/docs/build/projects)  \n> 📄 Best practices: [How we structure our dbt projects](https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview)\n\nWhen you run `dbt init`, dbt automatically creates a set of files and folders. This video walks through each one and explains its purpose. The structure below applies to both dbt Core and dbt Cloud (the DuckDB database file and `data/` folder are local-only artifacts and can be ignored here).\n\n---\n\n## Top-Level Files & Folders\n\n### `analysis/`\n- A place for **ad-hoc SQL scripts** that you don't necessarily want to share with stakeholders\n- Not heavily used by everyone, but handy for things like **data quality reports** or **administrative checks**\n- Think of it as a scratchpad — if you want to investigate how bad a data quality issue is, drop a SQL script here\n\n### `dbt_project.yml`\n- **The most important file in a dbt project**\n- Every time you run a dbt command, dbt looks for this file first — if it's missing, the command fails\n- Key things it contains:\n  - Project name\n  - Profile name (must match your `profiles.yml` — critical for dbt Core users)\n  - Default materializations\n  - Variables\n- Also a place to set project-wide defaults and configuration\n\n> 📄 [dbt_project.yml reference](https://docs.getdbt.com/reference/dbt_project.yml)\n\n### `macros/`\n- Macros behave like **reusable functions** (similar to Python functions or UDFs)\n- Use them when you find yourself **repeating the same SQL logic** in multiple places, or when you want to **encapsulate a piece of logic** in one place\n- Benefits:\n  - Easier to test (you're testing a small, isolated chunk)\n  - If a definition changes, you only update it in one place\n- Common use cases:\n  - **Calendar conversions** (e.g. converting standard dates to a company's fiscal calendar)\n  - **Tax rates or regulatory definitions** that might change over time\n  - Any reusable business logic that shouldn't be duplicated across models\n\n> 📄 [Jinja and macros](https://docs.getdbt.com/docs/build/jinja-macros)\n\n### `models/`\n- The **most important directory** — this is where all your SQL transformation logic lives\n- dbt suggests breaking it into **three subfolders** (see below)\n\n### `README.md`\n- Standard project documentation — the first thing someone sees when they open your project\n- dbt creates a default one, but most teams customize it\n- Good things to include:\n  - How to run the project\n  - Whether you need credentials or onboarding\n  - Contact information\n  - Installation/setup guides\n\n### `seeds/`\n- A place to **upload CSV or flat files** and ingest them as dbt models in your database\n- Considered a **quick-and-dirty** approach — if you have the option, it's better to load data properly at the source\n- Useful for:\n  - **Lookup tables**\n  - Quick experiments or prototypes\n  - Showing a stakeholder something before fully committing to a data load\n- Use when you don't have the right permissions, or the data is expected to change frequently during experimentation\n\n> 📄 [Seeds](https://docs.getdbt.com/docs/build/seeds)\n\n### `snapshots/`\n- Solves a specific problem: a source table has a column that **overwrites itself**, but you need to **keep the history**\n- Example: an `orders` table with a `current_status` column that only ever shows the latest status. For analytics, you want to know *when* each status changed\n- How it works: a snapshot takes a **\"picture\" of a table at a point in time**. Each time you run it, if a value has changed, a new row is recorded with a timestamp — without overwriting the previous value\n- Like seeds, this is a **workaround** — ideally you'd solve this at the source. But if you don't control the source, snapshots work well\n\n> 📄 [Snapshots](https://docs.getdbt.com/docs/build/snapshots)\n\n### `tests/`\n- A place for **singular tests** written as SQL assertions\n- The logic is simple: **if the query returns more than zero rows, the dbt build fails**\n- Example from the course: a client needed to ensure that vehicle timestamps always covered exactly 24 hours per day. A test query checked for any day where the total hours deviated from 24 — catching logic errors like accidental filters or bad joins early\n- This is one of several ways to test in dbt, but singular tests are especially good for **custom business rules** that don't fit standard schema tests\n\n> 📄 [Data tests (singular & generic)](https://docs.getdbt.com/docs/build/data-tests)\n\n---\n\n## The `models/` Subfolders\n\ndbt suggests organizing models into three layers:\n\n### `staging/`\n- Contains two things:\n  - **Source definitions** — telling dbt where your raw data lives in the database\n  - **Staging models** — a **1:1 copy** of each source table with only **minimal cleaning** applied\n- Minimal cleaning means things like:\n  - Fixing data types\n  - Renaming columns\n  - Filtering out clearly empty rows\n  - Removing unnecessary columns\n  - Standardizing values\n- Keep it **1:1** — same number of rows and columns as the raw source. Breaking this rule is occasionally convenient but should be the exception\n\n### `intermediate/`\n- Everything that is **not raw** and **not ready to expose** to end users\n- A catch-all for:\n  - Complex joins\n  - Heavy-duty cleaning or standardization\n  - Data quality processing\n- No strict guidelines on what goes here — if it doesn't fit neatly into staging or marts, it belongs in intermediate\n\n### `marts/`\n- Where all the **final, consumption-ready** tables live\n- If it's in marts, it's **ready for end users**\n- In a well-governed dbt project, **only marts tables should be exposed** to BI tools, analysts, and business stakeholders — nothing else\n- Typically contains:\n  - Tables ready for dashboards\n  - Properly modeled, clean tables\n  - Often star schemas, but not necessarily\n\n---\n\n## A Note on Conventions\n\nThe `staging → intermediate → marts` structure is dbt's recommendation, but it's not mandatory. The instructor has seen teams use:\n- **Medallion architecture** naming: `bronze`, `silver`, `gold`\n- Numbered layers: `first`, `second`, `third`, `last`\n- Other custom conventions\n\nIf your organization already has a convention, follow it. Otherwise, stick with dbt's default structure — it's well thought out and what this course uses."
  },
  {
    "path": "04-analytics-engineering/class_notes/4_3_2_dbt_sources.md",
    "content": "# DE Zoomcamp 4.3.2 — dbt Sources\n\n> 📄 Video: [dbt Sources](https://www.youtube.com/watch?v=7CrrXazV_8k)  \n> 📄 Official docs: [Sources](https://docs.getdbt.com/docs/build/sources)  \n> 📄 Best practices: [How we structure our dbt projects — Staging](https://docs.getdbt.com/best-practices/how-we-structure/03-staging)\n\nThis video is about telling dbt where your raw data actually lives. Sources are how dbt knows which tables to pull from before any transformation happens. Everything in this video takes place inside the `models/staging/` folder that we set up in 4.3.1.\n\n---\n\n## Defining Sources\n\n### `sources.yml`\n- A **YAML file** inside `models/staging/` that tells dbt where your raw data is\n- The **name** of the file is arbitrary — common choices are `sources.yml`, `_sources.yml` (underscore so it sorts to the top), or something named after the origin like `bigquery_sources.yml`\n- You give your source a **name** — this is arbitrary too. Think of it as a label: `raw`, `raw_data`, or something more descriptive like `google_analytics_data` or `finance_data`\n- Then you provide three fields that are **not** arbitrary — they must exactly match your warehouse:\n  - **database** — the database name or GCP project\n  - **schema** — the schema inside that database or BigQuery dataset\n  - **tables** — the individual tables you want to reference\n\n```yaml\nsources:\n  - name: nytaxi\n    database: taxi_rides_ny # Or name of your GCP project\n    schema: prod # Or name of your BigQuery dataset\n    \n    tables:\n      - name: green_tripdata\n      - name: yellow_tripdata\n```\n\n> 📄 [Sources — full reference](https://docs.getdbt.com/docs/build/sources)\n\n### Local (DuckDB) vs BigQuery — what goes where\n\nThe meaning of database, schema, and tables changes depending on your setup:\n\n| Field | Local (DuckDB) | BigQuery |\n|---|---|---|\n| **database** | `taxi_rides_ny` | Your GCP Project ID |\n| **schema** | `main` | Your BigQuery Dataset name (e.g. `trips_data_all`) |\n| **tables** | `green_tripdata`, `yellow_tripdata` | Same table names |\n\n- If you followed the default local setup, these names should be exactly right out of the box\n- If you're on BigQuery, just double-check that your table names match what you actually have in your dataset\n\n---\n\n## Using Sources in Your Models\n\n### The `source()` function\n- Instead of hard-coding the full path to your table (e.g. `FROM production.trips_data_all.green_tripdata`), you use the **`source()`** function\n- It's a **Jinja macro** — you'll recognize it by the double curly brackets `{{ }}`\n- It takes two arguments:\n  - The **source name** — the one you defined in your YAML (e.g. `staging`)\n  - The **table name** — must match exactly what you put under `tables` in the YAML\n- As long as there's a YAML file somewhere in your project with a matching source declaration, this will resolve correctly at compile time\n\n```sql\nselect * from {{ source('staging', 'green_tripdata') }}\n```\n\n- Run a preview and you should see the raw table data come back\n- If it works, that's the foundation — everything else builds on this\n\n---\n\n## Building a Proper Staging Model\n\n### Naming convention\n- Prefix your staging model files with **`stg_`** to make it clear what layer they belong to\n- So `green_tripdata.sql` becomes `stg_green_tripdata.sql`\n- Other common prefixes: `int_` for intermediate, and sometimes nothing at all for final mart models\n\n### Rename and reorder columns\n- List out every column explicitly and give them **cleaner aliases**\n- Be purposeful about the **order** — it should follow a logical grouping:\n  - **Identifiers first** — `vendor_id`, `trip_id`, anything that's an ID\n  - **Timestamps next** — `pickup_datetime`, `dropoff_datetime`\n  - **Trip details** — `passenger_count`, `trip_distance`, `trip_type`\n  - **Payment info last** — `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `total_amount`, `payment_type`\n\n### Cast data types explicitly\n- Don't rely on whatever the source gave you — cast everything to the type you actually want:\n  - IDs → `integer`\n  - Timestamps → `timestamp`\n  - Counts → `integer`\n  - Monetary values → `numeric` or `float` (depends on your platform)\n\n```sql\nwith tripdata as (\n  select *\n  from {{ source('staging','green_tripdata') }}\n  where vendorid is not null \n),\n\nrenamed as (\n  select\n      -- identifiers\n      cast(vendorid as integer) as vendorid,\n      cast(ratecodeid as integer) as ratecodeid,\n      cast(pulocationid as integer) as pickup_locationid,\n      cast(dolocationid as integer) as dropoff_locationid,\n      \n      -- timestamps\n      cast(lpep_pickup_datetime as timestamp) as pickup_datetime,\n      cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,\n      \n      -- trip info\n      store_and_fwd_flag,\n      cast(passenger_count as integer) as passenger_count,\n      cast(trip_distance as numeric) as trip_distance,\n      cast(trip_type as integer) as trip_type,\n      \n      -- payment info\n      cast(fare_amount as numeric) as fare_amount,\n      cast(extra as numeric) as extra,\n      cast(mta_tax as numeric) as mta_tax,\n      cast(tip_amount as numeric) as tip_amount,\n      cast(tolls_amount as numeric) as tolls_amount,\n      cast(ehail_fee as numeric) as ehail_fee,\n      cast(improvement_surcharge as numeric) as improvement_surcharge,\n      cast(total_amount as numeric) as total_amount,\n      cast(payment_type as integer) as payment_type,\n      {{ get_payment_type_description('payment_type') }} as payment_type_description\n  from tripdata\n)\n\nselect * from renamed\n```\n\n---\n\n## A Note on Filtering\n\n- The general recommendation is to keep staging models as **1:1 copies** of the source — same number of rows, same number of columns, just cleaned up\n- That said, this dataset has some data quality issues (we'll cover those later), so it makes sense to filter out rows where **`vendor_id IS NULL`** right here in staging\n- It's a deviation from convention, but a practical one for this project\n\n---\n\n## Your Exercise\n\nDo the same thing for the **yellow tripdata** table. The columns are almost identical to green, so it shouldn't be too painful. By the end you should have:\n- A `sources.yml` that declares both tables\n- A `stg_green_tripdata.sql` staging model\n- A `stg_yellow_tripdata.sql` staging model"
  },
  {
    "path": "04-analytics-engineering/class_notes/4_4_1_dbt_models.md",
    "content": "# DE Zoomcamp 4.4.1 — dbt Models\n\n> 📄 Video: [dbt Models](https://www.youtube.com/watch?v=JQYz-8sl1aQ)  \n> 📄 Official docs: [SQL models](https://docs.getdbt.com/docs/build/sql-models)  \n> 📄 ref() function: [About ref](https://docs.getdbt.com/reference/dbt-jinja-functions/ref)\n\nStaging is done. From here on out it's not just typing SQL behind a computer — you need to actually **explore the data**, understand what's in it, and get some **business context**. In a real org that means querying exhaustively until you understand the common data quality issues, what a normal row looks like, and talking to people about what the codes mean and when rows trigger. All of that understanding eventually gets encoded as SQL.\n\n---\n\n## What are we building?\n\nBefore writing any code, it helps to think about what the end result should look like. There are generally two things you want in your marts:\n\n### Reports and dashboards\n- If there's an important dashboard or data application out there — especially one that requires a lot of manual work or spreadsheet maintenance — that's a sign it should become a dbt model\n- Example: imagine there's a dashboard with a dataset called **monthly revenue per location**. That's something we want to build and version-control properly\n\n### A dimensional model\n- Beyond reports, you want a proper **star schema** — the kind of structure you see in data warehouses\n- Two key table types to know:\n  - **Fact tables** — one row per event/process. One row per trip, one row per sale, one row per order. Named with a `fct_` prefix (e.g. `fct_trips`)\n  - **Dimension tables** — attributes of an entity. Named with a `dim_` prefix (e.g. `dim_zones`, `dim_vendors` is not shown here)\n- The power of a good star schema: answering \"how many?\" questions becomes trivial. *How many zones do we have?* → `COUNT(*)` on `dim_zones`. *How many trips?* → `COUNT(*)` on `fct_trips`. Simple, focused tables that you join when you need something more complex\n\n### What we're building in this course\n- `dim_zones` — zone/location attributes  \n- `fct_trips` — one row per trip (yellow + green combined)\n- A report model for monthly revenue per zone (inside a `models/core/` folder)\n\n---\n\n## source() vs ref() — the key distinction\n\nThis is an important moment in the course. Up until now we've been using `{{ source() }}` to pull in raw data. But that's **only** for things declared in your sources YAML — i.e. raw tables that live outside of dbt.\n\nIf the input to your model is **another dbt model**, you use `{{ ref() }}` instead.\n\n- `{{ source('name', 'table') }}` → raw data defined in your YAML\n- `{{ ref('model_name') }}` → another dbt model\n\n> 📄 [ref() — full reference](https://docs.getdbt.com/reference/dbt-jinja-functions/ref)\n\nThis distinction matters because `ref()` also does something useful under the hood: it automatically builds the **dependency graph**. dbt knows that if model B refs model A, then A has to run first. You never have to manage run order yourself.\n\n---\n\n## The intermediate layer — why it exists\n\nWe want `fct_trips` to be a union of yellow and green trip data. But doing that union directly inside the fact model would make it messy. So we put it in an **intermediate model** instead — something that's not raw, and not ready to expose to end users.\n\n- Convention: prefix intermediate models with `int_`  \n- In this case: `int_trips_unioned.sql`\n- The idea is to keep intermediate work out of marts. Marts should only contain things that are consumption-ready\n\n```sql\nwith green_data as (\n    select *, \n        'Green' as service_type \n    from {{ ref('stg_green_tripdata') }}\n), \n\nyellow_data as (\n    select *, \n        'Yellow' as service_type\n    from {{ ref('stg_yellow_tripdata') }}\n), \n\ntrips_unioned as (\n    select * from green_data\n    union all\n    select * from yellow_data\n)\n\nselect * from trips_unioned\n```\n\n---\n\n## The union problem — yellow and green aren't identical\n\nWhen you try to union the two staging models, it fails. The error: *set operation can only be applied with expressions with the same number of columns*. Turns out green has **two extra columns** that yellow doesn't:\n\n### `trip_type`\n- Values are `1` or `2`\n- `1` = street hail (you flag down the taxi)\n- `2` = booked via phone or app\n- Yellow taxis **don't have this column** because by law you can only get a yellow taxi by hailing it on the street — it's always type 1\n- Fix: add `trip_type` to the yellow staging model and hard-code it as `1` (street hail)\n\n### `ehail_fee` (e-hail fee)\n- An extra fee that can apply when you request a taxi through an app\n- In practice, most of this data is null — the feature isn't consistently implemented across vendors\n- Yellow taxis by definition **never** have an e-hail fee\n- Fix: add `ehail_fee` to the yellow staging model and hard-code it as `0`\n\n```sql\n-- Updated stg_yellow_tripdata.sql to match green schema\nwith tripdata as (\n  select *\n  from {{ source('staging','yellow_tripdata') }}\n  where vendorid is not null \n),\n\nrenamed as (\n    select\n        -- identifiers\n        cast(vendorid as integer) as vendor_id,\n        cast(ratecodeid as integer) as ratecode_id,\n        cast(pulocationid as integer) as pickup_location_id,\n        cast(dolocationid as integer) as dropoff_location_id,\n        \n        -- timestamps\n        cast(tpep_pickup_datetime as timestamp) as pickup_datetime,\n        cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime,\n        \n        -- trip info\n        store_and_fwd_flag,\n        cast(passenger_count as integer) as passenger_count,\n        cast(trip_distance as numeric) as trip_distance,\n        cast(1 as integer) as trip_type,  -- Yellow only does street-hail\n        \n        -- payment info\n        cast(fare_amount as numeric) as fare_amount,\n        cast(extra as numeric) as extra,\n        cast(mta_tax as numeric) as mta_tax,\n        cast(tip_amount as numeric) as tip_amount,\n        cast(tolls_amount as numeric) as tolls_amount,\n        cast(0 as numeric) as ehail_fee,  -- Yellow doesn't have ehail\n        cast(improvement_surcharge as numeric) as improvement_surcharge,\n        cast(total_amount as numeric) as total_amount,\n        cast(payment_type as integer) as payment_type,\n    from tripdata\n)\n\nselect * from renamed\n```\n\nA note: adding these columns directly in staging is technically a break from the \"1:1 copy\" rule. It's done here to keep things simple, but in a stricter project you'd handle this in the intermediate layer.\n\n**Updated union after schema alignment:**\n\n```sql\n-- models/staging/int_trips_unioned.sql\nwith green_data as (\n    select *, \n        'Green' as service_type \n    from {{ ref('stg_green_tripdata') }}\n), \n\nyellow_data as (\n    select *, \n        'Yellow' as service_type\n    from {{ ref('stg_yellow_tripdata') }}\n), \n\ntrips_unioned as (\n    select * from green_data\n    union all\n    select * from yellow_data\n)\n\nselect * from trips_unioned\n```\n\n---\n\n## Why the business context matters\n\nThe column discrepancy between yellow and green isn't just a technical problem — it's a **business story**. Yellow and green taxis exist because of how NYC taxi licensing works: yellow cabs stay in Manhattan, green cabs were created so people in the outer boroughs could get rides too. Understanding that context is what lets you make the right call on how to handle `trip_type` and `ehail_fee` — not just technically, but semantically.\n\nThis is the part of analytics engineering where you stop just writing SQL and start understanding what the data actually represents."
  },
  {
    "path": "04-analytics-engineering/class_notes/4_4_2_dbt_seeds_and_macros.md",
    "content": "# DE Zoomcamp 4.4.2 — dbt Seeds and Macros\n\n> 📄 Video: [dbt Seeds and Macros](https://www.youtube.com/watch?v=lT4fmTDEqVk)  \n> 📄 Seeds docs: [Seeds](https://docs.getdbt.com/docs/build/seeds)  \n> 📄 Macros docs: [Jinja and macros](https://docs.getdbt.com/docs/build/jinja-macros)\n\nThe union model is done, but right now vendor IDs and location IDs are just numbers — meaningless codes. This video is about enriching that data. Two dbt features come in: **seeds** for bringing in lookup data, and **macros** for turning reusable SQL logic into something you don't have to copy-paste everywhere.\n\n---\n\n## The problem — codes everywhere\n\nIf you query `vendor_id`, you get values: 1 and 2. Those map to real companies:\n- **1** → Creative Mobile Technologies\n- **2** → VeriFone Inc.\n\nSame story with locations — 265 location IDs that could have names, boroughs, coordinates, and more. The raw data just doesn't have any of that. So how do we add it?\n\n---\n\n## Seeds — bringing in lookup data\n\n### What seeds are\n- A way to **upload a CSV file** and make it available as a dbt model\n- You drop the CSV into the `seeds/` directory, run `dbt seed`, and it becomes queryable just like any other model\n- You reference it with `{{ ref('filename') }}` — same as any other model\n\n### When to use them\n- **Lookup tables** that don't exist anywhere in your warehouse yet\n- Cases where you don't have write permissions to load data properly\n- Quick experiments or local testing before committing to a proper data load\n- Small, static datasets\n\n### When NOT to use them\n- **Never commit confidential data** — seeds go into your git repo\n- Keep the data **small** — large CSVs in git will slow down pulls and pushes\n- If you have the option to load the data properly at the source, do that instead. Seeds are a quick-and-dirty workaround\n\n> 📄 [Seeds — full reference](https://docs.getdbt.com/docs/build/seeds)\n\n---\n\n## dim_zones — using a seed in practice\n\nThe taxi zone lookup CSV has exactly what we need: location ID, borough, zone name, and service area. Drop it into `seeds/`, run `dbt seed`, and it's live.\n\nNow we build `dim_zones`. The model simply selects from the seed and renames columns to something cleaner.\n\n```sql\nselect\n    locationid as location_id,\n    borough,\n    zone,\n    service_zone\nfrom {{ ref('taxi_zone_lookup') }}\n```\n\nThat's it — first dimension table done. The seed did the heavy lifting.\n\n---\n\n## dim_vendors — the CASE WHEN problem (not implemented in this project, but shown for learning)\n\nFor vendors, we could pull distinct `vendor_id` from the intermediate union model using `ref()`. Easy enough. But we want to enrich it with vendor **names**.\n\n### The naive approach: CASE WHEN\nYou could just write it inline:\n\n```sql\nwith vendors as (\n    select distinct vendorid\n    from {{ ref('stg_green_tripdata') }}\n)\n\nselect\n    vendorid,\n    case \n        when vendorid = 1 then 'Creative Mobile Technologies, LLC'\n        when vendorid = 2 then 'VeriFone Inc.'\n        else 'Unknown'\n    end as vendor_name\nfrom vendors\n```\n\nThis works. But it has a real problem: **what happens when a new vendor appears, or a vendor changes its name?** You have to open this file, find the CASE block, and add another line. And if you need the same mapping somewhere else in the project, you copy-paste the whole thing. Eventually someone forgets to update one of the copies.\n\n### The better approach: macros\n\nMacros are dbt's answer to this. Think of them as **reusable SQL functions** — same idea as a Python function, but for SQL snippets.\n\n> 📄 [Jinja and macros — full reference](https://docs.getdbt.com/docs/build/jinja-macros)\n\n### How macros work\n- Defined in `.sql` files inside the `macros/` directory\n- You wrap your SQL logic in `{% macro macro_name(argument) %}` ... `{% endmacro %}`\n- The argument works just like a function parameter — you pass in a value when you call it\n- You call it in your models with `{{ macro_name(argument) }}`\n- dbt compiles it down — the final SQL looks exactly like you typed the CASE block inline, but your source code stays clean\n\n```sql\n{% macro get_vendor_data(vendor_id_column) %}\n\n{% set vendors = {\n    1: 'Creative Mobile Technologies',\n    2: 'VeriFone Inc.',\n    4: 'Unknown/Other'\n} %}\n\ncase {{ vendor_id_column }}\n    {% for vendor_id, vendor_name in vendors.items() %}\n    when {{ vendor_id }} then '{{ vendor_name }}'\n    {% endfor %}\nend\n\n{% endmacro %}\n\n```\n\n**Using the macro in a model:**\n\n```sql\nwith trips as (\n    select * from {{ ref('fct_trips') }}\n),\n\nvendors as (\n    select distinct\n        vendor_id,\n        {{ get_vendor_data('vendor_id') }} as vendor_name\n    from trips\n)\n\nselect * from vendors\n```\n\n### Why this is better\n- **Reusable** — need the same payment type logic somewhere else? Just call the macro again\n- **Single source of truth** — payment types change? Update the macro in one place, it's fixed everywhere\n- **Testable** — the logic is isolated in its own file, easier to reason about\n\n---\n\n## Homework preview — fct_trips\n\nThe fact trips model is left as an exercise. Here's what's expected:\n\n- **One row per trip** — yellow and green combined (the union is already done in the intermediate model)\n- **Add a primary key** (`trip_id`) — it has to be **unique**\n- **Find and fix duplicates** — there are quite a few in this dataset. Some come from the source, some get introduced during the union. Find them, understand why they happen, and fix them\n- **Enrich `payment_type`** (there is a seed for this in the repo)."
  },
  {
    "path": "04-analytics-engineering/class_notes/4_5_1_documentation.md",
    "content": "# DE Zoomcamp 4.5.1 — Documentation\n\n> 📄 Video: [Documentation](https://www.youtube.com/watch?v=UqoWyMjcqrA)  \n> 📄 Official docs: [Documentation](https://docs.getdbt.com/docs/build/documentation)  \n> 📄 Model properties: [Model properties](https://docs.getdbt.com/reference/model-properties)\n\nThe models are built. Now it's time to make sure other people can actually understand what they do. This video covers how dbt's documentation system works — what you write, where you write it, and what dbt does with it.\n\n---\n\n## Where documentation lives — YAML files\n\nYou've already seen YAML files in the context of sources. But they do more than just declare where raw data lives — they're also the **primary place to document your entire project**.\n\nThe most common convention is to have a single file called `schema.yml` per directory. Some teams prefer **one YAML file per model** — that's fine too, it keeps things from getting unwieldy when projects get large. For this course we stick with `schema.yml`.\n\n> 📄 [Model properties — full reference](https://docs.getdbt.com/reference/model-properties)\n\n---\n\n## What you can document\n\nAlmost everything in dbt can be documented. The structure is the same pattern regardless of what you're documenting:\n\n### Sources\n\nYou already have a `sources.yml` — you can add descriptions to the source itself and to each table inside it.\n\n```yaml\nversion: 2\n\nsources:\n  - name: staging\n    description: >\n      Raw NYC taxi trip data loaded from BigQuery external tables.\n      Contains both yellow and green taxi trip records for 2019-2020.\n    database: production\n    schema: trips_data_all\n    \n    tables:\n      - name: green_tripdata\n        description: >\n          Green taxi trip records. Green taxis operate primarily in\n          outer boroughs (outside Manhattan).\n          \n      - name: yellow_tripdata\n        description: Yellow taxi trips, primarily from Manhattan\n```\n\n### Models\n\nIn `schema.yml`, you switch from `sources:` to `models:`. Same idea — give each model a name and a description, then drill down into columns.\n\n```yaml\nversion: 2\n\nmodels:\n  - name: dim_zones\n    description: >\n      Zone lookup table containing LocationID, borough, zone name and service zone.\n      One row per taxi zone in NYC.\n    columns:\n      - name: locationid\n        description: Primary key for taxi zones\n        tests:\n          - unique\n          - not_null\n      \n      - name: borough\n        description: NYC borough name (Manhattan, Queens, Brooklyn, Bronx, Staten Island, EWR)\n      \n      - name: zone\n        description: Taxi zone name/neighborhood\n      \n      - name: service_zone\n        description: Service zone type (Yellow, Green, or Airports)\n```\n\n### Columns\nUnder each model, you can list every column with:\n- **name** — must match the actual column name\n- **description** — what it means\n- **data_type** — what type it should be (informational, not enforced)\n- **tests** — we'll cover these in the next video, but the slot is here\n- **meta** — custom key-value tags (more on this below)\n\n### Macros and seeds\nYou can document these too, using the same YAML pattern. Same `version: 2` header, just different top-level keys.\n\n---\n\n## Multi-line descriptions\n\nIf you need more than one line for a description, use the YAML **pipe operator** (`|`) or **greater-than operator** (`>`). Everything indented under it becomes part of the description. The `>` folds newlines into spaces, while `|` preserves them.\n\n```yaml\nversion: 2\n\nmodels:\n  - name: fct_trips\n    description: |\n      Fact table containing all taxi trips from both yellow and green taxis.\n      \n      This is the core analytical table for trip-level analysis.\n      Each row represents a single trip with:\n      - Trip identifiers and service type\n      - Pickup and dropoff locations and timestamps\n      - Trip details (distance, passenger count, etc.)\n      - Payment information and amounts\n      \n      Data is filtered for 2019-2020 only and excludes records\n      with unknown pickup or dropoff locations.\n```\n\n---\n\n## Meta tags — custom metadata\n\nThe `meta` field lets you attach arbitrary key-value pairs to any column or model. There's no predefined set — you and your team decide what matters. Common examples:\n\n- **PII** — flag columns that contain personally identifiable information\n- **owner** — who's responsible for this data asset, who to contact if something breaks\n- **importance** — mark which columns or models are critical vs. informational\n\nThese don't affect how dbt runs anything. They're purely for governance, discoverability, and helping your team navigate the project.\n\n---\n\n## Generating and viewing the docs\n\nTwo commands, run them in order:\n\n### `dbt docs generate`\n- Compiles everything — your YAML descriptions, your model code, and metadata from the warehouse (like actual column types and table sizes) — into a JSON file\n- In **dbt Cloud**, this happens automatically. There's even a checkbox for it\n- In **dbt Core**, you have to run it yourself\n\n### `dbt docs serve`\n- Takes the generated JSON and spins up a local website (defaults to `localhost:8080`)\n- Only needed if you're on **dbt Core** — dbt Cloud hosts the docs for you\n- If you want other people to see it, you'll need to host it somewhere (S3, Netlify, etc.)\n\n### What the docs site shows you\n- **Model code** — both the Jinja version you wrote and the compiled SQL that actually hits the database\n- **Column info** — types, descriptions, anything you added\n- **Lineage graph** — a visual DAG showing sources in green, all the way through to your final mart models. You can see exactly what depends on what, and whether a change might break something downstream\n- **Project structure** — toggle between a folder view and a database view\n\nIt's more of a **technical documentation** tool than a pretty data catalog. It's not going to replace something like Looker or Confluent's data catalog for non-technical stakeholders. But for the people building the models, it's genuinely useful — you can see at a glance what data assets exist, how they connect, and how they work."
  },
  {
    "path": "04-analytics-engineering/class_notes/4_5_2_dbt_tests.md",
    "content": "# DE Zoomcamp 4.5.2 — dbt Tests\n\n> 📄 Video: [dbt Tests](https://www.youtube.com/watch?v=bvZ-rJm7uMU)  \n> 📄 Official docs: [Data tests](https://docs.getdbt.com/docs/build/data-tests) | [Unit tests](https://docs.getdbt.com/docs/build/unit-tests) | [Model contracts](https://docs.getdbt.com/docs/mesh/govern/model-contracts)\n\nWrong KPIs in dashboards, bad numbers in reports — there are really only two causes: the underlying data wasn't what you expected, or you messed up the SQL. As an analytics engineer, if you can't tell which one it is, both are technically your fault. Tests are how you stay on top of this proactively. dbt ships with a pretty large suite of testing options, and this video walks through all of them.\n\n---\n\n## 1. Singular tests\n\nThe simplest kind of test. You write a plain SQL query, stick it in the `tests/` directory, and that's it — it's now a test.\n\nThe logic is straightforward: **if the query returns any rows, the test fails.** You're writing a query that selects for the \"bad\" cases. Zero rows back means everything checks out.\n\n```sql\n-- tests/assert_positive_fare_amount.sql\n-- Fare amounts should always be positive\n\nselect\n    tripid,\n    fare_amount\nfrom {{ ref('fct_trips') }}\nwhere fare_amount <= 0\n```\n\nThese are great for one-off business rules that are very specific to your organization — the kind of thing no generic test is going to cover out of the box.\n\n> 📄 [Singular data tests — docs](https://docs.getdbt.com/docs/build/data-tests#singular-data-tests)\n\n---\n\n## 2. Source freshness tests\n\nThese live in your source YAML, not in a separate file. You add a `freshness` block to a source and tell dbt which column indicates when data was last loaded. Then you run `dbt source freshness` and dbt checks whether that timestamp is recent enough.\n\nYou can set both `warn_after` and `error_after` thresholds — one to flag it, one to actually fail.\n\n```yaml\nversion: 2\n\nsources:\n  - name: staging\n    database: production\n    schema: trips_data_all\n    tables:\n      - name: green_tripdata\n        loaded_at_field: lpep_pickup_datetime\n        freshness:\n          warn_after: {count: 6, period: hour}\n          error_after: {count: 12, period: hour}\n      \n      - name: yellow_tripdata\n        loaded_at_field: tpep_pickup_datetime\n        freshness:\n          warn_after: {count: 6, period: hour}\n          error_after: {count: 12, period: hour}\n```\n\nNot something you see everywhere, but for pipelines where stale data would cause real problems it's a lifesaver.\n\n> 📄 [Source freshness — docs](https://docs.getdbt.com/reference/resource-properties/freshness)\n\n---\n\n## 3. Generic tests\n\nThis is the big one — the most common type of test you'll see in dbt projects. Generic tests are defined in your YAML right alongside your column descriptions. They're parameterized and reusable, so you write the logic once and apply it across as many columns and models as you need.\n\n### The four built-in generic tests\n\ndbt ships with exactly four:\n\n- **unique** — no duplicate values in this column\n- **not_null** — no nulls allowed\n- **accepted_values** — column values must be within a defined list\n- **relationships** — every value in this column must exist in another model (referential integrity)\n\n```yaml\nversion: 2\n\nmodels:\n  - name: stg_green_tripdata\n    description: Staged green taxi data\n    columns:\n      - name: tripid\n        description: Primary key for trips\n        tests:\n          - unique\n          - not_null\n      \n      - name: vendorid\n        tests:\n          - not_null\n      \n      - name: payment_type\n        description: Payment method code\n        tests:\n          - accepted_values:\n              values: [1, 2, 3, 4, 5, 6]\n      \n      - name: pickup_locationid\n        description: Taxi zone where trip started\n        tests:\n          - relationships:\n              to: ref('taxi_zone_lookup')\n              field: locationid\n```\n\n> 📄 [Generic data tests — docs](https://docs.getdbt.com/docs/build/data-tests#generic-data-tests)\n\n### Writing your own custom generic tests\n\nFour tests won't cover everything. You can write your own — they're SQL files that live in `tests/generic/`. The syntax uses Jinja test blocks, and dbt will pick them up and make them available just like the built-ins.\n\n```sql\n-- tests/generic/test_positive_values.sql\n{% test positive_values(model, column_name) %}\n\nselect *\nfrom {{ model }}\nwhere {{ column_name }} < 0\n\n{% endtest %}\n```\n\n**Usage in schema.yml:**\n```yaml\nmodels:\n  - name: fct_trips\n    columns:\n      - name: fare_amount\n        tests:\n          - positive_values\n      \n      - name: trip_distance\n        tests:\n          - positive_values\n```\n\nAnd here's the thing — you probably don't need to write as many custom tests as you'd expect. The dbt community has already built a ton of them in open-source packages (dbt-utils, dbt-expectations, etc.). Worth checking those before rolling your own.\n\n> 📄 [Writing custom generic tests — docs](https://docs.getdbt.com/best-practices/writing-custom-generic-tests)\n\n---\n\n## 4. Unit tests\n\nAvailable from dbt v1.8 onwards (released in mid-2024). Unit tests let you test your SQL logic in isolation, without hitting the warehouse with real data.\n\nThe idea: you define a small set of mock input rows and the expected output rows. dbt runs your model's SQL against those mocks and checks whether the output matches what you said it should be. This is especially handy for complex logic — rolling windows, regex, edge cases — because you can test for scenarios that haven't even shown up in your real data yet.\n\n```yaml\nversion: 2\n\nunit_tests:\n  - name: test_payment_type_mapping\n    description: Test that payment type codes map to correct descriptions\n    model: stg_green_tripdata\n    given:\n      - input: source('staging', 'green_tripdata')\n        rows:\n          - {tripid: '1', payment_type: 1}\n          - {tripid: '2', payment_type: 2}\n          - {tripid: '3', payment_type: 5}\n    expect:\n      rows:\n        - {tripid: '1', payment_type_description: 'Credit card'}\n        - {tripid: '2', payment_type_description: 'Cash'}\n        - {tripid: '3', payment_type_description: 'Unknown'}\n```\n\nUnit tests are defined in YAML in your `models/` directory, and currently only support SQL models. Since the inputs are static, there's no reason to run them in production — use them in development and CI.\n\nAs of early 2026, unit tests have been available for about 18 months and are seeing increasing adoption, especially for teams with complex transformation logic or strict data quality requirements. They're particularly useful in CI/CD pipelines where you want to catch logic errors before they hit production data.\n\n> 📄 [Unit tests — docs](https://docs.getdbt.com/docs/build/unit-tests)\n\n---\n\n## 5. Model contracts\n\nThe last type covered in this video, and a bit different from the others. Model contracts aren't about catching bad data after the fact — they're about **preventing your model from building at all** if it doesn't match a defined shape.\n\nYou define the expected columns, data types, and optionally constraints in your YAML. Then you flip on `contract: enforced: true` in the model's config. From that point on, if your model's output doesn't match — wrong column name, wrong type, missing column — dbt will error out before anything gets materialized.\n\n```yaml\nversion: 2\n\nmodels:\n  - name: fct_trips\n    config:\n      contract:\n        enforced: true\n    columns:\n      - name: tripid\n        data_type: string\n        constraints:\n          - type: not_null\n          - type: unique\n      \n      - name: pickup_datetime\n        data_type: timestamp\n        constraints:\n          - type: not_null\n      \n      - name: service_type\n        data_type: string\n      \n      - name: total_amount\n        data_type: numeric\n```\n\nThe idea behind this comes from the concept of **data contracts** — you sit down with your stakeholder, agree on what the output dataset should look like (column names, types, freshness expectations), and the contract enforces that agreement automatically. If someone changes the model in a way that breaks it, they'll know immediately.\n\n> 📄 [Model contracts — docs](https://docs.getdbt.com/docs/mesh/govern/model-contracts)"
  },
  {
    "path": "04-analytics-engineering/class_notes/4_5_3_dbt_packages.md",
    "content": "# DE Zoomcamp 4.5.3 — dbt Packages\n\n> 📄 Video: [dbt Packages](https://www.youtube.com/watch?v=KfhUA9Kfp8Y)  \n> 📄 Official docs: [Packages](https://docs.getdbt.com/docs/build/packages)  \n> 📄 Package Hub: [hub.getdbt.com](https://hub.getdbt.com)\n\nOne of the things that makes dbt's community so strong is packages. A dbt package is basically a self-contained dbt project — it has its own macros, tests, models, sources — but instead of using it yourself, you distribute it so other people can drop it into their own projects. Think Python libraries, but for dbt. This video covers the most useful packages out there and how to actually install and use them.\n\n---\n\n## Packages worth knowing about\n\n### dbt-utils\n\nThe big one. Maintained by dbt Labs, so it's well-kept and safe to use. It bundles a ton of common SQL utilities as macros — things like generating surrogate keys, deduplicating, pivoting, safe division, extracting URL parameters. Stuff most of us have written ourselves at some point.\n\nThe real kicker is **cross-database compatibility**. dbt-utils macros compile down to the correct SQL dialect depending on your warehouse. So the same macro works on BigQuery, DuckDB, Snowflake, etc. — no need to maintain separate versions of your code.\n\n### dbt-codegen\n\nA massive time-saver for the YAML grind. Codegen does two things:\n\n- **YAML from SQL** — point it at a model or source and it auto-generates the `schema.yml` with all the columns listed out. No more manually typing hundreds of column names.\n- **SQL from YAML** — the reverse. Give it a YAML spec and it generates a staging model SQL file following dbt conventions (single CTE for renaming, proper file naming, etc.).\n\n### dbt-project-evaluator\n\nScores your dbt project against best practices. Good for teams that want a quick sanity check on whether they're following conventions.\n\n### dbt-audit-helper\n\nHandy when you're refactoring. It compares an old model against a new one and validates that they produce the same results — same columns, same row counts, same values. Takes the anxiety out of rewriting existing SQL.\n\n### dbt-expectations\n\nThis is the one that makes custom tests almost unnecessary. It's a massive library of pre-built generic tests covering almost every assertion you can think of — row counts, value ranges, consistent casing, regex matching, approximate equality, and way more. In practice, if you need to test something, there's a very good chance dbt-expectations already has it.\n\n> 📄 [dbt-expectations on the Package Hub](https://hub.getdbt.com/calogica/dbt_expectations/latest/)\n\n### Warehouse-specific packages\n\nThe hub has plenty of packages tailored to specific platforms — Snowflake, BigQuery, etc. These typically come with models or macros for monitoring spend, evaluating best practices, applying constraints, or working with platform-specific features like semantic views.\n\n---\n\n## A note on trust\n\nPackages on the dbt Hub have gone through a vetting process by dbt Labs — they're generally safe to use. Packages you find floating around on GitHub that aren't on the Hub? Take a closer look at what they actually do before dropping them into your project.\n\n---\n\n## How to install a package — the demo\n\nThe video walks through installing dbt-utils and using it to generate surrogate keys. Here's the workflow:\n\n### 1. Create packages.yml\n\nAt the root of your dbt project (same level as `dbt_project.yml`), create a file called `packages.yml`. Declare the package and pin the version.\n\n```yaml\npackages:\n  - package: dbt-labs/dbt_utils\n    version: 1.1.1\n```\n\n### 2. Run `dbt deps`\n\nThis downloads and installs the package. After it runs, two things appear:\n\n- A `package-lock.yml` file — contains a hash of exactly what was installed. Commit this to version control so everyone on your team gets the same versions.\n- A `dbt_packages/` directory — this is where the installed package code lives. It's git-ignored by default (you don't want to commit other people's source code into your repo), but you can browse it if you're curious how the macros work.\n\n### 3. Use it\n\nOnce installed, the package's macros are immediately available. You call them with the standard Jinja syntax, prefixing with the package name.\n\n**Before (manual surrogate key):**\n```sql\nselect\n    -- Manual concatenation approach\n    concat(\n        cast(vendorid as string), '-',\n        cast(lpep_pickup_datetime as string)\n    ) as tripid,\n    vendorid,\n    pickup_datetime\nfrom {{ source('staging', 'green_tripdata') }}\n```\n\n**After (using dbt_utils.generate_surrogate_key):**\n```sql\nselect\n    -- Clean, cross-database macro\n    {{ dbt_utils.generate_surrogate_key(['vendorid', 'lpep_pickup_datetime']) }} as tripid,\n    vendorid,\n    pickup_datetime\nfrom {{ source('staging', 'green_tripdata') }}\n```\n\nThat's it. The macro handles the rest — compiles to the right SQL for whatever warehouse you're targeting (MD5 hash for BigQuery, hash function for Snowflake, etc.).\n\n> 📄 [dbt deps command — docs](https://docs.getdbt.com/reference/commands/deps)"
  },
  {
    "path": "04-analytics-engineering/class_notes/4_6_1_dbt_commands.md",
    "content": "# DE Zoomcamp 4.6.1 — dbt Commands\n\n> 📄 Video: [dbt Commands](https://www.youtube.com/watch?v=t4OeWHW3SsA)  \n> 📄 Official docs: [dbt command reference](https://docs.getdbt.com/reference/dbt-commands)  \n> 📄 Selection syntax: [Node selection syntax](https://docs.getdbt.com/reference/node-selection/syntax)\n\nWe've been using dbt commands throughout the series without really stopping to talk about all of them. This video is the full tour — every command you'll actually use, plus the flags that make them powerful. Good one to bookmark.\n\n---\n\n## The setup commands — run these once (or when needed)\n\n### dbt init\n\nCreates your dbt project from scratch. Generates the full directory structure — `models/`, `seeds/`, `snapshots/`, `tests/`, `analysis/`, all of it. You only ever run this once, at the very start.\n\n### dbt debug\n\nChecks that your `profiles.yml` is valid and that dbt can actually connect to your warehouse. Run this whenever you're setting up a new environment or something feels off with your connection.\n\n### dbt deps\n\nInstalls packages from your `packages.yml`. We covered this in 4.5.3 — just know it lives here in the command lineup too.\n\n### dbt clean\n\nDeletes the directories listed under `clean-targets` in your `dbt_project.yml`. By default that's `target/` and `dbt_packages/`. Useful for a fresh start, but remember you'll need to run `dbt deps` again after cleaning if you deleted `dbt_packages/`. You can add other directories to `clean-targets` if you want.\n\n> 📄 [dbt clean — docs](https://docs.getdbt.com/reference/commands/clean)\n\n---\n\n## The feature-specific commands\n\nThese are tied to specific dbt features rather than being general-purpose.\n\n### dbt seed\n\nLoads all the CSVs in your `seeds/` directory into the warehouse. Quick and simple — great for reference data or small lookup tables.\n\n### dbt snapshot\n\nRuns any snapshots you've defined in your project. Snapshots are dbt's way of tracking how source data changes over time (think SCD Type 2). Not something you use every day, but it's there when you need it.\n\n### dbt source freshness\n\nChecks whether your source data is stale. If you've defined `freshness` blocks in your source YAML (we covered this in 4.5.2), this is the command that actually runs the check.\n\n### dbt docs generate / dbt docs serve\n\n`dbt docs generate` compiles your YAML documentation, model code, and warehouse metadata into a `catalog.json` artifact in `target/`. `dbt docs serve` spins up a local website (localhost:8080) so you can browse it. On dbt Cloud, `docs serve` isn't needed — it's handled automatically. For dbt Core users, finding a scalable way to host that docs site is something you'll need to sort out yourself.\n\n> 📄 [dbt docs commands — docs](https://docs.getdbt.com/reference/commands/cmd-docs)\n\n---\n\n## The big four — these are your daily drivers\n\n### dbt compile\n\nLooks like it's doing nothing, but it's actually super useful. Takes all your models — with their Jinja, `ref()`, `source()` calls and everything — and outputs the fully resolved SQL into `target/compiled/`. No data moves, nothing hits the warehouse. It's just pure SQL sitting there for you to inspect.\n\nWhy bother? Two reasons. First, it's the fastest way to catch Jinja errors — way quicker than waiting for a full `dbt run`. Second, it's completely free — no compute, no warehouse cost. Good habit to run after making changes.\n\n> 📄 [dbt compile — docs](https://docs.getdbt.com/reference/commands/compile)\n\n### dbt run\n\nMaterializes every model in your project. Views become views, tables become tables, incremental models get incremental logic applied — whatever you configured. Models run in dependency order, so dbt figures out the sequence for you.\n\nThis is your go-to during active development when you just want to see your models built.\n\n> 📄 [dbt run — docs](https://docs.getdbt.com/reference/commands/run)\n\n### dbt test\n\nRuns all the tests in your project — generic tests, singular tests, unit tests, all of it. Reports pass/fail at the end. Nothing gets built here, it just validates what's already in the warehouse.\n\n> 📄 [dbt test — docs](https://docs.getdbt.com/reference/commands/test)\n\n### dbt build ⭐\n\nThe most important command. It's a smart combination of `dbt run` + `dbt test` + `dbt seed` + `dbt snapshot`, all in one. But it's not just running them sequentially — it's DAG-aware. It knows the right order, and if something fails along the way, it skips everything downstream of that failure rather than wasting compute on models that are going to break anyway.\n\nThis is what you want for CI, production runs, or any time you need confidence that your whole project is solid.\n\n> 📄 [dbt build — docs](https://docs.getdbt.com/reference/commands/build)\n\n### dbt retry\n\nIf a `dbt build` or `dbt run` fails partway through, don't just re-run the whole thing from scratch. `dbt retry` re-executes from the point of failure by reading the `run_results.json` file from the previous run. It automatically identifies which nodes failed and re-runs those nodes plus everything downstream of them.\n\nHow it works:\n- dbt looks at `target/run_results.json` from the last command\n- It identifies failed nodes and skipped nodes (anything downstream of a failure)\n- It re-runs only those nodes, reusing the same selection criteria from the original command\n- If the previous command completed successfully, `dbt retry` finishes as a no-op\n\nSaves a lot of time on big projects, especially when a single model fails deep in the DAG.\n\n---\n\n## Flags — the important ones\n\n### --help / -h\n\nWorks on any command. `dbt --help` gives you the full list, `dbt run --help` gives you flags specific to `run`. Standard stuff, but worth knowing it's there.\n\n### --version / -V\n\nTells you which version of dbt you have installed. Also lets you know if there's an update available.\n\n### --full-refresh / -f\n\nUsed with `dbt run` or `dbt build`. When you have an incremental model, it normally just appends new rows. `--full-refresh` drops the whole thing and rebuilds from scratch. Handy when historical data has changed, you've got duplicates, or you just want to make sure everything is clean. Most teams do this on a regular schedule — maybe once a month — just to keep things tidy.\n\n```bash\ndbt run --full-refresh\n```\n\n### --fail-fast\n\nRuns a stricter version of dbt. Normally warnings don't stop execution — with `--fail-fast` they do. Good for CI or any time you want to be sure nothing slips through. Better to fail loud than to be permissive and find surprises later.\n\n### --target / -t\n\nControls which profile target dbt runs against. By default everything runs on `dev`. But you can override it:\n\n```bash\ndbt run --target prod\n```\n\nWorks with `dbt run`, `dbt build`, `dbt test`, `dbt snapshot` — basically any command that touches the warehouse. Best practice: developers work in `dev`, production runs use `--target prod`.\n\n### --select / -s\n\nThis is the big one. Lets you run only specific parts of your project instead of everything. There are a few ways to use it:\n\n**By model name** — just give it the model name (no `.sql` needed):\n\n```bash\ndbt run --select stg_green_tripdata\n```\n\n**By directory path** — everything in a folder:\n\n```bash\ndbt run --select models/staging\n```\n\n**By tag:**\n\n```bash\ndbt run --select tag:nightly\n```\n\n**With graph operators (the + sign)** — this is where it gets really useful. The `+` lets you pull in upstream or downstream dependencies:\n\n```bash\n# Run stg_green_tripdata and all upstream dependencies\ndbt run --select +stg_green_tripdata\n\n# Run fct_trips and all downstream dependencies\ndbt run --select fct_trips+\n\n# Run dim_zones plus everything upstream AND downstream\ndbt run --select +dim_zones+\n```\n\n- `+my_model` — builds `my_model` and everything upstream of it (all its ancestors)\n- `my_model+` — builds `my_model` and everything downstream of it (all its descendants)\n- `+my_model+` — both directions. Everything upstream, the model itself, and everything downstream\n\n> 📄 [Graph operators — docs](https://docs.getdbt.com/reference/node-selection/graph-operators)\n\n**With state selectors** — instead of guessing what changed, let dbt figure it out:\n\n```bash\ndbt build --select state:modified+ --state ./prod-artifacts\n```\n\n- `state:new` — only files you just created\n- `state:modified` — anything that's changed since the last run\n- Add `+` after to include downstream dependencies of modified models\n\nHow state comparison works:\n- You need artifacts from a **previous run** stored somewhere persistent (not the same `target/` directory you're currently writing to)\n- On **dbt Cloud**, this is handled automatically — production artifacts are stored and accessible for comparison\n- On **dbt Core**, you need to manually store artifacts (especially `manifest.json`) somewhere — a cloud bucket, a separate directory, version control, etc.\n- Point `--state` to where those previous artifacts live\n- dbt compares your current code against those artifacts to determine what's new or modified\n\nThe key is that you're comparing against a *different environment's artifacts* (usually production) or a *previous point in time* — not against the directory you're currently building into. This lets you run only what's changed since your last production deployment, which is incredibly useful for CI/CD workflows.\n\nStoring those JSON artifacts persistently is also just good practice in general — you can use them to analyze how your project evolves over time.\n\n> 📄 [Node selection syntax — docs](https://docs.getdbt.com/reference/node-selection/syntax)"
  },
  {
    "path": "04-analytics-engineering/refreshers/SQL.md",
    "content": "# SQL Refresher\n\n### Table of contents\n\n\n- [Window Functions](#window-funtions)\n    - [Row Number](#row-number)\n    - [Rank and Dense Rank](#rank-and-dense-rank)    \n    - [Lag and Lead](#lag-and-lead)   \n    - [Percentile Cont](#percentile-cont)         \n- [Common Table Expression](#common-table-expression)\n- [dbt models and CTEs](#dbt-models-and-ctes)\n\n\n\n## Window Functions    \n\nA window function performs a calculation across a set of table rows that are related to the current row within a specific \"window\" or subset of data. This is comparable to the type of calculation that can be done with an aggregate function  (such as SUM(), AVG(), COUNT(), etc.).\n\nBut unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.\n\n\n**Syntax:**\n\n```sql\nFUNCTION() OVER (PARTITION BY column_name ORDER BY column_name)\n```\n\nA window function always has two components. This second part here defines your window:\n\n```sql\nOVER (PARTITION BY column_name ORDER BY column_name)\n```\n\nYour window here is how you want to be viewing your data when you're applying your function\n\n- PARTITION BY: divides the result set into groups (optional).\n\n- ORDER BY: defines the order of processing rows within the partition.\n\n\n**Common Window Functions:**\n\nRanking Functions:\n\n- ROW_NUMBER(): Assigns a unique row number within a partition.\n- RANK(): Similar to ROW_NUMBER(), but assigns the same rank to duplicate values, skipping numbers.\n- DENSE_RANK(): Like RANK(), but without gaps in numbering.\n\nAggregate Functions as Window Functions:\n\n- SUM() OVER(): Computes a running total.\n- AVG() OVER(): Computes a moving average.\n\nLag and Lead Functions:\n\n- LAG(): Retrieves the value from a previous row.\n- LEAD(): Retrieves the value from the next row.\n\n### Row Number\n\nROW_NUMBER() does just what it sounds like—displays the number of a given row. It starts at 1 and numbers the rows according to the ORDER BY part of the window statement. Using the PARTITION BY clause will allow you to begin counting 1 again in each partition.\n\n**Syntax:**\n\n```sql\nROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name)\n```\n\n**Common Uses:**\n\n- Removing Duplicates: You can use ROW_NUMBER() to identify duplicate rows and keep only one by filtering out rows with a row number greater than 1.\n\n- Ranking Data: Used when ranking rows based on specific criteria but requiring unique row numbers.\n\n- Selecting the Latest Record: Helps in selecting the most recent entry per category when combined with PARTITION BY.\n\n**Example 1:**\n\n```sql\n\nSELECT \n  total_amount,\n  ROW_NUMBER() OVER (ORDER BY total_amount DESC) AS ranking\n\nFROM `greentaxi_trips` \nLIMIT 10;\n\n```\n\nThe query returns the top 10 highest total_amount values from the table, along with a row number indicating their ranking.\n\n\n| total_amount | ranking |\n|--------|--------|\n| 4012.3 | 1      |\n| 2878.3 | 2      |\n| 2438.8 | 3      |\n| 2156.3 | 4      |\n| 2109.8 | 5      |\n| 2017.3 | 6      |\n| 1971.05| 7      |\n| 1958.8 | 8      |\n| 1762.8 | 9      |\n| 1600.8 | 10     |\n\nThe column generated with ROW_NUMBER() is temporary and does not modify the original table. It is just a calculation applied to the data in the query result.\n\n**Example 2:**\n\nLet's modify the previous query to add a partition by pick up location ID\n\n```sql\n\nSELECT \n\n  total_amount,\n  PULocationID,\n  ROW_NUMBER() OVER (PARTITION BY PULocationID ORDER BY total_amount DESC) AS ranking\n\nFROM `greentaxi_trips` \nLIMIT 10;\n\n```\n\nThis SQL query  assigns a ranking to each row based on total_amount in descending order within each \nPULocationID group:\n\n| total_amount | PULocationID | ranking |\n|-----------|-----------|-----------|\n| 8.51      | 224       | 432       |\n| 8.3       | 224       | 433       |\n| 8.3       | 224       | 434       |\n| 7.3       | 224       | 435       |\n| 3.3       | 224       | 436       |\n| 86.42     | 234       | 1         |\n| 73.5      | 234       | 2         |\n| 62.7      | 234       | 3         |\n| 61.94     | 234       | 4         |\n| 61.94     | 234       | 5         |\n\nUsing the PARTITION BY clause will allow you to begin counting 1 again in each partition.\n\n### Rank and Dense Rank\n\nROW_NUMBER(), RANK(), and DENSE_RANK() are window functions used to assign a ranking to rows based on a specified order. However, they behave differently when there are duplicate values in the ranking column.\n\nRANK() assigns a ranking, but skips numbers if there are ties. DENSE_RANK() its similar to RANK(), but does not skip numbers when there are ties.\n\nFor example:\n\n| Score | ROW_NUMBER() | RANK() | DENSE_RANK() |\n|-------|--------------|--------|--------------|\n| 95    | 1            | 1      | 1            |\n| 90    | 2            | 2      | 2            |\n| 90    | 3            | 2      | 2            |\n| 85    | 4            | 4      | 3            |\n\n\n### Lag and Lead\n\nIt can often be useful to compare rows to preceding or following rows. You can use LAG or LEAD to create columns that pull values from other rows without the need for a self-join. All you need to do is enter which column to pull from and how many rows away you'd like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows\n\n\n**Syntax:**\n\n```sql\n\nLAG(expression) OVER (PARTITION BY partition_expression ORDER BY order_expression)\n```\n\n- expression: The column whose value you want to retrieve from the previous row\n- offset (optional): The number of rows back from the current row to look. The default is 1, meaning it looks at the immediate previous row.\n- PARTITION BY (optional): Divides the result set into partitions to apply the function to each partition separately.\n- ORDER BY: Specifies the order in which the rows are processed.\n\n**Example:**\n\n```sql\n\nSELECT \n\nlpep_pickup_datetime,\ntotal_amount,\nLAG(total_amount) OVER (ORDER BY lpep_pickup_datetime) as prev_total_amount,\nLEAD(total_amount) OVER (ORDER BY lpep_pickup_datetime) as next_total_amount\n\nFROM `greentaxi_trips` \nORDER BY lpep_pickup_datetime\n\n```\n\nThe query retrieves the lpep_pickup_datetime, total_amount, the previous trip's total_amount, and the next trip's total_amount.\n\n| lpep_pickup_datetime      | total_amount | prev_total_amount | next_total_amount |\n|---------------------------|--------------|-------------------|-------------------|\n| 2008-12-31 23:33:38 UTC   | 7.3          | 6.3               | 5.3               |\n| 2008-12-31 23:42:31 UTC   | 5.3          | 7.3               | 14.55             |\n| 2008-12-31 23:47:51 UTC   | 14.55        | 5.3               | 19.55             |\n| 2008-12-31 23:57:46 UTC   | 19.55        | 14.55             | 9.8               |\n| 2009-01-01 00:00:00 UTC   | 9.8          | 19.55             | 81.3              |\n\n\n### Percentile Cont\n\nComputes the specified percentile value for the value_expression, with linear interpolation.\n\n**Syntax:**\n\n```sql\n\nPERCENTILE_CONT(value_expression, percentile ) OVER (PARTITION BY partition_expression)\n```\n\n**Example:**\n\nLet's calculate the 90th percentile of total_amount for each unique pickup location (PULocationID)\n\n```sql\n\nSELECT \n  PULocationID,\n  total_amount,\n  PERCENTILE_CONT(total_amount, 0.9 ) OVER (PARTITION BY PULocationID) AS p90\n\nFROM `greentaxi_trips` \n\n```\n\n- PERCENTILE_CONT(total_amount, 0.9): calculates the 90th percentile (p90) of total_amount\n- PARTITION BY PULocationID: This groups the calculations by PULocationID, so the 90th percentile is computed separately for each location.\n\n\nQuery results looks like this:\n\n| PULocationID | total_amount  | p90  |\n|------|-------|-------|\n| 224  | 17.3    | 51.9  |\n| 224  | 20.67    | 51.9  |\n| 224  | 21    | 51.9  |\n| 224  | 26.06 | 51.9  |\n| 224  | 27.13 | 51.9  |\n| 224  | 40.14 | 51.9  |\n| 224  | 55.46 | 51.9  |\n| 224  | 25.74 | 51.9  |\n| 224  | 27.02 | 51.9  |\n| 224  | 37    | 51.9  |\n\n\nThe P90 value is essentially the amount below which 90% of the values fall. In this table, the P90 \nis constant at 51.9, which means that for location \"224\", 90% of the total amounts are below 51.9.\n\n\n## Common Table Expression\n\nA CTE, short for Common Table Expression, is like a query within a query. With the WITH statement, you can create temporary tables to store results, making complex queries more readable and maintainable. These temporary tables exist only for the duration of the main query.\n\nCTEs and subqueries are both powerful tools and can be used to achieve similar goals, but they have different use cases and advantages. Differences are CTE is reusable during the entire session and more readable\n\nBy declaring CTEs at the beginning of the query, you enhance code readability, enabling a clearer grasp of your analysis logic. \n\n**Syntax:**\n\n```sql\n\nWITH cte_name AS (\n    SELECT column1, column2\n    FROM some_table\n    WHERE condition\n)\nSELECT * FROM cte_name;\n```\n\n**Example: Let's find the trip with the second largest total_amount**\n\n```sql\n\nWITH cte AS(\n\n  SELECT\n  lpep_pickup_datetime,\n  total_amount,\n  RANK() OVER (ORDER BY total_amount DESC) AS rank\n\n  FROM `greentaxi_trips` \n\n)\n\n\nSELECT * FROM cte WHERE rank = 2;\n\n```\n\nThe query starts with a Common Table Expression (CTE) named cte. We use the RANK() window function to \nassign a ranking (rank) to each row based on total_amount in descending order (from highest to lowest).\n\nNow, we use the CTE in the main query: ```SELECT * FROM cte WHERE rank = 2;```\n\nResult of the query:\n\n\n| lpep_pickup_datetime      | total_amount | rank | \n|---------------------------|--------------|-------------------|\n| 2019-10-10 15:22:49 UTC  | 2878.3        | 2             | \n\n## dbt models and CTEs\n\nCTEs and window functions will be used a lot in module 4 on dbt. Let's see an example of application in dbt models\n\n**Example:**\n\nSuppose we start from the FHV dataset and we want to create a dbt model that enriches the data by calculating the trip duration and the 90th percentile.\n\n```sql\n\nWITH trip_duration_calculated AS (\n\n    SELECT\n        *,\n        timestamp_diff(dropOff_datetime, pickup_datetime, second) as trip_duration\n\n    FROM `fhv_trips`\n)\n\nSELECT \n\n    PUlocationID,\n    trip_duration,\n    PERCENTILE_CONT(trip_duration, 0.90) OVER (PARTITION BY PUlocationID) AS trip_duration_p90\n\n\nFROM trip_duration_calculated\n\n\n```\n\n**Step 1: Understanding the CTE**\n\nThe WITH clause creates a CTE named trip_duration_calculated. This CTE acts as a temporary table that \ncontains all columns from the fhv_trips table. Additionally, it calculates the trip duration for each ride\n\n**Step 2: Main Query using the CTE and Window Function**\n\nThis query computes the 90th percentile of trip duration for each PUlocationID using a window function:\n\nThe PARTITION BY PUlocationID clause ensures that the percentile calculation is performed separately \nfor each unique PUlocationID.\n\nThe percentile 90 means that 90% of the trips have a duration equal to or below this value\n\n**Query result looks like this:**\n\n| PUlocationID | trip_duration | trip_duration_p90 |\n|-------------|---------------|--------------------|\n| 190         | 451           | 2170.0            |\n| 190         | 1373          | 2170.0            |\n| 190         | 817           | 2170.0            |\n| 190         | 589           | 2170.0            |\n| 190         | 1648          | 2170.0            |\n| 32          | 546           | 1988.0            |\n| 32          | 151           | 1988.0            |\n| 32          | 1752          | 1988.0            |\n| 32          | 2426          | 1988.0            |\n| 32          | 888           | 1988.0            |\n\n\n- For PUlocationID = 190, 90% of trips have a duration ≤ 2170.0   seconds.\n- For PUlocationID = 32, 90% of trips have a duration ≤ 1988.0  seconds.\n"
  },
  {
    "path": "04-analytics-engineering/setup/cloud_setup.md",
    "content": "# Cloud Setup Guide\n\nThis guide walks you through setting up dbt to work with the BigQuery data warehouse you created in Module 3.\n\n<div align=\"center\">\n\n[![dbt](https://img.shields.io/badge/dbt-FF694B?style=for-the-badge&logo=dbt&logoColor=white)](https://www.getdbt.com/)\n[![BigQuery](https://img.shields.io/badge/BigQuery-4285F4?style=for-the-badge&logo=google-cloud&logoColor=white)](https://cloud.google.com/bigquery)\n\n</div>\n\n> [!NOTE]\n> This guide assumes you've completed [Module 3: Data Warehouse](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/03-data-warehouse) where you:\n> - Created a GCP project and enabled the BigQuery API\n> - Created a service account with BigQuery permissions\n> - Learned how to load data into BigQuery (in the `nytaxi` dataset)\n>\n> Module 4 uses **different data** than Module 3 (green and yellow taxi data for 2019-2020 instead of yellow-only 2024). You'll load the new data in [Step 1](#load-the-taxi-data) below.\n\n## Step 1: Verify Your BigQuery Setup\n\nBefore setting up dbt Cloud, confirm you have the required data and credentials from Module 3.\n\n### Check Your Service Account\n\nYou should already have a service account JSON key file from Module 3. Make sure it has these permissions:\n\n- **BigQuery Data Editor**\n- **BigQuery Job User**\n- **BigQuery User**\n\nIf you need to create a new service account or download a new key, follow the instructions below.\n\n### How to Download Service Account JSON Key\n\nIf you don't have the JSON key file or need to download a new one:\n\n1. Go to [Google Cloud Console](https://console.cloud.google.com/)\n\n2. Navigate to **IAM & Admin** > **Service Accounts**\n   - Or use the search bar and type \"Service Accounts\"\n\n3. Find your service account in the list\n   - It should look like: `service-account-name@project-id.iam.gserviceaccount.com`\n   - If you don't have a service account yet, click **+ CREATE SERVICE ACCOUNT** and:\n     - Enter a name (e.g., `dbt-bigquery-service-account`)\n     - Click **CREATE AND CONTINUE**\n     - Add these roles:\n       - **BigQuery Admin** (or at minimum: BigQuery Data Editor, BigQuery Job User, BigQuery User)\n     - Click **CONTINUE** > **DONE**\n\n4. Click on your service account name to open its details\n\n5. Go to the **KEYS** tab\n\n6. Click **ADD KEY** > **Create new key**\n\n7. Select **JSON** as the key type\n\n8. Click **CREATE**\n\n9. The JSON key file will automatically download to your computer\n   - Save it in a secure location\n   - **Never commit this file to Git or share it publicly** - it contains credentials to access your GCP resources\n\nThe downloaded JSON file will look something like this:\n\n```json\n{\n  \"type\": \"service_account\",\n  \"project_id\": \"your-project-id\",\n  \"private_key_id\": \"...\",\n  \"private_key\": \"-----BEGIN PRIVATE KEY-----\\n...\\n-----END PRIVATE KEY-----\\n\",\n  \"client_email\": \"service-account-name@project-id.iam.gserviceaccount.com\",\n  ...\n}\n```\n\nYou'll use this JSON file in Step 4 to connect dbt Cloud to BigQuery.\n\n### Load the Taxi Data\n\nThis module uses **yellow and green taxi data for 2019-2020**, which is different from the data you loaded in Module 3. Using the same approach you learned in Module 3, load the following data into your BigQuery `nytaxi` dataset:\n\n- **Yellow taxi trip records** for all months of 2019 and 2020\n- **Green taxi trip records** for all months of 2019 and 2020\n\n> [!IMPORTANT]\n> Download the data from the [DataTalksClub NYC TLC Data repository](https://github.com/DataTalksClub/nyc-tlc-data/releases), **not** from the official NYC TLC website. The official site has been retroactively updated over the years, so its data differs from what the homework answers are based on.\n\nAfter loading, verify your data:\n\n1. Go to [BigQuery Console](https://console.cloud.google.com/bigquery)\n2. In the Explorer panel on the left, expand your project\n3. You should see the `nytaxi` dataset\n4. Expand the `nytaxi` dataset - you should see tables:\n   - `green_tripdata`\n   - `yellow_tripdata`\n\n### Note Your Dataset Location\n\nWhen you created your BigQuery datasets in Module 3, you chose a location (e.g., `US`, `EU`, `us-central1`). You'll need to use the same location when configuring dbt.\n\n**To check your dataset location:**\n1. In BigQuery Console, click on the `nytaxi` dataset\n2. Look for **Data location** in the dataset details\n\n## Step 2: Sign Up for dbt Platform\n\ndbt Platform is dbt's cloud-based development environment with a web IDE, scheduler, and collaboration features. dbt offers a **free Developer plan**. This should be more than enough to learn dbt and follow the course.\n\n## Step 3: Create a New dbt Project\n\nNow you'll create a fresh dbt project from scratch in dbt Cloud.\n\n1. Navigate to **Account settings** (gear icon in the top-right corner) and click **+ New Project**\n\n2. Enter a project name:\n   - Project name: `taxi_rides_ny`\n\n3. Click **Continue**\n\n## Step 4: Configure BigQuery Connection\n\nAfter clicking **Continue** in the previous step, dbt Cloud will prompt you to configure your data warehouse connection.\n\n> [!TIP]\n> If you're not automatically taken to the connection setup, you can also configure it from **Account settings** > **Projects** > **taxi_rides_ny** > **Connection**.\n\n### Upload Service Account JSON\n\n1. For the connection type, select **BigQuery**\n\n2. Click **Upload a Service Account JSON file**\n\n3. Select the service account JSON key file from Module 3\n\n4. dbt will automatically extract:\n   - Your GCP project ID\n   - Authentication credentials\n\n### Configure Connection Settings\n\n1. **Dataset**: Enter `dbt_prod`\n   - This is the base schema name where dbt will create datasets\n   - dbt will organize your models into schemas like:\n     - `dbt_prod_staging` - for staging models\n     - `dbt_prod_intermediate` - for intermediate models\n     - `dbt_prod_marts` - for final analytics tables\n\n2. **Location**: Select the same location as your `nytaxi` dataset from Module 3\n   - Example: `US`, `EU`, or `us-central1`\n   - **This must match your nytaxi dataset location**\n   - You can find this under **Optional Settings** or **Advanced Settings** depending on your UI version\n\n3. **Timeout**: `300` seconds\n\n4. **Maximum Bytes Billed**: (optional)\n   - Leave blank for unlimited, OR\n   - Set a limit like `1000000000` (1 GB) to prevent runaway queries\n\n### Test the Connection\n\n1. Click **Test Connection**\n\n2. You should see a success message: \"Connection test succeeded\"\n\n3. Click **Continue**\n\n## Step 5: Set Up Your Repository\n\ndbt Cloud needs a Git repository to store your project code. You have two options:\n\n- Let dbt Manage the Repository (Recommended for Beginners)\n- Connect Your Own GitHub Repository (Recommended for Production)\n\nIt doesn't matter which one you prefer for this course.\n\n## Step 6: Verify Your Development Environment\n\n### What Are Environments in dbt?\n\nIn dbt, **environments** define different contexts where your data transformations run:\n\n- **Development Environment**: Your personal workspace for building and testing models\n  - Uses your personal credentials\n  - Creates temporary schemas with your name (e.g., `dbt_<your_name>`)\n  - Changes only affect your work, not production\n  - Used when working in the dbt Cloud IDE\n\n- **Deployment Environment**: The production workspace where final models run on schedule\n  - Uses service account credentials\n  - Creates production schemas (e.g., `dbt_prod_staging`, `dbt_prod_marts`)\n  - Used by scheduled jobs that keep your data warehouse updated\n\nThink of it like having a draft folder (development) and a published folder (deployment) for your analytics code.\n\n### Check Your Development Environment\n\ndbt Cloud **automatically creates a development environment** when you set up a project. You don't need to create one manually.\n\nTo verify it was created:\n\n1. Navigate to **Deploy** > **Environments** in the top navigation bar\n2. You should see a **Development** environment already listed\n\n### Customize Your Development Credentials (Optional)\n\nIf you need to change how dbt connects to BigQuery during development, or adjust your development schema:\n\n1. Click your profile icon (bottom-left corner) > **Your Profile** > **Credentials**\n2. Select the credential linked to your project\n3. From here you can update:\n   - **Development Schema**: Where your personal development models will be created\n     - dbt automatically suggests: `dbt_<your_name>` (e.g., `dbt_john_smith`)\n     - This schema is separate from production (`dbt_prod`)\n   - **Target Name**: Leave as `dev` (default)\n\n## Step 7: Start Developing\n\nOnce your project, connection, and repository are configured, you're ready to start building dbt models.\n\n1. Click **Start developing in the Studio IDE**\n   - If you don't see this option, navigate to **Develop** in the top navigation bar\n\n2. dbt Cloud will initialize your workspace (this may take a minute)\n\n3. Once the IDE loads, you'll have a fresh project ready for development!\n\n## Additional Resources\n\n* [BigQuery Documentation](https://cloud.google.com/bigquery/docs)\n* [dbt Documentation](https://docs.getdbt.com/docs/cloud/about-cloud/dbt-cloud-features)\n* [BigQuery Best Practices](https://cloud.google.com/bigquery/docs/best-practices)\n* [NYC Taxi Data Dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)\n"
  },
  {
    "path": "04-analytics-engineering/setup/duckdb_troubleshooting.md",
    "content": "# Troubleshooting DuckDB Out of Memory Errors\n\nIf you're getting `Out of Memory` errors while running dbt build commands, don't panic. This is a common issue, especially on machines with limited RAM. This guide explains why it happens and what you can do about it.\n\n## Why does this happen?\n\nDuckDB is an **in-process database**, which means it runs inside your computer's memory (RAM) rather than on a remote server. The NYC taxi dataset we use in this project contains **tens of millions of rows** across 24 months of yellow and green taxi data. When dbt builds models, DuckDB needs to load, transform, and write this data (all using your local RAM).\n\nSome operations are more memory-intensive than others:\n\n| Operation | Why it's expensive | Where it happens |\n|---|---|---|\n| `QUALIFY` with window functions | Requires sorting and partitioning the entire dataset in memory | `int_trips.sql` (deduplication) |\n| `UNION ALL` on large tables | Combines two large datasets into one | `int_trips_unioned.sql` |\n| Surrogate key generation (`generate_surrogate_key`) | Computes hashes across the full dataset | `int_trips.sql` |\n| `JOIN` on large fact tables | Expands memory footprint when enriching trips with zones | `fct_trips.sql` |\n\n## Check your available RAM\n\nBefore troubleshooting, know what you're working with. You can generally find this in your settings menu.\n\nAs a rule of thumb:\n\n- **4 GB RAM**: You will very likely hit OOM. Consider using GitHub Codespaces or the Cloud Setup instead.\n- **8 GB RAM**: You might hit OOM on some models. Adjust memory settings or use GitHub Codespaces.\n- **16+ GB RAM**: You should be fine with default settings.\n\n## Option A: Use GitHub Codespaces or Cloud Setup\n\nIf your local machine doesn't have enough RAM, the easiest solution is to avoid running DuckDB locally altogether.\n\n### GitHub Codespaces\n\nRun the project in a **GitHub Codespace**. The free tier includes machines with **4 cores / 8 GB RAM**, and **8 cores / 16 GB RAM** is available within the free monthly quota for personal accounts. A 16 GB machine can comfortably run this entire project without any of the workarounds below.\n\nTo get started:\n\n1. Go to the [course repository on GitHub](https://github.com/DataTalksClub/data-engineering-zoomcamp).\n2. Click **Code** > **Codespaces** > **Create codespace on main**.\n3. Select the **8-core** machine type for the best experience.\n\nCodespaces come with Python, pip, and git pre-installed, so setup is minimal.\n\n### Cloud Setup (BigQuery)\n\nAlternatively, use the **Cloud Setup (BigQuery)** path. BigQuery runs on Google's servers, so your local RAM doesn't matter. See the [Cloud Setup Guide](cloud_setup.md).\n\n## Option B: Make it work on your local machine\n\nIf you prefer to run the project locally, follow the steps below to reduce memory usage.\n\n### Step 1: Adjust DuckDB memory settings in `profiles.yml`\n\nYour `~/.dbt/profiles.yml` controls how much memory DuckDB can use. Here's what you can tune:\n\n- **`memory_limit`**: By default, DuckDB will try to use up to 80% of your system's RAM. That sounds reasonable, but your operating system, browser, IDE, and other apps also need memory. If DuckDB claims too much, the OS may kill the process — that's your OOM error. Setting an explicit limit (roughly **50% of your total RAM**) leaves enough room for everything else. So if you have 8 GB, try `'4GB'`.\n- **`threads`**: This controls how many **dbt models** are built in parallel. Lowering `threads` to `1` means fewer concurrent models, which reduces overall memory pressure.\n- **`preserve_insertion_order: false`**: Tells DuckDB it doesn't need to maintain row order, which saves memory.\n\n### Step 2: Use `dbt retry` after a failure\n\nIf your `dbt build` fails partway through, you **don't need to rebuild everything from scratch**. Use:\n\n```bash\ndbt retry\n```\n\nThis command picks up where the last run left off, only running the models that failed or were skipped. This is very useful when an OOM error kills a single model — fix the issue, then retry without re-running the models that already succeeded.\n\n### Step 3: Build models selectively with `--select`\n\nInstead of building the entire project at once, build one model at a time to reduce peak memory usage:\n\n```bash\ndbt build --select stg_yellow_tripdata --target prod\ndbt build --select stg_green_tripdata --target prod\ndbt build --select int_trips_unioned --target prod\ndbt build --select int_trips --target prod\ndbt build --select fct_trips --target prod\n```\n\nThis way, DuckDB only needs to handle one model at a time.\n\n### Step 4: Leverage incremental models\n\nThe `fct_trips` model in this project is already configured as **incremental**. This means that after the first full build, subsequent runs only process **new records** instead of reprocessing the entire dataset.\n\nIf your first full build fails due to OOM but some models succeeded, use `dbt retry` (Step 2). Once `fct_trips` is built for the first time, future runs will be much lighter on memory.\n\n## DuckDB performance best practices\n\nThese tips come from [DuckDB's official performance guide](https://duckdb.org/docs/guides/performance/environment.html):\n\n1. **Close other applications**: Browsers, IDEs, and other apps compete for RAM. Close what you don't need before running `dbt build`.\n2. **Use an SSD**: DuckDB spills to disk when it runs out of memory. An SSD makes this spill-to-disk process much faster than an HDD.\n3. **Avoid running inside Docker** (if possible): Docker containers have memory limits that may be lower than your system's total RAM. If you must use Docker, increase the container's memory limit.\n\n## Still stuck?\n\nIf you've tried everything above and still can't build the project, ask for help in the [course Slack channel](https://datatalks-club.slack.com/). Include your RAM, OS, and the exact error message.\n"
  },
  {
    "path": "04-analytics-engineering/setup/local_setup.md",
    "content": "# Local Setup Guide\n\nThis guide walks you through setting up a local analytics engineering environment using DuckDB and dbt.\n\n<div align=\"center\">\n\n[![dbt Core](https://img.shields.io/badge/dbt-FF694B?style=for-the-badge&logo=dbt&logoColor=white)](https://www.getdbt.com/)\n[![DuckDB](https://img.shields.io/badge/DuckDB-FFF000?style=for-the-badge&logo=duckdb&logoColor=black)](https://duckdb.org/)\n\n</div>\n\n>[!NOTE]\n>*This guide will explain how to do the setup manually. If you want an additional challenge, try to run this setup using Docker Compose or a Python virtual environment.*\n\n**Important**: All dbt commands must be run from inside the `taxi_rides_ny/` directory. The setup steps below will guide you through:\n\n1. Installing the necessary tools\n2. Configuring your connection to DuckDB\n3. Loading the NYC taxi data\n4. Verifying everything works\n\n## Step 1: Install DuckDB\n\nDuckDB is a fast, in-process SQL database that works great for local analytics workloads. To install DuckDB, follow the instruction on the [official site](https://duckdb.org/docs/installation) for your specific operating system.\n\n> [!TIP]\n> *You can install DuckDB in two ways. You can install the CLI or install the client API for your favorite programming language (in the case of Python, you can use `pip install duckdb`). I personally prefer installing the CLI, but either way is fine.*\n\n## Step 2: Install dbt\n\n```bash\npip install dbt-duckdb\n```\n\nThis installs:\n\n* `dbt-core`: The core dbt framework\n* `dbt-duckdb`: The DuckDB adapter for dbt\n\n## Step 3: Configure dbt Profile\n\nSince this repository already contains a dbt project (`taxi_rides_ny/`), you don't need to run `dbt init`. Instead, you need to configure your dbt profile to connect to DuckDB.\n\n### Create or Update `~/.dbt/profiles.yml`\n\nThe dbt profile tells dbt how to connect to your database. Create or update the file `~/.dbt/profiles.yml` with the following content:\n\n```yaml\ntaxi_rides_ny:\n  target: dev\n  outputs:\n    # DuckDB Development profile\n    dev:\n      type: duckdb\n      path: taxi_rides_ny.duckdb\n      schema: dev\n      threads: 1\n      extensions:\n        - parquet\n      settings:\n        memory_limit: '2GB'\n        preserve_insertion_order: false\n\n    # DuckDB Production profile\n    prod:\n      type: duckdb\n      path: taxi_rides_ny.duckdb\n      schema: prod\n      threads: 1\n      extensions:\n        - parquet\n      settings:\n        memory_limit: '2GB'\n        preserve_insertion_order: false\n\n# Troubleshooting:\n# - If you have less than 4GB RAM, try setting memory_limit to '1GB'\n# - If you have 16GB+ RAM, you can increase to '4GB' for faster builds\n# - Expected build time: 5-10 minutes on most systems\n```\n\n## Step 4: Download and Ingest Data\n\nNow that your dbt profile is configured, let's load the taxi data into DuckDB. Navigate to the dbt project directory and run the ingestion script\n\n```python\nimport duckdb\nimport requests\nfrom pathlib import Path\n\nBASE_URL = \"https://github.com/DataTalksClub/nyc-tlc-data/releases/download\"\n\ndef download_and_convert_files(taxi_type):\n    data_dir = Path(\"data\") / taxi_type\n    data_dir.mkdir(exist_ok=True, parents=True)\n\n    for year in [2019, 2020]:\n        for month in range(1, 13):\n            parquet_filename = f\"{taxi_type}_tripdata_{year}-{month:02d}.parquet\"\n            parquet_filepath = data_dir / parquet_filename\n\n            if parquet_filepath.exists():\n                print(f\"Skipping {parquet_filename} (already exists)\")\n                continue\n\n            # Download CSV.gz file\n            csv_gz_filename = f\"{taxi_type}_tripdata_{year}-{month:02d}.csv.gz\"\n            csv_gz_filepath = data_dir / csv_gz_filename\n\n            response = requests.get(f\"{BASE_URL}/{taxi_type}/{csv_gz_filename}\", stream=True)\n            response.raise_for_status()\n\n            with open(csv_gz_filepath, 'wb') as f:\n                for chunk in response.iter_content(chunk_size=8192):\n                    f.write(chunk)\n\n            print(f\"Converting {csv_gz_filename} to Parquet...\")\n            con = duckdb.connect()\n            con.execute(f\"\"\"\n                COPY (SELECT * FROM read_csv_auto('{csv_gz_filepath}'))\n                TO '{parquet_filepath}' (FORMAT PARQUET)\n            \"\"\")\n            con.close()\n\n            # Remove the CSV.gz file to save space\n            csv_gz_filepath.unlink()\n            print(f\"Completed {parquet_filename}\")\n\ndef update_gitignore():\n    gitignore_path = Path(\".gitignore\")\n\n    # Read existing content or start with empty string\n    content = gitignore_path.read_text() if gitignore_path.exists() else \"\"\n\n    # Add data/ if not already present\n    if 'data/' not in content:\n        with open(gitignore_path, 'a') as f:\n            f.write('\\n# Data directory\\ndata/\\n' if content else '# Data directory\\ndata/\\n')\n\nif __name__ == \"__main__\":\n    # Update .gitignore to exclude data directory\n    update_gitignore()\n\n    for taxi_type in [\"yellow\", \"green\"]:\n        download_and_convert_files(taxi_type)\n\n    con = duckdb.connect(\"taxi_rides_ny.duckdb\")\n    con.execute(\"CREATE SCHEMA IF NOT EXISTS prod\")\n\n    for taxi_type in [\"yellow\", \"green\"]:\n        con.execute(f\"\"\"\n            CREATE OR REPLACE TABLE prod.{taxi_type}_tripdata AS\n            SELECT * FROM read_parquet('data/{taxi_type}/*.parquet', union_by_name=true)\n        \"\"\")\n\n    con.close()\n```\n\nThis script downloads yellow and green taxi data from 2019-2020, creates the `prod` schema, and loads the raw data into DuckDB. The download may take several minutes depending on your internet connection.\n\n## Step 5: Test the dbt Connection\n\nVerify dbt can connect to your DuckDB database:\n\n```bash\ndbt debug\n```\n\n## Step 6: Install dbt Power User Extension (VS Code Users)\n\nIf you're using Visual Studio Code, install the **dbt Power User** extension to enhance your dbt development experience.\n\n### What is dbt Power User?\n\ndbt Power User is a VS Code extension that provides:\n\n* SQL syntax highlighting and formatting for dbt models\n* Inline column-level lineage visualization\n* Auto-completion for dbt models, sources, and macros\n* Interactive documentation preview\n* Model compilation and execution directly from the editor\n\n### Why Not Use the Official dbt Extension?\n\ndbt Labs released an official VS Code extension called [dbt Extension](https://marketplace.visualstudio.com/items?itemName=dbtLabsInc.dbt) powered by the new dbt Fusion engine. However, this extension **requires dbt Fusion** and does not support dbt Core.\n\nSince we're using **dbt Core** with DuckDB for local development, we need the community-maintained **dbt Power User by AltimateAI** extension instead. This extension:\n\n* Works seamlessly with dbt Core (not just dbt Cloud)\n* Supports all dbt adapters, including DuckDB\n* Is actively maintained and open source\n* Provides a rich feature set for local development\n\n### Installation\n\n1. Open VS Code\n2. Go to Extensions (Ctrl+Shift+X / Cmd+Shift+X)\n3. Search for \"dbt Power User\"\n4. Install **dbt Power User by AltimateAI** (not the dbt Labs version)\n\nAlternatively, install it from the [VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=innoverio.vscode-dbt-power-user).\n\n> [!NOTE]\n> At this point, your local dbt environment is fully configured and ready to use. The next steps (running models, tests, and building documentation) will be covered in the tutorial videos.\n\n## Additional Resources\n\n* [DuckDB Documentation](https://duckdb.org/docs/)\n* [dbt Documentation](https://docs.getdbt.com/)\n* [dbt-duckdb Adapter](https://github.com/duckdb/dbt-duckdb)\n* [NYC Taxi Data Dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/.gitignore",
    "content": "# you shouldn't commit these into source control\n# these are the default directory names, adjust/add to fit your needs\ntarget/\ndbt_packages/\nlogs/\nprofiles.yml\n.user.yml\n\n# Data files for DuckDB\ndata/green_tripdata/\ndata/yellow_tripdata/\ndata/\n*.duckdb\n*.duckdb.wal\n.duckdb_temp/\n\n# Parquet data files\n*.parquet\n\n# Python artifacts\n__pycache__/\n*.py[cod]\n*$py.class\n*.so\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\npip-wheel-metadata/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# Virtual environments\nvenv/\nenv/\nENV/\nenv.bak/\nvenv.bak/\n.venv/\n\n# PyCharm\n.idea/\n\n# VS Code\n.vscode/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n*.ipynb\n\n# pyenv\n.python-version\n\n# pytest\n.pytest_cache/\n.coverage\nhtmlcov/\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# GCP credentials and service account keys\n*-key.json\n*-keys.json\n*key*.json\n*credential*.json\n*service-account*.json\n*serviceaccount*.json\nservice-account.json\nserviceaccount.json\ngcp-*.json\ngoogle-*.json\n\n# Environment variables\n.env\n.env.local\n.env.*.local\n*.env\ndbt_internal_packages/\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/dbt_project.yml",
    "content": "name: 'taxi_rides_ny'\nversion: '1.0.0'\n\n# Require a specific dbt version for reproducibility\nrequire-dbt-version: [\">=1.7.0\", \"<3.0.0\"]\n\n# This setting configures which \"profile\" dbt uses for this project.\nprofile: 'taxi_rides_ny'\n\n# These configurations specify where dbt should look for different types of files.\nmodel-paths: [\"models\"]\nanalysis-paths: [\"analyses\"]\ntest-paths: [\"tests\"]\nseed-paths: [\"seeds\"]\nmacro-paths: [\"macros\"]\nsnapshot-paths: [\"snapshots\"]\n\nclean-targets:\n  - \"target\"\n  - \"dbt_packages\"\n\n# Project-level variables\nvars:\n  # Date range for dev environment sampling\n  dev_start_date: '2019-01-01'\n  dev_end_date: '2019-02-01'\n\n# Configuring models\n# Full documentation: https://docs.getdbt.com/docs/configuring-models\nmodels:\n  taxi_rides_ny:\n    staging:\n      +materialized: view\n    intermediate:\n      +materialized: table\n    marts:\n      +materialized: table\nflags:\n  require_generic_test_arguments_property: true"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/macros/get_trip_duration_minutes.sql",
    "content": "{#\n    Calculate trip duration in minutes from pickup and dropoff timestamps.\n\n    Uses dbts built-in cross-database datediff macro.\n    This works seamlessly across DuckDB, BigQuery, Snowflake, Redshift, PostgreSQL, etc.\n\n    Returns: Trip duration as a numeric value in minutes\n#}\n\n{% macro get_trip_duration_minutes(pickup_datetime, dropoff_datetime) %}\n    {{ dbt.datediff(pickup_datetime, dropoff_datetime, 'minute') }}\n{% endmacro %}\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/macros/get_vendor_data.sql",
    "content": "{#\n    Macro to generate vendor_name column using Jinja dictionary.\n\n    This approach works seamlessly across BigQuery, DuckDB, Snowflake, etc.\n    by generating a CASE statement at compile time.\n\n    Usage: {{ get_vendor_data('vendor_id') }}\n    Returns: SQL CASE expression that maps vendor_id to vendor_name\n#}\n\n{% macro get_vendor_data(vendor_id_column) %}\n\n{% set vendors = {\n    1: 'Creative Mobile Technologies',\n    2: 'VeriFone Inc.',\n    4: 'Unknown/Other'\n} %}\n\ncase {{ vendor_id_column }}\n    {% for vendor_id, vendor_name in vendors.items() %}\n    when {{ vendor_id }} then '{{ vendor_name }}'\n    {% endfor %}\nend\n\n{% endmacro %}\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/macros/macros_properties.yml",
    "content": "macros:\n  - name: get_trip_duration_minutes\n    description: >\n      Calculates trip duration in minutes from pickup and dropoff timestamps.\n      This macro is cross-database compatible, supporting both DuckDB and BigQuery.\n      Returns a numeric value representing the duration in minutes.\n    arguments:\n      - name: pickup_datetime\n        type: timestamp\n        description: The pickup timestamp\n      - name: dropoff_datetime\n        type: timestamp\n        description: The dropoff timestamp\n\n  - name: get_vendor_data\n    description: >\n      Generates a CASE statement that maps vendor_id to vendor_name.\n      This macro is cross-database compatible and generates SQL at compile time using a Jinja dictionary.\n      Supports vendor IDs: 1 (Creative Mobile Technologies), 2 (VeriFone Inc.), 4 (Unknown/Other).\n    arguments:\n      - name: vendor_id_column\n        type: integer\n        description: The column name containing the vendor ID\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/macros/safe_cast.sql",
    "content": "{% macro safe_cast(column, data_type) %}\n    {% if target.type == 'bigquery' %}\n        safe_cast({{ column }} as {{ data_type }})\n    {% else %}\n        cast({{ column }} as {{ data_type }})\n    {% endif %}\n{% endmacro %}\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips.sql",
    "content": "-- Enrich and deduplicate trip data\n-- Demonstrates enrichment and surrogate key generation\n-- Note: Data quality analysis available in analyses/trips_data_quality.sql\n\nwith unioned as (\n    select * from {{ ref('int_trips_unioned') }}\n),\n\npayment_types as (\n    select * from {{ ref('payment_type_lookup') }}\n),\n\ncleaned_and_enriched as (\n    select\n        -- Generate unique trip identifier (surrogate key pattern)\n        {{ dbt_utils.generate_surrogate_key(['u.vendor_id', 'u.pickup_datetime', 'u.pickup_location_id', 'u.service_type']) }} as trip_id,\n\n        -- Identifiers\n        u.vendor_id,\n        u.service_type,\n        u.rate_code_id,\n\n        -- Location IDs\n        u.pickup_location_id,\n        u.dropoff_location_id,\n\n        -- Timestamps\n        u.pickup_datetime,\n        u.dropoff_datetime,\n\n        -- Trip details\n        u.store_and_fwd_flag,\n        u.passenger_count,\n        u.trip_distance,\n        u.trip_type,\n\n        -- Payment breakdown\n        u.fare_amount,\n        u.extra,\n        u.mta_tax,\n        u.tip_amount,\n        u.tolls_amount,\n        u.ehail_fee,\n        u.improvement_surcharge,\n        u.total_amount,\n\n        -- Enrich with payment type description\n        coalesce(u.payment_type, 0) as payment_type,\n        coalesce(pt.description, 'Unknown') as payment_type_description\n\n    from unioned u\n    left join payment_types pt\n        on coalesce(u.payment_type, 0) = pt.payment_type\n)\n\nselect * from cleaned_and_enriched\n\n-- Deduplicate: if multiple trips match (same vendor, second, location, service), keep first\nqualify row_number() over(\n    partition by vendor_id, pickup_datetime, pickup_location_id, service_type\n    order by dropoff_datetime\n) = 1\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/intermediate/int_trips_unioned.sql",
    "content": "-- Union green and yellow taxi data into a single dataset\n-- Demonstrates how to combine data from multiple sources with slightly different schemas\n\nwith green_trips as (\n    select\n        vendor_id,\n        rate_code_id,\n        pickup_location_id,\n        dropoff_location_id,\n        pickup_datetime,\n        dropoff_datetime,\n        store_and_fwd_flag,\n        passenger_count,\n        trip_distance,\n        trip_type,\n        fare_amount,\n        extra,\n        mta_tax,\n        tip_amount,\n        tolls_amount,\n        ehail_fee,\n        improvement_surcharge,\n        total_amount,\n        payment_type,\n        'Green' as service_type\n    from {{ ref('stg_green_tripdata') }}\n),\n\nyellow_trips as (\n    select\n        vendor_id,\n        rate_code_id,\n        pickup_location_id,\n        dropoff_location_id,\n        pickup_datetime,\n        dropoff_datetime,\n        store_and_fwd_flag,\n        passenger_count,\n        trip_distance,\n        cast(1 as integer) as trip_type,  -- Yellow taxis only do street-hail (code 1)\n        fare_amount,\n        extra,\n        mta_tax,\n        tip_amount,\n        tolls_amount,\n        cast(0 as numeric) as ehail_fee,  -- Yellow taxis don't have ehail_fee\n        improvement_surcharge,\n        total_amount,\n        payment_type,\n        'Yellow' as service_type\n    from {{ ref('stg_yellow_tripdata') }}\n)\n\nselect * from green_trips\nunion all\nselect * from yellow_trips\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/intermediate/schema.yml",
    "content": "models:\n  - name: int_trips_unioned\n    description: Union of green and yellow taxi trip data with normalized schema\n    columns:\n      - name: vendor_id\n        description: Taxi technology provider ID\n      - name: rate_code_id\n        description: Rate code at end of trip (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)\n      - name: pickup_location_id\n        description: TLC Taxi Zone where trip started\n      - name: dropoff_location_id\n        description: TLC Taxi Zone where trip ended\n      - name: pickup_datetime\n        description: Timestamp when meter was engaged\n      - name: dropoff_datetime\n        description: Timestamp when meter was disengaged\n      - name: store_and_fwd_flag\n        description: Trip record stored in vehicle memory (Y/N)\n      - name: passenger_count\n        description: Number of passengers in the vehicle\n      - name: trip_distance\n        description: Trip distance in miles\n      - name: trip_type\n        description: Trip type (1=Street-hail, 2=Dispatch)\n      - name: fare_amount\n        description: Time and distance fare\n      - name: extra\n        description: Miscellaneous extras and surcharges\n      - name: mta_tax\n        description: MTA tax\n      - name: tip_amount\n        description: Tip amount (credit card only)\n      - name: tolls_amount\n        description: Total tolls paid\n      - name: ehail_fee\n        description: E-hail service fee\n      - name: improvement_surcharge\n        description: Improvement surcharge\n      - name: total_amount\n        description: Total amount charged to passenger\n      - name: payment_type\n        description: Payment method code\n      - name: service_type\n        description: Type of taxi service (Green or Yellow)\n\n  - name: int_trips\n    description: Cleaned, enriched, and deduplicated trip data ready for marts\n    columns:\n      - name: trip_id\n        description: Unique trip identifier (surrogate key)\n        data_tests:\n          - unique\n          - not_null\n      - name: vendor_id\n        description: Taxi technology provider ID\n        data_tests:\n          - not_null\n      - name: service_type\n        description: Type of taxi service (Green or Yellow)\n        data_tests:\n          - not_null\n          - accepted_values:\n              arguments:\n                values: ['Green', 'Yellow']\n      - name: rate_code_id\n        description: Rate code at end of trip\n      - name: pickup_location_id\n        description: TLC Taxi Zone where trip started\n      - name: dropoff_location_id\n        description: TLC Taxi Zone where trip ended\n      - name: pickup_datetime\n        description: Timestamp when meter was engaged\n        data_tests:\n          - not_null\n      - name: dropoff_datetime\n        description: Timestamp when meter was disengaged\n      - name: store_and_fwd_flag\n        description: Trip record stored in vehicle memory (Y/N)\n      - name: passenger_count\n        description: Number of passengers in the vehicle\n      - name: trip_distance\n        description: Trip distance in miles\n      - name: trip_type\n        description: Trip type (1=Street-hail, 2=Dispatch)\n      - name: fare_amount\n        description: Time and distance fare\n      - name: extra\n        description: Miscellaneous extras and surcharges\n      - name: mta_tax\n        description: MTA tax\n      - name: tip_amount\n        description: Tip amount (credit card only)\n      - name: tolls_amount\n        description: Total tolls paid\n      - name: ehail_fee\n        description: E-hail service fee\n      - name: improvement_surcharge\n        description: Improvement surcharge\n      - name: total_amount\n        description: Total amount charged to passenger\n        data_tests:\n          - not_null\n      - name: payment_type\n        description: Payment method code\n      - name: payment_type_description\n        description: Human-readable payment method description\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/marts/dim_vendors.sql",
    "content": "-- Dimension table for taxi technology vendors\n-- Small static dimension defining vendor codes and their company names\n\nwith trips as (\n    select * from {{ ref('fct_trips') }}\n),\n\nvendors as (\n    select distinct\n        vendor_id,\n        {{ get_vendor_data('vendor_id') }} as vendor_name\n    from trips\n)\n\nselect * from vendors\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/marts/dim_zones.sql",
    "content": "-- Dimension table for NYC taxi zones\n-- This is a simple pass-through from the seed file, but having it as a model\n-- allows for future enhancements (e.g., adding calculated fields, filtering)\n\nselect\n    locationid as location_id,\n    borough,\n    zone,\n    service_zone\nfrom {{ ref('taxi_zone_lookup') }}"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/marts/fct_trips.sql",
    "content": "{{\n  config(\n    materialized='incremental',\n    unique_key='trip_id',\n    incremental_strategy='merge',\n    on_schema_change='append_new_columns'  )\n}}\n\n-- Fact table containing all taxi trips enriched with zone information\n-- This is a classic star schema design: fact table (trips) joined to dimension table (zones)\n-- Materialized incrementally to handle large datasets efficiently\n\nselect\n    -- Trip identifiers\n    trips.trip_id,\n    trips.vendor_id,\n    trips.service_type,\n    trips.rate_code_id,\n\n    -- Location details (enriched with human-readable zone names from dimension)\n    trips.pickup_location_id,\n    pz.borough as pickup_borough,\n    pz.zone as pickup_zone,\n    trips.dropoff_location_id,\n    dz.borough as dropoff_borough,\n    dz.zone as dropoff_zone,\n\n    -- Trip timing\n    trips.pickup_datetime,\n    trips.dropoff_datetime,\n    trips.store_and_fwd_flag,\n\n    -- Trip metrics\n    trips.passenger_count,\n    trips.trip_distance,\n    trips.trip_type,\n    {{ get_trip_duration_minutes('trips.pickup_datetime', 'trips.dropoff_datetime') }} as trip_duration_minutes,\n\n    -- Payment breakdown\n    trips.fare_amount,\n    trips.extra,\n    trips.mta_tax,\n    trips.tip_amount,\n    trips.tolls_amount,\n    trips.ehail_fee,\n    trips.improvement_surcharge,\n    trips.total_amount,\n    trips.payment_type,\n    trips.payment_type_description\n\nfrom {{ ref('int_trips') }} as trips\n-- LEFT JOIN preserves all trips even if zone information is missing or unknown\nleft join {{ ref('dim_zones') }} as pz\n    on trips.pickup_location_id = pz.location_id\nleft join {{ ref('dim_zones') }} as dz\n    on trips.dropoff_location_id = dz.location_id\n\n{% if is_incremental() %}\n  -- Only process new trips based on pickup datetime\n  where trips.pickup_datetime > (select max(pickup_datetime) from {{ this }})\n{% endif %}\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/marts/reporting/fct_monthly_zone_revenue.sql",
    "content": "-- Data mart for monthly revenue analysis by pickup zone and service type\n-- This aggregation is optimized for business reporting and dashboards\n-- Enables analysis of revenue trends across different zones and taxi types\n\nselect\n    -- Grouping dimensions\n    coalesce(pickup_zone, 'Unknown Zone') as pickup_zone,\n    {% if target.type == 'bigquery' %}cast(date_trunc(pickup_datetime, month) as date)\n    {% elif target.type == 'duckdb' %}date_trunc('month', pickup_datetime)\n    {% endif %} as revenue_month,\n    service_type,\n\n    -- Revenue breakdown (summed by zone, month, and service type)\n    sum(fare_amount) as revenue_monthly_fare,\n    sum(extra) as revenue_monthly_extra,\n    sum(mta_tax) as revenue_monthly_mta_tax,\n    sum(tip_amount) as revenue_monthly_tip_amount,\n    sum(tolls_amount) as revenue_monthly_tolls_amount,\n    sum(ehail_fee) as revenue_monthly_ehail_fee,\n    sum(improvement_surcharge) as revenue_monthly_improvement_surcharge,\n    sum(total_amount) as revenue_monthly_total_amount,\n\n    -- Additional metrics for operational analysis\n    count(trip_id) as total_monthly_trips,\n    avg(passenger_count) as avg_monthly_passenger_count,\n    avg(trip_distance) as avg_monthly_trip_distance\n\nfrom {{ ref('fct_trips') }}\ngroup by pickup_zone, revenue_month, service_type"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/marts/reporting/schema.yml",
    "content": "models:\n  - name: fct_monthly_zone_revenue\n    description: Monthly revenue aggregation by pickup zone and service type for business reporting\n    data_tests:\n      - dbt_utils.unique_combination_of_columns:\n          arguments:\n            combination_of_columns:\n              - pickup_zone\n              - revenue_month\n              - service_type\n    columns:\n      - name: pickup_zone\n        description: Pickup zone where revenue was generated\n        data_tests:\n          - not_null\n      - name: revenue_month\n        description: Month for revenue aggregation\n        data_tests:\n          - not_null\n      - name: service_type\n        description: Service type (Green or Yellow)\n        data_tests:\n          - not_null\n          - accepted_values:\n              arguments:\n                values: ['Green', 'Yellow']\n      - name: revenue_monthly_total_amount\n        description: Monthly sum of total fares\n        data_tests:\n          - not_null\n      - name: total_monthly_trips\n        description: Count of trips in the month\n        data_tests:\n          - not_null\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/marts/schema.yml",
    "content": "models:\n  - name: dim_zones\n    description: Taxi zone dimension table with location details\n    columns:\n      - name: location_id\n        description: Unique identifier for each taxi zone\n        data_tests:\n          - unique\n          - not_null\n      - name: borough\n        description: NYC borough name\n      - name: zone\n        description: Specific zone name within the borough\n      - name: service_zone\n        description: Service zone classification\n\n  - name: dim_vendors\n    description: Taxi technology vendor dimension table\n    columns:\n      - name: vendor_id\n        description: Unique vendor identifier\n        data_tests:\n          - unique\n          - not_null\n      - name: vendor_name\n        description: Company name of the vendor\n\n  - name: fct_trips\n    description: Fact table with all taxi trips including trip and payment details\n    config:\n      contract:\n        enforced: true\n    columns:\n      - name: trip_id\n        description: Unique trip identifier\n        data_type: string\n        data_tests:\n          - unique\n          - not_null\n      - name: vendor_id\n        description: Taxi technology provider\n        data_type: integer\n        data_tests:\n          - not_null\n      - name: service_type\n        description: Type of taxi service (Green or Yellow)\n        data_type: string\n        data_tests:\n          - accepted_values:\n              arguments:\n                values: ['Green', 'Yellow']\n          - not_null\n      - name: rate_code_id\n        description: Final rate code\n        data_type: integer\n      - name: pickup_location_id\n        description: TLC Taxi Zone where trip started\n        data_type: integer\n        data_tests:\n          - relationships:\n              arguments:\n                to: ref('dim_zones')\n                field: location_id\n      - name: pickup_borough\n        description: NYC borough where trip started\n        data_type: string\n      - name: pickup_zone\n        description: Specific zone where trip started\n        data_type: string\n      - name: dropoff_location_id\n        description: TLC Taxi Zone where trip ended\n        data_type: integer\n        data_tests:\n          - relationships:\n              arguments:\n                to: ref('dim_zones')\n                field: location_id\n      - name: dropoff_borough\n        description: NYC borough where trip ended\n        data_type: string\n      - name: dropoff_zone\n        description: Specific zone where trip ended\n        data_type: string\n      - name: pickup_datetime\n        description: Timestamp when meter was engaged\n        data_type: timestamp\n        data_tests:\n          - not_null\n      - name: dropoff_datetime\n        description: Timestamp when meter was disengaged\n        data_type: timestamp\n      - name: store_and_fwd_flag\n        description: Trip record stored in vehicle memory (Y/N)\n        data_type: string\n      - name: passenger_count\n        description: Number of passengers\n        data_type: integer\n      - name: trip_distance\n        description: Trip distance in miles\n        data_type: numeric\n      - name: trip_type\n        description: Trip type (1=Street-hail, 2=Dispatch)\n        data_type: integer\n      - name: trip_duration_minutes\n        description: Trip duration in minutes (calculated using cross-database macro)\n        data_type: bigint\n      - name: fare_amount\n        description: Time and distance fare\n        data_type: numeric\n      - name: extra\n        description: Miscellaneous extras and surcharges\n        data_type: numeric\n      - name: mta_tax\n        description: MTA tax\n        data_type: numeric\n      - name: tip_amount\n        description: Tip amount (credit card only)\n        data_type: numeric\n      - name: tolls_amount\n        description: Total tolls paid\n        data_type: numeric\n      - name: ehail_fee\n        description: E-hail service fee\n        data_type: numeric\n      - name: improvement_surcharge\n        description: Improvement surcharge\n        data_type: numeric\n      - name: total_amount\n        description: Total amount charged\n        data_type: numeric\n        data_tests:\n          - not_null\n      - name: payment_type\n        description: Payment method code\n        data_type: integer\n      - name: payment_type_description\n        description: Human-readable payment method description\n        data_type: string"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/staging/schema.yml",
    "content": "models:\n  - name: stg_green_tripdata\n    description: >\n      Staging model for green taxi trip data. This model standardizes column names\n      and data types from the raw green_tripdata source, providing a clean foundation\n      for downstream transformations.\n    columns:\n      - name: vendor_id\n        description: Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)\n        data_tests:\n          - not_null\n      - name: rate_code_id\n        description: Rate code at end of trip (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)\n      - name: pickup_location_id\n        description: TLC Taxi Zone where the meter was engaged\n      - name: dropoff_location_id\n        description: TLC Taxi Zone where the meter was disengaged\n      - name: pickup_datetime\n        description: Date and time when the meter was engaged\n        data_tests:\n          - not_null\n      - name: dropoff_datetime\n        description: Date and time when the meter was disengaged\n      - name: store_and_fwd_flag\n        description: Flag indicating if trip record was held in vehicle memory (Y/N)\n      - name: passenger_count\n        description: Number of passengers in the vehicle (driver-entered value)\n      - name: trip_distance\n        description: Trip distance in miles reported by the taximeter\n      - name: trip_type\n        description: Code for trip type (1=Street-hail, 2=Dispatch)\n      - name: fare_amount\n        description: Time and distance fare calculated by the meter\n      - name: extra\n        description: Miscellaneous extras and surcharges (rush hour, overnight)\n      - name: mta_tax\n        description: $0.50 MTA tax automatically triggered based on meter rate\n      - name: tip_amount\n        description: Tip amount (credit card tips only, cash tips not included)\n      - name: tolls_amount\n        description: Total amount of all tolls paid during the trip\n      - name: ehail_fee\n        description: E-hail service fee\n      - name: improvement_surcharge\n        description: Improvement surcharge assessed on hailed trips\n      - name: total_amount\n        description: Total amount charged to passengers (does not include cash tips)\n      - name: payment_type\n        description: Payment method code (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)\n\n  - name: stg_yellow_tripdata\n    description: >\n      Staging model for yellow taxi trip data. This model standardizes column names\n      and data types from the raw yellow_tripdata source, providing a clean foundation\n      for downstream transformations.\n    columns:\n      - name: vendor_id\n        description: Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.)\n        data_tests:\n          - not_null\n      - name: rate_code_id\n        description: Rate code at end of trip (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)\n      - name: pickup_location_id\n        description: TLC Taxi Zone where the meter was engaged\n      - name: dropoff_location_id\n        description: TLC Taxi Zone where the meter was disengaged\n      - name: pickup_datetime\n        description: Date and time when the meter was engaged\n        data_tests:\n          - not_null\n      - name: dropoff_datetime\n        description: Date and time when the meter was disengaged\n      - name: store_and_fwd_flag\n        description: Flag indicating if trip record was held in vehicle memory (Y/N)\n      - name: passenger_count\n        description: Number of passengers in the vehicle (driver-entered value)\n      - name: trip_distance\n        description: Trip distance in miles reported by the taximeter\n      - name: fare_amount\n        description: Time and distance fare calculated by the meter\n      - name: extra\n        description: Miscellaneous extras and surcharges (rush hour, overnight)\n      - name: mta_tax\n        description: $0.50 MTA tax automatically triggered based on meter rate\n      - name: tip_amount\n        description: Tip amount (credit card tips only, cash tips not included)\n      - name: tolls_amount\n        description: Total amount of all tolls paid during the trip\n      - name: improvement_surcharge\n        description: Improvement surcharge assessed on hailed trips\n      - name: total_amount\n        description: Total amount charged to passengers (does not include cash tips)\n      - name: payment_type\n        description: Payment method code (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/staging/sources.yml",
    "content": "sources:\n  - name: raw\n    description: Raw taxi trip data from NYC TLC\n    database: |\n      {%- if target.type == 'bigquery' -%}\n        {{ env_var('GCP_PROJECT_ID', 'please-add-your-gcp-project-id-here') }}\n      {%- else -%}\n        taxi_rides_ny\n      {%- endif -%}\n    schema: |\n      {%- if target.type == 'bigquery' -%}\n        nytaxi\n      {%- else -%}\n        prod\n      {%- endif -%}\n    tables:\n      - name: green_tripdata\n        description: Raw green taxi trip records\n        columns:\n          - name: vendorid\n            description: \"Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging\"\n          - name: lpep_pickup_datetime\n            description: Date and time when the meter was engaged\n          - name: lpep_dropoff_datetime\n            description: Date and time when the meter was disengaged\n          - name: passenger_count\n            description: Number of passengers in the vehicle\n          - name: trip_distance\n            description: Trip distance in miles\n          - name: pulocationid\n            description: TLC Taxi Zone where the meter was engaged\n          - name: dolocationid\n            description: TLC Taxi Zone where the meter was disengaged\n          - name: ratecodeid\n            description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)\n          - name: store_and_fwd_flag\n            description: Trip record held in vehicle memory (Y/N)\n          - name: payment_type\n            description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)\n          - name: fare_amount\n            description: Time and distance fare\n          - name: extra\n            description: Miscellaneous extras and surcharges\n          - name: mta_tax\n            description: MTA tax\n          - name: tip_amount\n            description: Tip amount (credit card only)\n          - name: tolls_amount\n            description: Total tolls paid\n          - name: improvement_surcharge\n            description: Improvement surcharge\n          - name: total_amount\n            description: Total amount charged\n          - name: trip_type\n            description: Trip type (1=Street-hail, 2=Dispatch)\n          - name: ehail_fee\n            description: E-hail fee\n\n        config:\n          loaded_at_field: lpep_pickup_datetime\n      - name: yellow_tripdata\n        description: Raw yellow taxi trip records\n        columns:\n          - name: vendorid\n            description: \"Taxi technology provider (1 = Creative Mobile Technologies, 2 = VeriFone Inc.) - Note: Raw data may contain nulls, filtered in staging\"\n          - name: tpep_pickup_datetime\n            description: Date and time when the meter was engaged\n          - name: tpep_dropoff_datetime\n            description: Date and time when the meter was disengaged\n          - name: passenger_count\n            description: Number of passengers in the vehicle\n          - name: trip_distance\n            description: Trip distance in miles\n          - name: pulocationid\n            description: TLC Taxi Zone where the meter was engaged\n          - name: dolocationid\n            description: TLC Taxi Zone where the meter was disengaged\n          - name: ratecodeid\n            description: Rate code (1=Standard, 2=JFK, 3=Newark, 4=Nassau/Westchester, 5=Negotiated, 6=Group)\n          - name: store_and_fwd_flag\n            description: Trip record held in vehicle memory (Y/N)\n          - name: payment_type\n            description: Payment method (1=Credit card, 2=Cash, 3=No charge, 4=Dispute, 5=Unknown, 6=Voided)\n          - name: fare_amount\n            description: Time and distance fare\n          - name: extra\n            description: Miscellaneous extras and surcharges\n          - name: mta_tax\n            description: MTA tax\n          - name: tip_amount\n            description: Tip amount (credit card only)\n          - name: tolls_amount\n            description: Total tolls paid\n          - name: improvement_surcharge\n            description: Improvement surcharge\n          - name: total_amount\n            description: Total amount charged\n        config:\n          loaded_at_field: tpep_pickup_datetime\n    config:\n      freshness:\n        warn_after: {count: 24, period: hour}\n        error_after: {count: 48, period: hour}\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/staging/stg_green_tripdata.sql",
    "content": "with source as (\n    select * from {{ source('raw', 'green_tripdata') }}\n),\n\nrenamed as (\n    select\n        -- identifiers\n        cast(vendorid as integer) as vendor_id,\n        {{ safe_cast('ratecodeid', 'integer') }} as rate_code_id,\n        cast(pulocationid as integer) as pickup_location_id,\n        cast(dolocationid as integer) as dropoff_location_id,\n\n        -- timestamps\n        cast(lpep_pickup_datetime as timestamp) as pickup_datetime,  -- lpep = Licensed Passenger Enhancement Program (green taxis)\n        cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,\n\n        -- trip info\n        cast(store_and_fwd_flag as string) as store_and_fwd_flag,\n        cast(passenger_count as integer) as passenger_count,\n        cast(trip_distance as numeric) as trip_distance,\n        {{ safe_cast('trip_type', 'integer') }} as trip_type,\n\n        -- payment info\n        cast(fare_amount as numeric) as fare_amount,\n        cast(extra as numeric) as extra,\n        cast(mta_tax as numeric) as mta_tax,\n        cast(tip_amount as numeric) as tip_amount,\n        cast(tolls_amount as numeric) as tolls_amount,\n        cast(ehail_fee as numeric) as ehail_fee,\n        cast(improvement_surcharge as numeric) as improvement_surcharge,\n        cast(total_amount as numeric) as total_amount,\n        {{ safe_cast('payment_type', 'integer') }} as payment_type\n    from source\n    -- Filter out records with null vendor_id (data quality requirement)\n    where vendorid is not null\n)\n\nselect * from renamed\n\n-- Sample records for dev environment using deterministic date filter\n{% if target.name == 'dev' %}\nwhere pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01'\n{% endif %}\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/models/staging/stg_yellow_tripdata.sql",
    "content": "with source as (\n    select * from {{ source('raw', 'yellow_tripdata') }}\n),\n\nrenamed as (\n    select\n        -- identifiers (standardized naming for consistency across yellow/green)\n        cast(vendorid as integer) as vendor_id,\n        cast(ratecodeid as integer) as rate_code_id,\n        cast(pulocationid as integer) as pickup_location_id,\n        cast(dolocationid as integer) as dropoff_location_id,\n\n        -- timestamps (standardized naming)\n        cast(tpep_pickup_datetime as timestamp) as pickup_datetime,  -- tpep = Taxicab Passenger Enhancement Program (yellow taxis)\n        cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime,\n\n        -- trip info\n        cast(store_and_fwd_flag as string) as store_and_fwd_flag,\n        cast(passenger_count as integer) as passenger_count,\n        cast(trip_distance as numeric) as trip_distance,\n\n        -- payment info\n        cast(fare_amount as numeric) as fare_amount,\n        cast(extra as numeric) as extra,\n        cast(mta_tax as numeric) as mta_tax,\n        cast(tip_amount as numeric) as tip_amount,\n        cast(tolls_amount as numeric) as tolls_amount,\n        cast(improvement_surcharge as numeric) as improvement_surcharge,\n        cast(total_amount as numeric) as total_amount,\n        cast(payment_type as integer) as payment_type\n\n    from source\n    -- Filter out records with null vendor_id (data quality requirement)\n    where vendorid is not null\n)\n\nselect * from renamed\n\n-- Sample records for dev environment using deterministic date filter\n{% if target.name == 'dev' %}\nwhere pickup_datetime >= '2019-01-01' and pickup_datetime < '2019-02-01'\n{% endif %}\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/package-lock.yml",
    "content": "packages:\n  - name: dbt_utils\n    package: dbt-labs/dbt_utils\n    version: 1.3.3\n  - name: codegen\n    package: dbt-labs/codegen\n    version: 0.14.0\nsha1_hash: 01f31e0d658d76121f50e62b998342ebf138df11\n"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/packages.yml",
    "content": "packages:\n  - package: dbt-labs/dbt_utils\n    version: [\">=1.3.0\", \"<2.0.0\"]\n  - package: dbt-labs/codegen\n    version: [\">=0.14.0\", \"<1.0.0\"]"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/seeds/seeds_properties.yml",
    "content": "seeds:\n  - name: taxi_zone_lookup\n    description: >\n      Taxi Zones roughly based on NYC Department of City Planning's Neighborhood\n      Tabulation Areas (NTAs) and are meant to approximate neighborhoods, so you can see which\n      neighborhood a passenger was picked up in, and which neighborhood they were dropped off in.\n      Includes associated service_zone (EWR, Boro Zone, Yellow Zone)\n\n  - name: payment_type_lookup\n    description: >\n      Payment type reference data mapping payment type codes to their descriptions.\n      Used as a dimension table for payment method analysis.\n    columns:\n      - name: payment_type\n        description: Numeric code for payment type\n        data_tests:\n          - unique\n          - not_null\n      - name: description\n        description: Human-readable description of payment method"
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/snapshots/.gitkeep",
    "content": ""
  },
  {
    "path": "04-analytics-engineering/taxi_rides_ny/tests/.gitkeep",
    "content": ""
  },
  {
    "path": "05-data-platforms/README.md",
    "content": "# Module 5: Data Platforms\n\n## Overview\n\nIn this module, you'll learn about data platforms - tools that help you manage the entire data lifecycle from ingestion to analytics.\n\nWe'll use [Bruin](https://getbruin.com/) as an example of a data platform. Bruin puts multiple tools under one platform:\n\n- Data ingestion (extract from sources to your warehouse)\n- Data transformation (cleaning, modeling, aggregating)\n- Data orchestration (scheduling and dependency management)\n- Data quality (built-in checks and validation)\n- Metadata management (lineage, documentation)\n\n## Tutorial\n\nFollow the complete hands-on tutorial at:\n\n[Bruin Data Engineering Zoomcamp Template](https://github.com/bruin-data/bruin/tree/main/templates/zoomcamp)\n\nThe template is a TODO-based learning exercise — run `bruin init zoomcamp my-taxi-pipeline` and fill in the configuration and code guided by inline comments. The [notes](notes/) contain completed reference implementations.\n\n## Videos\n\n### :movie_camera: 5.1 - Introduction to Bruin\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/f6vg7lGqZx0)](https://youtu.be/f6vg7lGqZx0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=1)\n\nIntroduction to the Bruin data platform: what it is, what a modern data stack looks like (ETL/ELT, orchestration, data quality), and how Bruin brings all of these together into a single project.\n\n- [Notes](notes/01-introduction.md)\n\n\n### :movie_camera: 5.2 - Getting Started with Bruin\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/JJwHKSidX_c)](https://youtu.be/JJwHKSidX_c&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=2)\n\nInstall Bruin, set up the VS Code/Cursor extension and Bruin MCP, and create a first project using `bruin init`. Walk through environments, connections (DuckDB, Chess.com), pipeline YAML configuration, and running Python, YAML ingestor, and SQL assets.\n\n- [Notes](notes/02-getting-started.md)\n\n\n### :movie_camera: 5.3 - Building an End-to-End Pipeline with NYC Taxi Data\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/q0k_iz9kWsI)](https://youtu.be/q0k_iz9kWsI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=3)\n\nBuild a full pipeline with a three-layered architecture (ingestion, staging, reports) using NYC taxi data and DuckDB.\n\n- [Notes](notes/03-nyc-taxi-pipeline.md)\n\n\n### :movie_camera: 5.4 - Using Bruin MCP with AI Agents\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/224xH7h8OaQ)](https://youtu.be/224xH7h8OaQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=4)\n\nInstall the Bruin MCP in Cursor/VS Code and use an AI agent to build the entire NYC taxi pipeline end to end. Query data conversationally, ask questions about pipeline logic, and troubleshoot issues — all through natural language.\n\n- [Notes](notes/04-bruin-mcp.md)\n\n\n### :movie_camera: 5.5 - Deploying to Bruin Cloud\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/uBqjLEwF8rc)](https://youtu.be/uBqjLEwF8rc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=5)\n\nRegister for Bruin Cloud, connect your GitHub repository, set up data warehouse connections, deploy and monitor your pipelines with a fully managed infrastructure.\n\n- [Notes](notes/05-bruin-cloud.md)\n\n\n## Bruin Core Concepts\n\nShort videos covering the fundamental concepts of Bruin: projects, pipelines, assets, variables, and commands.\n\n### :movie_camera: Projects\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/YWDjnSxbBtY)](https://www.youtube.com/watch?v=YWDjnSxbBtY)\n\nThe root directory where you create your Bruin data pipeline. Learn about project initialization, the `.bruin.yml` configuration file, environments, and connections.\n\n- [Notes](notes/06-core-01-projects.md)\n\n\n### :movie_camera: Pipelines\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/uzp_DiR4Sok)](https://www.youtube.com/watch?v=uzp_DiR4Sok)\n\nA grouping mechanism for organizing assets based on their execution schedule. Each pipeline has a single schedule and its own configuration file.\n\n- [Notes](notes/06-core-02-pipelines.md)\n\n\n### :movie_camera: Assets\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/ZElY5SoqrwI)](https://www.youtube.com/watch?v=ZElY5SoqrwI)\n\nSingle files that perform specific tasks, creating or updating tables/views in your database. Covers SQL, Python, and YAML asset types with examples.\n\n- [Notes](notes/06-core-03-assets.md)\n\n\n### :movie_camera: Variables\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/XCx0nDmhhxA)](https://www.youtube.com/watch?v=XCx0nDmhhxA)\n\nDynamic values initialized at each pipeline run. Learn about built-in variables (start_date, end_date) and custom variables for parameterizing your pipelines.\n\n- [Notes](notes/06-core-04-variables.md)\n\n\n### :movie_camera: Commands\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/3nykPEs_V7E)](https://www.youtube.com/watch?v=3nykPEs_V7E)\n\nCLI commands for interacting with your Bruin project: `bruin run`, `bruin validate`, `bruin lineage`, and more with practical examples.\n\n- [Notes](notes/06-core-05-commands.md)\n\n\n## Resources\n\n- [Bruin Documentation](https://getbruin.com/docs)\n- [Bruin GitHub Repository](https://github.com/bruin-data/bruin)\n- [Bruin MCP (AI Integration)](https://getbruin.com/docs/bruin/getting-started/bruin-mcp)\n- [Bruin Cloud](https://getbruin.com/) — managed deployment and monitoring\n\n# Homework\n\n* [2026 Homework](../cohorts/2026/05-data-platforms/homework.md)\n\n# Community notes\n\n<details>\n<summary>Did you take notes? You can share them here</summary>\n\n* Add your notes here (above this line)\n\n</details>\n"
  },
  {
    "path": "05-data-platforms/notes/01-introduction.md",
    "content": "# 5.1 - Introduction to Bruin\n\n## What is Bruin?\n\nBruin is an end-to-end data platform that combines ingestion, transformations, orchestration, data quality checks, metadata, and lineage into a single tool.\n\nInstead of using five or six different tools configured separately, Bruin lets you have your code logic, configurations, dependencies, and quality checks all in the same place.\n\n## The modern data stack\n\nA typical data stack involves several components:\n\n- Extract/ingest data from third-party sources or databases into a data warehouse or data lake\n- Run transformations: clean data, create reports, push results to a warehouse, lake, or third-party application\n- Orchestrate: tell different scripts and services when to run, how to run, and how to communicate with each other\n- Data quality and governance: ensure accuracy, completeness, and consistency of data before delivering it to consumers\n\nBruin brings all of these together so you don't need to be a DevOps person, data infrastructure engineer, and data architect just to build a pipeline.\n\n## Learning goals for the tutorial series\n\n- Bruin project structure\n- What is a pipeline and what are assets\n- How to configure pipelines\n- Materialization strategies supported by Bruin\n- Lineage and how to build dependencies between assets\n- Metadata created automatically and manually\n- Parameterizing pipelines with custom variables\n"
  },
  {
    "path": "05-data-platforms/notes/02-getting-started.md",
    "content": "# 5.2 - Getting Started with Bruin\n\n## Installation\n\nInstall Bruin CLI:\n\n```bash\ncurl -LsSf https://getbruin.com/install/cli | sh\nbruin version\n```\n\nInstall the Bruin extension for VS Code or Cursor. This adds a Bruin render panel that lets you run assets and pipelines directly from the IDE.\n\n## Bruin MCP\n\nBruin provides an MCP (Model Context Protocol) server that you can add to your IDE (Cursor, VS Code) to use AI agents for creating pipelines. Add the Bruin MCP under your IDE settings > Tools and MCP.\n\n### Bruin MCP Integration for VS Code\n\n Create a new file `mcp.json` in your Repository Root:\nIn the root directory of your project (the same level as your `.git` folder or `package.json`), create a new file named `mcp.json`.\n\nAdd the Configuration:\nOpen the `mcp.json` file and paste the following JSON configuration into it:\n\n```json\n{\n  \"servers\": {\n    \"bruin\": {\n      \"type\": \"stdio\",\n      \"command\": \"bruin\",\n      \"args\": [\n        \"mcp\"\n      ]\n    }\n  },\n  \"inputs\": []\n}\n```\n\nThis configuration instructs VS Code to launch the `bruin mcp` command, establishing a standard input/output connection with the Bruin MCP server.\n\n## Initializing a project\n\n```bash\nbruin init default my-first-pipeline\ncd my-first-pipeline\n```\n\nThis creates a project from a template, initializes git, adds a `.gitignore`, and creates the `bruin.yaml` file.\n\nBruin requires the project to be git-initialized. The `bruin init` command handles this automatically.\n\n## Project structure\n\n```text\nmy-first-pipeline/\n├── .bruin.yml              # Environment and connection configuration\n├── pipeline.yml            # Pipeline name, schedule, default connections\n└── assets/\n    ├── players.asset.yml   # Ingestr asset (data ingestion)\n    ├── player_stats.sql    # SQL asset with quality checks\n    └── my_python_asset.py  # Python asset\n```\n\n### .bruin.yml\n\n- Stays local only (auto-added to `.gitignore`)\n- Never push this to your repo — it contains database connections and secrets\n- Defines environments (default, production, staging, etc.)\n- Under each environment, define connections (e.g. DuckDB, Chess.com, custom secrets)\n\n```yaml\ndefault_environment: default\n\nenvironments:\n  default:\n    connections:\n      duckdb:\n        - name: duckdb-default\n          path: duckdb.db\n```\n\n### pipeline.yml\n\nConfigures the pipeline: name, schedule, default connection, start date.\n\n```yaml\nname: my-pipeline\nschedule: daily\nstart_date: \"2022-01-01\"\ndefault_connections:\n  duckdb: duckdb-default\n```\n\n## Asset types\n\n### Python asset\n\nSimplest form: a Python script with a name that prints or processes data. Run from the Bruin panel in your IDE.\n\n### YAML ingestor asset\n\nUses Bruin's built-in ingestor. Define source connection, destination, and table. Supports many built-in sources and destinations: Redshift, MySQL, Postgres, Motherduck, BigQuery, etc. Automatically creates the destination database/table if it doesn't exist.\n\n### SQL asset\n\nRuns SQL queries against your database. Define dependencies to other assets — when a dependency finishes, this asset runs automatically.\n\n## Intervals and incremental ingestion\n\n- Set `start_date` and `end_date` parameters to ingest data for a specific time range\n- Bruin provides these as variables you can inject into your code\n- Built-in ingestion assets automatically use the start/end dates\n\n## Dependencies and lineage\n\n- Define dependencies between assets so they run in the correct order\n- When the first asset completes, it automatically triggers the next dependent asset\n- Bruin builds a lineage graph from these dependencies\n\n## Key CLI commands\n\n| Command | Purpose |\n|---------|---------|\n| `bruin validate <path>` | Check syntax and dependencies without running |\n| `bruin run <path>` | Execute pipeline or individual asset |\n| `bruin run --downstream` | Run asset and all downstream dependencies |\n| `bruin run --full-refresh` | Truncate and rebuild tables from scratch |\n| `bruin lineage <path>` | View asset dependencies |\n| `bruin query --connection <conn> --query \"...\"` | Execute ad-hoc SQL queries |\n"
  },
  {
    "path": "05-data-platforms/notes/03-nyc-taxi-pipeline.md",
    "content": "# 5.3 - Building an End-to-End Pipeline with NYC Taxi Data\n\n## Architecture\n\nThree-layered pipeline using DuckDB as a locally hosted database:\n\n1. Ingestion layer: extract data and store in raw format\n2. Staging layer: pre-process, clean, transform, join with lookup tables\n3. Reports layer: aggregate data and run calculations\n\nAll assets have dependencies that create the data lineage Bruin uses for orchestration.\n\n## Project setup\n\nInitialize from the zoomcamp template:\n\n```bash\nbruin init zoomcamp my-taxi-pipeline\ncd my-taxi-pipeline\n```\n\nProject structure:\n\n```text\nzoomcamp/\n├── .bruin.yml\n├── README.md\n└── pipeline/\n    ├── pipeline.yml\n    └── assets/\n        ├── ingestion/\n        │   ├── trips.py\n        │   ├── requirements.txt\n        │   ├── payment_lookup.asset.yml\n        │   └── payment_lookup.csv\n        ├── staging/\n        │   └── trips.sql\n        └── reports/\n            └── trips_report.sql\n```\n\n### .bruin.yml\n\n```yaml\ndefault_environment: default\n\nenvironments:\n  default:\n    connections:\n      duckdb:\n        - name: duckdb-default\n          path: duckdb.db\n```\n\n### pipeline.yml\n\n```yaml\nname: nyc_taxi\nschedule: daily\nstart_date: \"2022-01-01\"\ndefault_connections:\n  duckdb: duckdb-default\nvariables:\n  taxi_types:\n    type: array\n    items:\n      type: string\n    default: [\"yellow\"]\n```\n\n- `start_date`: when running a full refresh, process data starting from this date\n- Custom variables: `taxi_types` lets you control which taxi types to ingest (yellow, green, or both)\n- Variables can be overridden at runtime with `--var`\n\n## Ingestion layer\n\n### Python asset: trips.py\n\nThe Python asset connects to the NYC taxi API and extracts data.\n\n```python\n\"\"\"@bruin\nname: ingestion.trips\ntype: python\nimage: python:3.11\n\nmaterialization:\n  type: table\n  strategy: append\n\ncolumns:\n  - name: pickup_datetime\n    type: timestamp\n    description: \"When the meter was engaged\"\n  - name: dropoff_datetime\n    type: timestamp\n    description: \"When the meter was disengaged\"\n@bruin\"\"\"\n\nimport os\nimport json\nimport pandas as pd\n\ndef materialize():\n    start_date = os.environ[\"BRUIN_START_DATE\"]\n    end_date = os.environ[\"BRUIN_END_DATE\"]\n    taxi_types = json.loads(os.environ[\"BRUIN_VARS\"]).get(\"taxi_types\", [\"yellow\"])\n\n    # Generate list of months between start and end dates\n    # Fetch parquet files from:\n    # https://d37ci6vzurychx.cloudfront.net/trip-data/{taxi_type}_tripdata_{year}-{month}.parquet\n\n    return final_dataframe\n```\n\n- `materialize()` returns a DataFrame; Bruin handles inserting it into the destination\n- `append` strategy: each run inserts data without touching existing rows\n- Uses `BRUIN_START_DATE` / `BRUIN_END_DATE` environment variables for the time window\n- Uses `BRUIN_VARS` to read the `taxi_types` pipeline variable\n\n### Seed file: payment_lookup.asset.yml\n\nSeed files ingest data from local CSV files into the database.\n\n```yaml\nname: ingestion.payment_lookup\ntype: duckdb.seed\nparameters:\n  path: payment_lookup.csv\ncolumns:\n  - name: payment_type_id\n    type: integer\n    description: \"Numeric code for payment type\"\n    primary_key: true\n    checks:\n      - name: not_null\n      - name: unique\n  - name: payment_type_name\n    type: string\n    description: \"Human-readable payment type\"\n    checks:\n      - name: not_null\n```\n\npayment_lookup.csv:\n\n```csv\npayment_type_id,payment_type_name\n0,flex_fare\n1,credit_card\n2,cash\n3,no_charge\n4,dispute\n5,unknown\n6,voided_trip\n```\n\nQuality checks (`not_null`, `unique`) run automatically after the asset finishes.\n\n### requirements.txt\n\n```\npandas\nrequests\npyarrow\npython-dateutil\n```\n\nBruin handles the environment and installs dependencies locally within the pipeline.\n\n## Staging layer\n\n### SQL asset: staging/trips.sql\n\n```sql\n/* @bruin\nname: staging.trips\ntype: duckdb.sql\n\ndepends:\n  - ingestion.trips\n  - ingestion.payment_lookup\n\nmaterialization:\n  type: table\n  strategy: time_interval\n  incremental_key: pickup_datetime\n  time_granularity: timestamp\n\ncolumns:\n  - name: pickup_datetime\n    type: timestamp\n    primary_key: true\n    checks:\n      - name: not_null\n\ncustom_checks:\n  - name: row_count_greater_than_zero\n    query: |\n      SELECT CASE WHEN COUNT(*) > 0 THEN 1 ELSE 0 END\n      FROM staging.trips\n    value: 1\n@bruin */\n\nSELECT\n    t.pickup_datetime,\n    t.dropoff_datetime,\n    t.pickup_location_id,\n    t.dropoff_location_id,\n    t.fare_amount,\n    t.taxi_type,\n    p.payment_type_name\nFROM ingestion.trips t\nLEFT JOIN ingestion.payment_lookup p\n    ON t.payment_type = p.payment_type_id\nWHERE t.pickup_datetime >= '{{ start_datetime }}'\n  AND t.pickup_datetime < '{{ end_datetime }}'\nQUALIFY ROW_NUMBER() OVER (\n    PARTITION BY t.pickup_datetime, t.dropoff_datetime,\n                 t.pickup_location_id, t.dropoff_location_id, t.fare_amount\n    ORDER BY t.pickup_datetime\n) = 1\n```\n\n- `time_interval` strategy: deletes rows in the time window, then inserts the query result\n- The `WHERE` clause must filter to the same time window to avoid duplicates\n- `QUALIFY ROW_NUMBER()` deduplicates using a composite key\n- Dependencies on both `ingestion.trips` and `ingestion.payment_lookup` ensure this runs after ingestion\n\n## Reports layer\n\n### SQL asset: reports/trips_report.sql\n\n```sql\n/* @bruin\nname: reports.trips_report\ntype: duckdb.sql\n\ndepends:\n  - staging.trips\n\nmaterialization:\n  type: table\n  strategy: time_interval\n  incremental_key: trip_date\n  time_granularity: date\n\ncolumns:\n  - name: trip_date\n    type: date\n    primary_key: true\n  - name: taxi_type\n    type: string\n    primary_key: true\n  - name: payment_type\n    type: string\n    primary_key: true\n  - name: trip_count\n    type: bigint\n    checks:\n      - name: non_negative\n@bruin */\n\nSELECT\n    CAST(pickup_datetime AS DATE) AS trip_date,\n    taxi_type,\n    payment_type_name AS payment_type,\n    COUNT(*) AS trip_count,\n    SUM(fare_amount) AS total_fare,\n    AVG(fare_amount) AS avg_fare\nFROM staging.trips\nWHERE pickup_datetime >= '{{ start_datetime }}'\n  AND pickup_datetime < '{{ end_datetime }}'\nGROUP BY 1, 2, 3\n```\n\n## Running the full pipeline\n\n```bash\n# Validate structure and definitions\nbruin validate ./pipeline/pipeline.yml\n\n# Run with a small date range for testing\nbruin run ./pipeline/pipeline.yml --start-date 2022-01-01 --end-date 2022-02-01\n\n# Full refresh\nbruin run ./pipeline/pipeline.yml --full-refresh\n\n# Query results\nbruin query --connection duckdb-default --query \"SELECT COUNT(*) FROM ingestion.trips\"\n```\n\nOpen the pipeline YAML file in the Bruin panel and view the lineage tab to see all assets and their dependencies. Execution order:\n\n1. Ingestion assets run first (trips + lookup, in parallel)\n2. Staging asset runs after both ingestion assets complete\n3. Report asset runs after staging completes\n\n## Materialization strategies summary\n\n| Strategy | Behavior |\n|----------|----------|\n| `table` | Drop and recreate the table each time |\n| `append` | Insert new data without touching existing rows |\n| `merge` | Upsert based on key columns |\n| `time_interval` | Delete rows in date range, then re-insert |\n| `delete+insert` | Delete matching rows, then insert |\n| `create+replace` | Create or replace the table |\n"
  },
  {
    "path": "05-data-platforms/notes/04-bruin-mcp.md",
    "content": "# 5.4 - Using Bruin MCP with AI Agents\n\n## What is Bruin MCP?\n\nMCP stands for **Model Context Protocol**. Bruin MCP is a way for AI agents (in Cursor, VS Code, Claude, etc.) to communicate with Bruin — querying documentation, running commands on your behalf, going through your code, troubleshooting, and analyzing data.\n\nWith the Bruin MCP and an AI agent, you can:\n\n- Write pipeline code and asset configurations\n- Write documentation and metadata\n- Troubleshoot errors and debug issues\n- Run queries and analyze data using natural language\n- Ask questions about your pipeline logic and structure\n\n## Installing Bruin MCP\n\nMake sure you have [Bruin CLI installed](https://getbruin.com/docs/bruin/getting-started/introduction/installation) first.\n\n### Cursor\n\nGo to **Settings → Tools & MCP → New MCP Server** and add:\n\n```json\n{\n  \"mcpServers\": {\n    \"bruin\": {\n      \"command\": \"bruin\",\n      \"args\": [\"mcp\"]\n    }\n  }\n}\n```\n\nIf it shows a failure/error, close and reopen your IDE — you should see \"Bruin enabled\".\n\n### VS Code (Copilot)\n\nCreate `.vscode/mcp.json` in your project folder:\n\n```json\n{\n  \"servers\": {\n    \"bruin\": {\n      \"command\": \"bruin\",\n      \"args\": [\"mcp\"]\n    }\n  }\n}\n```\n\n### Claude Code\n\n```bash\nclaude mcp add bruin -- bruin mcp\n```\n\nSee the full [Bruin MCP documentation](https://getbruin.com/docs/bruin/getting-started/bruin-mcp) for other agents and troubleshooting.\n\n## Building a pipeline with MCP\n\n### Using the template prompt\n\nThe zoomcamp template includes an example prompt in its README that you can give to the AI agent to create the entire pipeline end-to-end:\n\n```bash\nbruin init zoomcamp my-taxi-pipeline\n```\n\nOpen the generated `README.md` — it contains a prompt you can paste into the agent to scaffold the entire pipeline automatically.\n\n### What the agent does\n\nWhen given the pipeline prompt, the agent will:\n\n1. Create all pipeline assets (ingestion, staging, reports)\n2. Configure materialization strategies and dependencies\n3. Set up quality checks and column metadata\n4. Validate the pipeline with `bruin validate`\n5. Run the pipeline with a test date range\n6. Run custom checks to validate query logic\n7. Execute verification queries using `bruin query`\n\n### Working incrementally\n\nIn practice, you may prefer working asset by asset rather than generating everything at once. This lets you be involved in every design choice:\n\n- Create and test the ingestion asset first\n- Then build the staging layer\n- Then add the reports layer\n- Review and adjust quality checks at each step\n\n## Querying data with the agent\n\nOnce your pipeline has run, you can use the agent conversationally to query your data:\n\n**Example queries:**\n- \"Query the staging table and tell me how many days of data we have\"\n- \"Which day had the highest number of trips and total fare?\"\n- \"In which asset are we aggregating data?\"\n\nThe agent understands the context of your pipeline — it knows the table structures, can write SQL queries, and can explain the logic behind each asset. This is useful for:\n\n- Ad hoc analysis without writing SQL manually\n- Understanding unfamiliar pipeline logic\n- Data validation and troubleshooting\n- Onboarding new team members to an existing pipeline\n"
  },
  {
    "path": "05-data-platforms/notes/05-bruin-cloud.md",
    "content": "# 5.5 - Deploying to Bruin Cloud\n\n## What is Bruin Cloud?\n\nBruin Cloud is a fully managed infrastructure for your data pipelines. It is powered by the same open-source CLI tool you use locally for development. Everything lives in the same place:\n\n- Ingestions and transformations\n- Quality checks and monitoring\n- Lineage and metadata\n- Data governance\n- AI-powered features (automatic metadata generation, conversational data analysis)\n\n## Registration\n\n1. Go to [Bruin Cloud](https://getbruin.com/) and sign up\n2. Fill out your name, email, and set a password\n3. Verify your email by clicking the link in the verification email\n4. Choose to join an existing team or create a new organization\n5. Give your organization a name\n\n## Connecting your GitHub repository\n\nYou have two options:\n\n1. **Direct GitHub connection** (recommended) — connect your GitHub account directly and select your repo from a dropdown\n2. **Personal Access Token** — provide a GitHub personal access token and your repo link manually\n\n## Setting up connections\n\nAfter connecting your repo, set up your data warehouse connections. These are the same connections you configure locally in `.bruin.yml`, but stored securely in the cloud.\n\n1. Go to the connections page\n2. Select your connection type (MotherDuck, BigQuery, Redshift, etc.)\n3. Give it the same connection name you use locally\n4. Provide the required credentials (e.g., service token, database name)\n5. The connection will be validated and tested automatically\n\nRead the Bruin documentation for details on how secrets are stored securely.\n\n## Deploying pipelines\n\n1. Navigate to the **Pipelines** page to see the list of pipelines from your repository\n2. Bruin will validate every asset and ensure lineage and connections work (this takes a moment)\n3. Once ready, **enable** the pipeline\n\nWhen you enable a pipeline with a schedule, Bruin automatically creates a run for the last interval. For example, a monthly pipeline will immediately process the previous month's data.\n\n## Monitoring\n\nAfter a pipeline runs:\n\n- Check the status of each asset (success/failure)\n- Review quality check results\n- View lineage across all assets\n- Use AI-powered features to analyze data or ask questions about your pipelines\n\n## Getting help\n\n- Join the [Bruin Slack community](https://getbruin.com/) for questions and feature requests\n- Submit issues on [GitHub](https://github.com/bruin-data/bruin)\n"
  },
  {
    "path": "05-data-platforms/notes/06-core-01-projects.md",
    "content": "# 5.6 - Core Concepts: Projects\n\n🎥 [Bruin Core Concepts | Projects](https://www.youtube.com/watch?v=YWDjnSxbBtY) (3:03)\n\n## What is a Project?\n\nA **Project** is the root directory where you create your entire Bruin data pipeline. It serves as the foundation for organizing all your data assets, configurations, and connections.\n\n## Project Initialization\n\nThe project must be initialized with `bruin init` so the CLI tool can understand the directory structure and navigate files correctly.\n\n```bash\nbruin init zoomcamp my-pipeline\ncd my-pipeline\n```\n\n## The `.bruin.yml` File\n\nLocated at the root of your project, this file defines environments, connections, and secrets.\n\n**Important:** This file is always added to `.gitignore` to protect secrets. It stays local only and should never be pushed to your repo.\n\n### Environments\n\nDefine different environments for various stages:\n\n```yaml\ndefault_environment: default\n\nenvironments:\n  default:\n    connections:\n      duckdb:\n        - name: duckdb-default\n          path: duckdb.db\n      motherduck:\n        - name: motherduck\n          token: <your-token>\n\n  production:\n    connections:\n      bigquery:\n        - name: bq-prod\n          project: my-project\n          dataset: production\n```\n\n**Benefits:**\n- Run pipelines locally or on servers without exposing production credentials\n- Different teams can have different connection access\n- Default to `dev` environment to prevent accidental production runs\n\n### Connection Types\n\nBuilt-in connections include:\n- DuckDB, MotherDuck\n- PostgreSQL, MySQL\n- BigQuery, Redshift, Snowflake\n- Custom connections (for API keys, secrets, etc.)\n\n### Default Environment\n\nSet which environment is used by default:\n\n```yaml\ndefault_environment: dev\n```\n\nThis ensures pipelines run on development unless explicitly told to use production.\n\n## Quick Reference\n\n```bash\n# Initialize a new project\nbruin init zoomcamp my-pipeline\n\n# Navigate to your project\ncd my-pipeline\n\n# Check project is valid\nbruin validate .\n```\n\n## Further Reading\n\n- [Bruin Documentation - Projects](https://getbruin.com/docs/bruin/core-concepts/project.html)\n- [Bruin GitHub - Templates](https://github.com/bruin-data/bruin/tree/main/templates)\n"
  },
  {
    "path": "05-data-platforms/notes/06-core-02-pipelines.md",
    "content": "# 5.6 - Core Concepts: Pipelines\n\n🎥 [Bruin Core Concepts | Pipelines](https://www.youtube.com/watch?v=uzp_DiR4Sok) (3:13)\n\n## What is a Pipeline?\n\nA **Pipeline** is a grouping mechanism for organizing assets based on their execution schedule and configuration requirements. Within a project, you can have multiple pipelines.\n\n## Key Characteristics\n\n### Single Schedule\n\nEach pipeline has **one schedule** - this is the primary reason to group assets together:\n- Assets with the same schedule belong in the same pipeline\n- Common schedules: `hourly`, `daily`, `monthly`, or cron expressions\n\n### Pipeline Structure\n\nEach pipeline has its own folder containing a `pipeline.yml` file:\n\n```text\nproject/\n├── .bruin.yml\n├── pipelines/\n│   ├── nyc-taxi/\n│   │   ├── pipeline.yml\n│   │   └── assets/\n│   └── another-pipeline/\n│       ├── pipeline.yml\n│       └── assets/\n```\n\n## The `pipeline.yml` File\n\n```yaml\nname: nyc_taxi\nschedule: monthly\nstart_date: \"2019-01-01\"\ndefault_connections:\n  duckdb: duckdb-default\n```\n\n### Configuration Options\n\n| Setting | Description |\n|---------|-------------|\n| `name` | Pipeline identifier |\n| `schedule` | When to run (cron, daily, monthly, etc.) |\n| `start_date` | When the pipeline starts being active |\n| `default_connections` | Which connections to use |\n| `variables` | Custom variables for the pipeline |\n\n### Connection Scoping\n\nEven though connections are defined at the project level (`.bruin.yml`), each pipeline specifies **which connections it uses**.\n\n**Why this matters:**\n- In large organizations, different teams may need different credentials\n- Prevents unnecessary exposure of secrets\n- Only initializes connections needed for the specific pipeline run\n- Security isolation between departments\n\n## Quick Reference\n\n```bash\n# Validate a pipeline\nbruin validate ./pipelines/nyc-taxi/pipeline.yml\n\n# View pipeline lineage\nbruin lineage ./pipelines/nyc-taxi/pipeline.yml\n\n# Run the entire pipeline\nbruin run ./pipelines/nyc-taxi/pipeline.yml\n```\n\n## Further Reading\n\n- [Bruin Documentation - Pipelines](https://getbruin.com/docs/bruin/pipelines/definition.html)\n- [Pipeline Configuration Reference](https://getbruin.com/docs/bruin/pipelines/definition.html)\n"
  },
  {
    "path": "05-data-platforms/notes/06-core-03-assets.md",
    "content": "# 5.6 - Core Concepts: Assets\n\n🎥 [Bruin Core Concepts | Assets](https://www.youtube.com/watch?v=ZElY5SoqrwI) (6:11)\n\n## What is an Asset?\n\nAn **Asset** is a single file that performs a specific task, almost always related to creating or updating a table or view in the destination database.\n\nEach asset file contains two parts:\n\n1. **Definition** (Configuration) - Metadata, name, type, connection\n2. **Content** (Code) - The actual SQL, Python, or R code to execute\n\n## Asset Types\n\n| Type | Description | Use Case |\n|------|-------------|----------|\n| **Python** | Python scripts | Ingestion, data processing, ML models |\n| **SQL** | SQL queries | Transformations, aggregations |\n| **YAML/Seed** | File-based tables | Reference data, static lookups |\n| **R** | R scripts | Statistical analysis, R-specific workflows |\n\n## Asset Naming\n\nThe asset name can be:\n1. **Explicitly defined** in the decorator\n2. **Inferred from file path** (default behavior)\n\n**Convention:** Group assets by schema/dataset:\n- `assets/raw/trips_raw.py` → Creates table `raw.trips_raw`\n- `assets/staging/trips_summary.sql` → Creates table `staging.trips_summary`\n\n## SQL Asset Example\n\n```sql\n@bruin.asset(\n    name=\"staging.trips_summary\",\n    type=\"sql\",\n    connection=\"duckdb-default\",\n    materialization=\"table\"\n)\n\nSELECT\n    pickup_date,\n    COUNT(*) as trip_count,\n    SUM(fare_amount) as total_fare\nFROM raw.trips_raw\nWHERE pickup_date >= '{{ start_date }}'\n  AND pickup_date < '{{ end_date }}'\nGROUP BY pickup_date\n```\n\n### Materialization Strategies\n\n| Strategy | Behavior |\n|----------|----------|\n| `table` | Recreates the table on each run |\n| `view` | Creates a view (no data stored) |\n| `insert` | Appends new data to existing table |\n| `incremental` | Smart merge based on key columns |\n\n## Python Asset Example (Ingestion)\n\n```python\n@bruin.asset(\n    name=\"raw.trips_raw\",\n    type=\"python\",\n    connection=\"duckdb-default\"\n)\ndef ingest_trips():\n    import requests\n    import pandas as pd\n\n    # Connect to API, fetch data\n    response = requests.get(\"https://api.example.com/trips\")\n    data = response.json()\n\n    # Return pandas DataFrame\n    # Bruin handles materialization to database\n    return pd.DataFrame(data)\n```\n\n## YAML/Seed Asset Example\n\n```yaml\n@bruin.asset(\n    name=\"lookup.taxi_types\",\n    type=\"seed\",\n    connection=\"duckdb-default\"\n)\n\npath: reference_data/taxi_types.csv\n```\n\nSimply loads a local CSV file and creates a table in the destination database.\n\n## Lineage & Dependencies\n\nAssets automatically define dependencies based on what they read:\n\n- If Asset B reads from Asset A's table, **B depends on A**\n- Visualized in VS Code extension\n- Used for execution ordering during runs\n\n```sql\n-- This asset depends on raw.trips_raw\n@bruin.asset(name=\"staging.trips_summary\", type=\"sql\")\nSELECT * FROM raw.trips_raw  -- Creates dependency\n```\n\n## Quick Reference\n\n```bash\n# Run a specific asset\nbruin run ./pipeline.yml --asset raw.trips_raw\n\n# Run asset with all downstream dependencies\nbruin run ./pipeline.yml --asset raw.trips_raw --downstream\n\n# Run asset with all upstream dependencies\nbruin run ./pipeline.yml --asset staging.trips_summary --upstream\n\n# View lineage for an asset\nbruin lineage ./pipeline.yml --asset raw.trips_raw\n```\n\n## Further Reading\n\n- [Bruin Documentation - Assets](https://getbruin.com/docs/bruin/assets/definition-schema.html)\n- [Materialization Strategies](https://getbruin.com/docs/bruin/assets/materialization.html)\n"
  },
  {
    "path": "05-data-platforms/notes/06-core-04-variables.md",
    "content": "# 5.6 - Core Concepts: Variables\n\n🎥 [Bruin Core Concepts | Variables](https://www.youtube.com/watch?v=XCx0nDmhhxA) (6:03)\n\n## What are Variables?\n\n**Variables** are dynamically initialized each time a pipeline run is created. They allow you to parameterize your pipelines and pass dynamic values at runtime.\n\n## Variable Types\n\n### 1. Built-in Variables\n\nAlways provided by Bruin automatically:\n\n| Variable | Description |\n|----------|-------------|\n| `start_date` | Beginning of the scheduled interval |\n| `end_date` | End of the scheduled interval |\n\nThese dates are determined by the pipeline's schedule:\n\n| Schedule | Start Date | End Date |\n|----------|------------|----------|\n| **Monthly** | First day of month | Last day of month |\n| **Daily** | Start of day | End of day |\n| **Hourly** | Start of hour | End of hour |\n\n#### SQL Assets - Jinja Format\n\nIn SQL, variables are injected using Jinja templating:\n\n```sql\n@bruin.asset(name=\"staging.monthly_trips\", type=\"sql\")\nSELECT *\nFROM raw.trips\nWHERE pickup_date >= '{{ start_date }}'\n  AND pickup_date < '{{ end_date }}'\n```\n\nUse the **Bruin Render panel** in VS Code to see the compiled query with actual values.\n\n#### Python Assets - Environment Variables\n\nIn Python, variables are accessed via environment variables:\n\n```python\nimport os\nfrom datetime import datetime\n\n@bruin.asset(name=\"raw.monthly_data\", type=\"python\")\ndef ingest_monthly_data():\n    start_date = os.environ['BRUIN_VAR_START_DATE']\n    end_date = os.environ['BRUIN_VAR_END_DATE']\n\n    # Parse and use dates to fetch data for specific period\n    start = datetime.fromisoformat(start_date)\n    end = datetime.fromisoformat(end_date)\n\n    # Loop through months in range\n    # ...\n```\n\n### 2. Custom Variables\n\nUser-defined variables set at the pipeline level.\n\n#### Definition in `pipeline.yml`\n\n```yaml\nvariables:\n  - name: taxi_types\n    type: array\n    default:\n      - \"yellow\"\n```\n\n#### Override at Runtime\n\nChange default values when creating a run:\n\n```bash\nbruin run ./pipeline.yml --var taxi_types=[\"green\",\"fhv\"]\n```\n\n#### Accessing Custom Variables in Python\n\n```python\nimport os\nimport json\n\n@bruin.asset(name=\"example.asset\", type=\"python\")\ndef example_asset():\n    # Custom variables are prefixed with BRUIN_VAR_\n    taxi_types_json = os.environ['BRUIN_VAR_TAXI_TYPES']\n    taxi_types = json.loads(taxi_types_json)\n\n    # Use the variable in your code\n    for taxi_type in taxi_types:\n        # Process each taxi type\n        pass\n```\n\n## VS Code Extension Panel\n\nFrom the Bruin panel in VS Code/Cursor:\n\n1. **Variable Override** - Set custom variable values before running\n2. **Bruin Render** - See how Jinja templates are compiled with actual values\n3. **Run Configuration** - Set dates, environment, and variables\n\n## Practical Use Cases\n\n| Use Case | Description |\n|----------|-------------|\n| **Date-based partitioning** | Extract data for specific time periods |\n| **Multi-tenant processing** | Run same pipeline for different customers |\n| **Parameterized transformations** | Change logic based on variables |\n| **A/B testing** | Test different configurations without code changes |\n\n## Quick Reference\n\n```bash\n# Run with custom dates\nbruin run ./pipeline.yml --start-date 2020-01-01 --end-date 2020-01-31\n\n# Run with variable override (array)\nbruin run ./pipeline.yml --var taxi_types=[\"green\",\"fhv\"]\n\n# Run with variable override (string)\nbruin run ./pipeline.yml --var customer_id=12345\n\n# Run with full refresh (affects materialization)\nbruin run ./pipeline.yml --full-refresh\n\n# Set end date as exclusive\nbruin run ./pipeline.yml --exclusive-end-date\n```\n\n## Further Reading\n\n- [Bruin Documentation - Variables](https://getbruin.com/docs/bruin/core-concepts/variables.html)\n- [Pipeline Runtime Options](https://getbruin.com/docs/bruin/commands/run.html)\n"
  },
  {
    "path": "05-data-platforms/notes/06-core-05-commands.md",
    "content": "# 5.6 - Core Concepts: Commands\n\n🎥 [Bruin Core Concepts | Commands](https://www.youtube.com/watch?v=3nykPEs_V7E) (6:46)\n\n## Bruin CLI Commands\n\nCommands are how you interact with your Bruin project - running pipelines, validating configurations, querying data, and more.\n\n## `bruin run` - Execute a Pipeline\n\nCreates a **single execution instance** (a \"run\") of your pipeline.\n\n### Basic Usage\n\n```bash\nbruin run ./pipelines/nyc-taxi/pipeline.yml\n```\n\n### Run Scope Options\n\n| Option | Description |\n|--------|-------------|\n| Entire pipeline | Runs all assets in dependency order |\n| Single asset | `--asset staging.trips_summary` |\n| With upstream | `--asset X --upstream` - Runs X plus all dependencies |\n| With downstream | `--asset X --downstream` - Runs X plus all dependents |\n\n### Common Run Flags\n\n| Flag | Description |\n|------|-------------|\n| `--start-date DATE` | Set execution start date |\n| `--end-date DATE` | Set execution end date |\n| `--full-refresh` | Drop and recreate tables (overrides incremental) |\n| `--exclusive-end-date` | End date is exclusive (default: inclusive) |\n| `--environment ENV` | Use specific environment (dev/prod) |\n| `--var KEY=VALUE` | Override custom variables |\n\n### Example Run Commands\n\n```bash\n# Simple run\nbruin run ./pipelines/nyc-taxi/pipeline.yml\n\n# With date range\nbruin run ./pipelines/nyc-taxi/pipeline.yml \\\n  --start-date 2020-01-01 \\\n  --end-date 2020-01-31\n\n# Full refresh with variables\nbruin run ./pipelines/nyc-taxi/pipeline.yml \\\n  --full-refresh \\\n  --var taxi_types=[\"yellow\",\"green\"] \\\n  --environment default\n```\n\n## `bruin validate` - Validate Pipeline\n\nChecks for configuration issues before running:\n\n```bash\nbruin validate ./pipelines/nyc-taxi/pipeline.yml\n```\n\n**Validates:**\n- No circular dependencies in lineage\n- Asset definitions are correct\n- Connections exist and are properly configured\n- No broken references\n\n**Always validate before running!**\n\n## `bruin lineage` - View Dependency Graph\n\nVisualize how assets are connected:\n\n```bash\nbruin lineage ./pipelines/nyc-taxi/pipeline.yml\n```\n\nShows upstream and downstream relationships between assets.\n\n## `bruin query` - Query Data\n\nRun ad-hoc queries against your connections:\n\n```bash\nbruin query --connection duckdb-default \\\n  --query \"SELECT * FROM ingestion.trips LIMIT 10\"\n```\n\n## What is a \"Run\"?\n\nA **run** is a single instance of pipeline execution:\n- Has unique start/end times\n- May run all assets or a subset\n- Has its own variable values\n- Creates execution logs and results\n\n## Putting It All Together\n\nThe complete Bruin workflow:\n\n```\n1. Project (root, initialized)\n   └── .bruin.yml (environments, connections)\n\n2. Pipeline (scheduled grouping)\n   └── pipeline.yml (schedule, default connection, variables)\n\n3. Assets (the actual work)\n   ├── Python (ingestion, processing)\n   ├── SQL (transformations)\n   └── YAML/Seed (static data)\n\n4. Commands (make it happen)\n   ├── bruin run (execute)\n   ├── bruin validate (check)\n   └── bruin query (inspect)\n```\n\n## Quick Reference\n\n```bash\n# Initialize new project\nbruin init zoomcamp my-pipeline\n\n# Validate before running\nbruin validate ./pipeline/pipeline.yml\n\n# Run entire pipeline\nbruin run ./pipeline/pipeline.yml\n\n# Run with date range\nbruin run ./pipeline/pipeline.yml \\\n  --start-date 2020-01-01 \\\n  --end-date 2020-01-31\n\n# Run single asset with downstream\nbruin run ./pipeline/pipeline.yml \\\n  --asset raw.trips \\\n  --downstream\n\n# View lineage\nbruin lineage ./pipeline/pipeline.yml\n\n# Query a table\nbruin query --connection duckdb-default \\\n  --query \"SELECT COUNT(*) FROM staging.trips\"\n```\n\n## Further Reading\n\n- [Bruin Documentation - CLI Reference](https://getbruin.com/docs/bruin/commands/overview.html)\n- [Bruin GitHub Repository](https://github.com/bruin-data/bruin)\n"
  },
  {
    "path": "06-batch/.gitignore",
    "content": ""
  },
  {
    "path": "06-batch/README.md",
    "content": "# Module 6: Batch Processing\n\n## 6.1 Introduction\n\n* :movie_camera: 6.1.1 Introduction to Batch Processing\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/dcHe5Fl3MF8)](https://youtu.be/dcHe5Fl3MF8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=51)\n\n* :movie_camera: 6.1.2 Introduction to Spark\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/FhaqbEOuQ8U)](https://youtu.be/FhaqbEOuQ8U&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=52)\n\n\n## 6.2 Installation\n\nFollow [these instructions](setup/) to install Spark:\n\n* [Windows](setup/windows.md)\n* [Linux](setup/linux.md)\n* [MacOS](setup/macos.md)\n\n:movie_camera: 6.2.1 (Optional) Installing Spark (Linux)\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/hqUbB9c8sKg)](https://youtu.be/hqUbB9c8sKg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=53)\n\nAlternatively, if the setups above don't work, you can run Spark in Google Colab.\n> [!NOTE]  \n> It's advisable to invest some time in setting things up locally rather than immediately jumping into this solution\n\n* [Google Colab Instructions](https://medium.com/gitconnected/launch-spark-on-google-colab-and-connect-to-sparkui-342cad19b304)\n* [Google Colab Starter Notebook](https://github.com/aaalexlit/medium_articles/blob/main/Spark_in_Colab.ipynb)\n\n\n## 6.3 Spark SQL and DataFrames\n\n* :movie_camera: 6.3.1 First Look at Spark/PySpark\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/r_Sf6fCB40c)](https://youtu.be/r_Sf6fCB40c&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=54)\n\n* :movie_camera: 6.3.2 Spark Dataframes\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/ti3aC1m3rE8)](https://youtu.be/ti3aC1m3rE8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=55)\n\n* :movie_camera: 6.3.3 (Optional) Preparing Yellow and Green Taxi Data\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/CI3P4tAtru4)](https://youtu.be/CI3P4tAtru4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=56)\n\nScript to prepare the Dataset [download_data.sh](code/download_data.sh)\n\n> [!NOTE]  \n> The other way to infer the schema (apart from pandas) for the csv files, is to set the `inferSchema` option to `true` while reading the files in Spark.\n\n* :movie_camera: 6.3.4 SQL with Spark\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/uAlp2VuZZPY)](https://youtu.be/uAlp2VuZZPY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=57)\n\n\n## 6.4 Spark Internals\n\n* :movie_camera: 6.4.1 Anatomy of a Spark Cluster\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/68CipcZt7ZA)](https://youtu.be/68CipcZt7ZA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=58)\n\n* :movie_camera: 6.4.2 GroupBy in Spark\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/9qrDsY_2COo)](https://youtu.be/9qrDsY_2COo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=59)\n\n* :movie_camera: 6.4.3 Joins in Spark\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/lu7TrqAWuH4)](https://youtu.be/lu7TrqAWuH4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=60)\n\n## 6.5 (Optional) Resilient Distributed Datasets\n\n* :movie_camera: 6.5.1 Operations on Spark RDDs\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/Bdu-xIrF3OM)](https://youtu.be/Bdu-xIrF3OM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=61)\n\n* :movie_camera: 6.5.2 Spark RDD mapPartition\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/k3uB2K99roI)](https://youtu.be/k3uB2K99roI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=62)\n\n\n## 6.6 Running Spark in the Cloud\n\n* :movie_camera: 6.6.1 Connecting to Google Cloud Storage\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/Yyz293hBVcQ)](https://youtu.be/Yyz293hBVcQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=63)\n\n* :movie_camera: 6.6.2 Creating a Local Spark Cluster\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/HXBwSlXo5IA)](https://youtu.be/HXBwSlXo5IA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=64)\n\n* :movie_camera: 6.6.3 Setting up a Dataproc Cluster\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/osAiAYahvh8)](https://youtu.be/osAiAYahvh8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=65)\n\n* :movie_camera: 6.6.4 Connecting Spark to Big Query\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/HIm2BOj8C0Q)](https://youtu.be/HIm2BOj8C0Q&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=66)\n\n\n# Homework\n\n* [2026 Homework](../cohorts/2026/06-batch/homework.md)\n\n\n# Community notes\n\n<details>\n<summary>Did you take notes? You can share them here</summary>\n\n* [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/5_batch_processing.md)\n* [Sandy's DE Learning Blog](https://learningdataengineering540969211.wordpress.com/2022/02/24/week-5-de-zoomcamp-5-2-1-installing-spark-on-linux/)\n* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week5.md)\n* [Alternative : Using docker-compose to launch spark by rafik](https://gist.github.com/rafik-rahoui/f98df941c4ccced9c46e9ccbdef63a03) \n* [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-5-batch-spark)\n* [Notes by Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week5)\n* [Notes by Oscar Garcia](https://github.com/ozkary/Data-Engineering-Bootcamp/tree/main/Step5-Batch-Processing)\n* [Notes by HongWei](https://github.com/hwchua0209/data-engineering-zoomcamp-submission/blob/main/05-batch-processing/README.md)\n* [2024 videos transcript](https://drive.google.com/drive/folders/1XMmP4H5AMm1qCfMFxc_hqaPGw31KIVcb?usp=drive_link) by Maria Fisher \n* [2025 Notes by Manuel Guerra](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/5_Batch-Processing-Spark/README.md)\n* [2025 Notes by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/05_batch_processing/00_notes.md)\n* [2025 Notes on Installing Spark on MacOS (with Anaconda + brew) by Gabi Fonseca](https://github.com/fonsecagabriella/data_engineering/blob/main/05_batch_processing/01_env_setup.md)\n* [2025 Notes by Daniel Lachner](https://github.com/mossdet/dlp_data_eng/blob/main/Notes/05_01_Batch_Processing_Spark_GCP.pdf)\n* [2026 Notes by Ajay Katte](https://github.com/mushroomsandchai/dtdez/tree/main/06_batch_processing/notes)\n* Add your notes here (above this line)\n\n</details>\n"
  },
  {
    "path": "06-batch/code/03_test.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"72505747\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pyspark\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"bd55afbe\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"'/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/__init__.py'\"\n      ]\n     },\n     \"execution_count\": 3,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"pyspark.__file__\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"id\": \"29f1cf4c\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from pyspark.sql import SparkSession\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"cf6d80ad\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING: An illegal reflective access operation has occurred\\n\",\n      \"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\\n\",\n      \"WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\\n\",\n      \"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\\n\",\n      \"WARNING: All illegal access operations will be denied in a future release\\n\",\n      \"22/02/15 22:22:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"spark = SparkSession.builder \\\\\\n\",\n    \"    .master(\\\"local[*]\\\") \\\\\\n\",\n    \"    .appName('test') \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"3f604529\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"--2022-02-15 22:23:22--  https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv\\n\",\n      \"Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.196.8\\n\",\n      \"Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.196.8|:443... connected.\\n\",\n      \"HTTP request sent, awaiting response... 200 OK\\n\",\n      \"Length: 12322 (12K) [application/octet-stream]\\n\",\n      \"Saving to: ‘taxi+_zone_lookup.csv’\\n\",\n      \"\\n\",\n      \"taxi+_zone_lookup.c 100%[===================>]  12.03K  --.-KB/s    in 0s      \\n\",\n      \"\\n\",\n      \"2022-02-15 22:23:23 (114 MB/s) - ‘taxi+_zone_lookup.csv’ saved [12322/12322]\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"12342345\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\\"LocationID\\\",\\\"Borough\\\",\\\"Zone\\\",\\\"service_zone\\\"\\r\\n\",\n      \"\\r\\n\",\n      \"1,\\\"EWR\\\",\\\"Newark Airport\\\",\\\"EWR\\\"\\r\\n\",\n      \"\\r\\n\",\n      \"2,\\\"Queens\\\",\\\"Jamaica Bay\\\",\\\"Boro Zone\\\"\\r\\n\",\n      \"\\r\\n\",\n      \"3,\\\"Bronx\\\",\\\"Allerton/Pelham Gardens\\\",\\\"Boro Zone\\\"\\r\\n\",\n      \"\\r\\n\",\n      \"4,\\\"Manhattan\\\",\\\"Alphabet City\\\",\\\"Yellow Zone\\\"\\r\\n\",\n      \"\\r\\n\",\n      \"5,\\\"Staten Island\\\",\\\"Arden Heights\\\",\\\"Boro Zone\\\"\\r\\n\",\n      \"\\r\\n\",\n      \"6,\\\"Staten Island\\\",\\\"Arrochar/Fort Wadsworth\\\",\\\"Boro Zone\\\"\\r\\n\",\n      \"\\r\\n\",\n      \"7,\\\"Queens\\\",\\\"Astoria\\\",\\\"Boro Zone\\\"\\r\\n\",\n      \"\\r\\n\",\n      \"8,\\\"Queens\\\",\\\"Astoria Park\\\",\\\"Boro Zone\\\"\\r\\n\",\n      \"\\r\\n\",\n      \"9,\\\"Queens\\\",\\\"Auburndale\\\",\\\"Boro Zone\\\"\\r\\n\",\n      \"\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!head taxi_zone_lookup.csv\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"809464d0\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = spark.read \\\\\\n\",\n    \"    .option(\\\"header\\\", \\\"true\\\") \\\\\\n\",\n    \"    .csv('taxi_zone_lookup.csv')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"id\": \"e36dd996\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+----------+-------------+--------------------+------------+\\n\",\n      \"|LocationID|      Borough|                Zone|service_zone|\\n\",\n      \"+----------+-------------+--------------------+------------+\\n\",\n      \"|         1|          EWR|      Newark Airport|         EWR|\\n\",\n      \"|         2|       Queens|         Jamaica Bay|   Boro Zone|\\n\",\n      \"|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|\\n\",\n      \"|         4|    Manhattan|       Alphabet City| Yellow Zone|\\n\",\n      \"|         5|Staten Island|       Arden Heights|   Boro Zone|\\n\",\n      \"|         6|Staten Island|Arrochar/Fort Wad...|   Boro Zone|\\n\",\n      \"|         7|       Queens|             Astoria|   Boro Zone|\\n\",\n      \"|         8|       Queens|        Astoria Park|   Boro Zone|\\n\",\n      \"|         9|       Queens|          Auburndale|   Boro Zone|\\n\",\n      \"|        10|       Queens|        Baisley Park|   Boro Zone|\\n\",\n      \"|        11|     Brooklyn|          Bath Beach|   Boro Zone|\\n\",\n      \"|        12|    Manhattan|        Battery Park| Yellow Zone|\\n\",\n      \"|        13|    Manhattan|   Battery Park City| Yellow Zone|\\n\",\n      \"|        14|     Brooklyn|           Bay Ridge|   Boro Zone|\\n\",\n      \"|        15|       Queens|Bay Terrace/Fort ...|   Boro Zone|\\n\",\n      \"|        16|       Queens|             Bayside|   Boro Zone|\\n\",\n      \"|        17|     Brooklyn|             Bedford|   Boro Zone|\\n\",\n      \"|        18|        Bronx|        Bedford Park|   Boro Zone|\\n\",\n      \"|        19|       Queens|           Bellerose|   Boro Zone|\\n\",\n      \"|        20|        Bronx|             Belmont|   Boro Zone|\\n\",\n      \"+----------+-------------+--------------------+------------+\\n\",\n      \"only showing top 20 rows\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"id\": \"cb547351\",\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\\n\",\n      \"[Stage 4:>                                                          (0 + 1) / 1]\\r\\n\",\n      \"\\r\\n\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df.write.parquet('zones')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"id\": \"02fe2bdb\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"total 28K\\r\\n\",\n      \"-rw-rw-r-- 1 alexey alexey 6.8K Feb 15 22:25 Untitled.ipynb\\r\\n\",\n      \"-rw-rw-r-- 1 alexey alexey  13K Aug 17  2016 taxi+_zone_lookup.csv\\r\\n\",\n      \"drwxr-xr-x 2 alexey alexey 4.0K Feb 15 22:25 zones\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!ls -lh\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"659f0812\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.7\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "06-batch/code/04_pyspark.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"07de9dc3\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pyspark\\n\",\n    \"from pyspark.sql import SparkSession\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"ca5bbb06\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING: An illegal reflective access operation has occurred\\n\",\n      \"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\\n\",\n      \"WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\\n\",\n      \"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\\n\",\n      \"WARNING: All illegal access operations will be denied in a future release\\n\",\n      \"22/02/16 21:11:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"spark = SparkSession.builder \\\\\\n\",\n    \"    .master(\\\"local[*]\\\") \\\\\\n\",\n    \"    .appName('test') \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"cf8de204\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"--2022-02-16 21:13:50--  https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-01.csv\\n\",\n      \"Resolving nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)... 52.217.84.132\\n\",\n      \"Connecting to nyc-tlc.s3.amazonaws.com (nyc-tlc.s3.amazonaws.com)|52.217.84.132|:443... connected.\\n\",\n      \"HTTP request sent, awaiting response... 200 OK\\n\",\n      \"Length: 752335705 (717M) [text/csv]\\n\",\n      \"Saving to: ‘fhvhv_tripdata_2021-01.csv’\\n\",\n      \"\\n\",\n      \"fhvhv_tripdata_2021 100%[===================>] 717.48M  35.6MB/s    in 21s     \\n\",\n      \"\\n\",\n      \"2022-02-16 21:14:11 (34.4 MB/s) - ‘fhvhv_tripdata_2021-01.csv’ saved [752335705/752335705]\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"201a5957\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"!gzip -dc fhvhv_tripdata_2021-01.csv.gz\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"id\": \"2a52087c\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"11908469 fhvhv_tripdata_2021-01.csv\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!wc -l fhvhv_tripdata_2021-01.csv\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"931021a7\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = spark.read \\\\\\n\",\n    \"    .option(\\\"header\\\", \\\"true\\\") \\\\\\n\",\n    \"    .csv('fhvhv_tripdata_2021-01.csv')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"id\": \"d44b7839\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"StructType(List(StructField(hvfhs_license_num,StringType,true),StructField(dispatching_base_num,StringType,true),StructField(pickup_datetime,StringType,true),StructField(dropoff_datetime,StringType,true),StructField(PULocationID,StringType,true),StructField(DOLocationID,StringType,true),StructField(SR_Flag,StringType,true)))\"\n      ]\n     },\n     \"execution_count\": 10,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.schema\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"id\": \"4249e790\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"!head -n 1001 fhvhv_tripdata_2021-01.csv > head.csv\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"id\": \"6894312c\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"id\": \"f3ca771b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_pandas = pd.read_csv('head.csv')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"id\": \"f1066b4f\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"hvfhs_license_num        object\\n\",\n       \"dispatching_base_num     object\\n\",\n       \"pickup_datetime          object\\n\",\n       \"dropoff_datetime         object\\n\",\n       \"PULocationID              int64\\n\",\n       \"DOLocationID              int64\\n\",\n       \"SR_Flag                 float64\\n\",\n       \"dtype: object\"\n      ]\n     },\n     \"execution_count\": 19,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df_pandas.dtypes\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"id\": \"f8413c9d\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"StructType(List(StructField(hvfhs_license_num,StringType,true),StructField(dispatching_base_num,StringType,true),StructField(pickup_datetime,StringType,true),StructField(dropoff_datetime,StringType,true),StructField(PULocationID,LongType,true),StructField(DOLocationID,LongType,true),StructField(SR_Flag,DoubleType,true)))\"\n      ]\n     },\n     \"execution_count\": 23,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"spark.createDataFrame(df_pandas).schema\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"80f252c1\",\n   \"metadata\": {},\n   \"source\": [\n    \"Integer - 4 bytes\\n\",\n    \"Long - 8 bytes\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"id\": \"16937bfd\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from pyspark.sql import types\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"id\": \"fc61a99a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"schema = types.StructType([\\n\",\n    \"    types.StructField('hvfhs_license_num', types.StringType(), True),\\n\",\n    \"    types.StructField('dispatching_base_num', types.StringType(), True),\\n\",\n    \"    types.StructField('pickup_datetime', types.TimestampType(), True),\\n\",\n    \"    types.StructField('dropoff_datetime', types.TimestampType(), True),\\n\",\n    \"    types.StructField('PULocationID', types.IntegerType(), True),\\n\",\n    \"    types.StructField('DOLocationID', types.IntegerType(), True),\\n\",\n    \"    types.StructField('SR_Flag', types.StringType(), True)\\n\",\n    \"])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 32,\n   \"id\": \"f94052ae\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = spark.read \\\\\\n\",\n    \"    .option(\\\"header\\\", \\\"true\\\") \\\\\\n\",\n    \"    .schema(schema) \\\\\\n\",\n    \"    .csv('fhvhv_tripdata_2021-01.csv')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 36,\n   \"id\": \"c270d9d6\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = df.repartition(24)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"7796c2b2\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df.write.parquet('fhvhv/2021/01/')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 44,\n   \"id\": \"c3cab876\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = spark.read.parquet('fhvhv/2021/01/')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 48,\n   \"id\": \"203b5627\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- hvfhs_license_num: string (nullable = true)\\n\",\n      \" |-- dispatching_base_num: string (nullable = true)\\n\",\n      \" |-- pickup_datetime: timestamp (nullable = true)\\n\",\n      \" |-- dropoff_datetime: timestamp (nullable = true)\\n\",\n      \" |-- PULocationID: integer (nullable = true)\\n\",\n      \" |-- DOLocationID: integer (nullable = true)\\n\",\n      \" |-- SR_Flag: string (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"64172a47\",\n   \"metadata\": {},\n   \"source\": [\n    \"SELECT * FROM df WHERE hvfhs_license_num =  HV0003\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 56,\n   \"id\": \"d24840a0\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from pyspark.sql import functions as F\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 61,\n   \"id\": \"3ab1ca44\",\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+\\n\",\n      \"|hvfhs_license_num|dispatching_base_num|    pickup_datetime|   dropoff_datetime|PULocationID|DOLocationID|SR_Flag|\\n\",\n      \"+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+\\n\",\n      \"|           HV0005|              B02510|2021-01-07 06:43:22|2021-01-07 06:55:06|         142|         230|   null|\\n\",\n      \"|           HV0005|              B02510|2021-01-01 16:01:26|2021-01-01 16:20:20|         133|          91|   null|\\n\",\n      \"|           HV0003|              B02764|2021-01-01 00:23:13|2021-01-01 00:30:35|         147|         159|   null|\\n\",\n      \"|           HV0003|              B02869|2021-01-06 11:43:12|2021-01-06 11:55:07|          79|         164|   null|\\n\",\n      \"|           HV0003|              B02884|2021-01-04 15:35:32|2021-01-04 15:52:02|         174|          18|   null|\\n\",\n      \"|           HV0003|              B02875|2021-01-04 13:42:15|2021-01-04 14:04:57|         201|         180|   null|\\n\",\n      \"|           HV0005|              B02510|2021-01-04 18:57:31|2021-01-04 19:09:55|         230|         142|   null|\\n\",\n      \"|           HV0003|              B02872|2021-01-03 18:42:03|2021-01-03 19:12:22|         132|          72|   null|\\n\",\n      \"|           HV0004|              B02800|2021-01-01 05:31:50|2021-01-01 05:40:03|         188|          61|   null|\\n\",\n      \"|           HV0005|              B02510|2021-01-04 20:21:47|2021-01-04 20:26:03|          97|         189|   null|\\n\",\n      \"|           HV0003|              B02764|2021-01-01 01:51:18|2021-01-01 02:05:32|         174|         235|   null|\\n\",\n      \"|           HV0003|              B02871|2021-01-05 10:20:54|2021-01-05 10:32:44|          35|          76|   null|\\n\",\n      \"|           HV0005|              B02510|2021-01-06 02:32:09|2021-01-06 02:43:35|          35|          39|   null|\\n\",\n      \"|           HV0003|              B02882|2021-01-04 12:34:52|2021-01-04 12:38:59|         231|          13|   null|\\n\",\n      \"|           HV0003|              B02617|2021-01-02 20:12:56|2021-01-02 20:41:18|          87|         127|   null|\\n\",\n      \"|           HV0005|              B02510|2021-01-02 16:55:48|2021-01-02 17:20:40|          17|          89|   null|\\n\",\n      \"|           HV0003|              B02869|2021-01-02 15:14:38|2021-01-02 15:23:27|          11|          14|   null|\\n\",\n      \"|           HV0005|              B02510|2021-01-01 05:54:50|2021-01-01 06:03:46|          21|          26|   null|\\n\",\n      \"|           HV0003|              B02869|2021-01-04 12:40:42|2021-01-04 12:48:34|          83|         260|   null|\\n\",\n      \"|           HV0005|              B02510|2021-01-01 14:58:57|2021-01-01 15:09:53|         189|          52|   null|\\n\",\n      \"+-----------------+--------------------+-------------------+-------------------+------------+------------+-------+\\n\",\n      \"only showing top 20 rows\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 63,\n   \"id\": \"6d98c2ce\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def crazy_stuff(base_num):\\n\",\n    \"    num = int(base_num[1:])\\n\",\n    \"    if num % 7 == 0:\\n\",\n    \"        return f's/{num:03x}'\\n\",\n    \"    elif num % 3 == 0:\\n\",\n    \"        return f'a/{num:03x}'\\n\",\n    \"    else:\\n\",\n    \"        return f'e/{num:03x}'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 65,\n   \"id\": \"f3175419\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"'s/b44'\"\n      ]\n     },\n     \"execution_count\": 65,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"crazy_stuff('B02884')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 66,\n   \"id\": \"9bb5d503\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"crazy_stuff_udf = F.udf(crazy_stuff, returnType=types.StringType())\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 67,\n   \"id\": \"b38f0465\",\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+-------+-----------+------------+------------+------------+\\n\",\n      \"|base_id|pickup_date|dropoff_date|PULocationID|DOLocationID|\\n\",\n      \"+-------+-----------+------------+------------+------------+\\n\",\n      \"|  e/9ce| 2021-01-07|  2021-01-07|         142|         230|\\n\",\n      \"|  e/9ce| 2021-01-01|  2021-01-01|         133|          91|\\n\",\n      \"|  e/acc| 2021-01-01|  2021-01-01|         147|         159|\\n\",\n      \"|  e/b35| 2021-01-06|  2021-01-06|          79|         164|\\n\",\n      \"|  s/b44| 2021-01-04|  2021-01-04|         174|          18|\\n\",\n      \"|  e/b3b| 2021-01-04|  2021-01-04|         201|         180|\\n\",\n      \"|  e/9ce| 2021-01-04|  2021-01-04|         230|         142|\\n\",\n      \"|  e/b38| 2021-01-03|  2021-01-03|         132|          72|\\n\",\n      \"|  s/af0| 2021-01-01|  2021-01-01|         188|          61|\\n\",\n      \"|  e/9ce| 2021-01-04|  2021-01-04|          97|         189|\\n\",\n      \"|  e/acc| 2021-01-01|  2021-01-01|         174|         235|\\n\",\n      \"|  a/b37| 2021-01-05|  2021-01-05|          35|          76|\\n\",\n      \"|  e/9ce| 2021-01-06|  2021-01-06|          35|          39|\\n\",\n      \"|  e/b42| 2021-01-04|  2021-01-04|         231|          13|\\n\",\n      \"|  e/a39| 2021-01-02|  2021-01-02|          87|         127|\\n\",\n      \"|  e/9ce| 2021-01-02|  2021-01-02|          17|          89|\\n\",\n      \"|  e/b35| 2021-01-02|  2021-01-02|          11|          14|\\n\",\n      \"|  e/9ce| 2021-01-01|  2021-01-01|          21|          26|\\n\",\n      \"|  e/b35| 2021-01-04|  2021-01-04|          83|         260|\\n\",\n      \"|  e/9ce| 2021-01-01|  2021-01-01|         189|          52|\\n\",\n      \"+-------+-----------+------------+------------+------------+\\n\",\n      \"only showing top 20 rows\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df \\\\\\n\",\n    \"    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \\\\\\n\",\n    \"    .withColumn('dropoff_date', F.to_date(df.dropoff_datetime)) \\\\\\n\",\n    \"    .withColumn('base_id', crazy_stuff_udf(df.dispatching_base_num)) \\\\\\n\",\n    \"    .select('base_id', 'pickup_date', 'dropoff_date', 'PULocationID', 'DOLocationID') \\\\\\n\",\n    \"    .show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 55,\n   \"id\": \"00921644\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"[Row(pickup_datetime=datetime.datetime(2021, 1, 1, 0, 23, 13), dropoff_datetime=datetime.datetime(2021, 1, 1, 0, 30, 35), PULocationID=147, DOLocationID=159),\\n\",\n       \" Row(pickup_datetime=datetime.datetime(2021, 1, 6, 11, 43, 12), dropoff_datetime=datetime.datetime(2021, 1, 6, 11, 55, 7), PULocationID=79, DOLocationID=164),\\n\",\n       \" Row(pickup_datetime=datetime.datetime(2021, 1, 4, 15, 35, 32), dropoff_datetime=datetime.datetime(2021, 1, 4, 15, 52, 2), PULocationID=174, DOLocationID=18),\\n\",\n       \" Row(pickup_datetime=datetime.datetime(2021, 1, 4, 13, 42, 15), dropoff_datetime=datetime.datetime(2021, 1, 4, 14, 4, 57), PULocationID=201, DOLocationID=180),\\n\",\n       \" Row(pickup_datetime=datetime.datetime(2021, 1, 3, 18, 42, 3), dropoff_datetime=datetime.datetime(2021, 1, 3, 19, 12, 22), PULocationID=132, DOLocationID=72)]\"\n      ]\n     },\n     \"execution_count\": 55,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.select('pickup_datetime', 'dropoff_datetime', 'PULocationID', 'DOLocationID') \\\\\\n\",\n    \"  .filter(df.hvfhs_license_num == 'HV0003')\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 50,\n   \"id\": \"0866f9c0\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"hvfhs_license_num,dispatching_base_num,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,SR_Flag\\r\\n\",\n      \"\\r\\n\",\n      \"HV0003,B02682,2021-01-01 00:33:44,2021-01-01 00:49:07,230,166,\\r\\n\",\n      \"\\r\\n\",\n      \"HV0003,B02682,2021-01-01 00:55:19,2021-01-01 01:18:21,152,167,\\r\\n\",\n      \"\\r\\n\",\n      \"HV0003,B02764,2021-01-01 00:23:56,2021-01-01 00:38:05,233,142,\\r\\n\",\n      \"\\r\\n\",\n      \"HV0003,B02764,2021-01-01 00:42:51,2021-01-01 00:45:50,142,143,\\r\\n\",\n      \"\\r\\n\",\n      \"HV0003,B02764,2021-01-01 00:48:14,2021-01-01 01:08:42,143,78,\\r\\n\",\n      \"\\r\\n\",\n      \"HV0005,B02510,2021-01-01 00:06:59,2021-01-01 00:43:01,88,42,\\r\\n\",\n      \"\\r\\n\",\n      \"HV0005,B02510,2021-01-01 00:50:00,2021-01-01 01:04:57,42,151,\\r\\n\",\n      \"\\r\\n\",\n      \"HV0003,B02764,2021-01-01 00:14:30,2021-01-01 00:50:27,71,226,\\r\\n\",\n      \"\\r\\n\",\n      \"HV0003,B02875,2021-01-01 00:22:54,2021-01-01 00:30:20,112,255,\\r\\n\",\n      \"\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!head -n 10 head.csv\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"aa1b0e18\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.7\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "06-batch/code/05_taxi_schema.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"8c1d0c08\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pyspark\\n\",\n    \"from pyspark.sql import SparkSession\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"96a248f5\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING: An illegal reflective access operation has occurred\\n\",\n      \"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\\n\",\n      \"WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\\n\",\n      \"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\\n\",\n      \"WARNING: All illegal access operations will be denied in a future release\\n\",\n      \"22/02/17 21:59:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"spark = SparkSession.builder \\\\\\n\",\n    \"    .master(\\\"local[*]\\\") \\\\\\n\",\n    \"    .appName('test') \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"id\": \"c53274b1\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"id\": \"5d8434e1\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from pyspark.sql import types\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"id\": \"a84c6c6d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"green_schema = types.StructType([\\n\",\n    \"    types.StructField(\\\"VendorID\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"lpep_pickup_datetime\\\", types.TimestampType(), True),\\n\",\n    \"    types.StructField(\\\"lpep_dropoff_datetime\\\", types.TimestampType(), True),\\n\",\n    \"    types.StructField(\\\"store_and_fwd_flag\\\", types.StringType(), True),\\n\",\n    \"    types.StructField(\\\"RatecodeID\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"PULocationID\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"DOLocationID\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"passenger_count\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"trip_distance\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"fare_amount\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"extra\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"mta_tax\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"tip_amount\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"tolls_amount\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"ehail_fee\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"improvement_surcharge\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"total_amount\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"payment_type\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"trip_type\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"congestion_surcharge\\\", types.DoubleType(), True)\\n\",\n    \"])\\n\",\n    \"\\n\",\n    \"yellow_schema = types.StructType([\\n\",\n    \"    types.StructField(\\\"VendorID\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"tpep_pickup_datetime\\\", types.TimestampType(), True),\\n\",\n    \"    types.StructField(\\\"tpep_dropoff_datetime\\\", types.TimestampType(), True),\\n\",\n    \"    types.StructField(\\\"passenger_count\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"trip_distance\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"RatecodeID\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"store_and_fwd_flag\\\", types.StringType(), True),\\n\",\n    \"    types.StructField(\\\"PULocationID\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"DOLocationID\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"payment_type\\\", types.IntegerType(), True),\\n\",\n    \"    types.StructField(\\\"fare_amount\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"extra\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"mta_tax\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"tip_amount\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"tolls_amount\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"improvement_surcharge\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"total_amount\\\", types.DoubleType(), True),\\n\",\n    \"    types.StructField(\\\"congestion_surcharge\\\", types.DoubleType(), True)\\n\",\n    \"])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"id\": \"3f7e0cb9\",\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/1\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/2\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/3\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/4\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/5\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/6\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/7\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/8\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/9\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/10\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/11\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/12\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"year = 2020\\n\",\n    \"\\n\",\n    \"for month in range(1, 13):\\n\",\n    \"    print(f'processing data for {year}/{month}')\\n\",\n    \"\\n\",\n    \"    input_path = f'data/raw/green/{year}/{month:02d}/'\\n\",\n    \"    output_path = f'data/pq/green/{year}/{month:02d}/'\\n\",\n    \"\\n\",\n    \"    df_green = spark.read \\\\\\n\",\n    \"        .option(\\\"header\\\", \\\"true\\\") \\\\\\n\",\n    \"        .schema(green_schema) \\\\\\n\",\n    \"        .csv(input_path)\\n\",\n    \"\\n\",\n    \"    df_green \\\\\\n\",\n    \"        .repartition(4) \\\\\\n\",\n    \"        .write.parquet(output_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"id\": \"96ac2ad7\",\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/1\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/2\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/3\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/4\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/5\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/6\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/7\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 15:>                                                         (0 + 1) / 1]\\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/8\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"ename\": \"AnalysisException\",\n     \"evalue\": \"Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/green/2021/08;\",\n     \"output_type\": \"error\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mAnalysisException\\u001b[0m                         Traceback (most recent call last)\",\n      \"\\u001b[0;32m/tmp/ipykernel_129101/906373977.py\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m\\u001b[0m\\n\\u001b[1;32m      7\\u001b[0m     \\u001b[0moutput_path\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0;34mf'data/pq/green/{year}/{month:02d}/'\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m      8\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m----> 9\\u001b[0;31m     \\u001b[0mdf_green\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mspark\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread\\u001b[0m\\u001b[0;31m \\u001b[0m\\u001b[0;31m\\\\\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m     10\\u001b[0m         \\u001b[0;34m.\\u001b[0m\\u001b[0moption\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m\\\"header\\\"\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0;34m\\\"true\\\"\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;31m \\u001b[0m\\u001b[0;31m\\\\\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     11\\u001b[0m         \\u001b[0;34m.\\u001b[0m\\u001b[0mschema\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mgreen_schema\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;31m \\u001b[0m\\u001b[0;31m\\\\\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/readwriter.py\\u001b[0m in \\u001b[0;36mcsv\\u001b[0;34m(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup)\\u001b[0m\\n\\u001b[1;32m    536\\u001b[0m             \\u001b[0mpath\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0;34m[\\u001b[0m\\u001b[0mpath\\u001b[0m\\u001b[0;34m]\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    537\\u001b[0m         \\u001b[0;32mif\\u001b[0m \\u001b[0mtype\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mpath\\u001b[0m\\u001b[0;34m)\\u001b[0m \\u001b[0;34m==\\u001b[0m \\u001b[0mlist\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 538\\u001b[0;31m             \\u001b[0;32mreturn\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_df\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_jreader\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcsv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_spark\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_sc\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_jvm\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mPythonUtils\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mtoSeq\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mpath\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    539\\u001b[0m         \\u001b[0;32melif\\u001b[0m \\u001b[0misinstance\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mpath\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mRDD\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    540\\u001b[0m             \\u001b[0;32mdef\\u001b[0m \\u001b[0mfunc\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0miterator\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\\u001b[0m in \\u001b[0;36m__call__\\u001b[0;34m(self, *args)\\u001b[0m\\n\\u001b[1;32m   1302\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1303\\u001b[0m         \\u001b[0manswer\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mgateway_client\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0msend_command\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mcommand\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 1304\\u001b[0;31m         return_value = get_return_value(\\n\\u001b[0m\\u001b[1;32m   1305\\u001b[0m             answer, self.gateway_client, self.target_id, self.name)\\n\\u001b[1;32m   1306\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\\u001b[0m in \\u001b[0;36mdeco\\u001b[0;34m(*a, **kw)\\u001b[0m\\n\\u001b[1;32m    132\\u001b[0m                 \\u001b[0;31m# Hide where the exception came from that shows a non-Pythonic\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    133\\u001b[0m                 \\u001b[0;31m# JVM exception message.\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 134\\u001b[0;31m                 \\u001b[0mraise_from\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mconverted\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    135\\u001b[0m             \\u001b[0;32melse\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    136\\u001b[0m                 \\u001b[0;32mraise\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\\u001b[0m in \\u001b[0;36mraise_from\\u001b[0;34m(e)\\u001b[0m\\n\",\n      \"\\u001b[0;31mAnalysisException\\u001b[0m: Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/green/2021/08;\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"year = 2021 \\n\",\n    \"\\n\",\n    \"for month in range(1, 13):\\n\",\n    \"    print(f'processing data for {year}/{month}')\\n\",\n    \"\\n\",\n    \"    input_path = f'data/raw/green/{year}/{month:02d}/'\\n\",\n    \"    output_path = f'data/pq/green/{year}/{month:02d}/'\\n\",\n    \"\\n\",\n    \"    df_green = spark.read \\\\\\n\",\n    \"        .option(\\\"header\\\", \\\"true\\\") \\\\\\n\",\n    \"        .schema(green_schema) \\\\\\n\",\n    \"        .csv(input_path)\\n\",\n    \"\\n\",\n    \"    df_green \\\\\\n\",\n    \"        .repartition(4) \\\\\\n\",\n    \"        .write.parquet(output_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"463c7dc8\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"id\": \"6ff4265d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"id\": \"6e982d29\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 28,\n   \"id\": \"19326bc9\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/1\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/2\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/3\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/4\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/5\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/6\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/7\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/8\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/9\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/10\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/11\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2020/12\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"year = 2020\\n\",\n    \"\\n\",\n    \"for month in range(1, 13):\\n\",\n    \"    print(f'processing data for {year}/{month}')\\n\",\n    \"\\n\",\n    \"    input_path = f'data/raw/yellow/{year}/{month:02d}/'\\n\",\n    \"    output_path = f'data/pq/yellow/{year}/{month:02d}/'\\n\",\n    \"\\n\",\n    \"    df_yellow = spark.read \\\\\\n\",\n    \"        .option(\\\"header\\\", \\\"true\\\") \\\\\\n\",\n    \"        .schema(yellow_schema) \\\\\\n\",\n    \"        .csv(input_path)\\n\",\n    \"\\n\",\n    \"    df_yellow \\\\\\n\",\n    \"        .repartition(4) \\\\\\n\",\n    \"        .write.parquet(output_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 29,\n   \"id\": \"aeca811a\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/1\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/2\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/3\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/4\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/5\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/6\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/7\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[Stage 78:===========================================>              (3 + 1) / 4]\\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"processing data for 2021/8\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"ename\": \"AnalysisException\",\n     \"evalue\": \"Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/yellow/2021/08;\",\n     \"output_type\": \"error\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mAnalysisException\\u001b[0m                         Traceback (most recent call last)\",\n      \"\\u001b[0;32m/tmp/ipykernel_129101/2088663510.py\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m\\u001b[0m\\n\\u001b[1;32m      7\\u001b[0m     \\u001b[0moutput_path\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0;34mf'data/pq/yellow/{year}/{month:02d}/'\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m      8\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m----> 9\\u001b[0;31m     \\u001b[0mdf_yellow\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mspark\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread\\u001b[0m\\u001b[0;31m \\u001b[0m\\u001b[0;31m\\\\\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m     10\\u001b[0m         \\u001b[0;34m.\\u001b[0m\\u001b[0moption\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m\\\"header\\\"\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0;34m\\\"true\\\"\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;31m \\u001b[0m\\u001b[0;31m\\\\\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     11\\u001b[0m         \\u001b[0;34m.\\u001b[0m\\u001b[0mschema\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0myellow_schema\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;31m \\u001b[0m\\u001b[0;31m\\\\\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/readwriter.py\\u001b[0m in \\u001b[0;36mcsv\\u001b[0;34m(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup)\\u001b[0m\\n\\u001b[1;32m    536\\u001b[0m             \\u001b[0mpath\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0;34m[\\u001b[0m\\u001b[0mpath\\u001b[0m\\u001b[0;34m]\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    537\\u001b[0m         \\u001b[0;32mif\\u001b[0m \\u001b[0mtype\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mpath\\u001b[0m\\u001b[0;34m)\\u001b[0m \\u001b[0;34m==\\u001b[0m \\u001b[0mlist\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 538\\u001b[0;31m             \\u001b[0;32mreturn\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_df\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_jreader\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcsv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_spark\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_sc\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_jvm\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mPythonUtils\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mtoSeq\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mpath\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    539\\u001b[0m         \\u001b[0;32melif\\u001b[0m \\u001b[0misinstance\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mpath\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mRDD\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    540\\u001b[0m             \\u001b[0;32mdef\\u001b[0m \\u001b[0mfunc\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0miterator\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\\u001b[0m in \\u001b[0;36m__call__\\u001b[0;34m(self, *args)\\u001b[0m\\n\\u001b[1;32m   1302\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1303\\u001b[0m         \\u001b[0manswer\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mgateway_client\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0msend_command\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mcommand\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 1304\\u001b[0;31m         return_value = get_return_value(\\n\\u001b[0m\\u001b[1;32m   1305\\u001b[0m             answer, self.gateway_client, self.target_id, self.name)\\n\\u001b[1;32m   1306\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\\u001b[0m in \\u001b[0;36mdeco\\u001b[0;34m(*a, **kw)\\u001b[0m\\n\\u001b[1;32m    132\\u001b[0m                 \\u001b[0;31m# Hide where the exception came from that shows a non-Pythonic\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    133\\u001b[0m                 \\u001b[0;31m# JVM exception message.\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 134\\u001b[0;31m                 \\u001b[0mraise_from\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mconverted\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    135\\u001b[0m             \\u001b[0;32melse\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    136\\u001b[0m                 \\u001b[0;32mraise\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m~/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/sql/utils.py\\u001b[0m in \\u001b[0;36mraise_from\\u001b[0;34m(e)\\u001b[0m\\n\",\n      \"\\u001b[0;31mAnalysisException\\u001b[0m: Path does not exist: file:/home/alexey/data-engineering-zoomcamp/week_5_batch_processing/code/data/raw/yellow/2021/08;\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"year = 2021\\n\",\n    \"\\n\",\n    \"for month in range(1, 13):\\n\",\n    \"    print(f'processing data for {year}/{month}')\\n\",\n    \"\\n\",\n    \"    input_path = f'data/raw/yellow/{year}/{month:02d}/'\\n\",\n    \"    output_path = f'data/pq/yellow/{year}/{month:02d}/'\\n\",\n    \"\\n\",\n    \"    df_yellow = spark.read \\\\\\n\",\n    \"        .option(\\\"header\\\", \\\"true\\\") \\\\\\n\",\n    \"        .schema(yellow_schema) \\\\\\n\",\n    \"        .csv(input_path)\\n\",\n    \"\\n\",\n    \"    df_yellow \\\\\\n\",\n    \"        .repartition(4) \\\\\\n\",\n    \"        .write.parquet(output_path)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"d7eb0da9\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.7\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "06-batch/code/06_spark_sql.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"3307b886\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING: An illegal reflective access operation has occurred\\n\",\n      \"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\\n\",\n      \"WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\\n\",\n      \"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\\n\",\n      \"WARNING: All illegal access operations will be denied in a future release\\n\",\n      \"22/02/17 22:43:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import pyspark\\n\",\n    \"from pyspark.sql import SparkSession\\n\",\n    \"\\n\",\n    \"spark = SparkSession.builder \\\\\\n\",\n    \"    .master(\\\"local[*]\\\") \\\\\\n\",\n    \"    .appName('test') \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"1ee1eb1d\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_green = spark.read.parquet('data/pq/green/*/*')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"0ca5ee99\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"id\": \"649bb4da\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_green = df_green \\\\\\n\",\n    \"    .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \\\\\\n\",\n    \"    .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"90cd6845\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_yellow = spark.read.parquet('data/pq/yellow/*/*')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"id\": \"88822efd\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_yellow = df_yellow \\\\\\n\",\n    \"    .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \\\\\\n\",\n    \"    .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"id\": \"610167a2\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"common_colums = []\\n\",\n    \"\\n\",\n    \"yellow_columns = set(df_yellow.columns)\\n\",\n    \"\\n\",\n    \"for col in df_green.columns:\\n\",\n    \"    if col in yellow_columns:\\n\",\n    \"        common_colums.append(col)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"id\": \"839d773f\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from pyspark.sql import functions as F\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 28,\n   \"id\": \"2498810a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_green_sel = df_green \\\\\\n\",\n    \"    .select(common_colums) \\\\\\n\",\n    \"    .withColumn('service_type', F.lit('green'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 29,\n   \"id\": \"19032efc\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_yellow_sel = df_yellow \\\\\\n\",\n    \"    .select(common_colums) \\\\\\n\",\n    \"    .withColumn('service_type', F.lit('yellow'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 30,\n   \"id\": \"f5b0f3d1\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_trips_data = df_green_sel.unionAll(df_yellow_sel)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 33,\n   \"id\": \"1bed8b33\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+------------+--------+\\n\",\n      \"|service_type|   count|\\n\",\n      \"+------------+--------+\\n\",\n      \"|       green| 2304517|\\n\",\n      \"|      yellow|39649199|\\n\",\n      \"+------------+--------+\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_trips_data.groupBy('service_type').count().show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 40,\n   \"id\": \"28cc8fa3\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['VendorID',\\n\",\n       \" 'pickup_datetime',\\n\",\n       \" 'dropoff_datetime',\\n\",\n       \" 'store_and_fwd_flag',\\n\",\n       \" 'RatecodeID',\\n\",\n       \" 'PULocationID',\\n\",\n       \" 'DOLocationID',\\n\",\n       \" 'passenger_count',\\n\",\n       \" 'trip_distance',\\n\",\n       \" 'fare_amount',\\n\",\n       \" 'extra',\\n\",\n       \" 'mta_tax',\\n\",\n       \" 'tip_amount',\\n\",\n       \" 'tolls_amount',\\n\",\n       \" 'improvement_surcharge',\\n\",\n       \" 'total_amount',\\n\",\n       \" 'payment_type',\\n\",\n       \" 'congestion_surcharge',\\n\",\n       \" 'service_type']\"\n      ]\n     },\n     \"execution_count\": 40,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df_trips_data.columns\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 35,\n   \"id\": \"36e90cbc\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_trips_data.registerTempTable('trips_data')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 38,\n   \"id\": \"d0e01bf1\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+------------+--------+\\n\",\n      \"|service_type|count(1)|\\n\",\n      \"+------------+--------+\\n\",\n      \"|       green| 2304517|\\n\",\n      \"|      yellow|39649199|\\n\",\n      \"+------------+--------+\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"spark.sql(\\\"\\\"\\\"\\n\",\n    \"SELECT\\n\",\n    \"    service_type,\\n\",\n    \"    count(1)\\n\",\n    \"FROM\\n\",\n    \"    trips_data\\n\",\n    \"GROUP BY \\n\",\n    \"    service_type\\n\",\n    \"\\\"\\\"\\\").show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"b2ee7038\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_result = spark.sql(\\\"\\\"\\\"\\n\",\n    \"SELECT \\n\",\n    \"    -- Revenue grouping \\n\",\n    \"    PULocationID AS revenue_zone,\\n\",\n    \"    date_trunc('month', pickup_datetime) AS revenue_month, \\n\",\n    \"    service_type, \\n\",\n    \"\\n\",\n    \"    -- Revenue calculation \\n\",\n    \"    SUM(fare_amount) AS revenue_monthly_fare,\\n\",\n    \"    SUM(extra) AS revenue_monthly_extra,\\n\",\n    \"    SUM(mta_tax) AS revenue_monthly_mta_tax,\\n\",\n    \"    SUM(tip_amount) AS revenue_monthly_tip_amount,\\n\",\n    \"    SUM(tolls_amount) AS revenue_monthly_tolls_amount,\\n\",\n    \"    SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge,\\n\",\n    \"    SUM(total_amount) AS revenue_monthly_total_amount,\\n\",\n    \"    SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge,\\n\",\n    \"\\n\",\n    \"    -- Additional calculations\\n\",\n    \"    AVG(passenger_count) AS avg_monthly_passenger_count,\\n\",\n    \"    AVG(trip_distance) AS avg_monthly_trip_distance\\n\",\n    \"FROM\\n\",\n    \"    trips_data\\n\",\n    \"GROUP BY\\n\",\n    \"    1, 2, 3\\n\",\n    \"\\\"\\\"\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 49,\n   \"id\": \"f67eeb92\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_result.coalesce(1).write.parquet('data/report/revenue/', mode='overwrite')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"f56a885d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.7\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "06-batch/code/06_spark_sql.py",
    "content": "#!/usr/bin/env python\n# coding: utf-8\n\nimport argparse\n\nimport pyspark\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import functions as F\n\n\nparser = argparse.ArgumentParser()\n\nparser.add_argument('--input_green', required=True)\nparser.add_argument('--input_yellow', required=True)\nparser.add_argument('--output', required=True)\n\nargs = parser.parse_args()\n\ninput_green = args.input_green\ninput_yellow = args.input_yellow\noutput = args.output\n\n\nspark = SparkSession.builder \\\n    .appName('test') \\\n    .getOrCreate()\n\ndf_green = spark.read.parquet(input_green)\n\ndf_green = df_green \\\n    .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \\\n    .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')\n\ndf_yellow = spark.read.parquet(input_yellow)\n\n\ndf_yellow = df_yellow \\\n    .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \\\n    .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')\n\n\ncommon_colums = [\n    'VendorID',\n    'pickup_datetime',\n    'dropoff_datetime',\n    'store_and_fwd_flag',\n    'RatecodeID',\n    'PULocationID',\n    'DOLocationID',\n    'passenger_count',\n    'trip_distance',\n    'fare_amount',\n    'extra',\n    'mta_tax',\n    'tip_amount',\n    'tolls_amount',\n    'improvement_surcharge',\n    'total_amount',\n    'payment_type',\n    'congestion_surcharge'\n]\n\n\n\ndf_green_sel = df_green \\\n    .select(common_colums) \\\n    .withColumn('service_type', F.lit('green'))\n\ndf_yellow_sel = df_yellow \\\n    .select(common_colums) \\\n    .withColumn('service_type', F.lit('yellow'))\n\n\ndf_trips_data = df_green_sel.unionAll(df_yellow_sel)\n\ndf_trips_data.registerTempTable('trips_data')\n\n\ndf_result = spark.sql(\"\"\"\nSELECT \n    -- Reveneue grouping \n    PULocationID AS revenue_zone,\n    date_trunc('month', pickup_datetime) AS revenue_month, \n    service_type, \n\n    -- Revenue calculation \n    SUM(fare_amount) AS revenue_monthly_fare,\n    SUM(extra) AS revenue_monthly_extra,\n    SUM(mta_tax) AS revenue_monthly_mta_tax,\n    SUM(tip_amount) AS revenue_monthly_tip_amount,\n    SUM(tolls_amount) AS revenue_monthly_tolls_amount,\n    SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge,\n    SUM(total_amount) AS revenue_monthly_total_amount,\n    SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge,\n\n    -- Additional calculations\n    AVG(passenger_count) AS avg_montly_passenger_count,\n    AVG(trip_distance) AS avg_montly_trip_distance\nFROM\n    trips_data\nGROUP BY\n    1, 2, 3\n\"\"\")\n\n\ndf_result.coalesce(1) \\\n    .write.parquet(output, mode='overwrite')\n\n\n\n\n"
  },
  {
    "path": "06-batch/code/06_spark_sql_big_query.py",
    "content": "#!/usr/bin/env python\n# coding: utf-8\n\nimport argparse\n\nimport pyspark\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import functions as F\n\n\nparser = argparse.ArgumentParser()\n\nparser.add_argument('--input_green', required=True)\nparser.add_argument('--input_yellow', required=True)\nparser.add_argument('--output', required=True)\n\nargs = parser.parse_args()\n\ninput_green = args.input_green\ninput_yellow = args.input_yellow\noutput = args.output\n\n\nspark = SparkSession.builder \\\n    .appName('test') \\\n    .getOrCreate()\n\nspark.conf.set('temporaryGcsBucket', 'dataproc-temp-europe-west6-828225226997-fckhkym8')\n\ndf_green = spark.read.parquet(input_green)\n\ndf_green = df_green \\\n    .withColumnRenamed('lpep_pickup_datetime', 'pickup_datetime') \\\n    .withColumnRenamed('lpep_dropoff_datetime', 'dropoff_datetime')\n\ndf_yellow = spark.read.parquet(input_yellow)\n\n\ndf_yellow = df_yellow \\\n    .withColumnRenamed('tpep_pickup_datetime', 'pickup_datetime') \\\n    .withColumnRenamed('tpep_dropoff_datetime', 'dropoff_datetime')\n\n\ncommon_columns = [\n    'VendorID',\n    'pickup_datetime',\n    'dropoff_datetime',\n    'store_and_fwd_flag',\n    'RatecodeID',\n    'PULocationID',\n    'DOLocationID',\n    'passenger_count',\n    'trip_distance',\n    'fare_amount',\n    'extra',\n    'mta_tax',\n    'tip_amount',\n    'tolls_amount',\n    'improvement_surcharge',\n    'total_amount',\n    'payment_type',\n    'congestion_surcharge'\n]\n\n\n\ndf_green_sel = df_green \\\n    .select(common_columns) \\\n    .withColumn('service_type', F.lit('green'))\n\ndf_yellow_sel = df_yellow \\\n    .select(common_columns) \\\n    .withColumn('service_type', F.lit('yellow'))\n\n\ndf_trips_data = df_green_sel.unionAll(df_yellow_sel)\n\ndf_trips_data.registerTempTable('trips_data')\n\n\ndf_result = spark.sql(\"\"\"\nSELECT \n    -- Revenue grouping \n    PULocationID AS revenue_zone,\n    date_trunc('month', pickup_datetime) AS revenue_month, \n    service_type, \n\n    -- Revenue calculation \n    SUM(fare_amount) AS revenue_monthly_fare,\n    SUM(extra) AS revenue_monthly_extra,\n    SUM(mta_tax) AS revenue_monthly_mta_tax,\n    SUM(tip_amount) AS revenue_monthly_tip_amount,\n    SUM(tolls_amount) AS revenue_monthly_tolls_amount,\n    SUM(improvement_surcharge) AS revenue_monthly_improvement_surcharge,\n    SUM(total_amount) AS revenue_monthly_total_amount,\n    SUM(congestion_surcharge) AS revenue_monthly_congestion_surcharge,\n\n    -- Additional calculations\n    AVG(passenger_count) AS avg_monthly_passenger_count,\n    AVG(trip_distance) AS avg_monthly_trip_distance\nFROM\n    trips_data\nGROUP BY\n    1, 2, 3\n\"\"\")\n\n\ndf_result.write.format('bigquery') \\\n    .option('table', output) \\\n    .save()\n    \n\n\n\n"
  },
  {
    "path": "06-batch/code/07_groupby_join.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"4341e0e6\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING: An illegal reflective access operation has occurred\\n\",\n      \"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\\n\",\n      \"WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\\n\",\n      \"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\\n\",\n      \"WARNING: All illegal access operations will be denied in a future release\\n\",\n      \"22/02/18 21:41:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import pyspark\\n\",\n    \"from pyspark.sql import SparkSession\\n\",\n    \"\\n\",\n    \"spark = SparkSession.builder \\\\\\n\",\n    \"    .master(\\\"local[*]\\\") \\\\\\n\",\n    \"    .appName('test') \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"cd304aec\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_green = spark.read.parquet('data/pq/green/*/*')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"243991f3\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_green.registerTempTable('green')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"id\": \"e43764a7\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_green_revenue = spark.sql(\\\"\\\"\\\"\\n\",\n    \"SELECT \\n\",\n    \"    date_trunc('hour', lpep_pickup_datetime) AS hour, \\n\",\n    \"    PULocationID AS zone,\\n\",\n    \"\\n\",\n    \"    SUM(total_amount) AS amount,\\n\",\n    \"    COUNT(1) AS number_records\\n\",\n    \"FROM\\n\",\n    \"    green\\n\",\n    \"WHERE\\n\",\n    \"    lpep_pickup_datetime >= '2020-01-01 00:00:00'\\n\",\n    \"GROUP BY\\n\",\n    \"    1, 2\\n\",\n    \"\\\"\\\"\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"id\": \"3e00310e\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_green_revenue \\\\\\n\",\n    \"    .repartition(20) \\\\\\n\",\n    \"    .write.parquet('data/report/revenue/green', mode='overwrite')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"id\": \"07ebb68c\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_yellow = spark.read.parquet('data/pq/yellow/*/*')\\n\",\n    \"df_yellow.registerTempTable('yellow')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"id\": \"9d5be29d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_yellow_revenue = spark.sql(\\\"\\\"\\\"\\n\",\n    \"SELECT \\n\",\n    \"    date_trunc('hour', tpep_pickup_datetime) AS hour, \\n\",\n    \"    PULocationID AS zone,\\n\",\n    \"\\n\",\n    \"    SUM(total_amount) AS amount,\\n\",\n    \"    COUNT(1) AS number_records\\n\",\n    \"FROM\\n\",\n    \"    yellow\\n\",\n    \"WHERE\\n\",\n    \"    tpep_pickup_datetime >= '2020-01-01 00:00:00'\\n\",\n    \"GROUP BY\\n\",\n    \"    1, 2\\n\",\n    \"\\\"\\\"\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 27,\n   \"id\": \"8bd9264e\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_yellow_revenue \\\\\\n\",\n    \"    .repartition(20) \\\\\\n\",\n    \"    .write.parquet('data/report/revenue/yellow', mode='overwrite')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 46,\n   \"id\": \"fd5d74d7\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_green_revenue = spark.read.parquet('data/report/revenue/green')\\n\",\n    \"df_yellow_revenue = spark.read.parquet('data/report/revenue/yellow')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 47,\n   \"id\": \"35015ee6\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_green_revenue_tmp = df_green_revenue \\\\\\n\",\n    \"    .withColumnRenamed('amount', 'green_amount') \\\\\\n\",\n    \"    .withColumnRenamed('number_records', 'green_number_records')\\n\",\n    \"\\n\",\n    \"df_yellow_revenue_tmp = df_yellow_revenue \\\\\\n\",\n    \"    .withColumnRenamed('amount', 'yellow_amount') \\\\\\n\",\n    \"    .withColumnRenamed('number_records', 'yellow_number_records')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 48,\n   \"id\": \"ec9f34ea\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_join = df_green_revenue_tmp.join(df_yellow_revenue_tmp, on=['hour', 'zone'], how='outer')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 50,\n   \"id\": \"10238be7\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_join.write.parquet('data/report/revenue/total', mode='overwrite')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 51,\n   \"id\": \"c3af7169\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_join = spark.read.parquet('data/report/revenue/total')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 56,\n   \"id\": \"bc2a6680\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"DataFrame[hour: timestamp, zone: int, green_amount: double, green_number_records: bigint, yellow_amount: double, yellow_number_records: bigint]\"\n      ]\n     },\n     \"execution_count\": 56,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df_join\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 54,\n   \"id\": \"abb46398\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_zones = spark.read.parquet('zones/')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 57,\n   \"id\": \"b3cf98a5\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_result = df_join.join(df_zones, df_join.zone == df_zones.LocationID)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 62,\n   \"id\": \"5e0614ba\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_result.drop('LocationID', 'zone').write.parquet('tmp/revenue-zones')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"9f5ca913\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.7\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "06-batch/code/08_rdds.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"d66f42fd\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING: An illegal reflective access operation has occurred\\n\",\n      \"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\\n\",\n      \"WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\\n\",\n      \"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\\n\",\n      \"WARNING: All illegal access operations will be denied in a future release\\n\",\n      \"22/02/21 22:25:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import pyspark\\n\",\n    \"from pyspark.sql import SparkSession\\n\",\n    \"\\n\",\n    \"spark = SparkSession.builder \\\\\\n\",\n    \"    .master(\\\"local[*]\\\") \\\\\\n\",\n    \"    .appName('test') \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"646fc343\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 0:>                                                          (0 + 1) / 1]\\r\",\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_green = spark.read.parquet('data/pq/green/*/*')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"196cccd5\",\n   \"metadata\": {},\n   \"source\": [\n    \"```\\n\",\n    \"SELECT \\n\",\n    \"    date_trunc('hour', lpep_pickup_datetime) AS hour, \\n\",\n    \"    PULocationID AS zone,\\n\",\n    \"\\n\",\n    \"    SUM(total_amount) AS amount,\\n\",\n    \"    COUNT(1) AS number_records\\n\",\n    \"FROM\\n\",\n    \"    green\\n\",\n    \"WHERE\\n\",\n    \"    lpep_pickup_datetime >= '2020-01-01 00:00:00'\\n\",\n    \"GROUP BY\\n\",\n    \"    1, 2\\n\",\n    \"```\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"id\": \"74fe52cb\",\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"rdd = df_green \\\\\\n\",\n    \"    .select('lpep_pickup_datetime', 'PULocationID', 'total_amount') \\\\\\n\",\n    \"    .rdd\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"id\": \"1a0bf382\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from datetime import datetime\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"id\": \"fa2b00f1\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"start = datetime(year=2020, month=1, day=1)\\n\",\n    \"\\n\",\n    \"def filter_outliers(row):\\n\",\n    \"    return row.lpep_pickup_datetime >= start\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"id\": \"69dd326d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"rows = rdd.take(10)\\n\",\n    \"row = rows[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 29,\n   \"id\": \"cd4b7006\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Row(lpep_pickup_datetime=datetime.datetime(2020, 1, 16, 19, 49, 27), PULocationID=260, total_amount=14.3)\"\n      ]\n     },\n     \"execution_count\": 29,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"row\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 31,\n   \"id\": \"d99eb089\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def prepare_for_grouping(row): \\n\",\n    \"    hour = row.lpep_pickup_datetime.replace(minute=0, second=0, microsecond=0)\\n\",\n    \"    zone = row.PULocationID\\n\",\n    \"    key = (hour, zone)\\n\",\n    \"    \\n\",\n    \"    amount = row.total_amount\\n\",\n    \"    count = 1\\n\",\n    \"    value = (amount, count)\\n\",\n    \"\\n\",\n    \"    return (key, value)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 34,\n   \"id\": \"cb328a44\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def calculate_revenue(left_value, right_value):\\n\",\n    \"    left_amount, left_count = left_value\\n\",\n    \"    right_amount, right_count = right_value\\n\",\n    \"    \\n\",\n    \"    output_amount = left_amount + right_amount\\n\",\n    \"    output_count = left_count + right_count\\n\",\n    \"    \\n\",\n    \"    return (output_amount, output_count)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 39,\n   \"id\": \"2ea260f1\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from collections import namedtuple\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 40,\n   \"id\": \"7dae6064\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"RevenueRow = namedtuple('RevenueRow', ['hour', 'zone', 'revenue', 'count'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 41,\n   \"id\": \"e0a98ee4\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def unwrap(row):\\n\",\n    \"    return RevenueRow(\\n\",\n    \"        hour=row[0][0], \\n\",\n    \"        zone=row[0][1],\\n\",\n    \"        revenue=row[1][0],\\n\",\n    \"        count=row[1][1]\\n\",\n    \"    )\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 45,\n   \"id\": \"a09200b8\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from pyspark.sql import types\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 46,\n   \"id\": \"5c14d15e\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"result_schema = types.StructType([\\n\",\n    \"    types.StructField('hour', types.TimestampType(), True),\\n\",\n    \"    types.StructField('zone', types.IntegerType(), True),\\n\",\n    \"    types.StructField('revenue', types.DoubleType(), True),\\n\",\n    \"    types.StructField('count', types.IntegerType(), True)\\n\",\n    \"])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 47,\n   \"id\": \"56ea72ff\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_result = rdd \\\\\\n\",\n    \"    .filter(filter_outliers) \\\\\\n\",\n    \"    .map(prepare_for_grouping) \\\\\\n\",\n    \"    .reduceByKey(calculate_revenue) \\\\\\n\",\n    \"    .map(unwrap) \\\\\\n\",\n    \"    .toDF(result_schema) \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 50,\n   \"id\": \"4675bd3f\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_result.write.parquet('tmp/green-revenue')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 55,\n   \"id\": \"255b5503\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"columns = ['VendorID', 'lpep_pickup_datetime', 'PULocationID', 'DOLocationID', 'trip_distance']\\n\",\n    \"\\n\",\n    \"duration_rdd = df_green \\\\\\n\",\n    \"    .select(columns) \\\\\\n\",\n    \"    .rdd\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 67,\n   \"id\": \"645c3190\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 68,\n   \"id\": \"921e4ef9\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"rows = duration_rdd.take(10)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 81,\n   \"id\": \"f50db3eb\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = pd.DataFrame(rows, columns=columns)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 74,\n   \"id\": \"5b8ecc53\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['VendorID',\\n\",\n       \" 'lpep_pickup_datetime',\\n\",\n       \" 'PULocationID',\\n\",\n       \" 'DOLocationID',\\n\",\n       \" 'trip_distance']\"\n      ]\n     },\n     \"execution_count\": 74,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"columns\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 76,\n   \"id\": \"6766c0f8\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"#model = ...\\n\",\n    \"\\n\",\n    \"def model_predict(df):\\n\",\n    \"#     y_pred = model.predict(df)\\n\",\n    \"    y_pred = df.trip_distance * 5\\n\",\n    \"    return y_pred\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 98,\n   \"id\": \"7437b848\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"def apply_model_in_batch(rows):\\n\",\n    \"    df = pd.DataFrame(rows, columns=columns)\\n\",\n    \"    predictions = model_predict(df)\\n\",\n    \"    df['predicted_duration'] = predictions\\n\",\n    \"\\n\",\n    \"    for row in df.itertuples():\\n\",\n    \"        yield row\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 102,\n   \"id\": \"580b5845\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_predicts = duration_rdd \\\\\\n\",\n    \"    .mapPartitions(apply_model_in_batch)\\\\\\n\",\n    \"    .toDF() \\\\\\n\",\n    \"    .drop('Index')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 104,\n   \"id\": \"6055d543\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 48:>                                                         (0 + 1) / 1]\\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+------------------+\\n\",\n      \"|predicted_duration|\\n\",\n      \"+------------------+\\n\",\n      \"|             12.95|\\n\",\n      \"|             31.25|\\n\",\n      \"|              14.0|\\n\",\n      \"|             12.75|\\n\",\n      \"|               0.1|\\n\",\n      \"|             11.05|\\n\",\n      \"|11.299999999999999|\\n\",\n      \"|54.349999999999994|\\n\",\n      \"|             15.25|\\n\",\n      \"|             91.75|\\n\",\n      \"|             12.25|\\n\",\n      \"|               3.1|\\n\",\n      \"|               7.5|\\n\",\n      \"|11.899999999999999|\\n\",\n      \"| 78.89999999999999|\\n\",\n      \"|              4.45|\\n\",\n      \"|              23.2|\\n\",\n      \"|              4.85|\\n\",\n      \"|              6.65|\\n\",\n      \"|              15.1|\\n\",\n      \"+------------------+\\n\",\n      \"only showing top 20 rows\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_predicts.select('predicted_duration').show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"9e91d243\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.7\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "06-batch/code/09_spark_gcs.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"3307b886\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pyspark\\n\",\n    \"from pyspark.sql import SparkSession\\n\",\n    \"from pyspark.conf import SparkConf\\n\",\n    \"from pyspark.context import SparkContext\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"9f0ddbff\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"credentials_location = '/home/alexey/.google/credentials/google_credentials.json'\\n\",\n    \"\\n\",\n    \"conf = SparkConf() \\\\\\n\",\n    \"    .setMaster('local[*]') \\\\\\n\",\n    \"    .setAppName('test') \\\\\\n\",\n    \"    .set(\\\"spark.jars\\\", \\\"./lib/gcs-connector-hadoop3-2.2.5.jar\\\") \\\\\\n\",\n    \"    .set(\\\"spark.hadoop.google.cloud.auth.service.account.enable\\\", \\\"true\\\") \\\\\\n\",\n    \"    .set(\\\"spark.hadoop.google.cloud.auth.service.account.json.keyfile\\\", credentials_location)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"b83404e8\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING: An illegal reflective access operation has occurred\\n\",\n      \"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\\n\",\n      \"WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\\n\",\n      \"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\\n\",\n      \"WARNING: All illegal access operations will be denied in a future release\\n\",\n      \"22/03/30 12:25:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"sc = SparkContext(conf=conf)\\n\",\n    \"\\n\",\n    \"hadoop_conf = sc._jsc.hadoopConfiguration()\\n\",\n    \"\\n\",\n    \"hadoop_conf.set(\\\"fs.AbstractFileSystem.gs.impl\\\",  \\\"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS\\\")\\n\",\n    \"hadoop_conf.set(\\\"fs.gs.impl\\\", \\\"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem\\\")\\n\",\n    \"hadoop_conf.set(\\\"fs.gs.auth.service.account.json.keyfile\\\", credentials_location)\\n\",\n    \"hadoop_conf.set(\\\"fs.gs.auth.service.account.enable\\\", \\\"true\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"id\": \"c4713e2b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"spark = SparkSession.builder \\\\\\n\",\n    \"    .config(conf=sc.getConf()) \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"1ee1eb1d\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_green = spark.read.parquet('gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/*/*')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"id\": \"104b40ab\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"2304517\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df_green.count()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"f56a885d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.7\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "06-batch/code/cloud.md",
    "content": "## Running Spark in the Cloud\n\n### Connecting to Google Cloud Storage \n\nUploading data to GCS:\n\n```bash\ngsutil -m cp -r pq/ gs://dtc_data_lake_de-zoomcamp-nytaxi/pq\n```\n\nDownload the jar for connecting to GCS to any location (e.g. the `lib` folder):\n\n**Note**: For other versions of GCS connector for Hadoop see [Cloud Storage connector ](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#connector-setup-on-non-dataproc-clusters).\n\n```bash\ngsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar ./lib/\n```\n\nSee the notebook with configuration in [09_spark_gcs.ipynb](09_spark_gcs.ipynb)\n\n(Thanks Alvin Do for the instructions!)\n\n\n### Local Cluster and Spark-Submit\n\nCreating a stand-alone cluster ([docs](https://spark.apache.org/docs/latest/spark-standalone.html)):\n\n```bash\n./sbin/start-master.sh\n```\n\nCreating a worker:\n\n```bash\nURL=\"spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077\"\n./sbin/start-slave.sh ${URL}\n\n# for newer versions of spark use that:\n#./sbin/start-worker.sh ${URL}\n```\n\nTurn the notebook into a script:\n\n```bash\njupyter nbconvert --to=script 06_spark_sql.ipynb\n```\n\nEdit the script and then run it:\n\n```bash \npython 06_spark_sql.py \\\n    --input_green=data/pq/green/2020/*/ \\\n    --input_yellow=data/pq/yellow/2020/*/ \\\n    --output=data/report-2020\n```\n\nUse `spark-submit` for running the script on the cluster\n\n```bash\nURL=\"spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077\"\n\nspark-submit \\\n    --master=\"${URL}\" \\\n    06_spark_sql.py \\\n        --input_green=data/pq/green/2021/*/ \\\n        --input_yellow=data/pq/yellow/2021/*/ \\\n        --output=data/report-2021\n```\n\n### Data Proc\n\nUpload the script to GCS:\n\n```bash\ngsutil -m cp -r 06_spark_sql.py gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py\n```\n\nParams for the job:\n\n* `--input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2021/*/`\n* `--input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2021/*/`\n* `--output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2021`\n\n\nUsing Google Cloud SDK for submitting to dataproc\n([link](https://cloud.google.com/dataproc/docs/guides/submit-job#dataproc-submit-job-gcloud))\n\n```bash\ngcloud dataproc jobs submit pyspark \\\n    --cluster=de-zoomcamp-cluster \\\n    --region=europe-west6 \\\n    gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py \\\n    -- \\\n        --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \\\n        --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \\\n        --output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2020\n```\n\n### Big Query\n\nUpload the script to GCS:\n\n```bash\ngsutil -m cp -r 06_spark_sql_big_query.py gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql_big_query.py\n```\n\nWrite results to big query ([docs](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#pyspark)):\n\n```bash\ngcloud dataproc jobs submit pyspark \\\n    --cluster=de-zoomcamp-cluster \\\n    --region=europe-west6 \\\n    --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \\\n    gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql_big_query.py \\\n    -- \\\n        --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \\\n        --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \\\n        --output=trips_data_all.reports-2020\n```\n\nThere can be issue with latest Spark version and the Big query connector. Download links to the jar file for respective Spark versions can be found at:\n[Spark and Big query connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector)\n\n**Note**: Dataproc on GCE 2.1+ images pre-install Spark BigQquery connector: [DataProc Release 2.2](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-2.2). Therefore, no need to include the jar file in the job submission."
  },
  {
    "path": "06-batch/code/download_data.sh",
    "content": "\nset -e\n\nTAXI_TYPE=$1 # \"yellow\"\nYEAR=$2 # 2020\n\nURL_PREFIX=\"https://github.com/DataTalksClub/nyc-tlc-data/releases/download\"\n\nfor MONTH in {1..12}; do\n  FMONTH=`printf \"%02d\" ${MONTH}`\n\n  URL=\"${URL_PREFIX}/${TAXI_TYPE}/${TAXI_TYPE}_tripdata_${YEAR}-${FMONTH}.csv.gz\"\n\n  LOCAL_PREFIX=\"data/raw/${TAXI_TYPE}/${YEAR}/${FMONTH}\"\n  LOCAL_FILE=\"${TAXI_TYPE}_tripdata_${YEAR}_${FMONTH}.csv.gz\"\n  LOCAL_PATH=\"${LOCAL_PREFIX}/${LOCAL_FILE}\"\n\n  echo \"downloading ${URL} to ${LOCAL_PATH}\"\n  mkdir -p ${LOCAL_PREFIX}\n  wget ${URL} -O ${LOCAL_PATH}\n\ndone\n"
  },
  {
    "path": "06-batch/code/homework.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"00bc6543\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pyspark\\n\",\n    \"from pyspark.sql import SparkSession\\n\",\n    \"from pyspark.sql import types\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"cd4a0f3d\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"WARNING: An illegal reflective access operation has occurred\\n\",\n      \"WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/alexey/spark/spark-3.0.3-bin-hadoop3.2/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)\\n\",\n      \"WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\\n\",\n      \"WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\\n\",\n      \"WARNING: All illegal access operations will be denied in a future release\\n\",\n      \"22/03/07 21:55:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"spark = SparkSession.builder \\\\\\n\",\n    \"    .master(\\\"local[*]\\\") \\\\\\n\",\n    \"    .appName('test') \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"eb3e4c36\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"'3.0.3'\"\n      ]\n     },\n     \"execution_count\": 3,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"spark.version\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"id\": \"5236cebd\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"-rw-rw-r-- 1 alexey alexey 700M Oct 29 18:53 fhvhv_tripdata_2021-02.csv\\r\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!ls -lh fhvhv_tripdata_2021-02.csv\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"id\": \"0a3399a3\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"schema = types.StructType([\\n\",\n    \"    types.StructField('hvfhs_license_num', types.StringType(), True),\\n\",\n    \"    types.StructField('dispatching_base_num', types.StringType(), True),\\n\",\n    \"    types.StructField('pickup_datetime', types.TimestampType(), True),\\n\",\n    \"    types.StructField('dropoff_datetime', types.TimestampType(), True),\\n\",\n    \"    types.StructField('PULocationID', types.IntegerType(), True),\\n\",\n    \"    types.StructField('DOLocationID', types.IntegerType(), True),\\n\",\n    \"    types.StructField('SR_Flag', types.StringType(), True)\\n\",\n    \"])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"id\": \"68bc8b72\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = spark.read \\\\\\n\",\n    \"    .option(\\\"header\\\", \\\"true\\\") \\\\\\n\",\n    \"    .schema(schema) \\\\\\n\",\n    \"    .csv('fhvhv_tripdata_2021-02.csv')\\n\",\n    \"\\n\",\n    \"df = df.repartition(24)\\n\",\n    \"\\n\",\n    \"df.write.parquet('data/pq/fhvhv/2021/02/', compression=)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"id\": \"58989b55\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 0:>                                                          (0 + 1) / 1]\\r\",\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df = spark.read.parquet('data/pq/fhvhv/2021/02/')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"48b01d2f\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Q3**: How many taxi trips were there on February 15?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"id\": \"f7489aea\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from pyspark.sql import functions as F\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 24,\n   \"id\": \"6c2500fd\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"367170\"\n      ]\n     },\n     \"execution_count\": 24,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df \\\\\\n\",\n    \"    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \\\\\\n\",\n    \"    .filter(\\\"pickup_date = '2021-02-15'\\\") \\\\\\n\",\n    \"    .count()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 25,\n   \"id\": \"dd7ae60d\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df.registerTempTable('fhvhv_2021_02')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 28,\n   \"id\": \"6d47c147\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 20:>                                                         (0 + 4) / 4]\\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+--------+\\n\",\n      \"|count(1)|\\n\",\n      \"+--------+\\n\",\n      \"|  367170|\\n\",\n      \"+--------+\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 20:==============>                                           (1 + 3) / 4]\\r\",\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"spark.sql(\\\"\\\"\\\"\\n\",\n    \"SELECT\\n\",\n    \"    COUNT(1)\\n\",\n    \"FROM \\n\",\n    \"    fhvhv_2021_02\\n\",\n    \"WHERE\\n\",\n    \"    to_date(pickup_datetime) = '2021-02-15';\\n\",\n    \"\\\"\\\"\\\").show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"ae3f533b\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Q4**: Longest trip for each day\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 29,\n   \"id\": \"7befe422\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['hvfhs_license_num',\\n\",\n       \" 'dispatching_base_num',\\n\",\n       \" 'pickup_datetime',\\n\",\n       \" 'dropoff_datetime',\\n\",\n       \" 'PULocationID',\\n\",\n       \" 'DOLocationID',\\n\",\n       \" 'SR_Flag']\"\n      ]\n     },\n     \"execution_count\": 29,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.columns\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 36,\n   \"id\": \"279d9161\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[Stage 37:==============>                                           (1 + 3) / 4]\\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+-----------+-------------+\\n\",\n      \"|pickup_date|max(duration)|\\n\",\n      \"+-----------+-------------+\\n\",\n      \"| 2021-02-11|        75540|\\n\",\n      \"| 2021-02-17|        57221|\\n\",\n      \"| 2021-02-20|        44039|\\n\",\n      \"| 2021-02-03|        40653|\\n\",\n      \"| 2021-02-19|        37577|\\n\",\n      \"+-----------+-------------+\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 38:==================================================>   (187 + 4) / 200]\\r\",\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df \\\\\\n\",\n    \"    .withColumn('duration', df.dropoff_datetime.cast('long') - df.pickup_datetime.cast('long')) \\\\\\n\",\n    \"    .withColumn('pickup_date', F.to_date(df.pickup_datetime)) \\\\\\n\",\n    \"    .groupBy('pickup_date') \\\\\\n\",\n    \"        .max('duration') \\\\\\n\",\n    \"    .orderBy('max(duration)', ascending=False) \\\\\\n\",\n    \"    .limit(5) \\\\\\n\",\n    \"    .show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 38,\n   \"id\": \"74cf0e8b\",\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 43:>                                                         (0 + 4) / 4]\\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+-----------+-----------------+\\n\",\n      \"|pickup_date|         duration|\\n\",\n      \"+-----------+-----------------+\\n\",\n      \"| 2021-02-11|           1259.0|\\n\",\n      \"| 2021-02-17|953.6833333333333|\\n\",\n      \"| 2021-02-20|733.9833333333333|\\n\",\n      \"| 2021-02-03|           677.55|\\n\",\n      \"| 2021-02-19|626.2833333333333|\\n\",\n      \"| 2021-02-25|            583.5|\\n\",\n      \"| 2021-02-18|576.8666666666667|\\n\",\n      \"| 2021-02-10|569.4833333333333|\\n\",\n      \"| 2021-02-21|           537.05|\\n\",\n      \"| 2021-02-09|534.7833333333333|\\n\",\n      \"+-----------+-----------------+\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 44:================================================>     (180 + 4) / 200]\\r\",\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"spark.sql(\\\"\\\"\\\"\\n\",\n    \"SELECT\\n\",\n    \"    to_date(pickup_datetime) AS pickup_date,\\n\",\n    \"    MAX((CAST(dropoff_datetime AS LONG) - CAST(pickup_datetime AS LONG)) / 60) AS duration\\n\",\n    \"FROM \\n\",\n    \"    fhvhv_2021_02\\n\",\n    \"GROUP BY\\n\",\n    \"    1\\n\",\n    \"ORDER BY\\n\",\n    \"    2 DESC\\n\",\n    \"LIMIT 10;\\n\",\n    \"\\\"\\\"\\\").show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"d915096b\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Q5**: Most frequent `dispatching_base_num`\\n\",\n    \"\\n\",\n    \"How many stages this spark job has?\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 44,\n   \"id\": \"25816aa2\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 73:>                                                         (0 + 4) / 4]\\r\",\n      \"\\r\",\n      \"[Stage 73:==============>                                           (1 + 3) / 4]\\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+--------------------+--------+\\n\",\n      \"|dispatching_base_num|count(1)|\\n\",\n      \"+--------------------+--------+\\n\",\n      \"|              B02510| 3233664|\\n\",\n      \"|              B02764|  965568|\\n\",\n      \"|              B02872|  882689|\\n\",\n      \"|              B02875|  685390|\\n\",\n      \"|              B02765|  559768|\\n\",\n      \"+--------------------+--------+\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 74:===================================================>  (189 + 5) / 200]\\r\",\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"spark.sql(\\\"\\\"\\\"\\n\",\n    \"SELECT\\n\",\n    \"    dispatching_base_num,\\n\",\n    \"    COUNT(1)\\n\",\n    \"FROM \\n\",\n    \"    fhvhv_2021_02\\n\",\n    \"GROUP BY\\n\",\n    \"    1\\n\",\n    \"ORDER BY\\n\",\n    \"    2 DESC\\n\",\n    \"LIMIT 5;\\n\",\n    \"\\\"\\\"\\\").show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 46,\n   \"id\": \"a78f9fe3\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 86:>                                                         (0 + 4) / 4]\\r\",\n      \"\\r\",\n      \"[Stage 86:=============================>                            (2 + 2) / 4]\\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+--------------------+-------+\\n\",\n      \"|dispatching_base_num|  count|\\n\",\n      \"+--------------------+-------+\\n\",\n      \"|              B02510|3233664|\\n\",\n      \"|              B02764| 965568|\\n\",\n      \"|              B02872| 882689|\\n\",\n      \"|              B02875| 685390|\\n\",\n      \"|              B02765| 559768|\\n\",\n      \"+--------------------+-------+\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"[Stage 87:===========================================>          (161 + 5) / 200]\\r\",\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df \\\\\\n\",\n    \"    .groupBy('dispatching_base_num') \\\\\\n\",\n    \"        .count() \\\\\\n\",\n    \"    .orderBy('count', ascending=False) \\\\\\n\",\n    \"    .limit(5) \\\\\\n\",\n    \"    .show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"0d10173a\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Q6**: Most common locations pair\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 47,\n   \"id\": \"74b7f664\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_zones = spark.read.parquet('zones')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 49,\n   \"id\": \"81642d3b\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['LocationID', 'Borough', 'Zone', 'service_zone']\"\n      ]\n     },\n     \"execution_count\": 49,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df_zones.columns\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 51,\n   \"id\": \"4f460dda\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['hvfhs_license_num',\\n\",\n       \" 'dispatching_base_num',\\n\",\n       \" 'pickup_datetime',\\n\",\n       \" 'dropoff_datetime',\\n\",\n       \" 'PULocationID',\\n\",\n       \" 'DOLocationID',\\n\",\n       \" 'SR_Flag']\"\n      ]\n     },\n     \"execution_count\": 51,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.columns\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 50,\n   \"id\": \"ad8f0101\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df_zones.registerTempTable('zones')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 57,\n   \"id\": \"6f738414\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"[Stage 103:==============================================>      (176 + 4) / 200]\\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+--------------------+--------+\\n\",\n      \"|          pu_do_pair|count(1)|\\n\",\n      \"+--------------------+--------+\\n\",\n      \"|East New York / E...|   45041|\\n\",\n      \"|Borough Park / Bo...|   37329|\\n\",\n      \"| Canarsie / Canarsie|   28026|\\n\",\n      \"|Crown Heights Nor...|   25976|\\n\",\n      \"|Bay Ridge / Bay R...|   17934|\\n\",\n      \"+--------------------+--------+\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"\\r\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"spark.sql(\\\"\\\"\\\"\\n\",\n    \"SELECT\\n\",\n    \"    CONCAT(pul.Zone, ' / ', dol.Zone) AS pu_do_pair,\\n\",\n    \"    COUNT(1)\\n\",\n    \"FROM \\n\",\n    \"    fhvhv_2021_02 fhv LEFT JOIN zones pul ON fhv.PULocationID = pul.LocationID\\n\",\n    \"                      LEFT JOIN zones dol ON fhv.DOLocationID = dol.LocationID\\n\",\n    \"GROUP BY \\n\",\n    \"    1\\n\",\n    \"ORDER BY\\n\",\n    \"    2 DESC\\n\",\n    \"LIMIT 5;\\n\",\n    \"\\\"\\\"\\\").show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"e4b754d1\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.9.7\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "06-batch/setup/config/core-site.xml",
    "content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<?xml-stylesheet type=\"text/xsl\" href=\"configuration.xsl\"?>\n\n<configuration>\n  <property>\n    <name>fs.AbstractFileSystem.gs.impl</name>\n    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>\n  </property>\n  <property>\n    <name>fs.gs.impl</name>\n    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>\n  </property>\n  <property>\n    <name>fs.gs.auth.service.account.json.keyfile</name>\n    <value>/home/alexey/.google/credentials/google_credentials.json</value>\n  </property>\n  <property>\n    <name>fs.gs.auth.service.account.enable</name>\n    <value>true</value>\n  </property>\n</configuration>"
  },
  {
    "path": "06-batch/setup/config/spark-defaults.conf",
    "content": "spark-master    yarn\nspark.hadoop.google.cloud.auth.service.account.enable        true\nspark.hadoop.google.cloud.auth.service.account.json.keyfile  /home/alexey\n"
  },
  {
    "path": "06-batch/setup/config/spark.dockerfile",
    "content": "FROM library/openjdk:11"
  },
  {
    "path": "06-batch/setup/hadoop-yarn.md",
    "content": "## Spark on YARN \n\nFor the Spark and Docker module, we need YARN, which\ncomes together with Hadoop. So we need to install Hadoop\n\nIn this document, we'll assume you use Linux. For Windows, use WSL. It should work (supposedly) on MacOS as well. \n\nWe'll need to run it in a pseudo-distributed mode.\n\n\n### Configuring ssh\n\nYou need to run be able to `ssh` to your localhost without having to type any password. In other words, you execute \n\n```bash\nssh localhost\n```\n\nAnd you get ssh access. \n\nIf you don't have it, add your `id_rsa.pub` key to the list of keys authorized to access your computer:\n\n```bash\ncat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys\nchmod 0600 ~/.ssh/authorized_keys\n```\n\n(This assumes you already have `id_rsa.pub` in `~/.ssh`)\n\nOn WSL, you may need to start the ssh service:\n\n```bash\nsudo service ssh start\n```\n\n### Download Hadoop binaries\n\nWe use Spark that expects Hadoop 3.2 version. So we'll install it.\n\nGo to the [Hadoop's website](https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz) to get the closest mirror. And then download it:\n\n```bash\nwget https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz\n```\n\nUnpack it and go to this directory\n\n```bash\ntar xzfv hadoop-3.2.3.tar.gz\ncd hadoop-3.2.3/\n```\n\n\n### YARN on a Single Node\n\nSet `JAVA_HOME` in `etc/hadoop/hadoop-env.sh`:\n\n```bash\necho \"export JAVA_HOME=${JAVA_HOME}\" >> etc/hadoop/hadoop-env.sh\n```\n\nStart YARN\n\n```bash\n./sbin/start-yarn.sh\n```\n\nYARN should work on port 8088: http://localhost:8088/\n\n\n### Running Spark on YARN\n\nFor submitting spark jobs, we'll need to use `master=\"yarn\"`.\n\nSpark needs to know where to look for YARN config files, so we need to set it:\n\n\n```bash\nexport HADOOP_HOME=\"${HOME}/spark/hadoop-3.2.3\"\nexport YARN_CONF_DIR=\"${HADOOP_HOME}/etc/hadoop\"\n```\n\nThen run Jupyter or use spark-submit.\n\n\n### Connecting Spark and YARN to GCS\n\nDownload the GCS connector:\n\n```bash\ngsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar .\n```\n\nConfig changes:\n\n* Change `${SPARK_HOME}/conf/spark-defaults.conf` (see [here]())\n* Change `${YARN_CONF_DIR}/core-site.xml` (see [here](config/core-site.xml))\n\nTemplate for hadoop properties:\n\n```xml\n  <property>\n    <name></name>\n    <value></value>\n  </property>\n```\n\n### Spark and YARN with Docker\n\nCopy the config from [here](https://hadoop.apache.org/docs/r3.2.3/hadoop-yarn/hadoop-yarn-site/DockerContainers.html)\n\nRunning spark-submit:\n\n```bash\nMOUNTS=\"$HADOOP_HOME:$HADOOP_HOME:ro,/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro\"\nIMAGE_ID=\"pyspark-docker:test\"\n\nspark-submit \\\n    --master yarn \\\n    --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \\\n    --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE_ID} \\\n    --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \\\n    --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=${IMAGE_ID} \\\n    06_spark_sql.py \\\n        --input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2021/*/ \\\n        --input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2021/*/ \\\n        --output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2021\n```\n\n\n\n### Sources\n\n* https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-common/SingleCluster.html\n* https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration\n"
  },
  {
    "path": "06-batch/setup/linux.md",
    "content": "\n## Linux\n\nHere we'll show you how to install Spark 4.x for Linux.\nWe tested it on Ubuntu 24.04 (also WSL), but it should work\nfor other Linux distros as well\n\n\n### Installing Java\n\nSpark 4.x requires Java 17 or 21. The simplest way is to install it via your package manager:\n\n```bash\nsudo apt update\nsudo apt install default-jdk\n```\n\nCheck that it works:\n\n```bash\njava --version\n```\n\nOutput (example):\n\n```\nopenjdk 21.0.10 2026-01-20\nOpenJDK Runtime Environment (build 21.0.10+7-Ubuntu-124.04)\nOpenJDK 64-Bit Server VM (build 21.0.10+7-Ubuntu-124.04, mixed mode, sharing)\n```\n\nSet `JAVA_HOME` (add to your `.bashrc` or `.zshrc`):\n\n```bash\nexport JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))\nexport PATH=\"${JAVA_HOME}/bin:${PATH}\"\n```\n\n\n### PySpark\n\nWe recommend using [uv](https://docs.astral.sh/uv/) for managing Python packages:\n\n```bash\nuv init\nuv add pyspark\n```\n\nThen run your scripts with `uv run`:\n\n```bash\nuv run python your_script.py\n```\n\nAlternatively, you can use pip:\n\n```bash\npip install pyspark\n```\n\nBoth approaches install PySpark along with a bundled Spark distribution - no separate Spark download needed.\n\n\n### Testing it\n\nCreate a test script `test_spark.py`:\n\n```python\nimport pyspark\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder \\\n    .master(\"local[*]\") \\\n    .appName('test') \\\n    .getOrCreate()\n\nprint(f\"Spark version: {spark.version}\")\n\ndf = spark.range(10)\ndf.show()\n\nspark.stop()\n```\n\nRun it:\n\n```bash\nuv run python test_spark.py\n```\n\n"
  },
  {
    "path": "06-batch/setup/macos.md",
    "content": "\n## MacOS\n\nHere we'll show you how to install Spark 4.x for macOS.\nWe tested it on macOS 15 (Sequoia), but it should work\nfor other versions as well.\n\n\n### Installing Java\n\nSpark 4.x requires Java 17. Ensure [Homebrew](https://brew.sh/) is installed, then install OpenJDK 17:\n\n```bash\nbrew install openjdk@17\n```\n\nAdd the following environment variables to your `.zshrc` (or `.bash_profile`):\n\n```bash\nexport JAVA_HOME=$(brew --prefix openjdk@17)\nexport PATH=\"$JAVA_HOME/bin:$PATH\"\n```\n\nCheck that Java works correctly:\n\n```bash\njava --version\n```\n\nOutput (example):\n\n```\nopenjdk 17.0.14 2026-01-21\nOpenJDK Runtime Environment Homebrew (build 17.0.14+0)\nOpenJDK 64-Bit Server VM Homebrew (build 17.0.14+0, mixed mode, sharing)\n```\n\n\n### PySpark\n\nWe recommend using [uv](https://docs.astral.sh/uv/) for managing Python packages:\n\n```bash\nuv init\nuv add pyspark\n```\n\nThen run your scripts with `uv run`:\n\n```bash\nuv run python your_script.py\n```\n\nAlternatively, you can use pip:\n\n```bash\npip install pyspark\n```\n\nBoth approaches install PySpark along with a bundled Spark distribution — no separate Spark download needed.\n\n> If you previously installed Spark 3.x and have `SPARK_HOME` set in your `.zshrc` or `.bash_profile` (e.g. pointing to a local Spark directory), remove that line. PySpark 4.x bundles its own Spark, so `SPARK_HOME` is no longer needed. If the old `SPARK_HOME` is still set, PySpark 4.x will load the old JARs and fail.\n\n\n### Testing it\n\nCreate a test script `test_spark.py`:\n\n```python\nimport pyspark\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder \\\n    .master(\"local[*]\") \\\n    .appName('test') \\\n    .getOrCreate()\n\nprint(f\"Spark version: {spark.version}\")\n\ndf = spark.range(10)\ndf.show()\n\nspark.stop()\n```\n\nRun it:\n\n```bash\nuv run python test_spark.py\n```\n\nYou may see a warning like `WARNING: Using incubator modules: jdk.incubator.vector` — you can safely ignore it.\n"
  },
  {
    "path": "06-batch/setup/windows.md",
    "content": "## Windows\n\nHere we'll show you how to install Spark 4.x for Windows.\nWe tested it on Windows 10 and 11, but it should work\nfor other versions as well.\n\nIn this tutorial, we'll use [MINGW](https://www.mingw-w64.org/)/[Git Bash](https://gitforwindows.org/) for the command line.\n\nIf you use WSL, follow the instructions from [linux.md](linux.md).\n\n\n### Installing Java\n\nSpark 4.x requires Java 17. Download and unpack the Adoptium JDK 17:\n\n```bash\nwget https://github.com/adoptium/temurin17-binaries/releases/download/jdk-17.0.18%2B8/OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip\nunzip OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip -d /c/tools/\n```\n\nThe full path to JDK will be `/c/tools/jdk-17.0.18+8`.\n\nNow let's configure it and add it to `PATH` (add to your `.bashrc`):\n\n```bash\nexport JAVA_HOME=\"/c/tools/jdk-17.0.18+8\"\nexport PATH=\"${JAVA_HOME}/bin:${PATH}\"\n```\n\nCheck that Java works correctly:\n\n```bash\njava --version\n```\n\nOutput:\n\n```\nopenjdk 17.0.18 2026-01-20 LTS\nOpenJDK Runtime Environment Temurin-17.0.18+8 (build 17.0.18+8-LTS)\nOpenJDK 64-Bit Server VM Temurin-17.0.18+8 (build 17.0.18+8-LTS, mixed mode, sharing)\n```\n\n\n### PySpark\n\nWe recommend using [uv](https://docs.astral.sh/uv/) for managing Python packages:\n\n```bash\nuv init\nuv add pyspark\n```\n\nThen run your scripts with `uv run`:\n\n```bash\nuv run python your_script.py\n```\n\nAlternatively, you can use pip:\n\n```bash\npip install pyspark\n```\n\nBoth approaches install PySpark along with a bundled Spark distribution — no separate Spark or Hadoop download needed.\n\n> If you previously installed Spark 3.x and have `SPARK_HOME` set in your `.bashrc` (e.g. pointing to `C:/tools/spark-3.3.2-bin-hadoop3`), remove that line. PySpark 4.x bundles its own Spark, so `SPARK_HOME` is no longer needed. If the old `SPARK_HOME` is still set, PySpark 4.x will load the old JARs and fail.\n\n\n### Testing it\n\nCreate a test script `test_spark.py`:\n\n```python\nimport pyspark\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder \\\n    .master(\"local[*]\") \\\n    .appName('test') \\\n    .getOrCreate()\n\nprint(f\"Spark version: {spark.version}\")\n\ndf = spark.range(10)\ndf.show()\n\nspark.stop()\n```\n\nRun it:\n\n```bash\nuv run python test_spark.py\n```\n\nAt this point you may get a message from Windows Firewall — allow it.\n\nYou may see a warning like `WARNING: Using incubator modules: jdk.incubator.vector` — you can safely ignore it.\n\n"
  },
  {
    "path": "07-streaming/.gitignore",
    "content": "week6_venv"
  },
  {
    "path": "07-streaming/README.md",
    "content": "# Module 7: Stream Processing\n\nVideo: https://www.youtube.com/live/YDUgFeHQzJU\n\n- [PyFlink workshop](workshop/) - build a real-time streaming pipeline step by step (Redpanda, Python, Flink, PostgreSQL)\n- [Homework](../cohorts/2026/07-streaming/homework.md)\n- [Kafka theory](theory/) - video lectures on Kafka concepts with Java code examples (optional)\n- [Extras](extras/) - supplementary Python and PyFlink examples from previous years (optional)\n\n\n## Community notes\n\n<details>\n<summary>Did you take notes? You can share them here</summary>\n\n* [Notes by Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/6_streaming.md )\n* [Marcos Torregrosa's blog (spanish)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-6-stream-processing/)\n* [Notes by Oscar Garcia](https://github.com/ozkary/Data-Engineering-Bootcamp/tree/main/Step6-Streaming)\n* [Notes by Shayan Shafiee Moghadam](https://github.com/shayansm2/eng-notebook/blob/main/kafka/readme.md)\n* Add your notes here (above this line)\n</details>\n"
  },
  {
    "path": "07-streaming/extras/README.md",
    "content": "# Supplementary streaming examples\n\nAdditional stream processing examples from previous course years. These are\nnot part of the main workshop but may be useful as reference material.\n\n\n## python/\n\nPython Kafka examples by Irem Erturk, using various libraries.\n\n- [json_example/](python/json_example) - producer and consumer using\n  `kafka-python` with JSON serialization\n- [avro_example/](python/avro_example) - producer and consumer using\n  `confluent-kafka` with Avro serialization and Schema Registry\n- [redpanda_example/](python/redpanda_example) - same as the JSON example\n  but running against Redpanda instead of Kafka, with a local\n  docker-compose setup\n- [streams-example/faust/](python/streams-example/faust) - stream processing\n  with [Faust](https://faust-streaming.github.io/faust/), a Python library\n  for Kafka Streams. Includes windowing, branching, and counting examples.\n- [streams-example/pyspark/](python/streams-example/pyspark) - Spark\n  Structured Streaming consuming from Kafka, with a Jupyter notebook\n- [streams-example/redpanda/](python/streams-example/redpanda) - same as\n  the PySpark example but using Redpanda as the broker\n- [docker/](python/docker) - Docker Compose files for running Kafka and\n  Spark clusters locally\n- [resources/](python/resources) - sample data (rides.csv) and Avro schemas\n\n\n## pyflink/\n\nPyFlink workshop by Irem Erturk. Uses Apache Flink 1.x with a\nMakefile-based workflow, PostgreSQL sink, and Docker Compose setup. The\n[2025 stream with Zach Wilson](https://www.youtube.com/watch?v=P2loELMUUeI)\nwas rewritten into the current [2026 workshop](../workshop/) by Alexey,\nusing Flink 2.2, uv, and a step-by-step README.\n\n\n## ksqldb/\n\n[commands.md](ksqldb/commands.md) - example ksqlDB queries for creating\nstreams, filtering, grouping, and windowed aggregations over Kafka topics.\nCompanion to the [ksqlDB and Connect video](../theory/#kafka-streams) in\nthe theory section.\n"
  },
  {
    "path": "07-streaming/extras/ksqldb/commands.md",
    "content": "## KSQL DB Examples\n### Create streams\n```sql\nCREATE STREAM ride_streams (\n    VendorId varchar, \n    trip_distance double,\n    payment_type varchar\n)  WITH (KAFKA_TOPIC='rides',\n        VALUE_FORMAT='JSON');\n```\n\n### Query stream\n```sql\nselect * from RIDE_STREAMS \nEMIT CHANGES;\n```\n\n### Query stream count\n```sql\nSELECT VENDORID, count(*) FROM RIDE_STREAMS \nGROUP BY VENDORID\nEMIT CHANGES;\n```\n\n### Query stream with filters\n```sql\nSELECT payment_type, count(*) FROM RIDE_STREAMS \nWHERE payment_type IN ('1', '2')\nGROUP BY payment_type\nEMIT CHANGES;\n```\n\n### Query stream with window functions\n```sql\nCREATE TABLE payment_type_sessions AS\n  SELECT payment_type,\n         count(*)\n  FROM  RIDE_STREAMS \n  WINDOW SESSION (60 SECONDS)\n  GROUP BY payment_type\n  EMIT CHANGES;\n```\n\n## KSQL documentation for details\n[KSQL DB Documentation](https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/quick-reference/)\n\n[KSQL DB Java client](https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-clients/java-client/)"
  },
  {
    "path": "07-streaming/extras/pyflink/.gitignore",
    "content": "data/\npostgres-data\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\npip-wheel-metadata/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n.python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\ndump.sql\n\n# Personal workspace files\n.idea/*\n.vscode/*"
  },
  {
    "path": "07-streaming/extras/pyflink/Dockerfile.flink",
    "content": "FROM --platform=linux/amd64 flink:1.16.0-scala_2.12-java8\n\n# install python3: it has updated Python to 3.9 in Debian 11 and so install Python 3.7 from source\n# it currently only supports Python 3.6, 3.7 and 3.8 in PyFlink officially.\n\n# ref: https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker\n\nRUN apt-get update -y && \\\n    apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libffi-dev liblzma-dev && \\\n    wget https://www.python.org/ftp/python/3.7.9/Python-3.7.9.tgz && \\\n    tar -xvf Python-3.7.9.tgz && \\\n    cd Python-3.7.9 && \\\n    ./configure --without-tests --enable-shared && \\\n    make -j6 && \\\n    make install && \\\n    ldconfig /usr/local/lib && \\\n    cd .. && rm -f Python-3.7.9.tgz && rm -rf Python-3.7.9 && \\\n    ln -s /usr/local/bin/python3 /usr/local/bin/python && \\\n    apt-get clean && \\\n    rm -rf /var/lib/apt/lists/*\n\n# install PyFlink\nCOPY requirements.txt .\nRUN python -m pip install --upgrade pip; \\\n    pip3 install --upgrade google-api-python-client; \\\n    pip3 install -r requirements.txt  --no-cache-dir;\n\n# Download connector libraries\nRUN wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/1.16.0/flink-json-1.16.0.jar; \\\n    wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka/1.16.0/flink-sql-connector-kafka-1.16.0.jar; \\\n    wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc/1.16.0/flink-connector-jdbc-1.16.0.jar; \\\n    wget -P /opt/flink/lib/ https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.24/postgresql-42.2.24.jar;\n\nRUN echo \"taskmanager.memory.jvm-metaspace.size: 512m\" >> /opt/flink/conf/flink-conf.yaml;\n\nWORKDIR /opt/flink\n"
  },
  {
    "path": "07-streaming/extras/pyflink/LICENSE",
    "content": "MIT License\n\nCopyright (c) 2025 Sreela Das, Julie Scherer, Zach Wilson\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "07-streaming/extras/pyflink/Makefile",
    "content": "PLATFORM ?= linux/amd64\n\n# COLORS\nGREEN  := $(shell tput -Txterm setaf 2)\nYELLOW := $(shell tput -Txterm setaf 3)\nWHITE  := $(shell tput -Txterm setaf 7)\nRESET  := $(shell tput -Txterm sgr0)\n\n\nTARGET_MAX_CHAR_NUM=20\n\n## Show help with `make help`\nhelp:\n\t@echo ''\n\t@echo 'Usage:'\n\t@echo '  ${YELLOW}make${RESET} ${GREEN}<target>${RESET}'\n\t@echo ''\n\t@echo 'Targets:'\n\t@awk '/^[a-zA-Z\\-\\_0-9]+:/ { \\\n\t\thelpMessage = match(lastLine, /^## (.*)/); \\\n\t\tif (helpMessage) { \\\n\t\t\thelpCommand = substr($$1, 0, index($$1, \":\")-1); \\\n\t\t\thelpMessage = substr(lastLine, RSTART + 3, RLENGTH); \\\n\t\t\tprintf \"  ${YELLOW}%-$(TARGET_MAX_CHAR_NUM)s${RESET} ${GREEN}%s${RESET}\\n\", helpCommand, helpMessage; \\\n\t\t} \\\n\t} \\\n\t{ lastLine = $$0 }' $(MAKEFILE_LIST)\n\n.PHONY: build\n## Builds the Flink base image with pyFlink and connectors installed\nbuild:\n\tdocker build .\n\n.PHONY: up\n## Builds the base Docker image and starts Flink cluster\nup:\n\tdocker compose up --build --remove-orphans  -d\n\n.PHONY: down\n## Shuts down the Flink cluster\ndown:\n\tdocker compose down --remove-orphans\n\n.PHONY: job\n## Submit the Flink job\njob:\n\tdocker compose exec jobmanager ./bin/flink run -py /opt/src/job/start_job.py --pyFiles /opt/src -d\n\naggregation_job:\n\tdocker compose exec jobmanager ./bin/flink run -py /opt/src/job/aggregation_job.py --pyFiles /opt/src -d\n\n.PHONY: stop\n## Stops all services in Docker compose\nstop:\n\tdocker compose stop\n\n.PHONY: start\n## Starts all services in Docker compose\nstart:\n\tdocker compose start\n"
  },
  {
    "path": "07-streaming/extras/pyflink/README.md",
    "content": "# Apache Flink Training\nApache Flink Streaming Pipelines\n\n## :pushpin: Getting started \n\n### :whale: Installations\n\nTo run this repo, the following components will need to be installed:\n\n1. [Docker](https://docs.docker.com/get-docker/) (required)\n2. [Docker compose](https://docs.docker.com/compose/install/#installation-scenarios) (required)\n3. Make (recommended) -- see below\n    - On most Linux distributions and macOS, `make` is typically pre-installed by default. To check if `make` is installed on your system, you can run the `make --version` command in your terminal or command prompt. If it's installed, it will display the version information. \n    - Otherwise, you can try following the instructions below, or you can just copy+paste the commands from the `Makefile` into your terminal or command prompt and run manually.\n\n        ```bash\n        # On Ubuntu or Debian:\n        sudo apt-get update\n        sudo apt-get install build-essential\n\n        # On CentOS or Fedora:\n        sudo dnf install make\n\n        # On macOS:\n        xcode-select --install\n\n        # On windows:\n        choco install make # uses Chocolatey, https://chocolatey.org/install\n        ```\n\n### :computer: Local setup\n\nMake sure you're in the `pyflick` folder:\n\n```bash\ncd 07-streaming/pyflink\n```\n\n## :boom: Running the pipeline\n\n1. Build the Docker image and deploy the services in the `docker-compose.yml` file, including the PostgreSQL database and Flink cluster. This will (should) also create the sink table, `processed_events`, where Flink will write the Kafka messages to.\n\n    ```bash\n    make up\n\n    #// if you dont have make, you can run:\n    # docker compose up --build --remove-orphans  -d\n    ```\n\n    **:star: Wait until the Flink UI is running at [http://localhost:8081/](http://localhost:8081/) before proceeding to the next step.** _Note the first time you build the Docker image it can take anywhere from 5 to 30 minutes. Future builds should only take a few second, assuming you haven't deleted the image since._\n\n    :information_source: After the image is built, Docker will automatically start up the job manager and task manager services. This will take a minute or so. Check the container logs in Docker desktop and when you see the line below, you know you're good to move onto the next step.\n\n    ```\n    taskmanager Successful registration at resource manager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_* under registration id <id_number>\n    ```\n\n2. Now that the Flink cluster is up and running, it's time to finally run the PyFlink job! :smile:\n\n    ```bash\n    make job\n\n    #// if you dont have make, you can run:\n    # docker-compose exec jobmanager ./bin/flink run -py /opt/job/start_job.py -d\n    ```\n\n    After about a minute, you should see a prompt that the job's been submitted (e.g., `Job has been submitted with JobID <job_id_number>`). Now go back to the [Flink UI](http://localhost:8081/#/job/running) to see the job running! :tada:\n\n\n3. When you're done, you can stop and/or clean up the Docker resources by running the commands below.\n\n    ```bash\n    make stop # to stop running services in docker compose\n    make down # to stop and remove docker compose services\n    make clean # to remove the docker container and dangling images\n    ```\n\n    :grey_exclamation: Note the `/var/lib/postgresql/data` directory inside the PostgreSQL container is mounted to the `./postgres-data` directory on your local machine. This means the data will persist across container restarts or removals, so even if you stop/remove the container, you won't lose any data written within the container.\n\n------\n\n:information_source: To see all the make commands that're available and what they do, run:\n\n```bash\nmake help\n```\n\nAs of the time of writing this, the available commands are:\n\n```bash\n\nUsage:\n  make <target>\n\nTargets:\n  help                 Show help with `make help`\n  db-init              Builds and runs the PostgreSQL database service\n  build                Builds the Flink base image with pyFlink and connectors installed\n  up                   Builds the base Docker image and starts Flink cluster\n  down                 Shuts down the Flink cluster\n  job                  Submit the Flink job\n  stop                 Stops all services in Docker compose\n  start                Starts all services in Docker compose\n  clean                Stops and removes the Docker container as well as images with tag `<none>`\n  psql                 Runs psql to query containerized postgreSQL database in CLI\n  postgres-die-mac     Removes mounted postgres data dir on local machine (mac users) and in Docker\n  postgres-die-pc      Removes mounted postgres data dir on local machine (PC users) and in Docker\n```\n"
  },
  {
    "path": "07-streaming/extras/pyflink/docker-compose.yml",
    "content": "version: \"3.9\"\nservices:\n  redpanda-1:\n    image: redpandadata/redpanda:v24.2.18\n    container_name: redpanda-1\n    command:\n      - redpanda\n      - start\n      - --smp\n      - '1'\n      - --reserve-memory\n      - 0M\n      - --overprovisioned\n      - --node-id\n      - '1'\n      - --kafka-addr\n      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092\n      - --advertise-kafka-addr\n      - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092\n      - --pandaproxy-addr\n      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082\n      - --advertise-pandaproxy-addr\n      - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082\n      - --rpc-addr\n      - 0.0.0.0:33145\n      - --advertise-rpc-addr\n      - redpanda-1:33145\n    ports:\n      # - 8081:8081\n      - 8082:8082\n      - 9092:9092\n      - 28082:28082\n      - 29092:29092\n\n  jobmanager:\n    build:\n      context: .\n      dockerfile: ./Dockerfile.flink\n    image: pyflink:1.16.0\n    container_name: \"flink-jobmanager\"\n    pull_policy: never\n    platform: \"linux/amd64\"\n    hostname: \"jobmanager\"\n    expose:\n      - \"6123\"\n    ports:\n      - \"8081:8081\"\n    volumes:\n      - ./:/opt/flink/usrlib\n      - ./keys/:/var/private/ssl/\n      - ./src/:/opt/src\n    command: jobmanager \n    extra_hosts:\n      - \"host.docker.internal:127.0.0.1\" #// Linux\n      - \"host.docker.internal:host-gateway\" #// Access services on the host machine from within the Docker container\n    environment:\n      - POSTGRES_URL=${POSTGRES_URL:-jdbc:postgresql://host.docker.internal:5432/postgres}\n      - POSTGRES_USER=${POSTGRES_USER:-postgres}\n      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-postgres}\n      - POSTGRES_DB=${POSTGRES_DB:-postgres}\n      - |\n        FLINK_PROPERTIES=\n        jobmanager.rpc.address: jobmanager        \n  \n  # Flink task manager\n  taskmanager:\n    image: pyflink:1.16.0\n    container_name: \"flink-taskmanager\"\n    pull_policy: never\n    platform: \"linux/amd64\"\n    expose:\n      - \"6121\"\n      - \"6122\"\n    volumes:\n      - ./:/opt/flink/usrlib\n      - ./src/:/opt/src\n    depends_on:\n      - jobmanager\n    command: taskmanager --taskmanager.registration.timeout 5 min\n    extra_hosts:\n      - \"host.docker.internal:127.0.0.1\" #// Linux\n      - \"host.docker.internal:host-gateway\" #// Access services on the host machine from within the Docker container\n    environment:\n      - |\n        FLINK_PROPERTIES=\n        jobmanager.rpc.address: jobmanager\n        taskmanager.numberOfTaskSlots: 15\n        parallelism.default: 3\n  postgres:\n    image: postgres:14\n    restart: on-failure\n    container_name: \"postgres\"\n    environment:\n      - POSTGRES_DB=postgres\n      - POSTGRES_USER=postgres\n      - POSTGRES_PASSWORD=postgres\n    ports:\n      - \"5432:5432\"\n    extra_hosts:\n     - \"host.docker.internal:127.0.0.1\" #// Linux\n     - \"host.docker.internal:host-gateway\" #// Access services on the host machine from within the Docker container\n\n"
  },
  {
    "path": "07-streaming/extras/pyflink/homework.md",
    "content": "# Homework\n\nFor this homework we will be using the Taxi data:\n- Green 2019-10 data from [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz)\n\n\n## Start Red Panda, Flink Job Manager, Flink Task Manager, and Postgres \n\nThere's a `docker-compose.yml` file in the homework folder (taken from [here](https://github.com/redpanda-data-blog/2023-python-gsg/blob/main/docker-compose.yml))\n\nCopy this file to your homework directory and run\n\n```bash\ndocker-compose up\n```\n\n(Add `-d` if you want to run in detached mode)\n\nVisit `localhost:8081` to see the Flink Job Manager\n\nConnect to Postgres with [DBeaver](https://dbeaver.io/).\n\nThe connection credentials are:\n- Username `postgres`\n- Password `postgres`\n- Database `postgres`\n- Host `localhost`\n- Port `5432`\n\n\nIn DBeaver, run this query to create the Postgres landing zone for the first events:\n```sql \nCREATE TABLE processed_events (\n    test_data INTEGER,\n    event_timestamp TIMESTAMP\n)\n```\n\n\n## Question 1. Connecting to the Kafka server\n\nWe need to make sure we can connect to the server, so\nlater we can send some data to its topics\n\nFirst, let's install the kafka connector (up to you if you\nwant to have a separate virtual environment for that)\n\n```bash\npip install kafka-python\n```\n\nYou can start a jupyter notebook in your solution folder or\ncreate a script\n\nLet's try to connect to our server:\n\n```python\nimport json\nimport time \n\nfrom kafka import KafkaProducer\n\ndef json_serializer(data):\n    return json.dumps(data).encode('utf-8')\n\nserver = 'localhost:9092'\n\nproducer = KafkaProducer(\n    bootstrap_servers=[server],\n    value_serializer=json_serializer\n)\n\nproducer.bootstrap_connected()\n```\n\n## Question 3: Sending the Trip Data\n\n* Read the green csv.gz file\n* We will only need these columns:\n  * `'lpep_pickup_datetime',`\n  * `'lpep_dropoff_datetime',`\n  * `'PULocationID',`\n  * `'DOLocationID',`\n  * `'passenger_count',`\n  * `'trip_distance',`\n  * `'tip_amount'`\n\n* Create a topic `green-trips` and send the data there with `load_taxi_data.py`\n* How much time in seconds did it take? (You can round it to a whole number)\n* Make sure you don't include sleeps in your code\n\n## Question 4: Build a Sessionization Window\n\n* Copy `aggregation_job.py` and rename it to `session_job.py`\n* Have it read from `green-trips` fixing the schema\n* Use a [session window](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/) with a gap of 5 minutes\n* Use `lpep_dropoff_datetime` time as your watermark with a 5 second tolerance\n* Which pickup and drop off locations have the longest unbroken streak of taxi trips?\n\n\n\n\n"
  },
  {
    "path": "07-streaming/extras/pyflink/requirements.txt",
    "content": "apache-flink==1.16.0\npsycopg2-binary==2.9.1\nrequests\nkafka-python"
  },
  {
    "path": "07-streaming/extras/pyflink/src/job/aggregation_job.py",
    "content": "from pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, DataTypes, TableEnvironment, StreamTableEnvironment\nfrom pyflink.common.watermark_strategy import WatermarkStrategy\nfrom pyflink.common.time import Duration\n\ndef create_events_aggregated_sink(t_env):\n    table_name = 'processed_events_aggregated'\n    sink_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            event_hour TIMESTAMP(3),\n            test_data INT,\n            num_hits BIGINT,\n            PRIMARY KEY (event_hour, test_data) NOT ENFORCED\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n\ndef create_events_source_kafka(t_env):\n    table_name = \"events\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            test_data INTEGER,\n            event_timestamp BIGINT,\n            event_watermark AS TO_TIMESTAMP_LTZ(event_timestamp, 3),\n            WATERMARK for event_watermark as event_watermark - INTERVAL '1' SECOND\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda-1:29092',\n            'topic' = 'test-topic',\n            'scan.startup.mode' = 'earliest-offset',\n            'properties.auto.offset.reset' = 'earliest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n\n\ndef log_aggregation():\n    # Set up the execution environment\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)\n    env.set_parallelism(3)\n\n    # Set up the table environment\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n\n    watermark_strategy = (\n        WatermarkStrategy\n        .for_bounded_out_of_orderness(Duration.of_seconds(5))\n        .with_timestamp_assigner(\n            # This lambda is your timestamp assigner:\n            #   event -> The data record\n            #   timestamp -> The previously assigned (or default) timestamp\n            lambda event, timestamp: event[2]  # We treat the second tuple element as the event-time (ms).\n        )\n    )\n    try:\n        # Create Kafka table\n        source_table = create_events_source_kafka(t_env)\n        aggregated_table = create_events_aggregated_sink(t_env)\n\n        t_env.execute_sql(f\"\"\"\n        INSERT INTO {aggregated_table}\n        SELECT\n            window_start as event_hour,\n            test_data,\n            COUNT(*) AS num_hits\n        FROM TABLE(\n            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_watermark), INTERVAL '1' MINUTE)\n        )\n        GROUP BY window_start, test_data;\n        \n        \"\"\").wait()\n\n    except Exception as e:\n        print(\"Writing records from Kafka to JDBC failed:\", str(e))\n\n\nif __name__ == '__main__':\n    log_aggregation()\n"
  },
  {
    "path": "07-streaming/extras/pyflink/src/job/start_job.py",
    "content": "from pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, DataTypes, TableEnvironment, StreamTableEnvironment\n\n\ndef create_processed_events_sink_postgres(t_env):\n    table_name = 'processed_events'\n    sink_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            test_data INTEGER,\n            event_timestamp TIMESTAMP\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n\n\ndef create_events_source_kafka(t_env):\n    table_name = \"events\"\n    pattern = \"yyyy-MM-dd HH:mm:ss.SSS\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            test_data INTEGER,\n            event_timestamp BIGINT,\n            event_watermark AS TO_TIMESTAMP_LTZ(event_timestamp, 3),\n            WATERMARK for event_watermark as event_watermark - INTERVAL '5' SECOND\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda-1:29092',\n            'topic' = 'test-topic',\n            'scan.startup.mode' = 'latest-offset',\n            'properties.auto.offset.reset' = 'latest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n\ndef log_processing():\n    # Set up the execution environment\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)\n    # env.set_parallelism(1)\n\n    # Set up the table environment\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n    try:\n        # Create Kafka table\n        source_table = create_events_source_kafka(t_env)\n        postgres_sink = create_processed_events_sink_postgres(t_env)\n        # write records to postgres too!\n        t_env.execute_sql(\n            f\"\"\"\n                    INSERT INTO {postgres_sink}\n                    SELECT\n                        test_data,\n                        TO_TIMESTAMP_LTZ(event_timestamp, 3) as event_timestamp\n                    FROM {source_table}\n                    \"\"\"\n        ).wait()\n\n    except Exception as e:\n        print(\"Writing records from Kafka to JDBC failed:\", str(e))\n\n\nif __name__ == '__main__':\n    log_processing()\n"
  },
  {
    "path": "07-streaming/extras/pyflink/src/job/taxi_job.py",
    "content": "from pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, DataTypes, TableEnvironment, StreamTableEnvironment\n\n\ndef create_taxi_events_sink_postgres(t_env):\n    table_name = 'taxi_events'\n    sink_ddl = f\"\"\"\n        CREATE OR REPLACE TABLE {table_name} (\n            VendorID INTEGER,\n            lpep_pickup_datetime VARCHAR,\n            lpep_dropoff_datetime VARCHAR,\n            store_and_fwd_flag VARCHAR,\n            RatecodeID INTEGER ,\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            passenger_count INTEGER,\n            trip_distance DOUBLE,\n            fare_amount DOUBLE,\n            extra DOUBLE,\n            mta_tax DOUBLE,\n            tip_amount DOUBLE,\n            tolls_amount DOUBLE,\n            ehail_fee DOUBLE,\n            improvement_surcharge DOUBLE,\n            total_amount DOUBLE,\n            payment_type INTEGER,\n            trip_type INTEGER,\n            congestion_surcharge DOUBLE\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n\n\ndef create_events_source_kafka(t_env):\n    table_name = \"taxi_events\"\n    pattern = \"yyyy-MM-dd HH:mm:ss\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            VendorID INTEGER,\n            lpep_pickup_datetime VARCHAR,\n            lpep_dropoff_datetime VARCHAR,\n            store_and_fwd_flag VARCHAR,\n            RatecodeID INTEGER ,\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            passenger_count INTEGER,\n            trip_distance DOUBLE,\n            fare_amount DOUBLE,\n            extra DOUBLE,\n            mta_tax DOUBLE,\n            tip_amount DOUBLE,\n            tolls_amount DOUBLE,\n            ehail_fee DOUBLE,\n            improvement_surcharge DOUBLE,\n            total_amount DOUBLE,\n            payment_type INTEGER,\n            trip_type INTEGER,\n            congestion_surcharge DOUBLE,\n            pickup_timestamp AS TO_TIMESTAMP(lpep_pickup_datetime, '{pattern}'),\n            WATERMARK FOR pickup_timestamp AS pickup_timestamp - INTERVAL '15' SECOND\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda-1:29092',\n            'topic' = 'green-data',\n            'scan.startup.mode' = 'earliest-offset',\n            'properties.auto.offset.reset' = 'earliest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n\ndef log_processing():\n    # Set up the execution environment\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)\n    # env.set_parallelism(1)\n\n    # Set up the table environment\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n    try:\n        # Create Kafka table\n        source_table = create_events_source_kafka(t_env)\n        postgres_sink = create_taxi_events_sink_postgres(t_env)\n        # write records to postgres too!\n        t_env.execute_sql(\n            f\"\"\"\n                    INSERT INTO {postgres_sink}\n                    SELECT\n                        *\n                    FROM {source_table}\n                    \"\"\"\n        ).wait()\n\n    except Exception as e:\n        print(\"Writing records from Kafka to JDBC failed:\", str(e))\n\n\nif __name__ == '__main__':\n    log_processing()\n"
  },
  {
    "path": "07-streaming/extras/pyflink/src/producers/load_taxi_data.py",
    "content": "import csv\nimport json\nfrom kafka import KafkaProducer\n\ndef main():\n    # Create a Kafka producer\n    producer = KafkaProducer(\n        bootstrap_servers='localhost:9092',\n        value_serializer=lambda v: json.dumps(v).encode('utf-8')\n    )\n\n    csv_file = 'data/green_tripdata_2019-10.csv'  # change to your CSV file path if needed\n\n    with open(csv_file, 'r', newline='', encoding='utf-8') as file:\n        reader = csv.DictReader(file)\n\n        for row in reader:\n            # Each row will be a dictionary keyed by the CSV headers\n            # Send data to Kafka topic \"green-data\"\n            producer.send('green-data', value=row)\n\n    # Make sure any remaining messages are delivered\n    producer.flush()\n    producer.close()\n\n\nif __name__ == \"__main__\":\n    main()"
  },
  {
    "path": "07-streaming/extras/pyflink/src/producers/producer.py",
    "content": "import json\nimport time\nfrom kafka import KafkaProducer\n\ndef json_serializer(data):\n    return json.dumps(data).encode('utf-8')\n\nserver = 'localhost:9092'\n\nproducer = KafkaProducer(\n    bootstrap_servers=[server],\n    value_serializer=json_serializer\n)\nt0 = time.time()\n\ntopic_name = 'test-topic'\n\nfor i in range(10, 1000):\n    message = {'test_data': i, 'event_timestamp': time.time() * 1000}\n    producer.send(topic_name, value=message)\n    print(f\"Sent: {message}\")\n    time.sleep(0.05)\n\nproducer.flush()\n\nt1 = time.time()\nprint(f'took {(t1 - t0):.2f} seconds')"
  },
  {
    "path": "07-streaming/extras/python/README.md",
    "content": "### Stream-Processing with Python\n\nIn this document, you will be finding information about stream processing \nusing different Python libraries (`kafka-python`,`confluent-kafka`,`pyspark`, `faust`).\n\nThis Python module can be separated in following modules.\n\n####  1. Docker\nDocker module includes, Dockerfiles and docker-compose definitions \nto run Kafka and Spark in a docker container. Setting up required services is\nthe prerequsite step for running following modules.\n\n#### 2. Kafka Producer - Consumer Examples\n- [Json Producer-Consumer Example](json_example) using `kafka-python` library\n- [Avro Producer-Consumer Example](avro_example) using `confluent-kafka` library\n\nBoth of these examples require, up-and running Kafka services, therefore please ensure\nfollowing steps under [docker-README](docker/README.md)\n\nTo run the producer-consumer examples in the respective example folder, run following commands\n```bash\n# Start producer script\npython3 producer.py\n# Start consumer script\npython3 consumer.py\n```\n\n\n\n\n\n"
  },
  {
    "path": "07-streaming/extras/python/avro_example/consumer.py",
    "content": "import os\nfrom typing import Dict, List\n\nfrom confluent_kafka import Consumer\nfrom confluent_kafka.schema_registry import SchemaRegistryClient\nfrom confluent_kafka.schema_registry.avro import AvroDeserializer\nfrom confluent_kafka.serialization import SerializationContext, MessageField\n\nfrom ride_record_key import dict_to_ride_record_key\nfrom ride_record import dict_to_ride_record\nfrom settings import BOOTSTRAP_SERVERS, SCHEMA_REGISTRY_URL, \\\n    RIDE_KEY_SCHEMA_PATH, RIDE_VALUE_SCHEMA_PATH, KAFKA_TOPIC\n\n\nclass RideAvroConsumer:\n    def __init__(self, props: Dict):\n\n        # Schema Registry and Serializer-Deserializer Configurations\n        key_schema_str = self.load_schema(props['schema.key'])\n        value_schema_str = self.load_schema(props['schema.value'])\n        schema_registry_props = {'url': props['schema_registry.url']}\n        schema_registry_client = SchemaRegistryClient(schema_registry_props)\n        self.avro_key_deserializer = AvroDeserializer(schema_registry_client=schema_registry_client,\n                                                      schema_str=key_schema_str,\n                                                      from_dict=dict_to_ride_record_key)\n        self.avro_value_deserializer = AvroDeserializer(schema_registry_client=schema_registry_client,\n                                                        schema_str=value_schema_str,\n                                                        from_dict=dict_to_ride_record)\n\n        consumer_props = {'bootstrap.servers': props['bootstrap.servers'],\n                          'group.id': 'datatalkclubs.taxirides.avro.consumer.2',\n                          'auto.offset.reset': \"earliest\"}\n        self.consumer = Consumer(consumer_props)\n\n    @staticmethod\n    def load_schema(schema_path: str):\n        path = os.path.realpath(os.path.dirname(__file__))\n        with open(f\"{path}/{schema_path}\") as f:\n            schema_str = f.read()\n        return schema_str\n\n    def consume_from_kafka(self, topics: List[str]):\n        self.consumer.subscribe(topics=topics)\n        while True:\n            try:\n                # SIGINT can't be handled when polling, limit timeout to 1 second.\n                msg = self.consumer.poll(1.0)\n                if msg is None:\n                    continue\n                key = self.avro_key_deserializer(msg.key(), SerializationContext(msg.topic(), MessageField.KEY))\n                record = self.avro_value_deserializer(msg.value(),\n                                                      SerializationContext(msg.topic(), MessageField.VALUE))\n                if record is not None:\n                    print(\"{}, {}\".format(key, record))\n            except KeyboardInterrupt:\n                break\n\n        self.consumer.close()\n\n\nif __name__ == \"__main__\":\n    config = {\n        'bootstrap.servers': BOOTSTRAP_SERVERS,\n        'schema_registry.url': SCHEMA_REGISTRY_URL,\n        'schema.key': RIDE_KEY_SCHEMA_PATH,\n        'schema.value': RIDE_VALUE_SCHEMA_PATH,\n    }\n    avro_consumer = RideAvroConsumer(props=config)\n    avro_consumer.consume_from_kafka(topics=[KAFKA_TOPIC])\n"
  },
  {
    "path": "07-streaming/extras/python/avro_example/producer.py",
    "content": "import os\nimport csv\nfrom time import sleep\nfrom typing import Dict\n\nfrom confluent_kafka import Producer\nfrom confluent_kafka.schema_registry import SchemaRegistryClient\nfrom confluent_kafka.schema_registry.avro import AvroSerializer\nfrom confluent_kafka.serialization import SerializationContext, MessageField\n\nfrom ride_record_key import RideRecordKey, ride_record_key_to_dict\nfrom ride_record import RideRecord, ride_record_to_dict\nfrom settings import RIDE_KEY_SCHEMA_PATH, RIDE_VALUE_SCHEMA_PATH, \\\n    SCHEMA_REGISTRY_URL, BOOTSTRAP_SERVERS, INPUT_DATA_PATH, KAFKA_TOPIC\n\n\ndef delivery_report(err, msg):\n    if err is not None:\n        print(\"Delivery failed for record {}: {}\".format(msg.key(), err))\n        return\n    print('Record {} successfully produced to {} [{}] at offset {}'.format(\n        msg.key(), msg.topic(), msg.partition(), msg.offset()))\n\n\nclass RideAvroProducer:\n    def __init__(self, props: Dict):\n        # Schema Registry and Serializer-Deserializer Configurations\n        key_schema_str = self.load_schema(props['schema.key'])\n        value_schema_str = self.load_schema(props['schema.value'])\n        schema_registry_props = {'url': props['schema_registry.url']}\n        schema_registry_client = SchemaRegistryClient(schema_registry_props)\n        self.key_serializer = AvroSerializer(schema_registry_client, key_schema_str, ride_record_key_to_dict)\n        self.value_serializer = AvroSerializer(schema_registry_client, value_schema_str, ride_record_to_dict)\n\n        # Producer Configuration\n        producer_props = {'bootstrap.servers': props['bootstrap.servers']}\n        self.producer = Producer(producer_props)\n\n    @staticmethod\n    def load_schema(schema_path: str):\n        path = os.path.realpath(os.path.dirname(__file__))\n        with open(f\"{path}/{schema_path}\") as f:\n            schema_str = f.read()\n        return schema_str\n\n    @staticmethod\n    def delivery_report(err, msg):\n        if err is not None:\n            print(\"Delivery failed for record {}: {}\".format(msg.key(), err))\n            return\n        print('Record {} successfully produced to {} [{}] at offset {}'.format(\n            msg.key(), msg.topic(), msg.partition(), msg.offset()))\n\n    @staticmethod\n    def read_records(resource_path: str):\n        ride_records, ride_keys = [], []\n        with open(resource_path, 'r') as f:\n            reader = csv.reader(f)\n            header = next(reader)  # skip the header\n            for row in reader:\n                ride_records.append(RideRecord(arr=[row[0], row[3], row[4], row[9], row[16]]))\n                ride_keys.append(RideRecordKey(vendor_id=int(row[0])))\n        return zip(ride_keys, ride_records)\n\n    def publish(self, topic: str, records: [RideRecordKey, RideRecord]):\n        for key_value in records:\n            key, value = key_value\n            try:\n                self.producer.produce(topic=topic,\n                                      key=self.key_serializer(key, SerializationContext(topic=topic,\n                                                                                        field=MessageField.KEY)),\n                                      value=self.value_serializer(value, SerializationContext(topic=topic,\n                                                                                              field=MessageField.VALUE)),\n                                      on_delivery=delivery_report)\n            except KeyboardInterrupt:\n                break\n            except Exception as e:\n                print(f\"Exception while producing record - {value}: {e}\")\n\n        self.producer.flush()\n        sleep(1)\n\n\nif __name__ == \"__main__\":\n    config = {\n        'bootstrap.servers': BOOTSTRAP_SERVERS,\n        'schema_registry.url': SCHEMA_REGISTRY_URL,\n        'schema.key': RIDE_KEY_SCHEMA_PATH,\n        'schema.value': RIDE_VALUE_SCHEMA_PATH\n    }\n    producer = RideAvroProducer(props=config)\n    ride_records = producer.read_records(resource_path=INPUT_DATA_PATH)\n    producer.publish(topic=KAFKA_TOPIC, records=ride_records)\n"
  },
  {
    "path": "07-streaming/extras/python/avro_example/ride_record.py",
    "content": "from typing import List, Dict\n\n\nclass RideRecord:\n\n    def __init__(self, arr: List[str]):\n        self.vendor_id = int(arr[0])\n        self.passenger_count = int(arr[1])\n        self.trip_distance = float(arr[2])\n        self.payment_type = int(arr[3])\n        self.total_amount = float(arr[4])\n\n    @classmethod\n    def from_dict(cls, d: Dict):\n        return cls(arr=[\n            d['vendor_id'],\n            d['passenger_count'],\n            d['trip_distance'],\n            d['payment_type'],\n            d['total_amount']\n        ]\n        )\n\n    def __repr__(self):\n        return f'{self.__class__.__name__}: {self.__dict__}'\n\n\ndef dict_to_ride_record(obj, ctx):\n    if obj is None:\n        return None\n\n    return RideRecord.from_dict(obj)\n\n\ndef ride_record_to_dict(ride_record: RideRecord, ctx):\n    return ride_record.__dict__\n"
  },
  {
    "path": "07-streaming/extras/python/avro_example/ride_record_key.py",
    "content": "from typing import Dict\n\n\nclass RideRecordKey:\n    def __init__(self, vendor_id):\n        self.vendor_id = vendor_id\n\n    @classmethod\n    def from_dict(cls, d: Dict):\n        return cls(vendor_id=d['vendor_id'])\n\n    def __repr__(self):\n        return f'{self.__class__.__name__}: {self.__dict__}'\n\n\ndef dict_to_ride_record_key(obj, ctx):\n    if obj is None:\n        return None\n\n    return RideRecordKey.from_dict(obj)\n\n\ndef ride_record_key_to_dict(ride_record_key: RideRecordKey, ctx):\n    return ride_record_key.__dict__\n"
  },
  {
    "path": "07-streaming/extras/python/avro_example/settings.py",
    "content": "INPUT_DATA_PATH = '../resources/rides.csv'\n\nRIDE_KEY_SCHEMA_PATH = '../resources/schemas/taxi_ride_key.avsc'\nRIDE_VALUE_SCHEMA_PATH = '../resources/schemas/taxi_ride_value.avsc'\n\nSCHEMA_REGISTRY_URL = 'http://localhost:8081'\nBOOTSTRAP_SERVERS = 'localhost:9092'\nKAFKA_TOPIC = 'rides_avro'\n"
  },
  {
    "path": "07-streaming/extras/python/docker/README.md",
    "content": "\n# Running Spark and Kafka Clusters on Docker\n\n### 1. Build Required Images for running Spark\n\nThe details of how to spark-images are build in different layers can be created can be read through \nthe blog post written by André Perez on [Medium blog -Towards Data Science](https://towardsdatascience.com/apache-spark-cluster-on-docker-ft-a-juyterlab-interface-418383c95445)\n\n```bash\n# Build Spark Images\n./build.sh \n```\n\n### 2. Create Docker Network & Volume\n\n```bash\n# Create Network\ndocker network  create kafka-spark-network\n\n# Create Volume\ndocker volume create --name=hadoop-distributed-file-system\n```\n\n### 3. Run Services on Docker\n```bash\n# Start Docker-Compose (within for kafka and spark folders)\ndocker compose up -d\n```\nIn depth explanation of [Kafka Listeners](https://www.confluent.io/blog/kafka-listeners-explained/)\n\nExplanation of [Kafka Listeners](https://www.confluent.io/blog/kafka-listeners-explained/)\n\n### 4. Stop Services on Docker\n```bash\n# Stop Docker-Compose (within for kafka and spark folders)\ndocker compose down\n```\n\n### 5. Helpful Comands\n```bash\n# Delete all Containers\ndocker rm -f $(docker ps -a -q)\n\n# Delete all volumes\ndocker volume rm $(docker volume ls -q)\n```\n\n"
  },
  {
    "path": "07-streaming/extras/python/docker/docker-compose.yml",
    "content": "version: \"3.6\"\nvolumes:\n  shared-workspace:\n    name: \"hadoop-distributed-file-system\"\n    driver: local\nservices:\n  jupyterlab:\n    image: jupyterlab\n    container_name: jupyterlab\n    ports:\n      - 8888:8888\n    volumes:\n      - shared-workspace:/opt/workspace\n  spark-master:\n    image: spark-master\n    container_name: spark-master\n    environment:\n      SPARK_LOCAL_IP: 'spark-master'\n    ports:\n      - 8080:8080\n      - 7077:7077\n    volumes:\n      - shared-workspace:/opt/workspace\n  spark-worker-1:\n    image: spark-worker\n    container_name: spark-worker-1\n    environment:\n      - SPARK_WORKER_CORES=1\n      - SPARK_WORKER_MEMORY=4g\n    ports:\n      - 8083:8081\n    volumes:\n      - shared-workspace:/opt/workspace\n    depends_on:\n      - spark-master\n  spark-worker-2:\n    image: spark-worker\n    container_name: spark-worker-2\n    environment:\n      - SPARK_WORKER_CORES=1\n      - SPARK_WORKER_MEMORY=4g\n    ports:\n      - 8082:8081\n    volumes:\n      - shared-workspace:/opt/workspace\n    depends_on:\n      - spark-master\n\n  broker:\n    image: confluentinc/cp-kafka:7.2.0\n    hostname: broker\n    container_name: broker\n    depends_on:\n      - zookeeper\n    ports:\n      - '9092:9092'\n    environment:\n      KAFKA_BROKER_ID: 1\n      KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'\n      # KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT\n      # KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:9092,PLAINTEXT_HOST://localhost:9092\n      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_BOB:PLAINTEXT,LISTENER_FRED:PLAINTEXT\n      KAFKA_LISTENERS: LISTENER_BOB://broker:29092,LISTENER_FRED://broker:9092\n      KAFKA_ADVERTISED_LISTENERS: LISTENER_BOB://broker:29092,LISTENER_FRED://localhost:9092\n      KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_BOB\n      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1\n      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0\n      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1\n      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1\n  schema-registry:\n    image: confluentinc/cp-schema-registry:7.2.0\n    hostname: schema-registry\n    container_name: schema-registry\n    depends_on:\n      - zookeeper\n      - broker\n    ports:\n      - \"8081:8081\"\n    environment:\n      # SCHEMA_REGISTRY_HOST_NAME: schema-registry # used for intercommunication\n      # SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: \"zookeeper:2181\" #(depreciated)\n      SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: \"broker:29092\"\n      SCHEMA_REGISTRY_HOST_NAME: \"localhost\"\n      SCHEMA_REGISTRY_LISTENERS: \"http://0.0.0.0:8081\" #(default: http://0.0.0.0:8081)\n  zookeeper:\n    image: confluentinc/cp-zookeeper:7.2.0\n    hostname: zookeeper\n    container_name: zookeeper\n    ports:\n      - '2181:2181'\n    environment:\n      ZOOKEEPER_CLIENT_PORT: 2181\n      ZOOKEEPER_TICK_TIME: 2000\n  control-center:\n    image: confluentinc/cp-enterprise-control-center:7.2.0\n    hostname: control-center\n    container_name: control-center\n    depends_on:\n      - zookeeper\n      - broker\n      - schema-registry\n    ports:\n      - \"9021:9021\"\n    environment:\n      CONTROL_CENTER_BOOTSTRAP_SERVERS: 'broker:29092'\n      CONTROL_CENTER_ZOOKEEPER_CONNECT: 'zookeeper:2181'\n      CONTROL_CENTER_SCHEMA_REGISTRY_URL: \"http://localhost:8081\"\n      CONTROL_CENTER_REPLICATION_FACTOR: 1\n      CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1\n      CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1\n      CONFLUENT_METRICS_TOPIC_REPLICATION: 1\n      PORT: 9021\n"
  },
  {
    "path": "07-streaming/extras/python/docker/kafka/docker-compose.yml",
    "content": "version: '3.6'\nnetworks:\n  default:\n    name: kafka-spark-network\n    external: true\nservices:\n  broker:\n    image: confluentinc/cp-kafka:7.2.0\n    hostname: broker\n    container_name: broker\n    depends_on:\n      - zookeeper\n    ports:\n      - '9092:9092'\n    environment:\n      KAFKA_BROKER_ID: 1\n      KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'\n      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT\n      KAFKA_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://broker:9092\n      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092\n      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT\n      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1\n      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0\n      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1\n      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1\n  schema-registry:\n    image: confluentinc/cp-schema-registry:7.2.0\n    hostname: schema-registry\n    container_name: schema-registry\n    depends_on:\n      - zookeeper\n      - broker\n    ports:\n      - \"8081:8081\"\n    environment:\n      # SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: \"zookeeper:2181\" #(depreciated)\n      SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: \"broker:29092\"\n      SCHEMA_REGISTRY_HOST_NAME: \"localhost\"\n      SCHEMA_REGISTRY_LISTENERS: \"http://0.0.0.0:8081\" #(default: http://0.0.0.0:8081)\n  zookeeper:\n    image: confluentinc/cp-zookeeper:7.2.0\n    hostname: zookeeper\n    container_name: zookeeper\n    ports:\n      - '2181:2181'\n    environment:\n      ZOOKEEPER_CLIENT_PORT: 2181\n      ZOOKEEPER_TICK_TIME: 2000\n  control-center:\n    image: confluentinc/cp-enterprise-control-center:7.2.0\n    hostname: control-center\n    container_name: control-center\n    depends_on:\n      - zookeeper\n      - broker\n      - schema-registry\n    ports:\n      - \"9021:9021\"\n    environment:\n      CONTROL_CENTER_BOOTSTRAP_SERVERS: 'broker:29092'\n      CONTROL_CENTER_ZOOKEEPER_CONNECT: 'zookeeper:2181'\n      CONTROL_CENTER_SCHEMA_REGISTRY_URL: \"http://localhost:8081\"\n      CONTROL_CENTER_REPLICATION_FACTOR: 1\n      CONTROL_CENTER_INTERNAL_TOPICS_PARTITIONS: 1\n      CONTROL_CENTER_MONITORING_INTERCEPTOR_TOPIC_PARTITIONS: 1\n      CONFLUENT_METRICS_TOPIC_REPLICATION: 1\n      PORT: 9021\n\n  kafka-rest:\n    image: confluentinc/cp-kafka-rest:7.2.0\n    hostname: kafka-rest\n    ports:\n      - \"8082:8082\"\n    depends_on:\n      - schema-registry\n      - broker\n    environment:\n      KAFKA_REST_BOOTSTRAP_SERVERS: 'broker:29092'\n      KAFKA_REST_ZOOKEEPER_CONNECT: 'zookeeper:2181'\n      KAFKA_REST_SCHEMA_REGISTRY_URL: 'http://localhost:8081'\n      KAFKA_REST_HOST_NAME: localhost\n      KAFKA_REST_LISTENERS: 'http://0.0.0.0:8082'"
  },
  {
    "path": "07-streaming/extras/python/docker/spark/build.sh",
    "content": "# -- Software Stack Version\n\nSPARK_VERSION=\"3.3.1\"\nHADOOP_VERSION=\"3\"\nJUPYTERLAB_VERSION=\"3.6.1\"\n\n# -- Building the Images\n\ndocker build \\\n  -f cluster-base.Dockerfile \\\n  -t cluster-base .\n\ndocker build \\\n  --build-arg spark_version=\"${SPARK_VERSION}\" \\\n  --build-arg hadoop_version=\"${HADOOP_VERSION}\" \\\n  -f spark-base.Dockerfile \\\n  -t spark-base .\n\ndocker build \\\n  -f spark-master.Dockerfile \\\n  -t spark-master .\n\ndocker build \\\n  -f spark-worker.Dockerfile \\\n  -t spark-worker .\n\ndocker build \\\n  --build-arg spark_version=\"${SPARK_VERSION}\" \\\n  --build-arg jupyterlab_version=\"${JUPYTERLAB_VERSION}\" \\\n  -f jupyterlab.Dockerfile \\\n  -t jupyterlab .\n"
  },
  {
    "path": "07-streaming/extras/python/docker/spark/cluster-base.Dockerfile",
    "content": "# Reference from offical Apache Spark repository Dockerfile for Kubernetes\n# https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile\nARG java_image_tag=17-jre\nFROM eclipse-temurin:${java_image_tag}\n\n# -- Layer: OS + Python\n\nARG shared_workspace=/opt/workspace\n\nRUN mkdir -p ${shared_workspace} && \\\n    apt-get update -y && \\\n    apt-get install -y python3 && \\\n    ln -s /usr/bin/python3 /usr/bin/python && \\\n    rm -rf /var/lib/apt/lists/*\n\nENV SHARED_WORKSPACE=${shared_workspace}\n\n# -- Runtime\n\nVOLUME ${shared_workspace}\nCMD [\"bash\"]"
  },
  {
    "path": "07-streaming/extras/python/docker/spark/docker-compose.yml",
    "content": "version: \"3.6\"\nvolumes:\n  shared-workspace:\n    name: \"hadoop-distributed-file-system\"\n    driver: local\nnetworks:\n  default:\n    name: kafka-spark-network\n    external: true\n\nservices:\n  jupyterlab:\n    image: jupyterlab\n    container_name: jupyterlab\n    ports:\n      - 8888:8888\n    volumes:\n      - shared-workspace:/opt/workspace\n  spark-master:\n    image: spark-master\n    container_name: spark-master\n    environment:\n      SPARK_LOCAL_IP: 'spark-master'\n    ports:\n      - 8080:8080\n      - 7077:7077\n    volumes:\n      - shared-workspace:/opt/workspace\n  spark-worker-1:\n    image: spark-worker\n    container_name: spark-worker-1\n    environment:\n      - SPARK_WORKER_CORES=1\n      - SPARK_WORKER_MEMORY=4g\n    ports:\n      - 8083:8081\n    volumes:\n      - shared-workspace:/opt/workspace\n    depends_on:\n      - spark-master\n  spark-worker-2:\n    image: spark-worker\n    container_name: spark-worker-2\n    environment:\n      - SPARK_WORKER_CORES=1\n      - SPARK_WORKER_MEMORY=4g\n    ports:\n      - 8084:8081\n    volumes:\n      - shared-workspace:/opt/workspace\n    depends_on:\n      - spark-master\n"
  },
  {
    "path": "07-streaming/extras/python/docker/spark/jupyterlab.Dockerfile",
    "content": "FROM cluster-base\n\n# -- Layer: JupyterLab\n\nARG spark_version=3.3.1\nARG jupyterlab_version=3.6.1\n\nRUN apt-get update -y && \\\n    apt-get install -y python3-pip && \\\n    pip3 install wget pyspark==${spark_version} jupyterlab==${jupyterlab_version}\n\n# -- Runtime\n\nEXPOSE 8888\nWORKDIR ${SHARED_WORKSPACE}\nCMD jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token=\n"
  },
  {
    "path": "07-streaming/extras/python/docker/spark/spark-base.Dockerfile",
    "content": "FROM cluster-base\n\n# -- Layer: Apache Spark\n\nARG spark_version=3.3.1\nARG hadoop_version=3\n\nRUN apt-get update -y && \\\n    apt-get install -y curl && \\\n    curl https://archive.apache.org/dist/spark/spark-${spark_version}/spark-${spark_version}-bin-hadoop${hadoop_version}.tgz -o spark.tgz && \\\n    tar -xf spark.tgz && \\\n    mv spark-${spark_version}-bin-hadoop${hadoop_version} /usr/bin/ && \\\n    mkdir /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}/logs && \\\n    rm spark.tgz\n\nENV SPARK_HOME /usr/bin/spark-${spark_version}-bin-hadoop${hadoop_version}\nENV SPARK_MASTER_HOST spark-master\nENV SPARK_MASTER_PORT 7077\nENV PYSPARK_PYTHON python3\n\n# -- Runtime\n\nWORKDIR ${SPARK_HOME}"
  },
  {
    "path": "07-streaming/extras/python/docker/spark/spark-master.Dockerfile",
    "content": "FROM spark-base\n\n# -- Runtime\n\nARG spark_master_web_ui=8080\n\nEXPOSE ${spark_master_web_ui} ${SPARK_MASTER_PORT}\nCMD bin/spark-class org.apache.spark.deploy.master.Master >> logs/spark-master.out"
  },
  {
    "path": "07-streaming/extras/python/docker/spark/spark-worker.Dockerfile",
    "content": "FROM spark-base\n\n# -- Runtime\n\nARG spark_worker_web_ui=8081\n\nEXPOSE ${spark_worker_web_ui}\nCMD bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT} >> logs/spark-worker.out\n"
  },
  {
    "path": "07-streaming/extras/python/json_example/consumer.py",
    "content": "from typing import Dict, List\nfrom json import loads\nfrom kafka import KafkaConsumer\n\nfrom ride import Ride\nfrom settings import BOOTSTRAP_SERVERS, KAFKA_TOPIC\n\n\nclass JsonConsumer:\n    def __init__(self, props: Dict):\n        self.consumer = KafkaConsumer(**props)\n\n    def consume_from_kafka(self, topics: List[str]):\n        self.consumer.subscribe(topics)\n        print('Consuming from Kafka started')\n        print('Available topics to consume: ', self.consumer.subscription())\n        while True:\n            try:\n                # SIGINT can't be handled when polling, limit timeout to 1 second.\n                message = self.consumer.poll(1.0)\n                if message is None or message == {}:\n                    continue\n                for message_key, message_value in message.items():\n                    for msg_val in message_value:\n                        print(msg_val.key, msg_val.value)\n            except KeyboardInterrupt:\n                break\n\n        self.consumer.close()\n\n\nif __name__ == '__main__':\n    config = {\n        'bootstrap_servers': BOOTSTRAP_SERVERS,\n        'auto_offset_reset': 'earliest',\n        'enable_auto_commit': True,\n        'key_deserializer': lambda key: int(key.decode('utf-8')),\n        'value_deserializer': lambda x: loads(x.decode('utf-8'), object_hook=lambda d: Ride.from_dict(d)),\n        'group_id': 'consumer.group.id.json-example.1',\n    }\n\n    json_consumer = JsonConsumer(props=config)\n    json_consumer.consume_from_kafka(topics=[KAFKA_TOPIC])\n"
  },
  {
    "path": "07-streaming/extras/python/json_example/producer.py",
    "content": "import csv\nimport json\nfrom typing import List, Dict\nfrom kafka import KafkaProducer\nfrom kafka.errors import KafkaTimeoutError\n\nfrom ride import Ride\nfrom settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, KAFKA_TOPIC\n\n\nclass JsonProducer(KafkaProducer):\n    def __init__(self, props: Dict):\n        self.producer = KafkaProducer(**props)\n\n    @staticmethod\n    def read_records(resource_path: str):\n        records = []\n        with open(resource_path, 'r') as f:\n            reader = csv.reader(f)\n            header = next(reader)  # skip the header row\n            for row in reader:\n                records.append(Ride(arr=row))\n        return records\n\n    def publish_rides(self, topic: str, messages: List[Ride]):\n        for ride in messages:\n            try:\n                record = self.producer.send(topic=topic, key=ride.pu_location_id, value=ride)\n                print('Record {} successfully produced at offset {}'.format(ride.pu_location_id, record.get().offset))\n            except KafkaTimeoutError as e:\n                print(e.__str__())\n\n\nif __name__ == '__main__':\n    # Config Should match with the KafkaProducer expectation\n    config = {\n        'bootstrap_servers': BOOTSTRAP_SERVERS,\n        'key_serializer': lambda key: str(key).encode(),\n        'value_serializer': lambda x: json.dumps(x.__dict__, default=str).encode('utf-8')\n    }\n    producer = JsonProducer(props=config)\n    rides = producer.read_records(resource_path=INPUT_DATA_PATH)\n    producer.publish_rides(topic=KAFKA_TOPIC, messages=rides)\n"
  },
  {
    "path": "07-streaming/extras/python/json_example/ride.py",
    "content": "from typing import List, Dict\nfrom decimal import Decimal\nfrom datetime import datetime\n\n\nclass Ride:\n    def __init__(self, arr: List[str]):\n        self.vendor_id = arr[0]\n        self.tpep_pickup_datetime = datetime.strptime(arr[1], \"%Y-%m-%d %H:%M:%S\"),\n        self.tpep_dropoff_datetime = datetime.strptime(arr[2], \"%Y-%m-%d %H:%M:%S\"),\n        self.passenger_count = int(arr[3])\n        self.trip_distance = Decimal(arr[4])\n        self.rate_code_id = int(arr[5])\n        self.store_and_fwd_flag = arr[6]\n        self.pu_location_id = int(arr[7])\n        self.do_location_id = int(arr[8])\n        self.payment_type = arr[9]\n        self.fare_amount = Decimal(arr[10])\n        self.extra = Decimal(arr[11])\n        self.mta_tax = Decimal(arr[12])\n        self.tip_amount = Decimal(arr[13])\n        self.tolls_amount = Decimal(arr[14])\n        self.improvement_surcharge = Decimal(arr[15])\n        self.total_amount = Decimal(arr[16])\n        self.congestion_surcharge = Decimal(arr[17])\n\n    @classmethod\n    def from_dict(cls, d: Dict):\n        return cls(arr=[\n            d['vendor_id'],\n            d['tpep_pickup_datetime'][0],\n            d['tpep_dropoff_datetime'][0],\n            d['passenger_count'],\n            d['trip_distance'],\n            d['rate_code_id'],\n            d['store_and_fwd_flag'],\n            d['pu_location_id'],\n            d['do_location_id'],\n            d['payment_type'],\n            d['fare_amount'],\n            d['extra'],\n            d['mta_tax'],\n            d['tip_amount'],\n            d['tolls_amount'],\n            d['improvement_surcharge'],\n            d['total_amount'],\n            d['congestion_surcharge'],\n        ]\n        )\n\n    def __repr__(self):\n        return f'{self.__class__.__name__}: {self.__dict__}'\n"
  },
  {
    "path": "07-streaming/extras/python/json_example/settings.py",
    "content": "INPUT_DATA_PATH = '../resources/rides.csv'\n\nBOOTSTRAP_SERVERS = ['localhost:9092']\nKAFKA_TOPIC = 'rides_json'\n"
  },
  {
    "path": "07-streaming/extras/python/redpanda_example/README.md",
    "content": "# Basic PubSub example with Redpanda\n\nThe aim of this module is to have a good grasp on the foundation of these Kafka/Redpanda concepts, to be able to submit a capstone project using streaming:\n- clusters\n- brokers\n- topics\n- producers\n- consumers and consumer groups\n- data serialization and deserialization\n- replication and retention\n- offsets\n- consumer-groups\n- \n\n## 1. Pre-requisites\n\nIf you have been following the [module-07](./../../../07-streaming/README.md) videos, you might already have installed the `kafka-python` library, so you can move on to [Docker](#2-docker) section.\n\nIf you have not, this is the only package you need to install in your virtual environment for this Redpanda lesson. \n\n1. activate your environment\n2. `pip install kafka-python`\n\n## 2. Docker\n\nStart a Redpanda cluster. Redpanda is a single binary image, so it is very easy to start learning kafka concepts with Redpanda.\n\n```bash\ncd 07-streaming/python/redpanda_example/\ndocker-compose up -d\n```\n\n## 3. Set RPK alias\n\nRedpanda has a console command `rpk` which means `Redpanda keeper`, the CLI tool that ships with Redpanda and is already available in the Docker image. \n\nSet the following `rpk` alias so we can use it from our terminal, without having to open a Docker interactive terminal. We can use this `rpk` alias directly in our terminal. \n\n```bash\nalias rpk=\"docker exec -ti redpanda-1 rpk\"\nrpk version\n```\n\nAt this time, the verion is shown as `v23.2.26 (rev 328d83a06e)`. The important version munber is the major one `v23` following the versioning semantics `major.minor[.build[.revision]]`, to ensure that you get the same results as whatever is shared in this document.\n\n> [!TIP]\n> If you're reading this after Mar, 2024 and want to update the Docker file to use the latest Redpanda images, just visit [Docker hub](https://hub.docker.com/r/vectorized/redpanda/tags), and paste the new version number.\n\n\n## 4. Kafka Producer - Consumer Examples\n\nTo run the producer-consumer examples, open 2 shell terminals in 2 side-by-side tabs and run following commands. Be sure to activate your virtual environment in each terminal.\n\n```bash\n# Start consumer script, in 1st terminal tab\npython -m consumer.py\n# Start producer script, in 2nd terminal tab\npython -m producer.py\n```\n\nRun the `python -m producer.py` command again (and again) to observe that the `consumer` worker tab would automatically consume messages in real-time when new `events` occur\n\n## 5. Redpanda UI\n\nYou can also see the clusters, topics, etc from the Redpanda Console UI via your browser at [http://localhost:8080](http://localhost:8080)\n\n\n## 6. rpk commands glossary\n\nVisit [get-started-rpk blog post](https://redpanda.com/blog/get-started-rpk-manage-streaming-data-clusters) for more.\n\n```bash\n# set alias for rpk\nalias rpk=\"docker exec -ti redpanda-1 rpk\"\n\n# get info on cluster\nrpk cluster info\n\n# create topic_name with m partitions and n replication factor\nrpk topic create [topic_name] --partitions m --replicas n\n\n# get list of available topics, without extra details and with details\nrpk topic list\nrpk topic list --detailed\n\n# inspect topic config\nrpk topic describe [topic_name]\n\n# consume [topic_name]\nrpk topic consume [topic_name]\n\n# list the consumer groups in a Redpanda cluster\nrpk group list\n\n# get additional information about a consumer group, from above listed result\nrpk group describe my-group\n```\n\n## 7. Additional Resources\n\nRedpanda Univerity (needs a Redpanda account and it is free to enrol and do the course(s))\n- [RP101: Getting Started with Redpanda](https://university.redpanda.com/courses/hands-on-redpanda-getting-started)\n- [RP102: Stream Processing with Redpanda](https://university.redpanda.com/courses/take/hands-on-redpanda-stream-processing/lessons/37830192-intro)\n- [SF101: Streaming Fundamentals](https://university.redpanda.com/courses/streaming-fundamentals)\n- [SF102: Kafka building blocks](https://university.redpanda.com/courses/kafka-building-blocks)\n\nIf you feel that you already have a good foundational basis on Streaming and Kafka, feel free to skip these supplementary courses.\n\n"
  },
  {
    "path": "07-streaming/extras/python/redpanda_example/consumer.py",
    "content": "import os\nfrom typing import Dict, List\nfrom json import loads\nfrom kafka import KafkaConsumer\n\nfrom ride import Ride\nfrom settings import BOOTSTRAP_SERVERS, KAFKA_TOPIC\n\n\nclass JsonConsumer:\n    def __init__(self, props: Dict):\n        self.consumer = KafkaConsumer(**props)\n\n    def consume_from_kafka(self, topics: List[str]):\n        self.consumer.subscribe(topics)\n        print('Consuming from Kafka started')\n        print('Available topics to consume: ', self.consumer.subscription())\n        while True:\n            try:\n                # SIGINT can't be handled when polling, limit timeout to 1 second.\n                message = self.consumer.poll(1.0)\n                if message is None or message == {}:\n                    continue\n                for message_key, message_value in message.items():\n                    for msg_val in message_value:\n                        print(msg_val.key, msg_val.value)\n            except KeyboardInterrupt:\n                break\n\n        self.consumer.close()\n\n\nif __name__ == '__main__':\n    config = {\n        'bootstrap_servers': BOOTSTRAP_SERVERS,\n        'auto_offset_reset': 'earliest',\n        'enable_auto_commit': True,\n        'key_deserializer': lambda key: int(key.decode('utf-8')),\n        'value_deserializer': lambda x: loads(x.decode('utf-8'), object_hook=lambda d: Ride.from_dict(d)),\n        'group_id': 'consumer.group.id.json-example.1',\n    }\n\n    json_consumer = JsonConsumer(props=config)\n    json_consumer.consume_from_kafka(topics=[KAFKA_TOPIC])\n\n\n# There's no schema in JSON format, so if the schema changes and one column is removed or new one added or the data types is changed, the Ride class would still work and produce-consume messages would still run without a hitch.\n# But the issue is in the downstream Analytics as the dataset would no longer have that column and the dashboards would thus fail. Therefore, the trust in our data and processes would erodes."
  },
  {
    "path": "07-streaming/extras/python/redpanda_example/docker-compose.yaml",
    "content": "version: '3.7'\nservices:\n  # Redpanda cluster\n  redpanda-1:\n    image: docker.redpanda.com/redpandadata/redpanda:v23.2.26\n    container_name: redpanda-1\n    command:\n      - redpanda\n      - start\n      - --smp\n      - '1'\n      - --reserve-memory\n      - 0M\n      - --overprovisioned\n      - --node-id\n      - '1'\n      - --kafka-addr\n      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092\n      - --advertise-kafka-addr\n      - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092\n      - --pandaproxy-addr\n      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082\n      - --advertise-pandaproxy-addr\n      - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082\n      - --rpc-addr\n      - 0.0.0.0:33145\n      - --advertise-rpc-addr\n      - redpanda-1:33145\n    ports:\n      # - 8081:8081\n      - 8082:8082\n      - 9092:9092\n      - 9644:9644\n      - 28082:28082\n      - 29092:29092\n\n  # Want a two node Redpanda cluster? Uncomment this block :)\n  # redpanda-2:\n  #   image: docker.redpanda.com/redpandadata/redpanda:v23.1.1\n  #   container_name: redpanda-2\n  #   command:\n  #     - redpanda\n  #     - start\n  #     - --smp\n  #     - '1'\n  #     - --reserve-memory\n  #     - 0M\n  #     - --overprovisioned\n  #     - --node-id\n  #     - '2'\n  #     - --seeds\n  #     - redpanda-1:33145\n  #     - --kafka-addr\n  #     - PLAINTEXT://0.0.0.0:29093,OUTSIDE://0.0.0.0:9093\n  #     - --advertise-kafka-addr\n  #     - PLAINTEXT://redpanda-2:29093,OUTSIDE://localhost:9093\n  #     - --pandaproxy-addr\n  #     - PLAINTEXT://0.0.0.0:28083,OUTSIDE://0.0.0.0:8083\n  #     - --advertise-pandaproxy-addr\n  #     - PLAINTEXT://redpanda-2:28083,OUTSIDE://localhost:8083\n  #     - --rpc-addr\n  #     - 0.0.0.0:33146\n  #     - --advertise-rpc-addr\n  #     - redpanda-2:33146\n  #   ports:\n  #     - 8083:8083\n  #     - 9093:9093\n\n  redpanda-console:\n    image: docker.redpanda.com/redpandadata/console:v2.2.2\n    container_name: redpanda-console\n    entrypoint: /bin/sh\n    command: -c \"echo \\\"$$CONSOLE_CONFIG_FILE\\\" > /tmp/config.yml; /app/console\"\n    environment:\n      CONFIG_FILEPATH: /tmp/config.yml\n      CONSOLE_CONFIG_FILE: |\n        kafka:\n          brokers: [\"redpanda-1:29092\"]\n          schemaRegistry:\n            enabled: false\n        redpanda:\n          adminApi:\n            enabled: true\n            urls: [\"http://redpanda-1:9644\"]\n        connect:\n          enabled: false\n    ports:\n      - 8080:8080\n    depends_on:\n      - redpanda-1\n"
  },
  {
    "path": "07-streaming/extras/python/redpanda_example/producer.py",
    "content": "import csv\nimport json\nfrom typing import List, Dict\nfrom kafka import KafkaProducer\nfrom kafka.errors import KafkaTimeoutError\n\nfrom ride import Ride\nfrom settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, KAFKA_TOPIC\n\n\nclass JsonProducer(KafkaProducer):\n    def __init__(self, props: Dict):\n        self.producer = KafkaProducer(**props)\n\n    @staticmethod\n    def read_records(resource_path: str):\n        records = []\n        with open(resource_path, 'r') as f:\n            reader = csv.reader(f)\n            header = next(reader)  # skip the header row\n            for row in reader:\n                records.append(Ride(arr=row))\n        return records\n\n    def publish_rides(self, topic: str, messages: List[Ride]):\n        for ride in messages:\n            try:\n                record = self.producer.send(topic=topic, key=ride.pu_location_id, value=ride)\n                print('Record {} successfully produced at offset {}'.format(ride.pu_location_id, record.get().offset))\n            except KafkaTimeoutError as e:\n                print(e.__str__())\n\n\nif __name__ == '__main__':\n    # Config Should match with the KafkaProducer expectation\n    # kafka expects binary format for the key-value pair\n    config = {\n        'bootstrap_servers': BOOTSTRAP_SERVERS,\n        'key_serializer': lambda key: str(key).encode(),\n        'value_serializer': lambda x: json.dumps(x.__dict__, default=str).encode('utf-8')\n    }\n    producer = JsonProducer(props=config)\n    rides = producer.read_records(resource_path=INPUT_DATA_PATH)\n    producer.publish_rides(topic=KAFKA_TOPIC, messages=rides)\n"
  },
  {
    "path": "07-streaming/extras/python/redpanda_example/ride.py",
    "content": "from typing import List, Dict\nfrom decimal import Decimal\nfrom datetime import datetime\n\n\nclass Ride:\n    def __init__(self, arr: List[str]):\n        self.vendor_id = arr[0]\n        self.tpep_pickup_datetime = datetime.strptime(arr[1], \"%Y-%m-%d %H:%M:%S\"),\n        self.tpep_dropoff_datetime = datetime.strptime(arr[2], \"%Y-%m-%d %H:%M:%S\"),\n        self.passenger_count = int(arr[3])\n        self.trip_distance = Decimal(arr[4])\n        self.rate_code_id = int(arr[5])\n        self.store_and_fwd_flag = arr[6]\n        self.pu_location_id = int(arr[7])\n        self.do_location_id = int(arr[8])\n        self.payment_type = arr[9]\n        self.fare_amount = Decimal(arr[10])\n        self.extra = Decimal(arr[11])\n        self.mta_tax = Decimal(arr[12])\n        self.tip_amount = Decimal(arr[13])\n        self.tolls_amount = Decimal(arr[14])\n        self.improvement_surcharge = Decimal(arr[15])\n        self.total_amount = Decimal(arr[16])\n        self.congestion_surcharge = Decimal(arr[17])\n\n    @classmethod\n    def from_dict(cls, d: Dict):\n        return cls(arr=[\n            d['vendor_id'],\n            d['tpep_pickup_datetime'][0],\n            d['tpep_dropoff_datetime'][0],\n            d['passenger_count'],\n            d['trip_distance'],\n            d['rate_code_id'],\n            d['store_and_fwd_flag'],\n            d['pu_location_id'],\n            d['do_location_id'],\n            d['payment_type'],\n            d['fare_amount'],\n            d['extra'],\n            d['mta_tax'],\n            d['tip_amount'],\n            d['tolls_amount'],\n            d['improvement_surcharge'],\n            d['total_amount'],\n            d['congestion_surcharge'],\n        ]\n        )\n\n    def __repr__(self):\n        return f'{self.__class__.__name__}: {self.__dict__}'\n"
  },
  {
    "path": "07-streaming/extras/python/redpanda_example/settings.py",
    "content": "INPUT_DATA_PATH = '../resources/rides.csv'\n\nBOOTSTRAP_SERVERS = ['localhost:9092']\nKAFKA_TOPIC = 'rides_json'\n"
  },
  {
    "path": "07-streaming/extras/python/requirements.txt",
    "content": "kafka-python==1.4.6\nconfluent_kafka\nrequests\navro\nfaust\nfastavro\n"
  },
  {
    "path": "07-streaming/extras/python/resources/schemas/taxi_ride_key.avsc",
    "content": "{\n  \"namespace\": \"com.datatalksclub.taxi\",\n  \"type\": \"record\",\n  \"name\": \"RideRecordKey\",\n  \"fields\": [\n    {\n      \"name\": \"vendor_id\",\n      \"type\": \"int\"\n    }\n  ]\n}"
  },
  {
    "path": "07-streaming/extras/python/resources/schemas/taxi_ride_value.avsc",
    "content": "{\n  \"namespace\": \"com.datatalksclub.taxi\",\n  \"type\": \"record\",\n  \"name\": \"RideRecord\",\n  \"fields\": [\n    {\n      \"name\": \"vendor_id\",\n      \"type\": \"int\"\n    },\n    {\n      \"name\": \"passenger_count\",\n      \"type\": \"int\"\n    },\n    {\n      \"name\": \"trip_distance\",\n      \"type\": \"float\"\n    },\n    {\n      \"name\": \"payment_type\",\n      \"type\": \"int\"\n    },\n    {\n      \"name\": \"total_amount\",\n      \"type\": \"float\"\n    }\n  ]\n}"
  },
  {
    "path": "07-streaming/extras/python/streams-example/faust/branch_price.py",
    "content": "import faust\nfrom taxi_rides import TaxiRide\nfrom faust import current_event\n\napp = faust.App('datatalksclub.stream.v3', broker='kafka://localhost:9092', consumer_auto_offset_reset=\"earliest\")\ntopic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide)\n\nhigh_amount_rides = app.topic('datatalks.yellow_taxi_rides.high_amount')\nlow_amount_rides = app.topic('datatalks.yellow_taxi_rides.low_amount')\n\n\n@app.agent(topic)\nasync def process(stream):\n    async for event in stream:\n        if event.total_amount >= 40.0:\n            await current_event().forward(high_amount_rides)\n        else:\n            await current_event().forward(low_amount_rides)\n\nif __name__ == '__main__':\n    app.main()\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/faust/producer_taxi_json.py",
    "content": "import csv\nfrom json import dumps\nfrom kafka import KafkaProducer\nfrom time import sleep\n\n\nproducer = KafkaProducer(bootstrap_servers=['localhost:9092'],\n                         key_serializer=lambda x: dumps(x).encode('utf-8'),\n                         value_serializer=lambda x: dumps(x).encode('utf-8'))\n\nfile = open('../../resources/rides.csv')\n\ncsvreader = csv.reader(file)\nheader = next(csvreader)\nfor row in csvreader:\n    key = {\"vendorId\": int(row[0])}\n    value = {\"vendorId\": int(row[0]), \"passenger_count\": int(row[3]), \"trip_distance\": float(row[4]), \"payment_type\": int(row[9]), \"total_amount\": float(row[16])}\n    producer.send('datatalkclub.yellow_taxi_ride.json', value=value, key=key)\n    print(\"producing\")\n    sleep(1)"
  },
  {
    "path": "07-streaming/extras/python/streams-example/faust/stream.py",
    "content": "import faust\nfrom taxi_rides import TaxiRide\n\n\napp = faust.App('datatalksclub.stream.v2', broker='kafka://localhost:9092')\ntopic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide)\n\n\n@app.agent(topic)\nasync def start_reading(records):\n    async for record in records:\n        print(record)\n\n\nif __name__ == '__main__':\n    app.main()\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/faust/stream_count_vendor_trips.py",
    "content": "import faust\nfrom taxi_rides import TaxiRide\n\n\napp = faust.App('datatalksclub.stream.v2', broker='kafka://localhost:9092')\ntopic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide)\n\nvendor_rides = app.Table('vendor_rides', default=int)\n\n\n@app.agent(topic)\nasync def process(stream):\n    async for event in stream.group_by(TaxiRide.vendorId):\n        vendor_rides[event.vendorId] += 1\n\nif __name__ == '__main__':\n    app.main()\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/faust/taxi_rides.py",
    "content": "import faust\n\n\nclass TaxiRide(faust.Record, validation=True):\n    vendorId: str\n    passenger_count: int\n    trip_distance: float\n    payment_type: int\n    total_amount: float\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/faust/windowing.py",
    "content": "from datetime import timedelta\nimport faust\nfrom taxi_rides import TaxiRide\n\n\napp = faust.App('datatalksclub.stream.v2', broker='kafka://localhost:9092')\ntopic = app.topic('datatalkclub.yellow_taxi_ride.json', value_type=TaxiRide)\n\nvendor_rides = app.Table('vendor_rides_windowed', default=int).tumbling(\n    timedelta(minutes=1),\n    expires=timedelta(hours=1),\n)\n\n\n@app.agent(topic)\nasync def process(stream):\n    async for event in stream.group_by(TaxiRide.vendorId):\n        vendor_rides[event.vendorId] += 1\n\n\nif __name__ == '__main__':\n    app.main()\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/pyspark/README.md",
    "content": "\n# Running PySpark Streaming \n\n#### Prerequisite\n\nEnsure your Kafka and Spark services up and running by following the [docker setup readme](./../../docker/README.md). \nIt is important to create network and volume as described in the document. Therefore please ensure, your volume and network are created correctly\n\n```bash\ndocker volume ls # should list hadoop-distributed-file-system\ndocker network ls # should list kafka-spark-network \n```\n\n\n### Running Producer and Consumer\n```bash\n# Run producer\npython3 producer.py\n\n# Run consumer with default settings\npython3 consumer.py\n# Run consumer for specific topic\npython3 consumer.py --topic <topic-name>\n```\n\n### Running Streaming Script\n\nspark-submit script ensures installation of necessary jars before running the streaming.py\n\n```bash\n./spark-submit.sh streaming.py \n```\n\n### Additional Resources\n- [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streaming-programming-guide)\n- [Structured Streaming + Kafka Integration](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#structured-streaming-kafka-integration-guide-kafka-broker-versio)\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/pyspark/consumer.py",
    "content": "import argparse\nfrom typing import Dict, List\nfrom kafka import KafkaConsumer\n\nfrom settings import BOOTSTRAP_SERVERS, CONSUME_TOPIC_RIDES_CSV\n\n\nclass RideCSVConsumer:\n    def __init__(self, props: Dict):\n        self.consumer = KafkaConsumer(**props)\n\n    def consume_from_kafka(self, topics: List[str]):\n        self.consumer.subscribe(topics=topics)\n        print('Consuming from Kafka started')\n        print('Available topics to consume: ', self.consumer.subscription())\n        while True:\n            try:\n                # SIGINT can't be handled when polling, limit timeout to 1 second.\n                msg = self.consumer.poll(1.0)\n                if msg is None or msg == {}:\n                    continue\n                for msg_key, msg_values in msg.items():\n                    for msg_val in msg_values:\n                        print(f'Key:{msg_val.key}-type({type(msg_val.key)}), '\n                              f'Value:{msg_val.value}-type({type(msg_val.value)})')\n            except KeyboardInterrupt:\n                break\n\n        self.consumer.close()\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser(description='Kafka Consumer')\n    parser.add_argument('--topic', type=str, default=CONSUME_TOPIC_RIDES_CSV)\n    args = parser.parse_args()\n\n    topic = args.topic\n    config = {\n        'bootstrap_servers': [BOOTSTRAP_SERVERS],\n        'auto_offset_reset': 'earliest',\n        'enable_auto_commit': True,\n        'key_deserializer': lambda key: int(key.decode('utf-8')),\n        'value_deserializer': lambda value: value.decode('utf-8'),\n        'group_id': 'consumer.group.id.csv-example.1',\n    }\n    csv_consumer = RideCSVConsumer(props=config)\n    csv_consumer.consume_from_kafka(topics=[topic])\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/pyspark/producer.py",
    "content": "import csv\nfrom time import sleep\nfrom typing import Dict\nfrom kafka import KafkaProducer\n\nfrom settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, PRODUCE_TOPIC_RIDES_CSV\n\n\ndef delivery_report(err, msg):\n    if err is not None:\n        print(\"Delivery failed for record {}: {}\".format(msg.key(), err))\n        return\n    print('Record {} successfully produced to {} [{}] at offset {}'.format(\n        msg.key(), msg.topic(), msg.partition(), msg.offset()))\n\n\nclass RideCSVProducer:\n    def __init__(self, props: Dict):\n        self.producer = KafkaProducer(**props)\n        # self.producer = Producer(producer_props)\n\n    @staticmethod\n    def read_records(resource_path: str):\n        records, ride_keys = [], []\n        i = 0\n        with open(resource_path, 'r') as f:\n            reader = csv.reader(f)\n            header = next(reader)  # skip the header\n            for row in reader:\n                # vendor_id, passenger_count, trip_distance, payment_type, total_amount\n                records.append(f'{row[0]}, {row[1]}, {row[2]}, {row[3]}, {row[4]}, {row[9]}, {row[16]}')\n                ride_keys.append(str(row[0]))\n                i += 1\n                if i == 5:\n                    break\n        return zip(ride_keys, records)\n\n    def publish(self, topic: str, records: [str, str]):\n        for key_value in records:\n            key, value = key_value\n            try:\n                self.producer.send(topic=topic, key=key, value=value)\n                print(f\"Producing record for <key: {key}, value:{value}>\")\n            except KeyboardInterrupt:\n                break\n            except Exception as e:\n                print(f\"Exception while producing record - {value}: {e}\")\n\n        self.producer.flush()\n        sleep(1)\n\n\nif __name__ == \"__main__\":\n    config = {\n        'bootstrap_servers': [BOOTSTRAP_SERVERS],\n        'key_serializer': lambda x: x.encode('utf-8'),\n        'value_serializer': lambda x: x.encode('utf-8')\n    }\n    producer = RideCSVProducer(props=config)\n    ride_records = producer.read_records(resource_path=INPUT_DATA_PATH)\n    print(ride_records)\n    producer.publish(topic=PRODUCE_TOPIC_RIDES_CSV, records=ride_records)\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/pyspark/settings.py",
    "content": "import pyspark.sql.types as T\n\nINPUT_DATA_PATH = '../../resources/rides.csv'\nBOOTSTRAP_SERVERS = 'localhost:9092'\n\nTOPIC_WINDOWED_VENDOR_ID_COUNT = 'vendor_counts_windowed'\n\nPRODUCE_TOPIC_RIDES_CSV = CONSUME_TOPIC_RIDES_CSV = 'rides_csv'\n\nRIDE_SCHEMA = T.StructType(\n    [T.StructField(\"vendor_id\", T.IntegerType()),\n     T.StructField('tpep_pickup_datetime', T.TimestampType()),\n     T.StructField('tpep_dropoff_datetime', T.TimestampType()),\n     T.StructField(\"passenger_count\", T.IntegerType()),\n     T.StructField(\"trip_distance\", T.FloatType()),\n     T.StructField(\"payment_type\", T.IntegerType()),\n     T.StructField(\"total_amount\", T.FloatType()),\n     ])\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/pyspark/spark-submit.sh",
    "content": "# Submit Python code to SparkMaster\n\nif [ $# -lt 1 ]\nthen\n\techo \"Usage: $0 <pyspark-job.py> [ executor-memory ]\"\n\techo \"(specify memory in string format such as \\\"512M\\\" or \\\"2G\\\")\"\n\texit 1\nfi\nPYTHON_JOB=$1\n\nif [ -z $2 ]\nthen\n\tEXEC_MEM=\"1G\"\nelse\n\tEXEC_MEM=$2\nfi\nspark-submit --master spark://localhost:7077 --num-executors 2 \\\n\t           --executor-memory $EXEC_MEM --executor-cores 1 \\\n             --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.3.1 \\\n             $PYTHON_JOB"
  },
  {
    "path": "07-streaming/extras/python/streams-example/pyspark/streaming-notebook.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"c4419168-c0e6-4a65-b56e-8454c42060ac\",\n   \"metadata\": {\n    \"jp-MarkdownHeadingCollapsed\": true,\n    \"tags\": []\n   },\n   \"source\": [\n    \"### 0. Spark Setup\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"32bd7cdd-8504-4a54-a461-244bf7878d2a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 pyspark-shell'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"3aab2a7e-a685-4925-9c9a-b5adf201af77\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \":: loading settings :: url = jar:file:/usr/local/lib/python3.10/dist-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Ivy Default Cache set to: /root/.ivy2/cache\\n\",\n      \"The jars for the packages stored in: /root/.ivy2/jars\\n\",\n      \"org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\\n\",\n      \"org.apache.spark#spark-avro_2.12 added as a dependency\\n\",\n      \":: resolving dependencies :: org.apache.spark#spark-submit-parent-5a3a4db6-be91-4d32-9884-8b0f38241b3f;1.0\\n\",\n      \"\\tconfs: [default]\\n\",\n      \"\\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 in central\\n\",\n      \"\\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 in central\\n\",\n      \"\\tfound org.apache.kafka#kafka-clients;2.8.1 in central\\n\",\n      \"\\tfound org.lz4#lz4-java;1.8.0 in central\\n\",\n      \"\\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\\n\",\n      \"\\tfound org.slf4j#slf4j-api;1.7.32 in central\\n\",\n      \"\\tfound org.apache.hadoop#hadoop-client-runtime;3.3.2 in central\\n\",\n      \"\\tfound org.spark-project.spark#unused;1.0.0 in central\\n\",\n      \"\\tfound org.apache.hadoop#hadoop-client-api;3.3.2 in central\\n\",\n      \"\\tfound commons-logging#commons-logging;1.1.3 in central\\n\",\n      \"\\tfound com.google.code.findbugs#jsr305;3.0.0 in central\\n\",\n      \"\\tfound org.apache.commons#commons-pool2;2.11.1 in central\\n\",\n      \"\\tfound org.apache.spark#spark-avro_2.12;3.3.1 in central\\n\",\n      \"\\tfound org.tukaani#xz;1.8 in central\\n\",\n      \":: resolution report :: resolve 544ms :: artifacts dl 11ms\\n\",\n      \"\\t:: modules in use:\\n\",\n      \"\\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\\n\",\n      \"\\tcommons-logging#commons-logging;1.1.3 from central in [default]\\n\",\n      \"\\torg.apache.commons#commons-pool2;2.11.1 from central in [default]\\n\",\n      \"\\torg.apache.hadoop#hadoop-client-api;3.3.2 from central in [default]\\n\",\n      \"\\torg.apache.hadoop#hadoop-client-runtime;3.3.2 from central in [default]\\n\",\n      \"\\torg.apache.kafka#kafka-clients;2.8.1 from central in [default]\\n\",\n      \"\\torg.apache.spark#spark-avro_2.12;3.3.1 from central in [default]\\n\",\n      \"\\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 from central in [default]\\n\",\n      \"\\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 from central in [default]\\n\",\n      \"\\torg.lz4#lz4-java;1.8.0 from central in [default]\\n\",\n      \"\\torg.slf4j#slf4j-api;1.7.32 from central in [default]\\n\",\n      \"\\torg.spark-project.spark#unused;1.0.0 from central in [default]\\n\",\n      \"\\torg.tukaani#xz;1.8 from central in [default]\\n\",\n      \"\\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\\n\",\n      \"\\t---------------------------------------------------------------------\\n\",\n      \"\\t|                  |            modules            ||   artifacts   |\\n\",\n      \"\\t|       conf       | number| search|dwnlded|evicted|| number|dwnlded|\\n\",\n      \"\\t---------------------------------------------------------------------\\n\",\n      \"\\t|      default     |   14  |   0   |   0   |   0   ||   14  |   0   |\\n\",\n      \"\\t---------------------------------------------------------------------\\n\",\n      \":: retrieving :: org.apache.spark#spark-submit-parent-5a3a4db6-be91-4d32-9884-8b0f38241b3f\\n\",\n      \"\\tconfs: [default]\\n\",\n      \"\\t0 artifacts copied, 14 already retrieved (0kB/8ms)\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/02/21 21:20:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from pyspark.sql import SparkSession\\n\",\n    \"import pyspark.sql.types as T\\n\",\n    \"import pyspark.sql.functions as F\\n\",\n    \"\\n\",\n    \"spark = SparkSession \\\\\\n\",\n    \"    .builder \\\\\\n\",\n    \"    .appName(\\\"Spark-Notebook\\\") \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"6f4b62fa-b3ce-4a1b-a1f4-2ed332a0d55a\",\n   \"metadata\": {\n    \"jp-MarkdownHeadingCollapsed\": true,\n    \"tags\": []\n   },\n   \"source\": [\n    \"### 1. Reading from Kafka Stream\\n\",\n    \"\\n\",\n    \"through `readStream`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"f491fa45-4471-4bc5-92f7-48081f687140\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### 1.1 Raw Kafka Stream\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"82c25cb2-2599-4f9b-8849-967fbb604a44\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# default for startingOffsets is \\\"latest\\\"\\n\",\n    \"df_kafka_raw = spark \\\\\\n\",\n    \"    .readStream \\\\\\n\",\n    \"    .format(\\\"kafka\\\") \\\\\\n\",\n    \"    .option(\\\"kafka.bootstrap.servers\\\", \\\"localhost:9092,broker:29092\\\") \\\\\\n\",\n    \"    .option(\\\"subscribe\\\", \\\"rides_csv\\\") \\\\\\n\",\n    \"    .option(\\\"startingOffsets\\\", \\\"earliest\\\") \\\\\\n\",\n    \"    .option(\\\"checkpointLocation\\\", \\\"checkpoint\\\") \\\\\\n\",\n    \"    .load()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"id\": \"d9149ccd-69b2-4f5b-afc0-43567673c634\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- key: binary (nullable = true)\\n\",\n      \" |-- value: binary (nullable = true)\\n\",\n      \" |-- topic: string (nullable = true)\\n\",\n      \" |-- partition: integer (nullable = true)\\n\",\n      \" |-- offset: long (nullable = true)\\n\",\n      \" |-- timestamp: timestamp (nullable = true)\\n\",\n      \" |-- timestampType: integer (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_kafka_raw.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"62e5e753-89c7-460f-a8be-16868ce5c680\",\n   \"metadata\": {\n    \"jp-MarkdownHeadingCollapsed\": true,\n    \"tags\": []\n   },\n   \"source\": [\n    \"#### 1.2 Encoded Kafka Stream\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"0b745eed-7d74-421e-8e4b-c8343fda4de3\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"df_kafka_encoded = df_kafka_raw.selectExpr(\\\"CAST(key AS STRING)\\\",\\\"CAST(value AS STRING)\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"id\": \"6839addc-c7c0-4117-8c9c-d2cd59cbf136\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- key: string (nullable = true)\\n\",\n      \" |-- value: string (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_kafka_encoded.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"6749c4de-6f80-4b91-b2b8-b2968c761d75\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### 1.3 Structure Streaming DataFrame\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"id\": \"ca20ae37-49f0-421f-9859-73fac8d4ca45\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def parse_ride_from_kafka_message(df_raw, schema):\\n\",\n    \"    \\\"\\\"\\\" take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema \\\"\\\"\\\"\\n\",\n    \"    assert df_raw.isStreaming is True, \\\"DataFrame doesn't receive streaming data\\\"\\n\",\n    \"\\n\",\n    \"    df = df_raw.selectExpr(\\\"CAST(key AS STRING)\\\", \\\"CAST(value AS STRING)\\\")\\n\",\n    \"\\n\",\n    \"    # split attributes to nested array in one Column\\n\",\n    \"    col = F.split(df['value'], ', ')\\n\",\n    \"\\n\",\n    \"    # expand col to multiple top-level columns\\n\",\n    \"    for idx, field in enumerate(schema):\\n\",\n    \"        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))\\n\",\n    \"    return df.select([field.name for field in schema])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"id\": \"e1737bd0-146f-4ee2-a70f-a4657af5bbc6\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"ride_schema = T.StructType(\\n\",\n    \"    [T.StructField(\\\"vendor_id\\\", T.IntegerType()),\\n\",\n    \"     T.StructField('tpep_pickup_datetime', T.TimestampType()),\\n\",\n    \"     T.StructField('tpep_dropoff_datetime', T.TimestampType()),\\n\",\n    \"     T.StructField(\\\"passenger_count\\\", T.IntegerType()),\\n\",\n    \"     T.StructField(\\\"trip_distance\\\", T.FloatType()),\\n\",\n    \"     T.StructField(\\\"payment_type\\\", T.IntegerType()),\\n\",\n    \"     T.StructField(\\\"total_amount\\\", T.FloatType()),\\n\",\n    \"     ])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"id\": \"ae2ce896-f54b-4166-b01f-b5532ab292fe\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"df_rides = parse_ride_from_kafka_message(df_raw=df_kafka_raw, schema=ride_schema)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"id\": \"cd848228-97c5-4325-8457-97f35e533cd8\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- vendor_id: integer (nullable = true)\\n\",\n      \" |-- tpep_pickup_datetime: timestamp (nullable = true)\\n\",\n      \" |-- tpep_dropoff_datetime: timestamp (nullable = true)\\n\",\n      \" |-- passenger_count: integer (nullable = true)\\n\",\n      \" |-- trip_distance: float (nullable = true)\\n\",\n      \" |-- payment_type: integer (nullable = true)\\n\",\n      \" |-- total_amount: float (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_rides.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"60277fdc-2797-4b23-9ecf-956b76db5778\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"source\": [\n    \"### 2 Sink Operation & Streaming Query\\n\",\n    \"\\n\",\n    \"through `writeStream`\\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"**Output Sinks**\\n\",\n    \"- File Sink: stores the output to the directory\\n\",\n    \"- Kafka Sink: stores the output to one or more topics in Kafka\\n\",\n    \"- Foreach Sink:\\n\",\n    \"- (for debugging) Console Sink, Memory Sink\\n\",\n    \"\\n\",\n    \"Further details can be found in [Output Sinks](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks)\\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"There are three types of **Output Modes**:\\n\",\n    \"- Complete: The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.\\n\",\n    \"- Append (default): Only new rows are added to the Result Table\\n\",\n    \"- Update: Only updated rows are outputted\\n\",\n    \"\\n\",\n    \"[Output Modes](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes) differs based on the set of transformations applied to the streaming data. \\n\",\n    \"\\n\",\n    \"--- \\n\",\n    \"**Triggers**\\n\",\n    \"\\n\",\n    \"The [trigger settings](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) of a streaming query define the timing of streaming data processing. Spark streaming support micro-batch streamings schema and you can select following options based on requirements.\\n\",\n    \"\\n\",\n    \"- default-micro-batch-mode\\n\",\n    \"- fixed-interval-micro-batch-mode\\n\",\n    \"- one-time-micro-batch-mode\\n\",\n    \"- available-now-micro-batch-mode\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"02ca9b08-aa61-46cd-b946-4457ce2cdf5d\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"source\": [\n    \"#### Console and Memory Sink\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"id\": \"74c72469-4c37-417c-a866-a1c1ef75ae8b\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):\\n\",\n    \"    write_query = df.writeStream \\\\\\n\",\n    \"        .outputMode(output_mode) \\\\\\n\",\n    \"        .trigger(processingTime=processing_time) \\\\\\n\",\n    \"        .format(\\\"console\\\") \\\\\\n\",\n    \"        .option(\\\"truncate\\\", False) \\\\\\n\",\n    \"        .start()\\n\",\n    \"    return write_query # pyspark.sql.streaming.StreamingQuery\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 22,\n   \"id\": \"d866c7ba-f8e9-475d-830a-50ffb2c5472b\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/02/21 21:46:12 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-289a958e-f6b6-4b38-a87b-50002d82ec8b. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\\n\",\n      \"23/02/21 21:46:12 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\\n\",\n      \"23/02/21 21:46:12 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0-3, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"23/02/21 21:46:12 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0-3, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-driver-0] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\\n\",\n      \"23/02/21 21:46:13 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-4, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"23/02/21 21:46:13 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-4, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\\n\",\n      \"-------------------------------------------\\n\",\n      \"Batch: 0\\n\",\n      \"-------------------------------------------\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\\n\",\n      \"|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\\n\",\n      \"|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\\n\",\n      \"|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\\n\",\n      \"|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\\n\",\n      \"|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\\n\",\n      \"|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\\n\",\n      \"|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\\n\",\n      \"|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\\n\",\n      \"|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"\\n\",\n      \"23/02/21 22:11:05 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-5, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"23/02/21 22:11:05 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor-5, groupId=spark-kafka-source-a303026d-ebd2-4fd3-a000-adb99dfea4a9--717872766-executor] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"-------------------------------------------\\n\",\n      \"Batch: 1\\n\",\n      \"-------------------------------------------\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\\n\",\n      \"|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\\n\",\n      \"|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\\n\",\n      \"|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\\n\",\n      \"|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"write_query = sink_console(df_rides, output_mode='append')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"id\": \"a9bfa73f-a8cc-4988-a8cf-bf31ee6c449c\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def sink_memory(df, query_name, query_template):\\n\",\n    \"    write_query = df \\\\\\n\",\n    \"        .writeStream \\\\\\n\",\n    \"        .queryName(query_name) \\\\\\n\",\n    \"        .format('memory') \\\\\\n\",\n    \"        .start()\\n\",\n    \"    query_str = query_template.format(table_name=query_name)\\n\",\n    \"    query_results = spark.sql(query_str)\\n\",\n    \"    return write_query, query_results\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 16,\n   \"id\": \"b31d0b76-e917-44e7-a14d-f9ce6901c23a\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"23/02/21 21:31:47 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-b3e2c096-aa06-4083-9cdf-d6f3cf04fc06. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\\n\",\n      \"23/02/21 21:31:47 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\\n\",\n      \"23/02/21 21:31:48 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0-1, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"23/02/21 21:31:48 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0-1, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-driver-0] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\\n\",\n      \"23/02/21 21:31:49 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor-2, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"23/02/21 21:31:49 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor-2, groupId=spark-kafka-source-f07faf6a-cb53-4ec8-bf58-1685d976f432--722858875-executor] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"query_name = 'vendor_id_counts'\\n\",\n    \"query_template = 'select count(distinct(vendor_id)) from {table_name}'\\n\",\n    \"write_query, df_vendor_id_counts = sink_memory(df=df_rides, query_name=query_name, query_template=query_template)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"id\": \"4ba56111-83bf-4028-ac65-565e0190f310\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"<class 'pyspark.sql.streaming.StreamingQuery'>\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'message': 'Waiting for data to arrive',\\n\",\n       \" 'isDataAvailable': False,\\n\",\n       \" 'isTriggerActive': True}\"\n      ]\n     },\n     \"execution_count\": 18,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"print(type(write_query)) # pyspark.sql.streaming.StreamingQuery\\n\",\n    \"write_query.status\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"id\": \"7cc37bda-9cfa-402b-9d42-a6ba5271476b\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+-------------------------+\\n\",\n      \"|count(DISTINCT vendor_id)|\\n\",\n      \"+-------------------------+\\n\",\n      \"|                        2|\\n\",\n      \"+-------------------------+\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_vendor_id_counts.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"id\": \"88862ca9-4d89-487e-987f-08a2b9e83efe\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"write_query.stop()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"443d4041-06db-4a4a-89c1-348848cc7ca8\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"source\": [\n    \"#### Kafka Sink\\n\",\n    \"\\n\",\n    \"To write stream results to `kafka-topic`, the stream dataframe has at least a column with name `value`.\\n\",\n    \"\\n\",\n    \"Therefore before starting `writeStream` in kafka format, dataframe needs to be updated accordingly.\\n\",\n    \"\\n\",\n    \"More information regarding kafka sink expected data structure [here](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 21,\n   \"id\": \"8b08a013-d039-41cf-94fd-a1a57571d25f\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def prepare_dataframe_to_kafka_sink(df, value_columns, key_column=None):\\n\",\n    \"    columns = df.columns\\n\",\n    \"    df = df.withColumn(\\\"value\\\", F.concat_ws(', ',*value_columns))    \\n\",\n    \"    if key_column:\\n\",\n    \"        df = df.withColumnRenamed(key_column,\\\"key\\\")\\n\",\n    \"        df = df.withColumn(\\\"key\\\",df.key.cast('string'))\\n\",\n    \"    return df.select(['key', 'value'])\\n\",\n    \"    \\n\",\n    \"def sink_kafka(df, topic, output_mode='append'):\\n\",\n    \"    write_query = df.writeStream \\\\\\n\",\n    \"        .format(\\\"kafka\\\") \\\\\\n\",\n    \"        .option(\\\"kafka.bootstrap.servers\\\", \\\"localhost:9092,broker:29092\\\") \\\\\\n\",\n    \"        .outputMode(output_mode) \\\\\\n\",\n    \"        .option(\\\"topic\\\", topic) \\\\\\n\",\n    \"        .option(\\\"checkpointLocation\\\", \\\"checkpoint\\\") \\\\\\n\",\n    \"        .start()\\n\",\n    \"    return write_query\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"e4cb2140-9f2e-4914-b74c-be4c18cdbe8a\",\n   \"metadata\": {},\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.10.6\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/pyspark/streaming.py",
    "content": "from pyspark.sql import SparkSession\nimport pyspark.sql.functions as F\n\nfrom settings import RIDE_SCHEMA, CONSUME_TOPIC_RIDES_CSV, TOPIC_WINDOWED_VENDOR_ID_COUNT\n\n\ndef read_from_kafka(consume_topic: str):\n    # Spark Streaming DataFrame, connect to Kafka topic served at host in bootrap.servers option\n    df_stream = spark \\\n        .readStream \\\n        .format(\"kafka\") \\\n        .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n        .option(\"subscribe\", consume_topic) \\\n        .option(\"startingOffsets\", \"earliest\") \\\n        .option(\"checkpointLocation\", \"checkpoint\") \\\n        .load()\n    return df_stream\n\n\ndef parse_ride_from_kafka_message(df, schema):\n    \"\"\" take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema \"\"\"\n    assert df.isStreaming is True, \"DataFrame doesn't receive streaming data\"\n\n    df = df.selectExpr(\"CAST(key AS STRING)\", \"CAST(value AS STRING)\")\n\n    # split attributes to nested array in one Column\n    col = F.split(df['value'], ', ')\n\n    # expand col to multiple top-level columns\n    for idx, field in enumerate(schema):\n        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))\n    return df.select([field.name for field in schema])\n\n\ndef sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):\n    write_query = df.writeStream \\\n        .outputMode(output_mode) \\\n        .trigger(processingTime=processing_time) \\\n        .format(\"console\") \\\n        .option(\"truncate\", False) \\\n        .start()\n    return write_query  # pyspark.sql.streaming.StreamingQuery\n\n\ndef sink_memory(df, query_name, query_template):\n    query_df = df \\\n        .writeStream \\\n        .queryName(query_name) \\\n        .format(\"memory\") \\\n        .start()\n    query_str = query_template.format(table_name=query_name)\n    query_results = spark.sql(query_str)\n    return query_results, query_df\n\n\ndef sink_kafka(df, topic):\n    write_query = df.writeStream \\\n        .format(\"kafka\") \\\n        .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n        .outputMode('complete') \\\n        .option(\"topic\", topic) \\\n        .option(\"checkpointLocation\", \"checkpoint\") \\\n        .start()\n    return write_query\n\n\ndef prepare_df_to_kafka_sink(df, value_columns, key_column=None):\n    columns = df.columns\n\n    df = df.withColumn(\"value\", F.concat_ws(', ', *value_columns))\n    if key_column:\n        df = df.withColumnRenamed(key_column, \"key\")\n        df = df.withColumn(\"key\", df.key.cast('string'))\n    return df.select(['key', 'value'])\n\n\ndef op_groupby(df, column_names):\n    df_aggregation = df.groupBy(column_names).count()\n    return df_aggregation\n\n\ndef op_windowed_groupby(df, window_duration, slide_duration):\n    df_windowed_aggregation = df.groupBy(\n        F.window(timeColumn=df.tpep_pickup_datetime, windowDuration=window_duration, slideDuration=slide_duration),\n        df.vendor_id\n    ).count()\n    return df_windowed_aggregation\n\n\nif __name__ == \"__main__\":\n    spark = SparkSession.builder.appName('streaming-examples').getOrCreate()\n    spark.sparkContext.setLogLevel('WARN')\n\n    # read_streaming data\n    df_consume_stream = read_from_kafka(consume_topic=CONSUME_TOPIC_RIDES_CSV)\n    print(df_consume_stream.printSchema())\n\n    # parse streaming data\n    df_rides = parse_ride_from_kafka_message(df_consume_stream, RIDE_SCHEMA)\n    print(df_rides.printSchema())\n\n    sink_console(df_rides, output_mode='append')\n\n    df_trip_count_by_vendor_id = op_groupby(df_rides, ['vendor_id'])\n    df_trip_count_by_pickup_date_vendor_id = op_windowed_groupby(df_rides, window_duration=\"10 minutes\",\n                                                                 slide_duration='5 minutes')\n\n    # write the output out to the console for debugging / testing\n    sink_console(df_trip_count_by_vendor_id)\n    # write the output to the kafka topic\n    df_trip_count_messages = prepare_df_to_kafka_sink(df=df_trip_count_by_pickup_date_vendor_id,\n                                                      value_columns=['count'], key_column='vendor_id')\n    kafka_sink_query = sink_kafka(df=df_trip_count_messages, topic=TOPIC_WINDOWED_VENDOR_ID_COUNT)\n\n    spark.streams.awaitAnyTermination()\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/redpanda/README.md",
    "content": "\n# Running PySpark Streaming with Redpanda\n\n### 1. Prerequisite\n\nIt is important to create network and volume as described in the document. Therefore please ensure, your volume and network are created correctly.\n\n```bash\ndocker volume ls # should list hadoop-distributed-file-system\ndocker network ls # should list kafka-spark-network \n```\n\n### 2. Create Docker Network & Volume\n\nIf you have not followed any other examples, and above `ls` steps shows no output, create them now.\n\n```bash\n# Create Network\ndocker network create kafka-spark-network\n\n# Create Volume\ndocker volume create --name=hadoop-distributed-file-system\n```\n\n### Running Producer and Consumer\n```bash\n# Run producer\npython producer.py\n\n# Run consumer with default settings\npython consumer.py\n# Run consumer for specific topic\npython consumer.py --topic <topic-name>\n```\n\n### Running Streaming Script\n\nspark-submit script ensures installation of necessary jars before running the streaming.py\n\n```bash\n./spark-submit.sh streaming.py \n```\n\n### Additional Resources\n- [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#structured-streaming-programming-guide)\n- [Structured Streaming + Kafka Integration](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#structured-streaming-kafka-integration-guide-kafka-broker-versio)\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/redpanda/consumer.py",
    "content": "import argparse\nfrom typing import Dict, List\nfrom kafka import KafkaConsumer\n\nfrom settings import BOOTSTRAP_SERVERS, CONSUME_TOPIC_RIDES_CSV\n\n\nclass RideCSVConsumer:\n    def __init__(self, props: Dict):\n        self.consumer = KafkaConsumer(**props)\n\n    def consume_from_kafka(self, topics: List[str]):\n        self.consumer.subscribe(topics=topics)\n        print('Consuming from Kafka started')\n        print('Available topics to consume: ', self.consumer.subscription())\n        while True:\n            try:\n                # SIGINT can't be handled when polling, limit timeout to 1 second.\n                msg = self.consumer.poll(1.0)\n                if msg is None or msg == {}:\n                    continue\n                for msg_key, msg_values in msg.items():\n                    for msg_val in msg_values:\n                        print(f'Key:{msg_val.key}-type({type(msg_val.key)}), '\n                              f'Value:{msg_val.value}-type({type(msg_val.value)})')\n            except KeyboardInterrupt:\n                break\n\n        self.consumer.close()\n\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser(description='Kafka Consumer')\n    parser.add_argument('--topic', type=str, default=CONSUME_TOPIC_RIDES_CSV)\n    args = parser.parse_args()\n\n    topic = args.topic\n    config = {\n        'bootstrap_servers': [BOOTSTRAP_SERVERS],\n        'auto_offset_reset': 'earliest',\n        'enable_auto_commit': True,\n        'key_deserializer': lambda key: int(key.decode('utf-8')),\n        'value_deserializer': lambda value: value.decode('utf-8'),\n        'group_id': 'consumer.group.id.csv-example.1',\n    }\n    csv_consumer = RideCSVConsumer(props=config)\n    csv_consumer.consume_from_kafka(topics=[topic])\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/redpanda/docker-compose.yaml",
    "content": "version: '3.7'\nvolumes:\n  shared-workspace:\n    name: \"hadoop-distributed-file-system\"\n    driver: local\nnetworks:\n  default:\n    name: kafka-spark-network\n    external: true\nservices:\n  # Redpanda cluster\n  redpanda-1:\n    image: docker.redpanda.com/redpandadata/redpanda:v23.2.26\n    container_name: redpanda-1\n    command:\n      - redpanda\n      - start\n      - --smp\n      - '1'\n      - --reserve-memory\n      - 0M\n      - --overprovisioned\n      - --node-id\n      - '1'\n      - --kafka-addr\n      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092\n      - --advertise-kafka-addr\n      - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092\n      - --pandaproxy-addr\n      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082\n      - --advertise-pandaproxy-addr\n      - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082\n      - --rpc-addr\n      - 0.0.0.0:33145\n      - --advertise-rpc-addr\n      - redpanda-1:33145\n    ports:\n      # - 8081:8081\n      - 8082:8082\n      - 9092:9092\n      - 9644:9644\n      - 28082:28082\n      - 29092:29092\n    volumes:\n      - shared-workspace:/opt/workspace\n\n  # Want a two node Redpanda cluster? Uncomment this block :)\n  redpanda-2:\n    image: docker.redpanda.com/redpandadata/redpanda:v23.1.1\n    container_name: redpanda-2\n    command:\n      - redpanda\n      - start\n      - --smp\n      - '1'\n      - --reserve-memory\n      - 0M\n      - --overprovisioned\n      - --node-id\n      - '2'\n      - --seeds\n      - redpanda-1:33145\n      - --kafka-addr\n      - PLAINTEXT://0.0.0.0:29093,OUTSIDE://0.0.0.0:9093\n      - --advertise-kafka-addr\n      - PLAINTEXT://redpanda-2:29093,OUTSIDE://localhost:9093\n      - --pandaproxy-addr\n      - PLAINTEXT://0.0.0.0:28083,OUTSIDE://0.0.0.0:8083\n      - --advertise-pandaproxy-addr\n      - PLAINTEXT://redpanda-2:28083,OUTSIDE://localhost:8083\n      - --rpc-addr\n      - 0.0.0.0:33146\n      - --advertise-rpc-addr\n      - redpanda-2:33146\n    ports:\n      - 8083:8083\n      - 9093:9093\n    volumes:\n      - shared-workspace:/opt/workspace\n\n  redpanda-console:\n    image: docker.redpanda.com/redpandadata/console:v2.2.2\n    container_name: redpanda-console\n    entrypoint: /bin/sh\n    command: -c \"echo \\\"$$CONSOLE_CONFIG_FILE\\\" > /tmp/config.yml; /app/console\"\n    environment:\n      CONFIG_FILEPATH: /tmp/config.yml\n      CONSOLE_CONFIG_FILE: |\n        kafka:\n          brokers: [\"redpanda-1:29092\"]\n          schemaRegistry:\n            enabled: false\n        redpanda:\n          adminApi:\n            enabled: true\n            urls: [\"http://redpanda-1:9644\"]\n        connect:\n          enabled: false\n    ports:\n      - 8080:8080\n    depends_on:\n      - redpanda-1\n    volumes:\n      - shared-workspace:/opt/workspace\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/redpanda/producer.py",
    "content": "import csv\nfrom time import sleep\nfrom typing import Dict\nfrom kafka import KafkaProducer\n\nfrom settings import BOOTSTRAP_SERVERS, INPUT_DATA_PATH, PRODUCE_TOPIC_RIDES_CSV\n\n\ndef delivery_report(err, msg):\n    if err is not None:\n        print(\"Delivery failed for record {}: {}\".format(msg.key(), err))\n        return\n    print('Record {} successfully produced to {} [{}] at offset {}'.format(\n        msg.key(), msg.topic(), msg.partition(), msg.offset()))\n\n\nclass RideCSVProducer:\n    def __init__(self, props: Dict):\n        self.producer = KafkaProducer(**props)\n        # self.producer = Producer(producer_props)\n\n    @staticmethod\n    def read_records(resource_path: str):\n        records, ride_keys = [], []\n        i = 0\n        with open(resource_path, 'r') as f:\n            reader = csv.reader(f)\n            header = next(reader)  # skip the header\n            for row in reader:\n                # vendor_id, passenger_count, trip_distance, payment_type, total_amount\n                records.append(f'{row[0]}, {row[1]}, {row[2]}, {row[3]}, {row[4]}, {row[9]}, {row[16]}')\n                ride_keys.append(str(row[0]))\n                i += 1\n                if i == 5:\n                    break\n        return zip(ride_keys, records)\n\n    def publish(self, topic: str, records: [str, str]):\n        for key_value in records:\n            key, value = key_value\n            try:\n                self.producer.send(topic=topic, key=key, value=value)\n                print(f\"Producing record for <key: {key}, value:{value}>\")\n            except KeyboardInterrupt:\n                break\n            except Exception as e:\n                print(f\"Exception while producing record - {value}: {e}\")\n\n        self.producer.flush()\n        sleep(1)\n\n\nif __name__ == \"__main__\":\n    config = {\n        'bootstrap_servers': [BOOTSTRAP_SERVERS],\n        'key_serializer': lambda x: x.encode('utf-8'),\n        'value_serializer': lambda x: x.encode('utf-8')\n    }\n    producer = RideCSVProducer(props=config)\n    ride_records = producer.read_records(resource_path=INPUT_DATA_PATH)\n    print(ride_records)\n    producer.publish(topic=PRODUCE_TOPIC_RIDES_CSV, records=ride_records)\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/redpanda/settings.py",
    "content": "import pyspark.sql.types as T\n\nINPUT_DATA_PATH = '../../resources/rides.csv'\nBOOTSTRAP_SERVERS = 'localhost:9092'\n\nTOPIC_WINDOWED_VENDOR_ID_COUNT = 'vendor_counts_windowed'\n\nPRODUCE_TOPIC_RIDES_CSV = CONSUME_TOPIC_RIDES_CSV = 'rides_csv'\n\nRIDE_SCHEMA = T.StructType(\n    [T.StructField(\"vendor_id\", T.IntegerType()),\n     T.StructField('tpep_pickup_datetime', T.TimestampType()),\n     T.StructField('tpep_dropoff_datetime', T.TimestampType()),\n     T.StructField(\"passenger_count\", T.IntegerType()),\n     T.StructField(\"trip_distance\", T.FloatType()),\n     T.StructField(\"payment_type\", T.IntegerType()),\n     T.StructField(\"total_amount\", T.FloatType()),\n     ])\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/redpanda/spark-submit.sh",
    "content": "# Submit Python code to SparkMaster\n\nif [ $# -lt 1 ]\nthen\n\techo \"Usage: $0 <pyspark-job.py> [ executor-memory ]\"\n\techo \"(specify memory in string format such as \\\"512M\\\" or \\\"2G\\\")\"\n\texit 1\nfi\nPYTHON_JOB=$1\n\nif [ -z $2 ]\nthen\n\tEXEC_MEM=\"1G\"\nelse\n\tEXEC_MEM=$2\nfi\nspark-submit --master spark://localhost:7077 --num-executors 2 \\\n\t           --executor-memory $EXEC_MEM --executor-cores 1 \\\n             --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.spark:spark-avro_2.12:3.5.1,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.5.1 \\\n             $PYTHON_JOB\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/redpanda/streaming-notebook.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"c4419168-c0e6-4a65-b56e-8454c42060ac\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"source\": [\n    \"### 0. Spark Setup\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"32bd7cdd-8504-4a54-a461-244bf7878d2a\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1 pyspark-shell'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"3aab2a7e-a685-4925-9c9a-b5adf201af77\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"your 131072x1 screen size is bogus. expect trouble\\n\",\n      \"24/03/11 00:28:48 WARN Utils: Your hostname, Cinders resolves to a loopback address: 127.0.1.1; using 172.17.156.62 instead (on interface eth0)\\n\",\n      \"24/03/11 00:28:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \":: loading settings :: url = jar:file:/home/ellabelle/spark/spark-3.5.1-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Ivy Default Cache set to: /home/ellabelle/.ivy2/cache\\n\",\n      \"The jars for the packages stored in: /home/ellabelle/.ivy2/jars\\n\",\n      \"org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency\\n\",\n      \"org.apache.spark#spark-avro_2.12 added as a dependency\\n\",\n      \":: resolving dependencies :: org.apache.spark#spark-submit-parent-0c8615d6-fa19-46ec-942b-46e9fe0012aa;1.0\\n\",\n      \"\\tconfs: [default]\\n\",\n      \"\\tfound org.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 in central\\n\",\n      \"\\tfound org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 in central\\n\",\n      \"\\tfound org.apache.kafka#kafka-clients;2.8.1 in central\\n\",\n      \"\\tfound org.lz4#lz4-java;1.8.0 in central\\n\",\n      \"\\tfound org.xerial.snappy#snappy-java;1.1.8.4 in central\\n\",\n      \"\\tfound org.slf4j#slf4j-api;1.7.32 in central\\n\",\n      \"\\tfound org.apache.hadoop#hadoop-client-runtime;3.3.2 in central\\n\",\n      \"\\tfound org.spark-project.spark#unused;1.0.0 in central\\n\",\n      \"\\tfound org.apache.hadoop#hadoop-client-api;3.3.2 in central\\n\",\n      \"\\tfound commons-logging#commons-logging;1.1.3 in central\\n\",\n      \"\\tfound com.google.code.findbugs#jsr305;3.0.0 in central\\n\",\n      \"\\tfound org.apache.commons#commons-pool2;2.11.1 in central\\n\",\n      \"\\tfound org.apache.spark#spark-avro_2.12;3.3.1 in central\\n\",\n      \"\\tfound org.tukaani#xz;1.8 in central\\n\",\n      \":: resolution report :: resolve 328ms :: artifacts dl 13ms\\n\",\n      \"\\t:: modules in use:\\n\",\n      \"\\tcom.google.code.findbugs#jsr305;3.0.0 from central in [default]\\n\",\n      \"\\tcommons-logging#commons-logging;1.1.3 from central in [default]\\n\",\n      \"\\torg.apache.commons#commons-pool2;2.11.1 from central in [default]\\n\",\n      \"\\torg.apache.hadoop#hadoop-client-api;3.3.2 from central in [default]\\n\",\n      \"\\torg.apache.hadoop#hadoop-client-runtime;3.3.2 from central in [default]\\n\",\n      \"\\torg.apache.kafka#kafka-clients;2.8.1 from central in [default]\\n\",\n      \"\\torg.apache.spark#spark-avro_2.12;3.3.1 from central in [default]\\n\",\n      \"\\torg.apache.spark#spark-sql-kafka-0-10_2.12;3.3.1 from central in [default]\\n\",\n      \"\\torg.apache.spark#spark-token-provider-kafka-0-10_2.12;3.3.1 from central in [default]\\n\",\n      \"\\torg.lz4#lz4-java;1.8.0 from central in [default]\\n\",\n      \"\\torg.slf4j#slf4j-api;1.7.32 from central in [default]\\n\",\n      \"\\torg.spark-project.spark#unused;1.0.0 from central in [default]\\n\",\n      \"\\torg.tukaani#xz;1.8 from central in [default]\\n\",\n      \"\\torg.xerial.snappy#snappy-java;1.1.8.4 from central in [default]\\n\",\n      \"\\t---------------------------------------------------------------------\\n\",\n      \"\\t|                  |            modules            ||   artifacts   |\\n\",\n      \"\\t|       conf       | number| search|dwnlded|evicted|| number|dwnlded|\\n\",\n      \"\\t---------------------------------------------------------------------\\n\",\n      \"\\t|      default     |   14  |   0   |   0   |   0   ||   14  |   0   |\\n\",\n      \"\\t---------------------------------------------------------------------\\n\",\n      \":: retrieving :: org.apache.spark#spark-submit-parent-0c8615d6-fa19-46ec-942b-46e9fe0012aa\\n\",\n      \"\\tconfs: [default]\\n\",\n      \"\\t0 artifacts copied, 14 already retrieved (0kB/8ms)\\n\",\n      \"24/03/11 00:28:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\\n\",\n      \"Setting default log level to \\\"WARN\\\".\\n\",\n      \"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\\n\",\n      \"24/03/11 00:28:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from pyspark.sql import SparkSession\\n\",\n    \"import pyspark.sql.types as T\\n\",\n    \"import pyspark.sql.functions as F\\n\",\n    \"\\n\",\n    \"spark = SparkSession \\\\\\n\",\n    \"    .builder \\\\\\n\",\n    \"    .appName(\\\"Spark-Notebook\\\") \\\\\\n\",\n    \"    .getOrCreate()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"6f4b62fa-b3ce-4a1b-a1f4-2ed332a0d55a\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"source\": [\n    \"### 1. Reading from Kafka Stream\\n\",\n    \"\\n\",\n    \"through `readStream`\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"f491fa45-4471-4bc5-92f7-48081f687140\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### 1.1 Raw Kafka Stream\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"82c25cb2-2599-4f9b-8849-967fbb604a44\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# default for startingOffsets is \\\"latest\\\"\\n\",\n    \"df_kafka_raw = spark \\\\\\n\",\n    \"    .readStream \\\\\\n\",\n    \"    .format(\\\"kafka\\\") \\\\\\n\",\n    \"    .option(\\\"kafka.bootstrap.servers\\\", \\\"localhost:9092,broker:29092\\\") \\\\\\n\",\n    \"    .option(\\\"subscribe\\\", \\\"rides_csv\\\") \\\\\\n\",\n    \"    .option(\\\"startingOffsets\\\", \\\"earliest\\\") \\\\\\n\",\n    \"    .option(\\\"checkpointLocation\\\", \\\"checkpoint\\\") \\\\\\n\",\n    \"    .load()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"id\": \"d9149ccd-69b2-4f5b-afc0-43567673c634\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- key: binary (nullable = true)\\n\",\n      \" |-- value: binary (nullable = true)\\n\",\n      \" |-- topic: string (nullable = true)\\n\",\n      \" |-- partition: integer (nullable = true)\\n\",\n      \" |-- offset: long (nullable = true)\\n\",\n      \" |-- timestamp: timestamp (nullable = true)\\n\",\n      \" |-- timestampType: integer (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_kafka_raw.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"62e5e753-89c7-460f-a8be-16868ce5c680\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"source\": [\n    \"#### 1.2 Encoded Kafka Stream\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"0b745eed-7d74-421e-8e4b-c8343fda4de3\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"df_kafka_encoded = df_kafka_raw.selectExpr(\\\"CAST(key AS STRING)\\\",\\\"CAST(value AS STRING)\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"id\": \"6839addc-c7c0-4117-8c9c-d2cd59cbf136\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- key: string (nullable = true)\\n\",\n      \" |-- value: string (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_kafka_encoded.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"6749c4de-6f80-4b91-b2b8-b2968c761d75\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### 1.3 Structure Streaming DataFrame\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"id\": \"ca20ae37-49f0-421f-9859-73fac8d4ca45\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def parse_ride_from_kafka_message(df_raw, schema):\\n\",\n    \"    \\\"\\\"\\\" take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema \\\"\\\"\\\"\\n\",\n    \"    assert df_raw.isStreaming is True, \\\"DataFrame doesn't receive streaming data\\\"\\n\",\n    \"\\n\",\n    \"    df = df_raw.selectExpr(\\\"CAST(key AS STRING)\\\", \\\"CAST(value AS STRING)\\\")\\n\",\n    \"\\n\",\n    \"    # split attributes to nested array in one Column\\n\",\n    \"    col = F.split(df['value'], ', ')\\n\",\n    \"\\n\",\n    \"    # expand col to multiple top-level columns\\n\",\n    \"    for idx, field in enumerate(schema):\\n\",\n    \"        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))\\n\",\n    \"    return df.select([field.name for field in schema])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"id\": \"e1737bd0-146f-4ee2-a70f-a4657af5bbc6\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"ride_schema = T.StructType(\\n\",\n    \"    [T.StructField(\\\"vendor_id\\\", T.IntegerType()),\\n\",\n    \"     T.StructField('tpep_pickup_datetime', T.TimestampType()),\\n\",\n    \"     T.StructField('tpep_dropoff_datetime', T.TimestampType()),\\n\",\n    \"     T.StructField(\\\"passenger_count\\\", T.IntegerType()),\\n\",\n    \"     T.StructField(\\\"trip_distance\\\", T.FloatType()),\\n\",\n    \"     T.StructField(\\\"payment_type\\\", T.IntegerType()),\\n\",\n    \"     T.StructField(\\\"total_amount\\\", T.FloatType()),\\n\",\n    \"     ])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"id\": \"ae2ce896-f54b-4166-b01f-b5532ab292fe\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"df_rides = parse_ride_from_kafka_message(df_raw=df_kafka_raw, schema=ride_schema)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"id\": \"cd848228-97c5-4325-8457-97f35e533cd8\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"root\\n\",\n      \" |-- vendor_id: integer (nullable = true)\\n\",\n      \" |-- tpep_pickup_datetime: timestamp (nullable = true)\\n\",\n      \" |-- tpep_dropoff_datetime: timestamp (nullable = true)\\n\",\n      \" |-- passenger_count: integer (nullable = true)\\n\",\n      \" |-- trip_distance: float (nullable = true)\\n\",\n      \" |-- payment_type: integer (nullable = true)\\n\",\n      \" |-- total_amount: float (nullable = true)\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_rides.printSchema()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"f1cdb53e-f477-4137-8412-6915d7772125\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"df_rides.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"60277fdc-2797-4b23-9ecf-956b76db5778\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"source\": [\n    \"### 2 Sink Operation & Streaming Query\\n\",\n    \"\\n\",\n    \"through `writeStream`\\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"**Output Sinks**\\n\",\n    \"- File Sink: stores the output to the directory\\n\",\n    \"- Kafka Sink: stores the output to one or more topics in Kafka\\n\",\n    \"- Foreach Sink:\\n\",\n    \"- (for debugging) Console Sink, Memory Sink\\n\",\n    \"\\n\",\n    \"Further details can be found in [Output Sinks](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks)\\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"There are three types of **Output Modes**:\\n\",\n    \"- Complete: The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.\\n\",\n    \"- Append (default): Only new rows are added to the Result Table\\n\",\n    \"- Update: Only updated rows are outputted\\n\",\n    \"\\n\",\n    \"[Output Modes](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes) differs based on the set of transformations applied to the streaming data. \\n\",\n    \"\\n\",\n    \"--- \\n\",\n    \"**Triggers**\\n\",\n    \"\\n\",\n    \"The [trigger settings](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) of a streaming query define the timing of streaming data processing. Spark streaming support micro-batch streamings schema and you can select following options based on requirements.\\n\",\n    \"\\n\",\n    \"- default-micro-batch-mode\\n\",\n    \"- fixed-interval-micro-batch-mode\\n\",\n    \"- one-time-micro-batch-mode\\n\",\n    \"- available-now-micro-batch-mode\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"02ca9b08-aa61-46cd-b946-4457ce2cdf5d\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"source\": [\n    \"#### Console and Memory Sink\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"id\": \"74c72469-4c37-417c-a866-a1c1ef75ae8b\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):\\n\",\n    \"    write_query = df.writeStream \\\\\\n\",\n    \"        .outputMode(output_mode) \\\\\\n\",\n    \"        .trigger(processingTime=processing_time) \\\\\\n\",\n    \"        .format(\\\"console\\\") \\\\\\n\",\n    \"        .option(\\\"truncate\\\", False) \\\\\\n\",\n    \"        .start()\\n\",\n    \"    return write_query # pyspark.sql.streaming.StreamingQuery\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"id\": \"d866c7ba-f8e9-475d-830a-50ffb2c5472b\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"24/03/11 00:30:31 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-2b8e8845-1369-4653-8c23-c45a98e194a9. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\\n\",\n      \"24/03/11 00:30:31 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"24/03/11 00:30:32 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\\n\",\n      \"24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:30:32 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:30:33 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\\n\",\n      \"                                                                                \\r\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"-------------------------------------------\\n\",\n      \"Batch: 0\\n\",\n      \"-------------------------------------------\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\\n\",\n      \"|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\\n\",\n      \"|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\\n\",\n      \"|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\\n\",\n      \"|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"write_query = sink_console(df_rides, output_mode='append')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"id\": \"a9bfa73f-a8cc-4988-a8cf-bf31ee6c449c\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def sink_memory(df, query_name, query_template):\\n\",\n    \"    write_query = df \\\\\\n\",\n    \"        .writeStream \\\\\\n\",\n    \"        .queryName(query_name) \\\\\\n\",\n    \"        .format('memory') \\\\\\n\",\n    \"        .start()\\n\",\n    \"    query_str = query_template.format(table_name=query_name)\\n\",\n    \"    query_results = spark.sql(query_str)\\n\",\n    \"    return write_query, query_results\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"id\": \"b31d0b76-e917-44e7-a14d-f9ce6901c23a\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"24/03/11 00:30:42 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-c7621425-b7fb-47fe-8b42-791c9c5d3186. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.\\n\",\n      \"24/03/11 00:30:42 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.\\n\",\n      \"24/03/11 00:30:43 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\\n\",\n      \"24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:30:43 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"24/03/11 00:30:43 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\\n\",\n      \"                                                                                \\r\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"query_name = 'vendor_id_counts'\\n\",\n    \"query_template = 'select count(distinct(vendor_id)) from {table_name}'\\n\",\n    \"write_query, df_vendor_id_counts = sink_memory(df=df_rides, query_name=query_name, query_template=query_template)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"id\": \"4ba56111-83bf-4028-ac65-565e0190f310\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"<class 'pyspark.sql.streaming.query.StreamingQuery'>\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"{'message': 'Waiting for data to arrive',\\n\",\n       \" 'isDataAvailable': False,\\n\",\n       \" 'isTriggerActive': False}\"\n      ]\n     },\n     \"execution_count\": 15,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"-------------------------------------------\\n\",\n      \"Batch: 1\\n\",\n      \"-------------------------------------------\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\\n\",\n      \"|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\\n\",\n      \"|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\\n\",\n      \"|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\\n\",\n      \"|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"\\n\",\n      \"-------------------------------------------\\n\",\n      \"Batch: 2\\n\",\n      \"-------------------------------------------\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|vendor_id|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"|1        |2020-07-01 00:25:32 |2020-07-01 00:33:39  |1              |1.5          |2           |9.3         |\\n\",\n      \"|1        |2020-07-01 00:03:19 |2020-07-01 00:25:43  |1              |9.5          |1           |27.8        |\\n\",\n      \"|2        |2020-07-01 00:15:11 |2020-07-01 00:29:24  |1              |5.85         |2           |22.3        |\\n\",\n      \"|2        |2020-07-01 00:30:49 |2020-07-01 00:38:26  |1              |1.9          |1           |14.16       |\\n\",\n      \"|2        |2020-07-01 00:31:26 |2020-07-01 00:38:02  |1              |1.25         |2           |7.8         |\\n\",\n      \"+---------+--------------------+---------------------+---------------+-------------+------------+------------+\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print(type(write_query)) # pyspark.sql.streaming.StreamingQuery\\n\",\n    \"write_query.status\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"id\": \"7cc37bda-9cfa-402b-9d42-a6ba5271476b\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"+-------------------------+\\n\",\n      \"|count(DISTINCT vendor_id)|\\n\",\n      \"+-------------------------+\\n\",\n      \"|                        2|\\n\",\n      \"+-------------------------+\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df_vendor_id_counts.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 19,\n   \"id\": \"88862ca9-4d89-487e-987f-08a2b9e83efe\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"write_query.stop()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"443d4041-06db-4a4a-89c1-348848cc7ca8\",\n   \"metadata\": {\n    \"tags\": []\n   },\n   \"source\": [\n    \"#### Kafka Sink\\n\",\n    \"\\n\",\n    \"To write stream results to `kafka-topic`, the stream dataframe has at least a column with name `value`.\\n\",\n    \"\\n\",\n    \"Therefore before starting `writeStream` in kafka format, dataframe needs to be updated accordingly.\\n\",\n    \"\\n\",\n    \"More information regarding kafka sink expected data structure [here](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"id\": \"8b08a013-d039-41cf-94fd-a1a57571d25f\",\n   \"metadata\": {\n    \"scrolled\": true,\n    \"tags\": []\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:36 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:37 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:39 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:40 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:41 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:42 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:43 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:44 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:45 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:46 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:47 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:48 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:49 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:50 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:51 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:52 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:53 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:54 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:55 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:56 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:57 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:34:58 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:00 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:01 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:02 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:03 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:04 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:05 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:06 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:07 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:08 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:09 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:10 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:11 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:12 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:13 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:14 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:16 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:17 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:17 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:19 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:20 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:21 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:22 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:23 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:24 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:25 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:26 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:27 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:28 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:29 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:30 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:31 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:32 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:33 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:34 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:35 WARN KafkaOffsetReaderAdmin: Error in attempt 1 getting Kafka offsets: \\n\",\n      \"java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088535000, tries=1, nextAllowedTryMs=1710088535101) timed out at 1710088535001 after 1 attempt(s)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\\n\",\n      \"\\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\\n\",\n      \"\\tat scala.collection.Iterator.foreach(Iterator.scala:943)\\n\",\n      \"\\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\\n\",\n      \"\\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\\n\",\n      \"\\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\\n\",\n      \"\\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\\n\",\n      \"\\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\\n\",\n      \"\\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\\n\",\n      \"\\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\\n\",\n      \"\\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\\n\",\n      \"Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088535000, tries=1, nextAllowedTryMs=1710088535101) timed out at 1710088535001 after 1 attempt(s)\\n\",\n      \"Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\\n\",\n      \"24/03/11 00:35:35 WARN NetworkClient: [AdminClient clientId=adminclient-1] Connection to node 1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:36 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\\n\",\n      \"24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:35:36 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:37 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:37 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:39 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:40 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:41 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:42 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:43 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:44 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:45 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:46 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:47 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:48 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:49 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:50 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:51 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:52 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:53 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:55 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:55 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:57 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:58 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:35:59 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:00 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:01 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:02 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:03 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:04 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:06 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:07 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:08 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:09 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:10 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:11 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:12 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:13 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:14 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:15 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:16 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:17 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:18 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:19 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:20 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:22 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:23 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:24 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:25 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:26 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:27 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:28 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:29 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:30 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:31 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:32 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:33 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:35 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:35 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:36 WARN KafkaOffsetReaderAdmin: Error in attempt 2 getting Kafka offsets: \\n\",\n      \"java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088596058, tries=1, nextAllowedTryMs=1710088596159) timed out at 1710088596059 after 1 attempt(s)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\\n\",\n      \"\\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\\n\",\n      \"\\tat scala.collection.Iterator.foreach(Iterator.scala:943)\\n\",\n      \"\\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\\n\",\n      \"\\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\\n\",\n      \"\\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\\n\",\n      \"\\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\\n\",\n      \"\\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\\n\",\n      \"\\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\\n\",\n      \"\\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\\n\",\n      \"\\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\\n\",\n      \"Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088596058, tries=1, nextAllowedTryMs=1710088596159) timed out at 1710088596059 after 1 attempt(s)\\n\",\n      \"Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\\n\",\n      \"24/03/11 00:36:36 WARN NetworkClient: [AdminClient clientId=adminclient-3] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:37 WARN ClientUtils: Couldn't resolve server broker:29092 from bootstrap.servers as DNS resolution failed for broker\\n\",\n      \"24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:36:37 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.\\n\",\n      \"24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:38 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:38 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:40 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:41 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:42 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:43 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:44 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:45 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:46 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:47 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:47 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:48 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:49 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:50 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:52 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:52 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:54 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:55 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:56 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:57 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:58 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:36:59 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:00 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:01 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:02 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:03 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:05 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:05 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:06 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:08 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:09 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:10 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:11 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:12 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:13 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:14 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:15 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:16 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:17 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:18 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:19 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:20 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:21 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:22 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:23 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:24 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:25 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:26 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:27 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:28 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:29 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:31 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:32 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:33 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:34 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:35 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:36 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:37 WARN KafkaOffsetReaderAdmin: Error in attempt 3 getting Kafka offsets: \\n\",\n      \"java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\\n\",\n      \"\\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\\n\",\n      \"\\tat scala.collection.Iterator.foreach(Iterator.scala:943)\\n\",\n      \"\\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\\n\",\n      \"\\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\\n\",\n      \"\\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\\n\",\n      \"\\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\\n\",\n      \"\\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\\n\",\n      \"\\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\\n\",\n      \"\\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\\n\",\n      \"\\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\\n\",\n      \"Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\\n\",\n      \"Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\\n\",\n      \"24/03/11 00:37:37 WARN NetworkClient: [AdminClient clientId=adminclient-4] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.\\n\",\n      \"24/03/11 00:37:38 ERROR MicroBatchExecution: Query [id = 4dfba771-eff7-49e7-a3ff-f1aa03a6e840, runId = 0f86ad02-1d50-487a-97c7-72790d8857d8] terminated with error\\n\",\n      \"java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)\\n\",\n      \"\\tat org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions(ConsumerStrategy.scala:66)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.ConsumerStrategy.retrieveAllPartitions$(ConsumerStrategy.scala:65)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.SubscribeStrategy.retrieveAllPartitions(ConsumerStrategy.scala:102)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.SubscribeStrategy.assignedTopicPartitions(ConsumerStrategy.scala:113)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.$anonfun$partitionsAssignedToAdmin$1(KafkaOffsetReaderAdmin.scala:499)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.withRetries(KafkaOffsetReaderAdmin.scala:518)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.partitionsAssignedToAdmin(KafkaOffsetReaderAdmin.scala:498)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaOffsetReaderAdmin.fetchLatestOffsets(KafkaOffsetReaderAdmin.scala:297)\\n\",\n      \"\\tat org.apache.spark.sql.kafka010.KafkaMicroBatchStream.latestOffset(KafkaMicroBatchStream.scala:132)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$4(MicroBatchExecution.scala:491)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:490)\\n\",\n      \"\\tat scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)\\n\",\n      \"\\tat scala.collection.Iterator.foreach(Iterator.scala:943)\\n\",\n      \"\\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\\n\",\n      \"\\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\\n\",\n      \"\\tat scala.collection.IterableLike.foreach(IterableLike.scala:74)\\n\",\n      \"\\tat scala.collection.IterableLike.foreach$(IterableLike.scala:73)\\n\",\n      \"\\tat scala.collection.AbstractIterable.foreach(Iterable.scala:56)\\n\",\n      \"\\tat scala.collection.TraversableLike.map(TraversableLike.scala:286)\\n\",\n      \"\\tat scala.collection.TraversableLike.map$(TraversableLike.scala:279)\\n\",\n      \"\\tat scala.collection.AbstractTraversable.map(Traversable.scala:108)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:479)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:810)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:475)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:268)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211)\\n\",\n      \"\\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)\\n\",\n      \"\\tat org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)\\n\",\n      \"\\tat org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)\\n\",\n      \"Caused by: org.apache.kafka.common.errors.TimeoutException: Call(callName=describeTopics, deadlineMs=1710088657106, tries=1, nextAllowedTryMs=1710088657207) timed out at 1710088657107 after 1 attempt(s)\\n\",\n      \"Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"def prepare_dataframe_to_kafka_sink(df, value_columns, key_column=None):\\n\",\n    \"    columns = df.columns\\n\",\n    \"    df = df.withColumn(\\\"value\\\", F.concat_ws(', ',*value_columns))    \\n\",\n    \"    if key_column:\\n\",\n    \"        df = df.withColumnRenamed(key_column,\\\"key\\\")\\n\",\n    \"        df = df.withColumn(\\\"key\\\",df.key.cast('string'))\\n\",\n    \"    return df.select(['key', 'value'])\\n\",\n    \"    \\n\",\n    \"def sink_kafka(df, topic, output_mode='append'):\\n\",\n    \"    write_query = df.writeStream \\\\\\n\",\n    \"        .format(\\\"kafka\\\") \\\\\\n\",\n    \"        .option(\\\"kafka.bootstrap.servers\\\", \\\"localhost:9092,broker:29092\\\") \\\\\\n\",\n    \"        .outputMode(output_mode) \\\\\\n\",\n    \"        .option(\\\"topic\\\", topic) \\\\\\n\",\n    \"        .option(\\\"checkpointLocation\\\", \\\"checkpoint\\\") \\\\\\n\",\n    \"        .start()\\n\",\n    \"    return write_query\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"e4cb2140-9f2e-4914-b74c-be4c18cdbe8a\",\n   \"metadata\": {},\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"63abe115-879c-4863-97d3-b22cda7f7469\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.11.8\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "07-streaming/extras/python/streams-example/redpanda/streaming.py",
    "content": "from pyspark.sql import SparkSession\nimport pyspark.sql.functions as F\n\nfrom settings import RIDE_SCHEMA, CONSUME_TOPIC_RIDES_CSV, TOPIC_WINDOWED_VENDOR_ID_COUNT\n\n\ndef read_from_kafka(consume_topic: str):\n    # Spark Streaming DataFrame, connect to Kafka topic served at host in bootrap.servers option\n    df_stream = spark \\\n        .readStream \\\n        .format(\"kafka\") \\\n        .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n        .option(\"subscribe\", consume_topic) \\\n        .option(\"startingOffsets\", \"earliest\") \\\n        .option(\"checkpointLocation\", \"checkpoint\") \\\n        .load()\n    return df_stream\n\n\ndef parse_ride_from_kafka_message(df, schema):\n    \"\"\" take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema \"\"\"\n    assert df.isStreaming is True, \"DataFrame doesn't receive streaming data\"\n\n    df = df.selectExpr(\"CAST(key AS STRING)\", \"CAST(value AS STRING)\")\n\n    # split attributes to nested array in one Column\n    col = F.split(df['value'], ', ')\n\n    # expand col to multiple top-level columns\n    for idx, field in enumerate(schema):\n        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))\n    return df.select([field.name for field in schema])\n\n\ndef sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):\n    write_query = df.writeStream \\\n        .outputMode(output_mode) \\\n        .trigger(processingTime=processing_time) \\\n        .format(\"console\") \\\n        .option(\"truncate\", False) \\\n        .start()\n    return write_query  # pyspark.sql.streaming.StreamingQuery\n\n\ndef sink_memory(df, query_name, query_template):\n    query_df = df \\\n        .writeStream \\\n        .queryName(query_name) \\\n        .format(\"memory\") \\\n        .start()\n    query_str = query_template.format(table_name=query_name)\n    query_results = spark.sql(query_str)\n    return query_results, query_df\n\n\ndef sink_kafka(df, topic):\n    write_query = df.writeStream \\\n        .format(\"kafka\") \\\n        .option(\"kafka.bootstrap.servers\", \"localhost:9092,broker:29092\") \\\n        .outputMode('complete') \\\n        .option(\"topic\", topic) \\\n        .option(\"checkpointLocation\", \"checkpoint\") \\\n        .start()\n    return write_query\n\n\ndef prepare_df_to_kafka_sink(df, value_columns, key_column=None):\n    columns = df.columns\n\n    df = df.withColumn(\"value\", F.concat_ws(', ', *value_columns))\n    if key_column:\n        df = df.withColumnRenamed(key_column, \"key\")\n        df = df.withColumn(\"key\", df.key.cast('string'))\n    return df.select(['key', 'value'])\n\n\ndef op_groupby(df, column_names):\n    df_aggregation = df.groupBy(column_names).count()\n    return df_aggregation\n\n\ndef op_windowed_groupby(df, window_duration, slide_duration):\n    df_windowed_aggregation = df.groupBy(\n        F.window(timeColumn=df.tpep_pickup_datetime, windowDuration=window_duration, slideDuration=slide_duration),\n        df.vendor_id\n    ).count()\n    return df_windowed_aggregation\n\n\nif __name__ == \"__main__\":\n    spark = SparkSession.builder.appName('streaming-examples').getOrCreate()\n    spark.sparkContext.setLogLevel('WARN')\n\n    # read_streaming data\n    df_consume_stream = read_from_kafka(consume_topic=CONSUME_TOPIC_RIDES_CSV)\n    print(df_consume_stream.printSchema())\n\n    # parse streaming data\n    df_rides = parse_ride_from_kafka_message(\n        df_consume_stream, \n        RIDE_SCHEMA\n    )\n    print(df_rides.printSchema())\n\n    sink_console(df_rides, output_mode='append')\n\n    df_trip_count_by_vendor_id = op_groupby(df_rides, ['vendor_id'])\n    df_trip_count_by_pickup_date_vendor_id = op_windowed_groupby(\n        df_rides, \n        window_duration=\"10 minutes\", \n        slide_duration='5 minutes'\n    )\n\n    # write the output out to the console for debugging / testing\n    sink_console(df_trip_count_by_vendor_id)\n    # write the output to the kafka topic\n    df_trip_count_messages = prepare_df_to_kafka_sink(\n        df=df_trip_count_by_pickup_date_vendor_id, \n        value_columns=['count'], \n        key_column='vendor_id'\n    )\n    kafka_sink_query = sink_kafka(\n        df=df_trip_count_messages, \n        topic=TOPIC_WINDOWED_VENDOR_ID_COUNT\n    )\n\n    spark.streams.awaitAnyTermination()\n"
  },
  {
    "path": "07-streaming/theory/README.md",
    "content": "# Kafka theory (optional)\n\nVideo lectures covering Kafka concepts, with code examples in Java.\n\nCode: [java/kafka_examples](java/kafka_examples)\n\n\n## Stream processing\n\n- [7.0.1 Introduction](https://youtu.be/hfvju3iOIP0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=67)\n- [7.0.2 What is stream processing](https://youtu.be/WxTxKGcfA-k&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=68)\n- [7.3 What is Kafka?](https://youtu.be/zPLZUDPi4AY&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=69)\n- [7.4 Confluent Cloud](https://youtu.be/ZnEZFEYKppw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=70)\n- [7.5 Kafka producer consumer](https://youtu.be/aegTuyxX7Yg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=71)\n- [7.6 Kafka configuration](https://youtu.be/SXQtWyRpMKs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=72)\n\nLinks:\n\n- [Slides](https://docs.google.com/presentation/d/1bCtdCba8v1HxJ_uMm9pwjRUC-NAMeB-6nOG2ng3KujA/edit?usp=sharing)\n- [Kafka Configuration Reference](https://docs.confluent.io/platform/current/installation/configuration/)\n- [Confluent Cloud trial](https://www.confluent.io/confluent-cloud/tryfree/)\n\n\n## Kafka Streams\n\n- [7.7 Kafka stream basics](https://youtu.be/dUyA_63eRb0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=73)\n- [7.8 Kafka stream join](https://youtu.be/NcpKlujh34Y&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=74)\n- [7.9 Kafka stream testing](https://youtu.be/TNx5rmLY8Pk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=75)\n- [7.10 Kafka stream windowing](https://youtu.be/r1OuLdwxbRc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=76)\n- [7.11 Kafka ksqlDB and Connect](https://youtu.be/DziQ4a4tn9Y&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=77)\n- [7.12 Kafka Schema registry](https://youtu.be/tBY_hBuyzwI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=78)\n\nLinks:\n\n- [Slides](https://docs.google.com/presentation/d/1fVi9sFa7fL2ZW3ynS5MAZm0bRSZ4jO10fymPmrfTUjE/edit?usp=sharing)\n- [Streams Concepts](https://docs.confluent.io/platform/current/streams/concepts.html)\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/.gitignore",
    "content": ".gradle\nbin\n!src/main/resources/rides.csv\n\nbuild/classes\nbuild/generated\nbuild/libs\nbuild/reports\nbuild/resources\nbuild/test-results\nbuild/tmp\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/build/generated-main-avro-java/schemaregistry/RideRecord.java",
    "content": "/**\n * Autogenerated by Avro\n *\n * DO NOT EDIT DIRECTLY\n */\npackage schemaregistry;\n\nimport org.apache.avro.generic.GenericArray;\nimport org.apache.avro.specific.SpecificData;\nimport org.apache.avro.util.Utf8;\nimport org.apache.avro.message.BinaryMessageEncoder;\nimport org.apache.avro.message.BinaryMessageDecoder;\nimport org.apache.avro.message.SchemaStore;\n\n@org.apache.avro.specific.AvroGenerated\npublic class RideRecord extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {\n  private static final long serialVersionUID = 6805437803204402942L;\n\n\n  public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse(\"{\\\"type\\\":\\\"record\\\",\\\"name\\\":\\\"RideRecord\\\",\\\"namespace\\\":\\\"schemaregistry\\\",\\\"fields\\\":[{\\\"name\\\":\\\"vendor_id\\\",\\\"type\\\":{\\\"type\\\":\\\"string\\\",\\\"avro.java.string\\\":\\\"String\\\"}},{\\\"name\\\":\\\"passenger_count\\\",\\\"type\\\":\\\"int\\\"},{\\\"name\\\":\\\"trip_distance\\\",\\\"type\\\":\\\"double\\\"}]}\");\n  public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }\n\n  private static final SpecificData MODEL$ = new SpecificData();\n\n  private static final BinaryMessageEncoder<RideRecord> ENCODER =\n      new BinaryMessageEncoder<>(MODEL$, SCHEMA$);\n\n  private static final BinaryMessageDecoder<RideRecord> DECODER =\n      new BinaryMessageDecoder<>(MODEL$, SCHEMA$);\n\n  /**\n   * Return the BinaryMessageEncoder instance used by this class.\n   * @return the message encoder used by this class\n   */\n  public static BinaryMessageEncoder<RideRecord> getEncoder() {\n    return ENCODER;\n  }\n\n  /**\n   * Return the BinaryMessageDecoder instance used by this class.\n   * @return the message decoder used by this class\n   */\n  public static BinaryMessageDecoder<RideRecord> getDecoder() {\n    return DECODER;\n  }\n\n  /**\n   * Create a new BinaryMessageDecoder instance for this class that uses the specified {@link SchemaStore}.\n   * @param resolver a {@link SchemaStore} used to find schemas by fingerprint\n   * @return a BinaryMessageDecoder instance for this class backed by the given SchemaStore\n   */\n  public static BinaryMessageDecoder<RideRecord> createDecoder(SchemaStore resolver) {\n    return new BinaryMessageDecoder<>(MODEL$, SCHEMA$, resolver);\n  }\n\n  /**\n   * Serializes this RideRecord to a ByteBuffer.\n   * @return a buffer holding the serialized data for this instance\n   * @throws java.io.IOException if this instance could not be serialized\n   */\n  public java.nio.ByteBuffer toByteBuffer() throws java.io.IOException {\n    return ENCODER.encode(this);\n  }\n\n  /**\n   * Deserializes a RideRecord from a ByteBuffer.\n   * @param b a byte buffer holding serialized data for an instance of this class\n   * @return a RideRecord instance decoded from the given buffer\n   * @throws java.io.IOException if the given bytes could not be deserialized into an instance of this class\n   */\n  public static RideRecord fromByteBuffer(\n      java.nio.ByteBuffer b) throws java.io.IOException {\n    return DECODER.decode(b);\n  }\n\n  private java.lang.String vendor_id;\n  private int passenger_count;\n  private double trip_distance;\n\n  /**\n   * Default constructor.  Note that this does not initialize fields\n   * to their default values from the schema.  If that is desired then\n   * one should use <code>newBuilder()</code>.\n   */\n  public RideRecord() {}\n\n  /**\n   * All-args constructor.\n   * @param vendor_id The new value for vendor_id\n   * @param passenger_count The new value for passenger_count\n   * @param trip_distance The new value for trip_distance\n   */\n  public RideRecord(java.lang.String vendor_id, java.lang.Integer passenger_count, java.lang.Double trip_distance) {\n    this.vendor_id = vendor_id;\n    this.passenger_count = passenger_count;\n    this.trip_distance = trip_distance;\n  }\n\n  @Override\n  public org.apache.avro.specific.SpecificData getSpecificData() { return MODEL$; }\n\n  @Override\n  public org.apache.avro.Schema getSchema() { return SCHEMA$; }\n\n  // Used by DatumWriter.  Applications should not call.\n  @Override\n  public java.lang.Object get(int field$) {\n    switch (field$) {\n    case 0: return vendor_id;\n    case 1: return passenger_count;\n    case 2: return trip_distance;\n    default: throw new IndexOutOfBoundsException(\"Invalid index: \" + field$);\n    }\n  }\n\n  // Used by DatumReader.  Applications should not call.\n  @Override\n  @SuppressWarnings(value=\"unchecked\")\n  public void put(int field$, java.lang.Object value$) {\n    switch (field$) {\n    case 0: vendor_id = value$ != null ? value$.toString() : null; break;\n    case 1: passenger_count = (java.lang.Integer)value$; break;\n    case 2: trip_distance = (java.lang.Double)value$; break;\n    default: throw new IndexOutOfBoundsException(\"Invalid index: \" + field$);\n    }\n  }\n\n  /**\n   * Gets the value of the 'vendor_id' field.\n   * @return The value of the 'vendor_id' field.\n   */\n  public java.lang.String getVendorId() {\n    return vendor_id;\n  }\n\n\n  /**\n   * Sets the value of the 'vendor_id' field.\n   * @param value the value to set.\n   */\n  public void setVendorId(java.lang.String value) {\n    this.vendor_id = value;\n  }\n\n  /**\n   * Gets the value of the 'passenger_count' field.\n   * @return The value of the 'passenger_count' field.\n   */\n  public int getPassengerCount() {\n    return passenger_count;\n  }\n\n\n  /**\n   * Sets the value of the 'passenger_count' field.\n   * @param value the value to set.\n   */\n  public void setPassengerCount(int value) {\n    this.passenger_count = value;\n  }\n\n  /**\n   * Gets the value of the 'trip_distance' field.\n   * @return The value of the 'trip_distance' field.\n   */\n  public double getTripDistance() {\n    return trip_distance;\n  }\n\n\n  /**\n   * Sets the value of the 'trip_distance' field.\n   * @param value the value to set.\n   */\n  public void setTripDistance(double value) {\n    this.trip_distance = value;\n  }\n\n  /**\n   * Creates a new RideRecord RecordBuilder.\n   * @return A new RideRecord RecordBuilder\n   */\n  public static schemaregistry.RideRecord.Builder newBuilder() {\n    return new schemaregistry.RideRecord.Builder();\n  }\n\n  /**\n   * Creates a new RideRecord RecordBuilder by copying an existing Builder.\n   * @param other The existing builder to copy.\n   * @return A new RideRecord RecordBuilder\n   */\n  public static schemaregistry.RideRecord.Builder newBuilder(schemaregistry.RideRecord.Builder other) {\n    if (other == null) {\n      return new schemaregistry.RideRecord.Builder();\n    } else {\n      return new schemaregistry.RideRecord.Builder(other);\n    }\n  }\n\n  /**\n   * Creates a new RideRecord RecordBuilder by copying an existing RideRecord instance.\n   * @param other The existing instance to copy.\n   * @return A new RideRecord RecordBuilder\n   */\n  public static schemaregistry.RideRecord.Builder newBuilder(schemaregistry.RideRecord other) {\n    if (other == null) {\n      return new schemaregistry.RideRecord.Builder();\n    } else {\n      return new schemaregistry.RideRecord.Builder(other);\n    }\n  }\n\n  /**\n   * RecordBuilder for RideRecord instances.\n   */\n  @org.apache.avro.specific.AvroGenerated\n  public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase<RideRecord>\n    implements org.apache.avro.data.RecordBuilder<RideRecord> {\n\n    private java.lang.String vendor_id;\n    private int passenger_count;\n    private double trip_distance;\n\n    /** Creates a new Builder */\n    private Builder() {\n      super(SCHEMA$, MODEL$);\n    }\n\n    /**\n     * Creates a Builder by copying an existing Builder.\n     * @param other The existing Builder to copy.\n     */\n    private Builder(schemaregistry.RideRecord.Builder other) {\n      super(other);\n      if (isValidValue(fields()[0], other.vendor_id)) {\n        this.vendor_id = data().deepCopy(fields()[0].schema(), other.vendor_id);\n        fieldSetFlags()[0] = other.fieldSetFlags()[0];\n      }\n      if (isValidValue(fields()[1], other.passenger_count)) {\n        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);\n        fieldSetFlags()[1] = other.fieldSetFlags()[1];\n      }\n      if (isValidValue(fields()[2], other.trip_distance)) {\n        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);\n        fieldSetFlags()[2] = other.fieldSetFlags()[2];\n      }\n    }\n\n    /**\n     * Creates a Builder by copying an existing RideRecord instance\n     * @param other The existing instance to copy.\n     */\n    private Builder(schemaregistry.RideRecord other) {\n      super(SCHEMA$, MODEL$);\n      if (isValidValue(fields()[0], other.vendor_id)) {\n        this.vendor_id = data().deepCopy(fields()[0].schema(), other.vendor_id);\n        fieldSetFlags()[0] = true;\n      }\n      if (isValidValue(fields()[1], other.passenger_count)) {\n        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);\n        fieldSetFlags()[1] = true;\n      }\n      if (isValidValue(fields()[2], other.trip_distance)) {\n        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);\n        fieldSetFlags()[2] = true;\n      }\n    }\n\n    /**\n      * Gets the value of the 'vendor_id' field.\n      * @return The value.\n      */\n    public java.lang.String getVendorId() {\n      return vendor_id;\n    }\n\n\n    /**\n      * Sets the value of the 'vendor_id' field.\n      * @param value The value of 'vendor_id'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecord.Builder setVendorId(java.lang.String value) {\n      validate(fields()[0], value);\n      this.vendor_id = value;\n      fieldSetFlags()[0] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'vendor_id' field has been set.\n      * @return True if the 'vendor_id' field has been set, false otherwise.\n      */\n    public boolean hasVendorId() {\n      return fieldSetFlags()[0];\n    }\n\n\n    /**\n      * Clears the value of the 'vendor_id' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecord.Builder clearVendorId() {\n      vendor_id = null;\n      fieldSetFlags()[0] = false;\n      return this;\n    }\n\n    /**\n      * Gets the value of the 'passenger_count' field.\n      * @return The value.\n      */\n    public int getPassengerCount() {\n      return passenger_count;\n    }\n\n\n    /**\n      * Sets the value of the 'passenger_count' field.\n      * @param value The value of 'passenger_count'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecord.Builder setPassengerCount(int value) {\n      validate(fields()[1], value);\n      this.passenger_count = value;\n      fieldSetFlags()[1] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'passenger_count' field has been set.\n      * @return True if the 'passenger_count' field has been set, false otherwise.\n      */\n    public boolean hasPassengerCount() {\n      return fieldSetFlags()[1];\n    }\n\n\n    /**\n      * Clears the value of the 'passenger_count' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecord.Builder clearPassengerCount() {\n      fieldSetFlags()[1] = false;\n      return this;\n    }\n\n    /**\n      * Gets the value of the 'trip_distance' field.\n      * @return The value.\n      */\n    public double getTripDistance() {\n      return trip_distance;\n    }\n\n\n    /**\n      * Sets the value of the 'trip_distance' field.\n      * @param value The value of 'trip_distance'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecord.Builder setTripDistance(double value) {\n      validate(fields()[2], value);\n      this.trip_distance = value;\n      fieldSetFlags()[2] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'trip_distance' field has been set.\n      * @return True if the 'trip_distance' field has been set, false otherwise.\n      */\n    public boolean hasTripDistance() {\n      return fieldSetFlags()[2];\n    }\n\n\n    /**\n      * Clears the value of the 'trip_distance' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecord.Builder clearTripDistance() {\n      fieldSetFlags()[2] = false;\n      return this;\n    }\n\n    @Override\n    @SuppressWarnings(\"unchecked\")\n    public RideRecord build() {\n      try {\n        RideRecord record = new RideRecord();\n        record.vendor_id = fieldSetFlags()[0] ? this.vendor_id : (java.lang.String) defaultValue(fields()[0]);\n        record.passenger_count = fieldSetFlags()[1] ? this.passenger_count : (java.lang.Integer) defaultValue(fields()[1]);\n        record.trip_distance = fieldSetFlags()[2] ? this.trip_distance : (java.lang.Double) defaultValue(fields()[2]);\n        return record;\n      } catch (org.apache.avro.AvroMissingFieldException e) {\n        throw e;\n      } catch (java.lang.Exception e) {\n        throw new org.apache.avro.AvroRuntimeException(e);\n      }\n    }\n  }\n\n  @SuppressWarnings(\"unchecked\")\n  private static final org.apache.avro.io.DatumWriter<RideRecord>\n    WRITER$ = (org.apache.avro.io.DatumWriter<RideRecord>)MODEL$.createDatumWriter(SCHEMA$);\n\n  @Override public void writeExternal(java.io.ObjectOutput out)\n    throws java.io.IOException {\n    WRITER$.write(this, SpecificData.getEncoder(out));\n  }\n\n  @SuppressWarnings(\"unchecked\")\n  private static final org.apache.avro.io.DatumReader<RideRecord>\n    READER$ = (org.apache.avro.io.DatumReader<RideRecord>)MODEL$.createDatumReader(SCHEMA$);\n\n  @Override public void readExternal(java.io.ObjectInput in)\n    throws java.io.IOException {\n    READER$.read(this, SpecificData.getDecoder(in));\n  }\n\n  @Override protected boolean hasCustomCoders() { return true; }\n\n  @Override public void customEncode(org.apache.avro.io.Encoder out)\n    throws java.io.IOException\n  {\n    out.writeString(this.vendor_id);\n\n    out.writeInt(this.passenger_count);\n\n    out.writeDouble(this.trip_distance);\n\n  }\n\n  @Override public void customDecode(org.apache.avro.io.ResolvingDecoder in)\n    throws java.io.IOException\n  {\n    org.apache.avro.Schema.Field[] fieldOrder = in.readFieldOrderIfDiff();\n    if (fieldOrder == null) {\n      this.vendor_id = in.readString();\n\n      this.passenger_count = in.readInt();\n\n      this.trip_distance = in.readDouble();\n\n    } else {\n      for (int i = 0; i < 3; i++) {\n        switch (fieldOrder[i].pos()) {\n        case 0:\n          this.vendor_id = in.readString();\n          break;\n\n        case 1:\n          this.passenger_count = in.readInt();\n          break;\n\n        case 2:\n          this.trip_distance = in.readDouble();\n          break;\n\n        default:\n          throw new java.io.IOException(\"Corrupt ResolvingDecoder.\");\n        }\n      }\n    }\n  }\n}\n\n\n\n\n\n\n\n\n\n\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/build/generated-main-avro-java/schemaregistry/RideRecordCompatible.java",
    "content": "/**\n * Autogenerated by Avro\n *\n * DO NOT EDIT DIRECTLY\n */\npackage schemaregistry;\n\nimport org.apache.avro.generic.GenericArray;\nimport org.apache.avro.specific.SpecificData;\nimport org.apache.avro.util.Utf8;\nimport org.apache.avro.message.BinaryMessageEncoder;\nimport org.apache.avro.message.BinaryMessageDecoder;\nimport org.apache.avro.message.SchemaStore;\n\n@org.apache.avro.specific.AvroGenerated\npublic class RideRecordCompatible extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {\n  private static final long serialVersionUID = 7163300507090021229L;\n\n\n  public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse(\"{\\\"type\\\":\\\"record\\\",\\\"name\\\":\\\"RideRecordCompatible\\\",\\\"namespace\\\":\\\"schemaregistry\\\",\\\"fields\\\":[{\\\"name\\\":\\\"vendorId\\\",\\\"type\\\":{\\\"type\\\":\\\"string\\\",\\\"avro.java.string\\\":\\\"String\\\"}},{\\\"name\\\":\\\"passenger_count\\\",\\\"type\\\":\\\"int\\\"},{\\\"name\\\":\\\"trip_distance\\\",\\\"type\\\":\\\"double\\\"},{\\\"name\\\":\\\"pu_location_id\\\",\\\"type\\\":[\\\"null\\\",\\\"long\\\"],\\\"default\\\":null}]}\");\n  public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }\n\n  private static final SpecificData MODEL$ = new SpecificData();\n\n  private static final BinaryMessageEncoder<RideRecordCompatible> ENCODER =\n      new BinaryMessageEncoder<>(MODEL$, SCHEMA$);\n\n  private static final BinaryMessageDecoder<RideRecordCompatible> DECODER =\n      new BinaryMessageDecoder<>(MODEL$, SCHEMA$);\n\n  /**\n   * Return the BinaryMessageEncoder instance used by this class.\n   * @return the message encoder used by this class\n   */\n  public static BinaryMessageEncoder<RideRecordCompatible> getEncoder() {\n    return ENCODER;\n  }\n\n  /**\n   * Return the BinaryMessageDecoder instance used by this class.\n   * @return the message decoder used by this class\n   */\n  public static BinaryMessageDecoder<RideRecordCompatible> getDecoder() {\n    return DECODER;\n  }\n\n  /**\n   * Create a new BinaryMessageDecoder instance for this class that uses the specified {@link SchemaStore}.\n   * @param resolver a {@link SchemaStore} used to find schemas by fingerprint\n   * @return a BinaryMessageDecoder instance for this class backed by the given SchemaStore\n   */\n  public static BinaryMessageDecoder<RideRecordCompatible> createDecoder(SchemaStore resolver) {\n    return new BinaryMessageDecoder<>(MODEL$, SCHEMA$, resolver);\n  }\n\n  /**\n   * Serializes this RideRecordCompatible to a ByteBuffer.\n   * @return a buffer holding the serialized data for this instance\n   * @throws java.io.IOException if this instance could not be serialized\n   */\n  public java.nio.ByteBuffer toByteBuffer() throws java.io.IOException {\n    return ENCODER.encode(this);\n  }\n\n  /**\n   * Deserializes a RideRecordCompatible from a ByteBuffer.\n   * @param b a byte buffer holding serialized data for an instance of this class\n   * @return a RideRecordCompatible instance decoded from the given buffer\n   * @throws java.io.IOException if the given bytes could not be deserialized into an instance of this class\n   */\n  public static RideRecordCompatible fromByteBuffer(\n      java.nio.ByteBuffer b) throws java.io.IOException {\n    return DECODER.decode(b);\n  }\n\n  private java.lang.String vendorId;\n  private int passenger_count;\n  private double trip_distance;\n  private java.lang.Long pu_location_id;\n\n  /**\n   * Default constructor.  Note that this does not initialize fields\n   * to their default values from the schema.  If that is desired then\n   * one should use <code>newBuilder()</code>.\n   */\n  public RideRecordCompatible() {}\n\n  /**\n   * All-args constructor.\n   * @param vendorId The new value for vendorId\n   * @param passenger_count The new value for passenger_count\n   * @param trip_distance The new value for trip_distance\n   * @param pu_location_id The new value for pu_location_id\n   */\n  public RideRecordCompatible(java.lang.String vendorId, java.lang.Integer passenger_count, java.lang.Double trip_distance, java.lang.Long pu_location_id) {\n    this.vendorId = vendorId;\n    this.passenger_count = passenger_count;\n    this.trip_distance = trip_distance;\n    this.pu_location_id = pu_location_id;\n  }\n\n  @Override\n  public org.apache.avro.specific.SpecificData getSpecificData() { return MODEL$; }\n\n  @Override\n  public org.apache.avro.Schema getSchema() { return SCHEMA$; }\n\n  // Used by DatumWriter.  Applications should not call.\n  @Override\n  public java.lang.Object get(int field$) {\n    switch (field$) {\n    case 0: return vendorId;\n    case 1: return passenger_count;\n    case 2: return trip_distance;\n    case 3: return pu_location_id;\n    default: throw new IndexOutOfBoundsException(\"Invalid index: \" + field$);\n    }\n  }\n\n  // Used by DatumReader.  Applications should not call.\n  @Override\n  @SuppressWarnings(value=\"unchecked\")\n  public void put(int field$, java.lang.Object value$) {\n    switch (field$) {\n    case 0: vendorId = value$ != null ? value$.toString() : null; break;\n    case 1: passenger_count = (java.lang.Integer)value$; break;\n    case 2: trip_distance = (java.lang.Double)value$; break;\n    case 3: pu_location_id = (java.lang.Long)value$; break;\n    default: throw new IndexOutOfBoundsException(\"Invalid index: \" + field$);\n    }\n  }\n\n  /**\n   * Gets the value of the 'vendorId' field.\n   * @return The value of the 'vendorId' field.\n   */\n  public java.lang.String getVendorId() {\n    return vendorId;\n  }\n\n\n  /**\n   * Sets the value of the 'vendorId' field.\n   * @param value the value to set.\n   */\n  public void setVendorId(java.lang.String value) {\n    this.vendorId = value;\n  }\n\n  /**\n   * Gets the value of the 'passenger_count' field.\n   * @return The value of the 'passenger_count' field.\n   */\n  public int getPassengerCount() {\n    return passenger_count;\n  }\n\n\n  /**\n   * Sets the value of the 'passenger_count' field.\n   * @param value the value to set.\n   */\n  public void setPassengerCount(int value) {\n    this.passenger_count = value;\n  }\n\n  /**\n   * Gets the value of the 'trip_distance' field.\n   * @return The value of the 'trip_distance' field.\n   */\n  public double getTripDistance() {\n    return trip_distance;\n  }\n\n\n  /**\n   * Sets the value of the 'trip_distance' field.\n   * @param value the value to set.\n   */\n  public void setTripDistance(double value) {\n    this.trip_distance = value;\n  }\n\n  /**\n   * Gets the value of the 'pu_location_id' field.\n   * @return The value of the 'pu_location_id' field.\n   */\n  public java.lang.Long getPuLocationId() {\n    return pu_location_id;\n  }\n\n\n  /**\n   * Sets the value of the 'pu_location_id' field.\n   * @param value the value to set.\n   */\n  public void setPuLocationId(java.lang.Long value) {\n    this.pu_location_id = value;\n  }\n\n  /**\n   * Creates a new RideRecordCompatible RecordBuilder.\n   * @return A new RideRecordCompatible RecordBuilder\n   */\n  public static schemaregistry.RideRecordCompatible.Builder newBuilder() {\n    return new schemaregistry.RideRecordCompatible.Builder();\n  }\n\n  /**\n   * Creates a new RideRecordCompatible RecordBuilder by copying an existing Builder.\n   * @param other The existing builder to copy.\n   * @return A new RideRecordCompatible RecordBuilder\n   */\n  public static schemaregistry.RideRecordCompatible.Builder newBuilder(schemaregistry.RideRecordCompatible.Builder other) {\n    if (other == null) {\n      return new schemaregistry.RideRecordCompatible.Builder();\n    } else {\n      return new schemaregistry.RideRecordCompatible.Builder(other);\n    }\n  }\n\n  /**\n   * Creates a new RideRecordCompatible RecordBuilder by copying an existing RideRecordCompatible instance.\n   * @param other The existing instance to copy.\n   * @return A new RideRecordCompatible RecordBuilder\n   */\n  public static schemaregistry.RideRecordCompatible.Builder newBuilder(schemaregistry.RideRecordCompatible other) {\n    if (other == null) {\n      return new schemaregistry.RideRecordCompatible.Builder();\n    } else {\n      return new schemaregistry.RideRecordCompatible.Builder(other);\n    }\n  }\n\n  /**\n   * RecordBuilder for RideRecordCompatible instances.\n   */\n  @org.apache.avro.specific.AvroGenerated\n  public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase<RideRecordCompatible>\n    implements org.apache.avro.data.RecordBuilder<RideRecordCompatible> {\n\n    private java.lang.String vendorId;\n    private int passenger_count;\n    private double trip_distance;\n    private java.lang.Long pu_location_id;\n\n    /** Creates a new Builder */\n    private Builder() {\n      super(SCHEMA$, MODEL$);\n    }\n\n    /**\n     * Creates a Builder by copying an existing Builder.\n     * @param other The existing Builder to copy.\n     */\n    private Builder(schemaregistry.RideRecordCompatible.Builder other) {\n      super(other);\n      if (isValidValue(fields()[0], other.vendorId)) {\n        this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId);\n        fieldSetFlags()[0] = other.fieldSetFlags()[0];\n      }\n      if (isValidValue(fields()[1], other.passenger_count)) {\n        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);\n        fieldSetFlags()[1] = other.fieldSetFlags()[1];\n      }\n      if (isValidValue(fields()[2], other.trip_distance)) {\n        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);\n        fieldSetFlags()[2] = other.fieldSetFlags()[2];\n      }\n      if (isValidValue(fields()[3], other.pu_location_id)) {\n        this.pu_location_id = data().deepCopy(fields()[3].schema(), other.pu_location_id);\n        fieldSetFlags()[3] = other.fieldSetFlags()[3];\n      }\n    }\n\n    /**\n     * Creates a Builder by copying an existing RideRecordCompatible instance\n     * @param other The existing instance to copy.\n     */\n    private Builder(schemaregistry.RideRecordCompatible other) {\n      super(SCHEMA$, MODEL$);\n      if (isValidValue(fields()[0], other.vendorId)) {\n        this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId);\n        fieldSetFlags()[0] = true;\n      }\n      if (isValidValue(fields()[1], other.passenger_count)) {\n        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);\n        fieldSetFlags()[1] = true;\n      }\n      if (isValidValue(fields()[2], other.trip_distance)) {\n        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);\n        fieldSetFlags()[2] = true;\n      }\n      if (isValidValue(fields()[3], other.pu_location_id)) {\n        this.pu_location_id = data().deepCopy(fields()[3].schema(), other.pu_location_id);\n        fieldSetFlags()[3] = true;\n      }\n    }\n\n    /**\n      * Gets the value of the 'vendorId' field.\n      * @return The value.\n      */\n    public java.lang.String getVendorId() {\n      return vendorId;\n    }\n\n\n    /**\n      * Sets the value of the 'vendorId' field.\n      * @param value The value of 'vendorId'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordCompatible.Builder setVendorId(java.lang.String value) {\n      validate(fields()[0], value);\n      this.vendorId = value;\n      fieldSetFlags()[0] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'vendorId' field has been set.\n      * @return True if the 'vendorId' field has been set, false otherwise.\n      */\n    public boolean hasVendorId() {\n      return fieldSetFlags()[0];\n    }\n\n\n    /**\n      * Clears the value of the 'vendorId' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordCompatible.Builder clearVendorId() {\n      vendorId = null;\n      fieldSetFlags()[0] = false;\n      return this;\n    }\n\n    /**\n      * Gets the value of the 'passenger_count' field.\n      * @return The value.\n      */\n    public int getPassengerCount() {\n      return passenger_count;\n    }\n\n\n    /**\n      * Sets the value of the 'passenger_count' field.\n      * @param value The value of 'passenger_count'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordCompatible.Builder setPassengerCount(int value) {\n      validate(fields()[1], value);\n      this.passenger_count = value;\n      fieldSetFlags()[1] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'passenger_count' field has been set.\n      * @return True if the 'passenger_count' field has been set, false otherwise.\n      */\n    public boolean hasPassengerCount() {\n      return fieldSetFlags()[1];\n    }\n\n\n    /**\n      * Clears the value of the 'passenger_count' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordCompatible.Builder clearPassengerCount() {\n      fieldSetFlags()[1] = false;\n      return this;\n    }\n\n    /**\n      * Gets the value of the 'trip_distance' field.\n      * @return The value.\n      */\n    public double getTripDistance() {\n      return trip_distance;\n    }\n\n\n    /**\n      * Sets the value of the 'trip_distance' field.\n      * @param value The value of 'trip_distance'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordCompatible.Builder setTripDistance(double value) {\n      validate(fields()[2], value);\n      this.trip_distance = value;\n      fieldSetFlags()[2] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'trip_distance' field has been set.\n      * @return True if the 'trip_distance' field has been set, false otherwise.\n      */\n    public boolean hasTripDistance() {\n      return fieldSetFlags()[2];\n    }\n\n\n    /**\n      * Clears the value of the 'trip_distance' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordCompatible.Builder clearTripDistance() {\n      fieldSetFlags()[2] = false;\n      return this;\n    }\n\n    /**\n      * Gets the value of the 'pu_location_id' field.\n      * @return The value.\n      */\n    public java.lang.Long getPuLocationId() {\n      return pu_location_id;\n    }\n\n\n    /**\n      * Sets the value of the 'pu_location_id' field.\n      * @param value The value of 'pu_location_id'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordCompatible.Builder setPuLocationId(java.lang.Long value) {\n      validate(fields()[3], value);\n      this.pu_location_id = value;\n      fieldSetFlags()[3] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'pu_location_id' field has been set.\n      * @return True if the 'pu_location_id' field has been set, false otherwise.\n      */\n    public boolean hasPuLocationId() {\n      return fieldSetFlags()[3];\n    }\n\n\n    /**\n      * Clears the value of the 'pu_location_id' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordCompatible.Builder clearPuLocationId() {\n      pu_location_id = null;\n      fieldSetFlags()[3] = false;\n      return this;\n    }\n\n    @Override\n    @SuppressWarnings(\"unchecked\")\n    public RideRecordCompatible build() {\n      try {\n        RideRecordCompatible record = new RideRecordCompatible();\n        record.vendorId = fieldSetFlags()[0] ? this.vendorId : (java.lang.String) defaultValue(fields()[0]);\n        record.passenger_count = fieldSetFlags()[1] ? this.passenger_count : (java.lang.Integer) defaultValue(fields()[1]);\n        record.trip_distance = fieldSetFlags()[2] ? this.trip_distance : (java.lang.Double) defaultValue(fields()[2]);\n        record.pu_location_id = fieldSetFlags()[3] ? this.pu_location_id : (java.lang.Long) defaultValue(fields()[3]);\n        return record;\n      } catch (org.apache.avro.AvroMissingFieldException e) {\n        throw e;\n      } catch (java.lang.Exception e) {\n        throw new org.apache.avro.AvroRuntimeException(e);\n      }\n    }\n  }\n\n  @SuppressWarnings(\"unchecked\")\n  private static final org.apache.avro.io.DatumWriter<RideRecordCompatible>\n    WRITER$ = (org.apache.avro.io.DatumWriter<RideRecordCompatible>)MODEL$.createDatumWriter(SCHEMA$);\n\n  @Override public void writeExternal(java.io.ObjectOutput out)\n    throws java.io.IOException {\n    WRITER$.write(this, SpecificData.getEncoder(out));\n  }\n\n  @SuppressWarnings(\"unchecked\")\n  private static final org.apache.avro.io.DatumReader<RideRecordCompatible>\n    READER$ = (org.apache.avro.io.DatumReader<RideRecordCompatible>)MODEL$.createDatumReader(SCHEMA$);\n\n  @Override public void readExternal(java.io.ObjectInput in)\n    throws java.io.IOException {\n    READER$.read(this, SpecificData.getDecoder(in));\n  }\n\n  @Override protected boolean hasCustomCoders() { return true; }\n\n  @Override public void customEncode(org.apache.avro.io.Encoder out)\n    throws java.io.IOException\n  {\n    out.writeString(this.vendorId);\n\n    out.writeInt(this.passenger_count);\n\n    out.writeDouble(this.trip_distance);\n\n    if (this.pu_location_id == null) {\n      out.writeIndex(0);\n      out.writeNull();\n    } else {\n      out.writeIndex(1);\n      out.writeLong(this.pu_location_id);\n    }\n\n  }\n\n  @Override public void customDecode(org.apache.avro.io.ResolvingDecoder in)\n    throws java.io.IOException\n  {\n    org.apache.avro.Schema.Field[] fieldOrder = in.readFieldOrderIfDiff();\n    if (fieldOrder == null) {\n      this.vendorId = in.readString();\n\n      this.passenger_count = in.readInt();\n\n      this.trip_distance = in.readDouble();\n\n      if (in.readIndex() != 1) {\n        in.readNull();\n        this.pu_location_id = null;\n      } else {\n        this.pu_location_id = in.readLong();\n      }\n\n    } else {\n      for (int i = 0; i < 4; i++) {\n        switch (fieldOrder[i].pos()) {\n        case 0:\n          this.vendorId = in.readString();\n          break;\n\n        case 1:\n          this.passenger_count = in.readInt();\n          break;\n\n        case 2:\n          this.trip_distance = in.readDouble();\n          break;\n\n        case 3:\n          if (in.readIndex() != 1) {\n            in.readNull();\n            this.pu_location_id = null;\n          } else {\n            this.pu_location_id = in.readLong();\n          }\n          break;\n\n        default:\n          throw new java.io.IOException(\"Corrupt ResolvingDecoder.\");\n        }\n      }\n    }\n  }\n}\n\n\n\n\n\n\n\n\n\n\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/build/generated-main-avro-java/schemaregistry/RideRecordNoneCompatible.java",
    "content": "/**\n * Autogenerated by Avro\n *\n * DO NOT EDIT DIRECTLY\n */\npackage schemaregistry;\n\nimport org.apache.avro.generic.GenericArray;\nimport org.apache.avro.specific.SpecificData;\nimport org.apache.avro.util.Utf8;\nimport org.apache.avro.message.BinaryMessageEncoder;\nimport org.apache.avro.message.BinaryMessageDecoder;\nimport org.apache.avro.message.SchemaStore;\n\n@org.apache.avro.specific.AvroGenerated\npublic class RideRecordNoneCompatible extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {\n  private static final long serialVersionUID = -4618980179396772493L;\n\n\n  public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse(\"{\\\"type\\\":\\\"record\\\",\\\"name\\\":\\\"RideRecordNoneCompatible\\\",\\\"namespace\\\":\\\"schemaregistry\\\",\\\"fields\\\":[{\\\"name\\\":\\\"vendorId\\\",\\\"type\\\":\\\"int\\\"},{\\\"name\\\":\\\"passenger_count\\\",\\\"type\\\":\\\"int\\\"},{\\\"name\\\":\\\"trip_distance\\\",\\\"type\\\":\\\"double\\\"}]}\");\n  public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }\n\n  private static final SpecificData MODEL$ = new SpecificData();\n\n  private static final BinaryMessageEncoder<RideRecordNoneCompatible> ENCODER =\n      new BinaryMessageEncoder<>(MODEL$, SCHEMA$);\n\n  private static final BinaryMessageDecoder<RideRecordNoneCompatible> DECODER =\n      new BinaryMessageDecoder<>(MODEL$, SCHEMA$);\n\n  /**\n   * Return the BinaryMessageEncoder instance used by this class.\n   * @return the message encoder used by this class\n   */\n  public static BinaryMessageEncoder<RideRecordNoneCompatible> getEncoder() {\n    return ENCODER;\n  }\n\n  /**\n   * Return the BinaryMessageDecoder instance used by this class.\n   * @return the message decoder used by this class\n   */\n  public static BinaryMessageDecoder<RideRecordNoneCompatible> getDecoder() {\n    return DECODER;\n  }\n\n  /**\n   * Create a new BinaryMessageDecoder instance for this class that uses the specified {@link SchemaStore}.\n   * @param resolver a {@link SchemaStore} used to find schemas by fingerprint\n   * @return a BinaryMessageDecoder instance for this class backed by the given SchemaStore\n   */\n  public static BinaryMessageDecoder<RideRecordNoneCompatible> createDecoder(SchemaStore resolver) {\n    return new BinaryMessageDecoder<>(MODEL$, SCHEMA$, resolver);\n  }\n\n  /**\n   * Serializes this RideRecordNoneCompatible to a ByteBuffer.\n   * @return a buffer holding the serialized data for this instance\n   * @throws java.io.IOException if this instance could not be serialized\n   */\n  public java.nio.ByteBuffer toByteBuffer() throws java.io.IOException {\n    return ENCODER.encode(this);\n  }\n\n  /**\n   * Deserializes a RideRecordNoneCompatible from a ByteBuffer.\n   * @param b a byte buffer holding serialized data for an instance of this class\n   * @return a RideRecordNoneCompatible instance decoded from the given buffer\n   * @throws java.io.IOException if the given bytes could not be deserialized into an instance of this class\n   */\n  public static RideRecordNoneCompatible fromByteBuffer(\n      java.nio.ByteBuffer b) throws java.io.IOException {\n    return DECODER.decode(b);\n  }\n\n  private int vendorId;\n  private int passenger_count;\n  private double trip_distance;\n\n  /**\n   * Default constructor.  Note that this does not initialize fields\n   * to their default values from the schema.  If that is desired then\n   * one should use <code>newBuilder()</code>.\n   */\n  public RideRecordNoneCompatible() {}\n\n  /**\n   * All-args constructor.\n   * @param vendorId The new value for vendorId\n   * @param passenger_count The new value for passenger_count\n   * @param trip_distance The new value for trip_distance\n   */\n  public RideRecordNoneCompatible(java.lang.Integer vendorId, java.lang.Integer passenger_count, java.lang.Double trip_distance) {\n    this.vendorId = vendorId;\n    this.passenger_count = passenger_count;\n    this.trip_distance = trip_distance;\n  }\n\n  @Override\n  public org.apache.avro.specific.SpecificData getSpecificData() { return MODEL$; }\n\n  @Override\n  public org.apache.avro.Schema getSchema() { return SCHEMA$; }\n\n  // Used by DatumWriter.  Applications should not call.\n  @Override\n  public java.lang.Object get(int field$) {\n    switch (field$) {\n    case 0: return vendorId;\n    case 1: return passenger_count;\n    case 2: return trip_distance;\n    default: throw new IndexOutOfBoundsException(\"Invalid index: \" + field$);\n    }\n  }\n\n  // Used by DatumReader.  Applications should not call.\n  @Override\n  @SuppressWarnings(value=\"unchecked\")\n  public void put(int field$, java.lang.Object value$) {\n    switch (field$) {\n    case 0: vendorId = (java.lang.Integer)value$; break;\n    case 1: passenger_count = (java.lang.Integer)value$; break;\n    case 2: trip_distance = (java.lang.Double)value$; break;\n    default: throw new IndexOutOfBoundsException(\"Invalid index: \" + field$);\n    }\n  }\n\n  /**\n   * Gets the value of the 'vendorId' field.\n   * @return The value of the 'vendorId' field.\n   */\n  public int getVendorId() {\n    return vendorId;\n  }\n\n\n  /**\n   * Sets the value of the 'vendorId' field.\n   * @param value the value to set.\n   */\n  public void setVendorId(int value) {\n    this.vendorId = value;\n  }\n\n  /**\n   * Gets the value of the 'passenger_count' field.\n   * @return The value of the 'passenger_count' field.\n   */\n  public int getPassengerCount() {\n    return passenger_count;\n  }\n\n\n  /**\n   * Sets the value of the 'passenger_count' field.\n   * @param value the value to set.\n   */\n  public void setPassengerCount(int value) {\n    this.passenger_count = value;\n  }\n\n  /**\n   * Gets the value of the 'trip_distance' field.\n   * @return The value of the 'trip_distance' field.\n   */\n  public double getTripDistance() {\n    return trip_distance;\n  }\n\n\n  /**\n   * Sets the value of the 'trip_distance' field.\n   * @param value the value to set.\n   */\n  public void setTripDistance(double value) {\n    this.trip_distance = value;\n  }\n\n  /**\n   * Creates a new RideRecordNoneCompatible RecordBuilder.\n   * @return A new RideRecordNoneCompatible RecordBuilder\n   */\n  public static schemaregistry.RideRecordNoneCompatible.Builder newBuilder() {\n    return new schemaregistry.RideRecordNoneCompatible.Builder();\n  }\n\n  /**\n   * Creates a new RideRecordNoneCompatible RecordBuilder by copying an existing Builder.\n   * @param other The existing builder to copy.\n   * @return A new RideRecordNoneCompatible RecordBuilder\n   */\n  public static schemaregistry.RideRecordNoneCompatible.Builder newBuilder(schemaregistry.RideRecordNoneCompatible.Builder other) {\n    if (other == null) {\n      return new schemaregistry.RideRecordNoneCompatible.Builder();\n    } else {\n      return new schemaregistry.RideRecordNoneCompatible.Builder(other);\n    }\n  }\n\n  /**\n   * Creates a new RideRecordNoneCompatible RecordBuilder by copying an existing RideRecordNoneCompatible instance.\n   * @param other The existing instance to copy.\n   * @return A new RideRecordNoneCompatible RecordBuilder\n   */\n  public static schemaregistry.RideRecordNoneCompatible.Builder newBuilder(schemaregistry.RideRecordNoneCompatible other) {\n    if (other == null) {\n      return new schemaregistry.RideRecordNoneCompatible.Builder();\n    } else {\n      return new schemaregistry.RideRecordNoneCompatible.Builder(other);\n    }\n  }\n\n  /**\n   * RecordBuilder for RideRecordNoneCompatible instances.\n   */\n  @org.apache.avro.specific.AvroGenerated\n  public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase<RideRecordNoneCompatible>\n    implements org.apache.avro.data.RecordBuilder<RideRecordNoneCompatible> {\n\n    private int vendorId;\n    private int passenger_count;\n    private double trip_distance;\n\n    /** Creates a new Builder */\n    private Builder() {\n      super(SCHEMA$, MODEL$);\n    }\n\n    /**\n     * Creates a Builder by copying an existing Builder.\n     * @param other The existing Builder to copy.\n     */\n    private Builder(schemaregistry.RideRecordNoneCompatible.Builder other) {\n      super(other);\n      if (isValidValue(fields()[0], other.vendorId)) {\n        this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId);\n        fieldSetFlags()[0] = other.fieldSetFlags()[0];\n      }\n      if (isValidValue(fields()[1], other.passenger_count)) {\n        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);\n        fieldSetFlags()[1] = other.fieldSetFlags()[1];\n      }\n      if (isValidValue(fields()[2], other.trip_distance)) {\n        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);\n        fieldSetFlags()[2] = other.fieldSetFlags()[2];\n      }\n    }\n\n    /**\n     * Creates a Builder by copying an existing RideRecordNoneCompatible instance\n     * @param other The existing instance to copy.\n     */\n    private Builder(schemaregistry.RideRecordNoneCompatible other) {\n      super(SCHEMA$, MODEL$);\n      if (isValidValue(fields()[0], other.vendorId)) {\n        this.vendorId = data().deepCopy(fields()[0].schema(), other.vendorId);\n        fieldSetFlags()[0] = true;\n      }\n      if (isValidValue(fields()[1], other.passenger_count)) {\n        this.passenger_count = data().deepCopy(fields()[1].schema(), other.passenger_count);\n        fieldSetFlags()[1] = true;\n      }\n      if (isValidValue(fields()[2], other.trip_distance)) {\n        this.trip_distance = data().deepCopy(fields()[2].schema(), other.trip_distance);\n        fieldSetFlags()[2] = true;\n      }\n    }\n\n    /**\n      * Gets the value of the 'vendorId' field.\n      * @return The value.\n      */\n    public int getVendorId() {\n      return vendorId;\n    }\n\n\n    /**\n      * Sets the value of the 'vendorId' field.\n      * @param value The value of 'vendorId'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordNoneCompatible.Builder setVendorId(int value) {\n      validate(fields()[0], value);\n      this.vendorId = value;\n      fieldSetFlags()[0] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'vendorId' field has been set.\n      * @return True if the 'vendorId' field has been set, false otherwise.\n      */\n    public boolean hasVendorId() {\n      return fieldSetFlags()[0];\n    }\n\n\n    /**\n      * Clears the value of the 'vendorId' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordNoneCompatible.Builder clearVendorId() {\n      fieldSetFlags()[0] = false;\n      return this;\n    }\n\n    /**\n      * Gets the value of the 'passenger_count' field.\n      * @return The value.\n      */\n    public int getPassengerCount() {\n      return passenger_count;\n    }\n\n\n    /**\n      * Sets the value of the 'passenger_count' field.\n      * @param value The value of 'passenger_count'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordNoneCompatible.Builder setPassengerCount(int value) {\n      validate(fields()[1], value);\n      this.passenger_count = value;\n      fieldSetFlags()[1] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'passenger_count' field has been set.\n      * @return True if the 'passenger_count' field has been set, false otherwise.\n      */\n    public boolean hasPassengerCount() {\n      return fieldSetFlags()[1];\n    }\n\n\n    /**\n      * Clears the value of the 'passenger_count' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordNoneCompatible.Builder clearPassengerCount() {\n      fieldSetFlags()[1] = false;\n      return this;\n    }\n\n    /**\n      * Gets the value of the 'trip_distance' field.\n      * @return The value.\n      */\n    public double getTripDistance() {\n      return trip_distance;\n    }\n\n\n    /**\n      * Sets the value of the 'trip_distance' field.\n      * @param value The value of 'trip_distance'.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordNoneCompatible.Builder setTripDistance(double value) {\n      validate(fields()[2], value);\n      this.trip_distance = value;\n      fieldSetFlags()[2] = true;\n      return this;\n    }\n\n    /**\n      * Checks whether the 'trip_distance' field has been set.\n      * @return True if the 'trip_distance' field has been set, false otherwise.\n      */\n    public boolean hasTripDistance() {\n      return fieldSetFlags()[2];\n    }\n\n\n    /**\n      * Clears the value of the 'trip_distance' field.\n      * @return This builder.\n      */\n    public schemaregistry.RideRecordNoneCompatible.Builder clearTripDistance() {\n      fieldSetFlags()[2] = false;\n      return this;\n    }\n\n    @Override\n    @SuppressWarnings(\"unchecked\")\n    public RideRecordNoneCompatible build() {\n      try {\n        RideRecordNoneCompatible record = new RideRecordNoneCompatible();\n        record.vendorId = fieldSetFlags()[0] ? this.vendorId : (java.lang.Integer) defaultValue(fields()[0]);\n        record.passenger_count = fieldSetFlags()[1] ? this.passenger_count : (java.lang.Integer) defaultValue(fields()[1]);\n        record.trip_distance = fieldSetFlags()[2] ? this.trip_distance : (java.lang.Double) defaultValue(fields()[2]);\n        return record;\n      } catch (org.apache.avro.AvroMissingFieldException e) {\n        throw e;\n      } catch (java.lang.Exception e) {\n        throw new org.apache.avro.AvroRuntimeException(e);\n      }\n    }\n  }\n\n  @SuppressWarnings(\"unchecked\")\n  private static final org.apache.avro.io.DatumWriter<RideRecordNoneCompatible>\n    WRITER$ = (org.apache.avro.io.DatumWriter<RideRecordNoneCompatible>)MODEL$.createDatumWriter(SCHEMA$);\n\n  @Override public void writeExternal(java.io.ObjectOutput out)\n    throws java.io.IOException {\n    WRITER$.write(this, SpecificData.getEncoder(out));\n  }\n\n  @SuppressWarnings(\"unchecked\")\n  private static final org.apache.avro.io.DatumReader<RideRecordNoneCompatible>\n    READER$ = (org.apache.avro.io.DatumReader<RideRecordNoneCompatible>)MODEL$.createDatumReader(SCHEMA$);\n\n  @Override public void readExternal(java.io.ObjectInput in)\n    throws java.io.IOException {\n    READER$.read(this, SpecificData.getDecoder(in));\n  }\n\n  @Override protected boolean hasCustomCoders() { return true; }\n\n  @Override public void customEncode(org.apache.avro.io.Encoder out)\n    throws java.io.IOException\n  {\n    out.writeInt(this.vendorId);\n\n    out.writeInt(this.passenger_count);\n\n    out.writeDouble(this.trip_distance);\n\n  }\n\n  @Override public void customDecode(org.apache.avro.io.ResolvingDecoder in)\n    throws java.io.IOException\n  {\n    org.apache.avro.Schema.Field[] fieldOrder = in.readFieldOrderIfDiff();\n    if (fieldOrder == null) {\n      this.vendorId = in.readInt();\n\n      this.passenger_count = in.readInt();\n\n      this.trip_distance = in.readDouble();\n\n    } else {\n      for (int i = 0; i < 3; i++) {\n        switch (fieldOrder[i].pos()) {\n        case 0:\n          this.vendorId = in.readInt();\n          break;\n\n        case 1:\n          this.passenger_count = in.readInt();\n          break;\n\n        case 2:\n          this.trip_distance = in.readDouble();\n          break;\n\n        default:\n          throw new java.io.IOException(\"Corrupt ResolvingDecoder.\");\n        }\n      }\n    }\n  }\n}\n\n\n\n\n\n\n\n\n\n\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/build.gradle",
    "content": "plugins {\n    id 'java'\n    id \"com.github.davidmc24.gradle.plugin.avro\" version \"1.5.0\"\n}\n\n\ngroup 'org.example'\nversion '1.0-SNAPSHOT'\n\nrepositories {\n    mavenCentral()\n    maven {\n        url \"https://packages.confluent.io/maven\"\n    }\n}\n\ndependencies {\n    implementation 'org.apache.kafka:kafka-clients:3.3.1'\n    implementation 'com.opencsv:opencsv:5.7.1'\n    implementation 'io.confluent:kafka-json-serializer:7.3.1'\n    implementation 'org.apache.kafka:kafka-streams:3.3.1'\n    implementation 'io.confluent:kafka-avro-serializer:7.3.1'\n    implementation 'io.confluent:kafka-schema-registry-client:7.3.1'\n    implementation 'io.confluent:kafka-streams-avro-serde:7.3.1'\n    implementation \"org.apache.avro:avro:1.11.0\"\n    testImplementation 'org.junit.jupiter:junit-jupiter-api:5.8.1'\n    testRuntimeOnly 'org.junit.jupiter:junit-jupiter-engine:5.8.1'\n    testImplementation 'org.apache.kafka:kafka-streams-test-utils:3.3.1'\n}\n\nsourceSets.main.java.srcDirs = ['build/generated-main-avro-java','src/main/java']\n\ntest {\n    useJUnitPlatform()\n}\n\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/gradle/wrapper/gradle-wrapper.properties",
    "content": "distributionBase=GRADLE_USER_HOME\ndistributionPath=wrapper/dists\ndistributionUrl=https\\://services.gradle.org/distributions/gradle-7.5.1-bin.zip\nzipStoreBase=GRADLE_USER_HOME\nzipStorePath=wrapper/dists\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/gradlew",
    "content": "#!/bin/sh\n\n#\n# Copyright © 2015-2021 the original authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#      https://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n#\n\n##############################################################################\n#\n#   Gradle start up script for POSIX generated by Gradle.\n#\n#   Important for running:\n#\n#   (1) You need a POSIX-compliant shell to run this script. If your /bin/sh is\n#       noncompliant, but you have some other compliant shell such as ksh or\n#       bash, then to run this script, type that shell name before the whole\n#       command line, like:\n#\n#           ksh Gradle\n#\n#       Busybox and similar reduced shells will NOT work, because this script\n#       requires all of these POSIX shell features:\n#         * functions;\n#         * expansions «$var», «${var}», «${var:-default}», «${var+SET}»,\n#           «${var#prefix}», «${var%suffix}», and «$( cmd )»;\n#         * compound commands having a testable exit status, especially «case»;\n#         * various built-in commands including «command», «set», and «ulimit».\n#\n#   Important for patching:\n#\n#   (2) This script targets any POSIX shell, so it avoids extensions provided\n#       by Bash, Ksh, etc; in particular arrays are avoided.\n#\n#       The \"traditional\" practice of packing multiple parameters into a\n#       space-separated string is a well documented source of bugs and security\n#       problems, so this is (mostly) avoided, by progressively accumulating\n#       options in \"$@\", and eventually passing that to Java.\n#\n#       Where the inherited environment variables (DEFAULT_JVM_OPTS, JAVA_OPTS,\n#       and GRADLE_OPTS) rely on word-splitting, this is performed explicitly;\n#       see the in-line comments for details.\n#\n#       There are tweaks for specific operating systems such as AIX, CygWin,\n#       Darwin, MinGW, and NonStop.\n#\n#   (3) This script is generated from the Groovy template\n#       https://github.com/gradle/gradle/blob/master/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt\n#       within the Gradle project.\n#\n#       You can find Gradle at https://github.com/gradle/gradle/.\n#\n##############################################################################\n\n# Attempt to set APP_HOME\n\n# Resolve links: $0 may be a link\napp_path=$0\n\n# Need this for daisy-chained symlinks.\nwhile\n    APP_HOME=${app_path%\"${app_path##*/}\"}  # leaves a trailing /; empty if no leading path\n    [ -h \"$app_path\" ]\ndo\n    ls=$( ls -ld \"$app_path\" )\n    link=${ls#*' -> '}\n    case $link in             #(\n      /*)   app_path=$link ;; #(\n      *)    app_path=$APP_HOME$link ;;\n    esac\ndone\n\nAPP_HOME=$( cd \"${APP_HOME:-./}\" && pwd -P ) || exit\n\nAPP_NAME=\"Gradle\"\nAPP_BASE_NAME=${0##*/}\n\n# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.\nDEFAULT_JVM_OPTS='\"-Xmx64m\" \"-Xms64m\"'\n\n# Use the maximum available, or set MAX_FD != -1 to use that value.\nMAX_FD=maximum\n\nwarn () {\n    echo \"$*\"\n} >&2\n\ndie () {\n    echo\n    echo \"$*\"\n    echo\n    exit 1\n} >&2\n\n# OS specific support (must be 'true' or 'false').\ncygwin=false\nmsys=false\ndarwin=false\nnonstop=false\ncase \"$( uname )\" in                #(\n  CYGWIN* )         cygwin=true  ;; #(\n  Darwin* )         darwin=true  ;; #(\n  MSYS* | MINGW* )  msys=true    ;; #(\n  NONSTOP* )        nonstop=true ;;\nesac\n\nCLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar\n\n\n# Determine the Java command to use to start the JVM.\nif [ -n \"$JAVA_HOME\" ] ; then\n    if [ -x \"$JAVA_HOME/jre/sh/java\" ] ; then\n        # IBM's JDK on AIX uses strange locations for the executables\n        JAVACMD=$JAVA_HOME/jre/sh/java\n    else\n        JAVACMD=$JAVA_HOME/bin/java\n    fi\n    if [ ! -x \"$JAVACMD\" ] ; then\n        die \"ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME\n\nPlease set the JAVA_HOME variable in your environment to match the\nlocation of your Java installation.\"\n    fi\nelse\n    JAVACMD=java\n    which java >/dev/null 2>&1 || die \"ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.\n\nPlease set the JAVA_HOME variable in your environment to match the\nlocation of your Java installation.\"\nfi\n\n# Increase the maximum file descriptors if we can.\nif ! \"$cygwin\" && ! \"$darwin\" && ! \"$nonstop\" ; then\n    case $MAX_FD in #(\n      max*)\n        MAX_FD=$( ulimit -H -n ) ||\n            warn \"Could not query maximum file descriptor limit\"\n    esac\n    case $MAX_FD in  #(\n      '' | soft) :;; #(\n      *)\n        ulimit -n \"$MAX_FD\" ||\n            warn \"Could not set maximum file descriptor limit to $MAX_FD\"\n    esac\nfi\n\n# Collect all arguments for the java command, stacking in reverse order:\n#   * args from the command line\n#   * the main class name\n#   * -classpath\n#   * -D...appname settings\n#   * --module-path (only if needed)\n#   * DEFAULT_JVM_OPTS, JAVA_OPTS, and GRADLE_OPTS environment variables.\n\n# For Cygwin or MSYS, switch paths to Windows format before running java\nif \"$cygwin\" || \"$msys\" ; then\n    APP_HOME=$( cygpath --path --mixed \"$APP_HOME\" )\n    CLASSPATH=$( cygpath --path --mixed \"$CLASSPATH\" )\n\n    JAVACMD=$( cygpath --unix \"$JAVACMD\" )\n\n    # Now convert the arguments - kludge to limit ourselves to /bin/sh\n    for arg do\n        if\n            case $arg in                                #(\n              -*)   false ;;                            # don't mess with options #(\n              /?*)  t=${arg#/} t=/${t%%/*}              # looks like a POSIX filepath\n                    [ -e \"$t\" ] ;;                      #(\n              *)    false ;;\n            esac\n        then\n            arg=$( cygpath --path --ignore --mixed \"$arg\" )\n        fi\n        # Roll the args list around exactly as many times as the number of\n        # args, so each arg winds up back in the position where it started, but\n        # possibly modified.\n        #\n        # NB: a `for` loop captures its iteration list before it begins, so\n        # changing the positional parameters here affects neither the number of\n        # iterations, nor the values presented in `arg`.\n        shift                   # remove old arg\n        set -- \"$@\" \"$arg\"      # push replacement arg\n    done\nfi\n\n# Collect all arguments for the java command;\n#   * $DEFAULT_JVM_OPTS, $JAVA_OPTS, and $GRADLE_OPTS can contain fragments of\n#     shell script including quotes and variable substitutions, so put them in\n#     double quotes to make sure that they get re-expanded; and\n#   * put everything else in single quotes, so that it's not re-expanded.\n\nset -- \\\n        \"-Dorg.gradle.appname=$APP_BASE_NAME\" \\\n        -classpath \"$CLASSPATH\" \\\n        org.gradle.wrapper.GradleWrapperMain \\\n        \"$@\"\n\n# Stop when \"xargs\" is not available.\nif ! command -v xargs >/dev/null 2>&1\nthen\n    die \"xargs is not available\"\nfi\n\n# Use \"xargs\" to parse quoted args.\n#\n# With -n1 it outputs one arg per line, with the quotes and backslashes removed.\n#\n# In Bash we could simply go:\n#\n#   readarray ARGS < <( xargs -n1 <<<\"$var\" ) &&\n#   set -- \"${ARGS[@]}\" \"$@\"\n#\n# but POSIX shell has neither arrays nor command substitution, so instead we\n# post-process each arg (as a line of input to sed) to backslash-escape any\n# character that might be a shell metacharacter, then use eval to reverse\n# that process (while maintaining the separation between arguments), and wrap\n# the whole thing up as a single \"set\" statement.\n#\n# This will of course break if any of these variables contains a newline or\n# an unmatched quote.\n#\n\neval \"set -- $(\n        printf '%s\\n' \"$DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS\" |\n        xargs -n1 |\n        sed ' s~[^-[:alnum:]+,./:=@_]~\\\\&~g; ' |\n        tr '\\n' ' '\n    )\" '\"$@\"'\n\nexec \"$JAVACMD\" \"$@\"\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/gradlew.bat",
    "content": "@rem\r\n@rem Copyright 2015 the original author or authors.\r\n@rem\r\n@rem Licensed under the Apache License, Version 2.0 (the \"License\");\r\n@rem you may not use this file except in compliance with the License.\r\n@rem You may obtain a copy of the License at\r\n@rem\r\n@rem      https://www.apache.org/licenses/LICENSE-2.0\r\n@rem\r\n@rem Unless required by applicable law or agreed to in writing, software\r\n@rem distributed under the License is distributed on an \"AS IS\" BASIS,\r\n@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\r\n@rem See the License for the specific language governing permissions and\r\n@rem limitations under the License.\r\n@rem\r\n\r\n@if \"%DEBUG%\"==\"\" @echo off\r\n@rem ##########################################################################\r\n@rem\r\n@rem  Gradle startup script for Windows\r\n@rem\r\n@rem ##########################################################################\r\n\r\n@rem Set local scope for the variables with windows NT shell\r\nif \"%OS%\"==\"Windows_NT\" setlocal\r\n\r\nset DIRNAME=%~dp0\r\nif \"%DIRNAME%\"==\"\" set DIRNAME=.\r\nset APP_BASE_NAME=%~n0\r\nset APP_HOME=%DIRNAME%\r\n\r\n@rem Resolve any \".\" and \"..\" in APP_HOME to make it shorter.\r\nfor %%i in (\"%APP_HOME%\") do set APP_HOME=%%~fi\r\n\r\n@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.\r\nset DEFAULT_JVM_OPTS=\"-Xmx64m\" \"-Xms64m\"\r\n\r\n@rem Find java.exe\r\nif defined JAVA_HOME goto findJavaFromJavaHome\r\n\r\nset JAVA_EXE=java.exe\r\n%JAVA_EXE% -version >NUL 2>&1\r\nif %ERRORLEVEL% equ 0 goto execute\r\n\r\necho.\r\necho ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.\r\necho.\r\necho Please set the JAVA_HOME variable in your environment to match the\r\necho location of your Java installation.\r\n\r\ngoto fail\r\n\r\n:findJavaFromJavaHome\r\nset JAVA_HOME=%JAVA_HOME:\"=%\r\nset JAVA_EXE=%JAVA_HOME%/bin/java.exe\r\n\r\nif exist \"%JAVA_EXE%\" goto execute\r\n\r\necho.\r\necho ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME%\r\necho.\r\necho Please set the JAVA_HOME variable in your environment to match the\r\necho location of your Java installation.\r\n\r\ngoto fail\r\n\r\n:execute\r\n@rem Setup the command line\r\n\r\nset CLASSPATH=%APP_HOME%\\gradle\\wrapper\\gradle-wrapper.jar\r\n\r\n\r\n@rem Execute Gradle\r\n\"%JAVA_EXE%\" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% \"-Dorg.gradle.appname=%APP_BASE_NAME%\" -classpath \"%CLASSPATH%\" org.gradle.wrapper.GradleWrapperMain %*\r\n\r\n:end\r\n@rem End local scope for the variables with windows NT shell\r\nif %ERRORLEVEL% equ 0 goto mainEnd\r\n\r\n:fail\r\nrem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of\r\nrem the _cmd.exe /c_ return code!\r\nset EXIT_CODE=%ERRORLEVEL%\r\nif %EXIT_CODE% equ 0 set EXIT_CODE=1\r\nif not \"\"==\"%GRADLE_EXIT_CONSOLE%\" exit %EXIT_CODE%\r\nexit /b %EXIT_CODE%\r\n\r\n:mainEnd\r\nif \"%OS%\"==\"Windows_NT\" endlocal\r\n\r\n:omega\r\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/settings.gradle",
    "content": "pluginManagement {\n    repositories {\n        gradlePluginPortal()\n        mavenCentral()\n    }\n}\nrootProject.name = 'kafka_examples'"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/avro/rides.avsc",
    "content": "{\n       \"type\": \"record\",\n       \"name\":\"RideRecord\",\n       \"namespace\": \"schemaregistry\",\n       \"fields\":[\n         {\"name\":\"vendor_id\",\"type\":\"string\"},\n         {\"name\":\"passenger_count\",\"type\":\"int\"},\n         {\"name\":\"trip_distance\",\"type\":\"double\"}\n       ]\n}"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/avro/rides_compatible.avsc",
    "content": "{\n   \"type\": \"record\",\n       \"name\":\"RideRecordCompatible\",\n       \"namespace\": \"schemaregistry\",\n       \"fields\":[\n         {\"name\":\"vendorId\",\"type\":\"string\"},\n         {\"name\":\"passenger_count\",\"type\":\"int\"},\n         {\"name\":\"trip_distance\",\"type\":\"double\"},\n         {\"name\":\"pu_location_id\", \"type\": [ \"null\", \"long\" ], \"default\": null}\n       ]\n}"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/avro/rides_non_compatible.avsc",
    "content": "{\n   \"type\": \"record\",\n       \"name\":\"RideRecordNoneCompatible\",\n       \"namespace\": \"schemaregistry\",\n       \"fields\":[\n         {\"name\":\"vendorId\",\"type\":\"int\"},\n         {\"name\":\"passenger_count\",\"type\":\"int\"},\n         {\"name\":\"trip_distance\",\"type\":\"double\"}\n       ]\n}"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/AvroProducer.java",
    "content": "package org.example;\n\nimport com.opencsv.CSVReader;\nimport com.opencsv.exceptions.CsvException;\nimport io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig;\nimport io.confluent.kafka.serializers.KafkaAvroSerializer;\nimport org.apache.kafka.clients.producer.KafkaProducer;\nimport org.apache.kafka.clients.producer.ProducerConfig;\nimport org.apache.kafka.clients.producer.ProducerRecord;\nimport org.apache.kafka.streams.StreamsConfig;\nimport schemaregistry.RideRecord;\n\nimport java.io.FileReader;\nimport java.io.IOException;\nimport java.util.List;\nimport java.util.Properties;\nimport java.util.concurrent.ExecutionException;\nimport java.util.stream.Collectors;\n\npublic class AvroProducer {\n\n    private Properties props = new Properties();\n\n    public AvroProducer() {\n        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, \"pkc-75m1o.europe-west3.gcp.confluent.cloud:9092\");\n        props.put(\"security.protocol\", \"SASL_SSL\");\n        props.put(\"sasl.jaas.config\", \"org.apache.kafka.common.security.plain.PlainLoginModule required username='\"+Secrets.KAFKA_CLUSTER_KEY+\"' password='\"+Secrets.KAFKA_CLUSTER_SECRET+\"';\");\n        props.put(\"sasl.mechanism\", \"PLAIN\");\n        props.put(\"client.dns.lookup\", \"use_all_dns_ips\");\n        props.put(\"session.timeout.ms\", \"45000\");\n        props.put(ProducerConfig.ACKS_CONFIG, \"all\");\n        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, \"org.apache.kafka.common.serialization.StringSerializer\");\n        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class.getName());\n\n        props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, \"https://psrc-kk5gg.europe-west3.gcp.confluent.cloud\");\n        props.put(\"basic.auth.credentials.source\", \"USER_INFO\");\n        props.put(\"basic.auth.user.info\", Secrets.SCHEMA_REGISTRY_KEY+\":\"+Secrets.SCHEMA_REGISTRY_SECRET);\n    }\n\n    public List<RideRecord> getRides() throws IOException, CsvException {\n        var ridesStream = this.getClass().getResource(\"/rides.csv\");\n        var reader = new CSVReader(new FileReader(ridesStream.getFile()));\n        reader.skip(1);\n\n        return reader.readAll().stream().map(row ->\n            RideRecord.newBuilder()\n                    .setVendorId(row[0])\n                    .setTripDistance(Double.parseDouble(row[4]))\n                    .setPassengerCount(Integer.parseInt(row[3]))\n                    .build()\n                ).collect(Collectors.toList());\n    }\n\n    public void publishRides(List<RideRecord> rides) throws ExecutionException, InterruptedException {\n        KafkaProducer<String, RideRecord> kafkaProducer = new KafkaProducer<>(props);\n        for (RideRecord ride : rides) {\n            var record = kafkaProducer.send(new ProducerRecord<>(\"rides_avro\", String.valueOf(ride.getVendorId()), ride), (metadata, exception) -> {\n                if (exception != null) {\n                    System.out.println(exception.getMessage());\n                }\n            });\n            System.out.println(record.get().offset());\n            Thread.sleep(500);\n        }\n    }\n\n    public static void main(String[] args) throws IOException, CsvException, ExecutionException, InterruptedException {\n        var producer = new AvroProducer();\n        var rideRecords = producer.getRides();\n        producer.publishRides(rideRecords);\n    }\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonConsumer.java",
    "content": "package org.example;\n\nimport org.apache.kafka.clients.consumer.ConsumerConfig;\nimport org.apache.kafka.clients.consumer.ConsumerRecord;\nimport org.apache.kafka.clients.consumer.KafkaConsumer;\nimport org.apache.kafka.clients.producer.ProducerConfig;\nimport org.example.data.Ride;\n\nimport java.time.Duration;\nimport java.time.temporal.ChronoUnit;\nimport java.time.temporal.TemporalUnit;\nimport java.util.List;\nimport java.util.Properties;\nimport io.confluent.kafka.serializers.KafkaJsonDeserializerConfig;\npublic class JsonConsumer {\n\n    private Properties props = new Properties();\n    private KafkaConsumer<String, Ride> consumer;\n    public JsonConsumer() {\n        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, \"pkc-75m1o.europe-west3.gcp.confluent.cloud:9092\");\n        props.put(\"security.protocol\", \"SASL_SSL\");\n        props.put(\"sasl.jaas.config\", \"org.apache.kafka.common.security.plain.PlainLoginModule required username='\"+Secrets.KAFKA_CLUSTER_KEY+\"' password='\"+Secrets.KAFKA_CLUSTER_SECRET+\"';\");\n        props.put(\"sasl.mechanism\", \"PLAIN\");\n        props.put(\"client.dns.lookup\", \"use_all_dns_ips\");\n        props.put(\"session.timeout.ms\", \"45000\");\n        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, \"org.apache.kafka.common.serialization.StringDeserializer\");\n        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, \"io.confluent.kafka.serializers.KafkaJsonDeserializer\");\n        props.put(ConsumerConfig.GROUP_ID_CONFIG, \"kafka_tutorial_example.jsonconsumer.v2\");\n        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, \"earliest\");\n        props.put(KafkaJsonDeserializerConfig.JSON_VALUE_TYPE, Ride.class);\n        consumer = new KafkaConsumer<String, Ride>(props);\n        consumer.subscribe(List.of(\"rides\"));\n\n    }\n\n    public void consumeFromKafka() {\n        System.out.println(\"Consuming form kafka started\");\n        var results = consumer.poll(Duration.of(1, ChronoUnit.SECONDS));\n        var i = 0;\n        do {\n\n            for(ConsumerRecord<String, Ride> result: results) {\n                System.out.println(result.value().DOLocationID);\n            }\n            results =  consumer.poll(Duration.of(1, ChronoUnit.SECONDS));\n            System.out.println(\"RESULTS:::\" + results.count());\n            i++;\n        }\n        while(!results.isEmpty() || i < 10);\n    }\n\n    public static void main(String[] args) {\n        JsonConsumer jsonConsumer = new JsonConsumer();\n        jsonConsumer.consumeFromKafka();\n    }\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonKStream.java",
    "content": "package org.example;\n\nimport org.apache.kafka.clients.consumer.ConsumerConfig;\nimport org.apache.kafka.common.serialization.Serdes;\nimport org.apache.kafka.streams.KafkaStreams;\nimport org.apache.kafka.streams.StreamsBuilder;\nimport org.apache.kafka.streams.StreamsConfig;\nimport org.apache.kafka.streams.Topology;\nimport org.apache.kafka.streams.kstream.Consumed;\nimport org.apache.kafka.streams.kstream.Produced;\nimport org.example.customserdes.CustomSerdes;\nimport org.example.data.Ride;\n\nimport java.util.Properties;\n\npublic class JsonKStream {\n    private Properties props = new Properties();\n\n    public JsonKStream() {\n        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, \"pkc-75m1o.europe-west3.gcp.confluent.cloud:9092\");\n        props.put(\"security.protocol\", \"SASL_SSL\");\n        props.put(\"sasl.jaas.config\", \"org.apache.kafka.common.security.plain.PlainLoginModule required username='\"+Secrets.KAFKA_CLUSTER_KEY+\"' password='\"+Secrets.KAFKA_CLUSTER_SECRET+\"';\");\n        props.put(\"sasl.mechanism\", \"PLAIN\");\n        props.put(\"client.dns.lookup\", \"use_all_dns_ips\");\n        props.put(\"session.timeout.ms\", \"45000\");\n        props.put(StreamsConfig.APPLICATION_ID_CONFIG, \"kafka_tutorial.kstream.count.plocation.v1\");\n        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, \"latest\");\n        props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);\n\n    }\n\n    public Topology createTopology() {\n        StreamsBuilder streamsBuilder = new StreamsBuilder();\n        var ridesStream = streamsBuilder.stream(\"rides\", Consumed.with(Serdes.String(), CustomSerdes.getSerde(Ride.class)));\n        var puLocationCount = ridesStream.groupByKey().count().toStream();\n        puLocationCount.to(\"rides-pulocation-count\", Produced.with(Serdes.String(), Serdes.Long()));\n        return streamsBuilder.build();\n    }\n\n    public void countPLocation() throws InterruptedException {\n        var topology = createTopology();\n        var kStreams = new KafkaStreams(topology, props);\n        kStreams.start();\n        while (kStreams.state() != KafkaStreams.State.RUNNING) {\n            System.out.println(kStreams.state());\n            Thread.sleep(1000);\n        }\n        System.out.println(kStreams.state());\n        Runtime.getRuntime().addShutdownHook(new Thread(kStreams::close));\n    }\n\n    public static void main(String[] args) throws InterruptedException {\n        var object = new JsonKStream();\n        object.countPLocation();\n    }\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonKStreamJoins.java",
    "content": "package org.example;\n\nimport org.apache.kafka.clients.consumer.ConsumerConfig;\nimport org.apache.kafka.common.serialization.Serdes;\nimport org.apache.kafka.streams.KafkaStreams;\nimport org.apache.kafka.streams.StreamsBuilder;\nimport org.apache.kafka.streams.StreamsConfig;\nimport org.apache.kafka.streams.Topology;\nimport org.apache.kafka.streams.errors.StreamsUncaughtExceptionHandler;\nimport org.apache.kafka.streams.kstream.*;\nimport org.example.customserdes.CustomSerdes;\nimport org.example.data.PickupLocation;\nimport org.example.data.Ride;\nimport org.example.data.VendorInfo;\n\nimport java.time.Duration;\nimport java.util.Optional;\nimport java.util.Properties;\npublic class JsonKStreamJoins {\n    private Properties props = new Properties();\n\n    public JsonKStreamJoins() {\n        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, \"pkc-75m1o.europe-west3.gcp.confluent.cloud:9092\");\n        props.put(\"security.protocol\", \"SASL_SSL\");\n        props.put(\"sasl.jaas.config\", \"org.apache.kafka.common.security.plain.PlainLoginModule required username='\"+Secrets.KAFKA_CLUSTER_KEY+\"' password='\"+Secrets.KAFKA_CLUSTER_SECRET+\"';\");\n        props.put(\"sasl.mechanism\", \"PLAIN\");\n        props.put(\"client.dns.lookup\", \"use_all_dns_ips\");\n        props.put(\"session.timeout.ms\", \"45000\");\n        props.put(StreamsConfig.APPLICATION_ID_CONFIG, \"kafka_tutorial.kstream.joined.rides.pickuplocation.v1\");\n        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, \"latest\");\n        props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);\n    }\n\n    public Topology createTopology() {\n        StreamsBuilder streamsBuilder = new StreamsBuilder();\n        KStream<String, Ride> rides = streamsBuilder.stream(Topics.INPUT_RIDE_TOPIC, Consumed.with(Serdes.String(), CustomSerdes.getSerde(Ride.class)));\n        KStream<String, PickupLocation> pickupLocations = streamsBuilder.stream(Topics.INPUT_RIDE_LOCATION_TOPIC, Consumed.with(Serdes.String(), CustomSerdes.getSerde(PickupLocation.class)));\n\n        var pickupLocationsKeyedOnPUId = pickupLocations.selectKey((key, value) -> String.valueOf(value.PULocationID));\n\n        var joined = rides.join(pickupLocationsKeyedOnPUId, (ValueJoiner<Ride, PickupLocation, Optional<VendorInfo>>) (ride, pickupLocation) -> {\n                    var period = Duration.between(ride.tpep_dropoff_datetime, pickupLocation.tpep_pickup_datetime);\n                    if (period.abs().toMinutes() > 10) return Optional.empty();\n                    else return Optional.of(new VendorInfo(ride.VendorID, pickupLocation.PULocationID, pickupLocation.tpep_pickup_datetime, ride.tpep_dropoff_datetime));\n                }, JoinWindows.ofTimeDifferenceAndGrace(Duration.ofMinutes(20), Duration.ofMinutes(5)),\n                StreamJoined.with(Serdes.String(), CustomSerdes.getSerde(Ride.class), CustomSerdes.getSerde(PickupLocation.class)));\n\n        joined.filter(((key, value) -> value.isPresent())).mapValues(Optional::get)\n                .to(Topics.OUTPUT_TOPIC, Produced.with(Serdes.String(), CustomSerdes.getSerde(VendorInfo.class)));\n\n        return streamsBuilder.build();\n    }\n\n    public void joinRidesPickupLocation() throws InterruptedException {\n        var topology = createTopology();\n        var kStreams = new KafkaStreams(topology, props);\n\n        kStreams.setUncaughtExceptionHandler(exception -> {\n            System.out.println(exception.getMessage());\n            return StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.SHUTDOWN_APPLICATION;\n        });\n        kStreams.start();\n        while (kStreams.state() != KafkaStreams.State.RUNNING) {\n            System.out.println(kStreams.state());\n            Thread.sleep(1000);\n        }\n        System.out.println(kStreams.state());\n        Runtime.getRuntime().addShutdownHook(new Thread(kStreams::close));\n\n    }\n\n    public static void main(String[] args) throws InterruptedException {\n        var object = new JsonKStreamJoins();\n        object.joinRidesPickupLocation();\n    }\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonKStreamWindow.java",
    "content": "package org.example;\n\nimport org.apache.kafka.clients.consumer.ConsumerConfig;\nimport org.apache.kafka.common.serialization.Serdes;\nimport org.apache.kafka.streams.KafkaStreams;\nimport org.apache.kafka.streams.StreamsBuilder;\nimport org.apache.kafka.streams.StreamsConfig;\nimport org.apache.kafka.streams.Topology;\nimport org.apache.kafka.streams.kstream.Consumed;\nimport org.apache.kafka.streams.kstream.Produced;\nimport org.apache.kafka.streams.kstream.TimeWindows;\nimport org.apache.kafka.streams.kstream.WindowedSerdes;\nimport org.example.customserdes.CustomSerdes;\nimport org.example.data.Ride;\n\nimport java.time.Duration;\nimport java.time.temporal.ChronoUnit;\nimport java.util.Properties;\n\npublic class JsonKStreamWindow {\n    private Properties props = new Properties();\n\n    public JsonKStreamWindow() {\n        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, \"pkc-75m1o.europe-west3.gcp.confluent.cloud:9092\");\n        props.put(\"security.protocol\", \"SASL_SSL\");\n        props.put(\"sasl.jaas.config\", \"org.apache.kafka.common.security.plain.PlainLoginModule required username='\"+Secrets.KAFKA_CLUSTER_KEY+\"' password='\"+Secrets.KAFKA_CLUSTER_SECRET+\"';\");\n        props.put(\"sasl.mechanism\", \"PLAIN\");\n        props.put(\"client.dns.lookup\", \"use_all_dns_ips\");\n        props.put(\"session.timeout.ms\", \"45000\");\n        props.put(StreamsConfig.APPLICATION_ID_CONFIG, \"kafka_tutorial.kstream.count.plocation.v1\");\n        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, \"latest\");\n        props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);\n\n    }\n\n    public Topology createTopology() {\n        StreamsBuilder streamsBuilder = new StreamsBuilder();\n        var ridesStream = streamsBuilder.stream(\"rides\", Consumed.with(Serdes.String(), CustomSerdes.getSerde(Ride.class)));\n        var puLocationCount = ridesStream.groupByKey()\n                .windowedBy(TimeWindows.ofSizeAndGrace(Duration.ofSeconds(10), Duration.ofSeconds(5)))\n                .count().toStream();\n        var windowSerde = WindowedSerdes.timeWindowedSerdeFrom(String.class, 10*1000);\n\n        puLocationCount.to(\"rides-pulocation-window-count\", Produced.with(windowSerde, Serdes.Long()));\n        return streamsBuilder.build();\n    }\n\n    public void countPLocationWindowed() {\n        var topology = createTopology();\n        var kStreams = new KafkaStreams(topology, props);\n        kStreams.start();\n\n        Runtime.getRuntime().addShutdownHook(new Thread(kStreams::close));\n    }\n\n    public static void main(String[] args) {\n        var object = new JsonKStreamWindow();\n        object.countPLocationWindowed();\n    }\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonProducer.java",
    "content": "package org.example;\n\nimport com.opencsv.CSVReader;\nimport com.opencsv.exceptions.CsvException;\nimport org.apache.kafka.clients.producer.*;\nimport org.apache.kafka.streams.StreamsConfig;\nimport org.example.data.Ride;\n\nimport java.io.FileReader;\nimport java.io.IOException;\nimport java.time.LocalDateTime;\nimport java.util.List;\nimport java.util.Properties;\nimport java.util.concurrent.ExecutionException;\nimport java.util.stream.Collectors;\n\npublic class JsonProducer {\n    private Properties props = new Properties();\n    public JsonProducer() {\n        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, \"pkc-75m1o.europe-west3.gcp.confluent.cloud:9092\");\n        props.put(\"security.protocol\", \"SASL_SSL\");\n        props.put(\"sasl.jaas.config\", \"org.apache.kafka.common.security.plain.PlainLoginModule required username='\"+Secrets.KAFKA_CLUSTER_KEY+\"' password='\"+Secrets.KAFKA_CLUSTER_SECRET+\"';\");\n        props.put(\"sasl.mechanism\", \"PLAIN\");\n        props.put(\"client.dns.lookup\", \"use_all_dns_ips\");\n        props.put(\"session.timeout.ms\", \"45000\");\n        props.put(ProducerConfig.ACKS_CONFIG, \"all\");\n        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, \"org.apache.kafka.common.serialization.StringSerializer\");\n        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, \"io.confluent.kafka.serializers.KafkaJsonSerializer\");\n    }\n\n    public List<Ride> getRides() throws IOException, CsvException {\n        var ridesStream = this.getClass().getResource(\"/rides.csv\");\n        var reader = new CSVReader(new FileReader(ridesStream.getFile()));\n        reader.skip(1);\n        return reader.readAll().stream().map(arr -> new Ride(arr))\n                .collect(Collectors.toList());\n\n    }\n\n    public void publishRides(List<Ride> rides) throws ExecutionException, InterruptedException {\n        KafkaProducer<String, Ride> kafkaProducer = new KafkaProducer<String, Ride>(props);\n        for(Ride ride: rides) {\n            ride.tpep_pickup_datetime = LocalDateTime.now().minusMinutes(20);\n            ride.tpep_dropoff_datetime = LocalDateTime.now();\n            var record = kafkaProducer.send(new ProducerRecord<>(\"rides\", String.valueOf(ride.DOLocationID), ride), (metadata, exception) -> {\n                if(exception != null) {\n                    System.out.println(exception.getMessage());\n                }\n            });\n            System.out.println(record.get().offset());\n            System.out.println(ride.DOLocationID);\n            Thread.sleep(500);\n        }\n    }\n\n    public static void main(String[] args) throws IOException, CsvException, ExecutionException, InterruptedException {\n        var producer = new JsonProducer();\n        var rides = producer.getRides();\n        producer.publishRides(rides);\n    }\n}"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/JsonProducerPickupLocation.java",
    "content": "package org.example;\n\nimport com.opencsv.exceptions.CsvException;\nimport org.apache.kafka.clients.producer.KafkaProducer;\nimport org.apache.kafka.clients.producer.ProducerConfig;\nimport org.apache.kafka.clients.producer.ProducerRecord;\nimport org.example.data.PickupLocation;\n\nimport java.io.IOException;\nimport java.time.LocalDateTime;\nimport java.util.Properties;\nimport java.util.concurrent.ExecutionException;\n\npublic class JsonProducerPickupLocation {\n    private Properties props = new Properties();\n\n    public JsonProducerPickupLocation() {\n        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, \"pkc-75m1o.europe-west3.gcp.confluent.cloud:9092\");\n        props.put(\"security.protocol\", \"SASL_SSL\");\n        props.put(\"sasl.jaas.config\", \"org.apache.kafka.common.security.plain.PlainLoginModule required username='\"+Secrets.KAFKA_CLUSTER_KEY+\"' password='\"+Secrets.KAFKA_CLUSTER_SECRET+\"';\");\n        props.put(\"sasl.mechanism\", \"PLAIN\");\n        props.put(\"client.dns.lookup\", \"use_all_dns_ips\");\n        props.put(\"session.timeout.ms\", \"45000\");\n        props.put(ProducerConfig.ACKS_CONFIG, \"all\");\n        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, \"org.apache.kafka.common.serialization.StringSerializer\");\n        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, \"io.confluent.kafka.serializers.KafkaJsonSerializer\");\n    }\n\n    public void publish(PickupLocation pickupLocation) throws ExecutionException, InterruptedException {\n        KafkaProducer<String, PickupLocation> kafkaProducer = new KafkaProducer<String, PickupLocation>(props);\n        var record = kafkaProducer.send(new ProducerRecord<>(\"rides_location\", String.valueOf(pickupLocation.PULocationID), pickupLocation), (metadata, exception) -> {\n            if (exception != null) {\n                System.out.println(exception.getMessage());\n            }\n        });\n        System.out.println(record.get().offset());\n    }\n\n\n    public static void main(String[] args) throws IOException, CsvException, ExecutionException, InterruptedException {\n        var producer = new JsonProducerPickupLocation();\n        producer.publish(new PickupLocation(186, LocalDateTime.now()));\n    }\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/Secrets.java",
    "content": "package org.example;\n\npublic class Secrets {\n    public static final String KAFKA_CLUSTER_KEY = \"REPLACE_WITH_YOUR_KAFKA_CLUSTER_KEY\";\n    public static final String KAFKA_CLUSTER_SECRET = \"REPLACE_WITH_YOUR_KAFKA_CLUSTER_SECRET\";\n\n    public static final String SCHEMA_REGISTRY_KEY = \"REPLACE_WITH_SCHEMA_REGISTRY_KEY\";\n    public static final String SCHEMA_REGISTRY_SECRET = \"REPLACE_WITH_SCHEMA_REGISTRY_SECRET\";\n\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/Topics.java",
    "content": "package org.example;\n\npublic class Topics {\n    public static final String INPUT_RIDE_TOPIC = \"rides\";\n    public static final String INPUT_RIDE_LOCATION_TOPIC = \"rides_location\";\n    public static final String OUTPUT_TOPIC = \"vendor_info\";\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/customserdes/CustomSerdes.java",
    "content": "package org.example.customserdes;\n\nimport io.confluent.kafka.serializers.AbstractKafkaAvroSerDeConfig;\nimport io.confluent.kafka.serializers.KafkaJsonDeserializer;\nimport io.confluent.kafka.serializers.KafkaJsonSerializer;\nimport io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde;\nimport org.apache.avro.specific.SpecificRecordBase;\nimport org.apache.kafka.common.serialization.Deserializer;\nimport org.apache.kafka.common.serialization.Serde;\nimport org.apache.kafka.common.serialization.Serdes;\nimport org.apache.kafka.common.serialization.Serializer;\nimport org.example.data.PickupLocation;\nimport org.example.data.Ride;\nimport org.example.data.VendorInfo;\n\nimport java.util.HashMap;\nimport java.util.Map;\n\npublic class CustomSerdes {\n\n    public static <T> Serde<T> getSerde(Class<T> classOf) {\n        Map<String, Object> serdeProps = new HashMap<>();\n        serdeProps.put(\"json.value.type\", classOf);\n        final Serializer<T> mySerializer = new KafkaJsonSerializer<>();\n        mySerializer.configure(serdeProps, false);\n\n        final Deserializer<T> myDeserializer = new KafkaJsonDeserializer<>();\n        myDeserializer.configure(serdeProps, false);\n        return Serdes.serdeFrom(mySerializer, myDeserializer);\n    }\n\n    public static <T extends SpecificRecordBase> SpecificAvroSerde getAvroSerde(boolean isKey, String schemaRegistryUrl) {\n        var serde = new SpecificAvroSerde<T>();\n\n        Map<String, Object> serdeProps = new HashMap<>();\n        serdeProps.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);\n        serde.configure(serdeProps, isKey);\n        return serde;\n    }\n\n\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/data/PickupLocation.java",
    "content": "package org.example.data;\n\nimport java.time.LocalDateTime;\n\npublic class PickupLocation {\n    public PickupLocation(long PULocationID, LocalDateTime tpep_pickup_datetime) {\n        this.PULocationID = PULocationID;\n        this.tpep_pickup_datetime = tpep_pickup_datetime;\n    }\n\n    public PickupLocation() {\n    }\n\n    public long PULocationID;\n    public LocalDateTime tpep_pickup_datetime;\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/data/Ride.java",
    "content": "package org.example.data;\n\nimport java.nio.DoubleBuffer;\nimport java.time.LocalDate;\nimport java.time.LocalDateTime;\nimport java.time.format.DateTimeFormatter;\n\npublic class Ride {\n    public Ride(String[] arr) {\n        VendorID = arr[0];\n        tpep_pickup_datetime = LocalDateTime.parse(arr[1], DateTimeFormatter.ofPattern(\"yyyy-MM-dd HH:mm:ss\"));\n        tpep_dropoff_datetime = LocalDateTime.parse(arr[2], DateTimeFormatter.ofPattern(\"yyyy-MM-dd HH:mm:ss\"));\n        passenger_count = Integer.parseInt(arr[3]);\n        trip_distance = Double.parseDouble(arr[4]);\n        RatecodeID = Long.parseLong(arr[5]);\n        store_and_fwd_flag = arr[6];\n        PULocationID = Long.parseLong(arr[7]);\n        DOLocationID = Long.parseLong(arr[8]);\n        payment_type = arr[9];\n        fare_amount = Double.parseDouble(arr[10]);\n        extra = Double.parseDouble(arr[11]);\n        mta_tax = Double.parseDouble(arr[12]);\n        tip_amount = Double.parseDouble(arr[13]);\n        tolls_amount = Double.parseDouble(arr[14]);\n        improvement_surcharge = Double.parseDouble(arr[15]);\n        total_amount = Double.parseDouble(arr[16]);\n        congestion_surcharge = Double.parseDouble(arr[17]);\n    }\n    public Ride(){}\n    public String VendorID;\n    public LocalDateTime tpep_pickup_datetime;\n    public LocalDateTime tpep_dropoff_datetime;\n    public int passenger_count;\n    public double trip_distance;\n    public long RatecodeID;\n    public String store_and_fwd_flag;\n    public long PULocationID;\n    public long DOLocationID;\n    public String payment_type;\n    public double fare_amount;\n    public double extra;\n    public double mta_tax;\n    public double tip_amount;\n    public double tolls_amount;\n    public double improvement_surcharge;\n    public double total_amount;\n    public double congestion_surcharge;\n\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/main/java/org/example/data/VendorInfo.java",
    "content": "package org.example.data;\n\nimport java.time.LocalDateTime;\n\npublic class VendorInfo {\n\n    public VendorInfo(String vendorID, long PULocationID, LocalDateTime pickupTime, LocalDateTime lastDropoffTime) {\n        VendorID = vendorID;\n        this.PULocationID = PULocationID;\n        this.pickupTime = pickupTime;\n        this.lastDropoffTime = lastDropoffTime;\n    }\n\n    public VendorInfo() {\n    }\n\n    public String VendorID;\n    public long PULocationID;\n    public LocalDateTime pickupTime;\n    public LocalDateTime lastDropoffTime;\n}\n"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/test/java/org/example/JsonKStreamJoinsTest.java",
    "content": "package org.example;\n\nimport org.apache.kafka.clients.consumer.ConsumerConfig;\nimport org.apache.kafka.common.internals.Topic;\nimport org.apache.kafka.common.serialization.Serdes;\nimport org.apache.kafka.streams.*;\nimport org.example.customserdes.CustomSerdes;\nimport org.example.data.PickupLocation;\nimport org.example.data.Ride;\nimport org.example.data.VendorInfo;\nimport org.example.helper.DataGeneratorHelper;\nimport org.junit.jupiter.api.AfterAll;\nimport org.junit.jupiter.api.BeforeEach;\nimport org.junit.jupiter.api.Test;\n\nimport javax.xml.crypto.Data;\nimport java.util.Properties;\n\nimport static org.junit.jupiter.api.Assertions.*;\n\nclass JsonKStreamJoinsTest {\n    private Properties props = new Properties();\n    private static TopologyTestDriver testDriver;\n    private TestInputTopic<String, Ride> ridesTopic;\n    private TestInputTopic<String, PickupLocation> pickLocationTopic;\n    private TestOutputTopic<String, VendorInfo> outputTopic;\n\n    private Topology topology = new JsonKStreamJoins().createTopology();\n    @BeforeEach\n    public void setup() {\n        props = new Properties();\n        props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, \"testing_count_application\");\n        props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, \"dummy:1234\");\n        if (testDriver != null) {\n            testDriver.close();\n        }\n        testDriver = new TopologyTestDriver(topology, props);\n        ridesTopic = testDriver.createInputTopic(Topics.INPUT_RIDE_TOPIC, Serdes.String().serializer(), CustomSerdes.getSerde(Ride.class).serializer());\n        pickLocationTopic = testDriver.createInputTopic(Topics.INPUT_RIDE_LOCATION_TOPIC, Serdes.String().serializer(), CustomSerdes.getSerde(PickupLocation.class).serializer());\n        outputTopic = testDriver.createOutputTopic(Topics.OUTPUT_TOPIC, Serdes.String().deserializer(), CustomSerdes.getSerde(VendorInfo.class).deserializer());\n    }\n\n    @Test\n    public void testIfJoinWorksOnSameDropOffPickupLocationId() {\n        Ride ride = DataGeneratorHelper.generateRide();\n        PickupLocation pickupLocation = DataGeneratorHelper.generatePickUpLocation(ride.DOLocationID);\n        ridesTopic.pipeInput(String.valueOf(ride.DOLocationID), ride);\n        pickLocationTopic.pipeInput(String.valueOf(pickupLocation.PULocationID), pickupLocation);\n\n        assertEquals(outputTopic.getQueueSize(), 1);\n        var expected = new VendorInfo(ride.VendorID, pickupLocation.PULocationID, pickupLocation.tpep_pickup_datetime, ride.tpep_dropoff_datetime);\n        var result = outputTopic.readKeyValue();\n        assertEquals(result.key, String.valueOf(ride.DOLocationID));\n        assertEquals(result.value.VendorID, expected.VendorID);\n        assertEquals(result.value.pickupTime, expected.pickupTime);\n    }\n\n\n    @AfterAll\n    public static void shutdown() {\n        testDriver.close();\n    }\n}"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/test/java/org/example/JsonKStreamTest.java",
    "content": "package org.example;\n\nimport org.apache.kafka.common.serialization.Serdes;\nimport org.apache.kafka.streams.*;\nimport org.example.customserdes.CustomSerdes;\nimport org.example.data.Ride;\nimport org.example.helper.DataGeneratorHelper;\nimport org.junit.jupiter.api.AfterAll;\nimport org.junit.jupiter.api.BeforeEach;\nimport org.junit.jupiter.api.Test;\nimport static org.junit.jupiter.api.Assertions.*;\nimport java.util.Properties;\n\nclass JsonKStreamTest {\n    private Properties props;\n    private static TopologyTestDriver testDriver;\n    private TestInputTopic<String, Ride> inputTopic;\n    private TestOutputTopic<String, Long> outputTopic;\n    private Topology topology = new JsonKStream().createTopology();\n\n    @BeforeEach\n    public void setup() {\n        props = new Properties();\n        props.setProperty(StreamsConfig.APPLICATION_ID_CONFIG, \"testing_count_application\");\n        props.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, \"dummy:1234\");\n        if (testDriver != null) {\n            testDriver.close();\n        }\n        testDriver = new TopologyTestDriver(topology, props);\n        inputTopic = testDriver.createInputTopic(\"rides\", Serdes.String().serializer(), CustomSerdes.getSerde(Ride.class).serializer());\n        outputTopic = testDriver.createOutputTopic(\"rides-pulocation-count\", Serdes.String().deserializer(), Serdes.Long().deserializer());\n    }\n\n    @Test\n    public void testIfOneMessageIsPassedToInputTopicWeGetCountOfOne() {\n        Ride ride = DataGeneratorHelper.generateRide();\n        inputTopic.pipeInput(String.valueOf(ride.DOLocationID), ride);\n\n        assertEquals(outputTopic.readKeyValue(), KeyValue.pair(String.valueOf(ride.DOLocationID), 1L));\n        assertTrue(outputTopic.isEmpty());\n    }\n\n    @Test\n    public void testIfTwoMessageArePassedWithDifferentKey() {\n        Ride ride1 = DataGeneratorHelper.generateRide();\n        ride1.DOLocationID = 100L;\n        inputTopic.pipeInput(String.valueOf(ride1.DOLocationID), ride1);\n\n        Ride ride2 = DataGeneratorHelper.generateRide();\n        ride2.DOLocationID = 200L;\n        inputTopic.pipeInput(String.valueOf(ride2.DOLocationID), ride2);\n\n        assertEquals(outputTopic.readKeyValue(), KeyValue.pair(String.valueOf(ride1.DOLocationID), 1L));\n        assertEquals(outputTopic.readKeyValue(), KeyValue.pair(String.valueOf(ride2.DOLocationID), 1L));\n        assertTrue(outputTopic.isEmpty());\n    }\n\n    @Test\n    public void testIfTwoMessageArePassedWithSameKey() {\n        Ride ride1 = DataGeneratorHelper.generateRide();\n        ride1.DOLocationID = 100L;\n        inputTopic.pipeInput(String.valueOf(ride1.DOLocationID), ride1);\n\n        Ride ride2 = DataGeneratorHelper.generateRide();\n        ride2.DOLocationID = 100L;\n        inputTopic.pipeInput(String.valueOf(ride2.DOLocationID), ride2);\n\n        assertEquals(outputTopic.readKeyValue(), KeyValue.pair(\"100\", 1L));\n        assertEquals(outputTopic.readKeyValue(), KeyValue.pair(\"100\", 2L));\n        assertTrue(outputTopic.isEmpty());\n    }\n\n\n    @AfterAll\n    public static void tearDown() {\n        testDriver.close();\n    }\n\n\n}"
  },
  {
    "path": "07-streaming/theory/java/kafka_examples/src/test/java/org/example/helper/DataGeneratorHelper.java",
    "content": "package org.example.helper;\n\nimport org.example.data.PickupLocation;\nimport org.example.data.Ride;\nimport org.example.data.VendorInfo;\n\nimport java.time.LocalDateTime;\nimport java.time.format.DateTimeFormatter;\nimport java.util.List;\n\npublic class DataGeneratorHelper {\n    public static Ride generateRide() {\n        var arrivalTime = LocalDateTime.now().format(DateTimeFormatter.ofPattern(\"yyyy-MM-dd HH:mm:ss\"));\n        var departureTime = LocalDateTime.now().minusMinutes(30).format(DateTimeFormatter.ofPattern(\"yyyy-MM-dd HH:mm:ss\"));\n        return new Ride(new String[]{\"1\", departureTime, arrivalTime,\"1\",\"1.50\",\"1\",\"N\",\"238\",\"75\",\"2\",\"8\",\"0.5\",\"0.5\",\"0\",\"0\",\"0.3\",\"9.3\",\"0\"});\n    }\n\n    public static PickupLocation generatePickUpLocation(long pickupLocationId) {\n        return new PickupLocation(pickupLocationId, LocalDateTime.now());\n    }\n}\n"
  },
  {
    "path": "07-streaming/workshop/.python-version",
    "content": "3.13\n"
  },
  {
    "path": "07-streaming/workshop/Dockerfile.flink",
    "content": "FROM flink:2.2.0-scala_2.12-java17\n\nCOPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/\n\n# ref: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker\n\nWORKDIR /opt/pyflink\nCOPY pyproject.flink.toml pyproject.toml\nRUN uv python install 3.12 && uv sync\nENV PATH=\"/opt/pyflink/.venv/bin:$PATH\"\n\n# Download connector libraries\n\nWORKDIR /opt/flink/lib\nRUN wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/2.2.0/flink-json-2.2.0.jar; \\\n    wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/4.0.1-2.0/flink-sql-connector-kafka-4.0.1-2.0.jar; \\\n    wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-core/4.0.0-2.0/flink-connector-jdbc-core-4.0.0-2.0.jar; \\\n    wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-postgres/4.0.0-2.0/flink-connector-jdbc-postgres-4.0.0-2.0.jar; \\\n    wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.10/postgresql-42.7.10.jar\n\nCOPY flink-config.yaml /opt/flink/conf/config.yaml\n\nWORKDIR /opt/flink\n"
  },
  {
    "path": "07-streaming/workshop/Dockerfile_ARM64.flink",
    "content": "FROM flink:2.2.0-scala_2.12-java17\n\nCOPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/\n\nUSER root\n\n# Install a full JDK (not just a runtime) plus native build tools for pemja\nRUN apt-get update && apt-get install -y --no-install-recommends \\\n    openjdk-17-jdk-headless \\\n    build-essential \\\n    python3-dev \\\n    wget \\\n    ca-certificates \\\n    && rm -rf /var/lib/apt/lists/*\n\n# Point JAVA_HOME at the full JDK and make /opt/java/openjdk match what pemja expects\nRUN JDK_DIR=\"$(dirname \"$(dirname \"$(readlink -f \"$(command -v javac)\")\")\")\" \\\n    && rm -rf /opt/java/openjdk \\\n    && ln -s \"${JDK_DIR}\" /opt/java/openjdk \\\n    && test -d /opt/java/openjdk/include\n\nENV JAVA_HOME=/opt/java/openjdk\nENV PATH=\"${JAVA_HOME}/bin:${PATH}\"\n\n# ref: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker\n\nWORKDIR /opt/pyflink\nCOPY pyproject.flink.toml pyproject.toml\nRUN uv python install 3.12 && uv sync\nENV PATH=\"/opt/pyflink/.venv/bin:$PATH\"\n\n# Download connector libraries\n# flink-json-2.2.0.jar is already bundled in the base image -- do NOT re-download it.\nWORKDIR /opt/flink/lib\nRUN wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/4.0.1-2.0/flink-sql-connector-kafka-4.0.1-2.0.jar \\\n    && wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-core/4.0.0-2.0/flink-connector-jdbc-core-4.0.0-2.0.jar \\\n    && wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-postgres/4.0.0-2.0/flink-connector-jdbc-postgres-4.0.0-2.0.jar \\\n    && wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.10/postgresql-42.7.10.jar\n\nCOPY flink-config.yaml /opt/flink/conf/config.yaml\n\nWORKDIR /opt/flink"
  },
  {
    "path": "07-streaming/workshop/Makefile",
    "content": ".PHONY: build up down job aggregation_job stop start\n\nbuild:\n\tdocker compose build\n\nup:\n\tdocker compose up --build --remove-orphans -d\n\ndown:\n\tdocker compose down --remove-orphans\n\njob:\n\tdocker compose exec jobmanager ./bin/flink run -py /opt/src/job/start_job.py --pyFiles /opt/src -d\n\naggregation_job:\n\tdocker compose exec jobmanager ./bin/flink run -py /opt/src/job/aggregation_job.py --pyFiles /opt/src -d\n\nstop:\n\tdocker compose stop\n\nstart:\n\tdocker compose start\n"
  },
  {
    "path": "07-streaming/workshop/README.md",
    "content": "# PyFlink: Stream Processing Workshop\n\nVideo: https://www.youtube.com/watch?v=YDUgFeHQzJU\n\nThis workshop is based on the\n[2025 stream with Zach Wilson](https://www.youtube.com/watch?v=P2loELMUUeI).\n\nIn this workshop, we build a real-time streaming pipeline step by step.\nWe start with the basics - a message broker, a producer, and a consumer -\nthen add a database and finally a stream processing framework.\n\nWe'll use NYC yellow taxi trip data as our data source.\n\nWhat we'll build by the end:\n\n```\nProducer (Python) -> Kafka (Redpanda) -> Flink -> PostgreSQL\n```\n\nPrerequisites:\n\n- Docker and Docker Compose\n- [uv](https://docs.astral.sh/uv/)\n- A SQL client - [pgcli](https://www.pgcli.com/) (`uvx pgcli`), DBeaver, pgAdmin, or DataGrip\n\nCode:\n\n- [Reference code](./) in this directory (`07-streaming/workshop/`)\n- [Code created during the workshop](live/) by Alexey\n\nThe README walks through building everything from scratch - you can follow\nalong step by step or study the existing files and run the commands.\n\n\n## Redpanda - a Kafka-compatible broker\n\nBefore we can produce or consume messages, we need a message broker -\na service that receives messages from producers, stores them, and delivers\nthem to consumers.\n\nWe use [Redpanda](https://redpanda.com/), a drop-in replacement for\nApache Kafka. Redpanda implements the same protocol, so any Kafka client\nlibrary works with it unchanged. The `kafka-python` library we'll use\ndoesn't know or care that Redpanda is running instead of Kafka.\n\nWhy Redpanda instead of Kafka?\n\n- No JVM - Kafka runs on Java and needs significant memory for the JVM.\n  Redpanda is written in C++ and starts in seconds with far less overhead.\n- No ZooKeeper - Kafka traditionally required a separate ZooKeeper cluster\n  for coordination (metadata, leader election). Redpanda handles this\n  internally using the Raft consensus protocol - one less service to run.\n- Single binary - just one container, nothing else to configure.\n\nFor this workshop, every time we say \"Kafka\" we mean the Kafka protocol\nand concepts. Redpanda is the actual broker running underneath.\n\nCreate `docker-compose.yml` with the Redpanda service:\n\n```yaml\nservices:\n  redpanda:\n    image: redpandadata/redpanda:v25.3.9\n    command:\n      - redpanda\n      - start\n      - --smp\n      - '1'\n      - --reserve-memory\n      - 0M\n      - --overprovisioned\n      - --node-id\n      - '1'\n      - --kafka-addr\n      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092\n      - --advertise-kafka-addr\n      - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092\n      - --pandaproxy-addr\n      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082\n      - --advertise-pandaproxy-addr\n      - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082\n      - --rpc-addr\n      - 0.0.0.0:33145\n      - --advertise-rpc-addr\n      - redpanda:33145\n    ports:\n      - 8082:8082\n      - 9092:9092\n      - 28082:28082\n      - 29092:29092\n```\n\nThe command has many parameters. Let's go through them.\n\nResource parameters:\n\n| Parameter | What it does |\n|---|---|\n| `--smp 1` | Use 1 CPU core. Redpanda is built on [Seastar](http://seastar.io/), a framework that pins threads to cores for high performance. For development, 1 core is enough. |\n| `--reserve-memory 0M` | Don't reserve extra memory for Redpanda's internal cache. In production, Redpanda reserves memory for its own page cache; we skip this in development. |\n| `--overprovisioned` | Don't pin threads to specific CPU cores. On a shared development machine, this avoids contention with other processes. |\n| `--node-id 1` | Unique identifier for this broker in the cluster. With a single broker it doesn't matter, but the parameter is required. |\n\nNetworking parameters:\n\nRedpanda exposes two separate listeners for the Kafka protocol - one for\nconnections from inside Docker (other containers) and one for connections\nfrom outside Docker (your laptop):\n\n| Parameter | Internal (Docker) | External (your laptop) |\n|---|---|---|\n| `--kafka-addr` | `PLAINTEXT://0.0.0.0:29092` | `OUTSIDE://0.0.0.0:9092` |\n| `--advertise-kafka-addr` | `PLAINTEXT://redpanda:29092` | `OUTSIDE://localhost:9092` |\n\nWhy two addresses? Kafka clients use a two-step connection process:\n\n1. The client connects to a bootstrap server and asks for cluster metadata\n2. The broker responds with advertised addresses - where the client should\n   connect for actual data transfer\n\nInside Docker, containers find each other by service name, so the internal\nadvertised address is `redpanda:29092`. From your laptop, you connect via\nthe published port at `localhost:9092`. If we used only one address, either\nDocker containers or your laptop wouldn't be able to connect.\n\nThe `--pandaproxy-addr` / `--advertise-pandaproxy-addr` follow the same\npattern for Redpanda's HTTP REST API (not used in this workshop).\nThe `--rpc-addr` / `--advertise-rpc-addr` are for internal cluster\ncommunication between Redpanda nodes (not relevant with a single node).\n\nPublished ports:\n\n| Port | What it's for |\n|---|---|\n| `9092` | Kafka protocol (external) - your Python producer/consumer connects here |\n| `29092` | Kafka protocol (internal) - Flink containers will connect here later |\n| `8082` / `28082` | HTTP Proxy - REST API access (not used in this workshop) |\n\nStart Redpanda:\n\n```bash\ndocker compose up redpanda -d\n```\n\nVerify it's running:\n\n```bash\ndocker compose ps\n```\n\n```\nNAME                IMAGE                           SERVICE    STATUS\nworkshop-redpanda   redpandadata/redpanda:v25.3.9   redpanda   Up\n```\n\n\n## Produce messages to Kafka\n\nInitialize a Python project and add the dependencies we need:\n\n```bash\nuv init -p 3.12\nuv add kafka-python pandas pyarrow\n```\n\n> If you cloned the repository, `pyproject.toml` already exists.\n> Run `uv sync` instead.\n\nWe'll send NYC yellow taxi trip data to Kafka. You can run the code below\neither as a Python script or in a Jupyter notebook (`uv add jupyter`,\nthen `uv run jupyter lab`).\n\nFirst, download the data. We read a parquet file of yellow taxi trips and\ntake the first 1000 rows:\n\n```python\nimport pandas as pd\n\nurl = \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet\"\ncolumns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime']\ndf = pd.read_parquet(url, columns=columns).head(1000)\ndf.head()\n```\n\nWe only read 5 columns to keep things focused. The full dataset has many\nmore (fare breakdown, rate codes, payment type, etc.).\n\nDefine a dataclass for our message. This gives us a clear schema for each\ntaxi trip:\n\n```python\nfrom dataclasses import dataclass\n\n@dataclass\nclass Ride:\n    PULocationID: int\n    DOLocationID: int\n    trip_distance: float\n    total_amount: float\n    tpep_pickup_datetime: int  # epoch milliseconds\n```\n\nWrite a function to convert a DataFrame row into a `Ride`. We convert the\npandas Timestamp to epoch milliseconds - that's the format Flink expects\nlater:\n\n```python\ndef ride_from_row(row):\n    return Ride(\n        PULocationID=int(row['PULocationID']),\n        DOLocationID=int(row['DOLocationID']),\n        trip_distance=float(row['trip_distance']),\n        total_amount=float(row['total_amount']),\n        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),\n    )\n```\n\nTest it:\n\n```python\nride = ride_from_row(df.iloc[0])\nride\n# Ride(PULocationID=186, DOLocationID=79, trip_distance=1.72,\n#      total_amount=17.31, tpep_pickup_datetime=1730429702000)\n```\n\nNext, connect to Kafka. The `bootstrap_servers` is where the broker accepts\nconnections - `localhost:9092` because we're running this from our laptop\n(outside Docker). In production with multiple brokers, you'd list several\nfor redundancy - if one is down, the client connects through another.\n\nKafka works with raw bytes, so we need a serializer that converts Python\ndicts to JSON:\n\n```python\nimport json\nfrom kafka import KafkaProducer\n\ndef json_serializer(data):\n    return json.dumps(data).encode('utf-8')\n\nserver = 'localhost:9092'\n\nproducer = KafkaProducer(\n    bootstrap_servers=[server],\n    value_serializer=json_serializer\n)\n```\n\nLet's send a single ride to try it out. `dataclasses.asdict(ride)` converts\nthe dataclass to a plain dict, which the serializer turns into JSON bytes.\nThe broker auto-creates the `rides` topic on first use:\n\n```python\nimport dataclasses\n\ntopic_name = 'rides'\n\nproducer.send(topic_name, value=dataclasses.asdict(ride))\nproducer.flush()\n```\n\nThis works, but calling `dataclasses.asdict()` every time is tedious. We\ncan make a serializer that handles dataclasses directly:\n\n```python\ndef ride_serializer(ride):\n    ride_dict = dataclasses.asdict(ride)\n    json_str = json.dumps(ride_dict)\n    return json_str.encode('utf-8')\n```\n\nNow recreate the producer with the new serializer - we can pass `Ride`\nobjects directly without converting them to dicts first:\n\n```python\nproducer = KafkaProducer(\n    bootstrap_servers=[server],\n    value_serializer=ride_serializer\n)\n```\n\nSend one ride to verify:\n\n```python\nproducer.send(topic_name, value=ride)\nproducer.flush()\n```\n\nThat sent one record. Now let's send all 1000 rides in a loop:\n\n```python\nimport time\n\nt0 = time.time()\n\nfor _, row in df.iterrows():\n    ride = ride_from_row(row)\n    producer.send(topic_name, value=ride)\n    print(f\"Sent: {ride}\")\n    time.sleep(0.01)\n\nproducer.flush()\n\nt1 = time.time()\nprint(f'took {(t1 - t0):.2f} seconds')\n```\n\nIf you're building from scratch (not using the cloned repo files), create\nthe source directory structure and save the shared data model. The\nproducer and consumer scripts both import from this file:\n\n```bash\nmkdir -p src/producers src/consumers src/job\n```\n\nCreate `src/models.py`:\n\n```python\nimport json\nfrom dataclasses import dataclass\n\n\n@dataclass\nclass Ride:\n    PULocationID: int\n    DOLocationID: int\n    trip_distance: float\n    total_amount: float\n    tpep_pickup_datetime: int  # epoch milliseconds\n\n\ndef ride_from_row(row):\n    return Ride(\n        PULocationID=int(row['PULocationID']),\n        DOLocationID=int(row['DOLocationID']),\n        trip_distance=float(row['trip_distance']),\n        total_amount=float(row['total_amount']),\n        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),\n    )\n\n\ndef ride_deserializer(data):\n    json_str = data.decode('utf-8')\n    ride_dict = json.loads(json_str)\n    return Ride(**ride_dict)\n```\n\n`ride_deserializer` is introduced in the next step - we include it here so\nthe file is complete.\n\n> The complete script is in `src/producers/producer.py`.\n\nRun it:\n\n```bash\nuv run python src/producers/producer.py\n```\n\nYou'll see 1000 taxi trips sent over ~10 seconds:\n\n```\nSent: Ride(PULocationID=..., DOLocationID=..., trip_distance=..., total_amount=..., tpep_pickup_datetime=...)\n...\ntook 10.23 seconds\n```\n\n\n## Consume messages with Python\n\nNow let's read back the messages. The consumer receives raw bytes from\nKafka. Instead of deserializing to a dict and then constructing a `Ride`\nmanually, let's write a function that does both in one step:\n\n```python\nimport json\n\ndef ride_deserializer(data):\n    json_str = data.decode('utf-8')\n    ride_dict = json.loads(json_str)\n    return Ride(**ride_dict)\n```\n\nTest it with a sample JSON binary string (this is what Kafka delivers):\n\n```python\ntest_bytes = json.dumps({\n    'PULocationID': 186,\n    'DOLocationID': 79,\n    'trip_distance': 1.72,\n    'total_amount': 17.31,\n    'tpep_pickup_datetime': 1730429702000\n}).encode('utf-8')\n\nride_deserializer(test_bytes)\n# Ride(PULocationID=186, DOLocationID=79, trip_distance=1.72,\n#      total_amount=17.31, tpep_pickup_datetime=1730429702000)\n```\n\nNow we can pass `ride_deserializer` directly as the `value_deserializer` -\nKafka calls it on every message, so `message.value` is already a `Ride`.\n\nConnect to Kafka as a consumer. `auto_offset_reset='earliest'` means we\nstart reading from the beginning of the topic (without this, new consumers\ndefault to `latest` and only see new messages). `group_id` identifies this\nconsumer group - Kafka tracks how far each group has read, so restarting\nwith the same group ID continues where it left off:\n\n```python\nfrom kafka import KafkaConsumer\n\nserver = 'localhost:9092'\ntopic_name = 'rides'\n\nconsumer = KafkaConsumer(\n    topic_name,\n    bootstrap_servers=[server],\n    auto_offset_reset='earliest',\n    group_id='rides-console',\n    value_deserializer=ride_deserializer\n)\n```\n\nRead messages and print them. Since `value_deserializer` returns a `Ride`,\n`message.value` is already a `Ride` object - no extra conversion needed:\n\n```python\nfrom datetime import datetime\n\nprint(f\"Listening to {topic_name}...\")\n\ncount = 0\nfor message in consumer:\n    ride = message.value\n    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)\n    print(f\"Received: PU={ride.PULocationID}, DO={ride.DOLocationID}, \"\n          f\"distance={ride.trip_distance}, amount=${ride.total_amount:.2f}, \"\n          f\"pickup={pickup_dt}\")\n    count += 1\n    if count >= 10:\n        print(f\"\\n... received {count} messages so far (stopping after 10 for demo)\")\n        break\n\nconsumer.close()\n```\n\n> The complete script is in `src/consumers/consumer.py`.\n\nRun it:\n\n```bash\nuv run python src/consumers/consumer.py\n```\n\n```\nListening to rides...\nReceived: PU=..., DO=..., distance=..., amount=$..., pickup=2025-...\n...\n... received 10 messages so far (stopping after 10 for demo)\n```\n\n\n## Save events to PostgreSQL\n\nPrinting to the screen is fine for debugging, but let's save events to a\ndatabase. Add the PostgreSQL service to `docker-compose.yml`:\n\n```yaml\n  postgres:\n    image: postgres:18\n    restart: on-failure\n    environment:\n      - POSTGRES_DB=postgres\n      - POSTGRES_USER=postgres\n      - POSTGRES_PASSWORD=postgres\n    ports:\n      - \"5432:5432\"\n```\n\nStart it:\n\n```bash\ndocker compose up postgres -d\n```\n\nConnect to PostgreSQL. With `pgcli`:\n\n```bash\nuvx pgcli -h localhost -p 5432 -U postgres -d postgres\n# password: postgres\n```\n\nOr via Docker:\n\n```bash\ndocker compose exec postgres psql -U postgres -d postgres\n```\n\nCreate a table for our events:\n\n```sql\nCREATE TABLE processed_events (\n    PULocationID INTEGER,\n    DOLocationID INTEGER,\n    trip_distance DOUBLE PRECISION,\n    total_amount DOUBLE PRECISION,\n    pickup_datetime TIMESTAMP\n);\n```\n\nInstall the PostgreSQL client library:\n\n```bash\nuv add psycopg2-binary\n```\n\nCreate `src/consumers/consumer_postgres.py`.\n\nSet up the Kafka consumer. We reuse the same `ride_deserializer` from the\nprevious step. The `group_id` is different - each consumer group tracks its\noffsets independently, so the console consumer and the PostgreSQL consumer\neach read all messages:\n\n```python\nfrom kafka import KafkaConsumer\n\nserver = 'localhost:9092'\ntopic_name = 'rides'\n\nconsumer = KafkaConsumer(\n    topic_name,\n    bootstrap_servers=[server],\n    auto_offset_reset='earliest',\n    group_id='rides-to-postgres',\n    value_deserializer=ride_deserializer\n)\n```\n\nConnect to PostgreSQL:\n\n```python\nimport psycopg2\n\nconn = psycopg2.connect(\n    host='localhost',\n    port=5432,\n    database='postgres',\n    user='postgres',\n    password='postgres'\n)\nconn.autocommit = True\ncur = conn.cursor()\n```\n\n`autocommit = True` means each INSERT is committed immediately - no need\nto call `conn.commit()` after every row.\n\nRead messages and insert into PostgreSQL:\n\n```python\nfrom datetime import datetime\n\nprint(f\"Listening to {topic_name} and writing to PostgreSQL...\")\n\ncount = 0\nfor message in consumer:\n    ride = message.value\n    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)\n    cur.execute(\n        \"\"\"INSERT INTO processed_events\n           (PULocationID, DOLocationID, trip_distance, total_amount, pickup_datetime)\n           VALUES (%s, %s, %s, %s, %s)\"\"\",\n        (ride.PULocationID, ride.DOLocationID,\n         ride.trip_distance, ride.total_amount, pickup_dt)\n    )\n    count += 1\n    if count % 100 == 0:\n        print(f\"Inserted {count} rows...\")\n\nconsumer.close()\ncur.close()\nconn.close()\n```\n\nRun it (press Ctrl+C after it processes the data):\n\n```bash\nuv run python src/consumers/consumer_postgres.py\n```\n\nCheck PostgreSQL:\n\n```sql\nSELECT count(*) FROM processed_events;\n```\n\n```\n count\n-------\n  1000\n```\n\nThis works, but think about what's missing:\n\n- What if we want to aggregate by time window? We'd need to implement windowing\n  logic ourselves.\n- What if the consumer crashes? We'd need to track offsets ourselves to avoid\n  reprocessing or missing data.\n- What about parallelism? We'd need to manage multiple consumer instances and\n  partition assignment.\n- What about writing to different sinks? We'd need to write connector code for\n  each destination.\n\nThis is where Flink comes in. Clear the table before moving on:\n\n```sql\nTRUNCATE processed_events;\n```\n\n\n## Why Flink?\n\nFlink is a stream processing framework that handles all the hard parts:\n\n- Windowing - built-in tumbling, sliding, and session windows\n- Checkpointing - automatic state recovery after failures (no manual offset tracking)\n- Parallelism - distribute processing across multiple workers\n- Connectors - built-in JDBC, Kafka, filesystem sinks (no psycopg2 code)\n- SQL interface - express stream processing with SQL queries\n\nFlink can also connect to sources beyond Kafka - REST APIs, websockets,\nfilesystems, and more. But Kafka is the most common source in stream processing.\n\nThe trade-off is infrastructure complexity - we need the JobManager and\nTaskManager containers. A streaming job is more like owning a server than\nrunning a batch pipeline - it runs 24/7 and needs monitoring. But for anything\nbeyond simple consume-and-write, Flink pays for itself.\n\n\n## The Flink image and services\n\nFlink doesn't come with Python support out of the box. We need a custom\nDocker image with Python, PyFlink, and connector JARs.\n\nDownload the Flink build files:\n\n```bash\nPREFIX=\"https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/main/07-streaming/workshop\"\n\nwget ${PREFIX}/Dockerfile.flink\nwget ${PREFIX}/pyproject.flink.toml\nwget ${PREFIX}/flink-config.yaml\n```\n\n> If you cloned the repository, these files are already in the\n> `07-streaming/workshop/` directory.\n\nYou can look at\n[`Dockerfile.flink`](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/07-streaming/workshop/Dockerfile.flink)\nto see what it does:\n\n- Starts from the official Flink image (`flink:2.2.0-scala_2.12-java17`)\n- Installs Python 3.12 and PyFlink via uv\n- Downloads connector JARs (Kafka, JDBC, PostgreSQL driver)\n- Applies a custom Flink config to increase JVM metaspace for PyFlink\n\nNow add the Flink services to `docker-compose.yml`. A Flink cluster has\ntwo types of processes - let's add them one at a time.\n\nThe JobManager is the coordinator. It accepts jobs, manages checkpoints,\nand assigns work to task managers. You interact with it through the web UI\n(port `8081`) and submit jobs via its RPC port (`6123`):\n\n```yaml\n  jobmanager:\n    build:\n      context: .\n      dockerfile: ./Dockerfile.flink\n    image: pyflink-workshop\n    pull_policy: never\n    expose:\n      - \"6123\"\n    ports:\n      - \"8081:8081\"\n    volumes:\n      - ./:/opt/flink/usrlib\n      - ./src/:/opt/src\n    command: jobmanager\n    environment:\n      - |\n        FLINK_PROPERTIES=\n        jobmanager.rpc.address: jobmanager\n        jobmanager.memory.process.size: 1600m\n```\n\n- `build` + `image: pyflink-workshop` - builds our custom Docker image and\n  tags it as `pyflink-workshop`. The taskmanager will reuse this same image\n  without rebuilding.\n- `pull_policy: never` - don't try to pull `pyflink-workshop` from Docker Hub\n  (it doesn't exist there - we built it locally).\n- `volumes` - mount the source code into the container so we can submit jobs\n  without rebuilding the image.\n- `FLINK_PROPERTIES` - Flink configuration passed as an environment variable.\n  `jobmanager.rpc.address: jobmanager` tells Flink where the coordinator\n  lives (`jobmanager` is the Docker service name).\n\nThe TaskManager is the worker. It executes the actual data processing:\n\n```yaml\n  taskmanager:\n    image: pyflink-workshop\n    pull_policy: never\n    expose:\n      - \"6121\"\n      - \"6122\"\n    volumes:\n      - ./:/opt/flink/usrlib\n      - ./src/:/opt/src\n    depends_on:\n      - jobmanager\n    command: taskmanager --taskmanager.registration.timeout 5 min\n    environment:\n      - |\n        FLINK_PROPERTIES=\n        jobmanager.rpc.address: jobmanager\n        taskmanager.memory.process.size: 1728m\n        taskmanager.numberOfTaskSlots: 15\n        parallelism.default: 3\n```\n\n- `image: pyflink-workshop` - reuses the image built by the jobmanager\n  service, no `build` needed.\n- `depends_on: jobmanager` - start after the jobmanager.\n- `--taskmanager.registration.timeout 5 min` - give the task manager\n  5 minutes to find the job manager on startup (useful when services start\n  in parallel).\n- `taskmanager.numberOfTaskSlots: 15` - this task manager has 15 slots.\n- `parallelism.default: 3` - by default, each pipeline stage runs 3 copies\n  processing data in parallel.\n\nA task slot is a unit of resources (memory, CPU) that can run one parallel\ninstance of a pipeline stage. Think of slots like lanes on a highway - more\nlanes means more data can flow through at once. If you submit a job with\nparallelism 3, that job uses 3 slots. With 15 slots available, you can run\n5 such jobs simultaneously on this single task manager. In production, you'd\nhave multiple task managers across different machines, each contributing\nslots to the cluster. The job manager decides which slots run which parts\nof which jobs.\n\nMake sure `src/` exists before starting Docker - the volume mount\n`./src/:/opt/src` will create it as root if it doesn't exist, causing\npermission issues later when you try to create files inside it:\n\n```bash\nmkdir -p src/job\n```\n\nBuild the Flink image and start all services:\n\n```bash\ndocker compose up --build -d\n```\n\nThe first build takes a few minutes - it installs Python, PyFlink, and downloads\nthe connector JARs.\n\nVerify all four services are running:\n\n```bash\ndocker compose ps\n```\n\n```\nNAME                  IMAGE                           SERVICE        STATUS\nworkshop-jobmanager   pyflink-workshop                jobmanager     Up\nworkshop-taskmanager  pyflink-workshop                taskmanager    Up\nworkshop-postgres     postgres:18                     postgres       Up\nworkshop-redpanda     redpandadata/redpanda:v25.3.9   redpanda       Up\n```\n\nCheck the Flink dashboard at [http://localhost:8081](http://localhost:8081) -\nyou should see 1 task manager with 15 available task slots.\n\n\n## The pass-through Flink job\n\nNow let's do the same thing our Python consumer did, but with Flink.\n\nUnlike the producer and consumer scripts, Flink jobs can't run from a\nJupyter notebook. They are submitted to the Flink cluster as .py files\nusing `docker compose exec`. We cover how job submission works in\nproduction in the \"Flink in production\" section at the end.\n\nCreate `src/job/pass_through_job.py`.\n\nThe Kafka source table:\n\n```python\ndef create_events_source_kafka(t_env):\n    table_name = \"events\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            tpep_pickup_datetime BIGINT\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda:29092',\n            'topic' = 'rides',\n            'scan.startup.mode' = 'latest-offset',\n            'properties.auto.offset.reset' = 'latest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n```\n\nThis is a Flink SQL DDL statement. Breaking it down:\n\n- `PULocationID`, `DOLocationID`, `trip_distance`, `total_amount`,\n  `tpep_pickup_datetime` - the JSON fields from our producer\n- `'properties.bootstrap.servers' = 'redpanda:29092'` - the internal Docker\n  network address (not `localhost` - Flink runs inside Docker)\n- `'scan.startup.mode' = 'latest-offset'` - only read new messages arriving\n  after the job starts\n- `'format' = 'json'` - Flink deserializes JSON automatically\n\nThe PostgreSQL sink table:\n\n```python\ndef create_processed_events_sink_postgres(t_env):\n    table_name = 'processed_events'\n    sink_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            pickup_datetime TIMESTAMP\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n```\n\nNo psycopg2, no INSERT statements - just declare the table and Flink handles\nthe rest.\n\nThe execution:\n\n```python\nfrom pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, StreamTableEnvironment\n\ndef log_processing():\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)  # checkpoint every 10 seconds\n\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n\n    source_table = create_events_source_kafka(t_env)\n    postgres_sink = create_processed_events_sink_postgres(t_env)\n\n    t_env.execute_sql(\n        f\"\"\"\n        INSERT INTO {postgres_sink}\n        SELECT\n            PULocationID,\n            DOLocationID,\n            trip_distance,\n            total_amount,\n            TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) as pickup_datetime\n        FROM {source_table}\n        \"\"\"\n    ).wait()\n\nif __name__ == '__main__':\n    log_processing()\n```\n\n- Streaming mode - the job runs continuously, waiting for new data\n- The `INSERT INTO ... SELECT` is the pipeline - read from Kafka, convert the\n  timestamp, write to PostgreSQL\n\n`enable_checkpointing(10 * 1000)` tells Flink to take a snapshot of the\njob's state every 10 seconds. A checkpoint captures the Kafka offsets (how\nfar Flink has read) and any in-flight data. If the job crashes, it resumes\nfrom the last checkpoint instead of starting from the beginning.\n\nCheckpointing gets especially important with windows. If you have a\n5-minute window and the job fails 2 minutes in, Flink doesn't just track\nthe offset - it also serializes the open windows to disk. When it\nrestarts, it picks up right where it left off, with the partially-filled\nwindow intact.\n\nThe trade-off is resilience versus efficiency. Checkpointing every 1 second\nis expensive - Flink has to serialize and persist the entire state that\noften. Checkpointing every 10 minutes means you could lose up to 10 minutes\nof progress on failure. 10 seconds is a reasonable default for most jobs.\n\nSubmit the job:\n\n```bash\ndocker compose exec jobmanager ./bin/flink run \\\n    -py /opt/src/job/pass_through_job.py \\\n    --pyFiles /opt/src -d\n```\n\n```\nJob has been submitted with JobID 663cff6811b65e97fc1e068d641401f4\n```\n\nCheck the Flink UI at [http://localhost:8081](http://localhost:8081) - you should\nsee a running job.\n\nSince the job uses `latest-offset`, it's waiting for new messages. Send data:\n\n```bash\nuv run python src/producers/producer.py\n```\n\nQuery PostgreSQL:\n\n```sql\nSELECT count(*) FROM processed_events;\n```\n\nCompare this to our Python consumer approach - same result, but Flink handles\ncheckpointing, offset management, and PostgreSQL writes automatically.\n\n\n## Offsets - earliest vs latest\n\nWhen Flink connects to Kafka, it needs to know where to start reading. This\nis the `scan.startup.mode` setting:\n\n| Mode | Behavior |\n|---|---|\n| `latest-offset` | Only read messages arriving after the job starts |\n| `earliest-offset` | Read everything from the beginning of the topic |\n| `timestamp` | Start from a specific point in time |\n\n`earliest` is typically used for backfilling or restating data - you're\nusing Flink to process data that's been sitting in Kafka for a while, not\nreal-time data. `latest` is the more common production setting - the job\nstarts up and only processes new events as people click buttons on your\nwebsite or whatever event feed you're consuming.\n\nOur pass-through job uses `latest-offset`. Let's see what happens with\n`earliest-offset`:\n\n1. Cancel the running job from the Flink UI (click on the job, then Cancel)\n2. Clear the table:\n   ```sql\n   TRUNCATE processed_events;\n   ```\n3. Edit `src/job/pass_through_job.py` - change both offset settings:\n   ```\n   'scan.startup.mode' = 'earliest-offset',\n   'properties.auto.offset.reset' = 'earliest',\n   ```\n4. Resubmit:\n   ```bash\n   docker compose exec jobmanager ./bin/flink run \\\n       -py /opt/src/job/pass_through_job.py \\\n       --pyFiles /opt/src -d\n   ```\n5. Wait 15 seconds, then check:\n   ```sql\n   SELECT count(*) FROM processed_events;\n   ```\n\nFlink reads all messages from the topic - including data from previous producer\nruns. If you ran the producer twice before, you'll see ~2000 rows (duplicates\nof everything already processed).\n\nWhy duplicates? Checkpoints are scoped to a specific job instance. When you\ncancel and resubmit, it's a brand new job that knows nothing about previous\ncheckpoints. With `earliest-offset`, it starts from scratch. The offset\nsetting only matters at startup - once the job is running, checkpointing\ntakes over and tracks progress. But if you kill the job and create a new\none, those checkpoints are gone.\n\nThere is a third option - `timestamp` mode. If your job was running fine\nuntil 2:00 PM and then crashed, you can restart it from exactly 2:00 PM.\nThis is useful for recovering from failures without reprocessing everything\nfrom the beginning or missing the data that arrived while the job was down.\n\nA common production pattern (Lambda architecture): run your streaming job with\n`latest-offset` for real-time results, and if it goes down, use a separate\nbatch job to backfill the gap. This way the streaming job stays fast and you\ndon't lose data.\n\n> Change the offset back to `latest-offset` when you're done experimenting.\n\n\n## Aggregation with tumbling windows\n\nNow let's do something our plain Python consumer can't easily do - windowed\naggregation. We'll count taxi trips and sum revenue by pickup location per hour.\n\nFirst, cancel any running jobs. Then create the aggregation table in PostgreSQL:\n\n```sql\nCREATE TABLE processed_events_aggregated (\n    window_start TIMESTAMP,\n    PULocationID INTEGER,\n    num_trips BIGINT,\n    total_revenue DOUBLE PRECISION,\n    PRIMARY KEY (window_start, PULocationID)\n);\n```\n\nTwo important design choices:\n\n1. `PULocationID` is included - we group by both time window and pickup\n   location, so both appear in the output.\n2. `PRIMARY KEY` - enables upsert behavior. When Flink sends updated counts\n   for the same window, PostgreSQL updates the existing row instead of creating\n   a duplicate. This matters because late-arriving events can cause Flink to\n   re-evaluate a window it already emitted results for. With upsert, the\n   corrected count replaces the old one automatically.\n\nNow create `src/job/aggregation_job.py`:\n\n```python\nfrom pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, StreamTableEnvironment\n\n\ndef create_events_source_kafka(t_env):\n    table_name = \"events\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            tpep_pickup_datetime BIGINT,\n            event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3),\n            WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda:29092',\n            'topic' = 'rides',\n            'scan.startup.mode' = 'earliest-offset',\n            'properties.auto.offset.reset' = 'earliest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n\n\ndef create_events_aggregated_sink(t_env):\n    table_name = 'processed_events_aggregated'\n    sink_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            window_start TIMESTAMP(3),\n            PULocationID INT,\n            num_trips BIGINT,\n            total_revenue DOUBLE,\n            PRIMARY KEY (window_start, PULocationID) NOT ENFORCED\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n\n\ndef log_aggregation():\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)\n    env.set_parallelism(3)\n\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n\n    try:\n        source_table = create_events_source_kafka(t_env)\n        aggregated_table = create_events_aggregated_sink(t_env)\n\n        t_env.execute_sql(f\"\"\"\n        INSERT INTO {aggregated_table}\n        SELECT\n            window_start,\n            PULocationID,\n            COUNT(*) AS num_trips,\n            SUM(total_amount) AS total_revenue\n        FROM TABLE(\n            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR)\n        )\n        GROUP BY window_start, PULocationID;\n\n        \"\"\").wait()\n\n    except Exception as e:\n        print(\"Writing records from Kafka to JDBC failed:\", str(e))\n\n\nif __name__ == '__main__':\n    log_aggregation()\n```\n\nThe Kafka source table has two new lines compared to the pass-through job:\n\n- `event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3)` - a computed\n  column that converts epoch milliseconds to a timestamp. The `3` means\n  milliseconds precision.\n- `WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND` -\n  tells Flink when to publish window results.\n\nThe window defines WHAT you're counting - a 1-hour bucket of taxi trips.\nBut in a stream, events keep arriving. How does Flink know when to stop\nwaiting and publish the count for the 2 PM - 3 PM hour? It can't just\nlook at the clock because some events arrive late. Without a trigger,\nFlink would accumulate data forever and never write anything to PostgreSQL.\n\nThe watermark is that trigger. It tells Flink when to publish. In the SQL:\n\n```\nWATERMARK FOR event_timestamp AS event_timestamp - INTERVAL '5' SECOND\n                                                   ^^^^^^^^^^^^^^^^^^^\n                                                   patience = 5 seconds\n```\n\nThe watermark is always 5 seconds behind the latest event timestamp Flink\nhas seen. When the watermark passes the end of a window, Flink publishes\nthat window's results. The 5 seconds is patience for stragglers - events\nthat happened before the window ended but arrived a few seconds late.\n\nThree pieces working together:\n\n- Window = what bucket to count into (1 hour)\n- Watermark = when to publish the result (the trigger)\n- Upsert (PRIMARY KEY) = safety net that corrects the result if something\n  arrives after publishing\n\nHere's a concrete example. Two taxi pickups in East Village (PU=79) with\na 10-second window and 5-second watermark. Event A is on time, Event B is\n8 seconds late (the rider's phone lost signal in a tunnel).\n\nEvent B arrives late, but Flink hasn't published yet - both events counted:\n\n```mermaid\nsequenceDiagram\n    participant P as Producer\n    participant K as Kafka\n    participant F as Flink\n    participant PG as PostgreSQL\n\n    P->>K: Event A (ts=14:00:07, on time)\n    K->>F: Event A\n    Note over F: watermark = 00:02<br/>window [00:00, 00:10) not published yet<br/>A added to window\n\n    Note over P: 5 seconds pass, phone reconnects\n\n    P->>K: Event B (ts=14:00:04, 8s late)\n    K->>F: Event B\n    Note over F: watermark = 00:07<br/>window [00:00, 00:10) still not published<br/>B added to window\n\n    Note over F: more events arrive<br/>watermark reaches 00:10<br/>time to publish\n\n    F->>PG: INSERT (window=00:00, PU=79, trips=2)\n    Note over PG: both events counted\n```\n\nEvent B arrived late, but within Flink's patience window. Flink hadn't\npublished the result yet, so B was included in the count.\n\nNow what if Event B were 20 seconds late - arriving after Flink already\npublished?\n\n```mermaid\nsequenceDiagram\n    participant P as Producer\n    participant K as Kafka\n    participant F as Flink\n    participant PG as PostgreSQL\n\n    P->>K: Event A (ts=14:00:07, on time)\n    K->>F: Event A\n    Note over F: A added to window [00:00, 00:10)\n\n    Note over F: watermark reaches 00:10<br/>time to publish\n\n    F->>PG: INSERT (window=00:00, PU=79, trips=1)\n    Note over PG: published with trips=1\n\n    Note over P: 20 seconds later, phone reconnects\n\n    P->>K: Event B (ts=14:00:04, 20s late)\n    K->>F: Event B\n    Note over F: window [00:00, 00:10) already published<br/>but B still belongs to it\n\n    F->>PG: UPDATE (window=00:00, PU=79, trips=2)\n    Note over PG: upsert via PRIMARY KEY<br/>corrected from 1 to 2\n```\n\nFlink already published trips=1, but when Event B finally arrives, the\nPRIMARY KEY lets Flink send a correction. PostgreSQL updates the row\nfrom 1 to 2. Without the PRIMARY KEY (an append-only sink), Event B\nwould be lost - Flink can't re-open a published window in append mode.\n\nThe trade-off is latency vs completeness. A larger watermark means more\npatience for late events, but you wait longer before seeing any results.\n5 seconds is a reasonable default. In production, you'd tune this based\non how out-of-order your data actually is.\n\nOther differences from the pass-through job:\n\n- The sink has a `PRIMARY KEY` with `NOT ENFORCED` - this enables upsert\n  behavior in the Flink JDBC connector.\n- `earliest-offset` - reads all existing data from Kafka.\n- `env.set_parallelism(3)` - runs 3 copies processing data in parallel.\n- The `TUMBLE` function creates fixed-size, non-overlapping windows.\n  `DESCRIPTOR(event_timestamp)` must reference the column with the `WATERMARK`\n  defined on it, and `INTERVAL '1' HOUR` sets the window size.\n\nSubmit and test:\n\n```bash\ndocker compose exec jobmanager ./bin/flink run \\\n    -py /opt/src/job/aggregation_job.py \\\n    --pyFiles /opt/src -d\n```\n\nSend data:\n\n```bash\nuv run python src/producers/producer.py\n```\n\nWait ~15 seconds for the windows to close, then check:\n\n```sql\nSELECT window_start, count(*) as locations, sum(num_trips) as total_trips,\n       round(sum(total_revenue)::numeric, 2) as revenue\nFROM processed_events_aggregated\nGROUP BY window_start\nORDER BY window_start;\n```\n\n```\n     window_start     | locations | total_trips | revenue\n----------------------+-----------+-------------+---------\n 2025-11-01 00:00:00  |        ...\n 2025-11-01 01:00:00  |        ...\n ...\n```\n\nThe 1000 taxi trips were grouped into 1-hour tumbling windows by pickup\nlocation. Each row shows how many locations had trips in that hour and the\ntotal number of trips.\n\nTry this with a plain Python consumer - you'd need to implement the windowing\nlogic, handle late events, manage state, and write the upsert SQL yourself.\nWith Flink, it's a SQL query.\n\n\n## Late events and upserts\n\nThe CSV producer sends events in order, so the watermark never has to\nhandle late arrivals. Let's use a real-time producer that generates\nsynthetic events with occasional delays to see what happens.\n\nDownload and run the real-time producer:\n\n```bash\nPREFIX=\"https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/main/07-streaming/workshop\"\nwget ${PREFIX}/src/producers/producer_realtime.py -P src/producers/\n```\n\n```bash\nuv run python src/producers/producer_realtime.py\n```\n\nIt generates random taxi trips with current timestamps, but ~20% of events\nare sent with a timestamp 3-10 seconds in the past (simulating network\ndelays). The output labels each event:\n\n```\n  on time   -> PU=79 ts=14:23:05\n  on time   -> PU=107 ts=14:23:05\n  LATE (8s) -> PU=234 ts=14:22:58\n  on time   -> PU=48 ts=14:23:06\n```\n\nWith our 5-second watermark and 1-hour windows, no events will be dropped -\neven an event 10 seconds late lands well within the current hour window.\nBut the watermark + upsert behavior is still visible: Flink first emits\nwindow results when the watermark passes the window end, then late events\nupdate those results via the PRIMARY KEY.\n\nTo see this in action, open two terminals:\n\nTerminal 1 - run the real-time producer:\n\n```bash\nuv run python src/producers/producer_realtime.py\n```\n\nTerminal 2 - watch aggregation counts change:\n\n```bash\nwatch -n 1 'PGPASSWORD=postgres docker compose exec postgres psql -U postgres -d postgres -c \"SELECT window_start, sum(num_trips) as trips, round(sum(total_revenue)::numeric, 2) as revenue FROM processed_events_aggregated GROUP BY window_start ORDER BY window_start;\"'\n```\n\nYou'll see the counts for older windows increase as late events arrive\nand update the aggregation via upsert. This is why we set up the PRIMARY\nKEY - without it, late events would either be dropped or create duplicates.\n\n\n## Understanding window types\n\nWe used tumbling windows above. Flink supports three types:\n\n### Tumbling windows\n\nFixed-size, non-overlapping. Every event belongs to exactly one window.\nIf you come from the batch world, tumbling windows are the most familiar -\nthey just cut up your data into fixed segments. It's essentially a way to\nspeed up batch processing.\n\n```\n|  Window 1  |  Window 2  |  Window 3  |\n|  1 hour    |  1 hour    |  1 hour    |\n```\n\nUse case: Counting trips per hour, daily revenue summaries.\n\n### Sliding windows\n\nFixed-size, overlapping. An event can belong to multiple windows. When you\nthink of a 1-hour window, most people think of 00:00-01:00. But there's\nalso 00:15-01:15, 00:30-01:30 - those are also 1-hour windows, just\nstarting at different points. Sliding windows capture all of them.\n\n```\n|--- Window 1 (1 hour) ---|\n      |--- Window 2 (1 hour) ---|\n            |--- Window 3 (1 hour) ---|\n      <- 15 min slide ->\n```\n\n```sql\nHOP(TABLE events, DESCRIPTOR(event_timestamp), INTERVAL '15' MINUTE, INTERVAL '1' HOUR)\n```\n\nUse case: finding peaks and valleys - \"what was our peak traffic in any\n1-hour window?\" These overlapping windows let you find the moment in time\nwhere you have the highest or lowest values. Good for min-maxing, moving\naverages, and surge detection (e.g., ride-share surge pricing).\n\n### Session windows\n\nDynamic windows based on inactivity gaps. Unlike tumbling and sliding\nwindows, the window size isn't fixed - the window doesn't close at a\nspecified time, it closes after a specified amount of inactivity.\n\n```\n|--events--| gap |--events------| gap |--events--|\n| Session 1|     |  Session 2   |     | Session 3|\n```\n\nUse case: grouping user behavior together. Imagine a user logs into an app,\nclicks a bunch of buttons, leaves for 2 minutes, then comes back - that's\nstill technically the same session. You set a session gap (say, 30 minutes\nof inactivity) and Flink groups all the events within that session together.\nSessionization is very powerful for behavioral analytics.\n\n\n## Cleanup\n\nStop and remove all containers:\n\n```bash\ndocker compose down\n```\n\nTo also remove the PostgreSQL data volume:\n\n```bash\ndocker compose down -v\n```\n\n\n## Q&A\n\nQuestions and answers from the\n[2025 stream with Zach Wilson](https://www.youtube.com/watch?v=P2loELMUUeI).\n\n### What happens when a Flink job dies and restarts? Does it reprocess everything?\n\nThe `earliest` offset setting is only for the initial startup. If the job\nrestarts (not re-submitted as a new job), it uses checkpointing to resume\nfrom the last snapshot. Without checkpointing, you either reprocess\neverything (with `earliest`) or skip data (with `latest`).\n\nThe catch: checkpoints are scoped to a specific job instance. If you\ncompletely kill a job and submit a new one, the new job has no knowledge of\nthe previous checkpoints. To preserve state across redeployments, restart\nthe existing job rather than creating a new one.\n\n### Why can't we just use Kafka consumers? What does Flink actually add?\n\nFor simple pass-through (read a message, write it somewhere), a Kafka\nconsumer is fine. For anything involving time windows, watermarks,\ncheckpointing, or parallel processing, Flink saves you from building all\nthat yourself.\n\nYou can do windowing, watermarking, late data handling, and job recovery\nwith a plain consumer - go ahead and manage it yourself. But as Zach puts\nit: \"good luck.\" With a plain consumer, you'd also need to track\ncheckpoints yourself - save the latest processed timestamp to a file or\ndatabase and manage it on every restart. Flink keeps the state for you.\n\nIt's like asking \"why use Spark when you can use Pandas?\" You can, but\nPandas won't work at higher scale in a distributed way.\n\n### What happens with events delayed beyond the watermark (the \"tunnel\" scenario)?\n\nThere are two types of lateness. The watermark handles acceptable lateness -\nsmall delays where events arrive a few seconds late. For events arriving\nmuch later (like after a 5-minute tunnel), Flink has an allowed lateness\nparameter.\n\nBy default, allowed lateness is zero - events arriving after the watermark\ncloses a window are discarded. If you set allowed lateness to 10 minutes,\nFlink will go back, find the old closed window, create a new aggregation\nwith the late event, and send it to the sink as a brand new record. This\nmeans you need deduplication logic on the sink side (a primary key with\nupsert behavior - exactly what we set up in the aggregation section).\n\nThe trade-off: allowed lateness requires Flink to hold all those windows\non disk for the duration of the tolerance.\n\n### When do we actually need streaming? For many things micro-batch is enough.\n\nThe key question: is something going to happen in real time on the other\nside? If there is an automated process that will change something based on\nthe data, streaming is a great choice. If a human is just looking at data,\nreal-time is unnecessary and micro-batch is easier to maintain.\n\nIn 10 years as a data engineer, Zach had literally two use cases that\ngenuinely needed streaming - Netflix fraud/security detection (5 minutes of\ndelay means 5 more minutes of a hacked account) and Airbnb surge pricing\n(supply and demand changes rapidly). Everything else was daily batch, or\nhourly/every-15-minute micro-batch for lower latency needs.\n\nBefore committing to streaming, consider the operational cost. A streaming\njob runs 24/7 - if it breaks at 3 AM, someone needs to fix it. If you're\nthe only person on the team who understands Flink, you'll be on-call for\nit forever. Talk to your manager before implementing streaming - you'll\nneed to teach your entire team before you can share the on-call burden.\n\n### Spark Streaming vs Flink Streaming?\n\nThey are fundamentally different today but will likely converge. The key\ndifference: Spark Streaming is micro-batch - it pulses every 15-30 seconds,\npulling data in small batches (pull architecture). Flink is genuine\ncontinuous processing - events flow through as they arrive (push\narchitecture). For most use cases the difference is negligible, but Flink\nhas lower latency for truly real-time needs.\n\nFor micro-batch intervals, Zach finds every-5-minutes too frequent with\nSpark because startup alone takes about a minute, making the\noverhead-to-work ratio poor. His sweet spots are hourly and every 15\nminutes.\n\n### How does job submission work in production?\n\nIn this workshop we mount local files into Docker and submit jobs with\n`docker compose exec` - that's a development convenience. In production,\njob submission looks different depending on the deployment:\n\n- Managed services (AWS Kinesis Data Analytics, Google Cloud Dataflow,\n  Confluent Cloud) - you upload a JAR or Python zip through a web console\n  or CLI. The service handles the cluster.\n- Self-hosted Flink on Kubernetes - you typically build a Docker image with\n  your job code baked in, or use the Flink Kubernetes Operator which pulls\n  job artifacts from S3/GCS at startup.\n- Standalone Flink cluster - you use the `flink run` CLI pointing to a\n  local file or an HTTP/S3 URL. CI/CD pipelines often upload the job\n  artifact to S3 and then call `flink run` with that URL.\n\nThe common pattern: your code lives in git, CI builds an artifact (JAR,\nPython zip, or Docker image), pushes it to a registry or object store, and\nthen triggers the Flink cluster to pick it up.\n"
  },
  {
    "path": "07-streaming/workshop/docker-compose.yml",
    "content": "services:\n  redpanda:\n    image: redpandadata/redpanda:v25.3.9\n    command:\n      - redpanda\n      - start\n      - --smp\n      - '1'\n      - --reserve-memory\n      - 0M\n      - --overprovisioned\n      - --node-id\n      - '1'\n      - --kafka-addr\n      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092\n      - --advertise-kafka-addr\n      - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092\n      - --pandaproxy-addr\n      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082\n      - --advertise-pandaproxy-addr\n      - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082\n      - --rpc-addr\n      - 0.0.0.0:33145\n      - --advertise-rpc-addr\n      - redpanda:33145\n    ports:\n      - 8082:8082\n      - 9092:9092\n      - 28082:28082\n      - 29092:29092\n\n  jobmanager:\n    build:\n      context: .\n      dockerfile: ./Dockerfile.flink\n    image: pyflink-workshop\n    pull_policy: never\n    expose:\n      - \"6123\"\n    ports:\n      - \"8081:8081\"\n    volumes:\n      - ./:/opt/flink/usrlib\n      - ./src/:/opt/src\n    command: jobmanager\n    environment:\n      - |\n        FLINK_PROPERTIES=\n        jobmanager.rpc.address: jobmanager\n        jobmanager.memory.process.size: 1600m\n\n  taskmanager:\n    image: pyflink-workshop\n    pull_policy: never\n    expose:\n      - \"6121\"\n      - \"6122\"\n    volumes:\n      - ./:/opt/flink/usrlib\n      - ./src/:/opt/src\n    depends_on:\n      - jobmanager\n    command: taskmanager --taskmanager.registration.timeout 5 min\n    environment:\n      - |\n        FLINK_PROPERTIES=\n        jobmanager.rpc.address: jobmanager\n        taskmanager.memory.process.size: 1728m\n        taskmanager.numberOfTaskSlots: 15\n        parallelism.default: 3\n\n  postgres:\n    image: postgres:18\n    restart: on-failure\n    environment:\n      - POSTGRES_DB=postgres\n      - POSTGRES_USER=postgres\n      - POSTGRES_PASSWORD=postgres\n    ports:\n      - \"5432:5432\"\n"
  },
  {
    "path": "07-streaming/workshop/flink-config.yaml",
    "content": "# Custom Flink config for PyFlink workshop.\n# Original: https://github.com/apache/flink/blob/release-2.2/flink-dist/src/main/resources/config.yaml\n# Changes from default:\n#   1. Added taskmanager.memory.jvm-metaspace.size: 512m (PyFlink needs more metaspace)\n#   2. Removed --add-exports=jdk.compiler/... from env.java.opts.all\n#      (jdk.compiler module is not present in the JRE, causing warnings on every command)\n\nblob:\n  server:\n    port: '6124'\ntaskmanager:\n  memory:\n    process:\n      size: 1728m\n    jvm-metaspace:\n      size: 512m  # added for PyFlink\n  bind-host: 0.0.0.0\n  numberOfTaskSlots: 15\njobmanager:\n  execution:\n    failover-strategy: region\n  rpc:\n    address: jobmanager\n    port: 6123\n  memory:\n    process:\n      size: 1600m\n  bind-host: 0.0.0.0\nquery:\n  server:\n    port: '6125'\nparallelism:\n  default: 1\nrest:\n  address: 0.0.0.0\nenv:\n  java:\n    opts:\n      all: >-\n        --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED\n        --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED\n        --add-opens=java.base/java.lang=ALL-UNNAMED\n        --add-opens=java.base/java.net=ALL-UNNAMED\n        --add-opens=java.base/java.io=ALL-UNNAMED\n        --add-opens=java.base/java.nio=ALL-UNNAMED\n        --add-opens=java.base/sun.nio.ch=ALL-UNNAMED\n        --add-opens=java.base/java.lang.reflect=ALL-UNNAMED\n        --add-opens=java.base/java.text=ALL-UNNAMED\n        --add-opens=java.base/java.time=ALL-UNNAMED\n        --add-opens=java.base/java.util=ALL-UNNAMED\n        --add-opens=java.base/java.util.concurrent=ALL-UNNAMED\n        --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED\n        --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED\n"
  },
  {
    "path": "07-streaming/workshop/live/.gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[codz]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py.cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# UV\n#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#uv.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n#poetry.toml\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.\n#   https://pdm-project.org/en/latest/usage/project/#working-with-version-control\n#pdm.lock\n#pdm.toml\n.pdm-python\n.pdm-build/\n\n# pixi\n#   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.\n#pixi.lock\n#   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one\n#   in the .venv directory. It is recommended not to include this directory in version control.\n.pixi\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.envrc\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n#.idea/\n\n# Abstra\n# Abstra is an AI-powered process automation framework.\n# Ignore directories containing user credentials, local state, and settings.\n# Learn more at https://abstra.io/docs\n.abstra/\n\n# Visual Studio Code\n#  Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore \n#  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore\n#  and can be added to the global gitignore or merged into this file. However, if you prefer, \n#  you could uncomment the following to ignore the entire vscode folder\n# .vscode/\n\n# Ruff stuff:\n.ruff_cache/\n\n# PyPI configuration file\n.pypirc\n\n# Cursor\n#  Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to\n#  exclude from AI features like autocomplete and code analysis. Recommended for sensitive data\n#  refer to https://docs.cursor.com/context/ignore-files\n.cursorignore\n.cursorindexingignore\n\n# Marimo\nmarimo/_static/\nmarimo/_lsp/\n__marimo__/\n"
  },
  {
    "path": "07-streaming/workshop/live/.python-version",
    "content": "3.12\n"
  },
  {
    "path": "07-streaming/workshop/live/Dockerfile.flink",
    "content": "FROM flink:2.2.0-scala_2.12-java17\n\nCOPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/\n\n# ref: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker\n\nWORKDIR /opt/pyflink\nCOPY pyproject.flink.toml pyproject.toml\nRUN uv python install 3.12 && uv sync\nENV PATH=\"/opt/pyflink/.venv/bin:$PATH\"\n\n# Download connector libraries\n\nWORKDIR /opt/flink/lib\nRUN wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/2.2.0/flink-json-2.2.0.jar; \\\n    wget https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-kafka/4.0.1-2.0/flink-sql-connector-kafka-4.0.1-2.0.jar; \\\n    wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-core/4.0.0-2.0/flink-connector-jdbc-core-4.0.0-2.0.jar; \\\n    wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-jdbc-postgres/4.0.0-2.0/flink-connector-jdbc-postgres-4.0.0-2.0.jar; \\\n    wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.10/postgresql-42.7.10.jar\n\nCOPY flink-config.yaml /opt/flink/conf/config.yaml\n\nWORKDIR /opt/flink\n"
  },
  {
    "path": "07-streaming/workshop/live/README.md",
    "content": "# streaming-workshop"
  },
  {
    "path": "07-streaming/workshop/live/docker-compose.yaml",
    "content": "services:\n  redpanda:\n    image: redpandadata/redpanda:v25.3.9\n    command:\n      - redpanda\n      - start\n      - --smp\n      - '1'\n      - --reserve-memory\n      - 0M\n      - --overprovisioned\n      - --node-id\n      - '1'\n      - --kafka-addr\n      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092\n      - --advertise-kafka-addr\n      - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092\n      - --pandaproxy-addr\n      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082\n      - --advertise-pandaproxy-addr\n      - PLAINTEXT://redpanda:28082,OUTSIDE://localhost:8082\n      - --rpc-addr\n      - 0.0.0.0:33145\n      - --advertise-rpc-addr\n      - redpanda:33145\n    ports:\n      - 8082:8082\n      - 9092:9092\n      - 28082:28082\n      - 29092:29092\n\n  postgres:\n    image: postgres:18\n    restart: on-failure\n    environment:\n      - POSTGRES_DB=postgres\n      - POSTGRES_USER=postgres\n      - POSTGRES_PASSWORD=postgres\n    ports:\n      - \"5432:5432\"\n\n  jobmanager:\n    build:\n      context: .\n      dockerfile: ./Dockerfile.flink\n    image: pyflink-workshop\n    pull_policy: never\n    expose:\n      - \"6123\"\n    ports:\n      - \"8081:8081\"\n    volumes:\n      - ./:/opt/flink/usrlib\n      - ./src/:/opt/src\n    command: jobmanager\n    environment:\n      - |\n        FLINK_PROPERTIES=\n        jobmanager.rpc.address: jobmanager\n        jobmanager.memory.process.size: 1600m\n\n  taskmanager:\n    image: pyflink-workshop\n    pull_policy: never\n    expose:\n      - \"6121\"\n      - \"6122\"\n    volumes:\n      - ./:/opt/flink/usrlib\n      - ./src/:/opt/src\n    depends_on:\n      - jobmanager\n    command: taskmanager --taskmanager.registration.timeout 5 min\n    environment:\n      - |\n        FLINK_PROPERTIES=\n        jobmanager.rpc.address: jobmanager\n        taskmanager.memory.process.size: 1728m\n        taskmanager.numberOfTaskSlots: 15\n        parallelism.default: 3"
  },
  {
    "path": "07-streaming/workshop/live/flink-config.yaml",
    "content": "# Custom Flink config for PyFlink workshop.\n# Original: https://github.com/apache/flink/blob/release-2.2/flink-dist/src/main/resources/config.yaml\n# Changes from default:\n#   1. Added taskmanager.memory.jvm-metaspace.size: 512m (PyFlink needs more metaspace)\n#   2. Removed --add-exports=jdk.compiler/... from env.java.opts.all\n#      (jdk.compiler module is not present in the JRE, causing warnings on every command)\n\nblob:\n  server:\n    port: '6124'\ntaskmanager:\n  memory:\n    process:\n      size: 1728m\n    jvm-metaspace:\n      size: 512m  # added for PyFlink\n  bind-host: 0.0.0.0\n  numberOfTaskSlots: 15\njobmanager:\n  execution:\n    failover-strategy: region\n  rpc:\n    address: jobmanager\n    port: 6123\n  memory:\n    process:\n      size: 1600m\n  bind-host: 0.0.0.0\nquery:\n  server:\n    port: '6125'\nparallelism:\n  default: 1\nrest:\n  address: 0.0.0.0\nenv:\n  java:\n    opts:\n      all: >-\n        --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED\n        --add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED\n        --add-opens=java.base/java.lang=ALL-UNNAMED\n        --add-opens=java.base/java.net=ALL-UNNAMED\n        --add-opens=java.base/java.io=ALL-UNNAMED\n        --add-opens=java.base/java.nio=ALL-UNNAMED\n        --add-opens=java.base/sun.nio.ch=ALL-UNNAMED\n        --add-opens=java.base/java.lang.reflect=ALL-UNNAMED\n        --add-opens=java.base/java.text=ALL-UNNAMED\n        --add-opens=java.base/java.time=ALL-UNNAMED\n        --add-opens=java.base/java.util=ALL-UNNAMED\n        --add-opens=java.base/java.util.concurrent=ALL-UNNAMED\n        --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED\n        --add-opens=java.base/java.util.concurrent.locks=ALL-UNNAMED\n"
  },
  {
    "path": "07-streaming/workshop/live/main.py",
    "content": "def main():\n    print(\"Hello from streaming-workshop!\")\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "07-streaming/workshop/live/notebooks/consumer_db.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"c77749d8\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from kafka import KafkaConsumer\\n\",\n    \"\\n\",\n    \"server = 'localhost:9092'\\n\",\n    \"topic_name = 'rides'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"74dcdffe\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from models import Ride, ride_deserializer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"00726e41\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"consumer = KafkaConsumer(\\n\",\n    \"    topic_name,\\n\",\n    \"    bootstrap_servers=[server],\\n\",\n    \"    auto_offset_reset='earliest',\\n\",\n    \"    group_id='rides-database',\\n\",\n    \"    value_deserializer=ride_deserializer\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"a2cf7106\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import psycopg2\\n\",\n    \"\\n\",\n    \"conn = psycopg2.connect(\\n\",\n    \"    host='localhost',\\n\",\n    \"    port=5432,\\n\",\n    \"    database='postgres',\\n\",\n    \"    user='postgres',\\n\",\n    \"    password='postgres'\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"conn.autocommit = True\\n\",\n    \"cur = conn.cursor()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"id\": \"f0902406\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Listening to rides and writing to PostgreSQL...\\n\",\n      \"Inserted 100 rows...\\n\",\n      \"Inserted 200 rows...\\n\",\n      \"Inserted 300 rows...\\n\",\n      \"Inserted 400 rows...\\n\",\n      \"Inserted 500 rows...\\n\",\n      \"Inserted 600 rows...\\n\",\n      \"Inserted 700 rows...\\n\",\n      \"Inserted 800 rows...\\n\",\n      \"Inserted 900 rows...\\n\",\n      \"Inserted 1000 rows...\\n\"\n     ]\n    },\n    {\n     \"ename\": \"KeyboardInterrupt\",\n     \"evalue\": \"\",\n     \"output_type\": \"error\",\n     \"traceback\": [\n      \"\\u001b[31m---------------------------------------------------------------------------\\u001b[39m\",\n      \"\\u001b[31mKeyboardInterrupt\\u001b[39m                         Traceback (most recent call last)\",\n      \"\\u001b[36mCell\\u001b[39m\\u001b[36m \\u001b[39m\\u001b[32mIn[5]\\u001b[39m\\u001b[32m, line 6\\u001b[39m\\n\\u001b[32m      3\\u001b[39m \\u001b[38;5;28mprint\\u001b[39m(\\u001b[33mf\\u001b[39m\\u001b[33m\\\"\\u001b[39m\\u001b[33mListening to \\u001b[39m\\u001b[38;5;132;01m{\\u001b[39;00mtopic_name\\u001b[38;5;132;01m}\\u001b[39;00m\\u001b[33m and writing to PostgreSQL...\\u001b[39m\\u001b[33m\\\"\\u001b[39m)\\n\\u001b[32m      5\\u001b[39m count = \\u001b[32m0\\u001b[39m\\n\\u001b[32m----> \\u001b[39m\\u001b[32m6\\u001b[39m \\u001b[38;5;28;43;01mfor\\u001b[39;49;00m\\u001b[43m \\u001b[49m\\u001b[43mmessage\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[38;5;129;43;01min\\u001b[39;49;00m\\u001b[43m \\u001b[49m\\u001b[43mconsumer\\u001b[49m\\u001b[43m:\\u001b[49m\\n\\u001b[32m      7\\u001b[39m \\u001b[43m    \\u001b[49m\\u001b[43mride\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43m=\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43mmessage\\u001b[49m\\u001b[43m.\\u001b[49m\\u001b[43mvalue\\u001b[49m\\n\\u001b[32m      8\\u001b[39m \\u001b[43m    \\u001b[49m\\u001b[43mpickup_dt\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43m=\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43mdatetime\\u001b[49m\\u001b[43m.\\u001b[49m\\u001b[43mfromtimestamp\\u001b[49m\\u001b[43m(\\u001b[49m\\u001b[43mride\\u001b[49m\\u001b[43m.\\u001b[49m\\u001b[43mtpep_pickup_datetime\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43m/\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[32;43m1000\\u001b[39;49m\\u001b[43m)\\u001b[49m\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:1213\\u001b[39m, in \\u001b[36mKafkaConsumer.__next__\\u001b[39m\\u001b[34m(self)\\u001b[39m\\n\\u001b[32m   1211\\u001b[39m     \\u001b[38;5;28mself\\u001b[39m._iterator = \\u001b[38;5;28mself\\u001b[39m._message_generator_v2()\\n\\u001b[32m   1212\\u001b[39m \\u001b[38;5;28;01mtry\\u001b[39;00m:\\n\\u001b[32m-> \\u001b[39m\\u001b[32m1213\\u001b[39m     \\u001b[38;5;28;01mreturn\\u001b[39;00m \\u001b[38;5;28;43mnext\\u001b[39;49m\\u001b[43m(\\u001b[49m\\u001b[38;5;28;43mself\\u001b[39;49m\\u001b[43m.\\u001b[49m\\u001b[43m_iterator\\u001b[49m\\u001b[43m)\\u001b[49m\\n\\u001b[32m   1214\\u001b[39m \\u001b[38;5;28;01mexcept\\u001b[39;00m \\u001b[38;5;167;01mStopIteration\\u001b[39;00m:\\n\\u001b[32m   1215\\u001b[39m     \\u001b[38;5;28mself\\u001b[39m._iterator = \\u001b[38;5;28;01mNone\\u001b[39;00m\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:1185\\u001b[39m, in \\u001b[36mKafkaConsumer._message_generator_v2\\u001b[39m\\u001b[34m(self)\\u001b[39m\\n\\u001b[32m   1183\\u001b[39m \\u001b[38;5;28;01mdef\\u001b[39;00m\\u001b[38;5;250m \\u001b[39m\\u001b[34m_message_generator_v2\\u001b[39m(\\u001b[38;5;28mself\\u001b[39m):\\n\\u001b[32m   1184\\u001b[39m     timeout_ms = \\u001b[32m1000\\u001b[39m * \\u001b[38;5;28mmax\\u001b[39m(\\u001b[32m0\\u001b[39m, \\u001b[38;5;28mself\\u001b[39m._consumer_timeout - time.time())\\n\\u001b[32m-> \\u001b[39m\\u001b[32m1185\\u001b[39m     record_map = \\u001b[38;5;28;43mself\\u001b[39;49m\\u001b[43m.\\u001b[49m\\u001b[43mpoll\\u001b[49m\\u001b[43m(\\u001b[49m\\u001b[43mtimeout_ms\\u001b[49m\\u001b[43m=\\u001b[49m\\u001b[43mtimeout_ms\\u001b[49m\\u001b[43m,\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43mupdate_offsets\\u001b[49m\\u001b[43m=\\u001b[49m\\u001b[38;5;28;43;01mFalse\\u001b[39;49;00m\\u001b[43m)\\u001b[49m\\n\\u001b[32m   1186\\u001b[39m     \\u001b[38;5;28;01mfor\\u001b[39;00m tp, records \\u001b[38;5;129;01min\\u001b[39;00m six.iteritems(record_map):\\n\\u001b[32m   1187\\u001b[39m         \\u001b[38;5;66;03m# Generators are stateful, and it is possible that the tp / records\\u001b[39;00m\\n\\u001b[32m   1188\\u001b[39m         \\u001b[38;5;66;03m# here may become stale during iteration -- i.e., we seek to a\\u001b[39;00m\\n\\u001b[32m   1189\\u001b[39m         \\u001b[38;5;66;03m# different offset, pause consumption, or lose assignment.\\u001b[39;00m\\n\\u001b[32m   1190\\u001b[39m         \\u001b[38;5;28;01mfor\\u001b[39;00m record \\u001b[38;5;129;01min\\u001b[39;00m records:\\n\\u001b[32m   1191\\u001b[39m             \\u001b[38;5;66;03m# is_fetchable(tp) should handle assignment changes and offset\\u001b[39;00m\\n\\u001b[32m   1192\\u001b[39m             \\u001b[38;5;66;03m# resets; for all other changes (e.g., seeks) we'll rely on the\\u001b[39;00m\\n\\u001b[32m   1193\\u001b[39m             \\u001b[38;5;66;03m# outer function destroying the existing iterator/generator\\u001b[39;00m\\n\\u001b[32m   1194\\u001b[39m             \\u001b[38;5;66;03m# via self._iterator = None\\u001b[39;00m\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:708\\u001b[39m, in \\u001b[36mKafkaConsumer.poll\\u001b[39m\\u001b[34m(self, timeout_ms, max_records, update_offsets)\\u001b[39m\\n\\u001b[32m    706\\u001b[39m timer = Timer(timeout_ms)\\n\\u001b[32m    707\\u001b[39m \\u001b[38;5;28;01mwhile\\u001b[39;00m \\u001b[38;5;129;01mnot\\u001b[39;00m \\u001b[38;5;28mself\\u001b[39m._closed:\\n\\u001b[32m--> \\u001b[39m\\u001b[32m708\\u001b[39m     records = \\u001b[38;5;28;43mself\\u001b[39;49m\\u001b[43m.\\u001b[49m\\u001b[43m_poll_once\\u001b[49m\\u001b[43m(\\u001b[49m\\u001b[43mtimer\\u001b[49m\\u001b[43m,\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43mmax_records\\u001b[49m\\u001b[43m,\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43mupdate_offsets\\u001b[49m\\u001b[43m=\\u001b[49m\\u001b[43mupdate_offsets\\u001b[49m\\u001b[43m)\\u001b[49m\\n\\u001b[32m    709\\u001b[39m     \\u001b[38;5;28;01mif\\u001b[39;00m records:\\n\\u001b[32m    710\\u001b[39m         \\u001b[38;5;28;01mreturn\\u001b[39;00m records\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/consumer/group.py:757\\u001b[39m, in \\u001b[36mKafkaConsumer._poll_once\\u001b[39m\\u001b[34m(self, timer, max_records, update_offsets)\\u001b[39m\\n\\u001b[32m    754\\u001b[39m     log.debug(\\u001b[33m'\\u001b[39m\\u001b[33mpoll: do not have all fetch positions...\\u001b[39m\\u001b[33m'\\u001b[39m)\\n\\u001b[32m    755\\u001b[39m     poll_timeout_ms = \\u001b[38;5;28mmin\\u001b[39m(poll_timeout_ms, \\u001b[38;5;28mself\\u001b[39m.config[\\u001b[33m'\\u001b[39m\\u001b[33mretry_backoff_ms\\u001b[39m\\u001b[33m'\\u001b[39m])\\n\\u001b[32m--> \\u001b[39m\\u001b[32m757\\u001b[39m \\u001b[38;5;28;43mself\\u001b[39;49m\\u001b[43m.\\u001b[49m\\u001b[43m_client\\u001b[49m\\u001b[43m.\\u001b[49m\\u001b[43mpoll\\u001b[49m\\u001b[43m(\\u001b[49m\\u001b[43mtimeout_ms\\u001b[49m\\u001b[43m=\\u001b[49m\\u001b[43mpoll_timeout_ms\\u001b[49m\\u001b[43m)\\u001b[49m\\n\\u001b[32m    758\\u001b[39m \\u001b[38;5;66;03m# after the long poll, we should check whether the group needs to rebalance\\u001b[39;00m\\n\\u001b[32m    759\\u001b[39m \\u001b[38;5;66;03m# prior to returning data so that the group can stabilize faster\\u001b[39;00m\\n\\u001b[32m    760\\u001b[39m \\u001b[38;5;28;01mif\\u001b[39;00m \\u001b[38;5;28mself\\u001b[39m._coordinator.need_rejoin():\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/client_async.py:685\\u001b[39m, in \\u001b[36mKafkaClient.poll\\u001b[39m\\u001b[34m(self, timeout_ms, future)\\u001b[39m\\n\\u001b[32m    678\\u001b[39m         timeout = \\u001b[38;5;28mmin\\u001b[39m(\\n\\u001b[32m    679\\u001b[39m             user_timeout_ms,\\n\\u001b[32m    680\\u001b[39m             metadata_timeout_ms,\\n\\u001b[32m    681\\u001b[39m             idle_connection_timeout_ms,\\n\\u001b[32m    682\\u001b[39m             request_timeout_ms)\\n\\u001b[32m    683\\u001b[39m         timeout = \\u001b[38;5;28mmax\\u001b[39m(\\u001b[32m0\\u001b[39m, timeout)  \\u001b[38;5;66;03m# avoid negative timeouts\\u001b[39;00m\\n\\u001b[32m--> \\u001b[39m\\u001b[32m685\\u001b[39m     \\u001b[38;5;28;43mself\\u001b[39;49m\\u001b[43m.\\u001b[49m\\u001b[43m_poll\\u001b[49m\\u001b[43m(\\u001b[49m\\u001b[43mtimeout\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43m/\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[32;43m1000\\u001b[39;49m\\u001b[43m)\\u001b[49m\\n\\u001b[32m    687\\u001b[39m \\u001b[38;5;66;03m# called without the lock to avoid deadlock potential\\u001b[39;00m\\n\\u001b[32m    688\\u001b[39m \\u001b[38;5;66;03m# if handlers need to acquire locks\\u001b[39;00m\\n\\u001b[32m    689\\u001b[39m responses.extend(\\u001b[38;5;28mself\\u001b[39m._fire_pending_completed_requests())\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/client_async.py:781\\u001b[39m, in \\u001b[36mKafkaClient._poll\\u001b[39m\\u001b[34m(self, timeout)\\u001b[39m\\n\\u001b[32m    778\\u001b[39m         \\u001b[38;5;28;01mcontinue\\u001b[39;00m\\n\\u001b[32m    780\\u001b[39m     \\u001b[38;5;28mself\\u001b[39m._idle_expiry_manager.update(conn.node_id)\\n\\u001b[32m--> \\u001b[39m\\u001b[32m781\\u001b[39m     \\u001b[38;5;28mself\\u001b[39m._pending_completion.extend(\\u001b[43mconn\\u001b[49m\\u001b[43m.\\u001b[49m\\u001b[43mrecv\\u001b[49m\\u001b[43m(\\u001b[49m\\u001b[43m)\\u001b[49m)\\n\\u001b[32m    783\\u001b[39m \\u001b[38;5;66;03m# Check for additional pending SSL bytes\\u001b[39;00m\\n\\u001b[32m    784\\u001b[39m \\u001b[38;5;28;01mif\\u001b[39;00m \\u001b[38;5;28mself\\u001b[39m.config[\\u001b[33m'\\u001b[39m\\u001b[33msecurity_protocol\\u001b[39m\\u001b[33m'\\u001b[39m] \\u001b[38;5;129;01min\\u001b[39;00m (\\u001b[33m'\\u001b[39m\\u001b[33mSSL\\u001b[39m\\u001b[33m'\\u001b[39m, \\u001b[33m'\\u001b[39m\\u001b[33mSASL_SSL\\u001b[39m\\u001b[33m'\\u001b[39m):\\n\\u001b[32m    785\\u001b[39m     \\u001b[38;5;66;03m# TODO: optimize\\u001b[39;00m\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/conn.py:1131\\u001b[39m, in \\u001b[36mBrokerConnection.recv\\u001b[39m\\u001b[34m(self)\\u001b[39m\\n\\u001b[32m   1126\\u001b[39m \\u001b[38;5;28;01mdef\\u001b[39;00m\\u001b[38;5;250m \\u001b[39m\\u001b[34mrecv\\u001b[39m(\\u001b[38;5;28mself\\u001b[39m):\\n\\u001b[32m   1127\\u001b[39m \\u001b[38;5;250m    \\u001b[39m\\u001b[33;03m\\\"\\\"\\\"Non-blocking network receive.\\u001b[39;00m\\n\\u001b[32m   1128\\u001b[39m \\n\\u001b[32m   1129\\u001b[39m \\u001b[33;03m    Return list of (response, future) tuples\\u001b[39;00m\\n\\u001b[32m   1130\\u001b[39m \\u001b[33;03m    \\\"\\\"\\\"\\u001b[39;00m\\n\\u001b[32m-> \\u001b[39m\\u001b[32m1131\\u001b[39m     responses = \\u001b[38;5;28;43mself\\u001b[39;49m\\u001b[43m.\\u001b[49m\\u001b[43m_recv\\u001b[49m\\u001b[43m(\\u001b[49m\\u001b[43m)\\u001b[49m\\n\\u001b[32m   1132\\u001b[39m     \\u001b[38;5;28;01mif\\u001b[39;00m \\u001b[38;5;129;01mnot\\u001b[39;00m responses \\u001b[38;5;129;01mand\\u001b[39;00m \\u001b[38;5;28mself\\u001b[39m.requests_timed_out():\\n\\u001b[32m   1133\\u001b[39m         timed_out = \\u001b[38;5;28mself\\u001b[39m.timed_out_ifrs()\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/conn.py:1202\\u001b[39m, in \\u001b[36mBrokerConnection._recv\\u001b[39m\\u001b[34m(self)\\u001b[39m\\n\\u001b[32m   1200\\u001b[39m recvd_data = \\u001b[33mb\\u001b[39m\\u001b[33m'\\u001b[39m\\u001b[33m'\\u001b[39m.join(recvd)\\n\\u001b[32m   1201\\u001b[39m \\u001b[38;5;28;01mif\\u001b[39;00m \\u001b[38;5;28mself\\u001b[39m._sensors:\\n\\u001b[32m-> \\u001b[39m\\u001b[32m1202\\u001b[39m     \\u001b[38;5;28;43mself\\u001b[39;49m\\u001b[43m.\\u001b[49m\\u001b[43m_sensors\\u001b[49m\\u001b[43m.\\u001b[49m\\u001b[43mbytes_received\\u001b[49m\\u001b[43m.\\u001b[49m\\u001b[43mrecord\\u001b[49m\\u001b[43m(\\u001b[49m\\u001b[38;5;28;43mlen\\u001b[39;49m\\u001b[43m(\\u001b[49m\\u001b[43mrecvd_data\\u001b[49m\\u001b[43m)\\u001b[49m\\u001b[43m)\\u001b[49m\\n\\u001b[32m   1204\\u001b[39m \\u001b[38;5;66;03m# We need to keep the lock through protocol receipt\\u001b[39;00m\\n\\u001b[32m   1205\\u001b[39m \\u001b[38;5;66;03m# so that we ensure that the processed byte order is the\\u001b[39;00m\\n\\u001b[32m   1206\\u001b[39m \\u001b[38;5;66;03m# same as the received byte order\\u001b[39;00m\\n\\u001b[32m   1207\\u001b[39m \\u001b[38;5;28;01mtry\\u001b[39;00m:\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/metrics/stats/sensor.py:77\\u001b[39m, in \\u001b[36mSensor.record\\u001b[39m\\u001b[34m(self, value, time_ms)\\u001b[39m\\n\\u001b[32m     74\\u001b[39m \\u001b[38;5;28;01mwith\\u001b[39;00m \\u001b[38;5;28mself\\u001b[39m._lock:  \\u001b[38;5;66;03m# XXX high volume, might be performance issue\\u001b[39;00m\\n\\u001b[32m     75\\u001b[39m     \\u001b[38;5;66;03m# increment all the stats\\u001b[39;00m\\n\\u001b[32m     76\\u001b[39m     \\u001b[38;5;28;01mfor\\u001b[39;00m stat \\u001b[38;5;129;01min\\u001b[39;00m \\u001b[38;5;28mself\\u001b[39m._stats:\\n\\u001b[32m---> \\u001b[39m\\u001b[32m77\\u001b[39m         \\u001b[43mstat\\u001b[49m\\u001b[43m.\\u001b[49m\\u001b[43mrecord\\u001b[49m\\u001b[43m(\\u001b[49m\\u001b[38;5;28;43mself\\u001b[39;49m\\u001b[43m.\\u001b[49m\\u001b[43m_config\\u001b[49m\\u001b[43m,\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43mvalue\\u001b[49m\\u001b[43m,\\u001b[49m\\u001b[43m \\u001b[49m\\u001b[43mtime_ms\\u001b[49m\\u001b[43m)\\u001b[49m\\n\\u001b[32m     78\\u001b[39m     \\u001b[38;5;28mself\\u001b[39m._check_quotas(time_ms)\\n\\u001b[32m     79\\u001b[39m \\u001b[38;5;28;01mfor\\u001b[39;00m parent \\u001b[38;5;129;01min\\u001b[39;00m \\u001b[38;5;28mself\\u001b[39m._parents:\\n\",\n      \"\\u001b[36mFile \\u001b[39m\\u001b[32m/workspaces/streaming-workshop/.venv/lib/python3.12/site-packages/kafka/metrics/stats/rate.py:49\\u001b[39m, in \\u001b[36mRate.record\\u001b[39m\\u001b[34m(self, config, value, time_ms)\\u001b[39m\\n\\u001b[32m     46\\u001b[39m \\u001b[38;5;28;01mdef\\u001b[39;00m\\u001b[38;5;250m \\u001b[39m\\u001b[34munit_name\\u001b[39m(\\u001b[38;5;28mself\\u001b[39m):\\n\\u001b[32m     47\\u001b[39m     \\u001b[38;5;28;01mreturn\\u001b[39;00m TimeUnit.get_name(\\u001b[38;5;28mself\\u001b[39m._unit)\\n\\u001b[32m---> \\u001b[39m\\u001b[32m49\\u001b[39m \\u001b[38;5;28;01mdef\\u001b[39;00m\\u001b[38;5;250m \\u001b[39m\\u001b[34mrecord\\u001b[39m(\\u001b[38;5;28mself\\u001b[39m, config, value, time_ms):\\n\\u001b[32m     50\\u001b[39m     \\u001b[38;5;28mself\\u001b[39m._stat.record(config, value, time_ms)\\n\\u001b[32m     52\\u001b[39m \\u001b[38;5;28;01mdef\\u001b[39;00m\\u001b[38;5;250m \\u001b[39m\\u001b[34mmeasure\\u001b[39m(\\u001b[38;5;28mself\\u001b[39m, config, now):\\n\",\n      \"\\u001b[31mKeyboardInterrupt\\u001b[39m: \"\n     ]\n    }\n   ],\n   \"source\": [\n    \"from datetime import datetime\\n\",\n    \"\\n\",\n    \"print(f\\\"Listening to {topic_name} and writing to PostgreSQL...\\\")\\n\",\n    \"\\n\",\n    \"count = 0\\n\",\n    \"for message in consumer:\\n\",\n    \"    ride = message.value\\n\",\n    \"    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)\\n\",\n    \"    cur.execute(\\n\",\n    \"        \\\"\\\"\\\"INSERT INTO processed_events\\n\",\n    \"           (PULocationID, DOLocationID, trip_distance, total_amount, pickup_datetime)\\n\",\n    \"           VALUES (%s, %s, %s, %s, %s)\\\"\\\"\\\",\\n\",\n    \"        (ride.PULocationID, ride.DOLocationID,\\n\",\n    \"         ride.trip_distance, ride.total_amount, pickup_dt)\\n\",\n    \"    )\\n\",\n    \"    count += 1\\n\",\n    \"    if count % 100 == 0:\\n\",\n    \"        print(f\\\"Inserted {count} rows...\\\")\\n\",\n    \"\\n\",\n    \"consumer.close()\\n\",\n    \"cur.close()\\n\",\n    \"conn.close()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"66840c80\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"2bec0472\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"streaming-workshop\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.12.1\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "07-streaming/workshop/live/notebooks/models.py",
    "content": "import json\nimport dataclasses\n\nfrom dataclasses import dataclass\n\n\n@dataclass\nclass Ride:\n    PULocationID: int\n    DOLocationID: int\n    trip_distance: float\n    total_amount: float\n    tpep_pickup_datetime: int  # epoch milliseconds\n\n\ndef ride_from_row(row):\n    return Ride(\n        PULocationID=int(row['PULocationID']),\n        DOLocationID=int(row['DOLocationID']),\n        trip_distance=float(row['trip_distance']),\n        total_amount=float(row['total_amount']),\n        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),\n    )\n\n\ndef ride_serializer(ride):\n    ride_dict = dataclasses.asdict(ride)\n    ride_json = json.dumps(ride_dict).encode('utf-8')\n    return ride_json\n\n\ndef ride_deserializer(data):\n    json_str = data.decode('utf-8')\n    ride_dict = json.loads(json_str)\n    return Ride(**ride_dict)\n"
  },
  {
    "path": "07-streaming/workshop/live/notebooks/producer.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"eebfcff0\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"1e3c198b\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet'\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"id\": \"2113c0a9\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"columns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime']\\n\",\n    \"df = pd.read_parquet(url, columns=columns).head(1000)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 39,\n   \"id\": \"05ed66d7\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from models import Ride, ride_from_row, ride_serializer\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 40,\n   \"id\": \"26950bac\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"Ride(PULocationID=142, DOLocationID=237, trip_distance=2.28, total_amount=24.94, tpep_pickup_datetime=1761958147000)\"\n      ]\n     },\n     \"execution_count\": 40,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"ride = ride_from_row(df.iloc[1])\\n\",\n    \"ride\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 41,\n   \"id\": \"05cfce95\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from kafka import KafkaProducer\\n\",\n    \"\\n\",\n    \"server = 'localhost:9092'\\n\",\n    \"\\n\",\n    \"producer = KafkaProducer(\\n\",\n    \"    bootstrap_servers=[server],\\n\",\n    \"    value_serializer=ride_serializer\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 46,\n   \"id\": \"21f5fff3\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"topic_name = 'rides'\\n\",\n    \"\\n\",\n    \"producer.send(topic_name, value=ride)\\n\",\n    \"producer.flush()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 48,\n   \"id\": \"b17a175a\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Sent: Ride(PULocationID=43, DOLocationID=186, trip_distance=1.68, total_amount=22.15, tpep_pickup_datetime=1761956005000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=237, trip_distance=2.28, total_amount=24.94, tpep_pickup_datetime=1761958147000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=238, trip_distance=2.7, total_amount=25.62, tpep_pickup_datetime=1761955639000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=261, trip_distance=12.87, total_amount=86.14, tpep_pickup_datetime=1761955200000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=37, trip_distance=8.4, total_amount=48.65, tpep_pickup_datetime=1761956330000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=100, trip_distance=0.85, total_amount=16.45, tpep_pickup_datetime=1761956471000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=170, trip_distance=3.01, total_amount=25.85, tpep_pickup_datetime=1761955651000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=144, trip_distance=3.82, total_amount=57.54, tpep_pickup_datetime=1761958012000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=161, trip_distance=0.89, total_amount=12.95, tpep_pickup_datetime=1761958619000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=162, trip_distance=2.28, total_amount=38.68, tpep_pickup_datetime=1761955843000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=88, trip_distance=3.3, total_amount=44.0, tpep_pickup_datetime=1761955203000)\\n\",\n      \"Sent: Ride(PULocationID=88, DOLocationID=148, trip_distance=1.5, total_amount=19.55, tpep_pickup_datetime=1761957833000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=236, trip_distance=4.7, total_amount=47.65, tpep_pickup_datetime=1761958682000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=255, trip_distance=5.61, total_amount=38.85, tpep_pickup_datetime=1761958368000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=43, trip_distance=3.9, total_amount=46.55, tpep_pickup_datetime=1761955553000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=262, trip_distance=1.14, total_amount=14.9, tpep_pickup_datetime=1761956024000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=24, trip_distance=0.6, total_amount=9.12, tpep_pickup_datetime=1761955398000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=147, trip_distance=4.3, total_amount=29.2, tpep_pickup_datetime=1761956395000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=137, trip_distance=3.0, total_amount=32.75, tpep_pickup_datetime=1761957955000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.69, total_amount=11.5, tpep_pickup_datetime=1761955872000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=265, trip_distance=15.47, total_amount=106.63, tpep_pickup_datetime=1761955521000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=125, trip_distance=1.29, total_amount=22.26, tpep_pickup_datetime=1761955760000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=79, trip_distance=1.66, total_amount=32.34, tpep_pickup_datetime=1761957539000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.25, total_amount=22.25, tpep_pickup_datetime=1761958533000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=249, trip_distance=2.68, total_amount=48.68, tpep_pickup_datetime=1761956184000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=48, trip_distance=3.16, total_amount=33.15, tpep_pickup_datetime=1761958409000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=24, trip_distance=2.8, total_amount=24.55, tpep_pickup_datetime=1761956650000)\\n\",\n      \"Sent: Ride(PULocationID=143, DOLocationID=169, trip_distance=7.45, total_amount=44.04, tpep_pickup_datetime=1761956178000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=142, trip_distance=2.02, total_amount=17.8, tpep_pickup_datetime=1761957454000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=90, trip_distance=3.46, total_amount=35.7, tpep_pickup_datetime=1761957360000)\\n\",\n      \"Sent: Ride(PULocationID=50, DOLocationID=263, trip_distance=2.89, total_amount=26.46, tpep_pickup_datetime=1761956052000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=68, trip_distance=1.2, total_amount=27.3, tpep_pickup_datetime=1761956041000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.2, total_amount=13.02, tpep_pickup_datetime=1761957285000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=233, trip_distance=2.57, total_amount=26.15, tpep_pickup_datetime=1761957826000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=233, trip_distance=1.4, total_amount=24.75, tpep_pickup_datetime=1761956818000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=170, trip_distance=0.3, total_amount=13.0, tpep_pickup_datetime=1761957934000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=137, trip_distance=0.7, total_amount=17.7, tpep_pickup_datetime=1761958462000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=236, trip_distance=0.92, total_amount=13.8, tpep_pickup_datetime=1761956649000)\\n\",\n      \"Sent: Ride(PULocationID=43, DOLocationID=151, trip_distance=2.21, total_amount=17.4, tpep_pickup_datetime=1761957030000)\\n\",\n      \"Sent: Ride(PULocationID=151, DOLocationID=116, trip_distance=2.62, total_amount=21.75, tpep_pickup_datetime=1761957624000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=114, trip_distance=1.58, total_amount=35.25, tpep_pickup_datetime=1761957104000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=74, trip_distance=6.51, total_amount=52.08, tpep_pickup_datetime=1761958025000)\\n\",\n      \"Sent: Ride(PULocationID=166, DOLocationID=262, trip_distance=3.19, total_amount=27.24, tpep_pickup_datetime=1761956016000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=238, trip_distance=0.46, total_amount=12.12, tpep_pickup_datetime=1761956726000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=166, trip_distance=1.2, total_amount=17.16, tpep_pickup_datetime=1761957605000)\\n\",\n      \"Sent: Ride(PULocationID=66, DOLocationID=246, trip_distance=4.4, total_amount=30.35, tpep_pickup_datetime=1761955924000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=239, trip_distance=3.5, total_amount=29.85, tpep_pickup_datetime=1761958430000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=161, trip_distance=0.66, total_amount=14.25, tpep_pickup_datetime=1761956248000)\\n\",\n      \"Sent: Ride(PULocationID=43, DOLocationID=239, trip_distance=1.13, total_amount=16.38, tpep_pickup_datetime=1761956861000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=166, trip_distance=1.22, total_amount=13.32, tpep_pickup_datetime=1761957749000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=48, trip_distance=11.29, total_amount=81.18, tpep_pickup_datetime=1761958324000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=262, trip_distance=0.86, total_amount=14.2, tpep_pickup_datetime=1761955542000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=249, trip_distance=1.47, total_amount=24.78, tpep_pickup_datetime=1761958550000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=164, trip_distance=0.52, total_amount=20.47, tpep_pickup_datetime=1761955095000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=142, trip_distance=3.99, total_amount=38.22, tpep_pickup_datetime=1761955769000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=164, trip_distance=1.03, total_amount=16.35, tpep_pickup_datetime=1761955355000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=141, trip_distance=2.47, total_amount=27.75, tpep_pickup_datetime=1761956835000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=237, trip_distance=1.6, total_amount=18.45, tpep_pickup_datetime=1761958690000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=75, trip_distance=2.11, total_amount=20.52, tpep_pickup_datetime=1761955763000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=216, trip_distance=4.7, total_amount=24.3, tpep_pickup_datetime=1761957345000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=170, trip_distance=2.08, total_amount=31.5, tpep_pickup_datetime=1761958185000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=4, trip_distance=0.9, total_amount=19.72, tpep_pickup_datetime=1761957574000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=233, trip_distance=2.2, total_amount=22.05, tpep_pickup_datetime=1761958544000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=209, trip_distance=1.1, total_amount=19.7, tpep_pickup_datetime=1761958275000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=238, trip_distance=4.98, total_amount=45.75, tpep_pickup_datetime=1761955208000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=164, trip_distance=9.43, total_amount=59.54, tpep_pickup_datetime=1761955156000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=141, trip_distance=1.14, total_amount=17.22, tpep_pickup_datetime=1761957361000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=68, trip_distance=1.43, total_amount=23.21, tpep_pickup_datetime=1761955824000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=0.36, total_amount=12.15, tpep_pickup_datetime=1761956775000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=186, trip_distance=0.01, total_amount=-10.85, tpep_pickup_datetime=1761957695000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=186, trip_distance=0.01, total_amount=10.85, tpep_pickup_datetime=1761957695000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=265, trip_distance=4.31, total_amount=90.81, tpep_pickup_datetime=1761958059000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=141, trip_distance=3.92, total_amount=42.42, tpep_pickup_datetime=1761956086000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=173, trip_distance=7.98, total_amount=50.25, tpep_pickup_datetime=1761958460000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=48, trip_distance=3.03, total_amount=35.44, tpep_pickup_datetime=1761956112000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=140, trip_distance=0.54, total_amount=13.86, tpep_pickup_datetime=1761956344000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=1.46, total_amount=20.5, tpep_pickup_datetime=1761957032000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=48, trip_distance=1.8, total_amount=26.45, tpep_pickup_datetime=1761957521000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=233, trip_distance=0.91, total_amount=17.85, tpep_pickup_datetime=1761956210000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=113, trip_distance=1.61, total_amount=27.3, tpep_pickup_datetime=1761957229000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.33, total_amount=19.25, tpep_pickup_datetime=1761957297000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=265, trip_distance=45.7, total_amount=284.39, tpep_pickup_datetime=1761956656000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=231, trip_distance=1.75, total_amount=20.55, tpep_pickup_datetime=1761956254000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=229, trip_distance=4.32, total_amount=45.78, tpep_pickup_datetime=1761957567000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=224, trip_distance=1.99, total_amount=22.26, tpep_pickup_datetime=1761955509000)\\n\",\n      \"Sent: Ride(PULocationID=224, DOLocationID=141, trip_distance=2.85, total_amount=29.82, tpep_pickup_datetime=1761956184000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=239, trip_distance=1.32, total_amount=18.84, tpep_pickup_datetime=1761957687000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=1.0, total_amount=17.75, tpep_pickup_datetime=1761956921000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.29, total_amount=23.75, tpep_pickup_datetime=1761957629000)\\n\",\n      \"Sent: Ride(PULocationID=125, DOLocationID=186, trip_distance=1.81, total_amount=31.15, tpep_pickup_datetime=1761956520000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=249, trip_distance=1.14, total_amount=21.25, tpep_pickup_datetime=1761958570000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=237, trip_distance=10.18, total_amount=69.71, tpep_pickup_datetime=1761957894000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=43, trip_distance=2.16, total_amount=17.1, tpep_pickup_datetime=1761956402000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=1.7, total_amount=18.0, tpep_pickup_datetime=1761955253000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=7, trip_distance=4.5, total_amount=32.34, tpep_pickup_datetime=1761955970000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.4, total_amount=27.55, tpep_pickup_datetime=1761955608000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=244, trip_distance=10.5, total_amount=57.05, tpep_pickup_datetime=1761957376000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=255, trip_distance=7.32, total_amount=56.35, tpep_pickup_datetime=1761955670000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=261, trip_distance=6.49, total_amount=43.85, tpep_pickup_datetime=1761958702000)\\n\",\n      \"Sent: Ride(PULocationID=45, DOLocationID=97, trip_distance=3.2, total_amount=23.45, tpep_pickup_datetime=1761956745000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=186, trip_distance=2.9, total_amount=24.85, tpep_pickup_datetime=1761958647000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=0.97, total_amount=23.94, tpep_pickup_datetime=1761956073000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=237, trip_distance=1.73, total_amount=22.26, tpep_pickup_datetime=1761957100000)\\n\",\n      \"Sent: Ride(PULocationID=45, DOLocationID=79, trip_distance=1.32, total_amount=22.26, tpep_pickup_datetime=1761957425000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.64, total_amount=24.72, tpep_pickup_datetime=1761958384000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.6, total_amount=23.1, tpep_pickup_datetime=1761956460000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=41, trip_distance=7.6, total_amount=49.4, tpep_pickup_datetime=1761957518000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=233, trip_distance=1.1, total_amount=22.3, tpep_pickup_datetime=1761956203000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=75, trip_distance=2.7, total_amount=23.2, tpep_pickup_datetime=1761957674000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=125, trip_distance=3.43, total_amount=49.14, tpep_pickup_datetime=1761957040000)\\n\",\n      \"Sent: Ride(PULocationID=209, DOLocationID=90, trip_distance=3.28, total_amount=31.15, tpep_pickup_datetime=1761956343000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=125, trip_distance=2.08, total_amount=34.86, tpep_pickup_datetime=1761958287000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=113, trip_distance=2.01, total_amount=41.58, tpep_pickup_datetime=1761956311000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.15, total_amount=10.44, tpep_pickup_datetime=1761955368000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=48, trip_distance=1.4, total_amount=22.75, tpep_pickup_datetime=1761957364000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=68, trip_distance=2.4, total_amount=24.8, tpep_pickup_datetime=1761956279000)\\n\",\n      \"Sent: Ride(PULocationID=262, DOLocationID=90, trip_distance=4.3, total_amount=52.5, tpep_pickup_datetime=1761956236000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=246, trip_distance=0.72, total_amount=13.86, tpep_pickup_datetime=1761956524000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.38, total_amount=12.55, tpep_pickup_datetime=1761957465000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.95, total_amount=17.22, tpep_pickup_datetime=1761958160000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=236, trip_distance=4.25, total_amount=46.62, tpep_pickup_datetime=1761955309000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=75, trip_distance=0.83, total_amount=11.5, tpep_pickup_datetime=1761955738000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=137, trip_distance=2.0, total_amount=20.75, tpep_pickup_datetime=1761957902000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=140, trip_distance=0.49, total_amount=10.1, tpep_pickup_datetime=1761956499000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.3, total_amount=17.16, tpep_pickup_datetime=1761957396000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=249, trip_distance=3.49, total_amount=30.05, tpep_pickup_datetime=1761958009000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=229, trip_distance=6.31, total_amount=40.74, tpep_pickup_datetime=1761958328000)\\n\",\n      \"Sent: Ride(PULocationID=50, DOLocationID=68, trip_distance=1.9, total_amount=20.55, tpep_pickup_datetime=1761955870000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=49, trip_distance=3.71, total_amount=25.55, tpep_pickup_datetime=1761957107000)\\n\",\n      \"Sent: Ride(PULocationID=97, DOLocationID=256, trip_distance=3.39, total_amount=27.6, tpep_pickup_datetime=1761958421000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=236, trip_distance=5.79, total_amount=40.95, tpep_pickup_datetime=1761955769000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=143, trip_distance=3.23, total_amount=23.7, tpep_pickup_datetime=1761957945000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=232, trip_distance=1.6, total_amount=19.25, tpep_pickup_datetime=1761956045000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=229, trip_distance=0.7, total_amount=13.85, tpep_pickup_datetime=1761956383000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.3, total_amount=13.4, tpep_pickup_datetime=1761957107000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=263, trip_distance=0.7, total_amount=12.8, tpep_pickup_datetime=1761957502000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=229, trip_distance=1.7, total_amount=18.9, tpep_pickup_datetime=1761958191000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=148, trip_distance=1.04, total_amount=24.78, tpep_pickup_datetime=1761958597000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=265, trip_distance=0.0, total_amount=124.25, tpep_pickup_datetime=1761956790000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=229, trip_distance=2.57, total_amount=28.14, tpep_pickup_datetime=1761956518000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=48, trip_distance=1.06, total_amount=19.57, tpep_pickup_datetime=1761956704000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.32, total_amount=13.02, tpep_pickup_datetime=1761955678000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=162, trip_distance=1.73, total_amount=26.46, tpep_pickup_datetime=1761956006000)\\n\",\n      \"Sent: Ride(PULocationID=74, DOLocationID=236, trip_distance=1.55, total_amount=17.16, tpep_pickup_datetime=1761958603000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=230, trip_distance=2.16, total_amount=33.69, tpep_pickup_datetime=1761955305000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=231, trip_distance=1.99, total_amount=31.5, tpep_pickup_datetime=1761955569000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.04, total_amount=19.95, tpep_pickup_datetime=1761956993000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=170, trip_distance=1.53, total_amount=19.95, tpep_pickup_datetime=1761958077000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=234, trip_distance=2.63, total_amount=25.62, tpep_pickup_datetime=1761958784000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=141, trip_distance=4.28, total_amount=38.22, tpep_pickup_datetime=1761958147000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=230, trip_distance=3.19, total_amount=42.42, tpep_pickup_datetime=1761955883000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.38, total_amount=18.9, tpep_pickup_datetime=1761958302000)\\n\",\n      \"Sent: Ride(PULocationID=261, DOLocationID=186, trip_distance=3.2, total_amount=34.0, tpep_pickup_datetime=1761957372000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=68, trip_distance=0.85, total_amount=26.81, tpep_pickup_datetime=1761955469000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=237, trip_distance=3.15, total_amount=33.8, tpep_pickup_datetime=1761956993000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=114, trip_distance=1.4, total_amount=31.5, tpep_pickup_datetime=1761955287000)\\n\",\n      \"Sent: Ride(PULocationID=114, DOLocationID=230, trip_distance=2.8, total_amount=31.05, tpep_pickup_datetime=1761956826000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=43, trip_distance=0.55, total_amount=13.02, tpep_pickup_datetime=1761955330000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=143, trip_distance=0.7, total_amount=13.8, tpep_pickup_datetime=1761955615000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=141, trip_distance=6.07, total_amount=34.55, tpep_pickup_datetime=1761956252000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=186, trip_distance=2.8, total_amount=33.18, tpep_pickup_datetime=1761956258000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=48, trip_distance=1.54, total_amount=21.35, tpep_pickup_datetime=1761958653000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=48, trip_distance=1.3, total_amount=17.35, tpep_pickup_datetime=1761957861000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=141, trip_distance=1.6, total_amount=23.95, tpep_pickup_datetime=1761958511000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=4.26, total_amount=39.9, tpep_pickup_datetime=1761956321000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=164, trip_distance=2.85, total_amount=28.14, tpep_pickup_datetime=1761958363000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=140, trip_distance=3.86, total_amount=42.42, tpep_pickup_datetime=1761956960000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=68, trip_distance=4.44, total_amount=34.02, tpep_pickup_datetime=1761958780000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=234, trip_distance=2.0, total_amount=28.45, tpep_pickup_datetime=1761956085000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=48, trip_distance=2.1, total_amount=37.2, tpep_pickup_datetime=1761957500000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=159, trip_distance=4.96, total_amount=30.4, tpep_pickup_datetime=1761956905000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=237, trip_distance=1.48, total_amount=21.42, tpep_pickup_datetime=1761955928000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.0, total_amount=16.45, tpep_pickup_datetime=1761956435000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=0.6, total_amount=13.85, tpep_pickup_datetime=1761957093000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=229, trip_distance=1.5, total_amount=24.75, tpep_pickup_datetime=1761957404000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=224, trip_distance=4.26, total_amount=50.85, tpep_pickup_datetime=1761955719000)\\n\",\n      \"Sent: Ride(PULocationID=224, DOLocationID=233, trip_distance=1.07, total_amount=17.22, tpep_pickup_datetime=1761958748000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=162, trip_distance=2.38, total_amount=34.86, tpep_pickup_datetime=1761956435000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=24, trip_distance=3.81, total_amount=28.55, tpep_pickup_datetime=1761958415000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=246, trip_distance=0.77, total_amount=15.05, tpep_pickup_datetime=1761956032000)\\n\",\n      \"Sent: Ride(PULocationID=262, DOLocationID=262, trip_distance=0.0, total_amount=8.0, tpep_pickup_datetime=1761956477000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=0.8, total_amount=13.5, tpep_pickup_datetime=1761957782000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=238, trip_distance=2.0, total_amount=18.84, tpep_pickup_datetime=1761958251000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=141, trip_distance=1.78, total_amount=18.09, tpep_pickup_datetime=1761955958000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=74, trip_distance=2.49, total_amount=21.6, tpep_pickup_datetime=1761956602000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=87, trip_distance=2.61, total_amount=24.78, tpep_pickup_datetime=1761958149000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=237, trip_distance=1.84, total_amount=17.15, tpep_pickup_datetime=1761955513000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=262, trip_distance=1.05, total_amount=15.48, tpep_pickup_datetime=1761956227000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=163, trip_distance=0.45, total_amount=15.54, tpep_pickup_datetime=1761954619000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=68, trip_distance=1.8, total_amount=49.98, tpep_pickup_datetime=1761955361000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=114, trip_distance=1.88, total_amount=31.94, tpep_pickup_datetime=1761958526000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=87, trip_distance=4.4, total_amount=57.3, tpep_pickup_datetime=1761956034000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=237, trip_distance=0.89, total_amount=15.39, tpep_pickup_datetime=1761958187000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=229, trip_distance=4.88, total_amount=43.73, tpep_pickup_datetime=1761958624000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=151, trip_distance=1.0, total_amount=14.6, tpep_pickup_datetime=1761956087000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=261, trip_distance=5.0, total_amount=50.15, tpep_pickup_datetime=1761957798000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=148, trip_distance=0.39, total_amount=23.8, tpep_pickup_datetime=1761955384000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=223, trip_distance=8.64, total_amount=60.06, tpep_pickup_datetime=1761956689000)\\n\",\n      \"Sent: Ride(PULocationID=256, DOLocationID=107, trip_distance=3.25, total_amount=40.35, tpep_pickup_datetime=1761955754000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=68, trip_distance=1.47, total_amount=18.55, tpep_pickup_datetime=1761957878000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=7, trip_distance=4.12, total_amount=31.8, tpep_pickup_datetime=1761955283000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=113, trip_distance=9.36, total_amount=75.5, tpep_pickup_datetime=1761957063000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=158, trip_distance=0.96, total_amount=23.94, tpep_pickup_datetime=1761957835000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=230, trip_distance=1.28, total_amount=20.56, tpep_pickup_datetime=1761955072000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=107, trip_distance=1.7, total_amount=24.95, tpep_pickup_datetime=1761956114000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=137, trip_distance=0.64, total_amount=13.63, tpep_pickup_datetime=1761957625000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=107, trip_distance=1.17, total_amount=27.3, tpep_pickup_datetime=1761958042000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=68, trip_distance=3.48, total_amount=37.38, tpep_pickup_datetime=1761956556000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=75, trip_distance=3.25, total_amount=25.62, tpep_pickup_datetime=1761955328000)\\n\",\n      \"Sent: Ride(PULocationID=43, DOLocationID=237, trip_distance=1.06, total_amount=18.35, tpep_pickup_datetime=1761956831000)\\n\",\n      \"Sent: Ride(PULocationID=43, DOLocationID=236, trip_distance=2.03, total_amount=22.26, tpep_pickup_datetime=1761957676000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=246, trip_distance=0.54, total_amount=16.55, tpep_pickup_datetime=1761955938000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=236, trip_distance=3.49, total_amount=35.7, tpep_pickup_datetime=1761956552000)\\n\",\n      \"Sent: Ride(PULocationID=75, DOLocationID=238, trip_distance=1.23, total_amount=12.9, tpep_pickup_datetime=1761958460000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=229, trip_distance=10.1, total_amount=72.24, tpep_pickup_datetime=1761956382000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=230, trip_distance=1.0, total_amount=17.15, tpep_pickup_datetime=1761958646000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=177, trip_distance=9.77, total_amount=54.55, tpep_pickup_datetime=1761957231000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=262, trip_distance=1.9, total_amount=21.35, tpep_pickup_datetime=1761958531000)\\n\",\n      \"Sent: Ride(PULocationID=125, DOLocationID=239, trip_distance=4.72, total_amount=47.46, tpep_pickup_datetime=1761955991000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=262, trip_distance=2.02, total_amount=20.5, tpep_pickup_datetime=1761958263000)\\n\",\n      \"Sent: Ride(PULocationID=224, DOLocationID=231, trip_distance=3.16, total_amount=27.65, tpep_pickup_datetime=1761957209000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=90, trip_distance=3.67, total_amount=34.02, tpep_pickup_datetime=1761956212000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=90, trip_distance=0.18, total_amount=19.74, tpep_pickup_datetime=1761957726000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=79, trip_distance=0.98, total_amount=27.3, tpep_pickup_datetime=1761958631000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=90, trip_distance=4.7, total_amount=40.74, tpep_pickup_datetime=1761956974000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=68, trip_distance=0.85, total_amount=19.55, tpep_pickup_datetime=1761955241000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=161, trip_distance=1.37, total_amount=24.94, tpep_pickup_datetime=1761956360000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=232, trip_distance=2.09, total_amount=22.26, tpep_pickup_datetime=1761956603000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=78, trip_distance=8.9, total_amount=35.0, tpep_pickup_datetime=1761957013000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=236, trip_distance=7.09, total_amount=43.65, tpep_pickup_datetime=1761955507000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=2.0, total_amount=18.81, tpep_pickup_datetime=1761957857000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=50, trip_distance=1.81, total_amount=20.58, tpep_pickup_datetime=1761958528000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=113, trip_distance=1.25, total_amount=24.78, tpep_pickup_datetime=1761956381000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=107, trip_distance=1.44, total_amount=24.78, tpep_pickup_datetime=1761957581000)\\n\",\n      \"Sent: Ride(PULocationID=50, DOLocationID=90, trip_distance=2.4, total_amount=33.2, tpep_pickup_datetime=1761955786000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=0.6, total_amount=17.05, tpep_pickup_datetime=1761957365000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=68, trip_distance=0.8, total_amount=16.75, tpep_pickup_datetime=1761958508000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=246, trip_distance=2.2, total_amount=27.3, tpep_pickup_datetime=1761955399000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=90, trip_distance=1.5, total_amount=18.45, tpep_pickup_datetime=1761957057000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=107, trip_distance=1.1, total_amount=24.8, tpep_pickup_datetime=1761957738000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=234, trip_distance=2.5, total_amount=34.0, tpep_pickup_datetime=1761956224000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=186, trip_distance=1.2, total_amount=18.9, tpep_pickup_datetime=1761955938000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=166, trip_distance=6.1, total_amount=43.3, tpep_pickup_datetime=1761957437000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=239, trip_distance=0.7, total_amount=15.5, tpep_pickup_datetime=1761956589000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=113, trip_distance=0.37, total_amount=17.15, tpep_pickup_datetime=1761957323000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.2, total_amount=24.8, tpep_pickup_datetime=1761956590000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=107, trip_distance=1.0, total_amount=20.65, tpep_pickup_datetime=1761957748000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=90, trip_distance=1.34, total_amount=24.15, tpep_pickup_datetime=1761958718000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=137, trip_distance=2.4, total_amount=27.78, tpep_pickup_datetime=1761956073000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=232, trip_distance=3.1, total_amount=34.85, tpep_pickup_datetime=1761955678000)\\n\",\n      \"Sent: Ride(PULocationID=232, DOLocationID=263, trip_distance=5.4, total_amount=36.55, tpep_pickup_datetime=1761957916000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=43, trip_distance=0.51, total_amount=12.96, tpep_pickup_datetime=1761957600000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=1.02, total_amount=17.88, tpep_pickup_datetime=1761958142000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=75, trip_distance=3.62, total_amount=27.3, tpep_pickup_datetime=1761957421000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.54, total_amount=11.4, tpep_pickup_datetime=1761957775000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=140, trip_distance=0.89, total_amount=13.02, tpep_pickup_datetime=1761958544000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=17, trip_distance=7.47, total_amount=65.94, tpep_pickup_datetime=1761956096000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.8, total_amount=23.94, tpep_pickup_datetime=1761956686000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=229, trip_distance=3.43, total_amount=34.02, tpep_pickup_datetime=1761955666000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=262, trip_distance=1.62, total_amount=17.05, tpep_pickup_datetime=1761957259000)\\n\",\n      \"Sent: Ride(PULocationID=262, DOLocationID=237, trip_distance=1.32, total_amount=18.0, tpep_pickup_datetime=1761957709000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=0.9, total_amount=16.32, tpep_pickup_datetime=1761958396000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=107, trip_distance=1.3, total_amount=19.7, tpep_pickup_datetime=1761957455000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=170, trip_distance=0.93, total_amount=16.38, tpep_pickup_datetime=1761955645000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=164, trip_distance=1.12, total_amount=24.15, tpep_pickup_datetime=1761955455000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=236, trip_distance=0.39, total_amount=11.11, tpep_pickup_datetime=1761957759000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=231, trip_distance=1.82, total_amount=24.95, tpep_pickup_datetime=1761955330000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=162, trip_distance=6.21, total_amount=43.05, tpep_pickup_datetime=1761956607000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.27, total_amount=11.75, tpep_pickup_datetime=1761956212000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=43, trip_distance=1.46, total_amount=15.0, tpep_pickup_datetime=1761958020000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=141, trip_distance=0.5, total_amount=14.95, tpep_pickup_datetime=1761956740000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=87, trip_distance=4.1, total_amount=51.35, tpep_pickup_datetime=1761958460000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=116, trip_distance=18.5, total_amount=96.24, tpep_pickup_datetime=1761957126000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=17, trip_distance=9.96, total_amount=61.39, tpep_pickup_datetime=1761957438000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=170, trip_distance=4.21, total_amount=34.55, tpep_pickup_datetime=1761957583000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=263, trip_distance=3.7, total_amount=31.5, tpep_pickup_datetime=1761956232000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=263, trip_distance=0.0, total_amount=8.0, tpep_pickup_datetime=1761957635000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=229, trip_distance=1.9, total_amount=27.3, tpep_pickup_datetime=1761958640000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=164, trip_distance=1.05, total_amount=15.75, tpep_pickup_datetime=1761956099000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=170, trip_distance=1.13, total_amount=15.75, tpep_pickup_datetime=1761956710000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=256, trip_distance=5.72, total_amount=60.9, tpep_pickup_datetime=1761957394000)\\n\",\n      \"Sent: Ride(PULocationID=255, DOLocationID=49, trip_distance=4.19, total_amount=46.08, tpep_pickup_datetime=1761956379000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=148, trip_distance=2.1, total_amount=29.76, tpep_pickup_datetime=1761955716000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=249, trip_distance=1.3, total_amount=18.9, tpep_pickup_datetime=1761957598000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=87, trip_distance=2.2, total_amount=26.45, tpep_pickup_datetime=1761958322000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.35, total_amount=15.54, tpep_pickup_datetime=1761956616000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=246, trip_distance=1.21, total_amount=15.75, tpep_pickup_datetime=1761955343000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=246, trip_distance=0.83, total_amount=19.95, tpep_pickup_datetime=1761955909000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=263, trip_distance=4.38, total_amount=34.02, tpep_pickup_datetime=1761958565000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=107, trip_distance=16.64, total_amount=93.44, tpep_pickup_datetime=1761956975000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=233, trip_distance=1.44, total_amount=18.45, tpep_pickup_datetime=1761955379000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=246, trip_distance=2.65, total_amount=35.55, tpep_pickup_datetime=1761956312000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=229, trip_distance=0.71, total_amount=13.86, tpep_pickup_datetime=1761958144000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=79, trip_distance=3.44, total_amount=40.74, tpep_pickup_datetime=1761957545000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=90, trip_distance=0.6, total_amount=17.22, tpep_pickup_datetime=1761955431000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=1.1, total_amount=19.85, tpep_pickup_datetime=1761956413000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=62, trip_distance=8.16, total_amount=53.55, tpep_pickup_datetime=1761956385000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=79, trip_distance=1.11, total_amount=26.46, tpep_pickup_datetime=1761955381000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=4, trip_distance=0.77, total_amount=21.42, tpep_pickup_datetime=1761956586000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=107, trip_distance=1.11, total_amount=18.06, tpep_pickup_datetime=1761957433000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=68, trip_distance=2.16, total_amount=28.95, tpep_pickup_datetime=1761956224000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=90, trip_distance=0.8, total_amount=19.74, tpep_pickup_datetime=1761957722000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=246, trip_distance=1.77, total_amount=21.95, tpep_pickup_datetime=1761958441000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=79, trip_distance=2.87, total_amount=34.02, tpep_pickup_datetime=1761957364000)\\n\",\n      \"Sent: Ride(PULocationID=43, DOLocationID=140, trip_distance=2.05, total_amount=17.15, tpep_pickup_datetime=1761956821000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=263, trip_distance=0.6, total_amount=13.8, tpep_pickup_datetime=1761957600000)\\n\",\n      \"Sent: Ride(PULocationID=43, DOLocationID=249, trip_distance=3.17, total_amount=35.82, tpep_pickup_datetime=1761958522000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=170, trip_distance=0.7, total_amount=17.75, tpep_pickup_datetime=1761955603000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=8.9, total_amount=64.74, tpep_pickup_datetime=1761958130000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=142, trip_distance=1.9, total_amount=23.9, tpep_pickup_datetime=1761956855000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=41, trip_distance=2.4, total_amount=22.2, tpep_pickup_datetime=1761958017000)\\n\",\n      \"Sent: Ride(PULocationID=166, DOLocationID=151, trip_distance=0.71, total_amount=11.7, tpep_pickup_datetime=1761955955000)\\n\",\n      \"Sent: Ride(PULocationID=166, DOLocationID=243, trip_distance=5.58, total_amount=34.32, tpep_pickup_datetime=1761956353000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=236, trip_distance=0.64, total_amount=12.12, tpep_pickup_datetime=1761955402000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=170, trip_distance=3.58, total_amount=34.02, tpep_pickup_datetime=1761955588000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=236, trip_distance=1.3, total_amount=17.2, tpep_pickup_datetime=1761956232000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=74, trip_distance=7.55, total_amount=64.26, tpep_pickup_datetime=1761958722000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=90, trip_distance=1.0, total_amount=24.78, tpep_pickup_datetime=1761955226000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=137, trip_distance=1.51, total_amount=26.95, tpep_pickup_datetime=1761956284000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=263, trip_distance=2.0, total_amount=19.15, tpep_pickup_datetime=1761958009000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=1.0, total_amount=19.74, tpep_pickup_datetime=1761955806000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=233, trip_distance=0.15, total_amount=19.75, tpep_pickup_datetime=1761956644000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=256, trip_distance=4.57, total_amount=45.05, tpep_pickup_datetime=1761955504000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=161, trip_distance=0.67, total_amount=13.02, tpep_pickup_datetime=1761956393000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=158, trip_distance=0.07, total_amount=10.15, tpep_pickup_datetime=1761956927000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=107, trip_distance=0.99, total_amount=33.69, tpep_pickup_datetime=1761957291000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=100, trip_distance=1.46, total_amount=20.95, tpep_pickup_datetime=1761956040000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=233, trip_distance=1.13, total_amount=25.05, tpep_pickup_datetime=1761956974000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=68, trip_distance=2.89, total_amount=24.13, tpep_pickup_datetime=1761956628000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=233, trip_distance=1.9, total_amount=23.1, tpep_pickup_datetime=1761957676000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=68, trip_distance=0.82, total_amount=13.95, tpep_pickup_datetime=1761956420000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=142, trip_distance=3.3, total_amount=36.54, tpep_pickup_datetime=1761957077000)\\n\",\n      \"Sent: Ride(PULocationID=143, DOLocationID=79, trip_distance=7.5, total_amount=59.05, tpep_pickup_datetime=1761956774000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.08, total_amount=16.38, tpep_pickup_datetime=1761957693000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=263, trip_distance=4.19, total_amount=35.44, tpep_pickup_datetime=1761958288000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=229, trip_distance=0.52, total_amount=15.54, tpep_pickup_datetime=1761955463000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=238, trip_distance=2.32, total_amount=23.04, tpep_pickup_datetime=1761956228000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.48, total_amount=12.12, tpep_pickup_datetime=1761956375000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=140, trip_distance=1.2, total_amount=16.32, tpep_pickup_datetime=1761956646000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=142, trip_distance=4.38, total_amount=38.41, tpep_pickup_datetime=1761955989000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.42, total_amount=15.05, tpep_pickup_datetime=1761958537000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.3, total_amount=11.55, tpep_pickup_datetime=1761957747000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=249, trip_distance=0.98, total_amount=19.74, tpep_pickup_datetime=1761958266000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.86, total_amount=20.58, tpep_pickup_datetime=1761955096000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=229, trip_distance=2.06, total_amount=23.94, tpep_pickup_datetime=1761956339000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=141, trip_distance=0.89, total_amount=15.54, tpep_pickup_datetime=1761955468000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=262, trip_distance=0.97, total_amount=11.5, tpep_pickup_datetime=1761956055000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=158, trip_distance=5.59, total_amount=45.78, tpep_pickup_datetime=1761956318000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=68, trip_distance=1.17, total_amount=19.57, tpep_pickup_datetime=1761958429000)\\n\",\n      \"Sent: Ride(PULocationID=24, DOLocationID=75, trip_distance=1.38, total_amount=15.0, tpep_pickup_datetime=1761956097000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=224, trip_distance=0.83, total_amount=19.25, tpep_pickup_datetime=1761958771000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=48, trip_distance=1.34, total_amount=24.78, tpep_pickup_datetime=1761955846000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=41, trip_distance=3.75, total_amount=31.4, tpep_pickup_datetime=1761957073000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=263, trip_distance=2.33, total_amount=23.04, tpep_pickup_datetime=1761958753000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=61, trip_distance=7.04, total_amount=58.36, tpep_pickup_datetime=1761957360000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=163, trip_distance=1.1, total_amount=21.4, tpep_pickup_datetime=1761955669000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=261, trip_distance=5.9, total_amount=52.5, tpep_pickup_datetime=1761956719000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=233, trip_distance=1.16, total_amount=16.45, tpep_pickup_datetime=1761956976000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=107, trip_distance=0.98, total_amount=18.06, tpep_pickup_datetime=1761957777000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=24, trip_distance=3.75, total_amount=26.25, tpep_pickup_datetime=1761957045000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=170, trip_distance=2.2, total_amount=21.17, tpep_pickup_datetime=1761957744000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=48, trip_distance=1.2, total_amount=18.55, tpep_pickup_datetime=1761955558000)\\n\",\n      \"Sent: Ride(PULocationID=74, DOLocationID=42, trip_distance=2.21, total_amount=15.3, tpep_pickup_datetime=1761955487000)\\n\",\n      \"Sent: Ride(PULocationID=151, DOLocationID=238, trip_distance=0.71, total_amount=11.7, tpep_pickup_datetime=1761957229000)\\n\",\n      \"Sent: Ride(PULocationID=143, DOLocationID=151, trip_distance=2.15, total_amount=19.58, tpep_pickup_datetime=1761958183000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=230, trip_distance=1.9, total_amount=21.4, tpep_pickup_datetime=1761957928000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=230, trip_distance=0.3, total_amount=11.55, tpep_pickup_datetime=1761958796000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=233, trip_distance=2.3, total_amount=31.05, tpep_pickup_datetime=1761957266000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=262, trip_distance=3.0, total_amount=19.25, tpep_pickup_datetime=1761958773000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=50, trip_distance=1.26, total_amount=23.05, tpep_pickup_datetime=1761958750000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=186, trip_distance=0.59, total_amount=15.54, tpep_pickup_datetime=1761956784000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=48, trip_distance=1.39, total_amount=16.45, tpep_pickup_datetime=1761957244000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=238, trip_distance=2.04, total_amount=25.62, tpep_pickup_datetime=1761958076000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=141, trip_distance=3.93, total_amount=40.74, tpep_pickup_datetime=1761957260000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=4, trip_distance=1.98, total_amount=36.54, tpep_pickup_datetime=1761957796000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=112, trip_distance=3.96, total_amount=49.91, tpep_pickup_datetime=1761956053000)\\n\",\n      \"Sent: Ride(PULocationID=211, DOLocationID=137, trip_distance=2.41, total_amount=29.25, tpep_pickup_datetime=1761955544000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=141, trip_distance=1.48, total_amount=18.06, tpep_pickup_datetime=1761958664000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=138, trip_distance=11.3, total_amount=71.29, tpep_pickup_datetime=1761956418000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=11.04, total_amount=74.46, tpep_pickup_datetime=1761958340000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=79, trip_distance=1.82, total_amount=44.1, tpep_pickup_datetime=1761957003000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=237, trip_distance=0.64, total_amount=14.2, tpep_pickup_datetime=1761955771000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=162, trip_distance=1.19, total_amount=19.74, tpep_pickup_datetime=1761956127000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=80, trip_distance=5.8, total_amount=52.43, tpep_pickup_datetime=1761956989000)\\n\",\n      \"Sent: Ride(PULocationID=262, DOLocationID=87, trip_distance=6.6, total_amount=66.8, tpep_pickup_datetime=1761955277000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=141, trip_distance=14.9, total_amount=81.0, tpep_pickup_datetime=1761956893000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=230, trip_distance=2.16, total_amount=34.02, tpep_pickup_datetime=1761955901000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=236, trip_distance=2.67, total_amount=23.65, tpep_pickup_datetime=1761957546000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=179, trip_distance=3.9, total_amount=29.0, tpep_pickup_datetime=1761955359000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=113, trip_distance=1.0, total_amount=17.85, tpep_pickup_datetime=1761955818000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=48, trip_distance=3.0, total_amount=34.85, tpep_pickup_datetime=1761956723000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=113, trip_distance=0.87, total_amount=29.95, tpep_pickup_datetime=1761955542000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=236, trip_distance=3.63, total_amount=41.56, tpep_pickup_datetime=1761957744000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=170, trip_distance=1.92, total_amount=25.75, tpep_pickup_datetime=1761956455000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=48, trip_distance=1.12, total_amount=21.42, tpep_pickup_datetime=1761957962000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.2, total_amount=15.05, tpep_pickup_datetime=1761957985000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=263, trip_distance=2.0, total_amount=16.45, tpep_pickup_datetime=1761958639000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=1.4, total_amount=16.35, tpep_pickup_datetime=1761956025000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=141, trip_distance=2.8, total_amount=27.3, tpep_pickup_datetime=1761958394000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=186, trip_distance=1.2, total_amount=24.05, tpep_pickup_datetime=1761956352000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=142, trip_distance=2.1, total_amount=22.25, tpep_pickup_datetime=1761957923000)\\n\",\n      \"Sent: Ride(PULocationID=50, DOLocationID=246, trip_distance=1.25, total_amount=16.38, tpep_pickup_datetime=1761958378000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=75, trip_distance=5.25, total_amount=37.35, tpep_pickup_datetime=1761958709000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=170, trip_distance=0.59, total_amount=14.65, tpep_pickup_datetime=1761958613000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=161, trip_distance=1.56, total_amount=18.9, tpep_pickup_datetime=1761956052000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=7, trip_distance=3.47, total_amount=31.5, tpep_pickup_datetime=1761956801000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=107, trip_distance=0.54, total_amount=14.15, tpep_pickup_datetime=1761955290000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=229, trip_distance=1.92, total_amount=30.65, tpep_pickup_datetime=1761955702000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=232, trip_distance=5.7, total_amount=56.7, tpep_pickup_datetime=1761955310000)\\n\",\n      \"Sent: Ride(PULocationID=34, DOLocationID=263, trip_distance=8.81, total_amount=54.9, tpep_pickup_datetime=1761956216000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=263, trip_distance=1.8, total_amount=17.15, tpep_pickup_datetime=1761956756000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=233, trip_distance=2.3, total_amount=23.2, tpep_pickup_datetime=1761957345000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=163, trip_distance=1.2, total_amount=17.2, tpep_pickup_datetime=1761958050000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=42, trip_distance=3.5, total_amount=24.8, tpep_pickup_datetime=1761958674000)\\n\",\n      \"Sent: Ride(PULocationID=143, DOLocationID=81, trip_distance=13.98, total_amount=62.65, tpep_pickup_datetime=1761957611000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=263, trip_distance=7.04, total_amount=47.46, tpep_pickup_datetime=1761958101000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=230, trip_distance=0.28, total_amount=11.55, tpep_pickup_datetime=1761955440000)\\n\",\n      \"Sent: Ride(PULocationID=265, DOLocationID=265, trip_distance=0.0, total_amount=71.0, tpep_pickup_datetime=1761957594000)\\n\",\n      \"Sent: Ride(PULocationID=265, DOLocationID=265, trip_distance=0.0, total_amount=85.2, tpep_pickup_datetime=1761957770000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=87, trip_distance=4.0, total_amount=47.35, tpep_pickup_datetime=1761958153000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=231, trip_distance=0.3, total_amount=12.15, tpep_pickup_datetime=1761955983000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=1.9, total_amount=21.35, tpep_pickup_datetime=1761958793000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=238, trip_distance=1.6, total_amount=17.0, tpep_pickup_datetime=1761956346000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=0.8, total_amount=13.8, tpep_pickup_datetime=1761956873000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=263, trip_distance=2.3, total_amount=22.25, tpep_pickup_datetime=1761957322000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=66, trip_distance=4.8, total_amount=31.85, tpep_pickup_datetime=1761954981000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=239, trip_distance=4.01, total_amount=37.38, tpep_pickup_datetime=1761955907000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=249, trip_distance=4.0, total_amount=52.5, tpep_pickup_datetime=1761955461000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=236, trip_distance=5.02, total_amount=42.95, tpep_pickup_datetime=1761958553000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=262, trip_distance=5.8, total_amount=40.75, tpep_pickup_datetime=1761957082000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=68, trip_distance=3.74, total_amount=34.56, tpep_pickup_datetime=1761956465000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=75, trip_distance=1.98, total_amount=15.7, tpep_pickup_datetime=1761955693000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=158, trip_distance=2.0, total_amount=24.8, tpep_pickup_datetime=1761957491000)\\n\",\n      \"Sent: Ride(PULocationID=43, DOLocationID=170, trip_distance=1.8, total_amount=23.9, tpep_pickup_datetime=1761955415000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=113, trip_distance=1.0, total_amount=24.75, tpep_pickup_datetime=1761956410000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=68, trip_distance=1.4, total_amount=25.45, tpep_pickup_datetime=1761957751000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=229, trip_distance=1.5, total_amount=16.75, tpep_pickup_datetime=1761956645000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=263, trip_distance=1.9, total_amount=19.7, tpep_pickup_datetime=1761957347000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=79, trip_distance=3.7, total_amount=36.5, tpep_pickup_datetime=1761957967000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=4, trip_distance=2.35, total_amount=45.04, tpep_pickup_datetime=1761955695000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=230, trip_distance=3.06, total_amount=34.02, tpep_pickup_datetime=1761958084000)\\n\",\n      \"Sent: Ride(PULocationID=151, DOLocationID=164, trip_distance=3.53, total_amount=37.38, tpep_pickup_datetime=1761956326000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=79, trip_distance=1.62, total_amount=24.05, tpep_pickup_datetime=1761958296000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=141, trip_distance=1.78, total_amount=22.35, tpep_pickup_datetime=1761956815000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=161, trip_distance=3.12, total_amount=37.45, tpep_pickup_datetime=1761956190000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=137, trip_distance=0.71, total_amount=14.7, tpep_pickup_datetime=1761958680000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=239, trip_distance=4.03, total_amount=36.15, tpep_pickup_datetime=1761957513000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=239, trip_distance=1.43, total_amount=17.7, tpep_pickup_datetime=1761955297000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=263, trip_distance=4.9, total_amount=33.85, tpep_pickup_datetime=1761955712000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=249, trip_distance=4.3, total_amount=55.0, tpep_pickup_datetime=1761957577000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=234, trip_distance=0.47, total_amount=15.7, tpep_pickup_datetime=1761955763000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=2.07, total_amount=29.82, tpep_pickup_datetime=1761956324000)\\n\",\n      \"Sent: Ride(PULocationID=232, DOLocationID=232, trip_distance=0.07, total_amount=8.75, tpep_pickup_datetime=1761957625000)\\n\",\n      \"Sent: Ride(PULocationID=232, DOLocationID=224, trip_distance=1.45, total_amount=19.74, tpep_pickup_datetime=1761957883000)\\n\",\n      \"Sent: Ride(PULocationID=224, DOLocationID=229, trip_distance=1.92, total_amount=20.58, tpep_pickup_datetime=1761958559000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=255, trip_distance=4.34, total_amount=54.18, tpep_pickup_datetime=1761955952000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=249, trip_distance=1.0, total_amount=40.74, tpep_pickup_datetime=1761955307000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=148, trip_distance=1.88, total_amount=22.97, tpep_pickup_datetime=1761957572000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=25, trip_distance=2.62, total_amount=24.78, tpep_pickup_datetime=1761958769000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=263, trip_distance=1.95, total_amount=15.75, tpep_pickup_datetime=1761955485000)\\n\",\n      \"Sent: Ride(PULocationID=261, DOLocationID=13, trip_distance=0.83, total_amount=16.83, tpep_pickup_datetime=1761958153000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=261, trip_distance=1.9, total_amount=20.75, tpep_pickup_datetime=1761955780000)\\n\",\n      \"Sent: Ride(PULocationID=45, DOLocationID=170, trip_distance=3.1, total_amount=39.9, tpep_pickup_datetime=1761957020000)\\n\",\n      \"Sent: Ride(PULocationID=243, DOLocationID=116, trip_distance=2.99, total_amount=21.72, tpep_pickup_datetime=1761955822000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=137, trip_distance=1.41, total_amount=27.55, tpep_pickup_datetime=1761956665000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=41, trip_distance=3.29, total_amount=25.45, tpep_pickup_datetime=1761955794000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=170, trip_distance=1.17, total_amount=24.78, tpep_pickup_datetime=1761955535000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=107, trip_distance=0.91, total_amount=23.94, tpep_pickup_datetime=1761957348000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=100, trip_distance=0.93, total_amount=18.06, tpep_pickup_datetime=1761958539000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=107, trip_distance=1.65, total_amount=28.85, tpep_pickup_datetime=1761956687000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=113, trip_distance=1.3, total_amount=24.78, tpep_pickup_datetime=1761955497000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=100, trip_distance=2.19, total_amount=30.45, tpep_pickup_datetime=1761957097000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=24, trip_distance=5.13, total_amount=44.1, tpep_pickup_datetime=1761957239000)\\n\",\n      \"Sent: Ride(PULocationID=264, DOLocationID=90, trip_distance=1.4, total_amount=41.55, tpep_pickup_datetime=1761957062000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=10.22, total_amount=65.24, tpep_pickup_datetime=1761955208000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=234, trip_distance=1.9, total_amount=23.1, tpep_pickup_datetime=1761956752000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.0, total_amount=-8.75, tpep_pickup_datetime=1761957929000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.0, total_amount=8.75, tpep_pickup_datetime=1761957929000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=75, trip_distance=5.69, total_amount=35.95, tpep_pickup_datetime=1761958337000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=148, trip_distance=0.8, total_amount=22.25, tpep_pickup_datetime=1761955358000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=170, trip_distance=2.0, total_amount=25.6, tpep_pickup_datetime=1761956269000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=262, trip_distance=3.6, total_amount=30.65, tpep_pickup_datetime=1761957679000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=0.7, total_amount=11.1, tpep_pickup_datetime=1761958752000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=238, trip_distance=2.0, total_amount=22.31, tpep_pickup_datetime=1761957155000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=229, trip_distance=1.5, total_amount=19.85, tpep_pickup_datetime=1761958389000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=144, trip_distance=1.58, total_amount=30.66, tpep_pickup_datetime=1761955512000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=143, trip_distance=1.2, total_amount=18.06, tpep_pickup_datetime=1761956341000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=141, trip_distance=2.19, total_amount=23.94, tpep_pickup_datetime=1761957714000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=233, trip_distance=1.26, total_amount=25.25, tpep_pickup_datetime=1761956020000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=79, trip_distance=1.64, total_amount=27.65, tpep_pickup_datetime=1761956631000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.91, total_amount=14.45, tpep_pickup_datetime=1761958456000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=68, trip_distance=2.5, total_amount=29.82, tpep_pickup_datetime=1761955599000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=87, trip_distance=4.67, total_amount=50.82, tpep_pickup_datetime=1761957483000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=164, trip_distance=1.85, total_amount=23.94, tpep_pickup_datetime=1761958438000)\\n\",\n      \"Sent: Ride(PULocationID=143, DOLocationID=162, trip_distance=1.3, total_amount=13.65, tpep_pickup_datetime=1761956802000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=158, trip_distance=3.0, total_amount=41.65, tpep_pickup_datetime=1761957357000)\\n\",\n      \"Sent: Ride(PULocationID=37, DOLocationID=143, trip_distance=8.61, total_amount=65.88, tpep_pickup_datetime=1761956548000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=148, trip_distance=1.67, total_amount=28.14, tpep_pickup_datetime=1761956271000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.93, total_amount=18.9, tpep_pickup_datetime=1761957806000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=236, trip_distance=3.45, total_amount=26.45, tpep_pickup_datetime=1761958735000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=87, trip_distance=3.04, total_amount=26.05, tpep_pickup_datetime=1761958064000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=236, trip_distance=1.77, total_amount=20.5, tpep_pickup_datetime=1761955394000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=265, trip_distance=6.26, total_amount=92.1, tpep_pickup_datetime=1761956388000)\\n\",\n      \"Sent: Ride(PULocationID=152, DOLocationID=82, trip_distance=7.39, total_amount=50.94, tpep_pickup_datetime=1761957298000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.38, total_amount=19.25, tpep_pickup_datetime=1761958530000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=90, trip_distance=0.83, total_amount=18.9, tpep_pickup_datetime=1761956852000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=237, trip_distance=1.47, total_amount=14.3, tpep_pickup_datetime=1761955429000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=233, trip_distance=0.55, total_amount=19.74, tpep_pickup_datetime=1761956098000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=75, trip_distance=6.2, total_amount=53.35, tpep_pickup_datetime=1761956596000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=231, trip_distance=2.93, total_amount=35.55, tpep_pickup_datetime=1761958267000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=246, trip_distance=0.73, total_amount=14.25, tpep_pickup_datetime=1761957445000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=48, trip_distance=1.48, total_amount=23.94, tpep_pickup_datetime=1761955900000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.6, total_amount=20.41, tpep_pickup_datetime=1761955372000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=137, trip_distance=1.6, total_amount=25.9, tpep_pickup_datetime=1761956526000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=233, trip_distance=1.4, total_amount=22.05, tpep_pickup_datetime=1761958784000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=234, trip_distance=0.4, total_amount=14.7, tpep_pickup_datetime=1761955781000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=2.1, total_amount=44.9, tpep_pickup_datetime=1761956134000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=87, trip_distance=5.1, total_amount=53.34, tpep_pickup_datetime=1761956209000)\\n\",\n      \"Sent: Ride(PULocationID=262, DOLocationID=145, trip_distance=1.73, total_amount=16.45, tpep_pickup_datetime=1761955736000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=158, trip_distance=0.68, total_amount=21.95, tpep_pickup_datetime=1761955728000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=249, trip_distance=0.22, total_amount=15.54, tpep_pickup_datetime=1761956324000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=229, trip_distance=2.49, total_amount=24.95, tpep_pickup_datetime=1761955362000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=140, trip_distance=1.08, total_amount=15.54, tpep_pickup_datetime=1761956228000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=163, trip_distance=4.6, total_amount=47.85, tpep_pickup_datetime=1761955422000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=237, trip_distance=1.41, total_amount=19.74, tpep_pickup_datetime=1761958654000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=68, trip_distance=3.5, total_amount=26.25, tpep_pickup_datetime=1761958702000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=48, trip_distance=1.1, total_amount=24.8, tpep_pickup_datetime=1761955292000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=90, trip_distance=1.3, total_amount=21.45, tpep_pickup_datetime=1761956729000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=249, trip_distance=0.5, total_amount=20.55, tpep_pickup_datetime=1761957601000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=137, trip_distance=2.7, total_amount=36.3, tpep_pickup_datetime=1761958362000)\\n\",\n      \"Sent: Ride(PULocationID=232, DOLocationID=87, trip_distance=1.54, total_amount=22.29, tpep_pickup_datetime=1761957878000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=148, trip_distance=1.66, total_amount=29.82, tpep_pickup_datetime=1761958604000)\\n\",\n      \"Sent: Ride(PULocationID=65, DOLocationID=49, trip_distance=0.9, total_amount=11.4, tpep_pickup_datetime=1761956206000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=249, trip_distance=1.43, total_amount=21.95, tpep_pickup_datetime=1761958184000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=186, trip_distance=1.66, total_amount=26.15, tpep_pickup_datetime=1761955322000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=68, trip_distance=0.75, total_amount=16.05, tpep_pickup_datetime=1761956736000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=164, trip_distance=1.11, total_amount=21.42, tpep_pickup_datetime=1761957396000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=2.22, total_amount=49.85, tpep_pickup_datetime=1761956131000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=68, trip_distance=11.1, total_amount=79.9, tpep_pickup_datetime=1761955488000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=246, trip_distance=0.9, total_amount=33.45, tpep_pickup_datetime=1761958334000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=74, trip_distance=2.9, total_amount=23.6, tpep_pickup_datetime=1761955491000)\\n\",\n      \"Sent: Ride(PULocationID=74, DOLocationID=244, trip_distance=3.7, total_amount=22.2, tpep_pickup_datetime=1761956515000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=209, trip_distance=7.04, total_amount=51.66, tpep_pickup_datetime=1761957054000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=256, trip_distance=7.1, total_amount=61.95, tpep_pickup_datetime=1761955765000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=1.18, total_amount=17.22, tpep_pickup_datetime=1761955439000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=161, trip_distance=2.44, total_amount=24.78, tpep_pickup_datetime=1761956004000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=236, trip_distance=1.22, total_amount=17.22, tpep_pickup_datetime=1761957596000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=142, trip_distance=1.74, total_amount=18.84, tpep_pickup_datetime=1761958175000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=229, trip_distance=3.03, total_amount=37.38, tpep_pickup_datetime=1761955292000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=75, trip_distance=2.3, total_amount=20.58, tpep_pickup_datetime=1761957307000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.27, total_amount=20.15, tpep_pickup_datetime=1761956056000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=243, trip_distance=9.04, total_amount=55.02, tpep_pickup_datetime=1761956831000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=142, trip_distance=0.69, total_amount=15.86, tpep_pickup_datetime=1761955394000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=100, trip_distance=1.52, total_amount=21.42, tpep_pickup_datetime=1761957177000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=163, trip_distance=0.76, total_amount=14.44, tpep_pickup_datetime=1761957724000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.84, total_amount=13.8, tpep_pickup_datetime=1761958711000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=113, trip_distance=1.8, total_amount=34.45, tpep_pickup_datetime=1761957122000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=233, trip_distance=0.38, total_amount=10.85, tpep_pickup_datetime=1761957862000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=233, trip_distance=1.63, total_amount=30.66, tpep_pickup_datetime=1761955249000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=79, trip_distance=3.63, total_amount=35.25, tpep_pickup_datetime=1761957099000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=87, trip_distance=1.2, total_amount=19.75, tpep_pickup_datetime=1761956635000)\\n\",\n      \"Sent: Ride(PULocationID=209, DOLocationID=170, trip_distance=4.5, total_amount=39.05, tpep_pickup_datetime=1761957901000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=249, trip_distance=0.27, total_amount=15.05, tpep_pickup_datetime=1761955403000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=239, trip_distance=3.5, total_amount=48.3, tpep_pickup_datetime=1761956136000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=170, trip_distance=2.16, total_amount=24.78, tpep_pickup_datetime=1761957922000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=166, trip_distance=1.0, total_amount=13.0, tpep_pickup_datetime=1761958113000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.56, total_amount=12.96, tpep_pickup_datetime=1761956773000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=74, trip_distance=1.91, total_amount=17.25, tpep_pickup_datetime=1761957198000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=161, trip_distance=9.31, total_amount=70.26, tpep_pickup_datetime=1761958431000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=243, trip_distance=10.4, total_amount=54.25, tpep_pickup_datetime=1761957210000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=231, trip_distance=1.17, total_amount=21.42, tpep_pickup_datetime=1761956227000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.57, total_amount=19.55, tpep_pickup_datetime=1761958599000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=145, trip_distance=1.97, total_amount=22.26, tpep_pickup_datetime=1761956297000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=209, trip_distance=1.4, total_amount=18.9, tpep_pickup_datetime=1761958782000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=163, trip_distance=0.96, total_amount=13.95, tpep_pickup_datetime=1761958014000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=234, trip_distance=1.35, total_amount=19.55, tpep_pickup_datetime=1761958413000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=145, trip_distance=4.79, total_amount=42.26, tpep_pickup_datetime=1761956569000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=152, trip_distance=7.57, total_amount=57.79, tpep_pickup_datetime=1761957412000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=163, trip_distance=0.64, total_amount=12.25, tpep_pickup_datetime=1761957651000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=1.16, total_amount=15.75, tpep_pickup_datetime=1761958651000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=80, trip_distance=3.27, total_amount=24.85, tpep_pickup_datetime=1761956017000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=70, trip_distance=1.16, total_amount=24.09, tpep_pickup_datetime=1761958172000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=252, trip_distance=6.74, total_amount=49.63, tpep_pickup_datetime=1761958767000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=3.51, total_amount=37.38, tpep_pickup_datetime=1761955439000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=75, trip_distance=2.72, total_amount=18.55, tpep_pickup_datetime=1761958052000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=151, trip_distance=0.53, total_amount=12.12, tpep_pickup_datetime=1761957724000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=48, trip_distance=2.82, total_amount=38.22, tpep_pickup_datetime=1761955250000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=74, trip_distance=5.21, total_amount=37.38, tpep_pickup_datetime=1761957341000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=144, trip_distance=4.22, total_amount=42.73, tpep_pickup_datetime=1761956330000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=142, trip_distance=21.73, total_amount=87.69, tpep_pickup_datetime=1761956883000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=141, trip_distance=4.5, total_amount=46.6, tpep_pickup_datetime=1761955852000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=262, trip_distance=3.44, total_amount=33.18, tpep_pickup_datetime=1761956238000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=229, trip_distance=1.63, total_amount=19.74, tpep_pickup_datetime=1761958132000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=236, trip_distance=1.45, total_amount=16.38, tpep_pickup_datetime=1761958766000)\\n\",\n      \"Sent: Ride(PULocationID=114, DOLocationID=107, trip_distance=1.3, total_amount=21.35, tpep_pickup_datetime=1761955718000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=79, trip_distance=1.0, total_amount=20.65, tpep_pickup_datetime=1761956865000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=48, trip_distance=3.6, total_amount=29.05, tpep_pickup_datetime=1761957987000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=238, trip_distance=2.1, total_amount=21.4, tpep_pickup_datetime=1761957886000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=141, trip_distance=3.71, total_amount=32.34, tpep_pickup_datetime=1761958179000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=170, trip_distance=1.92, total_amount=28.14, tpep_pickup_datetime=1761958413000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=162, trip_distance=9.95, total_amount=76.61, tpep_pickup_datetime=1761956513000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=107, trip_distance=2.2, total_amount=24.15, tpep_pickup_datetime=1761957310000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=17, trip_distance=8.9, total_amount=67.49, tpep_pickup_datetime=1761956182000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.33, total_amount=17.16, tpep_pickup_datetime=1761956654000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=230, trip_distance=1.8, total_amount=22.25, tpep_pickup_datetime=1761957156000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=48, trip_distance=2.3, total_amount=28.35, tpep_pickup_datetime=1761958637000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=79, trip_distance=4.6, total_amount=53.34, tpep_pickup_datetime=1761957453000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=249, trip_distance=1.3, total_amount=30.45, tpep_pickup_datetime=1761955450000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=238, trip_distance=4.5, total_amount=50.0, tpep_pickup_datetime=1761957468000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=246, trip_distance=1.83, total_amount=24.75, tpep_pickup_datetime=1761957761000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=148, trip_distance=2.5, total_amount=32.85, tpep_pickup_datetime=1761955826000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=249, trip_distance=1.9, total_amount=24.15, tpep_pickup_datetime=1761958126000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=143, trip_distance=1.98, total_amount=25.36, tpep_pickup_datetime=1761955308000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=246, trip_distance=2.17, total_amount=23.45, tpep_pickup_datetime=1761955684000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.79, total_amount=28.98, tpep_pickup_datetime=1761955920000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=237, trip_distance=2.55, total_amount=23.35, tpep_pickup_datetime=1761957446000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=1.66, total_amount=17.2, tpep_pickup_datetime=1761958370000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=255, trip_distance=5.69, total_amount=52.15, tpep_pickup_datetime=1761956236000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=249, trip_distance=1.0, total_amount=15.05, tpep_pickup_datetime=1761958649000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=43, trip_distance=4.24, total_amount=42.45, tpep_pickup_datetime=1761957905000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=256, trip_distance=16.31, total_amount=87.85, tpep_pickup_datetime=1761957025000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=231, trip_distance=11.65, total_amount=66.33, tpep_pickup_datetime=1761958443000)\\n\",\n      \"Sent: Ride(PULocationID=88, DOLocationID=261, trip_distance=0.43, total_amount=18.06, tpep_pickup_datetime=1761956169000)\\n\",\n      \"Sent: Ride(PULocationID=261, DOLocationID=186, trip_distance=5.42, total_amount=51.66, tpep_pickup_datetime=1761957049000)\\n\",\n      \"Sent: Ride(PULocationID=50, DOLocationID=68, trip_distance=0.93, total_amount=15.54, tpep_pickup_datetime=1761957235000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=13, trip_distance=3.6, total_amount=-36.05, tpep_pickup_datetime=1761958091000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=13, trip_distance=3.6, total_amount=36.05, tpep_pickup_datetime=1761958091000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=37, trip_distance=12.66, total_amount=83.09, tpep_pickup_datetime=1761956632000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=0.69, total_amount=14.65, tpep_pickup_datetime=1761955727000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=148, trip_distance=1.23, total_amount=25.62, tpep_pickup_datetime=1761956213000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=1.04, total_amount=16.75, tpep_pickup_datetime=1761957350000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=231, trip_distance=1.63, total_amount=19.15, tpep_pickup_datetime=1761958074000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=255, trip_distance=3.38, total_amount=49.14, tpep_pickup_datetime=1761956923000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=113, trip_distance=0.28, total_amount=16.38, tpep_pickup_datetime=1761956238000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=161, trip_distance=2.63, total_amount=29.75, tpep_pickup_datetime=1761956822000)\\n\",\n      \"Sent: Ride(PULocationID=43, DOLocationID=158, trip_distance=5.39, total_amount=36.25, tpep_pickup_datetime=1761956931000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=186, trip_distance=1.33, total_amount=17.75, tpep_pickup_datetime=1761958537000)\\n\",\n      \"Sent: Ride(PULocationID=50, DOLocationID=263, trip_distance=4.11, total_amount=26.25, tpep_pickup_datetime=1761955276000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=144, trip_distance=1.07, total_amount=24.06, tpep_pickup_datetime=1761955616000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=158, trip_distance=2.15, total_amount=27.3, tpep_pickup_datetime=1761956698000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=249, trip_distance=0.58, total_amount=25.03, tpep_pickup_datetime=1761957879000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=246, trip_distance=1.7, total_amount=25.6, tpep_pickup_datetime=1761958272000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=75, trip_distance=4.3, total_amount=36.5, tpep_pickup_datetime=1761955917000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=68, trip_distance=0.73, total_amount=17.22, tpep_pickup_datetime=1761955730000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=243, trip_distance=8.62, total_amount=52.5, tpep_pickup_datetime=1761957246000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=163, trip_distance=1.9, total_amount=20.6, tpep_pickup_datetime=1761956718000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=234, trip_distance=2.3, total_amount=29.8, tpep_pickup_datetime=1761957364000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=79, trip_distance=3.15, total_amount=48.56, tpep_pickup_datetime=1761955429000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.09, total_amount=13.02, tpep_pickup_datetime=1761957978000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=2.07, total_amount=23.94, tpep_pickup_datetime=1761955360000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=237, trip_distance=0.43, total_amount=12.12, tpep_pickup_datetime=1761956193000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=239, trip_distance=1.27, total_amount=15.48, tpep_pickup_datetime=1761956860000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=229, trip_distance=3.11, total_amount=28.98, tpep_pickup_datetime=1761957265000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=79, trip_distance=2.55, total_amount=32.45, tpep_pickup_datetime=1761958387000)\\n\",\n      \"Sent: Ride(PULocationID=226, DOLocationID=7, trip_distance=1.1, total_amount=14.75, tpep_pickup_datetime=1761958172000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=137, trip_distance=0.75, total_amount=16.35, tpep_pickup_datetime=1761955477000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=236, trip_distance=2.06, total_amount=19.85, tpep_pickup_datetime=1761957775000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=145, trip_distance=16.5, total_amount=90.45, tpep_pickup_datetime=1761958648000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=107, trip_distance=1.0, total_amount=21.42, tpep_pickup_datetime=1761955658000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=170, trip_distance=0.71, total_amount=12.95, tpep_pickup_datetime=1761956423000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=170, trip_distance=0.0, total_amount=-19.25, tpep_pickup_datetime=1761956816000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=170, trip_distance=0.0, total_amount=19.25, tpep_pickup_datetime=1761956816000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=224, trip_distance=1.0, total_amount=23.95, tpep_pickup_datetime=1761956447000)\\n\",\n      \"Sent: Ride(PULocationID=224, DOLocationID=79, trip_distance=0.9, total_amount=17.2, tpep_pickup_datetime=1761957800000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=170, trip_distance=1.7, total_amount=18.9, tpep_pickup_datetime=1761958452000)\\n\",\n      \"Sent: Ride(PULocationID=114, DOLocationID=113, trip_distance=0.8, total_amount=19.15, tpep_pickup_datetime=1761955618000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=7, trip_distance=5.5, total_amount=56.7, tpep_pickup_datetime=1761957926000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=48, trip_distance=1.8, total_amount=19.15, tpep_pickup_datetime=1761956880000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=164, trip_distance=2.4, total_amount=29.05, tpep_pickup_datetime=1761957816000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=237, trip_distance=2.28, total_amount=24.15, tpep_pickup_datetime=1761957069000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=164, trip_distance=0.62, total_amount=16.38, tpep_pickup_datetime=1761957817000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=238, trip_distance=4.1, total_amount=36.15, tpep_pickup_datetime=1761958381000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=125, trip_distance=0.99, total_amount=23.75, tpep_pickup_datetime=1761955374000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=158, trip_distance=2.28, total_amount=43.26, tpep_pickup_datetime=1761956336000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=237, trip_distance=0.88, total_amount=12.25, tpep_pickup_datetime=1761956351000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=263, trip_distance=0.58, total_amount=12.12, tpep_pickup_datetime=1761956881000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.34, total_amount=10.81, tpep_pickup_datetime=1761957100000)\\n\",\n      \"Sent: Ride(PULocationID=262, DOLocationID=140, trip_distance=1.3, total_amount=18.0, tpep_pickup_datetime=1761957520000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=236, trip_distance=0.51, total_amount=10.4, tpep_pickup_datetime=1761958424000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=79, trip_distance=1.71, total_amount=21.35, tpep_pickup_datetime=1761955817000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=79, trip_distance=0.77, total_amount=19.25, tpep_pickup_datetime=1761957630000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=230, trip_distance=2.52, total_amount=44.94, tpep_pickup_datetime=1761955596000)\\n\",\n      \"Sent: Ride(PULocationID=116, DOLocationID=166, trip_distance=1.05, total_amount=13.52, tpep_pickup_datetime=1761956317000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=230, trip_distance=11.01, total_amount=82.02, tpep_pickup_datetime=1761956511000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=45, trip_distance=4.35, total_amount=-38.85, tpep_pickup_datetime=1761958593000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=45, trip_distance=4.35, total_amount=38.85, tpep_pickup_datetime=1761958593000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=263, trip_distance=0.82, total_amount=13.7, tpep_pickup_datetime=1761956267000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=74, trip_distance=0.84, total_amount=13.8, tpep_pickup_datetime=1761956776000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=186, trip_distance=1.4, total_amount=18.9, tpep_pickup_datetime=1761958274000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=28, trip_distance=6.35, total_amount=50.47, tpep_pickup_datetime=1761955330000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=262, trip_distance=8.44, total_amount=63.48, tpep_pickup_datetime=1761957553000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=48, trip_distance=1.11, total_amount=22.26, tpep_pickup_datetime=1761956847000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=87, trip_distance=6.66, total_amount=50.15, tpep_pickup_datetime=1761957913000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=151, trip_distance=1.41, total_amount=16.32, tpep_pickup_datetime=1761956114000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=0.76, total_amount=10.8, tpep_pickup_datetime=1761957059000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=238, trip_distance=1.01, total_amount=13.8, tpep_pickup_datetime=1761957693000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=263, trip_distance=1.4, total_amount=18.0, tpep_pickup_datetime=1761958383000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=1.11, total_amount=14.64, tpep_pickup_datetime=1761958732000)\\n\",\n      \"Sent: Ride(PULocationID=261, DOLocationID=229, trip_distance=5.7, total_amount=39.55, tpep_pickup_datetime=1761957098000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=148, trip_distance=1.38, total_amount=29.05, tpep_pickup_datetime=1761956006000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=141, trip_distance=4.33, total_amount=44.1, tpep_pickup_datetime=1761957811000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=238, trip_distance=19.53, total_amount=98.88, tpep_pickup_datetime=1761956173000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=43, trip_distance=2.5, total_amount=37.2, tpep_pickup_datetime=1761957579000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=68, trip_distance=10.49, total_amount=73.3, tpep_pickup_datetime=1761955850000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=48, trip_distance=1.54, total_amount=33.25, tpep_pickup_datetime=1761955446000)\\n\",\n      \"Sent: Ride(PULocationID=50, DOLocationID=97, trip_distance=5.7, total_amount=56.65, tpep_pickup_datetime=1761958001000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=148, trip_distance=1.43, total_amount=38.22, tpep_pickup_datetime=1761956136000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=36, trip_distance=3.85, total_amount=37.45, tpep_pickup_datetime=1761958185000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.09, total_amount=15.15, tpep_pickup_datetime=1761955232000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=249, trip_distance=1.32, total_amount=23.45, tpep_pickup_datetime=1761955586000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=13, trip_distance=3.02, total_amount=39.06, tpep_pickup_datetime=1761956901000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.7, total_amount=12.95, tpep_pickup_datetime=1761956438000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=244, trip_distance=5.9, total_amount=38.9, tpep_pickup_datetime=1761956922000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=68, trip_distance=1.07, total_amount=19.74, tpep_pickup_datetime=1761954560000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=263, trip_distance=5.34, total_amount=49.14, tpep_pickup_datetime=1761955328000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=237, trip_distance=1.92, total_amount=19.25, tpep_pickup_datetime=1761958528000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=10, trip_distance=2.32, total_amount=26.76, tpep_pickup_datetime=1761958227000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=162, trip_distance=2.4, total_amount=26.45, tpep_pickup_datetime=1761956453000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=48, trip_distance=1.3, total_amount=23.2, tpep_pickup_datetime=1761957702000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.1, total_amount=17.05, tpep_pickup_datetime=1761958724000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=90, trip_distance=1.8, total_amount=27.3, tpep_pickup_datetime=1761955601000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=88, trip_distance=3.1, total_amount=41.6, tpep_pickup_datetime=1761956876000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=48, trip_distance=1.5, total_amount=26.45, tpep_pickup_datetime=1761955465000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=164, trip_distance=1.3, total_amount=20.85, tpep_pickup_datetime=1761956661000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=50, trip_distance=0.5, total_amount=15.5, tpep_pickup_datetime=1761957974000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=239, trip_distance=1.64, total_amount=18.06, tpep_pickup_datetime=1761957850000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=4, trip_distance=2.84, total_amount=39.06, tpep_pickup_datetime=1761958097000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=164, trip_distance=1.09, total_amount=17.15, tpep_pickup_datetime=1761958351000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=88, trip_distance=3.3, total_amount=35.7, tpep_pickup_datetime=1761957349000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=229, trip_distance=1.0, total_amount=17.0, tpep_pickup_datetime=1761955657000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=223, trip_distance=5.7, total_amount=55.0, tpep_pickup_datetime=1761957596000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=4, trip_distance=0.54, total_amount=16.05, tpep_pickup_datetime=1761956220000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=246, trip_distance=2.39, total_amount=35.7, tpep_pickup_datetime=1761956807000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=140, trip_distance=4.82, total_amount=42.42, tpep_pickup_datetime=1761957016000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=231, trip_distance=1.29, total_amount=18.06, tpep_pickup_datetime=1761958584000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=238, trip_distance=2.3, total_amount=20.65, tpep_pickup_datetime=1761955264000)\\n\",\n      \"Sent: Ride(PULocationID=143, DOLocationID=263, trip_distance=2.0, total_amount=18.85, tpep_pickup_datetime=1761956722000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=107, trip_distance=4.0, total_amount=34.0, tpep_pickup_datetime=1761957356000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=249, trip_distance=1.3, total_amount=24.95, tpep_pickup_datetime=1761958780000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=229, trip_distance=1.0, total_amount=18.05, tpep_pickup_datetime=1761955689000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=107, trip_distance=1.7, total_amount=32.15, tpep_pickup_datetime=1761956227000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=79, trip_distance=0.9, total_amount=27.55, tpep_pickup_datetime=1761958453000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.35, total_amount=23.94, tpep_pickup_datetime=1761955756000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=48, trip_distance=0.7, total_amount=16.38, tpep_pickup_datetime=1761956588000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=100, trip_distance=1.02, total_amount=15.75, tpep_pickup_datetime=1761957508000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=140, trip_distance=2.04, total_amount=22.75, tpep_pickup_datetime=1761958204000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=166, trip_distance=2.9, total_amount=23.35, tpep_pickup_datetime=1761956399000)\\n\",\n      \"Sent: Ride(PULocationID=151, DOLocationID=116, trip_distance=2.6, total_amount=22.55, tpep_pickup_datetime=1761957485000)\\n\",\n      \"Sent: Ride(PULocationID=209, DOLocationID=158, trip_distance=2.7, total_amount=26.95, tpep_pickup_datetime=1761955527000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=249, trip_distance=1.3, total_amount=32.8, tpep_pickup_datetime=1761957154000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=48, trip_distance=0.3, total_amount=10.15, tpep_pickup_datetime=1761955372000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=112, trip_distance=5.3, total_amount=39.9, tpep_pickup_datetime=1761955882000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=262, trip_distance=1.43, total_amount=-15.0, tpep_pickup_datetime=1761957456000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=263, trip_distance=1.43, total_amount=15.0, tpep_pickup_datetime=1761957456000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=48, trip_distance=0.7, total_amount=22.95, tpep_pickup_datetime=1761956063000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=161, trip_distance=10.23, total_amount=79.5, tpep_pickup_datetime=1761955406000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=87, trip_distance=4.43, total_amount=50.05, tpep_pickup_datetime=1761956344000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=113, trip_distance=0.5, total_amount=13.85, tpep_pickup_datetime=1761956358000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=144, trip_distance=0.9, total_amount=15.55, tpep_pickup_datetime=1761956781000)\\n\",\n      \"Sent: Ride(PULocationID=261, DOLocationID=236, trip_distance=9.4, total_amount=59.35, tpep_pickup_datetime=1761958586000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=140, trip_distance=0.87, total_amount=13.7, tpep_pickup_datetime=1761955339000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=7, trip_distance=3.94, total_amount=27.65, tpep_pickup_datetime=1761956046000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=107, trip_distance=2.58, total_amount=34.02, tpep_pickup_datetime=1761958777000)\\n\",\n      \"Sent: Ride(PULocationID=24, DOLocationID=74, trip_distance=1.82, total_amount=17.52, tpep_pickup_datetime=1761957474000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.77, total_amount=23.1, tpep_pickup_datetime=1761955737000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=87, trip_distance=3.99, total_amount=40.74, tpep_pickup_datetime=1761957025000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=113, trip_distance=0.66, total_amount=17.22, tpep_pickup_datetime=1761956685000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=170, trip_distance=1.08, total_amount=21.42, tpep_pickup_datetime=1761957289000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=48, trip_distance=1.59, total_amount=25.79, tpep_pickup_datetime=1761958164000)\\n\",\n      \"Sent: Ride(PULocationID=226, DOLocationID=181, trip_distance=9.59, total_amount=47.3, tpep_pickup_datetime=1761955971000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=246, trip_distance=0.55, total_amount=12.85, tpep_pickup_datetime=1761956644000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=265, trip_distance=2.75, total_amount=102.97, tpep_pickup_datetime=1761958631000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=13, trip_distance=1.22, total_amount=17.22, tpep_pickup_datetime=1761955724000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=140, trip_distance=3.69, total_amount=26.25, tpep_pickup_datetime=1761958367000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=144, trip_distance=0.95, total_amount=19.25, tpep_pickup_datetime=1761955694000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=79, trip_distance=0.82, total_amount=19.74, tpep_pickup_datetime=1761956737000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=49, trip_distance=6.77, total_amount=54.69, tpep_pickup_datetime=1761957574000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=41, trip_distance=2.43, total_amount=22.2, tpep_pickup_datetime=1761957454000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=148, trip_distance=9.41, total_amount=64.7, tpep_pickup_datetime=1761958000000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=211, trip_distance=3.51, total_amount=40.74, tpep_pickup_datetime=1761955201000)\\n\",\n      \"Sent: Ride(PULocationID=211, DOLocationID=158, trip_distance=1.21, total_amount=22.29, tpep_pickup_datetime=1761957218000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=239, trip_distance=5.75, total_amount=40.43, tpep_pickup_datetime=1761955800000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=262, trip_distance=2.3, total_amount=23.2, tpep_pickup_datetime=1761955464000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=79, trip_distance=4.2, total_amount=53.35, tpep_pickup_datetime=1761956651000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=233, trip_distance=1.18, total_amount=18.06, tpep_pickup_datetime=1761956725000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=79, trip_distance=1.34, total_amount=32.34, tpep_pickup_datetime=1761957567000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=263, trip_distance=1.73, total_amount=17.4, tpep_pickup_datetime=1761957998000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=163, trip_distance=2.2, total_amount=23.86, tpep_pickup_datetime=1761955626000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=141, trip_distance=1.11, total_amount=18.9, tpep_pickup_datetime=1761956732000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=68, trip_distance=0.4, total_amount=13.02, tpep_pickup_datetime=1761955846000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=144, trip_distance=4.01, total_amount=48.15, tpep_pickup_datetime=1761956664000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=107, trip_distance=5.4, total_amount=50.45, tpep_pickup_datetime=1761956704000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=33, trip_distance=4.16, total_amount=44.1, tpep_pickup_datetime=1761957441000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=196, trip_distance=8.15, total_amount=49.99, tpep_pickup_datetime=1761958446000)\\n\",\n      \"Sent: Ride(PULocationID=13, DOLocationID=107, trip_distance=4.97, total_amount=37.38, tpep_pickup_datetime=1761955283000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=140, trip_distance=2.6, total_amount=22.75, tpep_pickup_datetime=1761956605000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=143, trip_distance=1.47, total_amount=18.84, tpep_pickup_datetime=1761957687000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=79, trip_distance=1.59, total_amount=24.75, tpep_pickup_datetime=1761955644000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=113, trip_distance=0.99, total_amount=19.74, tpep_pickup_datetime=1761956917000)\\n\",\n      \"Sent: Ride(PULocationID=114, DOLocationID=4, trip_distance=0.88, total_amount=17.45, tpep_pickup_datetime=1761957745000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=263, trip_distance=3.05, total_amount=34.86, tpep_pickup_datetime=1761956135000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=237, trip_distance=0.79, total_amount=14.2, tpep_pickup_datetime=1761957780000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=90, trip_distance=3.1, total_amount=28.98, tpep_pickup_datetime=1761958374000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=262, trip_distance=2.46, total_amount=24.78, tpep_pickup_datetime=1761957461000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=79, trip_distance=6.25, total_amount=77.44, tpep_pickup_datetime=1761956039000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=137, trip_distance=1.4, total_amount=27.3, tpep_pickup_datetime=1761956403000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=170, trip_distance=0.8, total_amount=19.7, tpep_pickup_datetime=1761957702000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=229, trip_distance=0.8, total_amount=11.55, tpep_pickup_datetime=1761958542000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=42, trip_distance=6.31, total_amount=50.82, tpep_pickup_datetime=1761955373000)\\n\",\n      \"Sent: Ride(PULocationID=41, DOLocationID=263, trip_distance=1.92, total_amount=19.68, tpep_pickup_datetime=1761958166000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=209, trip_distance=6.1, total_amount=39.35, tpep_pickup_datetime=1761957889000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=164, trip_distance=0.97, total_amount=16.45, tpep_pickup_datetime=1761955699000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=231, trip_distance=2.31, total_amount=37.19, tpep_pickup_datetime=1761956454000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=236, trip_distance=1.53, total_amount=18.84, tpep_pickup_datetime=1761955789000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=148, trip_distance=0.0, total_amount=8.75, tpep_pickup_datetime=1761957945000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=68, trip_distance=2.8, total_amount=28.15, tpep_pickup_datetime=1761957966000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=87, trip_distance=0.7, total_amount=12.55, tpep_pickup_datetime=1761958490000)\\n\",\n      \"Sent: Ride(PULocationID=66, DOLocationID=37, trip_distance=3.77, total_amount=31.0, tpep_pickup_datetime=1761958073000)\\n\",\n      \"Sent: Ride(PULocationID=24, DOLocationID=262, trip_distance=2.63, total_amount=24.72, tpep_pickup_datetime=1761958756000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=186, trip_distance=3.0, total_amount=42.45, tpep_pickup_datetime=1761956480000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=231, trip_distance=1.12, total_amount=14.89, tpep_pickup_datetime=1761955444000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=148, trip_distance=1.18, total_amount=29.75, tpep_pickup_datetime=1761955863000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=234, trip_distance=1.1, total_amount=20.25, tpep_pickup_datetime=1761957725000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=246, trip_distance=1.9, total_amount=26.7, tpep_pickup_datetime=1761956765000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=142, trip_distance=1.04, total_amount=17.31, tpep_pickup_datetime=1761957754000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=107, trip_distance=0.75, total_amount=14.25, tpep_pickup_datetime=1761955857000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=163, trip_distance=2.13, total_amount=19.95, tpep_pickup_datetime=1761956490000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=142, trip_distance=0.4, total_amount=13.02, tpep_pickup_datetime=1761957827000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=113, trip_distance=0.38, total_amount=12.95, tpep_pickup_datetime=1761957395000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=151, trip_distance=7.92, total_amount=69.3, tpep_pickup_datetime=1761957823000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=163, trip_distance=1.44, total_amount=19.25, tpep_pickup_datetime=1761956112000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=256, trip_distance=5.77, total_amount=51.45, tpep_pickup_datetime=1761955515000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=236, trip_distance=0.33, total_amount=12.12, tpep_pickup_datetime=1761957509000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=4, trip_distance=1.05, total_amount=-19.25, tpep_pickup_datetime=1761957971000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=4, trip_distance=1.05, total_amount=19.25, tpep_pickup_datetime=1761957971000)\\n\",\n      \"Sent: Ride(PULocationID=88, DOLocationID=33, trip_distance=3.68, total_amount=28.14, tpep_pickup_datetime=1761956756000)\\n\",\n      \"Sent: Ride(PULocationID=33, DOLocationID=25, trip_distance=0.98, total_amount=13.0, tpep_pickup_datetime=1761957514000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=211, trip_distance=0.85, total_amount=18.65, tpep_pickup_datetime=1761955144000)\\n\",\n      \"Sent: Ride(PULocationID=211, DOLocationID=97, trip_distance=3.25, total_amount=31.15, tpep_pickup_datetime=1761956474000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.26, total_amount=18.06, tpep_pickup_datetime=1761958254000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=1.43, total_amount=15.3, tpep_pickup_datetime=1761955827000)\\n\",\n      \"Sent: Ride(PULocationID=50, DOLocationID=90, trip_distance=2.31, total_amount=27.3, tpep_pickup_datetime=1761956610000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=3.37, total_amount=31.5, tpep_pickup_datetime=1761957803000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=113, trip_distance=1.03, total_amount=20.58, tpep_pickup_datetime=1761955629000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=170, trip_distance=0.11, total_amount=9.45, tpep_pickup_datetime=1761957550000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=68, trip_distance=1.7, total_amount=21.33, tpep_pickup_datetime=1761957416000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=231, trip_distance=2.27, total_amount=33.45, tpep_pickup_datetime=1761958648000)\\n\",\n      \"Sent: Ride(PULocationID=90, DOLocationID=233, trip_distance=1.76, total_amount=27.33, tpep_pickup_datetime=1761956324000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=186, trip_distance=2.52, total_amount=35.7, tpep_pickup_datetime=1761955460000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=236, trip_distance=4.01, total_amount=28.95, tpep_pickup_datetime=1761955447000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=127, trip_distance=15.53, total_amount=81.65, tpep_pickup_datetime=1761955352000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=246, trip_distance=2.33, total_amount=28.98, tpep_pickup_datetime=1761955667000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=100, trip_distance=0.23, total_amount=13.1, tpep_pickup_datetime=1761955396000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=151, trip_distance=2.59, total_amount=22.65, tpep_pickup_datetime=1761956438000)\\n\",\n      \"Sent: Ride(PULocationID=143, DOLocationID=262, trip_distance=2.21, total_amount=25.4, tpep_pickup_datetime=1761958058000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=234, trip_distance=3.02, total_amount=52.32, tpep_pickup_datetime=1761955791000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=4, trip_distance=1.64, total_amount=34.86, tpep_pickup_datetime=1761958476000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=163, trip_distance=2.58, total_amount=25.62, tpep_pickup_datetime=1761956710000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=0.79, total_amount=20.25, tpep_pickup_datetime=1761956022000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=114, trip_distance=0.61, total_amount=17.45, tpep_pickup_datetime=1761957040000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=249, trip_distance=2.04, total_amount=32.34, tpep_pickup_datetime=1761957906000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=260, trip_distance=6.9, total_amount=50.09, tpep_pickup_datetime=1761956327000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=48, trip_distance=0.94, total_amount=15.54, tpep_pickup_datetime=1761957874000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=143, trip_distance=1.88, total_amount=17.85, tpep_pickup_datetime=1761953770000)\\n\",\n      \"Sent: Ride(PULocationID=143, DOLocationID=236, trip_distance=2.75, total_amount=26.4, tpep_pickup_datetime=1761954614000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=48, trip_distance=2.65, total_amount=34.02, tpep_pickup_datetime=1761956414000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=229, trip_distance=1.5, total_amount=22.25, tpep_pickup_datetime=1761955922000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=246, trip_distance=1.2, total_amount=15.69, tpep_pickup_datetime=1761958070000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=152, trip_distance=8.0, total_amount=58.38, tpep_pickup_datetime=1761956935000)\\n\",\n      \"Sent: Ride(PULocationID=166, DOLocationID=116, trip_distance=1.21, total_amount=11.96, tpep_pickup_datetime=1761956038000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.75, total_amount=18.8, tpep_pickup_datetime=1761957587000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=90, trip_distance=1.78, total_amount=26.46, tpep_pickup_datetime=1761955155000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=107, trip_distance=2.28, total_amount=28.44, tpep_pickup_datetime=1761956921000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=162, trip_distance=1.15, total_amount=16.38, tpep_pickup_datetime=1761958446000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=75, trip_distance=1.63, total_amount=18.0, tpep_pickup_datetime=1761955216000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=186, trip_distance=1.14, total_amount=17.31, tpep_pickup_datetime=1761956032000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=90, trip_distance=0.76, total_amount=26.25, tpep_pickup_datetime=1761958542000)\\n\",\n      \"Sent: Ride(PULocationID=151, DOLocationID=116, trip_distance=2.17, total_amount=20.04, tpep_pickup_datetime=1761957776000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=79, trip_distance=4.19, total_amount=47.35, tpep_pickup_datetime=1761957071000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=164, trip_distance=11.3, total_amount=69.85, tpep_pickup_datetime=1761958772000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=158, trip_distance=1.9, total_amount=26.85, tpep_pickup_datetime=1761958643000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=263, trip_distance=1.1, total_amount=15.45, tpep_pickup_datetime=1761956554000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=234, trip_distance=3.1, total_amount=33.15, tpep_pickup_datetime=1761957289000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=134, trip_distance=6.05, total_amount=41.65, tpep_pickup_datetime=1761956033000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=239, trip_distance=11.5, total_amount=82.86, tpep_pickup_datetime=1761958551000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=234, trip_distance=1.72, total_amount=26.25, tpep_pickup_datetime=1761956681000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=61, trip_distance=6.25, total_amount=41.45, tpep_pickup_datetime=1761957608000)\\n\",\n      \"Sent: Ride(PULocationID=4, DOLocationID=80, trip_distance=3.17, total_amount=25.55, tpep_pickup_datetime=1761958324000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=137, trip_distance=3.33, total_amount=31.45, tpep_pickup_datetime=1761955496000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=13, trip_distance=6.92, total_amount=64.26, tpep_pickup_datetime=1761957268000)\\n\",\n      \"Sent: Ride(PULocationID=249, DOLocationID=170, trip_distance=2.41, total_amount=41.58, tpep_pickup_datetime=1761955218000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=113, trip_distance=0.62, total_amount=23.94, tpep_pickup_datetime=1761955586000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=263, trip_distance=4.62, total_amount=49.14, tpep_pickup_datetime=1761956998000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=246, trip_distance=2.12, total_amount=25.45, tpep_pickup_datetime=1761956111000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=50, trip_distance=0.39, total_amount=22.14, tpep_pickup_datetime=1761957432000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=142, trip_distance=0.9, total_amount=14.7, tpep_pickup_datetime=1761958636000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=75, trip_distance=0.65, total_amount=13.8, tpep_pickup_datetime=1761957376000)\\n\",\n      \"Sent: Ride(PULocationID=141, DOLocationID=263, trip_distance=0.67, total_amount=11.5, tpep_pickup_datetime=1761958327000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=68, trip_distance=0.4, total_amount=13.02, tpep_pickup_datetime=1761956732000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=41, trip_distance=4.87, total_amount=-36.05, tpep_pickup_datetime=1761957722000)\\n\",\n      \"Sent: Ride(PULocationID=186, DOLocationID=74, trip_distance=4.87, total_amount=36.05, tpep_pickup_datetime=1761957722000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=239, trip_distance=5.79, total_amount=53.34, tpep_pickup_datetime=1761957109000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=79, trip_distance=0.99, total_amount=19.95, tpep_pickup_datetime=1761956287000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=137, trip_distance=1.73, total_amount=23.35, tpep_pickup_datetime=1761957380000)\\n\",\n      \"Sent: Ride(PULocationID=152, DOLocationID=151, trip_distance=1.39, total_amount=14.16, tpep_pickup_datetime=1761955467000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=107, trip_distance=3.63, total_amount=32.34, tpep_pickup_datetime=1761956364000)\\n\",\n      \"Sent: Ride(PULocationID=231, DOLocationID=263, trip_distance=6.34, total_amount=52.5, tpep_pickup_datetime=1761955711000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=74, trip_distance=8.5, total_amount=54.15, tpep_pickup_datetime=1761955795000)\\n\",\n      \"Sent: Ride(PULocationID=74, DOLocationID=42, trip_distance=2.2, total_amount=17.5, tpep_pickup_datetime=1761957673000)\\n\",\n      \"Sent: Ride(PULocationID=80, DOLocationID=34, trip_distance=4.3, total_amount=26.75, tpep_pickup_datetime=1761957260000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=112, trip_distance=4.4, total_amount=44.95, tpep_pickup_datetime=1761956555000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=68, trip_distance=2.0, total_amount=47.45, tpep_pickup_datetime=1761955568000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=114, trip_distance=1.7, total_amount=27.25, tpep_pickup_datetime=1761958278000)\\n\",\n      \"Sent: Ride(PULocationID=162, DOLocationID=179, trip_distance=3.86, total_amount=33.69, tpep_pickup_datetime=1761956794000)\\n\",\n      \"Sent: Ride(PULocationID=264, DOLocationID=229, trip_distance=1.38, total_amount=18.9, tpep_pickup_datetime=1761958759000)\\n\",\n      \"Sent: Ride(PULocationID=79, DOLocationID=233, trip_distance=2.18, total_amount=31.94, tpep_pickup_datetime=1761956161000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=262, trip_distance=1.85, total_amount=19.74, tpep_pickup_datetime=1761957605000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=170, trip_distance=1.47, total_amount=21.25, tpep_pickup_datetime=1761956512000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=140, trip_distance=2.02, total_amount=20.58, tpep_pickup_datetime=1761957479000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=164, trip_distance=1.9, total_amount=24.63, tpep_pickup_datetime=1761958326000)\\n\",\n      \"Sent: Ride(PULocationID=50, DOLocationID=252, trip_distance=15.65, total_amount=106.85, tpep_pickup_datetime=1761956198000)\\n\",\n      \"Sent: Ride(PULocationID=170, DOLocationID=141, trip_distance=2.03, total_amount=18.9, tpep_pickup_datetime=1761958441000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=243, trip_distance=7.66, total_amount=65.94, tpep_pickup_datetime=1761956564000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=262, trip_distance=2.73, total_amount=19.9, tpep_pickup_datetime=1761955822000)\\n\",\n      \"Sent: Ride(PULocationID=262, DOLocationID=162, trip_distance=2.05, total_amount=-16.45, tpep_pickup_datetime=1761956792000)\\n\",\n      \"Sent: Ride(PULocationID=262, DOLocationID=162, trip_distance=2.05, total_amount=16.45, tpep_pickup_datetime=1761956792000)\\n\",\n      \"Sent: Ride(PULocationID=229, DOLocationID=229, trip_distance=0.41, total_amount=12.48, tpep_pickup_datetime=1761957355000)\\n\",\n      \"Sent: Ride(PULocationID=68, DOLocationID=233, trip_distance=2.05, total_amount=24.35, tpep_pickup_datetime=1761956966000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=232, trip_distance=2.58, total_amount=34.86, tpep_pickup_datetime=1761955867000)\\n\",\n      \"Sent: Ride(PULocationID=232, DOLocationID=238, trip_distance=7.27, total_amount=45.21, tpep_pickup_datetime=1761957725000)\\n\",\n      \"Sent: Ride(PULocationID=132, DOLocationID=10, trip_distance=3.38, total_amount=24.31, tpep_pickup_datetime=1761957541000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=234, trip_distance=0.5, total_amount=20.58, tpep_pickup_datetime=1761955758000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=236, trip_distance=4.53, total_amount=48.3, tpep_pickup_datetime=1761957112000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=142, trip_distance=1.09, total_amount=17.22, tpep_pickup_datetime=1761955648000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=100, trip_distance=2.5, total_amount=27.3, tpep_pickup_datetime=1761956205000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=68, trip_distance=1.06, total_amount=20.58, tpep_pickup_datetime=1761957629000)\\n\",\n      \"Sent: Ride(PULocationID=125, DOLocationID=151, trip_distance=6.01, total_amount=38.85, tpep_pickup_datetime=1761956924000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=107, trip_distance=0.77, total_amount=16.38, tpep_pickup_datetime=1761955684000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=164, trip_distance=1.0, total_amount=18.8, tpep_pickup_datetime=1761955373000)\\n\",\n      \"Sent: Ride(PULocationID=164, DOLocationID=137, trip_distance=1.0, total_amount=23.2, tpep_pickup_datetime=1761956086000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=141, trip_distance=1.2, total_amount=12.95, tpep_pickup_datetime=1761956985000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=141, trip_distance=1.8, total_amount=19.75, tpep_pickup_datetime=1761958331000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=230, trip_distance=0.7, total_amount=16.05, tpep_pickup_datetime=1761956008000)\\n\",\n      \"Sent: Ride(PULocationID=161, DOLocationID=263, trip_distance=1.8, total_amount=21.4, tpep_pickup_datetime=1761956913000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=142, trip_distance=2.2, total_amount=19.8, tpep_pickup_datetime=1761957773000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=263, trip_distance=2.68, total_amount=24.78, tpep_pickup_datetime=1761955217000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.56, total_amount=11.62, tpep_pickup_datetime=1761955998000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=83, trip_distance=5.16, total_amount=31.85, tpep_pickup_datetime=1761957529000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=246, trip_distance=2.6, total_amount=39.9, tpep_pickup_datetime=1761957214000)\\n\",\n      \"Sent: Ride(PULocationID=211, DOLocationID=114, trip_distance=0.8, total_amount=15.65, tpep_pickup_datetime=1761956007000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=158, trip_distance=1.9, total_amount=24.78, tpep_pickup_datetime=1761956443000)\\n\",\n      \"Sent: Ride(PULocationID=158, DOLocationID=87, trip_distance=3.82, total_amount=28.45, tpep_pickup_datetime=1761957437000)\\n\",\n      \"Sent: Ride(PULocationID=87, DOLocationID=107, trip_distance=4.34, total_amount=40.69, tpep_pickup_datetime=1761958760000)\\n\",\n      \"Sent: Ride(PULocationID=114, DOLocationID=137, trip_distance=2.0, total_amount=25.6, tpep_pickup_datetime=1761957565000)\\n\",\n      \"Sent: Ride(PULocationID=265, DOLocationID=265, trip_distance=0.0, total_amount=96.0, tpep_pickup_datetime=1761958241000)\\n\",\n      \"Sent: Ride(PULocationID=230, DOLocationID=144, trip_distance=3.5, total_amount=32.55, tpep_pickup_datetime=1761956658000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=160, trip_distance=7.45, total_amount=40.45, tpep_pickup_datetime=1761958726000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=256, trip_distance=6.03, total_amount=68.46, tpep_pickup_datetime=1761955214000)\\n\",\n      \"Sent: Ride(PULocationID=24, DOLocationID=152, trip_distance=1.3, total_amount=10.4, tpep_pickup_datetime=1761956874000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=236, trip_distance=0.43, total_amount=11.75, tpep_pickup_datetime=1761956620000)\\n\",\n      \"Sent: Ride(PULocationID=236, DOLocationID=263, trip_distance=1.15, total_amount=16.32, tpep_pickup_datetime=1761956897000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=238, trip_distance=1.99, total_amount=19.1, tpep_pickup_datetime=1761957478000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=24, trip_distance=2.3, total_amount=19.7, tpep_pickup_datetime=1761955265000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=239, trip_distance=0.8, total_amount=13.8, tpep_pickup_datetime=1761956041000)\\n\",\n      \"Sent: Ride(PULocationID=142, DOLocationID=262, trip_distance=2.1, total_amount=20.5, tpep_pickup_datetime=1761956804000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=226, trip_distance=4.0, total_amount=23.45, tpep_pickup_datetime=1761957750000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=143, trip_distance=1.53, total_amount=17.95, tpep_pickup_datetime=1761958179000)\\n\",\n      \"Sent: Ride(PULocationID=144, DOLocationID=231, trip_distance=1.2, total_amount=18.9, tpep_pickup_datetime=1761955678000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=229, trip_distance=2.66, total_amount=30.66, tpep_pickup_datetime=1761958375000)\\n\",\n      \"Sent: Ride(PULocationID=113, DOLocationID=107, trip_distance=0.75, total_amount=21.42, tpep_pickup_datetime=1761955353000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=88, trip_distance=4.28, total_amount=36.05, tpep_pickup_datetime=1761957278000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=262, trip_distance=0.67, total_amount=13.5, tpep_pickup_datetime=1761956709000)\\n\",\n      \"Sent: Ride(PULocationID=148, DOLocationID=79, trip_distance=1.5, total_amount=23.1, tpep_pickup_datetime=1761955608000)\\n\",\n      \"Sent: Ride(PULocationID=41, DOLocationID=238, trip_distance=1.73, total_amount=19.68, tpep_pickup_datetime=1761955822000)\\n\",\n      \"Sent: Ride(PULocationID=163, DOLocationID=140, trip_distance=1.44, total_amount=18.11, tpep_pickup_datetime=1761958756000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=137, trip_distance=0.49, total_amount=12.55, tpep_pickup_datetime=1761957049000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=68, trip_distance=1.48, total_amount=23.05, tpep_pickup_datetime=1761957555000)\\n\",\n      \"Sent: Ride(PULocationID=166, DOLocationID=41, trip_distance=1.1, total_amount=11.1, tpep_pickup_datetime=1761955442000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=140, trip_distance=1.8, total_amount=18.85, tpep_pickup_datetime=1761956742000)\\n\",\n      \"Sent: Ride(PULocationID=140, DOLocationID=237, trip_distance=0.7, total_amount=14.31, tpep_pickup_datetime=1761957607000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=107, trip_distance=2.8, total_amount=30.65, tpep_pickup_datetime=1761958306000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=239, trip_distance=2.94, total_amount=38.22, tpep_pickup_datetime=1761958211000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=42, trip_distance=4.7, total_amount=34.85, tpep_pickup_datetime=1761955550000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=236, trip_distance=0.9, total_amount=13.8, tpep_pickup_datetime=1761957769000)\\n\",\n      \"Sent: Ride(PULocationID=238, DOLocationID=114, trip_distance=6.12, total_amount=56.35, tpep_pickup_datetime=1761955775000)\\n\",\n      \"Sent: Ride(PULocationID=138, DOLocationID=232, trip_distance=9.37, total_amount=55.4, tpep_pickup_datetime=1761955488000)\\n\",\n      \"Sent: Ride(PULocationID=232, DOLocationID=137, trip_distance=1.93, total_amount=22.05, tpep_pickup_datetime=1761957174000)\\n\",\n      \"Sent: Ride(PULocationID=137, DOLocationID=263, trip_distance=2.72, total_amount=20.65, tpep_pickup_datetime=1761958356000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=263, trip_distance=4.6, total_amount=39.9, tpep_pickup_datetime=1761956440000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=0.6, total_amount=11.92, tpep_pickup_datetime=1761958168000)\\n\",\n      \"Sent: Ride(PULocationID=263, DOLocationID=141, trip_distance=1.04, total_amount=14.64, tpep_pickup_datetime=1761955790000)\\n\",\n      \"Sent: Ride(PULocationID=237, DOLocationID=75, trip_distance=1.93, total_amount=18.7, tpep_pickup_datetime=1761956503000)\\n\",\n      \"Sent: Ride(PULocationID=233, DOLocationID=107, trip_distance=0.76, total_amount=21.42, tpep_pickup_datetime=1761956475000)\\n\",\n      \"Sent: Ride(PULocationID=262, DOLocationID=48, trip_distance=3.12, total_amount=28.14, tpep_pickup_datetime=1761955204000)\\n\",\n      \"Sent: Ride(PULocationID=48, DOLocationID=68, trip_distance=0.45, total_amount=13.02, tpep_pickup_datetime=1761956902000)\\n\",\n      \"Sent: Ride(PULocationID=246, DOLocationID=186, trip_distance=0.7, total_amount=13.95, tpep_pickup_datetime=1761957726000)\\n\",\n      \"Sent: Ride(PULocationID=100, DOLocationID=164, trip_distance=0.15, total_amount=9.45, tpep_pickup_datetime=1761958138000)\\n\",\n      \"Sent: Ride(PULocationID=234, DOLocationID=113, trip_distance=0.43, total_amount=18.06, tpep_pickup_datetime=1761958439000)\\n\",\n      \"Sent: Ride(PULocationID=239, DOLocationID=116, trip_distance=3.3, total_amount=28.35, tpep_pickup_datetime=1761955931000)\\n\",\n      \"Sent: Ride(PULocationID=107, DOLocationID=79, trip_distance=0.9, total_amount=18.9, tpep_pickup_datetime=1761958296000)\\n\",\n      \"took 11.49 seconds\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import time\\n\",\n    \"\\n\",\n    \"t0 = time.time()\\n\",\n    \"\\n\",\n    \"for _, row in df.iterrows():\\n\",\n    \"    ride = ride_from_row(row)\\n\",\n    \"    producer.send(topic_name, value=ride)\\n\",\n    \"    print(f\\\"Sent: {ride}\\\")\\n\",\n    \"    time.sleep(0.01)\\n\",\n    \"\\n\",\n    \"producer.flush()\\n\",\n    \"\\n\",\n    \"t1 = time.time()\\n\",\n    \"print(f'took {(t1 - t0):.2f} seconds')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"a1ca66fe\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"streaming-workshop\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.12.1\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "07-streaming/workshop/live/pyproject.flink.toml",
    "content": "[project]\nname = \"pyflink-workshop\"\nversion = \"0.1.0\"\nrequires-python = \">=3.12\"\ndependencies = [\n    \"apache-flink==2.2.0\",\n]\n"
  },
  {
    "path": "07-streaming/workshop/live/pyproject.toml",
    "content": "[project]\nname = \"streaming-workshop\"\nversion = \"0.1.0\"\ndescription = \"Add your description here\"\nreadme = \"README.md\"\nrequires-python = \">=3.12\"\ndependencies = [\n    \"kafka-python>=2.3.0\",\n    \"pandas>=3.0.1\",\n    \"psycopg2-binary>=2.9.11\",\n    \"pyarrow>=23.0.1\",\n]\n\n[dependency-groups]\ndev = [\n    \"jupyter>=1.1.1\",\n]\n"
  },
  {
    "path": "07-streaming/workshop/live/src/job/aggregation_job.py",
    "content": "from pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, StreamTableEnvironment\n\n\ndef create_events_source_kafka(t_env):\n    table_name = \"events\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            tpep_pickup_datetime BIGINT,\n            event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3),\n            WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda:29092',\n            'topic' = 'rides',\n            'scan.startup.mode' = 'earliest-offset',\n            'properties.auto.offset.reset' = 'earliest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n\n\ndef create_events_aggregated_sink(t_env):\n    table_name = 'processed_events_aggregated'\n    sink_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            window_start TIMESTAMP(3),\n            PULocationID INT,\n            num_trips BIGINT,\n            total_revenue DOUBLE,\n            PRIMARY KEY (window_start, PULocationID) NOT ENFORCED\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n\n\ndef log_aggregation():\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)\n    env.set_parallelism(3)\n\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n\n    try:\n        source_table = create_events_source_kafka(t_env)\n        aggregated_table = create_events_aggregated_sink(t_env)\n\n        t_env.execute_sql(f\"\"\"\n        INSERT INTO {aggregated_table}\n        SELECT\n            window_start,\n            PULocationID,\n            COUNT(*) AS num_trips,\n            SUM(total_amount) AS total_revenue\n        FROM TABLE(\n            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR)\n        )\n        GROUP BY window_start, PULocationID;\n\n        \"\"\").wait()\n\n    except Exception as e:\n        print(\"Writing records from Kafka to JDBC failed:\", str(e))\n\n\nif __name__ == '__main__':\n    log_aggregation()"
  },
  {
    "path": "07-streaming/workshop/live/src/job/pass_through_job.py",
    "content": "\nfrom pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, StreamTableEnvironment\n\n\ndef create_events_source_kafka(t_env):\n    table_name = \"events\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            tpep_pickup_datetime BIGINT\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda:29092',\n            'topic' = 'rides',\n            'scan.startup.mode' = 'latest-offset',\n            'properties.auto.offset.reset' = 'latest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n\n\ndef create_processed_events_sink_postgres(t_env):\n    table_name = 'processed_events'\n    sink_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            pickup_datetime TIMESTAMP\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n\n\ndef log_processing():\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)  # checkpoint every 10 seconds\n\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n\n    source_table = create_events_source_kafka(t_env)\n    postgres_sink = create_processed_events_sink_postgres(t_env)\n\n    t_env.execute_sql(\n        f\"\"\"\n        INSERT INTO {postgres_sink}\n        SELECT\n            PULocationID,\n            DOLocationID,\n            trip_distance,\n            total_amount,\n            TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) as pickup_datetime\n        FROM {source_table}\n        \"\"\"\n    ).wait()\n\nif __name__ == '__main__':\n    log_processing()"
  },
  {
    "path": "07-streaming/workshop/live/src/producers/models.py",
    "content": "import json\nimport dataclasses\n\nfrom dataclasses import dataclass\n\n\n@dataclass\nclass Ride:\n    PULocationID: int\n    DOLocationID: int\n    trip_distance: float\n    total_amount: float\n    tpep_pickup_datetime: int  # epoch milliseconds\n\n\ndef ride_from_row(row):\n    return Ride(\n        PULocationID=int(row['PULocationID']),\n        DOLocationID=int(row['DOLocationID']),\n        trip_distance=float(row['trip_distance']),\n        total_amount=float(row['total_amount']),\n        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),\n    )\n\n\ndef ride_serializer(ride):\n    ride_dict = dataclasses.asdict(ride)\n    ride_json = json.dumps(ride_dict).encode('utf-8')\n    return ride_json\n\n\ndef ride_deserializer(data):\n    json_str = data.decode('utf-8')\n    ride_dict = json.loads(json_str)\n    return Ride(**ride_dict)\n"
  },
  {
    "path": "07-streaming/workshop/live/src/producers/producer_realtime.py",
    "content": "import dataclasses\nimport json\nimport random\nimport sys\nimport time\nfrom datetime import datetime, timezone\nfrom pathlib import Path\n\nsys.path.insert(0, str(Path(__file__).parent.parent))\n\nfrom kafka import KafkaProducer\nfrom models import Ride\n\n# Top pickup locations from the actual NYC yellow taxi data.\n# PULocationID is a taxi zone ID (1-263) defined by the NYC TLC.\n# See https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv\nPICKUP_LOCATIONS = [\n    79,   # East Village, Manhattan\n    107,  # Gramercy, Manhattan\n    48,   # Clinton East (Hell's Kitchen), Manhattan\n    132,  # JFK Airport\n    234,  # Union Sq, Manhattan\n    148,  # Lower East Side, Manhattan\n    249,  # West Village, Manhattan\n    68,   # East Chelsea, Manhattan\n    90,   # Flatiron, Manhattan\n    263,  # Yorkville West, Manhattan\n    138,  # LaGuardia Airport\n    230,  # Times Sq/Theatre District, Manhattan\n    161,  # Midtown Center, Manhattan\n    162,  # Midtown East, Manhattan\n    170,  # Murray Hill, Manhattan\n    237,  # Upper East Side South, Manhattan\n    239,  # Upper West Side South, Manhattan\n    186,  # Penn Station/Madison Sq West, Manhattan\n    164,  # Midtown South, Manhattan\n    236,  # Upper East Side North, Manhattan\n]\n\nDROPOFF_LOCATIONS = PICKUP_LOCATIONS  # same pool for simplicity\n\n\ndef make_ride(delay_seconds=0):\n    now_ms = int(time.time() * 1000) - delay_seconds * 1000\n    return Ride(\n        PULocationID=random.choice(PICKUP_LOCATIONS),\n        DOLocationID=random.choice(DROPOFF_LOCATIONS),\n        trip_distance=round(random.uniform(0.5, 20.0), 2),\n        total_amount=round(random.uniform(5.0, 100.0), 2),\n        tpep_pickup_datetime=now_ms,\n    )\n\n\ndef ride_serializer(ride):\n    return json.dumps(dataclasses.asdict(ride)).encode('utf-8')\n\n\nserver = 'localhost:9092'\nproducer = KafkaProducer(\n    bootstrap_servers=[server],\n    value_serializer=ride_serializer,\n)\n\ntopic_name = 'rides'\ncount = 0\n\nprint(\"Sending events (Ctrl+C to stop)...\")\nprint()\n\ntry:\n    while True:\n        # ~20% chance of a late event (3-10 seconds old)\n        if random.random() < 0.2:\n            delay = random.randint(3, 10)\n            ride = make_ride(delay_seconds=delay)\n            ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc)\n            print(f\"  LATE ({delay}s) -> PU={ride.PULocationID} ts={ts:%H:%M:%S}\")\n        else:\n            ride = make_ride()\n            ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc)\n            print(f\"  on time   -> PU={ride.PULocationID} ts={ts:%H:%M:%S}\")\n\n        producer.send(topic_name, value=ride)\n        count += 1\n        time.sleep(0.5)\n\nexcept KeyboardInterrupt:\n    producer.flush()\n    print(f\"\\nSent {count} events\")\n"
  },
  {
    "path": "07-streaming/workshop/pyproject.flink.toml",
    "content": "[project]\nname = \"pyflink-workshop\"\nversion = \"0.1.0\"\nrequires-python = \">=3.12\"\ndependencies = [\n    \"apache-flink==2.2.0\",\n]\n"
  },
  {
    "path": "07-streaming/workshop/pyproject.toml",
    "content": "[project]\nname = \"workshop\"\nversion = \"0.1.0\"\ndescription = \"PyFlink Stream Processing Workshop\"\nrequires-python = \">=3.12\"\ndependencies = [\n    \"kafka-python>=2.3.0\",\n    \"pandas>=2.2.0\",\n    \"psycopg2-binary>=2.9.11\",\n    \"pyarrow>=19.0.0\",\n]\n"
  },
  {
    "path": "07-streaming/workshop/src/consumers/consumer.py",
    "content": "import sys\nfrom datetime import datetime\nfrom pathlib import Path\n\nsys.path.insert(0, str(Path(__file__).parent.parent))\n\nfrom kafka import KafkaConsumer\nfrom models import ride_deserializer\n\nserver = 'localhost:9092'\ntopic_name = 'rides'\n\nconsumer = KafkaConsumer(\n    topic_name,\n    bootstrap_servers=[server],\n    auto_offset_reset='earliest',\n    group_id='rides-console',\n    value_deserializer=ride_deserializer\n)\n\nprint(f\"Listening to {topic_name}...\")\n\ncount = 0\nfor message in consumer:\n    ride = message.value\n    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)\n    print(f\"Received: PU={ride.PULocationID}, DO={ride.DOLocationID}, \"\n          f\"distance={ride.trip_distance}, amount=${ride.total_amount:.2f}, \"\n          f\"pickup={pickup_dt}\")\n    count += 1\n    if count >= 10:\n        print(f\"\\n... received {count} messages so far (stopping after 10 for demo)\")\n        break\n\nconsumer.close()\n"
  },
  {
    "path": "07-streaming/workshop/src/consumers/consumer_postgres.py",
    "content": "import sys\nfrom datetime import datetime\nfrom pathlib import Path\n\nsys.path.insert(0, str(Path(__file__).parent.parent))\n\nimport psycopg2\nfrom kafka import KafkaConsumer\nfrom models import ride_deserializer\n\nserver = 'localhost:9092'\ntopic_name = 'rides'\n\n# Connect to PostgreSQL\nconn = psycopg2.connect(\n    host='localhost',\n    port=5432,\n    database='postgres',\n    user='postgres',\n    password='postgres'\n)\nconn.autocommit = True\ncur = conn.cursor()\n\nconsumer = KafkaConsumer(\n    topic_name,\n    bootstrap_servers=[server],\n    auto_offset_reset='earliest',\n    group_id='rides-to-postgres',\n    value_deserializer=ride_deserializer\n)\n\nprint(f\"Listening to {topic_name} and writing to PostgreSQL...\")\n\ncount = 0\nfor message in consumer:\n    ride = message.value\n    pickup_dt = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000)\n    cur.execute(\n        \"\"\"INSERT INTO processed_events\n           (PULocationID, DOLocationID, trip_distance, total_amount, pickup_datetime)\n           VALUES (%s, %s, %s, %s, %s)\"\"\",\n        (ride.PULocationID, ride.DOLocationID,\n         ride.trip_distance, ride.total_amount, pickup_dt)\n    )\n    count += 1\n    if count % 100 == 0:\n        print(f\"Inserted {count} rows...\")\n\nconsumer.close()\ncur.close()\nconn.close()\n"
  },
  {
    "path": "07-streaming/workshop/src/job/aggregation_job.py",
    "content": "from pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, StreamTableEnvironment\n\n\ndef create_events_aggregated_sink(t_env):\n    table_name = 'processed_events_aggregated'\n    sink_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            window_start TIMESTAMP(3),\n            PULocationID INT,\n            num_trips BIGINT,\n            total_revenue DOUBLE,\n            PRIMARY KEY (window_start, PULocationID) NOT ENFORCED\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n\ndef create_events_source_kafka(t_env):\n    table_name = \"events\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            tpep_pickup_datetime BIGINT,\n            event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3),\n            WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda:29092',\n            'topic' = 'rides',\n            'scan.startup.mode' = 'earliest-offset',\n            'properties.auto.offset.reset' = 'earliest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n\n\ndef log_aggregation():\n    # Set up the execution environment\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)\n    env.set_parallelism(3)\n\n    # Set up the table environment\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n\n    try:\n        # Create Kafka table\n        source_table = create_events_source_kafka(t_env)\n        aggregated_table = create_events_aggregated_sink(t_env)\n\n        t_env.execute_sql(f\"\"\"\n        INSERT INTO {aggregated_table}\n        SELECT\n            window_start,\n            PULocationID,\n            COUNT(*) AS num_trips,\n            SUM(total_amount) AS total_revenue\n        FROM TABLE(\n            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '1' HOUR)\n        )\n        GROUP BY window_start, PULocationID;\n\n        \"\"\").wait()\n\n    except Exception as e:\n        print(\"Writing records from Kafka to JDBC failed:\", str(e))\n\n\nif __name__ == '__main__':\n    log_aggregation()\n"
  },
  {
    "path": "07-streaming/workshop/src/job/aggregation_job_demo.py",
    "content": "\"\"\"\nDemo aggregation job with 10-second tumbling windows.\n\nUse with producer_realtime.py to observe watermark behavior:\n- Watermark = event_timestamp - 5 seconds\n- Late events (<=5s) arrive before the watermark closes the window -> included\n- Late events (>5s) may arrive after the watermark closes the window -> dropped\n\"\"\"\n\nfrom pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, StreamTableEnvironment\n\n\ndef create_events_source_kafka(t_env):\n    table_name = \"events\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            tpep_pickup_datetime BIGINT,\n            event_timestamp AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3),\n            WATERMARK for event_timestamp as event_timestamp - INTERVAL '5' SECOND\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda:29092',\n            'topic' = 'rides',\n            'scan.startup.mode' = 'latest-offset',\n            'properties.auto.offset.reset' = 'latest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n\n\ndef create_events_aggregated_sink(t_env):\n    table_name = 'processed_events_aggregated'\n    sink_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            window_start TIMESTAMP(3),\n            PULocationID INT,\n            num_trips BIGINT,\n            total_revenue DOUBLE,\n            PRIMARY KEY (window_start, PULocationID) NOT ENFORCED\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n\n\ndef log_aggregation():\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)\n    env.set_parallelism(1)\n\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n\n    try:\n        source_table = create_events_source_kafka(t_env)\n        aggregated_table = create_events_aggregated_sink(t_env)\n\n        # 10-second tumbling windows (instead of 1 hour) so we can\n        # observe windows closing and late events being dropped\n        t_env.execute_sql(f\"\"\"\n        INSERT INTO {aggregated_table}\n        SELECT\n            window_start,\n            PULocationID,\n            COUNT(*) AS num_trips,\n            SUM(total_amount) AS total_revenue\n        FROM TABLE(\n            TUMBLE(TABLE {source_table}, DESCRIPTOR(event_timestamp), INTERVAL '10' SECOND)\n        )\n        GROUP BY window_start, PULocationID;\n\n        \"\"\").wait()\n\n    except Exception as e:\n        print(\"Writing records from Kafka to JDBC failed:\", str(e))\n\n\nif __name__ == '__main__':\n    log_aggregation()\n"
  },
  {
    "path": "07-streaming/workshop/src/job/pass_through_job.py",
    "content": "from pyflink.datastream import StreamExecutionEnvironment\nfrom pyflink.table import EnvironmentSettings, StreamTableEnvironment\n\n\ndef create_processed_events_sink_postgres(t_env):\n    table_name = 'processed_events'\n    sink_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            pickup_datetime TIMESTAMP\n        ) WITH (\n            'connector' = 'jdbc',\n            'url' = 'jdbc:postgresql://postgres:5432/postgres',\n            'table-name' = '{table_name}',\n            'username' = 'postgres',\n            'password' = 'postgres',\n            'driver' = 'org.postgresql.Driver'\n        );\n        \"\"\"\n    t_env.execute_sql(sink_ddl)\n    return table_name\n\n\ndef create_events_source_kafka(t_env):\n    table_name = \"events\"\n    source_ddl = f\"\"\"\n        CREATE TABLE {table_name} (\n            PULocationID INTEGER,\n            DOLocationID INTEGER,\n            trip_distance DOUBLE,\n            total_amount DOUBLE,\n            tpep_pickup_datetime BIGINT\n        ) WITH (\n            'connector' = 'kafka',\n            'properties.bootstrap.servers' = 'redpanda:29092',\n            'topic' = 'rides',\n            'scan.startup.mode' = 'latest-offset',\n            'properties.auto.offset.reset' = 'latest',\n            'format' = 'json'\n        );\n        \"\"\"\n    t_env.execute_sql(source_ddl)\n    return table_name\n\ndef log_processing():\n    # Set up the execution environment\n    env = StreamExecutionEnvironment.get_execution_environment()\n    env.enable_checkpointing(10 * 1000)\n\n    # Set up the table environment\n    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()\n    t_env = StreamTableEnvironment.create(env, environment_settings=settings)\n    try:\n        # Create Kafka table\n        source_table = create_events_source_kafka(t_env)\n        postgres_sink = create_processed_events_sink_postgres(t_env)\n        # write records to postgres\n        t_env.execute_sql(\n            f\"\"\"\n                    INSERT INTO {postgres_sink}\n                    SELECT\n                        PULocationID,\n                        DOLocationID,\n                        trip_distance,\n                        total_amount,\n                        TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) as pickup_datetime\n                    FROM {source_table}\n                    \"\"\"\n        ).wait()\n\n    except Exception as e:\n        print(\"Writing records from Kafka to JDBC failed:\", str(e))\n\n\nif __name__ == '__main__':\n    log_processing()\n"
  },
  {
    "path": "07-streaming/workshop/src/models.py",
    "content": "import json\nfrom dataclasses import dataclass\n\n\n@dataclass\nclass Ride:\n    PULocationID: int\n    DOLocationID: int\n    trip_distance: float\n    total_amount: float\n    tpep_pickup_datetime: int  # epoch milliseconds\n\n\ndef ride_from_row(row):\n    return Ride(\n        PULocationID=int(row['PULocationID']),\n        DOLocationID=int(row['DOLocationID']),\n        trip_distance=float(row['trip_distance']),\n        total_amount=float(row['total_amount']),\n        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),\n    )\n\n\ndef ride_deserializer(data):\n    json_str = data.decode('utf-8')\n    ride_dict = json.loads(json_str)\n    return Ride(**ride_dict)\n"
  },
  {
    "path": "07-streaming/workshop/src/producers/producer.py",
    "content": "import dataclasses\nimport json\nimport sys\nimport time\nfrom pathlib import Path\n\nsys.path.insert(0, str(Path(__file__).parent.parent))\n\nimport pandas as pd\nfrom kafka import KafkaProducer\nfrom models import Ride, ride_from_row\n\n# Download NYC yellow taxi trip data (first 1000 rows)\nurl = \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet\"\ncolumns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime']\ndf = pd.read_parquet(url, columns=columns).head(1000)\n\ndef ride_serializer(ride):\n    ride_dict = dataclasses.asdict(ride)\n    json_str = json.dumps(ride_dict)\n    return json_str.encode('utf-8')\n\nserver = 'localhost:9092'\n\nproducer = KafkaProducer(\n    bootstrap_servers=[server],\n    value_serializer=ride_serializer\n)\nt0 = time.time()\n\ntopic_name = 'rides'\n\nfor _, row in df.iterrows():\n    ride = ride_from_row(row)\n    producer.send(topic_name, value=ride)\n    print(f\"Sent: {ride}\")\n    time.sleep(0.01)\n\nproducer.flush()\n\nt1 = time.time()\nprint(f'took {(t1 - t0):.2f} seconds')\n"
  },
  {
    "path": "07-streaming/workshop/src/producers/producer_realtime.py",
    "content": "import dataclasses\nimport json\nimport random\nimport sys\nimport time\nfrom datetime import datetime, timezone\nfrom pathlib import Path\n\nsys.path.insert(0, str(Path(__file__).parent.parent))\n\nfrom kafka import KafkaProducer\nfrom models import Ride\n\n# Top pickup locations from the actual NYC yellow taxi data.\n# PULocationID is a taxi zone ID (1-263) defined by the NYC TLC.\n# See https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv\nPICKUP_LOCATIONS = [\n    79,   # East Village, Manhattan\n    107,  # Gramercy, Manhattan\n    48,   # Clinton East (Hell's Kitchen), Manhattan\n    132,  # JFK Airport\n    234,  # Union Sq, Manhattan\n    148,  # Lower East Side, Manhattan\n    249,  # West Village, Manhattan\n    68,   # East Chelsea, Manhattan\n    90,   # Flatiron, Manhattan\n    263,  # Yorkville West, Manhattan\n    138,  # LaGuardia Airport\n    230,  # Times Sq/Theatre District, Manhattan\n    161,  # Midtown Center, Manhattan\n    162,  # Midtown East, Manhattan\n    170,  # Murray Hill, Manhattan\n    237,  # Upper East Side South, Manhattan\n    239,  # Upper West Side South, Manhattan\n    186,  # Penn Station/Madison Sq West, Manhattan\n    164,  # Midtown South, Manhattan\n    236,  # Upper East Side North, Manhattan\n]\n\nDROPOFF_LOCATIONS = PICKUP_LOCATIONS  # same pool for simplicity\n\n\ndef make_ride(delay_seconds=0):\n    now_ms = int(time.time() * 1000) - delay_seconds * 1000\n    return Ride(\n        PULocationID=random.choice(PICKUP_LOCATIONS),\n        DOLocationID=random.choice(DROPOFF_LOCATIONS),\n        trip_distance=round(random.uniform(0.5, 20.0), 2),\n        total_amount=round(random.uniform(5.0, 100.0), 2),\n        tpep_pickup_datetime=now_ms,\n    )\n\n\ndef ride_serializer(ride):\n    return json.dumps(dataclasses.asdict(ride)).encode('utf-8')\n\n\nserver = 'localhost:9092'\nproducer = KafkaProducer(\n    bootstrap_servers=[server],\n    value_serializer=ride_serializer,\n)\n\ntopic_name = 'rides'\ncount = 0\n\nprint(\"Sending events (Ctrl+C to stop)...\")\nprint()\n\ntry:\n    while True:\n        # ~20% chance of a late event (3-10 seconds old)\n        if random.random() < 0.2:\n            delay = random.randint(3, 10)\n            ride = make_ride(delay_seconds=delay)\n            ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc)\n            print(f\"  LATE ({delay}s) -> PU={ride.PULocationID} ts={ts:%H:%M:%S}\")\n        else:\n            ride = make_ride()\n            ts = datetime.fromtimestamp(ride.tpep_pickup_datetime / 1000, tz=timezone.utc)\n            print(f\"  on time   -> PU={ride.PULocationID} ts={ts:%H:%M:%S}\")\n\n        producer.send(topic_name, value=ride)\n        count += 1\n        time.sleep(0.5)\n\nexcept KeyboardInterrupt:\n    producer.flush()\n    print(f\"\\nSent {count} events\")\n"
  },
  {
    "path": "README.md",
    "content": "<p align=\"center\">\n  <img width=\"100%\" src=\"/images/architecture/arch_v5_workshops.png\" alt=\"Data Engineering Zoomcamp Overview\">\n</p>\n\n<h1 align=\"center\">\n    <strong>Data Engineering Zoomcamp: A Free 9-Week Course on Data Engineering Fundamentals</strong>\n</h1>\n\n<p align=\"center\">\nMaster the fundamentals of data engineering by building an end-to-end data pipeline from scratch. Gain hands-on experience with industry-standard tools and best practices.\n</p>\n\n<p align=\"center\">\n<a href=\"https://airtable.com/shr6oVXeQvSI5HuWD\"><img src=\"https://user-images.githubusercontent.com/875246/185755203-17945fd1-6b64-46f2-8377-1011dcb1a444.png\" height=\"50\" /></a>\n</p>\n\n<p align=\"center\">\n<a href=\"https://datatalks.club/slack.html\">Join Slack</a> •\n<a href=\"https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG\">#course-data-engineering Channel</a> •\n<a href=\"https://t.me/dezoomcamp\">Telegram Announcements</a> •\n<a href=\"https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb\">Course Playlist</a> •\n<a href=\"https://datatalks.club/faq/data-engineering-zoomcamp.html\">FAQ</a>\n</p>\n\n## How to Enroll\n\n### 2026 Cohort\n- **Start Date**: 12 January 2026\n- **Register Here**: [Sign up](https://airtable.com/shr6oVXeQvSI5HuWD)\n\n### Self-Paced Learning\nAll course materials are freely available for independent study. Follow these steps:\n1. Watch the course videos.\n2. Join the [Slack community](https://datatalks.club/slack.html).\n3. Refer to the [FAQ document](https://datatalks.club/faq/data-engineering-zoomcamp.html) for guidance.\n\n## Syllabus Overview\nThe course consists of structured modules, hands-on workshops, and a final project to reinforce your learning.\n\n### **Prerequisites**\nTo get the most out of this course, you should have:\n- Basic coding experience\n- Familiarity with SQL\n- Experience with Python (helpful but not required)\n\nNo prior data engineering experience is necessary.\n\n### **Modules**\n\n#### [Module 1: Containerization and Infrastructure as Code](01-docker-terraform/)\n- Introduction to GCP\n- Docker and Docker Compose\n- Running PostgreSQL with Docker\n- Infrastructure setup with Terraform\n- Homework\n\n#### [Module 2: Workflow Orchestration](02-workflow-orchestration/)\n- Data Lakes and Workflow Orchestration\n- Workflow orchestration with Kestra\n- Homework\n\n#### [Workshop 1: Data Ingestion](cohorts/2026/workshops/dlt.md)\n- API reading and pipeline scalability\n- Data normalization and incremental loading\n- Homework\n\n#### [Module 3: Data Warehousing](03-data-warehouse/)\n- Introduction to BigQuery\n- Partitioning, clustering, and best practices\n- Machine learning in BigQuery\n\n#### [Module 4: Analytics Engineering](04-analytics-engineering/)\n- Analytics Engineering and Data Modeling\n- dbt (data build tool) with DuckDB & BigQuery\n- Testing, documentation, and deployment\n\n#### [Module 5: Data Platforms](05-data-platforms/)\n- Building end-to-end data pipelines with Bruin\n- Data ingestion, transformation, and quality\n- Deployment to cloud (BigQuery)\n\n#### [Module 6: Batch Processing](06-batch/)\n- Introduction to Apache Spark\n- DataFrames and SQL\n- Internals of GroupBy and Joins\n\n#### [Module 7: Streaming](07-streaming/)\n- Introduction to Kafka\n- Kafka Streams and KSQL\n- Schema management with Avro\n\n#### [Final Project](projects/)\n- Apply all concepts learned in a real-world scenario\n- Peer review and feedback process\n\n## Testimonials\n> Thank you for what you do! The Data Engineering Zoomcamp gave me skills that helped me land my first tech job.\n> \n> — [Tim Claytor](https://www.linkedin.com/in/claytor/) ([Source](https://www.linkedin.com/feed/update/urn:li:activity:7396882073308938240?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7396882073308938240%2C7396889959711793152%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287396889959711793152%2Curn%3Ali%3Aactivity%3A7396882073308938240%29))\n\n> Three months might seem like a long time, but the growth and learning during this period are truly remarkable. It was a great experience with a lot of learning, connecting with like-minded people from all around the world, and having fun. I must admit, this was really hard. But the feeling of accomplishment and learning made it all worthwhile. And I would do it again!\n>\n> — [Nevenka Lukic](https://www.linkedin.com/in/nevenka-lukic/) ([Source](https://www.linkedin.com/posts/nevenka-lukic_data-engineering-zoomcamp-final-project-activity-7181985646033461248-Lc1O?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4))\n\n> One of the significant things I inferred from the Zoomcamp is to prioritize fundamentals and principles over ever-evolving tools and tech stacks. Hugely grateful to Alexey Grigorev for putting together this incredible course and offering it for free.\n>\n> — [Siddhartha Gogoi](https://www.linkedin.com/in/siddhartha-gogoi/) ([Source](https://www.linkedin.com/posts/activity-7325692407675604992-XSKI?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4))\n\n> Such a fun deep dive into data engineering, cloud automation, and orchestration. I learned so much along the way. Big shoutout to Alexey Grigorev and the DataTalksClub team for the opportunity and guidance throughout the 3 months of the free course.\n>\n> — [Assitan NIARE](https://www.linkedin.com/in/assitan-niar%C3%A9-data/) ([Source](https://www.linkedin.com/posts/activity-7317441554023874561-E3wm?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4))\n\n> If you’re serious about breaking into data engineering, start here. The repo’s structure, community, and hands-on focus make it unparalleled.\n> \n> — [Wady Osama](https://www.linkedin.com/in/wadyosama/) ([Source](https://www.linkedin.com/posts/wadyosama_dataengineering-zoomcamp-dezoomcamp-activity-7292126824711520258-puJm?utm_source=share&utm_medium=member_desktop&rcm=ACoAADJu9vMBW6iyIYswCQnN6t8UJLkXH2tQPi4))\n\n## Community & Support\n\n### **Getting Help on Slack**\nJoin the [`#course-data-engineering`](https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG) channel on [DataTalks.Club Slack](https://datatalks.club/slack.html) for discussions, troubleshooting, and networking.\n\nTo keep discussions organized:\n- Follow [our guidelines](asking-questions.md) when posting questions.\n- Review the [community guidelines](https://datatalks.club/slack/guidelines.html).\n\n## Meet the Instructors\n\n- [Alexey Grigorev](https://linkedin.com/in/agrigorev)\n- [Michael Shoemaker](https://www.linkedin.com/in/michaelshoemaker1/)\n- [Will Russell](https://www.linkedin.com/in/wrussell1999/)\n- [Anna Geller](https://www.linkedin.com/in/anna-geller-12a86811a/)\n- [Juan Manuel Perafan](https://www.linkedin.com/in/jmperafan/)\n- [Arsalan Noorafkan](https://www.linkedin.com/in/arsalan0/)\n\nPast instructors:\n\n- [Victoria Perez Mola](https://www.linkedin.com/in/victoriaperezmola/)\n- [Ankush Khanna](https://linkedin.com/in/ankushkhanna2)\n- [Sejal Vaidya](https://www.linkedin.com/in/vaidyasejal/)\n- [Irem Erturk](https://www.linkedin.com/in/iremerturk/)\n- [Luis Oliveira](https://www.linkedin.com/in/lgsoliveira/)\n- [Zach Wilson](https://www.linkedin.com/in/eczachly)\n\n## Sponsors & Supporters\nA special thanks to our course sponsors for making this initiative possible!\n\n<p align=\"center\">\n  <a href=\"https://kestra.io/\">\n    <img height=\"120\" src=\"images/kestra.svg\">\n  </a>\n</p>\n\n<p align=\"center\">\n  <a href=\"https://getbruin.com/\">\n    <img height=\"110\" src=\"images/bruin.svg\">\n  </a>\n</p>\n\n\n<p align=\"center\">\n  <a href=\"https://dlthub.com/\">\n    <img height=\"90\" src=\"images/dlthub.png\">\n  </a>\n</p>\n\nInterested in supporting our community? Reach out to [alexey@datatalks.club](mailto:alexey@datatalks.club).\n\n## About DataTalks.Club\n\n<p align=\"center\">\n  <img width=\"40%\" src=\"https://github.com/user-attachments/assets/1243a44a-84c8-458d-9439-aaf6f3a32d89\" alt=\"DataTalks.Club\">\n</p>\n\n<p align=\"center\">\n<a href=\"https://datatalks.club/\">DataTalks.Club</a> is a global online community of data enthusiasts. It's a place to discuss data, learn, share knowledge, ask and answer questions, and support each other.\n</p>\n\n<p align=\"center\">\n<a href=\"https://datatalks.club/\">Website</a> •\n<a href=\"https://datatalks.club/slack.html\">Join Slack Community</a> •\n<a href=\"https://us19.campaign-archive.com/home/?u=0d7822ab98152f5afc118c176&id=97178021aa\">Newsletter</a> •\n<a href=\"http://lu.ma/dtc-events\">Upcoming Events</a> •\n<a href=\"https://www.youtube.com/@DataTalksClub/featured\">YouTube</a> •\n<a href=\"https://github.com/DataTalksClub\">GitHub</a> •\n<a href=\"https://www.linkedin.com/company/datatalks-club/\">LinkedIn</a> •\n<a href=\"https://twitter.com/DataTalksClub\">Twitter</a>\n</p>\n\nAll the activity at DataTalks.Club mainly happens on [Slack](https://datatalks.club/slack.html). We post updates there and discuss different aspects of data, career questions, and more.\n\nAt DataTalksClub, we organize online events, community activities, and free courses. You can learn more about what we do at [DataTalksClub Community Navigation](https://www.notion.so/DataTalksClub-Community-Navigation-bf070ad27ba44bf6bbc9222082f0e5a8?pvs=21).\n\n"
  },
  {
    "path": "after-sign-up.md",
    "content": "## Thank you!\n\nThanks for signing up for the course.\n\nThe process of adding you to the mailing list is not automated yet, \nbut you will hear from us closer to the course start. \n\nTo make sure you don't miss any announcements\n\n- Register in [DataTalks.Club's Slack](https://datatalks.club/slack.html) and\n  join the [`#course-data-engineering`](https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG) channel\n- Join the [course Telegram channel with announcements](https://t.me/dezoomcamp)\n- Subscribe to [DataTalks.Club's YouTube channel](https://www.youtube.com/c/DataTalksClub) and check \n  [the course playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\nSee you in January!\n"
  },
  {
    "path": "asking-questions.md",
    "content": "## Asking questions\n\nIf you have any questions, ask them \nin the [`#course-data-engineering`](https://app.slack.com/client/T01ATQK62F8/C01FABYF2RG) channel in [DataTalks.Club](https://datatalks.club) slack.\n\nTo keep our discussion in Slack more organized, we ask you to follow these suggestions:\n\n* First, review How to troubleshoot issues listed below.\n* Before asking a question, check the [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html).\n* Before asking a question review the [Slack Guidelines](#Ask-in-Slack).\n* If somebody helped you with your problem and it's not in [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html), please add it there.\n  It'll help other students.\n* Zed Shaw (of the Learn the Hard Way series) has [a great post on how to help others help you](https://learncodethehardway.com/blog/03-how-to-ask-for-help/)\n* Check [Stackoverflow guide on asking](https://stackoverflow.com/help/how-to-ask)\n  \n### How to troubleshoot issues\n\nThe first step is to try to solve the issue on you own; get used to solving problems. This will be a real life skill you need when employed.\n\n1. What does the error say? There will often be a description of the error or instructions on what is needed, I have even seen a link to the solution. Does it reference a specific line of your code?\n2. Restart the application or server/pc. \n3. Google it. It is going to be rare that you are the first to have the problem, someone out there has posted the issue and likely the solution. Search using: **technology** **problem statement**. Example: `pgcli error column c.relhasoids does not exist`. \n    * There are often different solutions for the same problem due to variation in environments. \n4. Check the tech’s documentation. Use its search if available or use the browser's search function. \n5. Try uninstall (this may remove the bad actor) and reinstall of application or re-implementation of action. Don’t forget to restart the server/pc for reinstalls.\n    * Sometimes reinstalling fails to resolve the issue but works if you uninstall first.\n6. Ask in Slack\n7. Take a break and come back to it later. You will be amazed at how often you figure out the solution after letting your brain rest. Get some fresh air, workout, play a video game, watch a tv show, whatever allows your brain to not think about it for a little while or even until the next day. \n8. Remember technology issues in real life sometimes take days or even weeks to resolve\n\n### Asking in Slack\n\n* Before asking a question, check the [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html).\n* DO NOT use screenshots, especially don’t take pictures from a phone.\n* DO NOT tag instructors, it may discourage others from helping you.\n* Copy and paste errors; if it’s long, just post it in a reply to your thread. \n* Use ``` for formatting your code.\n* Use the same thread for the conversation (that means replying to your own thread). \n* DO NOT create multiple posts to discuss the issue.\n* You may create a new post if the issue reemerges down the road. Be sure to describe what has changed in the environment.\n* Provide additional information in the same thread of the steps you have taken for resolution.\n  \n\n\n"
  },
  {
    "path": "awesome-data-engineering.md",
    "content": "Have you found any cool resources about data engineering? Put them here\n\n## Learning Data Engineering\n\n### Courses\n\n* [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp) by DataTalks.Club (free)\n* [Big Data Platforms, Autumn 2022: Introduction to Big Data Processing Frameworks](https://big-data-platforms-22.mooc.fi/) by the University of Helsinki (free)   \n* [Awesome Data Engineering Learning Path](https://awesomedataengineering.com/)\n\n\n### Books\n\n* [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321)\n* [Big Data: Principles and Best Practices of Scalable Realtime Data Systems by Nathan Marz, James Warren](https://www.amazon.com/Big-Data-Principles-practices-scalable/dp/1617290343)\n* [Practical DataOps: Delivering Agile Data Science at Scale by Harvinder Atwal](https://www.amazon.com/Practical-DataOps-Delivering-Agile-Science/dp/1484251032)\n* [Data Pipelines Pocket Reference: Moving and Processing Data for Analytics by James Densmore](https://www.amazon.com/Data-Pipelines-Pocket-Reference-Processing/dp/1492087831)\n* [Best books for data engineering](https://awesomedataengineering.com/data_engineering_best_books)\n* [Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis, Matt Housley](https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302)\n\n\n### Introduction to Data Engineering Terms\n\n* [https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html](https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html) \n\n\n### Data engineering in practice\n\nConference talks from companies, blog posts, etc\n\n* [Uber Data Archives](https://eng.uber.com/category/articles/uberdata/) (Uber engineering blog)\n* [Data Engineering Weekly (DE-focused substack)](https://www.dataengineeringweekly.com/)\n* [Seattle Data Guy (DE-focused substack)](https://seattledataguy.substack.com/) \n\n\n## Doing Data Engineering\n\n### Coding & Python\n\n* [CS50's Introduction to Computer Science | edX](https://www.edx.org/course/introduction-computer-science-harvardx-cs50x) (course)\n* [Python for Everybody Specialization](https://www.coursera.org/specializations/python) (course)\n* [Practical Python programming](https://github.com/dabeaz-course/practical-python/blob/master/Notes/Contents.md)\n\n\n### SQL\n\n* [Intro to SQL: Querying and managing data | Khan Academy](https://www.khanacademy.org/computing/computer-programming/sql) \n* [Mode SQL Tutorial](https://mode.com/sql-tutorial/)\n* [Use The Index, Luke](https://use-the-index-luke.com/) (SQL Indexing a        nd Tuning e-Book)nfreffx \n* [SQL Performance Explained](https://sql-performance-explained.com/) (book)  e\n\n\n### Workflow orchestration\n\n* [What is DAG?](https://youtu.be/1Yh5S-S6wsI) (video) \n* [Airflow, Prefect, and Dagster: An Inside Look](https://towardsdatascience.com/airflow-prefect-and-dagster-an-inside-look-6074781c9b77) (blog post) \n* [Open-Source Spotlight - Prefect - Kevin Kho](https://www.youtube.com/watch?v=ISLV9JyqF1w) (video) \n* [Prefect as a Data Engineering Project Workflow Tool, with Mary Clair Thompson (Duke) - 11/6/2020](https://youtu.be/HuwA4wLQtCM) (video) \n\n\n### ETL and ELT\n\n* [ETL vs. ELT: What’s the Difference?](https://rivery.io/blog/etl-vs-elt/) (blog post) (print version)\n\n### Data lakes\n\n* [An Introduction to Modern Data Lake Storage Layers (Hodi, Iceberg, Delta Lake)](https://dacort.dev/posts/modern-data-lake-storage-layers/) (blog post) \n* [Lake House Architecture @ Halodoc: Data Platform 2.0](https://blogs.halodoc.io/lake-house-architecture-halodoc-data-platform-2-0/amp/) (blzog post) \n\n\n### Data warehousing\n\n\n* [Guide to Data Warehousing. Short and comprehensive information… | by Tomas Peluritis](https://medium.com/towards-data-science/guide-to-data-warehousing-6fdcf30b6fbe) (blog post) \n* [Snowflake, Redshift, BigQuery, and Others: Cloud Data Warehouse Tools Compared](https://www.altexsoft.com/blog/snowflake-redshift-bigquery-data-warehouse-tools/) (blog post)\n\n\n### Streaming\n\n\n*   Building Streaming Analytics: The Journey and Learnings - Maxim Lukichev\n\n### DataOps\n\n* [DataOps 101 with Lars Albertsson – DataTalks.Club](https://datatalks.club/podcast/s02e11-dataops.html) (podcast)\n*  \n\n\n### Monitoring and observability \n\n* [Data Observability: The Next Frontier of Data Engineering with Barr Moses](https://datatalks.club/podcast/s03e03-data-observability.html) (podcast)\n\n\n### Analytics engineering\n\n* [Analytics Engineer: New Role in a Data Team with Victoria Perez Mola](https://datatalks.club/podcast/s03e11-analytics-engineer.html) (podcast)\n* [Modern Data Stack for Analytics Engineering - Kyle Shannon](https://www.youtube.com/watch?v=UmIZIkeOfi0) (video) \n* [Analytics Engineering vs Data Engineering | RudderStack Blog](https://www.rudderstack.com/blog/analytics-engineering-vs-data-engineering) (blog post)\n* [Learn the Fundamentals of Analytics Engineering with dbt](https://courses.getdbt.com/courses/fundamentals) (course)\n\n\n### Data mesh\n\n* [Data Mesh in Practice - Max Schultze](https://www.youtube.com/watch?v=ekEc8D_D3zY) (video)\n\n### Cloud\n\n* [https://acceldataio.medium.com/data-engineering-best-practices-how-netflix-keeps-its-data-infrastructure-cost-effective-dee310bcc910](https://acceldataio.medium.com/data-engineering-best-practices-how-netflix-keeps-its-data-infrastructure-cost-effective-dee310bcc910) \n\n\n### Reverse ETL\n\n* TODO: What is reverse ETL?\n* [https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html](https://datatalks.club/podcast/s05e02-data-engineering-acronyms.html) \n* [Open-Source Spotlight - Grouparoo - Brian Leonard](https://www.youtube.com/watch?v=hswlcgQZYuw) (video) \n* [Open-Source Spotlight - Castled.io (Reverse ETL) - Arun Thulasidharan](https://www.youtube.com/watch?v=iW0XhltAUJ8) (video) \n\n## Career in Data Engineering\n\n* [From Data Science to Data Engineering with Ellen König – DataTalks.Club](https://datatalks.club/podcast/s07e08-from-data-science-to-data-engineering.html) (podcast)\n* [Big Data Engineer vs Data Scientist with Roksolana Diachuk – DataTalks.Club](https://datatalks.club/podcast/s04e03-big-data-engineer-vs-data-scientist.html) (podcast)\n* [What Skills Do You Need to Become a Data Engineer](https://www.linkedin.com/pulse/what-skills-do-you-need-become-data-engineer-peng-wang/) (blog post) \n* [The future history of Data Engineering](https://groupby1.substack.com/p/data-engineering?s=r) (blog post) \n* [What Skills Do Data Engineers Need](https://www.theseattledataguy.com/what-skills-do-data-engineers-need/) (blog post)\n\n### Data Engineering Management \n\n* [Becoming a Data Engineering Manager with Rahul Jain – DataTalks.Club](https://datatalks.club/podcast/s07e07-becoming-a-data-engineering-manager.html) (podcast)\n\n## Data engineering projects\n\n* [How To Start A Data Engineering Project - With Data Engineering Project Ideas](https://www.youtube.com/watch?v=WpN47Jddo7I) (video)\n* [Data Engineering Project for Beginners - Batch edition](https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/) (blog post)\n* [Building a Data Engineering Project in 20 Minutes](https://www.sspaeti.com/blog/data-engineering-project-in-twenty-minutes/) (blog post)\n* [Automating Nike Run Club Data Analysis with Python, Airflow and Google Data Studio | by Rich Martin | Medium](https://medium.com/@rich_23525/automating-nike-run-club-data-analysis-with-python-airflow-and-google-data-studio-3c9556478926) (blog post)\n\n\n## Data Engineering Resources \n\n### Blogs\n\n* [Start Data Engineering](https://www.startdataengineering.com/)\n\n### Podcasts\n\n* [The Data Engineering Podcast](https://www.dataengineeringpodcast.com/)\n* [DataTalks.Club Podcast](https://datatalks.club/podcast.html) (only some episodes are about data engineering) \n* \n\n### Communities\n\n* [DataTalks.Club](https://datatalks.club/)\n* [/r/dataengineering](https://www.reddit.com/r/dataengineering) \n\n\n### Meetups\n\n* [Sydney Data Engineers](https://sydneydataengineers.github.io/) \n\n### People to follow on Twitter and LinkedIn\n\n* TODO\n\n### YouTube channels\n\n* [Karolina Sowinska - YouTube](https://www.youtube.com/channel/UCAxnMry1lETl47xQWABvH7g) x`\n* [Seattle Data Guy - YouTube](https://www.youtube.com/c/SeattleDataGuy) \n* [Andreas Kretz - YouTube](https://www.youtube.com/c/andreaskayy) \n* [DataTalksClub - YouTube](https://youtube.com/c/datatalksclub) (only some videos are about data engineering) \n\n### Resource aggregators\n\n* [Reading List](https://www.scling.com/reading-list/) by Lars Albertsson\n* [GitHub - igorbarinov/awesome-data-engineering](https://github.com/igorbarinov/awesome-data-engineering) (focus is more on tools)\n* [GitHub - DataExpert-io/data-engineer-handbook](https://github.com/DataExpert-io/data-engineer-handbook) (contains tools,blogs and more)\n\n\n\n## License\n\nThis work is licensed under a Creative Commons Attribution 4.0 International License.\n\nCC BY 4.0\n"
  },
  {
    "path": "certificates.md",
    "content": "## Getting your certificate\n\nCongratulations on finishing the course!\n\nYou can find your certificate in your enrollment profile (you need to be logged in):\n\n* For the 2025 edition, it's https://courses.datatalks.club/de-zoomcamp-2025/enrollment\n\nIf you can't find a certificate in your profile, it means you didn't pass the project.\nIf you believe it's a mistake, write in the course channel in Slack.\n\n\n## Adding to LinkedIn\n\nYou can add your certificate to LinkedIn:\n\n* Log in to your LinkedIn account, then go to your profile.\n* On the right, in the \"Add profile\" section dropdown, choose \"Background\" and then select the drop-down triangle next to \"Licenses & Certifications\".\n* In \"Name\", enter \"Data Engineering Zoomcamp\".\n* In \"Issuing Organization\", enter \"DataTalksClub\".\n* (Optional) In \"Issue Date\", enter the time when the certificate was created.\n* (Optional) Select the checkbox This certification does not expire. \n* Put your certificate ID.\n* In \"Certification URL\", enter the URL for your certificate.\n\n[Adapted from here](https://support.edx.org/hc/en-us/articles/206501938-How-can-I-add-my-certificate-to-my-LinkedIn-profile-)\n"
  },
  {
    "path": "cohorts/2022/README.md",
    "content": "\n### 2022 Cohort\n\n* **Start**: 17 January 2022\n* **Registration link**: https://airtable.com/shr6oVXeQvSI5HuWD\n* [Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vR9oQiYnAVvzL4dagnhvp0sngqagF0AceD0FGjhS-dnzMTBzNQIal3-hOgkTibVQvfuqbQ69b0fvRnf/pubhtml)\n"
  },
  {
    "path": "cohorts/2022/project.md",
    "content": "## Course Project\n\nThe goal of this project is to apply everything we learned\nin this course and build an end-to-end data pipeline.\n\nRemember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered compelete.\n\n\n### Submitting \n\n#### Project Cohort #2\n\nProject:\n\n* Form: https://forms.gle/JECXB9jYQ1vBXbsw6\n* Deadline: 2 May, 22:00 CET\n\nPeer reviewing:\n\n* Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vShnv8T4iY_5NA8h0nySIS8Wzr-DZGGigEikIW4ZMSi9HlvhaEB4RhwmepVIuIUGaQHS90r5iHR2YXV/pubhtml?gid=964123374&single=true)\n* Form: https://forms.gle/Pb2fBwYLQ3GGFsaK6\n* Deadline: 9 May, 22:00 CET\n\n\n#### Project Cohort #1\n\nProject:\n\n* Form: https://forms.gle/6aeVcEVJipqR2BqC8\n* Deadline: 4 April, 22:00 CET\n\nPeer reviewing:\n\n* Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vShnv8T4iY_5NA8h0nySIS8Wzr-DZGGigEikIW4ZMSi9HlvhaEB4RhwmepVIuIUGaQHS90r5iHR2YXV/pubhtml)\n* Form: https://forms.gle/AZ62bXMp4SGcVUmK7\n* Deadline: 11 April, 22:00 CET\n\nProject feedback: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRcVCkO-jes5mbPAcikn9X_s2laJ1KhsO8aibHYQxxKqdCUYMVTEJLJQdM8C5aAUWKFl_0SJW4rme7H/pubhtml)\n"
  },
  {
    "path": "cohorts/2022/week_1_basics_n_setup/homework.md",
    "content": "## Week 1 Homework\n\nIn this homework we'll prepare the environment \nand practice with terraform and SQL\n\n\n## Question 1. Google Cloud SDK\n\nInstall Google Cloud SDK. What's the version you have? \n\nTo get the version, run `gcloud --version`\n\n## Google Cloud account \n\nCreate an account in Google Cloud and create a project.\n\n\n## Question 2. Terraform \n\nNow install terraform and go to the terraform directory (`week_1_basics_n_setup/1_terraform_gcp/terraform`)\n\nAfter that, run\n\n* `terraform init`\n* `terraform plan`\n* `terraform apply` \n\nApply the plan and copy the output (after running `apply`) to the form.\n\nIt should be the entire output - from the moment you typed `terraform init` to the very end.\n\n## Prepare Postgres \n\nRun Postgres and load data as shown in the videos\n\nWe'll use the yellow taxi trips from January 2021:\n\n```bash\nwget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv\n```\n\nYou will also need the dataset with zones:\n\n```bash \nwget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv\n```\n\nDownload this data and put it to Postgres\n\n## Question 3. Count records \n\nHow many taxi trips were there on January 15?\n\nConsider only trips that started on January 15.\n\n\n## Question 4. Largest tip for each day\n\nFind the largest tip for each day. \nOn which day it was the largest tip in January?\n\nUse the pick up time for your calculations.\n\n(note: it's not a typo, it's \"tip\", not \"trip\")\n\n\n## Question 5. Most popular destination\n\nWhat was the most popular destination for passengers picked up \nin central park on January 14?\n\nUse the pick up time for your calculations.\n\nEnter the zone name (not id). If the zone name is unknown (missing), write \"Unknown\" \n\n\n## Question 6. Most expensive locations\n\nWhat's the pickup-dropoff pair with the largest \naverage price for a ride (calculated based on `total_amount`)?\n\nEnter two zone names separated by a slash\n\nFor example:\n\n\"Jamaica Bay / Clinton East\"\n\nIf any of the zone names are unknown (missing), write \"Unknown\". For example, \"Unknown / Clinton East\". \n\n\n## Submitting the solutions\n\n* Form for submitting: https://forms.gle/yGQrkgRdVbiFs8Vd7\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 26 January (Wednesday), 22:00 CET\n\n\n## Solution\n\nHere is the solution to questions 3-6: [video](https://www.youtube.com/watch?v=HxHqH2ARfxM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/README.md",
    "content": "## Week 2: Data Ingestion\n\n### Data Lake (GCS)\n\n* What is a Data Lake\n* ELT vs. ETL\n* Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)\n\n:movie_camera: [Video](https://www.youtube.com/watch?v=W3Zm6rjOq70&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n[Slides](https://docs.google.com/presentation/d/1RkH-YhBz2apIjYZAxUz2Uks4Pt51-fVWVN9CcH9ckyY/edit?usp=sharing)\n\n\n### Introduction to Workflow orchestration\n\n* What is an Orchestration Pipeline?\n* What is a DAG?\n* [Video](https://www.youtube.com/watch?v=0yK7LXwYeD0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n\n### Setting up Airflow locally\n\n* Setting up Airflow with Docker-Compose\n* [Video](https://www.youtube.com/watch?v=lqDMzReAtrw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n* More information in the [airflow folder](airflow)\n\nIf you want to run a lighter version of Airflow with fewer services, check this [video](https://www.youtube.com/watch?v=A1p5LQ0zzaQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb). It's optional.\n\n\n### Ingesting data to GCP with Airflow\n\n* Extraction: Download and unpack the data\n* Pre-processing: Convert this raw data to parquet\n* Upload the parquet files to GCS\n* Create an external table in BigQuery\n* [Video](https://www.youtube.com/watch?v=9ksX9REfL8w&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=19)\n\n### Ingesting data to Local Postgres with Airflow\n\n* Converting the ingestion script for loading data to Postgres to Airflow DAG\n* [Video](https://www.youtube.com/watch?v=s2U8MWJH5xA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n\n### Transfer service (AWS -> GCP)\n\nMoving files from AWS to GCP.\n\nYou will need an AWS account for this. This section is optional\n\n* [Video 1](https://www.youtube.com/watch?v=rFOFTfD1uGk&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n* [Video 2](https://www.youtube.com/watch?v=VhmmbqpIzeI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n\n### Homework \n\nIn the homework, you'll create a few DAGs for processing the NY Taxi data for 2019-2021\n\nMore information [here](homework.md)\n\n\n## Community notes\n\nDid you take notes? You can share them here.\n\n* [Notes from Alvaro Navas](https://github.com/ziritrion/dataeng-zoomcamp/blob/main/notes/2_data_ingestion.md)\n* [Notes from Aaron Wright](https://github.com/ABZ-Aaron/DataEngineerZoomCamp/blob/master/week_2_data_ingestion/README.md)\n* [Notes from Abd](https://itnadigital.notion.site/Week-2-Data-Ingestion-ec2d0d36c0664bc4b8be6a554b2765fd)\n* [Blog post by Isaac Kargar](https://kargarisaac.github.io/blog/data%20engineering/jupyter/2022/01/25/data-engineering-w2.html)\n* [Blog, notes, walkthroughs by Sandy Behrens](https://learningdataengineering540969211.wordpress.com/2022/01/30/week-2-de-zoomcamp-2-3-2-ingesting-data-to-gcp-with-airflow/)\n* [Notes from Apurva Hegde](https://github.com/apuhegde/Airflow-LocalExecutor-In-Docker#readme)\n* [Notes from Vincenzo Galante](https://binchentso.notion.site/Data-Talks-Club-Data-Engineering-Zoomcamp-8699af8e7ff94ec49e6f9bdec8eb69fd)\n* Add your notes here (above this line)\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/.env_example",
    "content": "# Custom\nCOMPOSE_PROJECT_NAME=dtc-de\nGOOGLE_APPLICATION_CREDENTIALS=/.google/credentials/google_credentials.json\nAIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json\n# AIRFLOW_UID=\nGCP_PROJECT_ID=\nGCP_GCS_BUCKET=\n\n# Postgres\nPOSTGRES_USER=airflow\nPOSTGRES_PASSWORD=airflow\nPOSTGRES_DB=airflow\n\n# Airflow\nAIRFLOW__CORE__EXECUTOR=LocalExecutor\nAIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC=10\n\nAIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}\nAIRFLOW_CONN_METADATA_DB=postgres+psycopg2://airflow:airflow@postgres:5432/airflow\nAIRFLOW_VAR__METADATA_DB_SCHEMA=airflow\n\n_AIRFLOW_WWW_USER_CREATE=True\n_AIRFLOW_WWW_USER_USERNAME=${_AIRFLOW_WWW_USER_USERNAME:airflow}\n_AIRFLOW_WWW_USER_PASSWORD=${_AIRFLOW_WWW_USER_PASSWORD:airflow}\n\nAIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True\nAIRFLOW__CORE__LOAD_EXAMPLES=False\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/1_setup_official.md",
    "content": "## Setup (Official)\n\n### Pre-Reqs\n\n1. For the sake of standardization across this workshop's config,\n    rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory\n    ``` bash\n        cd ~ && mkdir -p ~/.google/credentials/\n        mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json\n    ```\n\n2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 5GB\n(ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting.\n\n3. Python version: 3.7+\n\n\n### Airflow Setup\n\n1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in)\n\n2. **Set the Airflow user**:\n\n    On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. \n    Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. \n    You have to make sure to configure them for the docker-compose:\n\n    ```bash\n    mkdir -p ./dags ./logs ./plugins\n    echo -e \"AIRFLOW_UID=$(id -u)\" > .env\n    ```\n\n    On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. \n\n    To get rid of the warning (\"AIRFLOW_UID is not set\"), you can create `.env` file with\n    this content:\n\n    ```\n    AIRFLOW_UID=50000\n    ```\n\n   \n3. **Import the official docker setup file** from the latest Airflow version:\n   ```shell\n   curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'\n   ```\n   \n4. It could be overwhelming to see a lot of services in here. \n   But this is only a quick-start template, and as you proceed you'll figure out which unused services can be removed.\n   Eg. [Here's](docker-compose-nofrills.yml) a no-frills version of that template.\n\n5. **Docker Build**:\n\n    When you want to run Airflow locally, you might want to use an extended image, \n    containing some additional dependencies - for example you might add new python packages, \n    or upgrade airflow providers to a later version.\n    \n    Create a `Dockerfile` pointing to Airflow version you've just downloaded, \n    such as `apache/airflow:2.2.3`, as the base image,\n       \n    And customize this `Dockerfile` by:\n    * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket/Data Lake.\n    * Also, integrating `requirements.txt` to install libraries via  `pip install`\n\n6. **Docker Compose**:\n\n    Back in your `docker-compose.yaml`:\n   * In `x-airflow-common`: \n     * Remove the `image` tag, to replace it with your `build` from your Dockerfile, as shown\n     * Mount your `google_credentials` in `volumes` section as read-only\n     * Set environment variables: `GCP_PROJECT_ID`, `GCP_GCS_BUCKET`, `GOOGLE_APPLICATION_CREDENTIALS` & `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`, as per your config.\n   * Change `AIRFLOW__CORE__LOAD_EXAMPLES` to `false` (optional)\n\n7. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look.\n\n\n## Problems\n\n### `File /.google/credentials/google_credentials.json was not found`\n\nFirst, make sure you have your credentials in your `$HOME/.google/credentials`.\nMaybe you missed the step and didn't copy the your JSON with credentials there?\nAlso, make sure the file-name is `google_credentials.json`.\n\nSecond, check that docker-compose can correctly map this directory to airflow worker.\n\nExecute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker.\n\nThen execute `bash` on this container:\n\n```bash\ndocker exec -it <container-ID> bash\n```\n\nNow check if the file with credentials is actually there:\n\n```bash\nls -lh /.google/credentials/\n```\n\nIf it's empty, docker-compose couldn't map the folder with credentials. \nIn this case, try changing it to the absolute path to this folder:\n\n```yaml\n  volumes:\n    - ./dags:/opt/airflow/dags\n    - ./logs:/opt/airflow/logs\n    - ./plugins:/opt/airflow/plugins\n    # here: ----------------------------\n    - c:/Users/alexe/.google/credentials/:/.google/credentials:ro\n    # -----------------------------------\n```\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/2_setup_nofrills.md",
    "content": "## Setup (No-frills)\n\n### Pre-Reqs\n\n1. For the sake of standardization across this workshop's config,\n    rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory\n    ``` bash\n        cd ~ && mkdir -p ~/.google/credentials/\n        mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json\n    ```\n\n2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 4GB\n(ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting.\n\n3. Python version: 3.7+\n\n\n### Airflow Setup\n\n1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in)\n   \n2. **Set the Airflow user**:\n\n    On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. \n    Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. \n    You have to make sure to configure them for the docker-compose:\n\n    ```bash\n    mkdir -p ./dags ./logs ./plugins\n    echo -e \"AIRFLOW_UID=$(id -u)\" >> .env\n    ```\n\n    On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. \n\n    To get rid of the warning (\"AIRFLOW_UID is not set\"), you can create `.env` file with\n    this content:\n\n    ```\n    AIRFLOW_UID=50000\n    ```\n\n3. **Docker Build**:\n\n    When you want to run Airflow locally, you might want to use an extended image, \n    containing some additional dependencies - for example you might add new python packages, \n    or upgrade airflow providers to a later version.\n    \n    Create a `Dockerfile` pointing to the latest Airflow version such as `apache/airflow:2.2.3`, for the base image,\n       \n    And customize this `Dockerfile` by:\n    * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket (Data Lake).\n    * Also, integrating `requirements.txt` to install libraries via  `pip install`\n\n4. Copy [docker-compose-nofrills.yml](docker-compose-nofrills.yml), [.env_example](.env_example) & [entrypoint.sh](scripts/entrypoint.sh) from this repo.\n    The changes from the official setup are:\n    * Removal of `redis` queue, `worker`, `triggerer`, `flower` & `airflow-init` services, \n    and changing from `CeleryExecutor` (multi-node) mode to `LocalExecutor` (single-node) mode \n    * Inclusion of `.env` for better parametrization & flexibility\n    * Inclusion of simple `entrypoint.sh` to the `webserver` container, responsible to initialize the database and create login-user (admin).\n    * Updated `Dockerfile` to grant permissions on executing `scripts/entrypoint.sh`\n        \n5. `.env`:\n    * Rebuild your `.env` file by making a copy of `.env_example` (but make sure your `AIRFLOW_UID` remains):\n        ```shell\n        mv .env_example .env\n        ```\n    * Set environment variables `AIRFLOW_UID`, `GCP_PROJECT_ID` & `GCP_GCS_BUCKET`, as per your config.\n    * Optionally, if your `google-credentials.json` is stored somewhere else, such as a path like `$HOME/.gc`, \n    modify the env-vars (`GOOGLE_APPLICATION_CREDENTIALS`, `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`) and `volumes` path in `docker-compose-nofrills.yml`\n\n6. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose-nofrills](./docker-compose-nofrills.yml) should look.\n\n\n## Problems\n\n### `no-frills setup does not work for me - WSL/Windows user `\n\nIf you are running Docker in Windows/WSL/WSL2 and you have encountered some `ModuleNotFoundError` or low performance issues, take a look at this [Airflow & WSL2 gist](https://gist.github.com/nervuzz/d1afe81116cbfa3c834634ebce7f11c5) focused entirely on troubleshooting possible problems.\n\n### `File /.google/credentials/google_credentials.json was not found`\n\nFirst, make sure you have your credentials in your `$HOME/.google/credentials`.\nMaybe you missed the step and didn't copy the your JSON with credentials there?\nAlso, make sure the file-name is `google_credentials.json`.\n\nSecond, check that docker-compose can correctly map this directory to airflow worker.\n\nExecute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker.\n\nThen execute `bash` on this container:\n\n```bash\ndocker exec -it <container-ID> bash\n```\n\nNow check if the file with credentials is actually there:\n\n```bash\nls -lh /.google/credentials/\n```\n\nIf it's empty, docker-compose couldn't map the folder with credentials. \nIn this case, try changing it to the absolute path to this folder:\n\n```yaml\n  volumes:\n    - ./dags:/opt/airflow/dags\n    - ./logs:/opt/airflow/logs\n    - ./plugins:/opt/airflow/plugins\n    # here: ----------------------------\n    - c:/Users/alexe/.google/credentials/:/.google/credentials:ro\n    # -----------------------------------\n```\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/Dockerfile",
    "content": "# First-time build can take upto 10 mins.\n\nFROM apache/airflow:2.2.3\n\nENV AIRFLOW_HOME=/opt/airflow\n\nUSER root\nRUN apt-get update -qq && apt-get install vim -qqq\n# git gcc g++ -qqq\n\nCOPY requirements.txt .\nRUN pip install --no-cache-dir -r requirements.txt\n\n# Ref: https://airflow.apache.org/docs/docker-stack/recipes.html\n\nSHELL [\"/bin/bash\", \"-o\", \"pipefail\", \"-e\", \"-u\", \"-x\", \"-c\"]\n\nARG CLOUD_SDK_VERSION=322.0.0\nENV GCLOUD_HOME=/home/google-cloud-sdk\n\nENV PATH=\"${GCLOUD_HOME}/bin/:${PATH}\"\n\nRUN DOWNLOAD_URL=\"https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz\" \\\n    && TMP_DIR=\"$(mktemp -d)\" \\\n    && curl -fL \"${DOWNLOAD_URL}\" --output \"${TMP_DIR}/google-cloud-sdk.tar.gz\" \\\n    && mkdir -p \"${GCLOUD_HOME}\" \\\n    && tar xzf \"${TMP_DIR}/google-cloud-sdk.tar.gz\" -C \"${GCLOUD_HOME}\" --strip-components=1 \\\n    && \"${GCLOUD_HOME}/install.sh\" \\\n       --bash-completion=false \\\n       --path-update=false \\\n       --usage-reporting=false \\\n       --quiet \\\n    && rm -rf \"${TMP_DIR}\" \\\n    && gcloud --version\n\nWORKDIR $AIRFLOW_HOME\n\nCOPY scripts scripts\nRUN chmod +x scripts\n\nUSER $AIRFLOW_UID\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/README.md",
    "content": "### Concepts\n\n [Airflow Concepts and Architecture](docs/1_concepts.md)\n\n### Workflow\n\n ![](docs/gcs_ingestion_dag.png)\n \n### Setup - Official Version\n (For the section on the Custom/Lightweight setup, scroll down)\n\n #### Setup\n  [Airflow Setup with Docker, through official guidelines](1_setup_official.md)\n\n #### Execution\n \n  1. Build the image (only first-time, or when there's any change in the `Dockerfile`, takes ~15 mins for the first-time):\n     ```shell\n     docker-compose build\n     ```\n   \n     or (for legacy versions)\n   \n     ```shell\n     docker build .\n     ```\n\n 2. Initialize the Airflow scheduler, DB, and other config\n    ```shell\n    docker-compose up airflow-init\n    ```\n\n 3. Kick up the all the services from the container:\n    ```shell\n    docker-compose up\n    ```\n\n 4. In another terminal, run `docker-compose ps` to see which containers are up & running (there should be 7, matching with the services in your docker-compose file).\n\n 5. Login to Airflow web UI on `localhost:8080` with default creds: `airflow/airflow`\n\n 6. Run your DAG on the Web Console.\n\n 7. On finishing your run or to shut down the container/s:\n    ```shell\n    docker-compose down\n    ```\n\n    To stop and delete containers, delete volumes with database data, and download images, run:\n    ```\n    docker-compose down --volumes --rmi all\n    ```\n\n    or\n    ```\n    docker-compose down --volumes --remove-orphans\n    ```\n       \n### Setup - Custom No-Frills Version (Lightweight)\nThis is a quick, simple & less memory-intensive setup of Airflow that works on a LocalExecutor.\n\n  #### Setup\n  [Airflow Setup with Docker, customized](2_setup_nofrills.md)\n\n  #### Execution\n  \n  1. Stop and delete containers, delete volumes with database data, & downloaded images (from the previous setup):\n    ```\n    docker-compose down --volumes --rmi all\n    ```\n\n   or\n    ```\n    docker-compose down --volumes --remove-orphans\n    ```\n    \n   Or, if you need to clear your system of any pre-cached Docker issues:\n    ```\n    docker system prune\n    ```\n    \n   Also, empty the airflow `logs` directory.\n    \n  2. Build the image (only first-time, or when there's any change in the `Dockerfile`):\n  Takes ~5-10 mins for the first-time\n    ```shell\n    docker-compose build\n    ```\n    or (for legacy versions)\n    ```shell\n    docker build .\n    ```\n\n  3. Kick up the all the services from the container (no need to specially initialize):\n    ```shell\n    docker-compose -f docker-compose-nofrills.yml up\n    ```\n\n  4. In another terminal, run `docker ps` to see which containers are up & running (there should be 3, matching with the services in your docker-compose file).\n\n  5. Login to Airflow web UI on `localhost:8080` with creds: `admin/admin` (explicit creation of admin user was required)\n\n  6. Run your DAG on the Web Console.\n\n  7. On finishing your run or to shut down the container/s:\n    ```shell\n    docker-compose down\n    ```\n    \n### Setup - Taken from DE Zoomcamp 2.3.4 - Optional: Lightweight Local Setup for Airflow\n\nUse the docker-compose_2.3.4.yaml file (and rename it to docker-compose.yaml). Don't forget to replace the variables `GCP_PROJECT_ID` and `GCP_GCS_BUCKET`.\n\n### Future Enhancements\n* Deploy self-hosted Airflow setup on Kubernetes cluster, or use a Managed Airflow (Cloud Composer) service by GCP\n\n### References\nFor more info, check out these official docs:\n   * https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html\n   * https://airflow.apache.org/docs/docker-stack/build.html\n   * https://airflow.apache.org/docs/docker-stack/recipes.html\n\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/dags/data_ingestion_gcs_dag.py",
    "content": "import os\nimport logging\n\nfrom airflow import DAG\nfrom airflow.utils.dates import days_ago\nfrom airflow.operators.bash import BashOperator\nfrom airflow.operators.python import PythonOperator\n\nfrom google.cloud import storage\nfrom airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator\nimport pyarrow.csv as pv\nimport pyarrow.parquet as pq\n\nPROJECT_ID = os.environ.get(\"GCP_PROJECT_ID\")\nBUCKET = os.environ.get(\"GCP_GCS_BUCKET\")\n\ndataset_file = \"yellow_tripdata_2021-01.csv\"\ndataset_url = f\"https://s3.amazonaws.com/nyc-tlc/trip+data/{dataset_file}\"\npath_to_local_home = os.environ.get(\"AIRFLOW_HOME\", \"/opt/airflow/\")\nparquet_file = dataset_file.replace('.csv', '.parquet')\nBIGQUERY_DATASET = os.environ.get(\"BIGQUERY_DATASET\", 'trips_data_all')\n\n\ndef format_to_parquet(src_file):\n    if not src_file.endswith('.csv'):\n        logging.error(\"Can only accept source files in CSV format, for the moment\")\n        return\n    table = pv.read_csv(src_file)\n    pq.write_table(table, src_file.replace('.csv', '.parquet'))\n\n\n# NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed\ndef upload_to_gcs(bucket, object_name, local_file):\n    \"\"\"\n    Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python\n    :param bucket: GCS bucket name\n    :param object_name: target path & file-name\n    :param local_file: source path & file-name\n    :return:\n    \"\"\"\n    # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.\n    # (Ref: https://github.com/googleapis/python-storage/issues/74)\n    storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB\n    storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB\n    # End of Workaround\n\n    client = storage.Client()\n    bucket = client.bucket(bucket)\n\n    blob = bucket.blob(object_name)\n    blob.upload_from_filename(local_file)\n\n\ndefault_args = {\n    \"owner\": \"airflow\",\n    \"start_date\": days_ago(1),\n    \"depends_on_past\": False,\n    \"retries\": 1,\n}\n\n# NOTE: DAG declaration - using a Context Manager (an implicit way)\nwith DAG(\n    dag_id=\"data_ingestion_gcs_dag\",\n    schedule_interval=\"@daily\",\n    default_args=default_args,\n    catchup=False,\n    max_active_runs=1,\n    tags=['dtc-de'],\n) as dag:\n\n    download_dataset_task = BashOperator(\n        task_id=\"download_dataset_task\",\n        bash_command=f\"curl -sSL {dataset_url} > {path_to_local_home}/{dataset_file}\"\n    )\n\n    format_to_parquet_task = PythonOperator(\n        task_id=\"format_to_parquet_task\",\n        python_callable=format_to_parquet,\n        op_kwargs={\n            \"src_file\": f\"{path_to_local_home}/{dataset_file}\",\n        },\n    )\n\n    # TODO: Homework - research and try XCOM to communicate output values between 2 tasks/operators\n    local_to_gcs_task = PythonOperator(\n        task_id=\"local_to_gcs_task\",\n        python_callable=upload_to_gcs,\n        op_kwargs={\n            \"bucket\": BUCKET,\n            \"object_name\": f\"raw/{parquet_file}\",\n            \"local_file\": f\"{path_to_local_home}/{parquet_file}\",\n        },\n    )\n\n    bigquery_external_table_task = BigQueryCreateExternalTableOperator(\n        task_id=\"bigquery_external_table_task\",\n        table_resource={\n            \"tableReference\": {\n                \"projectId\": PROJECT_ID,\n                \"datasetId\": BIGQUERY_DATASET,\n                \"tableId\": \"external_table\",\n            },\n            \"externalDataConfiguration\": {\n                \"sourceFormat\": \"PARQUET\",\n                \"sourceUris\": [f\"gs://{BUCKET}/raw/{parquet_file}\"],\n            },\n        },\n    )\n\n    download_dataset_task >> format_to_parquet_task >> local_to_gcs_task >> bigquery_external_table_task\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/dags_local/data_ingestion_local.py",
    "content": "import os\n\nfrom datetime import datetime\n\nfrom airflow import DAG\n\nfrom airflow.operators.bash import BashOperator\nfrom airflow.operators.python import PythonOperator\n\nfrom ingest_script import ingest_callable\n\n\nAIRFLOW_HOME = os.environ.get(\"AIRFLOW_HOME\", \"/opt/airflow/\")\n\n\nPG_HOST = os.getenv('PG_HOST')\nPG_USER = os.getenv('PG_USER')\nPG_PASSWORD = os.getenv('PG_PASSWORD')\nPG_PORT = os.getenv('PG_PORT')\nPG_DATABASE = os.getenv('PG_DATABASE')\n\n\nlocal_workflow = DAG(\n    \"LocalIngestionDag\",\n    schedule_interval=\"0 6 2 * *\",\n    start_date=datetime(2021, 1, 1)\n)\n\n\nURL_PREFIX = 'https://s3.amazonaws.com/nyc-tlc/trip+data' \nURL_TEMPLATE = URL_PREFIX + '/yellow_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.csv'\nOUTPUT_FILE_TEMPLATE = AIRFLOW_HOME + '/output_{{ execution_date.strftime(\\'%Y-%m\\') }}.csv'\nTABLE_NAME_TEMPLATE = 'yellow_taxi_{{ execution_date.strftime(\\'%Y_%m\\') }}'\n\nwith local_workflow:\n    wget_task = BashOperator(\n        task_id='wget',\n        bash_command=f'curl -sSL {URL_TEMPLATE} > {OUTPUT_FILE_TEMPLATE}'\n    )\n\n    ingest_task = PythonOperator(\n        task_id=\"ingest\",\n        python_callable=ingest_callable,\n        op_kwargs=dict(\n            user=PG_USER,\n            password=PG_PASSWORD,\n            host=PG_HOST,\n            port=PG_PORT,\n            db=PG_DATABASE,\n            table_name=TABLE_NAME_TEMPLATE,\n            csv_file=OUTPUT_FILE_TEMPLATE\n        ),\n    )\n\n    wget_task >> ingest_task"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/dags_local/ingest_script.py",
    "content": "import os\n\nfrom time import time\n\nimport pandas as pd\nfrom sqlalchemy import create_engine\n\n\ndef ingest_callable(user, password, host, port, db, table_name, csv_file, execution_date):\n    print(table_name, csv_file, execution_date)\n\n    engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')\n    engine.connect()\n\n    print('connection established successfully, inserting data...')\n\n    t_start = time()\n    df_iter = pd.read_csv(csv_file, iterator=True, chunksize=100000)\n\n    df = next(df_iter)\n\n    df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)\n    df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)\n\n    df.head(n=0).to_sql(name=table_name, con=engine, if_exists='replace')\n\n    df.to_sql(name=table_name, con=engine, if_exists='append')\n\n    t_end = time()\n    print('inserted the first chunk, took %.3f second' % (t_end - t_start))\n\n    while True: \n        t_start = time()\n\n        try:\n            df = next(df_iter)\n        except StopIteration:\n            print(\"completed\")\n            break\n\n        df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)\n        df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)\n\n        df.to_sql(name=table_name, con=engine, if_exists='append')\n\n        t_end = time()\n\n        print('inserted another chunk, took %.3f second' % (t_end - t_start))\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/docker-compose-nofrills.yml",
    "content": "version: '3'\nservices:\n    postgres:\n        image: postgres:13\n        env_file:\n            - .env\n        volumes:\n            - postgres-db-volume:/var/lib/postgresql/data\n        healthcheck:\n            test: [\"CMD\", \"pg_isready\", \"-U\", \"airflow\"]\n            interval: 5s\n            retries: 5\n        restart: always\n\n    scheduler:\n        build: .\n        command: scheduler\n        restart: on-failure\n        depends_on:\n            - postgres\n        env_file:\n            - .env\n        volumes:\n            - ./dags:/opt/airflow/dags\n            - ./logs:/opt/airflow/logs\n            - ./plugins:/opt/airflow/plugins\n            - ./scripts:/opt/airflow/scripts\n            - ~/.google/credentials/:/.google/credentials\n\n\n    webserver:\n        build: .\n        entrypoint: ./scripts/entrypoint.sh\n        restart: on-failure\n        depends_on:\n            - postgres\n            - scheduler\n        env_file:\n            - .env\n        volumes:\n            - ./dags:/opt/airflow/dags\n            - ./logs:/opt/airflow/logs\n            - ./plugins:/opt/airflow/plugins\n            - ~/.google/credentials/:/.google/credentials:ro\n            - ./scripts:/opt/airflow/scripts\n\n        user: \"${AIRFLOW_UID:-50000}:0\"\n        ports:\n            - \"8080:8080\"\n        healthcheck:\n            test: [ \"CMD-SHELL\", \"[ -f /home/airflow/airflow-webserver.pid ]\" ]\n            interval: 30s\n            timeout: 30s\n            retries: 3\n\nvolumes:\n  postgres-db-volume:"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/docker-compose.yaml",
    "content": "# Licensed to the Apache Software Foundation (ASF) under one\n# or more contributor license agreements.  See the NOTICE file\n# distributed with this work for additional information\n# regarding copyright ownership.  The ASF licenses this file\n# to you under the Apache License, Version 2.0 (the\n# \"License\"); you may not use this file except in compliance\n# with the License.  You may obtain a copy of the License at\n#\n#   http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing,\n# software distributed under the License is distributed on an\n# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, either express or implied.  See the License for the\n# specific language governing permissions and limitations\n# under the License.\n#\n\n# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.\n#\n# WARNING: This configuration is for local development. Do not use it in a production deployment.\n#\n# This configuration supports basic configuration using environment variables or an .env file\n# The following variables are supported:\n#\n# AIRFLOW_IMAGE_NAME           - Docker image name used to run Airflow.\n#                                Default: apache/airflow:2.2.3\n# AIRFLOW_UID                  - User ID in Airflow containers\n#                                Default: 50000\n# Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode\n#\n# _AIRFLOW_WWW_USER_USERNAME   - Username for the administrator account (if requested).\n#                                Default: airflow\n# _AIRFLOW_WWW_USER_PASSWORD   - Password for the administrator account (if requested).\n#                                Default: airflow\n# _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.\n#                                Default: ''\n#\n# Feel free to modify this file to suit your needs.\n---\nversion: '3'\nx-airflow-common:\n  &airflow-common\n  # In order to add custom dependencies or upgrade provider packages you can use your extended image.\n  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml\n  # and uncomment the \"build\" line below, Then run `docker-compose build` to build the images.\n  build:\n    context: .\n    dockerfile: ./Dockerfile\n  environment:\n    &airflow-common-env\n    AIRFLOW__CORE__EXECUTOR: CeleryExecutor\n    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow\n    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow\n    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0\n    AIRFLOW__CORE__FERNET_KEY: ''\n    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'\n    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'\n    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'\n    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}\n    GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json\n    AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'\n\n    # TODO: Please change GCP_PROJECT_ID & GCP_GCS_BUCKET, as per your config\n    GCP_PROJECT_ID: 'pivotal-surfer-336713'\n    GCP_GCS_BUCKET: 'dtc_data_lake_pivotal-surfer-336713'\n\n  volumes:\n    - ./dags:/opt/airflow/dags\n    - ./logs:/opt/airflow/logs\n    - ./plugins:/opt/airflow/plugins\n    - ~/.google/credentials/:/.google/credentials:ro\n\n  user: \"${AIRFLOW_UID:-50000}:0\"\n  depends_on:\n    &airflow-common-depends-on\n    redis:\n      condition: service_healthy\n    postgres:\n      condition: service_healthy\n\nservices:\n  postgres:\n    image: postgres:13\n    environment:\n      POSTGRES_USER: airflow\n      POSTGRES_PASSWORD: airflow\n      POSTGRES_DB: airflow\n    volumes:\n      - postgres-db-volume:/var/lib/postgresql/data\n    healthcheck:\n      test: [\"CMD\", \"pg_isready\", \"-U\", \"airflow\"]\n      interval: 5s\n      retries: 5\n    restart: always\n\n  redis:\n    image: redis:latest\n    expose:\n      - 6379\n    healthcheck:\n      test: [\"CMD\", \"redis-cli\", \"ping\"]\n      interval: 5s\n      timeout: 30s\n      retries: 50\n    restart: always\n\n  airflow-webserver:\n    <<: *airflow-common\n    command: webserver\n    ports:\n      - 8080:8080\n    healthcheck:\n      test: [\"CMD\", \"curl\", \"--fail\", \"http://localhost:8080/health\"]\n      interval: 10s\n      timeout: 10s\n      retries: 5\n    restart: always\n    depends_on:\n      <<: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully\n\n  airflow-scheduler:\n    <<: *airflow-common\n    command: scheduler\n    healthcheck:\n      test: [\"CMD-SHELL\", 'airflow jobs check --job-type SchedulerJob --hostname \"$${HOSTNAME}\"']\n      interval: 10s\n      timeout: 10s\n      retries: 5\n    restart: always\n    depends_on:\n      <<: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully\n\n  airflow-worker:\n    <<: *airflow-common\n    command: celery worker\n    healthcheck:\n      test:\n        - \"CMD-SHELL\"\n        - 'celery --app airflow.executors.celery_executor.app inspect ping -d \"celery@$${HOSTNAME}\"'\n      interval: 10s\n      timeout: 10s\n      retries: 5\n    environment:\n      <<: *airflow-common-env\n      # Required to handle warm shutdown of the celery workers properly\n      # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation\n      DUMB_INIT_SETSID: \"0\"\n    restart: always\n    depends_on:\n      <<: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully\n\n  airflow-triggerer:\n    <<: *airflow-common\n    command: triggerer\n    healthcheck:\n      test: [\"CMD-SHELL\", 'airflow jobs check --job-type TriggererJob --hostname \"$${HOSTNAME}\"']\n      interval: 10s\n      timeout: 10s\n      retries: 5\n    restart: always\n    depends_on:\n      <<: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully\n\n  airflow-init:\n    <<: *airflow-common\n    entrypoint: /bin/bash\n    # yamllint disable rule:line-length\n    command:\n      - -c\n      - |\n        function ver() {\n          printf \"%04d%04d%04d%04d\" $${1//./ }\n        }\n        airflow_version=$$(gosu airflow airflow version)\n        airflow_version_comparable=$$(ver $${airflow_version})\n        min_airflow_version=2.2.0\n        min_airflow_version_comparable=$$(ver $${min_airflow_version})\n        if (( airflow_version_comparable < min_airflow_version_comparable )); then\n          echo\n          echo -e \"\\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\\e[0m\"\n          echo \"The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!\"\n          echo\n          exit 1\n        fi\n        if [[ -z \"${AIRFLOW_UID}\" ]]; then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: AIRFLOW_UID not set!\\e[0m\"\n          echo \"If you are on Linux, you SHOULD follow the instructions below to set \"\n          echo \"AIRFLOW_UID environment variable, otherwise files will be owned by root.\"\n          echo \"For other operating systems you can get rid of the warning with manually created .env file:\"\n          echo \"    See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user\"\n          echo\n        fi\n        one_meg=1048576\n        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))\n        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)\n        disk_available=$$(df / | tail -1 | awk '{print $$4}')\n        warning_resources=\"false\"\n        if (( mem_available < 4000 )) ; then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: Not enough memory available for Docker.\\e[0m\"\n          echo \"At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))\"\n          echo\n          warning_resources=\"true\"\n        fi\n        if (( cpus_available < 2 )); then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\\e[0m\"\n          echo \"At least 2 CPUs recommended. You have $${cpus_available}\"\n          echo\n          warning_resources=\"true\"\n        fi\n        if (( disk_available < one_meg * 10 )); then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\\e[0m\"\n          echo \"At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))\"\n          echo\n          warning_resources=\"true\"\n        fi\n        if [[ $${warning_resources} == \"true\" ]]; then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\\e[0m\"\n          echo \"Please follow the instructions to increase amount of resources available:\"\n          echo \"   https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin\"\n          echo\n        fi\n        mkdir -p /sources/logs /sources/dags /sources/plugins\n        chown -R \"${AIRFLOW_UID}:0\" /sources/{logs,dags,plugins}\n        exec /entrypoint airflow version\n    # yamllint enable rule:line-length\n    environment:\n      <<: *airflow-common-env\n      _AIRFLOW_DB_UPGRADE: 'true'\n      _AIRFLOW_WWW_USER_CREATE: 'true'\n      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}\n      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}\n    user: \"0:0\"\n    volumes:\n      - .:/sources\n\n  airflow-cli:\n    <<: *airflow-common\n    profiles:\n      - debug\n    environment:\n      <<: *airflow-common-env\n      CONNECTION_CHECK_MAX_COUNT: \"0\"\n    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252\n    command:\n      - bash\n      - -c\n      - airflow\n\n  flower:\n    <<: *airflow-common\n    command: celery flower\n    ports:\n      - 5555:5555\n    healthcheck:\n      test: [\"CMD\", \"curl\", \"--fail\", \"http://localhost:5555/\"]\n      interval: 10s\n      timeout: 10s\n      retries: 5\n    restart: always\n    depends_on:\n      <<: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully\n\nvolumes:\n  postgres-db-volume:\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/docker-compose_2.3.4.yaml",
    "content": "# Licensed to the Apache Software Foundation (ASF) under one\n# or more contributor license agreements.  See the NOTICE file\n# distributed with this work for additional information\n# regarding copyright ownership.  The ASF licenses this file\n# to you under the Apache License, Version 2.0 (the\n# \"License\"); you may not use this file except in compliance\n# with the License.  You may obtain a copy of the License at\n#\n#   http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing,\n# software distributed under the License is distributed on an\n# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, either express or implied.  See the License for the\n# specific language governing permissions and limitations\n# under the License.\n#\n\n# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.\n#\n# WARNING: This configuration is for local development. Do not use it in a production deployment.\n#\n# This configuration supports basic configuration using environment variables or an .env file\n# The following variables are supported:\n#\n# AIRFLOW_IMAGE_NAME           - Docker image name used to run Airflow.\n#                                Default: apache/airflow:2.2.3\n# AIRFLOW_UID                  - User ID in Airflow containers\n#                                Default: 50000\n# Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode\n#\n# _AIRFLOW_WWW_USER_USERNAME   - Username for the administrator account (if requested).\n#                                Default: airflow\n# _AIRFLOW_WWW_USER_PASSWORD   - Password for the administrator account (if requested).\n#                                Default: airflow\n# _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.\n#                                Default: ''\n#\n# Feel free to modify this file to suit your needs.\n---\nversion: '3'\nx-airflow-common:\n  &airflow-common\n  # In order to add custom dependencies or upgrade provider packages you can use your extended image.\n  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml\n  # and uncomment the \"build\" line below, Then run `docker-compose build` to build the images.\n  build:\n    context: .\n    dockerfile: ./Dockerfile\n  environment:\n    &airflow-common-env\n    AIRFLOW__CORE__EXECUTOR: LocalExecutor\n    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow\n    AIRFLOW__CORE__FERNET_KEY: ''\n    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'\n    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'\n    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'\n    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}\n    GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json\n    AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'\n    GCP_PROJECT_ID: 'abc'\n    GCP_GCS_BUCKET: \"abc\"\n\n  volumes:\n    - ./dags:/opt/airflow/dags\n    - ./logs:/opt/airflow/logs\n    - ./plugins:/opt/airflow/plugins\n    - ~/.google/credentials/:/.google/credentials:ro\n\n  user: \"${AIRFLOW_UID:-50000}:0\"\n  depends_on:\n    &airflow-common-depends-on\n    postgres:\n      condition: service_healthy\n\nservices:\n  postgres:\n    image: postgres:13\n    environment:\n      POSTGRES_USER: airflow\n      POSTGRES_PASSWORD: airflow\n      POSTGRES_DB: airflow\n    volumes:\n      - postgres-db-volume:/var/lib/postgresql/data\n    healthcheck:\n      test: [\"CMD\", \"pg_isready\", \"-U\", \"airflow\"]\n      interval: 5s\n      retries: 5\n    restart: always\n\n  airflow-webserver:\n    <<: *airflow-common\n    command: webserver\n    ports:\n      - 8080:8080\n    healthcheck:\n      test: [\"CMD\", \"curl\", \"--fail\", \"http://localhost:8080/health\"]\n      interval: 10s\n      timeout: 10s\n      retries: 5\n    restart: always\n    depends_on:\n      <<: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully\n\n  airflow-scheduler:\n    <<: *airflow-common\n    command: scheduler\n    healthcheck:\n      test: [\"CMD-SHELL\", 'airflow jobs check --job-type SchedulerJob --hostname \"$${HOSTNAME}\"']\n      interval: 10s\n      timeout: 10s\n      retries: 5\n    restart: always\n    depends_on:\n      <<: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully\n\n  airflow-init:\n    <<: *airflow-common\n    entrypoint: /bin/bash\n    # yamllint disable rule:line-length\n    command:\n      - -c\n      - |\n        function ver() {\n          printf \"%04d%04d%04d%04d\" $${1//./ }\n        }\n        airflow_version=$$(gosu airflow airflow version)\n        airflow_version_comparable=$$(ver $${airflow_version})\n        min_airflow_version=2.2.0\n        min_airflow_version_comparable=$$(ver $${min_airflow_version})\n        if (( airflow_version_comparable < min_airflow_version_comparable )); then\n          echo\n          echo -e \"\\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\\e[0m\"\n          echo \"The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!\"\n          echo\n          exit 1\n        fi\n        if [[ -z \"${AIRFLOW_UID}\" ]]; then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: AIRFLOW_UID not set!\\e[0m\"\n          echo \"If you are on Linux, you SHOULD follow the instructions below to set \"\n          echo \"AIRFLOW_UID environment variable, otherwise files will be owned by root.\"\n          echo \"For other operating systems you can get rid of the warning with manually created .env file:\"\n          echo \"    See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user\"\n          echo\n        fi\n        one_meg=1048576\n        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))\n        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)\n        disk_available=$$(df / | tail -1 | awk '{print $$4}')\n        warning_resources=\"false\"\n        if (( mem_available < 4000 )) ; then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: Not enough memory available for Docker.\\e[0m\"\n          echo \"At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))\"\n          echo\n          warning_resources=\"true\"\n        fi\n        if (( cpus_available < 2 )); then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\\e[0m\"\n          echo \"At least 2 CPUs recommended. You have $${cpus_available}\"\n          echo\n          warning_resources=\"true\"\n        fi\n        if (( disk_available < one_meg * 10 )); then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\\e[0m\"\n          echo \"At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))\"\n          echo\n          warning_resources=\"true\"\n        fi\n        if [[ $${warning_resources} == \"true\" ]]; then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\\e[0m\"\n          echo \"Please follow the instructions to increase amount of resources available:\"\n          echo \"   https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin\"\n          echo\n        fi\n        mkdir -p /sources/logs /sources/dags /sources/plugins\n        chown -R \"${AIRFLOW_UID}:0\" /sources/{logs,dags,plugins}\n        exec /entrypoint airflow version\n    # yamllint enable rule:line-length\n    environment:\n      <<: *airflow-common-env\n      _AIRFLOW_DB_UPGRADE: 'true'\n      _AIRFLOW_WWW_USER_CREATE: 'true'\n      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}\n      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}\n    user: \"0:0\"\n    volumes:\n      - .:/sources\n\n  airflow-cli:\n    <<: *airflow-common\n    profiles:\n      - debug\n    environment:\n      <<: *airflow-common-env\n      CONNECTION_CHECK_MAX_COUNT: \"0\"\n    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252\n    command:\n      - bash\n      - -c\n      - airflow\n\nvolumes:\n  postgres-db-volume:\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/docs/1_concepts.md",
    "content": "## Airflow concepts\n\n\n### Airflow architecture\n![](arch-diag-airflow.png)\n\nRef: https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html\n\n* **Web server**:\nGUI to inspect, trigger and debug the behaviour of DAGs and tasks. \nAvailable at http://localhost:8080.\n\n* **Scheduler**:\nResponsible for scheduling jobs. Handles both triggering & scheduled workflows, submits Tasks to the executor to run, monitors all tasks and DAGs, and\nthen triggers the task instances once their dependencies are complete.\n\n* **Worker**:\nThis component executes the tasks given by the scheduler.\n\n* **Metadata database (postgres)**:\nBackend to the Airflow environment. Used by the scheduler, executor and webserver to store state.\n\n* **Other components** (seen in docker-compose services):\n    * `redis`: Message broker that forwards messages from scheduler to worker.\n    * `flower`: The flower app for monitoring the environment. It is available at http://localhost:5555.\n    * `airflow-init`: initialization service (customized as per this design)\n\nAll these services allow you to run Airflow with CeleryExecutor. \nFor more information, see [Architecture Overview](https://airflow.apache.org/docs/apache-airflow/stable/concepts/overview.html).\n\n\n### Project Structure:\n\n* `./dags` - `DAG_FOLDER` for DAG files (use `./dags_local` for the local ingestion DAG)\n* `./logs` - contains logs from task execution and scheduler.\n* `./plugins` - for custom plugins\n\n\n### Workflow components\n\n* `DAG`: Directed acyclic graph, specifies the dependencies between a set of tasks with explicit execution order, and has a beginning as well as an end. (Hence, “acyclic”)\n    * `DAG Structure`: DAG Definition, Tasks (eg. Operators), Task Dependencies (control flow: `>>` or `<<` )\n    \n* `Task`: a defined unit of work (aka, operators in Airflow). The Tasks themselves describe what to do, be it fetching data, running analysis, triggering other systems, or more.\n    * Common Types: Operators (used in this workshop), Sensors, TaskFlow decorators\n    * Sub-classes of Airflow's BaseOperator\n\n* `DAG Run`: individual execution/run of a DAG\n    * scheduled or triggered\n\n* `Task Instance`: an individual run of a single task. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc.\n    * Ideally, a task should flow from `none`, to `scheduled`, to `queued`, to `running`, and finally to `success`.\n\n\n### References\n\nhttps://airflow.apache.org/docs/apache-airflow/stable/concepts/dags.html\n\nhttps://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html\n\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/extras/data_ingestion_gcs_dag_ex2.py",
    "content": "import os\nfrom datetime import datetime\n\nfrom airflow import DAG\nfrom airflow.utils.dates import days_ago\nfrom airflow.operators.bash import BashOperator\nfrom airflow.operators.python import PythonOperator\nfrom google.cloud import storage\n\nPROJECT_ID = os.environ.get(\"GCP_PROJECT_ID\", \"pivotal-surfer-336713\")\nBUCKET = os.environ.get(\"GCP_GCS_BUCKET\", \"dtc_data_lake_pivotal-surfer-336713\")\n\ndataset_file = \"yellow_tripdata_2021-01.csv\"\ndataset_url = f\"https://s3.amazonaws.com/nyc-tlc/trip+data/{dataset_file}\"\npath_to_local_home = os.environ.get(\"AIRFLOW_HOME\", \"/opt/airflow/\")\npath_to_creds = f\"{path_to_local_home}/google_credentials.json\"\n\ndefault_args = {\n    \"owner\": \"airflow\",\n    \"start_date\": days_ago(1),\n    \"depends_on_past\": False,\n    \"retries\": 1,\n}\n\n\n# # Takes 15-20 mins to run. Good case for using Spark (distributed processing, in place of chunks)\n# def upload_to_gcs(bucket, object_name, local_file):\n#     \"\"\"\n#     Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python\n#     :param bucket: GCS bucket name\n#     :param object_name: target path & file-name\n#     :param local_file: source path & file-name\n#     :return:\n#     \"\"\"\n#     # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload link.\n#     # (Ref: https://github.com/googleapis/python-storage/issues/74)\n#     storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB\n#     storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB\n#\n#     client = storage.Client()\n#     bucket = client.bucket(bucket)\n#\n#     blob = bucket.blob(object_name)\n#     # blob.chunk_size = 5 * 1024 * 1024\n#     blob.upload_from_filename(local_file)\n\n\nwith DAG(\n    dag_id=\"data_ingestion_gcs_dag\",\n    schedule_interval=\"@daily\",\n    default_args=default_args,\n    catchup=True,\n    max_active_runs=1,\n) as dag:\n\n    # Takes ~2 mins, depending upon your internet's download speed\n    download_dataset_task = BashOperator(\n        task_id=\"download_dataset_task\",\n        bash_command=f\"curl -sS {dataset_url} > {path_to_local_home}/{dataset_file}\"    # \"&& unzip {zip_file} && rm {zip_file}\"\n    )\n\n    # # APPROACH 1: (takes 20 mins, at an upload speed of 800Kbps. Faster if your internet has a better upload speed)\n    # upload_to_gcs_task = PythonOperator(\n    #     task_id=\"upload_to_gcs_task\",\n    #     python_callable=upload_to_gcs,\n    #     op_kwargs={\n    #         \"bucket\": BUCKET,\n    #         \"object_name\": f\"raw/{dataset_file}\",\n    #         \"local_file\": f\"{path_to_local_home}/{dataset_file}\",\n    #\n    #     },\n    # )\n\n    # OR APPROACH 2: (takes 20 mins, at an upload speed of 800Kbps. Faster if your internet has a better upload speed)\n    # Ref: https://cloud.google.com/blog/products/gcp/optimizing-your-cloud-storage-performance-google-cloud-performance-atlas\n    upload_to_gcs_task = BashOperator(\n        task_id=\"upload_to_gcs_task\",\n        bash_command=f\"gcloud auth activate-service-account --key-file={path_to_creds} && \\\n        gsutil -m cp {path_to_local_home}/{dataset_file} gs://{BUCKET}\",\n\n    )\n\n    download_dataset_task >> upload_to_gcs_task"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/extras/web_to_gcs.sh",
    "content": "dataset_url=${dataset_url}\ndataset_file=${dataset_file}\npath_to_local_file=${path_to_local_file}\npath_to_creds=${path_to_creds}\n\ncurl -sS \"$dataset_url\" > $path_to_local_file/$dataset_file\ngcloud auth activate-service-account --key-file=$path_to_creds\ngsutil -m cp $path_to_local_file/$dataset_file gs://$BUCKET\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/requirements.txt",
    "content": "apache-airflow-providers-google\npyarrow\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/airflow/scripts/entrypoint.sh",
    "content": "#!/usr/bin/env bash\nexport GOOGLE_APPLICATION_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS}\nexport AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=${AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT}\n\nairflow db upgrade\n\nairflow users create -r Admin -u admin -p admin -e admin@example.com -f admin -l airflow\n# \"$_AIRFLOW_WWW_USER_USERNAME\" -p \"$_AIRFLOW_WWW_USER_PASSWORD\"\n\nairflow webserver\n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/homework/homework.md",
    "content": "## Week 2 Homework\n\nIn this homework, we'll prepare data for the next week. We'll need\nto put these datasets to our data lake:\n\n* For the lessons, we'll need the Yellow taxi dataset (years 2019 and 2020)\n* For the homework, we'll need FHV Data (for-hire vehicles, for 2019 only)\n\nYou can find all the URLs on [the dataset page](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)\n\n\nIn this homework, we will:\n\n* Modify the DAG we created during the lessons for transferring the yellow taxi data\n* Create a new dag for transferring the FHV data\n* Create another dag for the Zones data\n\n\nIf you don't have access to GCP, you can do that locally and ingest data to Postgres \ninstead. If you have access to GCP, you don't need to do it for local Postgres -\nonly if you want.\n\nAlso note that for this homework we don't need the last step - creating a table in GCP.\nAfter putting all the files to the datalake, we'll create the tables in Week 3.\n\n\n\n## Question 1: Start date for the Yellow taxi data (1 point)\n\nYou'll need to parametrize the DAG for processing the yellow taxi data that\nwe created in the videos. \n\nWhat should be the start date for this dag?\n\n* 2019-01-01\n* 2020-01-01\n* 2021-01-01\n* days_ago(1)\n\n\n## Question 2: Frequency for the Yellow taxi data (1 point)\n\nHow often do we need to run this DAG?\n\n* Daily\n* Monthly\n* Yearly\n* Once\n\n\n## Re-running the DAGs for past dates\n\nTo execute your DAG for past dates, try this:\n\n* First, delete your DAG from the web interface (the bin icon)\n* Set the `catchup` parameter to `True`\n* Be careful with running a lot of jobs in parallel - your system may not like it. Don't set it higher than 3: `max_active_runs=3`\n* Rename the DAG to something like `data_ingestion_gcs_dag_v02` \n* Execute it from the Airflow GUI (the play button)\n\n\nAlso, there's no data for the recent months, but `curl` will exit successfully.\nTo make it fail on 404, add the `-f` flag:\n\n```bash\ncurl -sSLf { URL } > { LOCAL_PATH }\n```\n\nWhen you run this for all the data, the temporary files will be saved in Docker and will consume your \ndisk space. If it causes problems for you, add another step in your DAG that cleans everything up.\nIt could be a bash operator that runs this command:\n\n```bash\nrm name-of-csv-file.csv name-of-parquet-file.parquet\n```\n\n\n## Question 3: DAG for FHV Data (2 points)\n\nNow create another DAG - for uploading the FHV data. \n\nWe will need three steps: \n\n* Download the data\n* Parquetize it \n* Upload to GCS\n\nIf you don't have a GCP account, for local ingestion you'll need two steps:\n\n* Download the data\n* Ingest to Postgres\n\nUse the same frequency and the start date as for the yellow taxi dataset\n\nQuestion: how many DAG runs are green for data in 2019 after finishing everything? \n\nNote: when processing the data for 2020-01 you probably will get an error. It's up \nto you to decide what to do with it - for Week 3 homework we won't need 2020 data.\n\n\n## Question 4: DAG for Zones (2 points)\n\n\nCreate the final DAG - for Zones:\n\n* Download it\n* Parquetize \n* Upload to GCS\n\n(Or two steps for local ingestion: download -> ingest to postgres)\n\nHow often does it need to run?\n\n* Daily\n* Monthly\n* Yearly\n* Once\n\n\n## Submitting the solutions\n\n* Form for submitting: https://forms.gle/ViWS8pDf2tZD4zSu5\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: February 7, 17:00 CET \n"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/homework/solution.py",
    "content": "import os\nimport logging\n\nfrom datetime import datetime\n\nfrom airflow import DAG\nfrom airflow.utils.dates import days_ago\nfrom airflow.operators.bash import BashOperator\nfrom airflow.operators.python import PythonOperator\n\nfrom google.cloud import storage\n\nimport pyarrow.csv as pv\nimport pyarrow.parquet as pq\n\n\nPROJECT_ID = os.environ.get(\"GCP_PROJECT_ID\")\nBUCKET = os.environ.get(\"GCP_GCS_BUCKET\")\nAIRFLOW_HOME = os.environ.get(\"AIRFLOW_HOME\", \"/opt/airflow/\")\n\n\ndef format_to_parquet(src_file, dest_file):\n    if not src_file.endswith('.csv'):\n        logging.error(\"Can only accept source files in CSV format, for the moment\")\n        return\n    table = pv.read_csv(src_file)\n    pq.write_table(table, dest_file)\n\n\ndef upload_to_gcs(bucket, object_name, local_file):\n    client = storage.Client()\n    bucket = client.bucket(bucket)\n    blob = bucket.blob(object_name)\n    blob.upload_from_filename(local_file)\n\n\ndefault_args = {\n    \"owner\": \"airflow\",\n    #\"start_date\": days_ago(1),\n    \"depends_on_past\": False,\n    \"retries\": 1,\n}\n\n\ndef donwload_parquetize_upload_dag(\n    dag,\n    url_template,\n    local_csv_path_template,\n    local_parquet_path_template,\n    gcs_path_template\n):\n    with dag:\n        download_dataset_task = BashOperator(\n            task_id=\"download_dataset_task\",\n            bash_command=f\"curl -sSLf {url_template} > {local_csv_path_template}\"\n        )\n\n        format_to_parquet_task = PythonOperator(\n            task_id=\"format_to_parquet_task\",\n            python_callable=format_to_parquet,\n            op_kwargs={\n                \"src_file\": local_csv_path_template,\n                \"dest_file\": local_parquet_path_template\n            },\n        )\n\n        local_to_gcs_task = PythonOperator(\n            task_id=\"local_to_gcs_task\",\n            python_callable=upload_to_gcs,\n            op_kwargs={\n                \"bucket\": BUCKET,\n                \"object_name\": gcs_path_template,\n                \"local_file\": local_parquet_path_template,\n            },\n        )\n\n        rm_task = BashOperator(\n            task_id=\"rm_task\",\n            bash_command=f\"rm {local_csv_path_template} {local_parquet_path_template}\"\n        )\n\n        download_dataset_task >> format_to_parquet_task >> local_to_gcs_task >> rm_task\n\n\n\nURL_PREFIX = 'https://s3.amazonaws.com/nyc-tlc/trip+data'\n\nYELLOW_TAXI_URL_TEMPLATE = URL_PREFIX + '/yellow_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.csv'\nYELLOW_TAXI_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/yellow_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.csv'\nYELLOW_TAXI_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/yellow_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.parquet'\nYELLOW_TAXI_GCS_PATH_TEMPLATE = \"raw/yellow_tripdata/{{ execution_date.strftime(\\'%Y\\') }}/yellow_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.parquet\"\n\n\nyellow_taxi_data_dag = DAG(\n    dag_id=\"yellow_taxi_data_v2\",\n    schedule_interval=\"0 6 2 * *\",\n    start_date=datetime(2019, 1, 1),\n    default_args=default_args,\n    catchup=True,\n    max_active_runs=3,\n    tags=['dtc-de'],\n)\n\ndonwload_parquetize_upload_dag(\n    dag=yellow_taxi_data_dag,\n    url_template=YELLOW_TAXI_URL_TEMPLATE,\n    local_csv_path_template=YELLOW_TAXI_CSV_FILE_TEMPLATE,\n    local_parquet_path_template=YELLOW_TAXI_PARQUET_FILE_TEMPLATE,\n    gcs_path_template=YELLOW_TAXI_GCS_PATH_TEMPLATE\n)\n\n# https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2021-01.csv\n\nGREEN_TAXI_URL_TEMPLATE = URL_PREFIX + '/green_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.csv'\nGREEN_TAXI_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/green_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.csv'\nGREEN_TAXI_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/green_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.parquet'\nGREEN_TAXI_GCS_PATH_TEMPLATE = \"raw/green_tripdata/{{ execution_date.strftime(\\'%Y\\') }}/green_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.parquet\"\n\ngreen_taxi_data_dag = DAG(\n    dag_id=\"green_taxi_data_v1\",\n    schedule_interval=\"0 7 2 * *\",\n    start_date=datetime(2019, 1, 1),\n    default_args=default_args,\n    catchup=True,\n    max_active_runs=3,\n    tags=['dtc-de'],\n)\n\ndonwload_parquetize_upload_dag(\n    dag=green_taxi_data_dag,\n    url_template=GREEN_TAXI_URL_TEMPLATE,\n    local_csv_path_template=GREEN_TAXI_CSV_FILE_TEMPLATE,\n    local_parquet_path_template=GREEN_TAXI_PARQUET_FILE_TEMPLATE,\n    gcs_path_template=GREEN_TAXI_GCS_PATH_TEMPLATE\n)\n\n\n# https://nyc-tlc.s3.amazonaws.com/trip+data/fhv_tripdata_2021-01.csv\n\nFHV_TAXI_URL_TEMPLATE = URL_PREFIX + '/fhv_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.csv'\nFHV_TAXI_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/fhv_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.csv'\nFHV_TAXI_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/fhv_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.parquet'\nFHV_TAXI_GCS_PATH_TEMPLATE = \"raw/fhv_tripdata/{{ execution_date.strftime(\\'%Y\\') }}/fhv_tripdata_{{ execution_date.strftime(\\'%Y-%m\\') }}.parquet\"\n\nfhv_taxi_data_dag = DAG(\n    dag_id=\"hfv_taxi_data_v1\",\n    schedule_interval=\"0 8 2 * *\",\n    start_date=datetime(2019, 1, 1),\n    end_date=datetime(2020, 1, 1),\n    default_args=default_args,\n    catchup=True,\n    max_active_runs=3,\n    tags=['dtc-de'],\n)\n\ndonwload_parquetize_upload_dag(\n    dag=fhv_taxi_data_dag,\n    url_template=FHV_TAXI_URL_TEMPLATE,\n    local_csv_path_template=FHV_TAXI_CSV_FILE_TEMPLATE,\n    local_parquet_path_template=FHV_TAXI_PARQUET_FILE_TEMPLATE,\n    gcs_path_template=FHV_TAXI_GCS_PATH_TEMPLATE\n)\n\n\n# https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv\n\nZONES_URL_TEMPLATE = 'https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv'\nZONES_CSV_FILE_TEMPLATE = AIRFLOW_HOME + '/taxi_zone_lookup.csv'\nZONES_PARQUET_FILE_TEMPLATE = AIRFLOW_HOME + '/taxi_zone_lookup.parquet'\nZONES_GCS_PATH_TEMPLATE = \"raw/taxi_zone/taxi_zone_lookup.parquet\"\n\nzones_data_dag = DAG(\n    dag_id=\"zones_data_v1\",\n    schedule_interval=\"@once\",\n    start_date=days_ago(1),\n    default_args=default_args,\n    catchup=True,\n    max_active_runs=3,\n    tags=['dtc-de'],\n)\n\ndonwload_parquetize_upload_dag(\n    dag=zones_data_dag,\n    url_template=ZONES_URL_TEMPLATE,\n    local_csv_path_template=ZONES_CSV_FILE_TEMPLATE,\n    local_parquet_path_template=ZONES_PARQUET_FILE_TEMPLATE,\n    gcs_path_template=ZONES_GCS_PATH_TEMPLATE\n)"
  },
  {
    "path": "cohorts/2022/week_2_data_ingestion/transfer_service/README.md",
    "content": "## Generate AWS Access key\n- Login in to AWS account  \n- Search for IAM\n  ![aws iam](../../images/aws/iam.png)\n- Click on `Manage access key`\n- Click on `Create New Access Key`\n- Download the csv, your access key and secret would be in that csv (Please note that once lost secret cannot be recovered)\n\n## Transfer service\nhttps://console.cloud.google.com/transfer/cloud/jobs\n\n\n"
  },
  {
    "path": "cohorts/2022/week_3_data_warehouse/airflow/.env_example",
    "content": "# Custom\nCOMPOSE_PROJECT_NAME=dtc-de\nGOOGLE_APPLICATION_CREDENTIALS=/.google/credentials/google_credentials.json\nAIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json\n# AIRFLOW_UID=\nGCP_PROJECT_ID=\nGCP_GCS_BUCKET=\n\n# Postgres\nPOSTGRES_USER=airflow\nPOSTGRES_PASSWORD=airflow\nPOSTGRES_DB=airflow\n\n# Airflow\nAIRFLOW__CORE__EXECUTOR=LocalExecutor\nAIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC=10\n\nAIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres:5432/${POSTGRES_DB}\nAIRFLOW_CONN_METADATA_DB=postgres+psycopg2://airflow:airflow@postgres:5432/airflow\nAIRFLOW_VAR__METADATA_DB_SCHEMA=airflow\n\n_AIRFLOW_WWW_USER_CREATE=True\n_AIRFLOW_WWW_USER_USERNAME=${_AIRFLOW_WWW_USER_USERNAME:airflow}\n_AIRFLOW_WWW_USER_PASSWORD=${_AIRFLOW_WWW_USER_PASSWORD:airflow}\n\nAIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True\nAIRFLOW__CORE__LOAD_EXAMPLES=False\n"
  },
  {
    "path": "cohorts/2022/week_3_data_warehouse/airflow/1_setup_official.md",
    "content": "## Setup (Official)\n\n### Pre-Reqs\n\n1. For the sake of standardization across this workshop's config,\n    rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory\n    ``` bash\n        cd ~ && mkdir -p ~/.google/credentials/\n        mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json\n    ```\n\n2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 5GB\n(ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting.\n\n3. Python version: 3.7+\n\n\n### Airflow Setup\n\n1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in)\n\n2. **Set the Airflow user**:\n\n    On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. \n    Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. \n    You have to make sure to configure them for the docker-compose:\n\n    ```bash\n    mkdir -p ./dags ./logs ./plugins\n    echo -e \"AIRFLOW_UID=$(id -u)\" > .env\n    ```\n\n    On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. \n\n    To get rid of the warning (\"AIRFLOW_UID is not set\"), you can create `.env` file with\n    this content:\n\n    ```\n    AIRFLOW_UID=50000\n    ```\n\n3. **Import the official docker setup file** from the latest Airflow version:\n   ```shell\n   curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'\n   ```\n   \n4. It could be overwhelming to see a lot of services in here. \n   But this is only a quick-start template, and as you proceed you'll figure out which unused services can be removed.\n   Eg. [Here's](docker-compose-nofrills.yml) a no-frills version of that template.\n\n\n5. **Docker Build**:\n\n    When you want to run Airflow locally, you might want to use an extended image, \n    containing some additional dependencies - for example you might add new python packages, \n    or upgrade airflow providers to a later version.\n    \n    Create a `Dockerfile` pointing to Airflow version you've just downloaded, \n    such as `apache/airflow:2.2.3`, as the base image,\n       \n    And customize this `Dockerfile` by:\n    * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket/Data Lake.\n    * Also, integrating `requirements.txt` to install libraries via  `pip install`\n\n\n6. **Docker Compose**:\n\n    Back in your `docker-compose.yaml`:\n   * In `x-airflow-common`: \n     * Remove the `image` tag, to replace it with your `build` from your Dockerfile, as shown\n     * Mount your `google_credentials` in `volumes` section as read-only\n     * Set environment variables: `GCP_PROJECT_ID`, `GCP_GCS_BUCKET`, `GOOGLE_APPLICATION_CREDENTIALS` & `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`, as per your config.\n\n   * Change `AIRFLOW__CORE__LOAD_EXAMPLES` to `false` (optional)\n\n7. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look.\n\n\n## Problems\n\n### `File /.google/credentials/google_credentials.json was not found`\n\nFirst, make sure you have your credentials in your `$HOME/.google/credentials`.\nMaybe you missed the step and didn't copy the your JSON with credentials there?\nAlso, make sure the file-name is `google_credentials.json`.\n\nSecond, check that docker-compose can correctly map this directory to airflow worker.\n\nExecute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker.\n\nThen execute `bash` on this container:\n\n```bash\ndocker exec -it <container-ID> bash\n```\n\nNow check if the file with credentials is actually there:\n\n```bash\nls -lh /.google/credentials/\n```\n\nIf it's empty, docker-compose couldn't map the folder with credentials. \nIn this case, try changing it to the absolute path to this folder:\n\n```yaml\n  volumes:\n    - ./dags:/opt/airflow/dags\n    - ./logs:/opt/airflow/logs\n    - ./plugins:/opt/airflow/plugins\n    # here: ----------------------------\n    - c:/Users/alexe/.google/credentials/:/.google/credentials:ro\n    # -----------------------------------\n```\n"
  },
  {
    "path": "cohorts/2022/week_3_data_warehouse/airflow/2_setup_nofrills.md",
    "content": "## Setup (No-frills)\n\n### Pre-Reqs\n\n1. For the sake of standardization across this workshop's config,\n    rename your gcp-service-accounts-credentials file to `google_credentials.json` & store it in your `$HOME` directory\n    ``` bash\n        cd ~ && mkdir -p ~/.google/credentials/\n        mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json\n    ```\n\n2. You may need to upgrade your docker-compose version to v2.x+, and set the memory for your Docker Engine to minimum 4GB\n(ideally 8GB). If enough memory is not allocated, it might lead to airflow-webserver continuously restarting.\n\n3. Python version: 3.7+\n\n\n### Airflow Setup\n\n1. Create a new sub-directory called `airflow` in your `project` dir (such as the one we're currently in)\n   \n2. **Set the Airflow user**:\n\n    On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. \n    Otherwise the files created in `dags`, `logs` and `plugins` will be created with root user. \n    You have to make sure to configure them for the docker-compose:\n\n    ```bash\n    mkdir -p ./dags ./logs ./plugins\n    echo -e \"AIRFLOW_UID=$(id -u)\" >> .env\n    ```\n\n    On Windows you will probably also need it. If you use MINGW/GitBash, execute the same command. \n\n    To get rid of the warning (\"AIRFLOW_UID is not set\"), you can create `.env` file with\n    this content:\n\n    ```\n    AIRFLOW_UID=50000\n    ```\n\n3. **Docker Build**:\n\n    When you want to run Airflow locally, you might want to use an extended image, \n    containing some additional dependencies - for example you might add new python packages, \n    or upgrade airflow providers to a later version.\n    \n    Create a `Dockerfile` pointing to the latest Airflow version such as `apache/airflow:2.2.3`, for the base image,\n       \n    And customize this `Dockerfile` by:\n    * Adding your custom packages to be installed. The one we'll need the most is `gcloud` to connect with the GCS bucket (Data Lake).\n    * Also, integrating `requirements.txt` to install libraries via  `pip install`\n\n4. Copy [docker-compose-nofrills.yml](docker-compose-nofrills.yml), [.env_example](.env_example) & [entrypoint.sh](scripts/entrypoint.sh) from this repo.\n    The changes from the official setup are:\n    * Removal of `redis` queue, `worker`, `triggerer`, `flower` & `airflow-init` services, \n    and changing from `CeleryExecutor` (multi-node) mode to `LocalExecutor` (single-node) mode \n    * Inclusion of `.env` for better parametrization & flexibility\n    * Inclusion of simple `entrypoint.sh` to the `webserver` container, responsible to initialize the database and create login-user (admin).\n    * Updated `Dockerfile` to grant permissions on executing `scripts/entrypoint.sh`\n        \n5. `.env`:\n    * Rebuild your `.env` file by making a copy of `.env_example` (but make sure your `AIRFLOW_UID` remains):\n        ```shell\n        mv .env_example .env\n        ```\n    * Set environment variables `AIRFLOW_UID`, `GCP_PROJECT_ID` & `GCP_GCS_BUCKET`, as per your config.\n    * Optionally, if your `google-credentials.json` is stored somewhere else, such as a path like `$HOME/.gc`, \n    modify the env-vars (`GOOGLE_APPLICATION_CREDENTIALS`, `AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT`) and `volumes` path in `docker-compose-nofrills.yml`\n\n6. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose-nofrills](./docker-compose-nofrills.yml) should look.\n\n\n## Problems\n\n### `no-frills setup does not work for me - WSL/Windows user `\n\nIf you are running Docker in Windows/WSL/WSL2 and you have encountered some `ModuleNotFoundError` or low performance issues,\ntake a look at this [Airflow & WSL2 gist](https://gist.github.com/nervuzz/d1afe81116cbfa3c834634ebce7f11c5) focused entirely on troubleshooting possible problems.\n\n### `File /.google/credentials/google_credentials.json was not found`\n\nFirst, make sure you have your credentials in your `$HOME/.google/credentials`.\nMaybe you missed the step and didn't copy the your JSON with credentials there?\nAlso, make sure the file-name is `google_credentials.json`.\n\nSecond, check that docker-compose can correctly map this directory to airflow worker.\n\nExecute `docker ps` to see the list of docker containers running on your host machine and find the ID of the airflow worker.\n\nThen execute `bash` on this container:\n\n```bash\ndocker exec -it <container-ID> bash\n```\n\nNow check if the file with credentials is actually there:\n\n```bash\nls -lh /.google/credentials/\n```\n\nIf it's empty, docker-compose couldn't map the folder with credentials. \nIn this case, try changing it to the absolute path to this folder:\n\n```yaml\n  volumes:\n    - ./dags:/opt/airflow/dags\n    - ./logs:/opt/airflow/logs\n    - ./plugins:/opt/airflow/plugins\n    # here: ----------------------------\n    - c:/Users/alexe/.google/credentials/:/.google/credentials:ro\n    # -----------------------------------\n```\n"
  },
  {
    "path": "cohorts/2022/week_3_data_warehouse/airflow/README.md",
    "content": "### Concepts\n\n [Airflow Concepts and Architecture](../week_2_data_ingestion/airflow/docs/1_concepts.md)\n\n### Workflow\n\n ![](docs/gcs_2_bq_dag_graph_view.png)\n \n ![](docs/gcs_2_bq_dag_tree_view.png)\n \n### Setup - Official Version\n (For the section on the Custom/Lightweight setup, scroll down)\n\n #### Setup\n  [Airflow Setup with Docker, through official guidelines](1_setup_official.md)\n\n #### Execution\n \n  1. Build the image (only first-time, or when there's any change in the `Dockerfile`, takes ~15 mins for the first-time):\n     ```shell\n     docker-compose build\n     ```\n   \n     or (for legacy versions)\n   \n     ```shell\n     docker build .\n     ```\n\n 2. Initialize the Airflow scheduler, DB, and other config\n    ```shell\n    docker-compose up airflow-init\n    ```\n\n 3. Kick up the all the services from the container:\n    ```shell\n    docker-compose up\n    ```\n\n 4. In another terminal, run `docker-compose ps` to see which containers are up & running (there should be 7, matching with the services in your docker-compose file).\n\n 5. Login to Airflow web UI on `localhost:8080` with default creds: `airflow/airflow`\n\n 6. Run your DAG on the Web Console.\n\n 7. On finishing your run or to shut down the container/s:\n    ```shell\n    docker-compose down\n    ```\n\n    To stop and delete containers, delete volumes with database data, and download images, run:\n    ```\n    docker-compose down --volumes --rmi all\n    ```\n\n    or\n    ```\n    docker-compose down --volumes --remove-orphans\n    ```\n       \n### Setup - Custom No-Frills Version (Lightweight)\nThis is a quick, simple & less memory-intensive setup of Airflow that works on a LocalExecutor.\n\n  #### Setup\n  [Airflow Setup with Docker, customized](2_setup_nofrills.md)\n\n  #### Execution\n  \n  1. Stop and delete containers, delete volumes with database data, & downloaded images (from the previous setup):\n    ```\n    docker-compose down --volumes --rmi all\n    ```\n\n   or\n    ```\n    docker-compose down --volumes --remove-orphans\n    ```\n    \n   Or, if you need to clear your system of any pre-cached Docker issues:\n    ```\n    docker system prune\n    ```\n    \n   Also, empty the airflow `logs` directory.\n    \n  2. Build the image (only first-time, or when there's any change in the `Dockerfile`):\n  Takes ~5-10 mins for the first-time\n    ```shell\n    docker-compose build\n    ```\n    or (for legacy versions)\n    ```shell\n    docker build .\n    ```\n\n  3. Kick up the all the services from the container (no need to specially initialize):\n    ```shell\n    docker-compose -f docker-compose-nofrills.yml up\n    ```\n\n  4. In another terminal, run `docker ps` to see which containers are up & running (there should be 3, matching with the services in your docker-compose file).\n\n  5. Login to Airflow web UI on `localhost:8080` with creds: `admin/admin` (explicit creation of admin user was required)\n\n  6. Run your DAG on the Web Console.\n\n  7. On finishing your run or to shut down the container/s:\n    ```shell\n    docker-compose down\n    ```\n    \n   \n\n### Future Enhancements\n* Deploy self-hosted Airflow setup on Kubernetes cluster, or use a Managed Airflow (Cloud Composer) service by GCP\n\n### References\nFor more info, check out these official docs:\n   * https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html\n   * https://airflow.apache.org/docs/docker-stack/build.html\n   * https://airflow.apache.org/docs/docker-stack/recipes.html\n\n"
  },
  {
    "path": "cohorts/2022/week_3_data_warehouse/airflow/dags/gcs_to_bq_dag.py",
    "content": "import os\nimport logging\n\nfrom airflow import DAG\nfrom airflow.utils.dates import days_ago\nfrom airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator, BigQueryInsertJobOperator\nfrom airflow.providers.google.cloud.transfers.gcs_to_gcs import GCSToGCSOperator\n\nPROJECT_ID = os.environ.get(\"GCP_PROJECT_ID\")\nBUCKET = os.environ.get(\"GCP_GCS_BUCKET\")\n\npath_to_local_home = os.environ.get(\"AIRFLOW_HOME\", \"/opt/airflow/\")\nBIGQUERY_DATASET = os.environ.get(\"BIGQUERY_DATASET\", 'trips_data_all')\n\nDATASET = \"tripdata\"\nCOLOUR_RANGE = {'yellow': 'tpep_pickup_datetime', 'green': 'lpep_pickup_datetime'}\nINPUT_PART = \"raw\"\nINPUT_FILETYPE = \"parquet\"\n\ndefault_args = {\n    \"owner\": \"airflow\",\n    \"start_date\": days_ago(1),\n    \"depends_on_past\": False,\n    \"retries\": 1,\n}\n\n# NOTE: DAG declaration - using a Context Manager (an implicit way)\nwith DAG(\n    dag_id=\"gcs_2_bq_dag\",\n    schedule_interval=\"@daily\",\n    default_args=default_args,\n    catchup=False,\n    max_active_runs=1,\n    tags=['dtc-de'],\n) as dag:\n\n    for colour, ds_col in COLOUR_RANGE.items():\n        move_files_gcs_task = GCSToGCSOperator(\n            task_id=f'move_{colour}_{DATASET}_files_task',\n            source_bucket=BUCKET,\n            source_object=f'{INPUT_PART}/{colour}_{DATASET}*.{INPUT_FILETYPE}',\n            destination_bucket=BUCKET,\n            destination_object=f'{colour}/{colour}_{DATASET}',\n            move_object=True\n        )\n\n        bigquery_external_table_task = BigQueryCreateExternalTableOperator(\n            task_id=f\"bq_{colour}_{DATASET}_external_table_task\",\n            table_resource={\n                \"tableReference\": {\n                    \"projectId\": PROJECT_ID,\n                    \"datasetId\": BIGQUERY_DATASET,\n                    \"tableId\": f\"{colour}_{DATASET}_external_table\",\n                },\n                \"externalDataConfiguration\": {\n                    \"autodetect\": \"True\",\n                    \"sourceFormat\": f\"{INPUT_FILETYPE.upper()}\",\n                    \"sourceUris\": [f\"gs://{BUCKET}/{colour}/*\"],\n                },\n            },\n        )\n\n        CREATE_BQ_TBL_QUERY = (\n            f\"CREATE OR REPLACE TABLE {BIGQUERY_DATASET}.{colour}_{DATASET} \\\n            PARTITION BY DATE({ds_col}) \\\n            AS \\\n            SELECT * FROM {BIGQUERY_DATASET}.{colour}_{DATASET}_external_table;\"\n        )\n\n        # Create a partitioned table from external table\n        bq_create_partitioned_table_job = BigQueryInsertJobOperator(\n            task_id=f\"bq_create_{colour}_{DATASET}_partitioned_table_task\",\n            configuration={\n                \"query\": {\n                    \"query\": CREATE_BQ_TBL_QUERY,\n                    \"useLegacySql\": False,\n                }\n            }\n        )\n\n        move_files_gcs_task >> bigquery_external_table_task >> bq_create_partitioned_table_job\n"
  },
  {
    "path": "cohorts/2022/week_3_data_warehouse/airflow/docker-compose-nofrills.yml",
    "content": "version: '3'\nservices:\n    postgres:\n        image: postgres:13\n        env_file:\n            - .env\n        volumes:\n            - postgres-db-volume:/var/lib/postgresql/data\n        healthcheck:\n            test: [\"CMD\", \"pg_isready\", \"-U\", \"airflow\"]\n            interval: 5s\n            retries: 5\n        restart: always\n\n    scheduler:\n        build: .\n        command: scheduler\n        restart: on-failure\n        depends_on:\n            - postgres\n        env_file:\n            - .env\n        volumes:\n            - ./dags:/opt/airflow/dags\n            - ./logs:/opt/airflow/logs\n            - ./plugins:/opt/airflow/plugins\n            - ./scripts:/opt/airflow/scripts\n            - ~/.google/credentials/:/.google/credentials:ro\n\n\n    webserver:\n        build: .\n        entrypoint: ./scripts/entrypoint.sh\n        restart: on-failure\n        depends_on:\n            - postgres\n            - scheduler\n        env_file:\n            - .env\n        volumes:\n            - ./dags:/opt/airflow/dags\n            - ./logs:/opt/airflow/logs\n            - ./plugins:/opt/airflow/plugins\n            - ~/.google/credentials/:/.google/credentials:ro\n            - ./scripts:/opt/airflow/scripts\n\n        user: \"${AIRFLOW_UID:-50000}:0\"\n        ports:\n            - \"8080:8080\"\n        healthcheck:\n            test: [ \"CMD-SHELL\", \"[ -f /home/airflow/airflow-webserver.pid ]\" ]\n            interval: 30s\n            timeout: 30s\n            retries: 3\n\nvolumes:\n  postgres-db-volume:"
  },
  {
    "path": "cohorts/2022/week_3_data_warehouse/airflow/docker-compose.yaml",
    "content": "# Licensed to the Apache Software Foundation (ASF) under one\n# or more contributor license agreements.  See the NOTICE file\n# distributed with this work for additional information\n# regarding copyright ownership.  The ASF licenses this file\n# to you under the Apache License, Version 2.0 (the\n# \"License\"); you may not use this file except in compliance\n# with the License.  You may obtain a copy of the License at\n#\n#   http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing,\n# software distributed under the License is distributed on an\n# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n# KIND, either express or implied.  See the License for the\n# specific language governing permissions and limitations\n# under the License.\n#\n\n# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.\n#\n# WARNING: This configuration is for local development. Do not use it in a production deployment.\n#\n# This configuration supports basic configuration using environment variables or an .env file\n# The following variables are supported:\n#\n# AIRFLOW_IMAGE_NAME           - Docker image name used to run Airflow.\n#                                Default: apache/airflow:2.2.3\n# AIRFLOW_UID                  - User ID in Airflow containers\n#                                Default: 50000\n# Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode\n#\n# _AIRFLOW_WWW_USER_USERNAME   - Username for the administrator account (if requested).\n#                                Default: airflow\n# _AIRFLOW_WWW_USER_PASSWORD   - Password for the administrator account (if requested).\n#                                Default: airflow\n# _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.\n#                                Default: ''\n#\n# Feel free to modify this file to suit your needs.\n---\nversion: '3'\nx-airflow-common:\n  &airflow-common\n  # In order to add custom dependencies or upgrade provider packages you can use your extended image.\n  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml\n  # and uncomment the \"build\" line below, Then run `docker-compose build` to build the images.\n  build:\n    context: .\n    dockerfile: ./Dockerfile\n  environment:\n    &airflow-common-env\n    AIRFLOW__CORE__EXECUTOR: LocalExecutor\n    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow\n#    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow\n#    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0\n    AIRFLOW__CORE__FERNET_KEY: ''\n    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'\n    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'\n    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'\n    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}\n    GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json\n    AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'\n\n    # TODO: Please change GCP_PROJECT_ID & GCP_GCS_BUCKET, as per your config\n    GCP_PROJECT_ID: 'pivotal-surfer-336713'\n    GCP_GCS_BUCKET: 'dtc_data_lake_pivotal-surfer-336713'\n\n  volumes:\n    - ./dags:/opt/airflow/dags\n    - ./logs:/opt/airflow/logs\n    - ./plugins:/opt/airflow/plugins\n    - ~/.google/credentials/:/.google/credentials:ro\n\n  user: \"${AIRFLOW_UID:-50000}:0\"\n  depends_on:\n    &airflow-common-depends-on\n#    redis:\n#      condition: service_healthy\n    postgres:\n      condition: service_healthy\n\nservices:\n  postgres:\n    image: postgres:13\n    environment:\n      POSTGRES_USER: airflow\n      POSTGRES_PASSWORD: airflow\n      POSTGRES_DB: airflow\n    volumes:\n      - postgres-db-volume:/var/lib/postgresql/data\n    healthcheck:\n      test: [\"CMD\", \"pg_isready\", \"-U\", \"airflow\"]\n      interval: 5s\n      retries: 5\n    restart: always\n\n#  redis:\n#    image: redis:latest\n#    expose:\n#      - 6379\n#    healthcheck:\n#      test: [\"CMD\", \"redis-cli\", \"ping\"]\n#      interval: 5s\n#      timeout: 30s\n#      retries: 50\n#    restart: always\n\n  airflow-webserver:\n    <<: *airflow-common\n    command: webserver\n    ports:\n      - 8080:8080\n    healthcheck:\n      test: [\"CMD\", \"curl\", \"--fail\", \"http://localhost:8080/health\"]\n      interval: 10s\n      timeout: 10s\n      retries: 5\n    restart: always\n    depends_on:\n      <<: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully\n\n  airflow-scheduler:\n    <<: *airflow-common\n    command: scheduler\n    healthcheck:\n      test: [\"CMD-SHELL\", 'airflow jobs check --job-type SchedulerJob --hostname \"$${HOSTNAME}\"']\n      interval: 10s\n      timeout: 10s\n      retries: 5\n    restart: always\n    depends_on:\n      <<: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully\n\n#  airflow-worker:\n#    <<: *airflow-common\n#    command: celery worker\n#    healthcheck:\n#      test:\n#        - \"CMD-SHELL\"\n#        - 'celery --app airflow.executors.celery_executor.app inspect ping -d \"celery@$${HOSTNAME}\"'\n#      interval: 10s\n#      timeout: 10s\n#      retries: 5\n#    environment:\n#      <<: *airflow-common-env\n#      # Required to handle warm shutdown of the celery workers properly\n#      # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation\n#      DUMB_INIT_SETSID: \"0\"\n#    restart: always\n#    depends_on:\n#      <<: *airflow-common-depends-on\n#      airflow-init:\n#        condition: service_completed_successfully\n#\n#  airflow-triggerer:\n#    <<: *airflow-common\n#    command: triggerer\n#    healthcheck:\n#      test: [\"CMD-SHELL\", 'airflow jobs check --job-type TriggererJob --hostname \"$${HOSTNAME}\"']\n#      interval: 10s\n#      timeout: 10s\n#      retries: 5\n#    restart: always\n#    depends_on:\n#      <<: *airflow-common-depends-on\n#      airflow-init:\n#        condition: service_completed_successfully\n\n  airflow-init:\n    <<: *airflow-common\n    entrypoint: /bin/bash\n    # yamllint disable rule:line-length\n    command:\n      - -c\n      - |\n        function ver() {\n          printf \"%04d%04d%04d%04d\" $${1//./ }\n        }\n        airflow_version=$$(gosu airflow airflow version)\n        airflow_version_comparable=$$(ver $${airflow_version})\n        min_airflow_version=2.2.0\n        min_airflow_version_comparable=$$(ver $${min_airflow_version})\n        if (( airflow_version_comparable < min_airflow_version_comparable )); then\n          echo\n          echo -e \"\\033[1;31mERROR!!!: Too old Airflow version $${airflow_version}!\\e[0m\"\n          echo \"The minimum Airflow version supported: $${min_airflow_version}. Only use this or higher!\"\n          echo\n          exit 1\n        fi\n        if [[ -z \"${AIRFLOW_UID}\" ]]; then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: AIRFLOW_UID not set!\\e[0m\"\n          echo \"If you are on Linux, you SHOULD follow the instructions below to set \"\n          echo \"AIRFLOW_UID environment variable, otherwise files will be owned by root.\"\n          echo \"For other operating systems you can get rid of the warning with manually created .env file:\"\n          echo \"    See: https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#setting-the-right-airflow-user\"\n          echo\n        fi\n        one_meg=1048576\n        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))\n        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)\n        disk_available=$$(df / | tail -1 | awk '{print $$4}')\n        warning_resources=\"false\"\n        if (( mem_available < 4000 )) ; then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: Not enough memory available for Docker.\\e[0m\"\n          echo \"At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))\"\n          echo\n          warning_resources=\"true\"\n        fi\n        if (( cpus_available < 2 )); then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\\e[0m\"\n          echo \"At least 2 CPUs recommended. You have $${cpus_available}\"\n          echo\n          warning_resources=\"true\"\n        fi\n        if (( disk_available < one_meg * 10 )); then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\\e[0m\"\n          echo \"At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))\"\n          echo\n          warning_resources=\"true\"\n        fi\n        if [[ $${warning_resources} == \"true\" ]]; then\n          echo\n          echo -e \"\\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\\e[0m\"\n          echo \"Please follow the instructions to increase amount of resources available:\"\n          echo \"   https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#before-you-begin\"\n          echo\n        fi\n        mkdir -p /sources/logs /sources/dags /sources/plugins\n        chown -R \"${AIRFLOW_UID}:0\" /sources/{logs,dags,plugins}\n        exec /entrypoint airflow version\n    # yamllint enable rule:line-length\n    environment:\n      <<: *airflow-common-env\n      _AIRFLOW_DB_UPGRADE: 'true'\n      _AIRFLOW_WWW_USER_CREATE: 'true'\n      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}\n      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}\n    user: \"0:0\"\n    volumes:\n      - .:/sources\n\n  airflow-cli:\n    <<: *airflow-common\n    profiles:\n      - debug\n    environment:\n      <<: *airflow-common-env\n      CONNECTION_CHECK_MAX_COUNT: \"0\"\n    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252\n    command:\n      - bash\n      - -c\n      - airflow\n\n#  flower:\n#    <<: *airflow-common\n#    command: celery flower\n#    ports:\n#      - 5555:5555\n#    healthcheck:\n#      test: [\"CMD\", \"curl\", \"--fail\", \"http://localhost:5555/\"]\n#      interval: 10s\n#      timeout: 10s\n#      retries: 5\n#    restart: always\n#    depends_on:\n#      <<: *airflow-common-depends-on\n#      airflow-init:\n#        condition: service_completed_successfully\n\nvolumes:\n  postgres-db-volume:\n"
  },
  {
    "path": "cohorts/2022/week_3_data_warehouse/airflow/scripts/entrypoint.sh",
    "content": "#!/usr/bin/env bash\nexport GOOGLE_APPLICATION_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS}\nexport AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT=${AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT}\n\nairflow db upgrade\n\nairflow users create -r Admin -u admin -p admin -e admin@example.com -f admin -l airflow\n# \"$_AIRFLOW_WWW_USER_USERNAME\" -p \"$_AIRFLOW_WWW_USER_PASSWORD\"\n\nairflow webserver\n"
  },
  {
    "path": "cohorts/2022/week_5_batch_processing/homework.md",
    "content": "## Week 5 Homework\n\nIn this homework we'll put what we learned about Spark\nin practice.\n\nWe'll use high volume for-hire vehicles (HVFHV) dataset for that.\n\n## Question 1. Install Spark and PySpark\n\n* Install Spark\n* Run PySpark\n* Create a local spark session \n* Execute `spark.version`\n\nWhat's the output?\n\n\n## Question 2. HVFHW February 2021\n\nDownload the HVFHV data for february 2021:\n\n```bash\nwget https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2021-02.csv\n```\n\nRead it with Spark using the same schema as we did \nin the lessons. We will use this dataset for all\nthe remaining questions.\n\nRepartition it to 24 partitions and save it to\nparquet.\n\nWhat's the size of the folder with results (in MB)?\n\n\n## Question 3. Count records \n\nHow many taxi trips were there on February 15?\n\nConsider only trips that started on February 15.\n\n\n## Question 4. Longest trip for each day\n\nNow calculate the duration for each trip.\n\nTrip starting on which day was the longest? \n\n\n## Question 5. Most frequent `dispatching_base_num`\n\nNow find the most frequently occurring `dispatching_base_num` \nin this dataset.\n\nHow many stages this spark job has?\n\n> Note: the answer may depend on how you write the query,\n> so there are multiple correct answers. \n> Select the one you have.\n\n\n## Question 6. Most common locations pair\n\nFind the most common pickup-dropoff pair. \n\nFor example:\n\n\"Jamaica Bay / Clinton East\"\n\nEnter two zone names separated by a slash\n\nIf any of the zone names are unknown (missing), use \"Unknown\". For example, \"Unknown / Clinton East\". \n\n\n## Bonus question. Join type\n\n(not graded) \n\nFor finding the answer to Q6, you'll need to perform a join.\n\nWhat type of join is it?\n\nAnd how many stages your spark job has?\n\n\n## Submitting the solutions\n\n* Form for submitting: https://forms.gle/dBkVK9yT8cSMDwuw7\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 07 March (Monday), 22:00 CET\n"
  },
  {
    "path": "cohorts/2022/week_6_stream_processing/homework.md",
    "content": "## Week 6 Homework\n[Form](https://forms.gle/mSzfpPCXskWCabeu5)\n\nThe homework is mostly theoretical. In the last question you have to provide working code link, please keep in mind that this\nquestion is not scored.\n\nDeadline: 14 March, 22:00 CET"
  },
  {
    "path": "cohorts/2023/README.md",
    "content": "## Data Engineering Zoomcamp 2023 Cohort\n\n* [Launch stream with course overview](https://www.youtube.com/watch?v=-zpVha7bw5A)\n* [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html)\n* [Public Leaderboard](leaderboard.md) and [Private Leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vTbL00GcdQp0bJt9wf1ROltMq7s3qyxl-NYF7Pvk79Jfxgwfn9dNWmPD_yJHTDq_Wzvps8EIr6cOKWm/pubhtml)\n* [Course Playlist: Only 2023 Live videos & homeworks](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)\n\n[**Week 1: Introduction & Prerequisites**](week_1_docker_sql/)\n\n* [Homework SQL](week_1_docker_sql/homework.md) and [solution](https://www.youtube.com/watch?v=KIh_9tZiroA)\n* [Homework Terraform](week_1_terraform/homework.md)\n* [Office hours](https://www.youtube.com/watch?v=RVTryVvSyw4&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)\n\n[**Week 2: Workflow Orchestration**](week_2_workflow_orchestration)\n\n* [Homework](week_2_workflow_orchestration/homework.md)\n* [Office hours part 1](https://www.youtube.com/watch?v=a_nmLHb8hzw&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW) and [part 2](https://www.youtube.com/watch?v=PK8yyMY54Vk&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW&index=7) \n\n[**Week 3: Data Warehouse**](week_3_data_warehouse)\n\n* [Homework](week_3_data_warehouse/homework.md)\n* [Office hours](https://www.youtube.com/watch?v=QXfmtJp3bXE&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)\n\n[**Week 4: Analytics Engineering**](week_4_analytics_engineering/)\n\n* [Homework](week_4_analytics_engineering/homework.md)\n* [PipeRider + dbt Workshop](workshops/piperider.md)\n* [Office hours](https://www.youtube.com/watch?v=ODYg_r72qaE&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)\n\n[**Week 5: Batch processing**](week_5_batch_processing/)\n\n* [Homework](week_5_batch_processing/homework.md)\n* [Office hours](https://www.youtube.com/watch?v=5_69yL2PPYI&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW)\n\n[**Week 6: Stream Processing**](week_6_stream_processing)\n\n* [Homework](week_6_stream_processing/homework.md)\n\n\n[**Week 7, 8 & 9: Project**](project.md)\n\nMore information [here](project.md)\n"
  },
  {
    "path": "cohorts/2023/leaderboard.md",
    "content": "## Leaderboard \n\nThis is the top [100 leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vTbL00GcdQp0bJt9wf1ROltMq7s3qyxl-NYF7Pvk79Jfxgwfn9dNWmPD_yJHTDq_Wzvps8EIr6cOKWm/pubhtml)\nof participants of Data Engineering Zoomcamp 2023 edition!\n\n<table>\n<tr>\n  <th>Name</th>\n  <th>Project</th>\n  <th>Social</th>\n  <th>Links and comments</th>\n</tr>\n<tr>\n<td>Katharina Eichinger</td>\n<td><a href=\"https://github.com/PandaKata/dezoomcamp-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/katharina-eichinger/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/PandaKata\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Alia Hamwi</td>\n<td><a href=\"https://github.com/AliaHa3/data-engineering-zoomcamp-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/alia-hamwi/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/AliaHa3\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Emmanuel Ikpesu</td>\n<td><a href=\"https://github.com/uchiharon/DataTalksClub_de-zoomcamp_CapStone_Project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/emmanuel-ikpesu-393708132/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/uchiharon\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://medium.com/@emmanarutops2/automating-data-pipelines-using-prefect-block-98d9b16f16bc\">Automating Data Pipelines Using Prefect Block</a></li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Sanya Syed</td>\n<td><a href=\"https://github.com/sanyassyed/sf_eviction\">Project</a></td>\n<td> <a href=\"http://linkedin.com/in/sanyasy\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/sanyassyed\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://resume.creddle.io/resume/1so01cu6gx7\">My Resume</a></li>\n</ul>\n\n> I am excited about the prospect of securing a challenging role as a Data Engineer, where I can utilise my skills and expertise to contribute meaningfully to an organisation's data-driven initiatives. </details></td>\n</tr>\n<tr>\n<td>Aminu Lawal</td>\n<td><a href=\"https://github.com/zabull1/cycling_DE_project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/aminu-lawal-600920100/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/zabull1\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Lisa Reiber</td>\n<td><a href=\"https://github.com/lisallreiber/biketheft_berlin\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/lisareiber/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/lisallreiber\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://lookerstudio.google.com/u/2/reporting/8a06d083-e46f-403a-bcb0-d3ff23434e24/page/p_nmv21l7w4c\">Project Dashboard</a></li>\n</ul>\n\n> always happy to connect with other data enthusiasts over topics like low-budget data engineering solutions for non-profits or AI solutions for non-profits</details></td>\n</tr>\n<tr>\n<td>Vincenzo Galante</td>\n<td><a href=\"https://lookerstudio.google.com/u/0/reporting/ebdf68e1-27f7-435b-8add-a4018681f801/page/BkBJD\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/galantevincenzo/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/VincenzoGalante\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\n\n> Thank you for having this course!</details></td>\n</tr>\n<tr>\n<td>Grzegorz Gątkowski </td>\n<td><a href=\"https://github.com/GrzegorzGatkowski/Air_Pollution_Pipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/grzegorz-g%C4%85tkowski-811727125/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/GrzegorzGatkowski\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Matt Young</td>\n<td><a href=\"https://github.com/directdetour/BeerReviewsDataPipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/matt-young-11377720/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/directdetour\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://twitter.com/ymatty\">Twitter</a></li>\n</ul>\n\n> Experienced Developer | Cloud & Data Enthusiast | Open to Cloud & Data Engineering Roles 🌩️\n➜ C#, SQL, JavaScript, Python | BI, Data Analytics | AWS, Azure, GCP\n\nPassionate about data pipelines, storage, and processing. Excited to implement advanced cloud solutions and enable data-driven insights. Seeking Data Engineering opportunities to leverage my extensive SQL/Data Analytics experience and to transition into the world of cloud-based data solutions. Let's connect and collaborate on innovative data projects! #DataEngineering #CloudTechnology</details></td>\n</tr>\n<tr>\n<td>Sam Hatley</td>\n<td><a href=\"https://github.com/sam-hatley/real-estate-data\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/samhatley/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/sam-hatley\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Evan Hofmeister</td>\n<td><a href=\"https://github.com/EvanHofmeister/Housing-Wealth-Pipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/evanhofmeister/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/EvanHofmeister\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Barys Kazarkin</td>\n<td><a href=\"https://github.com/KazarkinBarys/Data_Engineering_Zoomcamp_Project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/barys-kazarkin-b9904b203/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/KazarkinBarys\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Joshua Ati</td>\n<td><a href=\"https://github.com/joshuaati/DE_airline_pipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/joshua-ati-460750110/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/joshuaati\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Oleg Agapov</td>\n<td><a href=\"https://github.com/oleg-agapov/de-zoomcamp-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/oagapov/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/oleg-agapov/\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://twitter.com/oleg_agapov_\">Twitter</a></li>\n<li><a href=\"https://olegagapov.com/\">Website</a></li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Mikhail Kuklin</td>\n<td><a href=\"https://github.com/MikhailKuklin/data-pipeline-COVID19-monitoring\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/mikhail-kuklin-194a9544/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/MikhailKuklin\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://mikhailkuklin.wordpress.com\">Personal webpage</a></li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Emmanuel Letremble</td>\n<td><a href=\"https://github.com/Valkea/DE_bootcamp_project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/letremble\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Valkea\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://valkea.github.io\">Portfolio</a></li>\n</ul>\n\n> Thanks to the DataTalks.Club for completing my Full Stack & Machine Learning skill sets with some extra DE knowledge.</details></td>\n</tr>\n<tr>\n<td>Victor Kuang</td>\n<td><a href=\"https://github.com/vykuang/toronto-service-calls-2023\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/vykuang/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/vykuang\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Antonis Angelakis</td>\n<td><a href=\"https://github.com/angeanto/dezoomcamp-project-youtube\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/antonios-angelakis-249899101\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/angeanto\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Christian Ruiz</td>\n<td></td>\n<td></td>\n<td></td>\n</tr>\n<tr>\n<td>Alex Pilugin</td>\n<td><a href=\"https://github.com/skipper-com/dtc_de_course_project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/alexander-pilugin/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/skipper-com?tab=repositories\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Ahmad Rizky</td>\n<td><a href=\"https://linktr.ee/ahmdxrzky\">Project</a></td>\n<td> <a href=\"https://linkedin.com/in/ahmdxrzky\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/ahmdxrzky\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Juan Francisco Hernandez Hernandez </td>\n<td><a href=\"https://github.com/JuanPacoHernandez/TelecommDescriptive-Analysis\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/juan-paco-hernandez/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/JuanPacoHernandez\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\n\n> Thanks to Data Talks Club, it was amazing learning for me as a Career changer.</details></td>\n</tr>\n<tr>\n<td>Iurii Chernigin</td>\n<td><a href=\"https://github.com/iurii-chernigin/audio-streaming-data-platform\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/iurii-chernigin/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/iurii-chernigin\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Franklyne Kibet</td>\n<td></td>\n<td></td>\n<td></td>\n</tr>\n<tr>\n<td>Federico Zambelli</td>\n<td><a href=\"https://github.com/wtfzambo/subreddit-analytics\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/fzambo/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/wtfzambo\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Marilina Orihuela</td>\n<td><a href=\"https://github.com/mary435/MLA_Dashboard\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/marilina-orihuela/?locale=en_US\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/mary435\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Alejandro R. Mármol Ruiz</td>\n<td><a href=\"https://github.com/marmola90/dezoomcampam\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/alejandro-marmol-81a998167/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/marmola90\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Daniel Takeshi</td>\n<td><a href=\"https://github.com/danietakeshi/de-zoomcamp-2023/tree/main/project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/daniel-takeshi\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/danietakeshi\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Xia He-Bleinagel</td>\n<td><a href=\"https://github.com/Data-Think-2021/DE-Final-Project-CO2\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/xia-he-bleinagel-51773585/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Data-Think-2021\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://xiahe-bleinagel.com/\">Personal website</a></li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Thorsten Foltz</td>\n<td></td>\n<td> <a href=\"https://www.linkedin.com/in/thorsten-foltz-a91481127/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Danh Vo</td>\n<td><a href=\"https://github.com/datavadoz/eu-airbnb\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/0798a811b\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/datavadoz\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Joseph Ologunja</td>\n<td><a href=\"https://github.com/Joseun/data-engineering-zoomcamp/tree/main/cohorts/2023/week_7_project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/josephologunja/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Joseun\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Roman Zabolotin</td>\n<td><a href=\"https://github.com/rzabolotin/de_zoomcamp_2023_project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/rzabolotin/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/rzabolotin\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Aditya Gupta </td>\n<td><a href=\"https://github.com/itsadityagupta/yelposphere\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/itsadityagupta\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/itsadityagupta\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://peerlist.io/itsadityagupta\">Portfolio</a></li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Vladimir Bugaevskii</td>\n<td><a href=\"https://github.com/vbugaevskii/de-zoomcamp-cycling-2023\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/vbugaevskii/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/vbugaevskii\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Fozan Talat</td>\n<td><a href=\"https://github.com/Fozan-Talat/divvy-bikeshare-de-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/fozan-talat/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Fozan-Talat\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Alain Boisvert</td>\n<td><a href=\"https://github.com/boisalai/twitter-dashboard\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/alain-boisvert-98b058156/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/boisalai\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>reneboy garcia</td>\n<td><a href=\"https://github.com/reneboygarcia/capstone_project_mongodb.git\">Project</a></td>\n<td> <a href=\"http://www.linkedin.com/in/eboygarcia\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/reneboygarcia\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\n\n> \"Success is not always about the grand achievements; it's about the small victories that accumulate over time.\" - Unknown</details></td>\n</tr>\n<tr>\n<td>Svetlana Kononova</td>\n<td></td>\n<td></td>\n<td></td>\n</tr>\n<tr>\n<td>Dmitrii Nikolaev</td>\n<td></td>\n<td> <a href=\"https://www.linkedin.com/in/dnnikolaev/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/melvinru\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://t.me/melvinru\">DN Telegram</a></li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Francis Romio</td>\n<td><a href=\"https://github.com/romiof/brazil-weather\">Project</a></td>\n<td> <a href=\"https://br.linkedin.com/in/francisromio\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/romiof\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Saul Acevedo</td>\n<td><a href=\"https://github.com/seacevedo/Solana-Pipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/saul-acevedo-739b17122\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/seacevedo\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Alina Li</td>\n<td><a href=\"https://github.com/alinali87/de-zoomcamp-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/alinali87/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Alexander Eryuzhev</td>\n<td><a href=\"https://github.com/aeryuzhev/de-zoomcamp-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/alexander-eryuzhev/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Paul Nwosu</td>\n<td><a href=\"https://github.com/paulonye/Cloudrunjobs\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/nwosu-paul-1b7b2218b/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/paulonye\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li>https://medium.com/@nwosupaul141/serverless-deployment-of-a-prefect-data-pipeline-on-google-cloud-run-8c48765f2480</li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Param mirani </td>\n<td><a href=\"https://github.com/Param-29/stock-data-pipeline\">Project</a></td>\n<td> <a href=\"https://in.linkedin.com/in/param-mirani\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Param-29\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Oscar Garcia - ozkary</td>\n<td><a href=\"https://github.com/ozkary/data-engineering-mta-turnstile/\">Project</a></td>\n<td> <a href=\"https://github.com/ozkary\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://twitter.com/ozkary\">Twitter</a>  * <a href=\"https://www.youtube.com/channel/UCpaqmBQr8YE6ikLXXyt8D7g\">You Tube</a> * <a href=\"https://www.ozkary.com\">blog</a></li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Hector Torres</td>\n<td><a href=\"https://github.com/hdt94/dtc-de-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/hdt94/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/hdt94/\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://twitter.com/hdt94\">Twitter @hdt94</a></li>\n</ul>\n\n> Currently looking for a position as data engineer</details></td>\n</tr>\n<tr>\n<td>Dewi Nurfitri Oktaviani</td>\n<td><a href=\"https://github.com/oktavianidewi/github-data-pipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/dewi-nurfitri-oktaviani-6b450b22/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/oktavianidewi\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://medium.com/@oktavianidewi\">medium</a></li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Ryno Marx</td>\n<td></td>\n<td> <a href=\"https://www.linkedin.com/in/ryno-m-402a58120\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Hidir Cem Altun</td>\n<td></td>\n<td> <a href=\"https://www.linkedin.com/in/hidir-cem-altun-914aaa65/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/HCA97\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Francis Mark Cayco</td>\n<td><a href=\"https://github.com/PeteCastle/League-of-Legends-Analytics\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/francis-mark-cayco-33511a190/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/PeteCastle\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Adrian Baumann</td>\n<td><a href=\"https://github.com/adrian-baumann/dwd-temp-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/adrianbaumann/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/adrian-baumann\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Vladislav Garist</td>\n<td><a href=\"https://github.com/garistvlad/data-engineering-zoomcamp/tree/main/week-7\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/vgarist/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/garistvlad\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Gerald Ooi</td>\n<td></td>\n<td> <a href=\"https://www.linkedin.com/in/geraldooi/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Roman</td>\n<td></td>\n<td> <a href=\"https://www.linkedin.com/in/roman-yakovlev-86b2b4130\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/romanyakovlev\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Aleksandr Krasnov</td>\n<td></td>\n<td> <a href=\"https://www.linkedin.com/in/aleksandr-krasnov/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://www.linkedin.com/in/aleksandr-krasnov/\">Open to work</a></li>\n</ul></details></td>\n</tr>\n<tr>\n<td>Jaesung Ryu</td>\n<td><a href=\"https://github.com/Haebuk/GHArchive-Data-Pipeline-Project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/jaesungryu\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Haebuk\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>António Damião Rodrigues</td>\n<td><a href=\"https://github.com/adamiaonr/de-zoomcamp-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/adamiaonrod/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/adamiaonr\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Alicia Escontrela</td>\n<td><a href=\"https://github.com/aliescont/dezoomcamp-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/alicia-escontrela/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/aliescont\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Chalermdej Lematavekul</td>\n<td><a href=\"https://github.com/Chalermdej-l/Final_Project_FredETE\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/chalermdej-l/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Chalermdej-l?tab=repositories\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\n\n> Thank you so much for the course. Learn so many thing from here.</details></td>\n</tr>\n<tr>\n<td>Muhammed Jimoh</td>\n<td><a href=\"https://github.com/Manny-97/DE-ZOOMCAMP-PROJECT\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/%F0%9F%91%A8%F0%9F%8F%BE%E2%80%8D%F0%9F%92%BB-muhammed-jimoh-45120a14a/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Manny-97\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Bartosz Skłodowski</td>\n<td><a href=\"https://github.com/bartoszsklodowski/de_zoomcamp_project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/bartosz-sk%C5%82odowski/?locale=en_US\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/bartoszsklodowski\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Daniel Rigney</td>\n<td><a href=\"https://github.com/danielyrigney/USDA-Data-Pipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/daniel-rigney-data/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/danielyrigney\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Daniel Gheorghita</td>\n<td><a href=\"https://github.com/daniel-gheorghita/dezoomcamp/tree/main/7_project_Belgium_housing_market\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/daniel-gheorghita-4a59903a/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/daniel-gheorghita\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Daniel Gheorghita</td>\n<td><a href=\"https://github.com/daniel-gheorghita/belgian_housing_buy_vs_rent\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/daniel-gheorghita-4a59903a/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/daniel-gheorghita\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Niel Kemp</td>\n<td></td>\n<td> <a href=\"https://www.linkedin.com/in/nielkemp/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Shahmir</td>\n<td><a href=\"https://github.com/Light2Dark/quality-of-life\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/shahmir-varqha\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Light2Dark\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\nLinks:\n\n<ul>\n<li><a href=\"https://smolwaffle.com\">Portfolio</a></li>\n</ul>\n\n> I've added a bunch of new features since the reviews! Check it out</details></td>\n</tr>\n<tr>\n<td>Matt Bertrand</td>\n<td><a href=\"https://github.com/mbertrand/eo-climate-pipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/bertrandmatt/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/mbertrand\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Nikolay Galkov</td>\n<td><a href=\"https://github.com/ngalkov/DEZoomcamp_project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/nikolay-galkov/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/ngalkov\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Hiroko Sakai</td>\n<td><a href=\"https://github.com/hirobo/world-earthquake\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/hirokos/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/hirobo\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Rohit Joshi</td>\n<td><a href=\"https://github.com/Rohitjoshi07/FHVDataAnalysis\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/rohit-joshi09\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/RohitJoshi07\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Valerii Bazyrov</td>\n<td></td>\n<td> <a href=\"https://www.linkedin.com/in/lantenak/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/lantenak\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Juan Pablo Ricapito</td>\n<td><a href=\"https://github.com/EzicStar/BA-turnstiles-pipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/juan-pablo-ricapito-112332186/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/EzicStar\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Ashraf Omara</td>\n<td><a href=\"https://github.com/AshrafOmara12/Ukraine-Conflict-Twitter-Data-Pipeline\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/ashraf-omara-48294a106/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/AshrafOmara12\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\n\n> I need to thank all of the data club community for this amazing contribution. </details></td>\n</tr>\n<tr>\n<td>Wasawat Boonyarittikit</td>\n<td><a href=\"https://github.com/ChungWasawat/dtc_de_project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/wasawat-boonyarittikit-b1698b179/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/ChungWasawat\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td></td>\n</tr>\n<tr>\n<td>Fedor Faizov</td>\n<td><a href=\"https://github.com/Fedrpi/de-zoomcamp-bandcamp-project\">Project</a></td>\n<td> <a href=\"https://www.linkedin.com/in/fedor-faizov-a75b32245/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Fedrpi\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n<td><details>\n<summary>More info</summary>\n\n\n\n> Absolutly amazing course <3 </details></td>\n\n</tr>\n</table>\n"
  },
  {
    "path": "cohorts/2023/project.md",
    "content": "## Course Project\n\nThe goal of this project is to apply everything we learned\nin this course and build an end-to-end data pipeline.\n\nYou will have two attempts to submit your project. If you don't have \ntime to submit your project by the end of attempt #1 (you started the \ncourse late, you have vacation plans, life/work got in the way, etc.)\nor you fail your first attempt, \nthen you will have a second chance to submit your project as attempt\n#2. \n\nThere are only two attempts.\n\nRemember that to pass the project, you must evaluate 3 peers. If you don't do that,\nyour project can't be considered complete.\n\nTo find the projects assigned to you, use the peer review assignments link \nand find your hash in the first column. You will see three rows: you need to evaluate \neach of these projects. For each project, you need to submit the form once,\nso in total, you will make three submissions. \n\n\n### Submitting\n\n#### Project Attempt #1\n\nProject:\n\n* Form: https://forms.gle/zTJiVYSmCgsENj6y8\n* Deadline: 10 April, 22:00 CET\n\nPeer reviewing:\n\n* Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRYQ0A9C7AkRK-YPSFhqaRMmuPR97QPfl2PjI8n11l5jntc6YMHIJXVVS0GQNqAYIGwzyevyManDB08/pubhtml?gid=0&single=true) (\"project-01\" sheet)\n* Form: https://forms.gle/1bxmgR8yPwV359zb7\n* Deadline: 17 April, 22:00 CET\n\nProject feedback: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vQuMt9m1XlPrCACqnsFTXTV_KGiSnsl9UjL7kdTMsLJ8DLu3jNJlPzoUKG6baxc8APeEQ8RaSP1U2VX/pubhtml?gid=27207346&single=true) (\"project-01\" sheet)\n\n#### Project Attempt #2\n\nProject:\n\n* Form: https://forms.gle/gCXUSYBm1KgMKXVm8\n* Deadline: 4 May, 22:00 CET\n\nPeer reviewing:\n\n* Peer review assignments: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vRYQ0A9C7AkRK-YPSFhqaRMmuPR97QPfl2PjI8n11l5jntc6YMHIJXVVS0GQNqAYIGwzyevyManDB08/pubhtml?gid=303437788&single=true) (\"project-02\" sheet)\n* Form: https://forms.gle/2x5MT4xxczR8isy37\n* Deadline: 11 May, 22:00 CET\n\nProject feedback: [link](https://docs.google.com/spreadsheets/d/e/2PACX-1vQuMt9m1XlPrCACqnsFTXTV_KGiSnsl9UjL7kdTMsLJ8DLu3jNJlPzoUKG6baxc8APeEQ8RaSP1U2VX/pubhtml?gid=246029638&single=true)\n\n### Evaluation criteria\n\nSee [here](../../week_7_project/README.md)\n\n\n### Misc\n\nTo get the hash for your project, use this function to hash your email:\n\n```python\nfrom hashlib import sha1\n\ndef compute_hash(email):\n    return sha1(email.lower().encode('utf-8')).hexdigest()\n```\n\nOr use [this website](http://www.sha1-online.com/). \n"
  },
  {
    "path": "cohorts/2023/week_1_docker_sql/homework.md",
    "content": "## Week 1 Homework\n\nIn this homework we'll prepare the environment \nand practice with Docker and SQL\n\n\n## Question 1. Knowing docker tags\n\nRun the command to get information on Docker \n\n```docker --help```\n\nNow run the command to get help on the \"docker build\" command\n\nWhich tag has the following text? - *Write the image ID to the file* \n\n- `--imageid string`\n- `--iidfile string`\n- `--idimage string`\n- `--idfile string`\n\n\n## Question 2. Understanding docker first run \n\nRun docker with the python:3.9 image in an interactive mode and the entrypoint of bash.\nNow check the python modules that are installed ( use pip list). \nHow many python packages/modules are installed?\n\n- 1\n- 6\n- 3\n- 7\n\n# Prepare Postgres\n\nRun Postgres and load data as shown in the videos\nWe'll use the green taxi trips from January 2019:\n\n```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz```\n\nYou will also need the dataset with zones:\n\n```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv```\n\nDownload this data and put it into Postgres (with jupyter notebooks or with a pipeline)\n\n\n## Question 3. Count records \n\nHow many taxi trips were totally made on January 15?\n\nTip: started and finished on 2019-01-15. \n\nRemember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date.\n\n- 20689\n- 20530\n- 17630\n- 21090\n\n## Question 4. Largest trip for each day\n\nWhich was the day with the largest trip distance\nUse the pick up time for your calculations.\n\n- 2019-01-18\n- 2019-01-28\n- 2019-01-15\n- 2019-01-10\n\n## Question 5. The number of passengers\n\nIn 2019-01-01 how many trips had 2 and 3 passengers?\n \n- 2: 1282 ; 3: 266\n- 2: 1532 ; 3: 126\n- 2: 1282 ; 3: 254\n- 2: 1282 ; 3: 274\n\n\n## Question 6. Largest tip\n\nFor the passengers picked up in the Astoria Zone which was the drop off zone that had the largest tip?\nWe want the name of the zone, not the id.\n\nNote: it's not a typo, it's `tip` , not `trip`\n\n- Central Park\n- Jamaica\n- South Ozone Park\n- Long Island City/Queens Plaza\n\n\n## Submitting the solutions\n\n* Form for submitting: [form](https://forms.gle/EjphSkR1b3nsdojv7)\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 30 January (Monday), 22:00 CET\n\n\n## Solution\n\nSee here: https://www.youtube.com/watch?v=KIh_9tZiroA\n"
  },
  {
    "path": "cohorts/2023/week_1_terraform/homework.md",
    "content": "## Week 1 Homework\n\nIn this homework we'll prepare the environment by creating resources in GCP with Terraform.\n\nIn your VM on GCP install Terraform. Copy the files from the course repo\n[here](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup/1_terraform_gcp/terraform) to your VM.\n\nModify the files as necessary to create a GCP Bucket and Big Query Dataset.\n\n\n## Question 1. Creating Resources\n\nAfter updating the main.tf and variable.tf files run:\n\n```\nterraform apply\n```\n\nPaste the output of this command into the homework submission form.\n\n\n## Submitting the solutions\n\n* Form for submitting: [form](https://forms.gle/S57Xs3HL9nB3YTzj9)\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 30 January (Monday), 22:00 CET\n\n"
  },
  {
    "path": "cohorts/2023/week_2_workflow_orchestration/README.md",
    "content": "## Week 2: Workflow Orchestration\n\nPython code from videos is linked [below](#code-repository).\n\nAlso, if you find the commands too small to view in Kalise's videos, here's the [transcript with code for the second Prefect video](https://github.com/discdiver/prefect-zoomcamp/tree/main/flows/01_start) and the [fifth Prefect video](https://github.com/discdiver/prefect-zoomcamp/tree/main/flows/03_deployments).\n\n### Data Lake (GCS)\n\n* What is a Data Lake\n* ELT vs. ETL\n* Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)\n* [Video](https://www.youtube.com/watch?v=W3Zm6rjOq70&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n* [Slides](https://docs.google.com/presentation/d/1RkH-YhBz2apIjYZAxUz2Uks4Pt51-fVWVN9CcH9ckyY/edit?usp=sharing)\n\n\n### 1. Introduction to Workflow orchestration\n\n* What is orchestration?\n* Workflow orchestrators vs. other types of orchestrators\n* Core features of a workflow orchestration tool\n* Different types of workflow orchestration tools that currently exist \n\n:movie_camera: [Video](https://www.youtube.com/watch?v=8oLs6pzHp68&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n\n### 2. Introduction to Prefect concepts\n\n* What is Prefect?\n* Installing Prefect\n* Prefect flow\n* Creating an ETL\n* Prefect task\n* Blocks and collections\n* Orion UI\n\n:movie_camera: [Video](https://www.youtube.com/watch?v=cdtN6dhp708&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n### 3. ETL with GCP & Prefect\n\n* Flow 1: Putting data to Google Cloud Storage \n\n:movie_camera: [Video](https://www.youtube.com/watch?v=W-rMz_2GwqQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n\n### 4. From Google Cloud Storage to Big Query\n\n* Flow 2: From GCS to BigQuery\n\n:movie_camera: [Video](https://www.youtube.com/watch?v=Cx5jt-V5sgE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n### 5. Parametrizing Flow & Deployments \n\n* Parametrizing the script from your flow\n* Parameter validation with Pydantic\n* Creating a deployment locally\n* Setting up Prefect Agent\n* Running the flow\n* Notifications\n\n:movie_camera: [Video](https://www.youtube.com/watch?v=QrDxPjX10iw&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n### 6. Schedules & Docker Storage with Infrastructure\n\n* Scheduling a deployment\n* Flow code storage\n* Running tasks in Docker\n\n:movie_camera: [Video](https://www.youtube.com/watch?v=psNSzqTsi-s&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n### 7. Prefect Cloud and Additional Resources \n\n\n* Using Prefect Cloud instead of local Prefect\n* Workspaces\n* Running flows on GCP\n\n:movie_camera: [Video](https://www.youtube.com/watch?v=gGC23ZK7lr8&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n* [Prefect docs](https://docs.prefect.io/)\n* [Pefect Discourse](https://discourse.prefect.io/)\n* [Prefect Cloud](https://app.prefect.cloud/)\n* [Prefect Slack](https://prefect-community.slack.com)\n\n### Code repository\n\n[Code from videos](https://github.com/discdiver/prefect-zoomcamp) (with a few minor enhancements)\n\n### Homework \nHomework can be found [here](./homework.md).\n\n## Community notes\n\nDid you take notes? You can share them here.\n\n* [Blog by Marcos Torregrosa (Prefect)](https://www.n4gash.com/2023/data-engineering-zoomcamp-semana-2/)\n* [Notes from Victor Padilha](https://github.com/padilha/de-zoomcamp/tree/master/week2)\n* [Notes by Alain Boisvert](https://github.com/boisalai/de-zoomcamp-2023/blob/main/week2.md)\n* [Notes by Candace Williams](https://github.com/teacherc/de_zoomcamp_candace2023/blob/main/week_2/week2_notes.md)\n* [Notes from Xia He-Bleinagel](https://xiahe-bleinagel.com/2023/02/week-2-data-engineering-zoomcamp-notes-prefect/)\n* [Notes from froukje](https://github.com/froukje/de-zoomcamp/blob/main/week_2_workflow_orchestration/notes/notes_week_02.md)\n* [Notes from Balaji](https://github.com/Balajirvp/DE-Zoomcamp/blob/main/Week%202/Detailed%20Week%202%20Notes.ipynb)\n* More on [Pandas vs SQL, Prefect capabilities, and testing your data](https://medium.com/@verazabeida/zoomcamp-2023-week-3-7f27bb8c483f), by Vera\n* Add your notes here (above this line)\n"
  },
  {
    "path": "cohorts/2023/week_2_workflow_orchestration/homework.md",
    "content": "## Week 2 Homework\n\nThe goal of this homework is to familiarise users with workflow orchestration and observation. \n\n\n## Question 1. Load January 2020 data\n\nUsing the `etl_web_to_gcs.py` flow that loads taxi data into GCS as a guide, create a flow that loads the green taxi CSV dataset for January 2020 into GCS and run it. Look at the logs to find out how many rows the dataset has.\n\nHow many rows does that dataset have?\n\n* 447,770\n* 766,792\n* 299,234\n* 822,132\n\n\n## Question 2. Scheduling with Cron\n\nCron is a common scheduling specification for workflows. \n\nUsing the flow in `etl_web_to_gcs.py`, create a deployment to run on the first of every month at 5am UTC. What’s the cron schedule for that?\n\n- `0 5 1 * *`\n- `0 0 5 1 *`\n- `5 * 1 0 *`\n- `* * 5 1 0`\n\n\n## Question 3. Loading data to BigQuery \n\nUsing `etl_gcs_to_bq.py` as a starting point, modify the script for extracting data from GCS and loading it into BigQuery. This new script should not fill or remove rows with missing values. (The script is really just doing the E and L parts of ETL).\n\nThe main flow should print the total number of rows processed by the script. Set the flow decorator to log the print statement.\n\nParametrize the entrypoint flow to accept a list of months, a year, and a taxi color. \n\nMake any other necessary changes to the code for it to function as required.\n\nCreate a deployment for this flow to run in a local subprocess with local flow code storage (the defaults).\n\nMake sure you have the parquet data files for Yellow taxi data for Feb. 2019 and March 2019 loaded in GCS. Run your deployment to append this data to your BiqQuery table. How many rows did your flow code process?\n\n- 14,851,920\n- 12,282,990\n- 27,235,753\n- 11,338,483\n\n\n\n## Question 4. Github Storage Block\n\nUsing the `web_to_gcs` script from the videos as a guide, you want to store your flow code in a GitHub repository for collaboration with your team. Prefect can look in the GitHub repo to find your flow code and read it. Create a GitHub storage block from the UI or in Python code and use that in your Deployment instead of storing your flow code locally or baking your flow code into a Docker image. \n\nNote that you will have to push your code to GitHub, Prefect will not push it for you.\n\nRun your deployment in a local subprocess (the default if you don’t specify an infrastructure). Use the Green taxi data for the month of November 2020.\n\nHow many rows were processed by the script?\n\n- 88,019\n- 192,297\n- 88,605\n- 190,225\n\n\n\n## Question 5. Email or Slack notifications\n\nQ5. It’s often helpful to be notified when something with your dataflow doesn’t work as planned. Choose one of the options below for creating email or slack notifications.\n\nThe hosted Prefect Cloud lets you avoid running your own server and has Automations that allow you to get notifications when certain events occur or don’t occur. \n\nCreate a free forever Prefect Cloud account at app.prefect.cloud and connect your workspace to it following the steps in the UI when you sign up. \n\nSet up an Automation that will send yourself an email when a flow run completes. Run the deployment used in Q4 for the Green taxi data for April 2019. Check your email to see the notification.\n\nAlternatively, use a Prefect Cloud Automation or a self-hosted Orion server Notification to get notifications in a Slack workspace via an incoming webhook. \n\nJoin my temporary Slack workspace with [this link](https://join.slack.com/t/temp-notify/shared_invite/zt-1odklt4wh-hH~b89HN8MjMrPGEaOlxIw). 400 people can use this link and it expires in 90 days. \n\nIn the Prefect Cloud UI create an [Automation](https://docs.prefect.io/ui/automations) or in the Prefect Orion UI create a [Notification](https://docs.prefect.io/ui/notifications/) to send a Slack message when a flow run enters a Completed state. Here is the Webhook URL to use: https://hooks.slack.com/services/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\n\nTest the functionality.\n\nAlternatively, you can grab the webhook URL from your own Slack workspace and Slack App that you create. \n\n\nHow many rows were processed by the script?\n\n- `125,268`\n- `377,922`\n- `728,390`\n- `514,392`\n\n\n## Question 6. Secrets\n\nPrefect Secret blocks provide secure, encrypted storage in the database and obfuscation in the UI. Create a secret block in the UI that stores a fake 10-digit password to connect to a third-party service. Once you’ve created your block in the UI, how many characters are shown as asterisks (*) on the next page of the UI?\n\n- 5\n- 6\n- 8\n- 10\n\n\n## Submitting the solutions\n\n* Form for submitting: https://forms.gle/PY8mBEGXJ1RvmTM97\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 8 February (Wednesday), 22:00 CET\n\n\n## Solution\n\n* Video: https://youtu.be/L04lvYqNlc0\n* Code: https://github.com/discdiver/prefect-zoomcamp/tree/main/flows/04_homework\n"
  },
  {
    "path": "cohorts/2023/week_3_data_warehouse/homework.md",
    "content": "## Week 3 Homework\n<b><u>Important Note:</b></u> <p>You can load the data however you would like, but keep the files in .GZ Format. \nIf you are using orchestration such as Airflow or Prefect do not load the data into Big Query using the orchestrator.</br> \nStop with loading the files into a bucket. </br></br>\n<u>NOTE:</u> You can use the CSV option for the GZ files when creating an External Table</br>\n\n<b>SETUP:</b></br>\nCreate an external table using the fhv 2019 data. </br>\nCreate a table in BQ using the fhv 2019 data (do not partition or cluster this table). </br>\nData can be found here: https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv </p>\n\n## Question 1:\nWhat is the count for fhv vehicle records for year 2019?\n- 65,623,481\n- 43,244,696\n- 22,978,333\n- 13,942,414\n\n## Question 2:\nWrite a query to count the distinct number of affiliated_base_number for the entire dataset on both the tables.</br> \nWhat is the estimated amount of data that will be read when this query is executed on the External Table and the Table?\n\n- 25.2 MB for the External Table and 100.87MB for the BQ Table\n- 225.82 MB for the External Table and 47.60MB for the BQ Table\n- 0 MB for the External Table and 0MB for the BQ Table\n- 0 MB for the External Table and 317.94MB for the BQ Table \n\n\n## Question 3:\nHow many records have both a blank (null) PUlocationID and DOlocationID in the entire dataset?\n- 717,748\n- 1,215,687\n- 5\n- 20,332\n\n## Question 4:\nWhat is the best strategy to optimize the table if query always filter by pickup_datetime and order by affiliated_base_number?\n- Cluster on pickup_datetime Cluster on affiliated_base_number\n- Partition by pickup_datetime Cluster on affiliated_base_number\n- Partition by pickup_datetime Partition by affiliated_base_number\n- Partition by affiliated_base_number Cluster on pickup_datetime\n\n## Question 5:\nImplement the optimized solution you chose for question 4. Write a query to retrieve the distinct affiliated_base_number between pickup_datetime 2019/03/01 and 2019/03/31 (inclusive).</br> \nUse the BQ table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values? Choose the answer which most closely matches.\n- 12.82 MB for non-partitioned table and 647.87 MB for the partitioned table\n- 647.87 MB for non-partitioned table and 23.06 MB for the partitioned table\n- 582.63 MB for non-partitioned table and 0 MB for the partitioned table\n- 646.25 MB for non-partitioned table and 646.25 MB for the partitioned table\n\n\n## Question 6: \nWhere is the data stored in the External Table you created?\n\n- Big Query\n- GCP Bucket\n- Container Registry\n- Big Table\n\n\n## Question 7:\nIt is best practice in Big Query to always cluster your data:\n- True\n- False\n\n\n## (Not required) Question 8:\nA better format to store these files may be parquet. Create a data pipeline to download the gzip files and convert them into parquet. Upload the files to your GCP Bucket and create an External and BQ Table. \n\n\nNote: Column types for all files used in an External Table must have the same datatype. While an External Table may be created and shown in the side panel in Big Query, this will need to be validated by running a count query on the External Table to check if any errors occur. \n \n## Submitting the solutions\n\n* Form for submitting: https://forms.gle/rLdvQW2igsAT73HTA\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 13 February (Monday), 22:00 CET\n\n\n## Solution\n\nSolution: https://www.youtube.com/watch?v=j8r2OigKBWE\n"
  },
  {
    "path": "cohorts/2023/week_4_analytics_engineering/homework.md",
    "content": "## Week 4 Homework \n\nIn this homework, we'll use the models developed during the week 4 videos and enhance the already presented dbt project using the already loaded Taxi data for fhv vehicles for year 2019 in our DWH.\n\nThis means that in this homework we use the following data [Datasets list](https://github.com/DataTalksClub/nyc-tlc-data/)\n* Yellow taxi data - Years 2019 and 2020\n* Green taxi data - Years 2019 and 2020 \n* fhv data - Year 2019. \n\nWe will use the data loaded for:\n\n* Building a source table: `stg_fhv_tripdata`\n* Building a fact table: `fact_fhv_trips`\n* Create a dashboard \n\nIf you don't have access to GCP, you can do this locally using the ingested data from your Postgres database\ninstead. If you have access to GCP, you don't need to do it for local Postgres -\nonly if you want to.\n\n> **Note**: if your answer doesn't match exactly, select the closest option \n\n### Question 1: \n\n**What is the count of records in the model fact_trips after running all models with the test run variable disabled and filtering for 2019 and 2020 data only (pickup datetime)?** \n\nYou'll need to have completed the [\"Build the first dbt models\"](https://www.youtube.com/watch?v=UVI30Vxzd6c) video and have been able to run the models via the CLI. \nYou should find the views and models for querying in your DWH.\n\n- 41648442\n- 51648442\n- 61648442\n- 71648442\n\n\n### Question 2: \n\n**What is the distribution between service type filtering by years 2019 and 2020 data as done in the videos?**\n\nYou will need to complete \"Visualising the data\" videos, either using [google data studio](https://www.youtube.com/watch?v=39nLTs74A3E) or [metabase](https://www.youtube.com/watch?v=BnLkrA7a6gM). \n\n- 89.9/10.1\n- 94/6\n- 76.3/23.7\n- 99.1/0.9\n\n\n\n### Question 3: \n\n**What is the count of records in the model stg_fhv_tripdata after running all models with the test run variable disabled (:false)?**  \n\nCreate a staging model for the fhv data for 2019 and do not add a deduplication step. Run it via the CLI without limits (is_test_run: false).\nFilter records with pickup time in year 2019.\n\n- 33244696\n- 43244696\n- 53244696\n- 63244696\n\n\n### Question 4: \n\n**What is the count of records in the model fact_fhv_trips after running all dependencies with the test run variable disabled (:false)?**  \n\nCreate a core model for the stg_fhv_tripdata joining with dim_zones.\nSimilar to what we've done in fact_trips, keep only records with known pickup and dropoff locations entries for pickup and dropoff locations. \nRun it via the CLI without limits (is_test_run: false) and filter records with pickup time in year 2019.\n\n- 12998722\n- 22998722\n- 32998722\n- 42998722\n\n### Question 5: \n\n**What is the month with the biggest amount of rides after building a tile for the fact_fhv_trips table?**\n\nCreate a dashboard with some tiles that you find interesting to explore the data. One tile should show the amount of trips per month, as done in the videos for fact_trips, based on the fact_fhv_trips table.\n\n- March\n- April\n- January\n- December\n\n\n\n## Submitting the solutions\n\n* Form for submitting: https://forms.gle/6A94GPutZJTuT5Y16\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 25 February (Saturday), 22:00 CET\n\n\n## Solution\n\n* Video: https://www.youtube.com/watch?v=I_K0lNu9WQw&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW\n* Answers:\n  * Question 1: 61648442,\n  * Question 2: 89.9/10.1\n  * Question 3: 43244696\n  * Question 4: 22998722\n  * Question 5: January\n"
  },
  {
    "path": "cohorts/2023/week_5_batch_processing/homework.md",
    "content": "## Week 5 Homework \n\nIn this homework we'll put what we learned about Spark in practice.\n\nFor this homework we will be using the FHVHV 2021-06 data found here. [FHVHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-06.csv.gz )\n\n\n### Question 1: \n\n**Install Spark and PySpark** \n\n- Install Spark\n- Run PySpark\n- Create a local spark session\n- Execute spark.version.\n\nWhat's the output?\n- 3.3.2\n- 2.1.4\n- 1.2.3\n- 5.4\n</br></br>\n\n\n### Question 2: \n\n**HVFHW June 2021**\n\nRead it with Spark using the same schema as we did in the lessons.</br> \nWe will use this dataset for all the remaining questions.</br>\nRepartition it to 12 partitions and save it to parquet.</br>\nWhat is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.</br>\n\n\n- 2MB\n- 24MB\n- 100MB\n- 250MB\n</br></br>\n\n\n### Question 3: \n\n**Count records**  \n\nHow many taxi trips were there on June 15?</br></br>\nConsider only trips that started on June 15.</br>\n\n- 308,164\n- 12,856\n- 452,470\n- 50,982\n</br></br>\n\n\n### Question 4: \n\n**Longest trip for each day**  \n\nNow calculate the duration for each trip.</br>\nHow long was the longest trip in Hours?</br>\n\n- 66.87 Hours\n- 243.44 Hours\n- 7.68 Hours\n- 3.32 Hours\n</br></br>\n\n### Question 5: \n\n**User Interface**\n\n Spark’s User Interface which shows application's dashboard runs on which local port?</br>\n\n- 80\n- 443\n- 4040\n- 8080\n</br></br>\n\n\n### Question 6: \n\n**Most frequent pickup location zone**\n\nLoad the zone lookup data into a temp view in Spark</br>\n[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv)</br>\n\nUsing the zone lookup data and the fhvhv June 2021 data, what is the name of the most frequent pickup location zone?</br>\n\n- East Chelsea\n- Astoria\n- Union Sq\n- Crown Heights North\n</br></br>\n\n\n\n\n## Submitting the solutions\n\n* Form for submitting: https://forms.gle/EcSvDs6vp64gcGuD8\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 06 March (Monday), 22:00 CET\n\n\n## Solution\n\n* Video: https://www.youtube.com/watch?v=ldoDIT32pJs\n* Answers:\n  * Question 1: 3.3.2\n  * Question 2: 24MB\n  * Question 3: 452,470\n  * Question 4: 66.87 Hours\n  * Question 5: 4040\n  * Question 6: Crown Heights North\n"
  },
  {
    "path": "cohorts/2023/week_6_stream_processing/client.properties",
    "content": "# Required connection configs for Kafka producer, consumer, and admin\nbootstrap.servers=<CONFLUENT CLOUD KAFKA BROKER>:9092\nsecurity.protocol=SASL_SSL\nsasl.mechanisms=PLAIN\nsasl.username=<CONFLUENT CLOUD API USER NAME>\nsasl.password=<CONFLUENT CLOUD API PASSWORD>\n\n# Best practice for higher availability in librdkafka clients prior to 1.7\nsession.timeout.ms=45000"
  },
  {
    "path": "cohorts/2023/week_6_stream_processing/homework.md",
    "content": "## Week 6 Homework \n\nIn this homework, there will be two sections, the first session focus on theoretical questions related to Kafka \nand streaming concepts and the second session asks to create a small streaming application using preferred \nprogramming language (Python or Java).\n\n### Question 1: \n\n**Please select the statements that are correct**\n\n- Kafka Node is responsible to store topics [x]\n- Zookeeper is removed from Kafka cluster starting from version 4.0 [x]\n- Retention configuration ensures the messages not get lost over specific period of time. [x]\n- Group-Id ensures the messages are distributed to associated consumers [x]\n\n\n### Question 2: \n\n**Please select the Kafka concepts that support reliability and availability**\n\n- Topic Replication [x]\n- Topic Partioning\n- Consumer Group Id\n- Ack All [x]\n\n\n\n### Question 3: \n\n**Please select the Kafka concepts that support scaling**  \n\n- Topic Replication\n- Topic Paritioning [x]\n- Consumer Group Id [x]\n- Ack All\n\n\n### Question 4: \n\n**Please select the attributes that are good candidates for partitioning key. \nConsider cardinality of the field you have selected and scaling aspects of your application**  \n\n- payment_type [x]\n- vendor_id [x]\n- passenger_count\n- total_amount\n- tpep_pickup_datetime\n- tpep_dropoff_datetime\n\n\n### Question 5: \n\n**Which configurations below should be provided for Kafka Consumer but not needed for Kafka Producer**\n\n- Deserializer Configuration [x]\n- Topics Subscription [x]\n- Bootstrap Server \n- Group-Id [x]\n- Offset [x]\n- Cluster Key and Cluster-Secret\n\n\n### Question 6:\n\nPlease implement a streaming application, for finding out popularity of PUlocationID across green and fhv trip datasets.\nPlease use the datasets [fhv_tripdata_2019-01.csv.gz](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv) \nand [green_tripdata_2019-01.csv.gz](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green)\n\nPS: If you encounter memory related issue, you can use the smaller portion of these two datasets as well, \nit is not necessary to find exact number in the  question.\n\nYour code should include following\n1. Producer that reads csv files and publish rides in corresponding kafka topics (such as rides_green, rides_fhv)\n2. Pyspark-streaming-application that reads two kafka topics\n   and writes both of them in topic rides_all and apply aggregations to find most popular pickup location.\n\n   \n## Submitting the solutions\n\n* Form for submitting: https://forms.gle/rK7268U92mHJBpmW7\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 13 March (Monday), 22:00 CET\n\n\n## Solution\n\nWe will publish the solution here after deadline#\n\nFor Question 6 ensure, \n\n1) Download fhv_tripdata_2019-01.csv and green_tripdata_2019-01.csv under resources/fhv_tripdata \nand resources/green_tripdata resprctively. ps: You need to unzip the compressed files\n\n2) Update the client.properties settings using your Confluent Cloud api keys and cluster. \n3) And create the topics(all_rides, fhv_taxi_rides, green_taxi_rides) in Confluent Cloud UI\n\n4) Run Producers for two datasets\n```\npython3 producer_confluent --type green\npython3 producer_confluent --type fhv\n```\n\n5) Run pyspark streaming\n```\n./spark-submit.sh streaming_confluent.py\n```\n\n\n"
  },
  {
    "path": "cohorts/2023/week_6_stream_processing/producer_confluent.py",
    "content": "from confluent_kafka import Producer\n\nimport argparse\nimport csv\nfrom typing import Dict\nfrom time import sleep\n\nfrom settings import CONFLUENT_CLOUD_CONFIG, \\\n    GREEN_TAXI_TOPIC, FHV_TAXI_TOPIC, \\\n    GREEN_TRIP_DATA_PATH, FHV_TRIP_DATA_PATH\n\n\nclass RideCSVProducer:\n    def __init__(self, probs: Dict, ride_type: str):\n\n        self.producer = Producer(**probs)\n        self.ride_type = ride_type\n\n    def parse_row(self, row):\n        if self.ride_type == 'green':\n            record = f'{row[5]}, {row[6]}'  # PULocationID, DOLocationID\n            key = str(row[0])  # vendor_id\n        elif self.ride_type == 'fhv':\n            record = f'{row[3]}, {row[4]}'  # PULocationID, DOLocationID,\n            key = str(row[0])  # dispatching_base_num\n        return key, record\n\n    def read_records(self, resource_path: str):\n        records, ride_keys = [], []\n        with open(resource_path, 'r') as f:\n            reader = csv.reader(f)\n            header = next(reader)  # skip the header\n            for row in reader:\n                key, record = self.parse_row(row)\n                ride_keys.append(key)\n                records.append(record)\n        return zip(ride_keys, records)\n\n    def publish(self, records: [str, str], topic: str):\n        for key_value in records:\n            key, value = key_value\n            try:\n                self.producer.poll(0)\n                self.producer.produce(topic=topic, key=key, value=value)\n                print(f\"Producing record for <key: {key}, value:{value}>\")\n            except KeyboardInterrupt:\n                break\n            except BufferError as bfer:\n                self.producer.poll(0.1)\n            except Exception as e:\n                print(f\"Exception while producing record - {value}: {e}\")\n\n        self.producer.flush()\n        sleep(10)\n\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description='Kafka Consumer')\n    parser.add_argument('--type', type=str, default='green')\n    args = parser.parse_args()\n\n    if args.type == 'green':\n        kafka_topic = GREEN_TAXI_TOPIC\n        data_path = GREEN_TRIP_DATA_PATH\n    elif args.type == 'fhv':\n        kafka_topic = FHV_TAXI_TOPIC\n        data_path = FHV_TRIP_DATA_PATH\n\n    producer = RideCSVProducer(ride_type=args.type, probs=CONFLUENT_CLOUD_CONFIG)\n    ride_records = producer.read_records(resource_path=data_path)\n    producer.publish(records=ride_records, topic=kafka_topic)\n"
  },
  {
    "path": "cohorts/2023/week_6_stream_processing/settings.py",
    "content": "import pyspark.sql.types as T\n\nGREEN_TRIP_DATA_PATH = './resources/green_tripdata/green_tripdata_2019-01.csv'\nFHV_TRIP_DATA_PATH = './resources/fhv_tripdata/fhv_tripdata_2019-01.csv'\nBOOTSTRAP_SERVERS = 'localhost:9092'\n\nRIDES_TOPIC = 'all_rides'\nFHV_TAXI_TOPIC = 'fhv_taxi_rides'\nGREEN_TAXI_TOPIC = 'green_taxi_rides'\n\nALL_RIDE_SCHEMA = T.StructType(\n    [T.StructField(\"PUlocationID\", T.StringType()),\n     T.StructField(\"DOlocationID\", T.StringType()),\n     ])\n\n\ndef read_ccloud_config(config_file):\n    conf = {}\n    with open(config_file) as fh:\n        for line in fh:\n            line = line.strip()\n            if len(line) != 0 and line[0] != \"#\":\n                parameter, value = line.strip().split('=', 1)\n                conf[parameter] = value.strip()\n    return conf\n\n\nCONFLUENT_CLOUD_CONFIG = read_ccloud_config('client_original.properties')\n"
  },
  {
    "path": "cohorts/2023/week_6_stream_processing/spark-submit.sh",
    "content": "# Submit Python code to SparkMaster\n\nif [ $# -lt 1 ]\nthen\n\techo \"Usage: $0 <pyspark-job.py> [ executor-memory ]\"\n\techo \"(specify memory in string format such as \\\"512M\\\" or \\\"2G\\\")\"\n\texit 1\nfi\nPYTHON_JOB=$1\n\nif [ -z $2 ]\nthen\n\tEXEC_MEM=\"1G\"\nelse\n\tEXEC_MEM=$2\nfi\nspark-submit --master spark://localhost:7077 --num-executors 2 \\\n\t           --executor-memory $EXEC_MEM --executor-cores 1 \\\n             --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,org.apache.spark:spark-avro_2.12:3.3.1,org.apache.spark:spark-streaming-kafka-0-10_2.12:3.3.1 \\\n             $PYTHON_JOB"
  },
  {
    "path": "cohorts/2023/week_6_stream_processing/streaming_confluent.py",
    "content": "from pyspark.sql import SparkSession\nimport pyspark.sql.functions as F\n\nfrom settings import CONFLUENT_CLOUD_CONFIG, GREEN_TAXI_TOPIC, FHV_TAXI_TOPIC, RIDES_TOPIC, ALL_RIDE_SCHEMA\n\n\ndef read_from_kafka(consume_topic: str):\n    # Spark Streaming DataFrame, connect to Kafka topic served at host in bootrap.servers option\n\n    df_stream = spark \\\n        .readStream \\\n        .format(\"kafka\") \\\n        .option(\"kafka.bootstrap.servers\", CONFLUENT_CLOUD_CONFIG['bootstrap.servers']) \\\n        .option(\"subscribe\", consume_topic) \\\n        .option(\"startingOffsets\", \"earliest\") \\\n        .option(\"checkpointLocation\", \"checkpoint\") \\\n        .option(\"kafka.security.protocol\", \"SASL_SSL\") \\\n        .option(\"kafka.sasl.mechanism\", \"PLAIN\") \\\n        .option(\"kafka.sasl.jaas.config\",\n                f\"\"\"org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{CONFLUENT_CLOUD_CONFIG['sasl.username']}\" password=\"{CONFLUENT_CLOUD_CONFIG['sasl.password']}\";\"\"\") \\\n        .option(\"failOnDataLoss\", False) \\\n        .load()\n\n    return df_stream\n\n\ndef parse_rides(df, schema):\n    \"\"\" take a Spark Streaming df and parse value col based on <schema>, return streaming df cols in schema \"\"\"\n    assert df.isStreaming is True, \"DataFrame doesn't receive streaming data\"\n\n    df = df.selectExpr(\"CAST(key AS STRING)\", \"CAST(value AS STRING)\")\n    # split attributes to nested array in one Column\n    col = F.split(df['value'], ', ')\n\n    # expand col to multiple top-level columns\n    for idx, field in enumerate(schema):\n        df = df.withColumn(field.name, col.getItem(idx).cast(field.dataType))\n\n    df = df.na.drop()\n\n    df.printSchema()\n\n    return df.select([field.name for field in schema])\n\n\ndef sink_console(df, output_mode: str = 'complete', processing_time: str = '5 seconds'):\n    query = df.writeStream \\\n        .outputMode(output_mode) \\\n        .trigger(processingTime=processing_time) \\\n        .format(\"console\") \\\n        .option(\"truncate\", False) \\\n        .start() \\\n        .awaitTermination()\n    return query  # pyspark.sql.streaming.StreamingQuery\n\n\ndef sink_kafka(df, topic, output_mode: str = 'complete'):\n    query = df.writeStream \\\n        .format(\"kafka\") \\\n        .option(\"kafka.bootstrap.servers\", \"pkc-75m1o.europe-west3.gcp.confluent.cloud:9092\") \\\n        .outputMode(output_mode) \\\n        .option(\"topic\", topic) \\\n        .option(\"checkpointLocation\", \"checkpoint\") \\\n        .option(\"kafka.security.protocol\", \"SASL_SSL\") \\\n        .option(\"kafka.sasl.mechanism\", \"PLAIN\") \\\n        .option(\"kafka.sasl.jaas.config\",\n                f\"\"\"org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{CONFLUENT_CLOUD_CONFIG['sasl.username']}\" password=\"{CONFLUENT_CLOUD_CONFIG['sasl.password']}\";\"\"\") \\\n        .option(\"failOnDataLoss\", False) \\\n        .start()\n    return query\n\n\ndef op_groupby(df, column_names):\n    df_aggregation = df.groupBy(column_names).count()\n    return df_aggregation\n\n\nif __name__ == \"__main__\":\n    spark = SparkSession.builder.appName('streaming-homework').getOrCreate()\n    spark.sparkContext.setLogLevel('WARN')\n\n    # Step 1: Consume GREEN_TAXI_TOPIC and FHV_TAXI_TOPIC\n    df_green_rides = read_from_kafka(consume_topic=GREEN_TAXI_TOPIC)\n    df_fhv_rides = read_from_kafka(consume_topic=FHV_TAXI_TOPIC)\n\n    # Step 2: Publish green and fhv rides to RIDES_TOPIC\n    kafka_sink_green_query = sink_kafka(df=df_green_rides, topic=RIDES_TOPIC, output_mode='append')\n    kafka_sink_fhv_query = sink_kafka(df=df_fhv_rides, topic=RIDES_TOPIC, output_mode='append')\n\n    # Step 3: Read RIDES_TOPIC and parse it in ALL_RIDE_SCHEMA\n    df_all_rides = read_from_kafka(consume_topic=RIDES_TOPIC)\n    df_all_rides = parse_rides(df_all_rides, ALL_RIDE_SCHEMA)\n\n    # Step 4: Apply Aggregation on the all_rides\n    df_pu_location_count = op_groupby(df_all_rides, ['PULocationID'])\n    df_pu_location_count = df_pu_location_count.sort(F.col('count').desc())\n\n    # Step 5: Sink Aggregation Streams to Console\n    console_sink_pu_location = sink_console(df_pu_location_count, output_mode='complete')\n"
  },
  {
    "path": "cohorts/2023/workshops/piperider.md",
    "content": "\n## Workshop: Maximizing Confidence in Your Data Model Changes with dbt and PipeRider\n\nTo learn how to use PipeRider together with dbt for detecting changes in model and data, sign up for a workshop\n\n- Video: https://www.youtube.com/watch?v=O-tyUOQccSs\n- Repository: https://github.com/InfuseAI/taxi_rides_ny_duckdb\n\n\n## Homework\n\nThe following questions follow on from the original Week 4 homework, and so use the same data as required by those questions:\n\nhttps://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2023/week_4_analytics_engineering/homework.md\n\nYellow taxi data - Years 2019 and 2020\nGreen taxi data - Years 2019 and 2020\nfhv data - Year 2019.\n\n### Question 1:\n\nWhat is the distribution between vendor id filtering by years 2019 and 2020 data?\n\nYou will need to run PipeRider and check the report\n\n* 70.1/29.6/0.5\n* 60.1/39.5/0.4\n* 90.2/9.5/0.3\n* 80.1/19.7/0.2\n\n### Question 2:\n\nWhat is the composition of total amount (positive/zero/negative) filtering by years 2019 and 2020 data?\n\nYou will need to run PipeRider and check the report\n\n\n* 51.4M/15K/48.6K\n* 21.4M/5K/248.6K\n* 61.4M/25K/148.6K\n* 81.4M/35K/14.6K\n\n### Question 3:\n\nWhat is the numeric statistics (average/standard deviation/min/max/sum) of trip distances filtering by years 2019 and 2020 data?\n\nYou will need to run PipeRider and check the report\n\n\n* 1.95/35.43/0/16.3K/151.5M\n* 3.95/25.43/23.88/267.3K/281.5M\n* 5.95/75.43/-63.88/67.3K/81.5M\n* 2.95/35.43/-23.88/167.3K/181.5M\n\n\n\n## Submitting the solutions\n\n* Form for submitting: https://forms.gle/WyLQHBu1DNwNTfqe8\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 20 March, 22:00 CET\n\n\n## Solution\n\nVideo: https://www.youtube.com/watch?v=inNrUys7W8U&list=PL3MmuxUbc_hJjEePXIdE-LVUx_1ZZjYGW\n"
  },
  {
    "path": "cohorts/2024/01-docker-terraform/homework.md",
    "content": "## Module 1 Homework\n\nATTENTION: At the very end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository.\n\n## Docker & SQL\n\nIn this homework we'll prepare the environment \nand practice with Docker and SQL\n\n\n## Question 1. Knowing docker tags\n\nRun the command to get information on Docker \n\n```docker --help```\n\nNow run the command to get help on the \"docker build\" command:\n\n```docker build --help```\n\nDo the same for \"docker run\".\n\nWhich tag has the following text? - *Automatically remove the container when it exits* \n\n- `--delete`\n- `--rc`\n- `--rmc`\n- `--rm`\n\n\n## Question 2. Understanding docker first run \n\nRun docker with the python:3.9 image in an interactive mode and the entrypoint of bash.\nNow check the python modules that are installed ( use ```pip list``` ). \n\nWhat is version of the package *wheel* ?\n\n- 0.42.0\n- 1.0.0\n- 23.0.1\n- 58.1.0\n\n\n# Prepare Postgres\n\nRun Postgres and load data as shown in the videos\nWe'll use the green taxi trips from September 2019:\n\n```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz```\n\nYou will also need the dataset with zones:\n\n```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv```\n\nDownload this data and put it into Postgres (with jupyter notebooks or with a pipeline)\n\n\n## Question 3. Count records \n\nHow many taxi trips were totally made on September 18th 2019?\n\nTip: started and finished on 2019-09-18. \n\nRemember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date.\n\n- 15767\n- 15612\n- 15859\n- 89009\n\n## Question 4. Longest trip for each day\n\nWhich was the pick up day with the longest trip distance?\nUse the pick up time for your calculations.\n\nTip: For every trip on a single day, we only care about the trip with the longest distance. \n\n- 2019-09-18\n- 2019-09-16\n- 2019-09-26\n- 2019-09-21\n\n\n## Question 5. Three biggest pick up Boroughs\n\nConsider lpep_pickup_datetime in '2019-09-18' and ignoring Borough has Unknown\n\nWhich were the 3 pick up Boroughs that had the maximum total_amount?\n \n- \"Brooklyn\" \"Manhattan\" \"Queens\"\n- \"Bronx\" \"Brooklyn\" \"Manhattan\"\n- \"Bronx\" \"Manhattan\" \"Queens\" \n- \"Brooklyn\" \"Queens\" \"Staten Island\"\n\n\n## Question 6. Largest tip\n\nFor the passengers picked up in September 2019 in the zone name Astoria which was the drop off zone that had the largest tip?\nWe want the name of the zone, not the id.\n\nNote: it's not a typo, it's `tip` , not `trip`\n\n- Central Park\n- Jamaica\n- JFK Airport\n- Long Island City/Queens Plaza\n\n\n\n## Terraform\n\nIn this section homework we'll prepare the environment by creating resources in GCP with Terraform.\n\nIn your VM on GCP/Laptop/GitHub Codespace install Terraform. \nCopy the files from the course repo\n[here](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/01-docker-terraform/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace.\n\nModify the files as necessary to create a GCP Bucket and Big Query Dataset.\n\n\n## Question 7. Creating Resources\n\nAfter updating the main.tf and variable.tf files run:\n\n```\nterraform apply\n```\n\nPaste the output of this command into the homework submission form.\n\n\n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw01\n* You can submit your homework multiple times. In this case, only the last submission will be used. \n\nDeadline: 29 January, 23:00 CET\n"
  },
  {
    "path": "cohorts/2024/01-docker-terraform/solutions.md",
    "content": "## Question 1. Knowing docker tags\n```\n❯ docker run --help | grep \"Automatically remove\"\n--rm                               Automatically remove\n```\n\n- `|` pipe operator redirects the previous command output as an input to the command after the operator\n- `docker run --help` -----> outputs `|` ---------> inputs to `grep \"Automatically remove\"`\n- `grep` allows you to search through text\n  \nAnswer: `--rm`\n\n\n## Question 2. Understanding docker first run\n\n- Run python:3.9 image with `docker run -it python:3.9 bash`\n- Since you opened with `it` tag, the container will be interactive`\n- Since the docker command ends with `bash`, the entrypoint into the container will be `bash`\n\n```shell\nroot@root: docker run -it python:3.9 bash\nroot@b67c6949422a:/# pip list\nPackage    Version\n---------- -------\npip        23.0.1\nsetuptools 58.1.0\nwheel      0.45.1\n```\n\nSince it's been a while since 2024 cohort, your wheel version might differ and may not be in the options provided.\n\nAnswer: For me it was `0.45.1`\n\n\n## Question 3. Count records\n\n- Trips that started and finished on 2019-09-18\n- Format timestamp(date and hour+min+sec) to date.\n\n```sql\nSELECT COUNT(*) FROM \"csv_green_tripdata_2019_09\"\nWHERE DATE(\"lpep_pickup_datetime\") = '2019-09-18' AND\n      DATE(\"lpep_dropoff_datetime\") = '2019-09-18';\n```\n```\n+-------+\n| count |\n|-------|\n| 15612 |\n+-------+\n```\n\nAnswer: `15612`\n\n\n## Question 4. Longest trip for each day\n```sql\nSELECT\n    DATE(\"lpep_pickup_datetime\") AS \"pickup_date\",\n    MAX(\"trip_distance\") AS \"longest_trip\"\nFROM\n    \"csv_green_tripdata_2019_09\"\nGROUP BY\n    DATE(\"lpep_pickup_datetime\")\nORDER BY\n    \"longest_trip\" DESC\nLIMIT 1;\n```\n```\n+-------------+--------------+\n| pickup_date | longest_trip |\n|-------------+--------------|\n| 2019-09-26  | 341.64       |\n+-------------+--------------+\n```\n\nAnswer: `2019-09-26`\n\n\n## Question 5. Three biggest pickup zones\n```sql\nSELECT\n    \"zone\".\"Zone\",\n    ROUND(SUM((\"total_amount\")::NUMERIC), 3) AS \"total_amount\"\nFROM\n    \"csv_green_tripdata_2019_09\"\nINNER JOIN\n    \"zone\" ON \"csv_green_tripdata_2019_09\".\"PULocationID\" = \"zone\".\"LocationID\"\nWHERE\n    DATE(\"lpep_pickup_datetime\") = '2019-09-18'\nGROUP BY\n    \"zone\".\"Zone\"\nORDER BY\n    \"total_amount\" DESC\nLIMIT 3;\n```\n```\n+---------------------+--------------+\n| Zone                | total_amount |\n|---------------------+--------------|\n| East Harlem North   | 17893.060    |\n| East Harlem South   | 17152.160    |\n| Morningside Heights | 11259.680    |\n+---------------------+--------------+\n```\n\nAnswer: `East Harlem North, East Harlem South, Morningside Heights`\n\n\n## Question 6. Largest tip\n```sql\nSELECT\n    puz.\"Zone\" AS pickup_zone,\n    doz.\"Zone\" AS dropoff_zone,\n    g.\"tip_amount\"\nFROM\n    \"csv_green_tripdata_2019_09\" g\nINNER JOIN\n    \"zone\" puz ON g.\"PULocationID\" = puz.\"LocationID\"\nINNER JOIN\n    \"zone\" doz ON g.\"DOLocationID\" = doz.\"LocationID\"\nWHERE\n    puz.\"Zone\" = 'Astoria'\nORDER BY\n    g.\"tip_amount\" DESC\nLIMIT 1;\n```\n\n```\n+-------------+--------------+------------+\n| pickup_zone | dropoff_zone | tip_amount |\n|-------------+--------------+------------|\n| Astoria     | JFK Airport  | 62.31      |\n+-------------+--------------+------------+\n```\n\nAnswer: `JFK Airport`\n\n\n## Question 7. Terraform Workflow\n\n> self-explanatory\n"
  },
  {
    "path": "cohorts/2024/02-workflow-orchestration/README.md",
    "content": "> [!NOTE]  \n>If you're looking for Airflow videos from the 2022 edition, check the [2022 cohort folder](../cohorts/2022/week_2_data_ingestion/). \n>\n>If you're looking for Prefect videos from the 2023 edition, check the [2023 cohort folder](../cohorts/2023/week_2_data_ingestion/).\n\n# Week 2: Workflow Orchestration\n\nWelcome to Week 2 of the Data Engineering Zoomcamp! 🚀😤 This week, we'll be covering workflow orchestration with Mage.\n\nMage is an open-source, hybrid framework for transforming and integrating data. ✨\n\nThis week, you'll learn how to use the Mage platform to author and share _magical_ data pipelines. This will all be covered in the course, but if you'd like to learn a bit more about Mage, check out our docs [here](https://docs.mage.ai/introduction/overview). \n\n* [2.2.1 - 📯 Intro to Orchestration](#221----intro-to-orchestration)\n* [2.2.2 - 🧙‍♂️ Intro to Mage](#222---%EF%B8%8F-intro-to-mage)\n* [2.2.3 - 🐘 ETL: API to Postgres](#223----etl-api-to-postgres)\n* [2.2.4 - 🤓 ETL: API to GCS](#224----etl-api-to-gcs)\n* [2.2.5 - 🔍 ETL: GCS to BigQuery](#225----etl-gcs-to-bigquery)\n* [2.2.6 - 👨‍💻 Parameterized Execution](#226----parameterized-execution)\n* [2.2.7 - 🤖 Deployment (Optional)](#227----deployment-optional)\n* [2.2.8 - 🗒️ Homework](#228---️-homework)\n* [2.2.9 - 👣 Next Steps](#229----next-steps)\n\n## 📕 Course Resources\n\n### 2.2.1 - 📯 Intro to Orchestration\n\nIn this section, we'll cover the basics of workflow orchestration. We'll discuss what it is, why it's important, and how it can be used to build data pipelines.\n\nVideos\n- 2.2.1a - What is Orchestration?\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/Li8-MWHhTbo)](https://youtu.be/Li8-MWHhTbo&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=17)\n\nResources\n- [Slides](https://docs.google.com/presentation/d/17zSxG5Z-tidmgY-9l7Al1cPmz4Slh4VPK6o2sryFYvw/)\n\n### 2.2.2 - 🧙‍♂️ Intro to Mage\n\nIn this section, we'll introduce the Mage platform. We'll cover what makes Mage different from other orchestrators, the fundamental concepts behind Mage, and how to get started. To cap it off, we'll spin Mage up via Docker 🐳 and run a simple pipeline.\n\nVideos\n- 2.2.2a - What is Mage?\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/AicKRcK3pa4)](https://youtu.be/AicKRcK3pa4&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=18)\n\n- 2.2.2b - Configuring Mage\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/tNiV7Wp08XE)](https://youtu.be/tNiV7Wp08XE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=19)\n\n- 2.2.2c - A Simple Pipeline\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/stI-gg4QBnI)](https://youtu.be/stI-gg4QBnI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=20)\n\nResources\n- [Getting Started Repo](https://github.com/mage-ai/mage-zoomcamp)\n- [Slides](https://docs.google.com/presentation/d/1y_5p3sxr6Xh1RqE6N8o2280gUzAdiic2hPhYUUD6l88/)\n\n### 2.2.3 - 🐘 ETL: API to Postgres\n\nHooray! Mage is up and running. Now, let's build a _real_ pipeline. In this section, we'll build a simple ETL pipeline that loads data from an API into a Postgres database. Our database will be built using Docker— it will be running locally, but it's the same as if it were running in the cloud.\n\nVideos\n- 2.2.3a - Configuring Postgres\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/pmhI-ezd3BE)](https://youtu.be/pmhI-ezd3BE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=21)\n\n- 2.2.3b - Writing an ETL Pipeline : API to postgres\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/Maidfe7oKLs)](https://youtu.be/Maidfe7oKLs&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=22)\n\n\n### 2.2.4 - 🤓 ETL: API to GCS\n\nOk, so we've written data _locally_ to a database, but what about the cloud? In this tutorial, we'll walk through the process of using Mage to extract, transform, and load data from an API to Google Cloud Storage (GCS). \n\nWe'll cover both writing _partitioned_ and _unpartitioned_ data to GCS and discuss _why_ you might want to do one over the other. Many data teams start with extracting data from a source and writing it to a data lake _before_ loading it to a structured data source, like a database.\n\nVideos\n- 2.2.4a - Configuring GCP\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/00LP360iYvE)](https://youtu.be/00LP360iYvE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=23)\n\n- 2.2.4b - Writing an ETL Pipeline : API to GCS\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/w0XmcASRUnc)](https://youtu.be/w0XmcASRUnc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=24)\n\nResources\n- [DTC Zoomcamp GCP Setup](../01-docker-terraform/1_terraform_gcp/2_gcp_overview.md)\n\n### 2.2.5 - 🔍 ETL: GCS to BigQuery\n\nNow that we've written data to GCS, let's load it into BigQuery. In this section, we'll walk through the process of using Mage to load our data from GCS to BigQuery. This closely mirrors a very common data engineering workflow: loading data from a data lake into a data warehouse.\n\nVideos\n- 2.2.5a - Writing an ETL Pipeline : GCS to BigQuery\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/JKp_uzM-XsM)](https://youtu.be/JKp_uzM-XsM&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=25)\n\n### 2.2.6 - 👨‍💻 Parameterized Execution\n\nBy now you're familiar with building pipelines, but what about adding parameters? In this video, we'll discuss some built-in runtime variables that exist in Mage and show you how to define your own! We'll also cover how to use these variables to parameterize your pipelines. Finally, we'll talk about what it means to *backfill* a pipeline and how to do it in Mage.\n\nVideos\n- 2.2.6a - Parameterized Execution\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/H0hWjWxB-rg)](https://youtu.be/H0hWjWxB-rg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=26)\n\n\n- 2.2.6b - Backfills\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/ZoeC6Ag5gQc)](https://youtu.be/ZoeC6Ag5gQc&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=27)\n\nResources\n- [Mage Variables Overview](https://docs.mage.ai/development/variables/overview)\n- [Mage Runtime Variables](https://docs.mage.ai/getting-started/runtime-variable)\n\n### 2.2.7 - 🤖 Deployment (Optional)\n\nIn this section, we'll cover deploying Mage using Terraform and Google Cloud. This section is optional— it's not *necessary* to learn Mage, but it might be helpful if you're interested in creating a fully deployed project. If you're using Mage in your final project, you'll need to deploy it to the cloud.\n\nVideos\n- 2.2.7a - Deployment Prerequisites\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/zAwAX5sxqsg)](https://youtu.be/zAwAX5sxqsg&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=28)\n\n- 2.2.7b - Google Cloud Permissions\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/O_H7DCmq2rA)](https://youtu.be/O_H7DCmq2rA&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=29)\n\n- 2.2.7c - Deploying to Google Cloud - Part 1\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/9A872B5hb_0)](https://youtu.be/9A872B5hb_0&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=30)\n\n- 2.2.7d - Deploying to Google Cloud - Part 2\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/0YExsb2HgLI)](https://youtu.be/0YExsb2HgLI&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=31)\n\nResources\n- [Installing Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli)\n- [Installing `gcloud` CLI](https://cloud.google.com/sdk/docs/install)\n- [Mage Terraform Templates](https://github.com/mage-ai/mage-ai-terraform-templates)\n\nAdditional Mage Guides\n- [Terraform](https://docs.mage.ai/production/deploying-to-cloud/using-terraform)\n- [Deploying to GCP with Terraform](https://docs.mage.ai/production/deploying-to-cloud/gcp/setup)\n\n### 2.2.8 - 🗒️ Homework \n\nWe've prepared a short exercise to test you on what you've learned this week. You can find the homework [here](../cohorts/2024/02-workflow-orchestration/homework.md). This follows closely from the contents of the course and shouldn't take more than an hour or two to complete. 😄\n\n### 2.2.9 - 👣 Next Steps\n\nCongratulations! You've completed Week 2 of the Data Engineering Zoomcamp. We hope you've enjoyed learning about Mage and that you're excited to use it in your final project. If you have any questions, feel free to reach out to us on Slack. Be sure to check out our \"Next Steps\" video for some inspiration for the rest of your journey 😄.\n\nVideos\n- 2.2.9 - Next Steps\n\n[![](https://markdown-videos-api.jorgenkh.no/youtube/uUtj7N0TleQ)](https://youtu.be/uUtj7N0TleQ&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=32)\n\nResources\n- [Slides](https://docs.google.com/presentation/d/1yN-e22VNwezmPfKrZkgXQVrX5owDb285I2HxHWgmAEQ/edit#slide=id.g262fb0d2905_0_12)\n\n### 📑 Additional Resources\n\n- [Mage Docs](https://docs.mage.ai/)\n- [Mage Guides](https://docs.mage.ai/guides)\n- [Mage Slack](https://www.mage.ai/chat)\n\n\n# Community notes\n\nDid you take notes? You can share them here:\n\n## 2024 notes\n\n* [2024 Videos transcripts week 2](https://drive.google.com/drive/folders/1yxT0uMMYKa6YOxanh91wGqmQUMS7yYW7?usp=sharing) by Maria Fisher\n* [Notes from Jonah Oliver](https://www.jonahboliver.com/blog/de-zc-w2)\n* [Notes from Linda](https://github.com/inner-outer-space/de-zoomcamp-2024/blob/main/2-workflow-orchestration/readme.md)\n* [Notes from Kirill](https://github.com/kirill505/data-engineering-zoomcamp/blob/main/02-workflow-orchestration/README.md)\n* [Notes from Zharko](https://www.zharconsulting.com/contents/data/data-engineering-bootcamp-2024/week-2-ingesting-data-with-mage/)\n* Add your notes above this line\n\n## 2023 notes\n\nSee [here](../cohorts/2023/week_2_workflow_orchestration#community-notes)\n\n\n## 2022 notes\n\nSee [here](../cohorts/2022/week_2_data_ingestion#community-notes)\n"
  },
  {
    "path": "cohorts/2024/02-workflow-orchestration/homework.md",
    "content": "## Module 2 Homework\n\nATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository.\n\n> In case you don't get one option exactly, select the closest one \n\nFor the homework, we'll be working with the _green_ taxi dataset located here:\n\n`https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download`\n\nTo get a `wget`-able link, use this prefix (note that the link itself gives 404):\n\n`https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/`\n\n### Assignment\n\nThe goal will be to construct an ETL pipeline that loads the data, performs some transformations, and writes the data to a database (and Google Cloud!).\n\n- Create a new pipeline, call it `green_taxi_etl`\n- Add a data loader block and use Pandas to read data for the final quarter of 2020 (months `10`, `11`, `12`).\n  - You can use the same datatypes and date parsing methods shown in the course.\n  - `BONUS`: load the final three months using a for loop and `pd.concat`\n- Add a transformer block and perform the following:\n  - Remove rows where the passenger count is equal to 0 _and_ the trip distance is equal to zero.\n  - Create a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date.\n  - Rename columns in Camel Case to Snake Case, e.g. `VendorID` to `vendor_id`.\n  - Add three assertions:\n    - `vendor_id` is one of the existing values in the column (currently)\n    - `passenger_count` is greater than 0\n    - `trip_distance` is greater than 0\n- Using a Postgres data exporter (SQL or Python), write the dataset to a table called `green_taxi` in a schema `mage`. Replace the table if it already exists.\n- Write your data as Parquet files to a bucket in GCP, partioned by `lpep_pickup_date`. Use the `pyarrow` library!\n- Schedule your pipeline to run daily at 5AM UTC.\n\n### Questions\n\n## Question 1. Data Loading\n\nOnce the dataset is loaded, what's the shape of the data?\n\n* 266,855 rows x 20 columns\n* 544,898 rows x 18 columns\n* 544,898 rows x 20 columns\n* 133,744 rows x 20 columns\n\n## Question 2. Data Transformation\n\nUpon filtering the dataset where the passenger count is greater than 0 _and_ the trip distance is greater than zero, how many rows are left?\n\n* 544,897 rows\n* 266,855 rows\n* 139,370 rows\n* 266,856 rows\n\n## Question 3. Data Transformation\n\nWhich of the following creates a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date?\n\n* `data = data['lpep_pickup_datetime'].date`\n* `data('lpep_pickup_date') = data['lpep_pickup_datetime'].date`\n* `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date`\n* `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt().date()`\n\n## Question 4. Data Transformation\n\nWhat are the existing values of `VendorID` in the dataset?\n\n* 1, 2, or 3\n* 1 or 2\n* 1, 2, 3, 4\n* 1\n\n## Question 5. Data Transformation\n\nHow many columns need to be renamed to snake case?\n\n* 3\n* 6\n* 2\n* 4\n\n## Question 6. Data Exporting\n\nOnce exported, how many partitions (folders) are present in Google Cloud?\n\n* 96\n* 56\n* 67\n* 108\n\n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw2\n* Check the link above to see the due date\n  \n## Solution\n\nWill be added after the due date\n"
  },
  {
    "path": "cohorts/2024/03-data-warehouse/homework.md",
    "content": "## Module 3 Homework\n\nSolution: https://www.youtube.com/watch?v=8g_lRKaC9ro\n\nATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or shell commands), please include these directly in the README file of your repository.\n\n<b><u>Important Note:</b></u> <p> For this homework we will be using the 2022 Green Taxi Trip Record Parquet Files from the New York\nCity Taxi Data found here: </br> https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page </br>\nIf you are using orchestration such as Mage, Airflow or Prefect do not load the data into Big Query using the orchestrator.</br> \nStop with loading the files into a bucket. </br></br>\n<u>NOTE:</u> You will need to use the PARQUET option files when creating an External Table</br>\n\n<b>SETUP:</b></br>\nCreate an external table using the Green Taxi Trip Records Data for 2022. </br>\nCreate a table in BQ using the Green Taxi Trip Records for 2022 (do not partition or cluster this table). </br>\n</p>\n\n## Question 1:\nQuestion 1: What is count of records for the 2022 Green Taxi Data??\n- 65,623,481\n- 840,402\n- 1,936,423\n- 253,647\n\n## Question 2:\nWrite a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.</br> \nWhat is the estimated amount of data that will be read when this query is executed on the External Table and the Table?\n\n- 0 MB for the External Table and 6.41MB for the Materialized Table\n- 18.82 MB for the External Table and 47.60 MB for the Materialized Table\n- 0 MB for the External Table and 0MB for the Materialized Table\n- 2.14 MB for the External Table and 0MB for the Materialized Table\n\n\n## Question 3:\nHow many records have a fare_amount of 0?\n- 12,488\n- 128,219\n- 112\n- 1,622\n\n## Question 4:\nWhat is the best strategy to make an optimized table in Big Query if your query will always order the results by PUlocationID and filter based on lpep_pickup_datetime? (Create a new table with this strategy)\n- Cluster on lpep_pickup_datetime Partition by PUlocationID\n- Partition by lpep_pickup_datetime  Cluster on PUlocationID\n- Partition by lpep_pickup_datetime and Partition by PUlocationID\n- Cluster on by lpep_pickup_datetime and Cluster on PUlocationID\n\n## Question 5:\nWrite a query to retrieve the distinct PULocationID between lpep_pickup_datetime\n06/01/2022 and 06/30/2022 (inclusive)</br>\n\nUse the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values? </br>\n\nChoose the answer which most closely matches.</br> \n\n- 22.82 MB for non-partitioned table and 647.87 MB for the partitioned table\n- 12.82 MB for non-partitioned table and 1.12 MB for the partitioned table\n- 5.63 MB for non-partitioned table and 0 MB for the partitioned table\n- 10.31 MB for non-partitioned table and 10.31 MB for the partitioned table\n\n\n## Question 6: \nWhere is the data stored in the External Table you created?\n\n- Big Query\n- GCP Bucket\n- Big Table\n- Container Registry\n\n\n## Question 7:\nIt is best practice in Big Query to always cluster your data:\n- True\n- False\n\n\n## (Bonus: Not worth points) Question 8:\nNo Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why?\n\n \n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw3\n\n\n"
  },
  {
    "path": "cohorts/2024/04-analytics-engineering/homework.md",
    "content": "## Module 4 Homework \n\nIn this homework, we'll use the models developed during the week 4 videos and enhance the already presented dbt project using the already loaded Taxi data for fhv vehicles for year 2019 in our DWH.\n\nThis means that in this homework we use the following data [Datasets list](https://github.com/DataTalksClub/nyc-tlc-data/)\n* Yellow taxi data - Years 2019 and 2020\n* Green taxi data - Years 2019 and 2020 \n* fhv data - Year 2019. \n\nWe will use the data loaded for:\n\n* Building a source table: `stg_fhv_tripdata`\n* Building a fact table: `fact_fhv_trips`\n* Create a dashboard \n\nIf you don't have access to GCP, you can do this locally using the ingested data from your Postgres database\ninstead. If you have access to GCP, you don't need to do it for local Postgres - only if you want to.\n\n> **Note**: if your answer doesn't match exactly, select the closest option \n\n### Question 1: \n\n**What happens when we execute dbt build --vars '{'is_test_run':'true'}'**\nYou'll need to have completed the [\"Build the first dbt models\"](https://www.youtube.com/watch?v=UVI30Vxzd6c) video. \n- It's the same as running *dbt build*\n- It applies a _limit 100_ to all of our models\n- It applies a _limit 100_ only to our staging models\n- Nothing\n\n### Question 2: \n\n**What is the code that our CI job will run? Where is this code coming from?**  \n\n- The code that has been merged into the main branch\n- The code that is behind the creation object on the dbt_cloud_pr_ schema\n- The code from any development branch that has been opened based on main\n- The code from the development branch we are requesting to merge to main\n\n\n### Question 3 (2 points)\n\n**What is the count of records in the model fact_fhv_trips after running all dependencies with the test run variable disabled (:false)?**  \nCreate a staging model for the fhv data, similar to the ones made for yellow and green data. Add an additional filter for keeping only records with pickup time in year 2019.\nDo not add a deduplication step. Run this models without limits (is_test_run: false).\n\nCreate a core model similar to fact trips, but selecting from stg_fhv_tripdata and joining with dim_zones.\nSimilar to what we've done in fact_trips, keep only records with known pickup and dropoff locations entries for pickup and dropoff locations. \nRun the dbt model without limits (is_test_run: false).\n\n- 12998722\n- 22998722\n- 32998722\n- 42998722\n\n### Question 4 (2 points)\n\n**What is the service that had the most rides during the month of July 2019 month with the biggest amount of rides after building a tile for the fact_fhv_trips table and the fact_trips tile as seen in the videos?**\n\nCreate a dashboard with some tiles that you find interesting to explore the data. One tile should show the amount of trips per month, as done in the videos for fact_trips, including the fact_fhv_trips data.\n\n- FHV\n- Green\n- Yellow\n- FHV and Green\n\n\n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw4\n\nDeadline: 22 February (Thursday), 22:00 CET\n\n\n## Solution (To be published after deadline)\n\n* Video: https://youtu.be/3OPggh5Rca8\n* Answers:\n  * Question 1: It applies a _limit 100_ only to our staging models\n  * Question 2: The code from the development branch we are requesting to merge to main\n  * Question 3: 22998722\n  * Question 4: Yellow\n"
  },
  {
    "path": "cohorts/2024/05-batch/homework.md",
    "content": "## Module 5 Homework \n\nSolution: https://www.youtube.com/watch?v=YtddC7vJOgQ\n\nIn this homework we'll put what we learned about Spark in practice.\n\nFor this homework we will be using the FHV 2019-10 data found here. [FHV Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-10.csv.gz)\n\n### Question 1: \n\n**Install Spark and PySpark** \n\n- Install Spark\n- Run PySpark\n- Create a local spark session\n- Execute spark.version.\n\nWhat's the output?\n\n> [!NOTE]\n> To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/setup/pyspark.md)\n\n### Question 2: \n\n**FHV October 2019**\n\nRead the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.\n\nRepartition the Dataframe to 6 partitions and save it to parquet.\n\nWhat is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.\n\n- 1MB\n- 6MB\n- 25MB\n- 87MB\n\n\n\n### Question 3: \n\n**Count records** \n\nHow many taxi trips were there on the 15th of October?\n\nConsider only trips that started on the 15th of October.\n\n- 108,164\n- 12,856\n- 452,470\n- 62,610\n\n> [!IMPORTANT]\n> Be aware of columns order when defining schema\n\n### Question 4: \n\n**Longest trip for each day** \n\nWhat is the length of the longest trip in the dataset in hours?\n\n- 631,152.50 Hours\n- 243.44 Hours\n- 7.68 Hours\n- 3.32 Hours\n\n\n\n### Question 5: \n\n**User Interface**\n\nSpark’s User Interface which shows the application's dashboard runs on which local port?\n\n- 80\n- 443\n- 4040\n- 8080\n\n\n\n### Question 6: \n\n**Least frequent pickup location zone**\n\nLoad the zone lookup data into a temp view in Spark</br>\n[Zone Data](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv)\n\nUsing the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?</br>\n\n- East Chelsea\n- Jamaica Bay\n- Union Sq\n- Crown Heights North\n\n\n## Submitting the solutions\n\n- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw5\n- Deadline: See the website\n"
  },
  {
    "path": "cohorts/2024/06-streaming/docker-compose.yml",
    "content": "version: '3.7'\nservices:\n  # Redpanda cluster\n  redpanda-1:\n    image: docker.redpanda.com/vectorized/redpanda:v22.3.5\n    container_name: redpanda-1\n    command:\n      - redpanda\n      - start\n      - --smp\n      - '1'\n      - --reserve-memory\n      - 0M\n      - --overprovisioned\n      - --node-id\n      - '1'\n      - --kafka-addr\n      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092\n      - --advertise-kafka-addr\n      - PLAINTEXT://redpanda-1:29092,OUTSIDE://localhost:9092\n      - --pandaproxy-addr\n      - PLAINTEXT://0.0.0.0:28082,OUTSIDE://0.0.0.0:8082\n      - --advertise-pandaproxy-addr\n      - PLAINTEXT://redpanda-1:28082,OUTSIDE://localhost:8082\n      - --rpc-addr\n      - 0.0.0.0:33145\n      - --advertise-rpc-addr\n      - redpanda-1:33145\n    ports:\n      # - 8081:8081\n      - 8082:8082\n      - 9092:9092\n      - 28082:28082\n      - 29092:29092"
  },
  {
    "path": "cohorts/2024/06-streaming/homework.md",
    "content": "## Module 6 Homework \n\nIn this homework, we're going to extend Module 5 Homework and learn about streaming with PySpark.\n\nInstead of Kafka, we will use Red Panda, which is a drop-in\nreplacement for Kafka. \n\nEnsure you have the following set up (if you had done the previous homework and the module):\n\n- Docker (see [module 1](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/01-docker-terraform))\n- PySpark (see [module 5](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/05-batch/setup))\n\nFor this homework we will be using the files from Module 5 homework:\n\n- Green 2019-10 data from [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz)\n\n\n\n## Start Red Panda\n\nLet's start redpanda in a docker container. \n\nThere's a `docker-compose.yml` file in the homework folder (taken from [here](https://github.com/redpanda-data-blog/2023-python-gsg/blob/main/docker-compose.yml))\n\nCopy this file to your homework directory and run\n\n```bash\ndocker-compose up\n```\n\n(Add `-d` if you want to run in detached mode)\n\n\n## Question 1: Redpanda version\n\nNow let's find out the version of redpandas. \n\nFor that, check the output of the command `rpk help` _inside the container_. The name of the container is `redpanda-1`.\n\nFind out what you need to execute based on the `help` output.\n\nWhat's the version, based on the output of the command you executed? (copy the entire version)\n\n\n## Question 2. Creating a topic\n\nBefore we can send data to the redpanda server, we\nneed to create a topic. We do it also with the `rpk`\ncommand we used previously for figuring out the version of \nredpandas.\n\nRead the output of `help` and based on it, create a topic with name `test-topic` \n\nWhat's the output of the command for creating a topic? Include the entire output in your answer.\n\n\n## Question 3. Connecting to the Kafka server\n\nWe need to make sure we can connect to the server, so\nlater we can send some data to its topics\n\nFirst, let's install the kafka connector (up to you if you\nwant to have a separate virtual environment for that)\n\n```bash\npip install kafka-python\n```\n\nYou can start a jupyter notebook in your solution folder or\ncreate a script\n\nLet's try to connect to our server:\n\n```python\nimport json\nimport time \n\nfrom kafka import KafkaProducer\n\ndef json_serializer(data):\n    return json.dumps(data).encode('utf-8')\n\nserver = 'localhost:9092'\n\nproducer = KafkaProducer(\n    bootstrap_servers=[server],\n    value_serializer=json_serializer\n)\n\nproducer.bootstrap_connected()\n```\n\nProvided that you can connect to the server, what's the output\nof the last command?\n\n\n## Question 4. Sending data to the stream\n\nNow we're ready to send some test data:\n\n```python\nt0 = time.time()\n\ntopic_name = 'test-topic'\n\nfor i in range(10):\n    message = {'number': i}\n    producer.send(topic_name, value=message)\n    print(f\"Sent: {message}\")\n    time.sleep(0.05)\n\nproducer.flush()\n\nt1 = time.time()\nprint(f'took {(t1 - t0):.2f} seconds')\n```\n\nHow much time did it take? Where did it spend most of the time?\n\n* Sending the messages\n* Flushing\n* Both took approximately the same amount of time\n\n(Don't remove `time.sleep` when answering this question)\n\n\n## Reading data with `rpk`\n\nYou can see the messages that you send to the topic\nwith `rpk`:\n\n```bash\nrpk topic consume test-topic\n```\n\nRun the command above and send the messages one more time to \nsee them\n\n\n## Sending the taxi data\n\nNow let's send our actual data:\n\n* Read the green csv.gz file\n* We will only need these columns:\n  * `'lpep_pickup_datetime',`\n  * `'lpep_dropoff_datetime',`\n  * `'PULocationID',`\n  * `'DOLocationID',`\n  * `'passenger_count',`\n  * `'trip_distance',`\n  * `'tip_amount'`\n\nIterate over the records in the dataframe\n\n```python\nfor row in df_green.itertuples(index=False):\n    row_dict = {col: getattr(row, col) for col in row._fields}\n    print(row_dict)\n    break\n\n    # TODO implement sending the data here\n```\n\nNote: this way of iterating over the records is more efficient compared\nto `iterrows`\n\n\n## Question 5: Sending the Trip Data\n\n* Create a topic `green-trips` and send the data there\n* How much time in seconds did it take? (You can round it to a whole number)\n* Make sure you don't include sleeps in your code\n\n\n## Creating the PySpark consumer\n\nNow let's read the data with PySpark. \n\nSpark needs a library (jar) to be able to connect to Kafka, \nso we need to tell PySpark that it needs to use it:\n\n```python\nimport pyspark\nfrom pyspark.sql import SparkSession\n\npyspark_version = pyspark.__version__\nkafka_jar_package = f\"org.apache.spark:spark-sql-kafka-0-10_2.12:{pyspark_version}\"\n\nspark = SparkSession \\\n    .builder \\\n    .master(\"local[*]\") \\\n    .appName(\"GreenTripsConsumer\") \\\n    .config(\"spark.jars.packages\", kafka_jar_package) \\\n    .getOrCreate()\n```\n\nNow we can connect to the stream:\n\n```python\ngreen_stream = spark \\\n    .readStream \\\n    .format(\"kafka\") \\\n    .option(\"kafka.bootstrap.servers\", \"localhost:9092\") \\\n    .option(\"subscribe\", \"green-trips\") \\\n    .option(\"startingOffsets\", \"earliest\") \\\n    .load()\n```\n\nIn order to test that we can consume from the stream, \nlet's see what will be the first record there. \n\nIn Spark streaming, the stream is represented as a sequence of \nsmall batches, each batch being a small RDD (or a small dataframe).\n\nSo we can execute a function over each mini-batch.\nLet's run `take(1)` there to see what do we have in the stream:\n\n```python\ndef peek(mini_batch, batch_id):\n    first_row = mini_batch.take(1)\n\n    if first_row:\n        print(first_row[0])\n\nquery = green_stream.writeStream.foreachBatch(peek).start()\n```\n\nYou should see a record like this:\n\n```\nRow(key=None, value=bytearray(b'{\"lpep_pickup_datetime\": \"2019-10-01 00:26:02\", \"lpep_dropoff_datetime\": \"2019-10-01 00:39:58\", \"PULocationID\": 112, \"DOLocationID\": 196, \"passenger_count\": 1.0, \"trip_distance\": 5.88, \"tip_amount\": 0.0}'), topic='green-trips', partition=0, offset=0, timestamp=datetime.datetime(2024, 3, 12, 22, 42, 9, 411000), timestampType=0)\n```\n\nNow let's stop the query, so it doesn't keep consuming messages\nfrom the stream\n\n```python\nquery.stop()\n```\n\n## Question 6. Parsing the data\n\nThe data is JSON, but currently it's in binary format. We need\nto parse it and turn it into a streaming dataframe with proper\ncolumns.\n\nSimilarly to PySpark, we define the schema\n\n```python\nfrom pyspark.sql import types\n\nschema = types.StructType() \\\n    .add(\"lpep_pickup_datetime\", types.StringType()) \\\n    .add(\"lpep_dropoff_datetime\", types.StringType()) \\\n    .add(\"PULocationID\", types.IntegerType()) \\\n    .add(\"DOLocationID\", types.IntegerType()) \\\n    .add(\"passenger_count\", types.DoubleType()) \\\n    .add(\"trip_distance\", types.DoubleType()) \\\n    .add(\"tip_amount\", types.DoubleType())\n```\n\nAnd apply this schema:\n\n```python\nfrom pyspark.sql import functions as F\n\ngreen_stream = green_stream \\\n  .select(F.from_json(F.col(\"value\").cast('STRING'), schema).alias(\"data\")) \\\n  .select(\"data.*\")\n```\n\nHow does the record look after parsing? Copy the output. \n\n\n### Question 7: Most popular destination\n\nNow let's finally do some streaming analytics. We will\nsee what's the most popular destination currently \nbased on our stream of data (which ideally we should \nhave sent with delays like we did in workshop 2)\n\n\nThis is how you can do it:\n\n* Add a column \"timestamp\" using the `current_timestamp` function\n* Group by:\n  * 5 minutes window based on the timestamp column (`F.window(col(\"timestamp\"), \"5 minutes\")`)\n  * `\"DOLocationID\"`\n* Order by count\n\nYou can print the output to the console using this \ncode\n\n```python\nquery = popular_destinations \\\n    .writeStream \\\n    .outputMode(\"complete\") \\\n    .format(\"console\") \\\n    .option(\"truncate\", \"false\") \\\n    .start()\n\nquery.awaitTermination()\n```\n\nWrite the most popular destination, your answer should be *either* the zone ID or the zone name of this destination. (You will need to re-send the data for this to work)\n\n\n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/hw6\n\n\n## Solution\n\nWe will publish the solution here after deadline.\n\n\n"
  },
  {
    "path": "cohorts/2024/README.md",
    "content": "## Data Engineering Zoomcamp 2024 Cohort\n\n* [Pre-launch Q&A stream](https://www.youtube.com/watch?v=91b8u9GmqB4)\n* [Launch stream with course overview](https://www.youtube.com/live/AtRhA-NfS24?si=5JzA_E8BmJjiLi8l)\n* [Deadline calendar](https://docs.google.com/spreadsheets/d/e/2PACX-1vQACMLuutV5rvXg5qICuJGL-yZqIV0FBD84CxPdC5eZHf8TfzB-CJT_3Mo7U7oGVTXmSihPgQxuuoku/pubhtml)\n* [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html)\n* Course Playlist: Only 2024 Live videos & homeworks (TODO)\n* [Public Leaderboard of Top-100 Participants](leaderboard.md)\n\n\n[**Module 1: Introduction & Prerequisites**](01-docker-terraform/)\n\n* [Homework](01-docker-terraform/homework.md)\n\n\n[**Module 2: Workflow Orchestration**](02-workflow-orchestration)\n\n* [Homework](02-workflow-orchestration/homework.md)\n* Office hours\n\n[**Workshop 1: Data Ingestion**](workshops/dlt.md)\n\n* Workshop with dlt\n* [Homework](workshops/dlt.md)\n\n\n[**Module 3: Data Warehouse**](03-data-warehouse)\n\n* [Homework](03-data-warehouse/homework.md)\n\n\n[**Module 4: Analytics Engineering**](04-analytics-engineering/)\n\n* [Homework](04-analytics-engineering/homework.md)\n\n\n[**Module 5: Batch processing**](05-batch/)\n\n* [Homework](05-batch/homework.md)\n\n\n[**Module 6: Stream Processing**](06-streaming)\n\n* [Homework](06-streaming/homework.md)\n\n\n[**Project**](project.md)\n\nMore information [here](project.md)\n"
  },
  {
    "path": "cohorts/2024/leaderboard.md",
    "content": "## Leaderboard \n\nThis is the top [100 leaderboard](https://courses.datatalks.club/de-zoomcamp-2024/leaderboard)\nof participants of Data Engineering Zoomcamp 2024 edition!\n\n<table>\n<tr>\n  <th>Name</th>\n  <th>Projects</th>\n  <th>Social</th>\n  <th>Comments</th>\n</tr>\n<tr>\n  <td>Ashraf Mohammad</td>\n  <td><a href=\"https://github.com/Ashraf1395/customer_retention_analytics\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a><a href=\"https://github.com/Ashraf1395/supply_chain_finance.git\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"www.linkedin.com/in/ashraf1395\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"www.github.com/Ashraf1395\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nReally Recommend this bootcamp , if you want to get hands on data engineering experience.     My two Capstone project: www.github.com/Ashraf1395/supply_chain_finance, www.github.com/Ashraf1395/customer_retention_analytics\n</details></td>\n</tr>\n<tr>\n  <td>Jorge Vladimir Abrego Arevalo</td>\n  <td><a href=\"https://github.com/JorgeAbrego/weather_stream_project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a><a href=\"https://github.com/JorgeAbrego/capital_bikeshare_project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/jorge-abrego/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/JorgeAbrego\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Purnendu Shekhar Shukla</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Krishna Anand</td>\n  <td><a href=\"https://github.com/anandaiml19/DE_Zoomcamp_Project2/tree/main\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a><a href=\"https://github.com/anandaiml19/Data-Engineering-Zoomcamp-Project1\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/krishna-anand-v-g-70bba623/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/anandaiml19\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Abhijit Chakraborty</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Hekmatullah Sajid</td>\n  <td><a href=\"https://github.com/hekmatullah-sajid/EcoEnergy-Germany\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/hekmatullah-sajid/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/hekmatullah-sajid\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Lottie Jane Pollard</td>\n  <td><a href=\"https://github.com/LottieJaneDev/usgs_earthquake_data_pipeline\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/lottiejanedev/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/LottieJaneDev/usgs_earthquake_data_pipeline\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>AviAnna</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Ketut Garjita</td>\n  <td><a href=\"https://github.com/garjita63/dezoomcamp2024-project1\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/ketutgarjitadba/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/garjita63\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nI would like to express my thanks and appreciation to the Data Talks Club for organizing this excellent Data Engineering Zoomcamp training. This made me valuable experience in deepening new knowledge for me even though previously I had mostly worked as a Database Administrator for various platform databases. Thank you also to the community (datatalks-club.slack.com), especially slack course-data-engineering, as well as other slack communities such as mageai.slack.com.\n</details></td>\n</tr>\n<tr>\n  <td>Diogo Costa</td>\n  <td><a href=\"https://github.com/techwithcosta/youtube-ai-analytics\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/costadms/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/techwithcosta\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nGreat course! Check out my YouTube channel: https://www.youtube.com/@TechWithCosta\n</details></td>\n</tr>\n<tr>\n  <td>Francisco Ortiz Tena</td>\n  <td><a href=\"https://github.com/FranciscoOrtizTena/de_zoomcamp_project_01/\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/francisco-ortiz-tena/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/FranciscoOrtizTena\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nIt is an awesome course!\n</details></td>\n</tr>\n<tr>\n  <td>Nevenka Lukic</td>\n  <td><a href=\"https://github.com/nenalukic/air-quality-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/nevenka-lukic/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/nenalukic\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nThis DE Zoomcamp was fantastic learning and networking experiences. Many thanks to organizers and big recommendations to anyone!\n</details></td>\n</tr>\n<tr>\n  <td>Mukhammad Sofyan Rizka Akbar</td>\n  <td><a href=\"https://github.com/SofyanAkbar94/Project-DE-Zoomcamp-2024\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://id.linkedin.com/in/m-sofyan-r-a-aa00a4118\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/SofyanAkbar94/\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nThanks for providing this course, especially for Alexey and other Datatalk hosts and I hope I can join ML, ML Ops, and LLM Zoomcamp. See you soon :)\n</details></td>\n</tr>\n<tr>\n  <td>Mahmoud Mahdy Zaky</td>\n  <td><a href=\"https://github.com/MahmoudMahdy448/Football-Data-Analytics/tree/main\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/mahmoud-mahdy-zaky\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/MahmoudMahdy448\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Brilliant Pancake</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Jobert M. Gutierrez</td>\n  <td><a href=\"https://github.com/bizzaccelerator/Footballers-transfers-Insights.git\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"www.linkedin.com/in/jobertgutierrez\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/bizzaccelerator\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Olusegun Samson Ayeni</td>\n  <td><a href=\"https://github.com/iamraphson/IMDB-pipeline-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a><a href=\"https://github.com/iamraphson/DE-2024-project-book-recommendation\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/iamraphson/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/iamraphson\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Lily Chau</td>\n  <td><a href=\"https://github.com/lilychau1/uk-power-analytics\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a><a href=\"https://github.com/lilychau1/uk-power-analytics/tree/main\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"www.linkedin.com/in/lilychau1\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/lilychau1\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nBig thank you to Alexey and all other speakers. This is one of the best online learning platforms I have ever come across.\n</details></td>\n</tr>\n<tr>\n  <td>Aleksandr Kolmakov</td>\n  <td><a href=\"https://github.com/Feanaur/marine-species-analytics\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a><a href=\"https://github.com/Feanaur/marine-species-analytics\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/aleksandr-kolmakov/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/alex-kolmakov\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Kang Zhi Yong</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Eduardo Muñoz Sala</td>\n  <td><a href=\"https://github.com/edumunozsala/GDELT-Events-Data-Eng-Project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/edumunozsala/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/edumunozsala\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Kirill Bazarov</td>\n  <td><a href=\"https://github.com/kirill505/de-zoomcamp-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/kirill-bazarov-66ba3152\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/kirill505\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Shayan Shafiee Moghadam</td>\n  <td><a href=\"https://github.com/shayansm2/DE-zoomcamp-playground/tree/de-zoomcamp-2nd-project/github-events-analyzer\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a><a href=\"https://github.com/shayansm2/tech-career-explorer/tree/de-zoomcamp-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/shayan-shafiee-moghadam/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/shayansm2\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Landry N.</td>\n  <td><a href=\"https://github.com/drux31/capstone-dezoomcamp\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://github.com/drux31\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nThanks for the awsome course.\n</details></td>\n</tr>\n<tr>\n  <td>Condescending Austin</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Lee Durbin</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Loving Einstein</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Carlos Vecina Tebar</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Abiodun Oki</td>\n  <td></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/okibaba/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Okibaba\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nthoroughly enjoyed the course, great work Alexey & course team!\n</details></td>\n</tr>\n<tr>\n  <td>Jimoh</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Sleepy Villani</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Ella Cinders</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Max Lutz</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Jessica De Silva</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Daniel Okello</td>\n  <td></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/okellodaniel/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/okellodaniel\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Kirill Sitnikov</td>\n  <td><a href=\"https://github.com/Siddha911/Citibike-data-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"Siddha911\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nThank you Alexey and all DTC team! I’m so glad that I knew about your courses and projects!\n</details></td>\n</tr>\n<tr>\n  <td>edumad</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Duy Quoc Vo</td>\n  <td><a href=\"https://github.com/voduyquoc/air_pollution_tracking\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/voduyquoc/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/voduyquoc\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nNA\n</details></td>\n</tr>\n<tr>\n  <td>Xiang Li</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Sugeng Wahyudi</td>\n  <td><a href=\"https://github.com/Gengsu07/DEGengsuProject\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/sugeng-wahyudi-8a3939132/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Gengsu07\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nThanks a lot, this was amazing. Can't miss another course and zoomcamp from datatalks.club\n</details></td>\n</tr>\n<tr>\n  <td>Anatolii Kryvko</td>\n  <td><a href=\"https://github.com/Nogromi/ukraine-vaccinations/tree/master\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/anatolii-kryvko-69b538107/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Nogromi\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>David Vanegas</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Honey Badger</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Abdelrahman Kamal</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Jean Paul Rodriguez</td>\n  <td><a href=\"https://github.com/jeanpaulrd1/de-zc-final-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/jean-paul-rodriguez\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/jeanpaulrd1/de-zc-final-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Eager Pasteur</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Damian Pszczoła</td>\n  <td><a href=\"https://github.com/d4mp3/GLDAS-Data-Pipeline\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/damian-pszczo%C5%82a-7aba54241/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/d4mp3\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>ManPrat</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>forrest_parnassus</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Ramazan Abylkassov</td>\n  <td><a href=\"https://github.com/ramazanabylkassov/aviation_stack_project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/ramazan-abylkassov-23965097/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/ramazanabylkassov\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nLook mom, I am on leaderboard!\n</details></td>\n</tr>\n<tr>\n  <td>Digamber Deshmukh</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Andrew Lee</td>\n  <td><a href=\"https://github.com/wndrlxx/ca-trademarks-data-pipeline\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Matt R</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Raul Antonio Catacora Grundy</td>\n  <td><a href=\"https://github.com/Cerpint4xt/data-engineering-all-news-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/raul-catacora-grundy-208315236/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Cerpint4xt\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nI just want to thank everyone, all the instructors, collaborators for creating this amazing set of resources and such a solid community based on sharing and caring. Many many thanks and shout out to you guys\n</details></td>\n</tr>\n<tr>\n  <td>Ranga H.</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Salma Gouda</td>\n  <td><a href=\"https://github.com/salmagouda/data-engineering-capstone/tree/main\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://linkedin.com/in/salmagouda\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/salmagouda\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Artsiom Turevich</td>\n  <td><a href=\"https://github.com/aturevich/zoomcamp_de_project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/artsiom-turevich/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"a.turevich\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nA long time ago in a galaxy far, far away...\n</details></td>\n</tr>\n<tr>\n  <td>Abhirup Ghosh</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Sonny Pham</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Peter Tran</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Ritika Tilwalia</td>\n  <td><a href=\"https://github.com/rtilwalia/Fashion-Campus-Orders\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/ritika-tilwalia/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/rtilwalia\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Eager Yalow</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Dave Samaniego</td>\n  <td><a href=\"https://github.com/nishiikata/de-zoomcamp-2024-mage-capstone\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/dave-s-32545014a\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/nishiikata\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nThank you DataTalksClub for the course. It was challenging learning many new things, but I had fun along the way too!\n</details></td>\n</tr>\n<tr>\n  <td>Lucid Keldysh</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Isaac Ndirangu Muturi</td>\n  <td><a href=\"https://github.com/Isaac-Ndirangu-Muturi-749/End_to_end_data_pipeline--Optimizing_Online_Retail_Analytics_with_Data_and_Analytics_Engineering\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/isaac-muturi-3b6b2b237\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Isaac-Ndirangu-Muturi-749\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nAmazing learning experience\n</details></td>\n</tr>\n<tr>\n  <td>Agitated Wing</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Hanaa HAMMAD</td>\n  <td></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/hanaahammad/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/hanaahammad\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nGrateful to this great course\n</details></td>\n</tr>\n<tr>\n  <td>Jonah Oliver</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Paul Emilio Arizpe Colorado</td>\n  <td><a href=\"https://github.com/kiramishima/crimes_in_mexico_city_analysis\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/parizpe/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/kiramishima\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nDataTalksClub brought me the opportunity to learn data engineering. Thanks for all :D\n</details></td>\n</tr>\n<tr>\n  <td>Asma-Chloë FARAH</td>\n  <td><a href=\"https://github.com/AsmaChloe/traffic_counting_paris\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/asma-chloefarah/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/AsmaChloe\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nThank you for this amazing zoomcamp ! It was really fun !\n</details></td>\n</tr>\n<tr>\n  <td>Happy Feistel</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Luca Pugliese</td>\n  <td><a href=\"https://github.com/lucapug/nyc-bike-analytics\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/lucapug/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/lucapug\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nit has been a crowdlearning experience! starting in thousands of us. 359 graduated in the end. Proud to have classified 59th. Thanks to all.\n</details></td>\n</tr>\n<tr>\n  <td>Jake Maund</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Aditya Phulallwar</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Dave Wilson</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Haitham Hussein Hamad</td>\n  <td><a href=\"https://github.com/haithamhamad2/kaggle-survey\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/haitham-hamad-8926b415/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/haithamhamad2\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Alexandre Bergere aka Rocket</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>TOGBAN COKOUVI Joyce Elvis Mahoutondji</td>\n  <td><a href=\"https://github.com/lvsuno/Github_data_analysis\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/elvistogban/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/lvsuno/Github_data_analysis\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Sad Robinson</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Tetiana Omelchenko</td>\n  <td><a href=\"https://github.com/TOmelchenko/LifeExpectancyProject\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"www.linkedin.com/in/tetiana-omelchenko-35177379\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/TOmelchenko/LifeExpectancyProject\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Amanda Kershaw</td>\n  <td><a href=\"https://github.com/ANKershaw/youtube_video_ranks\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/amandalnkershaw\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/ANKershaw/youtube_video_ranks\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nThis course was incredibly rewarding and absolutely worth the effort.\n</details></td>\n</tr>\n<tr>\n  <td>Kristjan Sert</td>\n  <td><a href=\"https://github.com/KrisSert/cadaster-ee\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/kristjan-sert-043396131/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/KrisSert\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Murad Arfanyan</td>\n  <td><a href=\"https://github.com/murkenson/movies_tv_shows_data_pipeline\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/murad-arfanyan-846786176/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/murkenson\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Ecstatic Hofstadter</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Chung Huu Tin</td>\n  <td><a href=\"https://github.com/TinChung41/US-Accidents-Analysis-zoomcamp-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"linkedin.com/in/huu-tin-chung\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/TinChung41\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Zen Mayer</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Zhastay Yeltay</td>\n  <td><a href=\"https://github.com/yelzha/tengrinews-open-project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/yelzha/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/yelzha\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\n;)\n</details></td>\n</tr>\n<tr>\n  <td>AV3NII</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Sebastian Alejandro Peralta Casafranca</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Relaxed Williams</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>George Mouratos</td>\n  <td><a href=\"https://github.com/Gimour/Datatalks_final_project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/gmouratos/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/Gimour/DataTalks, https://github.com/Gimour/Datatalks_final_project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\n-\n</details></td>\n</tr>\n<tr>\n  <td>mhmed ahmed rjb</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Frosty Jackson</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>WANJOHI</td>\n  <td><a href=\"https://github.com/DE-ZoomCamp/Flood-Monitoring\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://github.com/DE-ZoomCamp/Flood-Monitoring\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Ighorr Holstrom</td>\n  <td><a href=\"https://github.com/askeladden31/air_raids_data/\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/ighorr-holstrom/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/askeladden31\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Jesse Delzio</td>\n  <td></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/delzioj\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/delzio\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Khalil El Daou</td>\n  <td><a href=\"https://github.com/khalileldoau/global-news-engagement-on-social-media\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/khalil-el-daou-177a8b114?utm_source=share&utm_campaign=share_via&utm_content=profile&utm_medium=android_app\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/khalileldoau\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td><details>\n<summary>comment</summary>\nAlready made a post about the zoomcamp\n</details></td>\n</tr>\n<tr>\n  <td>Juan Rojas</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Gonçalo</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Muhamad Farikhin</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Bold Lederberg</td>\n  <td></a></td>\n  <td></td>\n  <td></td>\n</tr>\n<tr>\n  <td>Taras Shalaiko</td>\n  <td><a href=\"https://github.com/tarasenya/dezoomcamp_final_project\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></a></td>\n  <td> <a href=\"https://www.linkedin.com/in/taras-shalaiko-30114a107/\"><img src=\"https://user-images.githubusercontent.com/875246/192300614-2ce22ed5-bbc4-4684-8098-d8128d71aac5.png\" height=\"16em\" /></a> <a href=\"https://github.com/tarasenya\"><img src=\"https://user-images.githubusercontent.com/875246/192300611-a606521b-cb76-4090-be8e-7cc21752b996.png\" height=\"16em\" /></a></td>\n  <td></td>\n</tr>\n</table>\n"
  },
  {
    "path": "cohorts/2024/project.md",
    "content": "## Course Project\n\nThe goal of this project is to apply everything we learned\nin this course and build an end-to-end data pipeline.\n\nYou will have two attempts to submit your project. If you don't have \ntime to submit your project by the end of attempt #1 (you started the \ncourse late, you have vacation plans, life/work got in the way, etc.)\nor you fail your first attempt, \nthen you will have a second chance to submit your project as attempt\n#2. \n\nThere are only two attempts.\n\nRemember that to pass the project, you must evaluate 3 peers. If you don't do that,\nyour project can't be considered complete.\n\nTo find the projects assigned to you, use the peer review assignments link \nand find your hash in the first column. You will see three rows: you need to evaluate \neach of these projects. For each project, you need to submit the form once,\nso in total, you will make three submissions. \n\n\n### Submitting\n\n#### Project Attempt #1\n\n* Project: https://courses.datatalks.club/de-zoomcamp-2024/project/project1\n* Review: https://courses.datatalks.club/de-zoomcamp-2024/project/project1/eval\n\n#### Project Attempt #2\n\n* Project: https://courses.datatalks.club/de-zoomcamp-2024/project/project2\n* Review: https://courses.datatalks.club/de-zoomcamp-2024/project/project2/eval\n\n> **Important**: update your \"Certificate name\" here: https://courses.datatalks.club/de-zoomcamp-2024/enrollment -\nthis is what we will use when generating certificates for you.\n\n### Evaluation criteria\n\nSee [here](../../week_7_project/README.md)\n\n\n"
  },
  {
    "path": "cohorts/2024/workshops/dlt.md",
    "content": "# Data ingestion with dlt\n\n​In this hands-on workshop, we’ll learn how to build data ingestion pipelines.\n\n​We’ll cover the following steps:\n\n* ​Extracting data from APIs, or files.\n* ​Normalizing and loading data\n* ​Incremental loading\n\n​By the end of this workshop, you’ll be able to write data pipelines like a senior data engineer: Quickly, concisely, scalable, and self-maintaining.\n\nVideo: https://www.youtube.com/live/oLXhBM7nf2Q\n\n--- \n\n# Navigation\n\n* [Workshop content](dlt_resources/data_ingestion_workshop.md)\n* [Workshop notebook](dlt_resources/workshop.ipynb)\n* [Homework starter notebook](dlt_resources/homework_starter.ipynb)\n\n# Resources\n\n- Website and community: Visit our [docs](https://dlthub.com/docs/intro), discuss on our slack (Link at top of docs).\n- Course colab: [Notebook](https://colab.research.google.com/drive/1kLyD3AL-tYf_HqCXYnA3ZLwHGpzbLmoj#scrollTo=5aPjk0O3S_Ag&forceEdit=true&sandboxMode=true).\n- dlthub [community Slack](https://dlthub.com/community).\n\n---\n\n# Teacher\n\nWelcome to the data talks club data engineering zoomcamp, the data ingestion workshop.\n\n- My name is [Adrian](https://www.linkedin.com/in/data-team/), and I work in the data field since 2012\n    - I built many data warehouses some lakes, and a few data teams\n    - 10 years into my career I started working on dlt “data load tool”, which is an open source library to enable data engineers to build faster and better.\n    - I started working on dlt because data engineering is one of the few areas of software engineering where we do not have developer tools to do our work.\n    - Building better pipelines would require more code re-use - we cannot all just build perfect pipelines from scratch every time.\n    - And so dlt was born, a library that automates the tedious part of data ingestion: Loading, schema management, data type detection, scalability, self healing, scalable extraction… you get the idea - essentially a data engineer’s “one stop shop” for best practice data pipelining.\n    - Due to its **simplicity** of use, dlt enables **laymen** to\n        - Build pipelines 5-10x faster than without it\n        - Build self healing, self maintaining pipelines with all the best practices of data engineers. Automating schema changes removes the bulk of maintenance efforts.\n        - Govern your pipelines with schema evolution alerts and data contracts.\n        - and generally develop pipelines like a senior, commercial data engineer.\n\n--- \n\n# Course\nYou can find the course file [here](./dlt_resources/data_ingestion_workshop.md)\nThe course has 3 parts\n- [Extraction Section](./dlt_resources/data_ingestion_workshop.md#extracting-data): In this section we will learn about scalable extraction\n- [Normalisation Section](./dlt_resources/data_ingestion_workshop.md#normalisation): In this section we will learn to prepare data for loading\n- [Loading Section](./dlt_resources/data_ingestion_workshop.md#incremental-loading)): Here we will learn about incremental loading modes\n\n---\n\n# Homework\n\nThe [linked colab notebook](https://colab.research.google.com/drive/1Te-AT0lfh0GpChg1Rbd0ByEKOHYtWXfm#scrollTo=wLF4iXf-NR7t&forceEdit=true&sandboxMode=true) offers a few exercises to practice what you learned today.\n\n\n#### Question 1: What is the sum of the outputs of the generator for limit = 5?\n- **A**: 10.23433234744176\n- **B**: 7.892332347441762\n- **C**: 8.382332347441762\n- **D**: 9.123332347441762\n\n#### Question 2: What is the 13th number yielded by the generator?\n- **A**: 4.236551275463989\n- **B**: 3.605551275463989\n- **C**: 2.345551275463989\n- **D**: 5.678551275463989\n\n#### Question 3: Append the 2 generators. After correctly appending the data, calculate the sum of all ages of people.\n- **A**: 353\n- **B**: 365\n- **C**: 378\n- **D**: 390\n\n#### Question 4: Merge the 2 generators using the ID column. Calculate the sum of ages of all the people loaded as described above.\n- **A**: 215\n- **B**: 266\n- **C**: 241\n- **D**: 258\n\nSubmit the solution here: https://courses.datatalks.club/de-zoomcamp-2024/homework/workshop1\n\n--- \n# Next steps\n\nAs you are learning the various concepts of data engineering, \nconsider creating a portfolio project that will further your own knowledge.\n\nBy demonstrating the ability to deliver end to end, you will have an easier time finding your first role. \nThis will help regardless of whether your hiring manager reviews your project, largely because you will have a better \nunderstanding and will be able to talk the talk.\n\nHere are some example projects that others did with dlt:\n- Serverless dlt-dbt on cloud functions: [Article](https://docs.getdbt.com/blog/serverless-dlt-dbt-stack)\n- Bird finder: [Part 1](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-i), [Part 2](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-ii)\n- Event ingestion on GCP: [Article and repo](https://dlthub.com/docs/blog/streaming-pubsub-json-gcp)\n- Event ingestion on AWS: [Article and repo](https://dlthub.com/docs/blog/dlt-aws-taktile-blog)\n- Or see one of the many demos created by our working students: [Hacker news](https://dlthub.com/docs/blog/hacker-news-gpt-4-dashboard-demo), \n[GA4 events](https://dlthub.com/docs/blog/ga4-internal-dashboard-demo), \n[an E-Commerce](https://dlthub.com/docs/blog/postgresql-bigquery-metabase-demo), \n[google sheets](https://dlthub.com/docs/blog/google-sheets-to-data-warehouse-pipeline), \n[Motherduck](https://dlthub.com/docs/blog/dlt-motherduck-demo), \n[MongoDB + Holistics](https://dlthub.com/docs/blog/MongoDB-dlt-Holistics), \n[Deepnote](https://dlthub.com/docs/blog/deepnote-women-wellness-violence-tends), \n[Prefect](https://dlthub.com/docs/blog/dlt-prefect),\n[PowerBI vs GoodData vs Metabase](https://dlthub.com/docs/blog/semantic-modeling-tools-comparison),\n[Dagster](https://dlthub.com/docs/blog/dlt-dagster),\n[Ingesting events via gcp webhooks](https://dlthub.com/docs/blog/dlt-webhooks-on-cloud-functions-for-event-capture),\n[SAP to snowflake replication](https://dlthub.com/docs/blog/sap-hana-to-snowflake-demo-blog),\n[Read emails and send sumamry to slack with AI and Kestra](https://dlthub.com/docs/blog/dlt-kestra-demo-blog),\n[Mode +dlt capabilities](https://dlthub.com/docs/blog/dlt-mode-blog),\n[dbt on cloud functions](https://dlthub.com/docs/blog/dlt-dbt-runner-on-cloud-functions)\n- If you want to use dlt in your project, [check this list of public APIs](https://dlthub.com/docs/blog/practice-api-sources)\n\n\nIf you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt slack.\n\n\n\n**And don't forget, if you like dlt**\n- **Give us a [GitHub Star!](https://github.com/dlt-hub/dlt)**\n- **Join our [Slack community](https://dlthub.com/community)**\n\n\n# Notes\n\n* Add your notes here\n"
  },
  {
    "path": "cohorts/2024/workshops/dlt_resources/data_ingestion_workshop.md",
    "content": "# Intro\n\nWhat is data loading, or data ingestion?\n\nData ingestion is the process of extracting data from a producer, transporting it to a convenient environment, and preparing it for usage by normalising it, sometimes cleaning, and adding metadata.\n\n### “A wild dataset magically appears!”\n\nIn many data science teams, data magically appears - because the engineer loads it.\n\n- Sometimes the format in which it appears is structured, and with explicit schema\n    - In that case, they can go straight to using it; Examples: Parquet, Avro, or table in a db,\n- Sometimes the format is weakly typed and without explicit schema, such as csv, json\n    - in which case some extra normalisation or cleaning might be needed before usage\n\n> 💡 **What is a schema?** The schema specifies the expected format and structure of data within a document or data store, defining the allowed keys, their data types, and any constraints or relationships.\n\n\n### Be the magician! 😎\n\nSince you are here to learn about data engineering, you will be the one making datasets magically appear. \n\nHere’s what you need to learn to build pipelines\n\n- Extracting data\n- Normalising, cleaning, adding metadata such as schema and types\n- and Incremental loading, which is vital for fast, cost effective data refreshes.\n\n### What else does a data engineer do? What are we not learning, and what are we learning?\n\n- It might seem simplistic, but in fact a data engineer’s main goal is to ensure data flows from source systems to analytical destinations.\n- So besides building pipelines, running pipelines and fixing pipelines, a data engineer may also focus on optimising data storage, ensuring data quality and integrity, implementing effective data governance practices, and continuously refining data architecture to meet the evolving needs of the organisation.\n- Ultimately, a data engineer's role extends beyond the mechanical aspects of pipeline development, encompassing the strategic management and enhancement of the entire data lifecycle.\n- This workshop focuses on building robust, scalable, self maintaining pipelines, with built in governance - in other words, best practices applied.\n\n# Extracting data\n\n### The considerations of extracting data\n\nIn this section we will learn about extracting data from source systems, and what to care about when doing so.\n\nMost data is stored behind an API \n\n- Sometimes that’s a RESTful api for some business application, returning records of data.\n- Sometimes the API returns a secure file path to something like a json or parquet file in a bucket that enables you to grab the data in bulk,\n- Sometimes the API is something else (mongo, sql, other databases or applications) and will generally return records as JSON - the most common interchange format.\n\nAs an engineer, you will need to build pipelines that “just work”. \n\nSo here’s what you need to consider on extraction, to prevent the pipelines from breaking, and to keep them running smoothly.\n\n- Hardware limits: During this course we will cover how to navigate the challenges of managing memory.\n- Network limits: Sometimes networks can fail. We can’t fix what could go wrong but we can retry network jobs until they succeed. For example, dlt library offers a requests “replacement” that has built in retries. [Docs](https://dlthub.com/docs/reference/performance#using-the-built-in-requests-client). We won’t focus on this during the course but you can read the docs on your own.\n- Source api limits: Each source might have some limits such as how many requests you can do per second. We would call these “rate limits”. Read each source’s docs carefully to understand how to navigate these obstacles. You can find some examples of how to wait for rate limits in our verified sources repositories\n    - examples: [Zendesk](https://developer.zendesk.com/api-reference/introduction/rate-limits/), [Shopify](https://shopify.dev/docs/api/usage/rate-limits)\n\n### Extracting data without hitting hardware limits\n\nWhat kind of limits could you hit on your machine? In the case of data extraction, the only limits are memory and storage. This refers to the RAM or virtual memory, and the disk, or physical storage.\n\n### **Managing memory.**\n\n- Many data pipelines run on serverless functions or on orchestrators that delegate the workloads to clusters of small workers.\n- These systems have a small memory or share it between multiple workers - so filling the memory is BAAAD: It might lead to not only your pipeline crashing, but crashing the entire container or machine that might be shared with other worker processes, taking them down too.\n- The same can be said about disk - in most cases your disk is sufficient, but in some cases it’s not. For those cases, mounting an external drive mapped to a storage bucket is the way to go. Airflow for example supports a “data” folder that is used just like a local folder but can be mapped to a bucket for unlimited capacity.\n\n### So how do we avoid filling the memory?\n\n- We often do not know the volume of data upfront\n- And we cannot scale dynamically or infinitely on hardware during runtime\n- So the answer is: Control the max memory you use\n\n### Control the max memory used by streaming the data\n\nStreaming here refers to processing the data event by event or chunk by chunk instead of doing bulk operations. \n\nLet’s look at some classic examples of streaming where data is transferred chunk by chunk or event by event\n\n- Between an audio broadcaster and an in-browser audio player\n- Between a server and a local video player\n- Between a smart home device or IoT device and your phone\n- between google maps and your navigation app\n- Between instagram live and your followers\n\nWhat do data engineers do? We usually stream the data between buffers, such as \n\n- from API to local file\n- from webhooks to event queues\n- from event queue (Kafka, SQS) to Bucket\n\n### Streaming in python via generators\n\nLet’s focus on how we build most data pipelines:\n\n- To process data in a stream in python, we use generators, which are functions that can return multiple times - by allowing multiple returns, the data can be released as it’s produced, as stream, instead of returning it all at once as a batch.\n\nTake the following theoretical example: \n\n- We search twitter for “cat pictures”. We do not know how many pictures will be returned - maybe 10, maybe 10.000.000. Will they fit in memory? Who knows.\n- So to grab this data without running out of memory, we would use a python generator.\n- What’s a generator? In simple words, it’s a function that can return multiple times. Here’s an example of a regular function, and how that function looks if written as a generator.\n\n### Generator examples:\n\nLet’s look at a regular returning function, and how we can re-write it as a generator.\n\n**Regular function collects data in memory.** Here you can see how data is collected row by row in a list called `data`before it is returned. This will break if we have more data than memory.\n\n```python\ndef search_twitter(query):\n\tdata = []\n\tfor row in paginated_get(query):\n\t\tdata.append(row)\n\treturn data\n\n# Collect all the cat picture data\nfor row in search_twitter(\"cat pictures\"):\n  # Once collected, \n  # print row by row\n\tprint(row)\n```\n\nWhen calling `for row in search_twitter(\"cat pictures\"):` all the data must first be downloaded before the first record is returned\n\nLet’s see how we could rewrite this as a generator.\n\n**Generator for streaming the data.** The memory usage here is minimal.\n\nAs you can see, in the modified function, we yield each row as we get the data, without collecting it into memory. We can then run this generator and handle the data item by item.\n\n```python\ndef search_twitter(query):\n\tfor row in paginated_get(query):\n\t\tyield row\n\n# Get one row at a time\nfor row in extract_data(\"cat pictures\"):\n\t# print the row\n\tprint(row)\n  # do something with the row such as cleaning it and writing it to a buffer\n\t# continue requesting and printing data\n```\n\nWhen calling `for row in extract_data(\"cat pictures\"):` the function only runs until the first data item is yielded, before printing - so we do not need to wait long for the first value. It will then continue until there is no more data to get.\n\nIf we wanted to get all the values at once from a generator instead of one by one, we would need to first “run” the generator and collect the data. For example, if we wanted to get all the data in memory we could do `data = list(extract_data(\"cat pictures\"))` which would run the generator and collect all the data in a list before continuing.\n\n## 3 Extraction examples:\n\n### Example 1: Grabbing data from an api\n\n> 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab or in your local setup.\n\n\nFor these purposes we created an api that can serve the data you are already familiar with, the NYC taxi dataset.\n\nThe api documentation is as follows:\n\n- There are a limited nr of records behind the api\n- The data can be requested page by page, each page containing 1000 records\n- If we request a page with no data, we will get a successful response with no data\n- so this means that when we get an empty page, we know there is no more data and we can stop requesting pages - this is a common way to paginate but not the only one - each api may be different.\n- details:\n    - method: get\n    - url: `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api`\n    - parameters: `page`  integer. Represents the page number you are requesting. Defaults to 1.\n    \n\nSo how do we design our requester? \n\n- We need to request page by page until we get no more data. At this point, we do not know how much data is behind the api.\n- It could be 1000 records or it could be 10GB of records. So let’s grab the data with a generator to avoid having to fit an undetermined amount of data into ram.\n\nIn this approach to grabbing data from apis, we have pros and cons:\n\n- Pros: **Easy memory management** thanks to api returning events/pages\n- Cons: **Low throughput**, due to the data transfer being constrained via an API.\n\n```bash\nimport requests\n\nBASE_API_URL = \"https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api\"\n\n# I call this a paginated getter\n# as it's a function that gets data\n# and also paginates until there is no more data\n# by yielding pages, we \"microbatch\", which speeds up downstream processing\n\ndef paginated_getter():\n    page_number = 1\n\n    while True:\n        # Set the query parameters\n        params = {'page': page_number}\n\n        # Make the GET request to the API\n        response = requests.get(BASE_API_URL, params=params)\n        response.raise_for_status()  # Raise an HTTPError for bad responses\n        page_json = response.json()\n        print(f'got page number {page_number} with {len(page_json)} records')\n\n        # if the page has no records, stop iterating\n        if page_json:\n            yield page_json\n            page_number += 1\n        else:\n            # No more data, break the loop\n            break\n\nif __name__ == '__main__':\n    # Use the generator to iterate over pages\n    for page_data in paginated_getter():\n        # Process each page as needed\n        print(page_data)\n```\n\n### Example 2: Grabbing the same data from file - simple download\n\n\n> 💡 This part is demonstrative, so you do not need to follow along; just pay attention.\n\n\n- Why am I showing you this? so when you do this in the future, you will remember there is a best practice you can apply for scalability.\n\nSome apis respond with files instead of pages of data. The reason for this is simple: Throughput and cost. A restful api that returns data has to read the data from storage and process and return it to you by some logic - If this data is large, this costs time, money and creates a bottleneck. \n\nA better way is to offer the data as files that someone can download from storage directly, without going through the restful api layer. This is common for apis that offer large volumes of data, such as ad impressions data.\n\nIn this example, we grab exactly the same data as we did in the API example above, but now we get it from the underlying file instead of going through the API.\n\n- Pros: **High throughput**\n- Cons: **Memory** is used to hold all the data\n\nThis is how the code could look. As you can see in this case our `data`and  `parsed_data` variables hold the entire file data in memory before returning it. Not great.\n\n```python\nimport requests\nimport json\n\nurl = \"https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl\"\n\ndef download_and_read_jsonl(url):\n    response = requests.get(url)\n    response.raise_for_status()  # Raise an HTTPError for bad responses\n    data = response.text.splitlines()\n    parsed_data = [json.loads(line) for line in data]\n    return parsed_data\n   \n\ndownloaded_data = download_and_read_jsonl(url)\n\nif downloaded_data:\n    # Process or print the downloaded data as needed\n    print(downloaded_data[:5])  # Print the first 5 entries as an example\n```\n\n### Example 3: Same file, streaming download\n\n\n> 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab\n\nOk, downloading files is simple, but what if we want to do a stream download?\n\nThat’s possible too - in effect giving us the best of both worlds. In this case we prepared a jsonl file which is already split into lines making our code simple. But json (not jsonl) files could also be downloaded in this fashion, for example using the `ijson` library.\n\nWhat are the pros and cons of this method of grabbing data?\n\nPros: **High throughput, easy memory management,** because we are downloading a file\n\nCons: **Difficult to do for columnar file formats**, as entire blocks need to be downloaded before they can be deserialised to rows. Sometimes, the code is complex too.\n\nHere’s what the code looks like - in a jsonl file each line is a json document, or a “row” of data, so we yield them as they get downloaded. This allows us to download one row and process it before getting the next row.\n\n```bash\nimport requests\nimport json\n\ndef download_and_yield_rows(url):\n    response = requests.get(url, stream=True)\n    response.raise_for_status()  # Raise an HTTPError for bad responses\n\n    for line in response.iter_lines():\n        if line:\n            yield json.loads(line)\n\n# Replace the URL with your actual URL\nurl = \"https://storage.googleapis.com/dtc_zoomcamp_api/yellow_tripdata_2009-06.jsonl\"\n\n# Use the generator to iterate over rows with minimal memory usage\nfor row in download_and_yield_rows(url):\n    # Process each row as needed\n    print(row)\n```\n\nIn the colab notebook you can also find a code snippet to load the data - but we will load some data later in the course and you can explore the colab on your own after the course. \n\nWhat is worth keeping in mind at this point is that our loader library that we will use later, `dlt`or data load tool, will respect the streaming concept of the generator and will process it in an efficient way keeping memory usage low and using parallelism where possible.\n\nLet’s move over to the Colab notebook and run examples 2 and 3, compare them, and finally load examples 1 and 3 to DuckDB\n\n# Normalising data\n\nYou often hear that data people spend most of their time “cleaning” data. What does this mean? \n\nLet’s look granularly into what people consider data cleaning. \n\nUsually we have 2 parts: \n\n- Normalising data without changing its meaning,\n- and filtering data for a use case, which changes its meaning.\n\n### Part of what we often call data cleaning is just metadata work:\n\n- Add types (string to number, string to timestamp, etc)\n- Rename columns: Ensure column names follow a supported standard downstream - such as no strange characters in the names.\n- Flatten nested dictionaries: Bring nested dictionary values into the top dictionary row\n- Unnest lists or arrays into child tables: Arrays or lists cannot be flattened into their parent record, so if we want flat data we need to break them out into separate tables.\n- We will look at a practical example next, as these concepts can be difficult to visualise from text.\n\n### **Why prepare data? why not use json as is?**\n\n- We do not easily know what is inside a json document due to lack of schema\n- Types are not enforced between rows of json - we could have one record where age is `25`and another where age is `twenty five` , and another where it’s `25.00`.  Or in some systems, you might have a dictionary for a single record, but a list of dicts for multiple records. This could easily lead to applications downstream breaking.\n- We cannot just use json data easily, for example we would need to convert strings to time if we want to do a daily aggregation.\n- Reading json loads more data into memory, as the whole document is scanned - while in parquet or databases we can scan a single column of a document. This causes costs and slowness.\n- Json is not fast to aggregate - columnar formats are.\n- Json is not fast to search.\n- Basically json is designed as a \"lowest common denominator format\" for \"interchange\" / data transfer and is unsuitable for direct analytical usage.\n\n### Practical example\n\n\n> 💡 This is the bread and butter of data engineers pulling data, so follow along in the colab notebook.\n\nIn the case of the NY taxi rides data, the dataset is quite clean - so let’s instead use a small example of more complex data. Let’s assume we know some information about passengers and stops.\n\nFor this example we modified the dataset as follows\n\n- We added nested dictionaries\n    \n    ```json\n    \"coordinates\": {\n                \"start\": {\n                    \"lon\": -73.787442,\n                    \"lat\": 40.641525\n                    },\n    ```\n    \n- We added nested lists\n    \n    ```json\n    \"passengers\": [\n                {\"name\": \"John\", \"rating\": 4.9},\n                {\"name\": \"Jack\", \"rating\": 3.9}\n                  ],\n    ```\n    \n- We added a record hash that gives us an unique id for the record, for easy identification\n    \n    ```json\n    \"record_hash\": \"b00361a396177a9cb410ff61f20015ad\",\n    ```\n    \n\nWe want to load this data to a database. How do we want to clean the data?\n\n- We want to flatten dictionaries into the base row\n- We want to flatten lists into a separate table\n- We want to convert time strings into time type\n\n```python\ndata = [\n    {\n        \"vendor_name\": \"VTS\",\n\t\t\"record_hash\": \"b00361a396177a9cb410ff61f20015ad\",\n        \"time\": {\n            \"pickup\": \"2009-06-14 23:23:00\",\n            \"dropoff\": \"2009-06-14 23:48:00\"\n        },\n        \"Trip_Distance\": 17.52,\n        \"coordinates\": {\n            \"start\": {\n                \"lon\": -73.787442,\n                \"lat\": 40.641525\n            },\n            \"end\": {\n                \"lon\": -73.980072,\n                \"lat\": 40.742963\n            }\n        },\n        \"Rate_Code\": None,\n        \"store_and_forward\": None,\n        \"Payment\": {\n            \"type\": \"Credit\",\n            \"amt\": 20.5,\n            \"surcharge\": 0,\n            \"mta_tax\": None,\n            \"tip\": 9,\n            \"tolls\": 4.15,\n\t\t\t\"status\": \"booked\"\n        },\n        \"Passenger_Count\": 2,\n        \"passengers\": [\n            {\"name\": \"John\", \"rating\": 4.9},\n            {\"name\": \"Jack\", \"rating\": 3.9}\n        ],\n        \"Stops\": [\n            {\"lon\": -73.6, \"lat\": 40.6},\n            {\"lon\": -73.5, \"lat\": 40.5}\n        ]\n    },\n]\n```\n\nNow let’s normalise this data.\n\n## Introducing dlt\n\ndlt is a python library created for the purpose of assisting data engineers to build simpler, faster and more robust pipelines with minimal effort. \n\nYou can think of dlt as a loading tool that implements the best practices of data pipelines enabling you to just “use” those best practices in your own pipelines, in a declarative way. \n\nThis enables you to stop reinventing the flat tyre, and leverage dlt to build pipelines much faster than if you did everything from scratch.\n\ndlt automates much of the tedious work a data engineer would do, and does it in a way that is robust. dlt can handle things like:\n\n- Schema: Inferring and evolving schema, alerting changes, using schemas as data contracts.\n- Typing data, flattening structures, renaming columns to fit database standards.  In our example we will pass the “data” you can see above and see it normalised.\n- Processing a stream of events/rows without filling memory. This includes extraction from generators.\n- Loading to a variety of dbs or file formats.\n\nLet’s use it to load our nested json to duckdb:\n\nHere’s how you would do that on your local machine. I will walk you through before showing you in colab as well.\n\nFirst, install dlt\n\n```bash\n# Make sure you are using Python 3.8-3.11 and have pip installed\n# spin up a venv\npython -m venv ./env\nsource ./env/bin/activate\n# pip install\npip install dlt[duckdb]\n```\n\nNext, grab your data from above and run this snippet\n\n- here we define a pipeline, which is a connection to a destination\n- and we run the pipeline, printing the outcome\n\n```python\n# define the connection to load to. \n# We now use duckdb, but you can switch to Bigquery later\npipeline = dlt.pipeline(pipeline_name=\"taxi_data\",\n\t\t\t\t\t\tdestination='duckdb', \n\t\t\t\t\t\tdataset_name='taxi_rides')\n\n# run the pipeline with default settings, and capture the outcome\ninfo = pipeline.run(data, \n                    table_name=\"users\", \n                    write_disposition=\"replace\")\n\n# show the outcome\nprint(info)\n```\n\nIf you are running dlt locally you can use the built in streamlit app by running the cli command with the pipeline name we chose above.\n\n```bash\ndlt pipeline taxi_data show\n```\n\nOr explore the data in the linked colab notebook. I’ll switch to it now to show you the data.\n\n# Incremental loading\n\nIncremental loading means that as we update our datasets with the new data, we would only load the new data, as opposed to making a full copy of a source’s data all over again and replacing the old version.\n\nBy loading incrementally, our pipelines run faster and cheaper.\n\n- Incremental loading goes hand in hand with incremental extraction and state, two concepts which we will not delve into during this workshop\n    - `State` is information that keeps track of what was loaded, to know what else remains to be loaded. dlt stores the state at the destination in a separate table.\n    - Incremental extraction refers to only requesting the increment of data that we need, and not more. This is tightly connected to the state to determine the exact chunk that needs to be extracted and loaded.\n- You can learn more about incremental extraction and state by reading the dlt docs on how to do it.\n\n### dlt currently supports 2 ways of loading incrementally:\n\n1. Append: \n    - We can use this for immutable or stateless events (data that doesn’t change), such as taxi rides - For example,  every day there are new rides, and we could load the new ones only instead of the entire history.\n    - We could also use this to load different versions of stateful data, for example for creating a “slowly changing dimension” table for auditing changes. For example, if we load a list of cars and their colors every day, and one day one car changes color, we need both sets of data to be able to discern that a change happened.\n2. Merge: \n    - We can use this to update  data that changes.\n    - For example, a taxi ride could have a payment status, which is originally “booked” but could later be changed into “paid”, “rejected” or “cancelled”\n\nHere is how you can think about which method to use:\n\n![Incremental Loading](./incremental_loading.png)\n\n* If you want to keep track of when changes occur in stateful data (slowly changing dimension) then you will need to append the data\n\n### Let’s do a merge example together:\n\n\n> 💡 This is the bread and butter of data engineers pulling data, so follow along.\n\n\n- In our previous example, the payment status changed from \"booked\" to “cancelled”. Perhaps Jack likes to fraud taxis and that explains his low rating. Besides the ride status change, he also got his rating lowered further.\n- The merge operation replaces an old record with a new one based on a key. The key could consist of multiple fields or a single unique id. We will use record hash that we created for simplicity. If you do not have a unique key, you could create one deterministically out of several fields, such as by concatenating the data and hashing it.\n- A merge operation replaces rows, it does not update them. If you want to update only parts of a row, you would have to load the new data by appending it and doing a custom transformation to combine the old and new data.\n\nIn this example, the score of the 2 drivers got lowered and we need to update the values. We do it by using merge write disposition, replacing the records identified by  `record hash` present in the new data.\n\n```python\ndata = [\n    {\n        \"vendor_name\": \"VTS\",\n\t\t\"record_hash\": \"b00361a396177a9cb410ff61f20015ad\",\n        \"time\": {\n            \"pickup\": \"2009-06-14 23:23:00\",\n            \"dropoff\": \"2009-06-14 23:48:00\"\n        },\n        \"Trip_Distance\": 17.52,\n        \"coordinates\": {\n            \"start\": {\n                \"lon\": -73.787442,\n                \"lat\": 40.641525\n            },\n            \"end\": {\n                \"lon\": -73.980072,\n                \"lat\": 40.742963\n            }\n        },\n        \"Rate_Code\": None,\n        \"store_and_forward\": None,\n        \"Payment\": {\n            \"type\": \"Credit\",\n            \"amt\": 20.5,\n            \"surcharge\": 0,\n            \"mta_tax\": None,\n            \"tip\": 9,\n            \"tolls\": 4.15,\n\t\t\t\"status\": \"cancelled\"\n        },\n        \"Passenger_Count\": 2,\n        \"passengers\": [\n            {\"name\": \"John\", \"rating\": 4.4},\n            {\"name\": \"Jack\", \"rating\": 3.6}\n        ],\n        \"Stops\": [\n            {\"lon\": -73.6, \"lat\": 40.6},\n            {\"lon\": -73.5, \"lat\": 40.5}\n        ]\n    },\n]\n\n# define the connection to load to. \n# We now use duckdb, but you can switch to Bigquery later\npipeline = dlt.pipeline(destination='duckdb', dataset_name='taxi_rides')\n\n# run the pipeline with default settings, and capture the outcome\ninfo = pipeline.run(data, \n\t\t\t\t\ttable_name=\"users\", \n\t\t\t\t\twrite_disposition=\"merge\", \n\t\t\t\t\tmerge_key=\"record_hash\")\n\n# show the outcome\nprint(info)\n```\n\nAs you can see in your notebook, the payment status and Jack’s rating were updated after running the code.\n\n### What’s next?\n\n- You could change the destination to parquet + local file system or storage bucket. See the colab bonus section.\n- You could change the destination to BigQuery. Destination & credential setup docs: https://dlthub.com/docs/dlt-ecosystem/destinations/, https://dlthub.com/docs/walkthroughs/add_credentials\nor See the colab bonus section.\n- You could use a decorator to convert the generator into a customised dlt resource: https://dlthub.com/docs/general-usage/resource\n- You can deep dive into building more complex pipelines by following the guides:\n    - https://dlthub.com/docs/walkthroughs\n    - https://dlthub.com/docs/build-a-pipeline-tutorial\n- You can join our [Slack community](https://dlthub.com/community) and engage with us there."
  },
  {
    "path": "cohorts/2024/workshops/dlt_resources/homework_solution.ipynb",
    "content": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"name\": \"python3\",\n      \"display_name\": \"Python 3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    }\n  },\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# **Homework**: Data talks club data engineering zoomcamp Data loading workshop\\n\",\n        \"\\n\",\n        \"Hello folks, let's practice what we learned - Loading data with the best practices of data engineering.\\n\",\n        \"\\n\",\n        \"Here are the exercises we will do\\n\",\n        \"\\n\",\n        \"\\n\"\n      ],\n      \"metadata\": {\n        \"id\": \"mrTFv5nPClXh\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# 1. Use a generator\\n\",\n        \"\\n\",\n        \"Remember the concept of generator? Let's practice using them to futher our understanding of how they work.\\n\",\n        \"\\n\",\n        \"Let's define a generator and then run it as practice.\\n\",\n        \"\\n\",\n        \"**Answer the following questions:**\\n\",\n        \"\\n\",\n        \"- **Question 1: What is the sum of the outputs of the generator for limit = 5?**\\n\",\n        \"- **Question 2: What is the 13th number yielded**\\n\",\n        \"\\n\",\n        \"I suggest practicing these questions without GPT as the purpose is to further your learning.\"\n      ],\n      \"metadata\": {\n        \"id\": \"wLF4iXf-NR7t\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"def square_root_generator(limit):\\n\",\n        \"    n = 1\\n\",\n        \"    while n <= limit:\\n\",\n        \"        yield n ** 0.5\\n\",\n        \"        n += 1\\n\",\n        \"\\n\",\n        \"# Example usage:\\n\",\n        \"limit = 5\\n\",\n        \"generator = square_root_generator(limit)\\n\",\n        \"\\n\",\n        \"for sqrt_value in generator:\\n\",\n        \"    print(sqrt_value)\\n\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"wLng-bDJN4jf\",\n        \"outputId\": \"547683cb-5f56-4815-a903-d0d9578eb1f9\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"1.0\\n\",\n            \"1.4142135623730951\\n\",\n            \"1.7320508075688772\\n\",\n            \"2.0\\n\",\n            \"2.23606797749979\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [],\n      \"metadata\": {\n        \"id\": \"xbe3q55zN43j\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# 2. Append a generator to a table with existing data\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"Below you have 2 generators. You will be tasked to load them to duckdb and answer some questions from the data\\n\",\n        \"\\n\",\n        \"1. Load the first generator and calculate the sum of ages of all people. Make sure to only load it once.\\n\",\n        \"2. Append the second generator to the same table as the first.\\n\",\n        \"3. **After correctly appending the data, calculate the sum of all ages of people.**\\n\",\n        \"\\n\",\n        \"\\n\"\n      ],\n      \"metadata\": {\n        \"id\": \"vjWhILzGJMpK\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"2MoaQcdLBEk6\",\n        \"outputId\": \"d2b93dc1-d83f-44ea-aeff-fdf51d75f7aa\"\n      },\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"{'ID': 1, 'Name': 'Person_1', 'Age': 26, 'City': 'City_A'}\\n\",\n            \"{'ID': 2, 'Name': 'Person_2', 'Age': 27, 'City': 'City_A'}\\n\",\n            \"{'ID': 3, 'Name': 'Person_3', 'Age': 28, 'City': 'City_A'}\\n\",\n            \"{'ID': 4, 'Name': 'Person_4', 'Age': 29, 'City': 'City_A'}\\n\",\n            \"{'ID': 5, 'Name': 'Person_5', 'Age': 30, 'City': 'City_A'}\\n\",\n            \"{'ID': 3, 'Name': 'Person_3', 'Age': 33, 'City': 'City_B', 'Occupation': 'Job_3'}\\n\",\n            \"{'ID': 4, 'Name': 'Person_4', 'Age': 34, 'City': 'City_B', 'Occupation': 'Job_4'}\\n\",\n            \"{'ID': 5, 'Name': 'Person_5', 'Age': 35, 'City': 'City_B', 'Occupation': 'Job_5'}\\n\",\n            \"{'ID': 6, 'Name': 'Person_6', 'Age': 36, 'City': 'City_B', 'Occupation': 'Job_6'}\\n\",\n            \"{'ID': 7, 'Name': 'Person_7', 'Age': 37, 'City': 'City_B', 'Occupation': 'Job_7'}\\n\",\n            \"{'ID': 8, 'Name': 'Person_8', 'Age': 38, 'City': 'City_B', 'Occupation': 'Job_8'}\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"def people_1():\\n\",\n        \"    for i in range(1, 6):\\n\",\n        \"        yield {\\\"ID\\\": i, \\\"Name\\\": f\\\"Person_{i}\\\", \\\"Age\\\": 25 + i, \\\"City\\\": \\\"City_A\\\"}\\n\",\n        \"\\n\",\n        \"for person in people_1():\\n\",\n        \"    print(person)\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"def people_2():\\n\",\n        \"    for i in range(3, 9):\\n\",\n        \"        yield {\\\"ID\\\": i, \\\"Name\\\": f\\\"Person_{i}\\\", \\\"Age\\\": 30 + i, \\\"City\\\": \\\"City_B\\\", \\\"Occupation\\\": f\\\"Job_{i}\\\"}\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"for person in people_2():\\n\",\n        \"    print(person)\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [],\n      \"metadata\": {\n        \"id\": \"vtdTIm4fvQCN\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# 3. Merge a generator\\n\",\n        \"\\n\",\n        \"Re-use the generators from Exercise 2.\\n\",\n        \"\\n\",\n        \"A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\\n\",\n        \"\\n\",\n        \"Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\\n\",\n        \"\\n\",\n        \"After loading, you should have a total of 8 records, and ID 3 should have age 33.\\n\",\n        \"\\n\",\n        \"Question: **Calculate the sum of ages of all the people loaded as described above.**\\n\"\n      ],\n      \"metadata\": {\n        \"id\": \"pY4cFAWOSwN1\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# Solution: First make sure that the following modules are installed:\"\n      ],\n      \"metadata\": {\n        \"id\": \"kKB2GTB9oVjr\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"#Install the dependencies\\n\",\n        \"%%capture\\n\",\n        \"!pip install dlt[duckdb]\"\n      ],\n      \"metadata\": {\n        \"id\": \"xTVvtyqrfVNq\"\n      },\n      \"execution_count\": null,\n      \"outputs\": []\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# Solutions\\n\",\n        \"\\n\",\n        \"You can use these solutions to self check your results, or to check how the answer can be obtained if you get stuck.\"\n      ],\n      \"metadata\": {\n        \"id\": \"kUG4DNYGb5dF\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"\\n\"\n      ],\n      \"metadata\": {\n        \"id\": \"ks6Sh_jBJWdh\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"## Solution 1\"\n      ],\n      \"metadata\": {\n        \"id\": \"U61tgQaYb8Yt\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"def sum_of_generator_outputs(generator, limit):\\n\",\n        \"    return sum(next(generator) for _ in range(limit))\\n\",\n        \"\\n\",\n        \"# Example usage:\\n\",\n        \"limit_1 = 5\\n\",\n        \"generator_1 = square_root_generator(limit_1)\\n\",\n        \"result_1 = sum_of_generator_outputs(generator_1, limit_1)\\n\",\n        \"print(f\\\"The sum of the outputs for limit={limit_1} is: {result_1}\\\")\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"def nth_yielded_number(generator, n):\\n\",\n        \"    for _ in range(n - 1):\\n\",\n        \"        next(generator)\\n\",\n        \"    return next(generator)\\n\",\n        \"\\n\",\n        \"# Example usage:\\n\",\n        \"n = 13\\n\",\n        \"generator_2 = square_root_generator(n)\\n\",\n        \"result_2 = nth_yielded_number(generator_2, n)\\n\",\n        \"print(f\\\"The {n}th number yielded is: {result_2}\\\")\\n\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"Roc3y_lSTSfn\",\n        \"outputId\": \"f03d348e-cdfa-44d0-e5f2-276db6af1cf5\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"The sum of the outputs for limit=5 is: 8.382332347441762\\n\",\n            \"The 13th number yielded is: 3.605551275463989\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"## Solution 2: Append a generator\\n\",\n        \"\\n\",\n        \"Load your first generator first, and then load the second one using the \\\"append\\\" operation. Since they have overlapping IDs, some records will appear multiple times.\\n\",\n        \"\\n\",\n        \"After loading, you should have a total of 11 records.\\n\",\n        \"\\n\",\n        \"Question: Calculate the sum of ages of all the people loaded as described above\"\n      ],\n      \"metadata\": {\n        \"id\": \"M3PJYca2TIw8\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"# Importing the DLT library\\n\",\n        \"import dlt\\n\",\n        \"\\n\",\n        \"# Create a DLT pipeline for the first generator `people_1`\\n\",\n        \"# The pipeline is set to load data into a DuckDB database with the dataset named 'people'\\n\",\n        \"people_1_pipeline = dlt.pipeline(destination='duckdb', dataset_name='people')\\n\",\n        \"\\n\",\n        \"# Run the pipeline for the first generator, creating or replacing the table 'people'\\n\",\n        \"info = people_1_pipeline.run(people_1(),\\n\",\n        \"                             table_name=\\\"people\\\",\\n\",\n        \"                             write_disposition=\\\"replace\\\")\\n\",\n        \"\\n\",\n        \"print(f\\\"{info}\\\\n\\\\n\\\")\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"# Create a second DLT pipeline for the generator `people_2`, targeting the same DuckDB database and dataset\\n\",\n        \"people_2_pipeline = dlt.pipeline(destination='duckdb', dataset_name='people')\\n\",\n        \"\\n\",\n        \"# Run the second pipeline, appending data from `people_2` to the existing 'people' table\\n\",\n        \"info = people_2_pipeline.run(people_2(),\\n\",\n        \"                             table_name=\\\"people\\\",\\n\",\n        \"                             write_disposition=\\\"append\\\")\\n\",\n        \"\\n\",\n        \"print(f\\\"{info}\\\\n\\\\n\\\")\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"# Importing the DuckDB library\\n\",\n        \"import duckdb\\n\",\n        \"\\n\",\n        \"# Connect to the DuckDB database created by the first generator\\n\",\n        \"conn = duckdb.connect(f\\\"{people_1_pipeline.pipeline_name}.duckdb\\\")\\n\",\n        \"\\n\",\n        \"# Setting the search path to the dataset 'people' and displaying available tables\\n\",\n        \"conn.sql(f\\\"SET search_path = '{people_1_pipeline.dataset_name}'\\\")\\n\",\n        \"print('Loaded tables: ')\\n\",\n        \"display(conn.sql(\\\"show tables\\\"))\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"# Fetching the appended data from the 'people' table and displaying it\\n\",\n        \"data = conn.sql(\\\"SELECT * FROM people\\\").df()\\n\",\n        \"display(data)\\n\",\n        \"\\n\",\n        \"# Calculate the sum of ages from the combined data of `people_1` and `people_2` in the 'people' table\\n\",\n        \"sum_of_ages_p1_p2 = conn.execute(\\\"SELECT SUM(age) FROM people\\\").fetchone()[0]\\n\",\n        \"print(f\\\"\\\\n\\\\nSum of ages from generators `people_1()` and `people_2()` combined: {sum_of_ages_p1_p2}\\\")\\n\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 841\n        },\n        \"id\": \"0u2mtndkTLpk\",\n        \"outputId\": \"d5d253de-4502-42bf-ac89-08e0a7065d85\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"Pipeline dlt_colab_kernel_launcher load step completed in 0.59 seconds\\n\",\n            \"1 load package(s) were loaded to destination duckdb and into dataset people\\n\",\n            \"The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\\n\",\n            \"Load package 1706029306.7456656 is LOADED and contains no failed jobs\\n\",\n            \"\\n\",\n            \"\\n\",\n            \"Pipeline dlt_colab_kernel_launcher load step completed in 0.43 seconds\\n\",\n            \"1 load package(s) were loaded to destination duckdb and into dataset people\\n\",\n            \"The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\\n\",\n            \"Load package 1706029307.9851513 is LOADED and contains no failed jobs\\n\",\n            \"\\n\",\n            \"\\n\",\n            \"Loaded tables: \\n\"\n          ]\n        },\n        {\n          \"output_type\": \"display_data\",\n          \"data\": {\n            \"text/plain\": [\n              \"┌─────────────────────┐\\n\",\n              \"│        name         │\\n\",\n              \"│       varchar       │\\n\",\n              \"├─────────────────────┤\\n\",\n              \"│ _dlt_loads          │\\n\",\n              \"│ _dlt_pipeline_state │\\n\",\n              \"│ _dlt_version        │\\n\",\n              \"│ people              │\\n\",\n              \"└─────────────────────┘\"\n            ]\n          },\n          \"metadata\": {}\n        },\n        {\n          \"output_type\": \"display_data\",\n          \"data\": {\n            \"text/plain\": [\n              \"    id      name  age    city        _dlt_load_id         _dlt_id occupation\\n\",\n              \"0    1  Person_1   26  City_A  1706029306.7456656  An8WyXL43/J1GQ       None\\n\",\n              \"1    2  Person_2   27  City_A  1706029306.7456656  ZGI1S72CddPbJQ       None\\n\",\n              \"2    3  Person_3   28  City_A  1706029306.7456656  +z4Pm5oCykL2Vg       None\\n\",\n              \"3    4  Person_4   29  City_A  1706029306.7456656  0Vfr36JHZ34OJA       None\\n\",\n              \"4    5  Person_5   30  City_A  1706029306.7456656  aA+9WOclw3YWpg       None\\n\",\n              \"5    3  Person_3   33  City_B  1706029307.9851513  mEegoM7n4XujYw      Job_3\\n\",\n              \"6    4  Person_4   34  City_B  1706029307.9851513  FPrsrzXgz+E9Fw      Job_4\\n\",\n              \"7    5  Person_5   35  City_B  1706029307.9851513  ZaAOBa5EEqXU1Q      Job_5\\n\",\n              \"8    6  Person_6   36  City_B  1706029307.9851513  gmcktDnX6y4Fmg      Job_6\\n\",\n              \"9    7  Person_7   37  City_B  1706029307.9851513  960gdVKySsa4JA      Job_7\\n\",\n              \"10   8  Person_8   38  City_B  1706029307.9851513  +su5IfZQyFEsEw      Job_8\"\n            ],\n            \"text/html\": [\n              \"\\n\",\n              \"  <div id=\\\"df-164dc4c0-056c-460d-b99f-0582206da3c6\\\" class=\\\"colab-df-container\\\">\\n\",\n              \"    <div>\\n\",\n              \"<style scoped>\\n\",\n              \"    .dataframe tbody tr th:only-of-type {\\n\",\n              \"        vertical-align: middle;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .dataframe tbody tr th {\\n\",\n              \"        vertical-align: top;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .dataframe thead th {\\n\",\n              \"        text-align: right;\\n\",\n              \"    }\\n\",\n              \"</style>\\n\",\n              \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n              \"  <thead>\\n\",\n              \"    <tr style=\\\"text-align: right;\\\">\\n\",\n              \"      <th></th>\\n\",\n              \"      <th>id</th>\\n\",\n              \"      <th>name</th>\\n\",\n              \"      <th>age</th>\\n\",\n              \"      <th>city</th>\\n\",\n              \"      <th>_dlt_load_id</th>\\n\",\n              \"      <th>_dlt_id</th>\\n\",\n              \"      <th>occupation</th>\\n\",\n              \"    </tr>\\n\",\n              \"  </thead>\\n\",\n              \"  <tbody>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>0</th>\\n\",\n              \"      <td>1</td>\\n\",\n              \"      <td>Person_1</td>\\n\",\n              \"      <td>26</td>\\n\",\n              \"      <td>City_A</td>\\n\",\n              \"      <td>1706029306.7456656</td>\\n\",\n              \"      <td>An8WyXL43/J1GQ</td>\\n\",\n              \"      <td>None</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>1</th>\\n\",\n              \"      <td>2</td>\\n\",\n              \"      <td>Person_2</td>\\n\",\n              \"      <td>27</td>\\n\",\n              \"      <td>City_A</td>\\n\",\n              \"      <td>1706029306.7456656</td>\\n\",\n              \"      <td>ZGI1S72CddPbJQ</td>\\n\",\n              \"      <td>None</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>2</th>\\n\",\n              \"      <td>3</td>\\n\",\n              \"      <td>Person_3</td>\\n\",\n              \"      <td>28</td>\\n\",\n              \"      <td>City_A</td>\\n\",\n              \"      <td>1706029306.7456656</td>\\n\",\n              \"      <td>+z4Pm5oCykL2Vg</td>\\n\",\n              \"      <td>None</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>3</th>\\n\",\n              \"      <td>4</td>\\n\",\n              \"      <td>Person_4</td>\\n\",\n              \"      <td>29</td>\\n\",\n              \"      <td>City_A</td>\\n\",\n              \"      <td>1706029306.7456656</td>\\n\",\n              \"      <td>0Vfr36JHZ34OJA</td>\\n\",\n              \"      <td>None</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>4</th>\\n\",\n              \"      <td>5</td>\\n\",\n              \"      <td>Person_5</td>\\n\",\n              \"      <td>30</td>\\n\",\n              \"      <td>City_A</td>\\n\",\n              \"      <td>1706029306.7456656</td>\\n\",\n              \"      <td>aA+9WOclw3YWpg</td>\\n\",\n              \"      <td>None</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>5</th>\\n\",\n              \"      <td>3</td>\\n\",\n              \"      <td>Person_3</td>\\n\",\n              \"      <td>33</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>1706029307.9851513</td>\\n\",\n              \"      <td>mEegoM7n4XujYw</td>\\n\",\n              \"      <td>Job_3</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>6</th>\\n\",\n              \"      <td>4</td>\\n\",\n              \"      <td>Person_4</td>\\n\",\n              \"      <td>34</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>1706029307.9851513</td>\\n\",\n              \"      <td>FPrsrzXgz+E9Fw</td>\\n\",\n              \"      <td>Job_4</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>7</th>\\n\",\n              \"      <td>5</td>\\n\",\n              \"      <td>Person_5</td>\\n\",\n              \"      <td>35</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>1706029307.9851513</td>\\n\",\n              \"      <td>ZaAOBa5EEqXU1Q</td>\\n\",\n              \"      <td>Job_5</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>8</th>\\n\",\n              \"      <td>6</td>\\n\",\n              \"      <td>Person_6</td>\\n\",\n              \"      <td>36</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>1706029307.9851513</td>\\n\",\n              \"      <td>gmcktDnX6y4Fmg</td>\\n\",\n              \"      <td>Job_6</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>9</th>\\n\",\n              \"      <td>7</td>\\n\",\n              \"      <td>Person_7</td>\\n\",\n              \"      <td>37</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>1706029307.9851513</td>\\n\",\n              \"      <td>960gdVKySsa4JA</td>\\n\",\n              \"      <td>Job_7</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>10</th>\\n\",\n              \"      <td>8</td>\\n\",\n              \"      <td>Person_8</td>\\n\",\n              \"      <td>38</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>1706029307.9851513</td>\\n\",\n              \"      <td>+su5IfZQyFEsEw</td>\\n\",\n              \"      <td>Job_8</td>\\n\",\n              \"    </tr>\\n\",\n              \"  </tbody>\\n\",\n              \"</table>\\n\",\n              \"</div>\\n\",\n              \"    <div class=\\\"colab-df-buttons\\\">\\n\",\n              \"\\n\",\n              \"  <div class=\\\"colab-df-container\\\">\\n\",\n              \"    <button class=\\\"colab-df-convert\\\" onclick=\\\"convertToInteractive('df-164dc4c0-056c-460d-b99f-0582206da3c6')\\\"\\n\",\n              \"            title=\\\"Convert this dataframe to an interactive table.\\\"\\n\",\n              \"            style=\\\"display:none;\\\">\\n\",\n              \"\\n\",\n              \"  <svg xmlns=\\\"http://www.w3.org/2000/svg\\\" height=\\\"24px\\\" viewBox=\\\"0 -960 960 960\\\">\\n\",\n              \"    <path d=\\\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\\\"/>\\n\",\n              \"  </svg>\\n\",\n              \"    </button>\\n\",\n              \"\\n\",\n              \"  <style>\\n\",\n              \"    .colab-df-container {\\n\",\n              \"      display:flex;\\n\",\n              \"      gap: 12px;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .colab-df-convert {\\n\",\n              \"      background-color: #E8F0FE;\\n\",\n              \"      border: none;\\n\",\n              \"      border-radius: 50%;\\n\",\n              \"      cursor: pointer;\\n\",\n              \"      display: none;\\n\",\n              \"      fill: #1967D2;\\n\",\n              \"      height: 32px;\\n\",\n              \"      padding: 0 0 0 0;\\n\",\n              \"      width: 32px;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .colab-df-convert:hover {\\n\",\n              \"      background-color: #E2EBFA;\\n\",\n              \"      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\\n\",\n              \"      fill: #174EA6;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .colab-df-buttons div {\\n\",\n              \"      margin-bottom: 4px;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    [theme=dark] .colab-df-convert {\\n\",\n              \"      background-color: #3B4455;\\n\",\n              \"      fill: #D2E3FC;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    [theme=dark] .colab-df-convert:hover {\\n\",\n              \"      background-color: #434B5C;\\n\",\n              \"      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\\n\",\n              \"      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\\n\",\n              \"      fill: #FFFFFF;\\n\",\n              \"    }\\n\",\n              \"  </style>\\n\",\n              \"\\n\",\n              \"    <script>\\n\",\n              \"      const buttonEl =\\n\",\n              \"        document.querySelector('#df-164dc4c0-056c-460d-b99f-0582206da3c6 button.colab-df-convert');\\n\",\n              \"      buttonEl.style.display =\\n\",\n              \"        google.colab.kernel.accessAllowed ? 'block' : 'none';\\n\",\n              \"\\n\",\n              \"      async function convertToInteractive(key) {\\n\",\n              \"        const element = document.querySelector('#df-164dc4c0-056c-460d-b99f-0582206da3c6');\\n\",\n              \"        const dataTable =\\n\",\n              \"          await google.colab.kernel.invokeFunction('convertToInteractive',\\n\",\n              \"                                                    [key], {});\\n\",\n              \"        if (!dataTable) return;\\n\",\n              \"\\n\",\n              \"        const docLinkHtml = 'Like what you see? Visit the ' +\\n\",\n              \"          '<a target=\\\"_blank\\\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\\n\",\n              \"          + ' to learn more about interactive tables.';\\n\",\n              \"        element.innerHTML = '';\\n\",\n              \"        dataTable['output_type'] = 'display_data';\\n\",\n              \"        await google.colab.output.renderOutput(dataTable, element);\\n\",\n              \"        const docLink = document.createElement('div');\\n\",\n              \"        docLink.innerHTML = docLinkHtml;\\n\",\n              \"        element.appendChild(docLink);\\n\",\n              \"      }\\n\",\n              \"    </script>\\n\",\n              \"  </div>\\n\",\n              \"\\n\",\n              \"\\n\",\n              \"<div id=\\\"df-d353cda7-9937-430a-a4e2-605b8f9fa6ab\\\">\\n\",\n              \"  <button class=\\\"colab-df-quickchart\\\" onclick=\\\"quickchart('df-d353cda7-9937-430a-a4e2-605b8f9fa6ab')\\\"\\n\",\n              \"            title=\\\"Suggest charts\\\"\\n\",\n              \"            style=\\\"display:none;\\\">\\n\",\n              \"\\n\",\n              \"<svg xmlns=\\\"http://www.w3.org/2000/svg\\\" height=\\\"24px\\\"viewBox=\\\"0 0 24 24\\\"\\n\",\n              \"     width=\\\"24px\\\">\\n\",\n              \"    <g>\\n\",\n              \"        <path d=\\\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\\\"/>\\n\",\n              \"    </g>\\n\",\n              \"</svg>\\n\",\n              \"  </button>\\n\",\n              \"\\n\",\n              \"<style>\\n\",\n              \"  .colab-df-quickchart {\\n\",\n              \"      --bg-color: #E8F0FE;\\n\",\n              \"      --fill-color: #1967D2;\\n\",\n              \"      --hover-bg-color: #E2EBFA;\\n\",\n              \"      --hover-fill-color: #174EA6;\\n\",\n              \"      --disabled-fill-color: #AAA;\\n\",\n              \"      --disabled-bg-color: #DDD;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  [theme=dark] .colab-df-quickchart {\\n\",\n              \"      --bg-color: #3B4455;\\n\",\n              \"      --fill-color: #D2E3FC;\\n\",\n              \"      --hover-bg-color: #434B5C;\\n\",\n              \"      --hover-fill-color: #FFFFFF;\\n\",\n              \"      --disabled-bg-color: #3B4455;\\n\",\n              \"      --disabled-fill-color: #666;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  .colab-df-quickchart {\\n\",\n              \"    background-color: var(--bg-color);\\n\",\n              \"    border: none;\\n\",\n              \"    border-radius: 50%;\\n\",\n              \"    cursor: pointer;\\n\",\n              \"    display: none;\\n\",\n              \"    fill: var(--fill-color);\\n\",\n              \"    height: 32px;\\n\",\n              \"    padding: 0;\\n\",\n              \"    width: 32px;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  .colab-df-quickchart:hover {\\n\",\n              \"    background-color: var(--hover-bg-color);\\n\",\n              \"    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\\n\",\n              \"    fill: var(--button-hover-fill-color);\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  .colab-df-quickchart-complete:disabled,\\n\",\n              \"  .colab-df-quickchart-complete:disabled:hover {\\n\",\n              \"    background-color: var(--disabled-bg-color);\\n\",\n              \"    fill: var(--disabled-fill-color);\\n\",\n              \"    box-shadow: none;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  .colab-df-spinner {\\n\",\n              \"    border: 2px solid var(--fill-color);\\n\",\n              \"    border-color: transparent;\\n\",\n              \"    border-bottom-color: var(--fill-color);\\n\",\n              \"    animation:\\n\",\n              \"      spin 1s steps(1) infinite;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  @keyframes spin {\\n\",\n              \"    0% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-bottom-color: var(--fill-color);\\n\",\n              \"      border-left-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    20% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-left-color: var(--fill-color);\\n\",\n              \"      border-top-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    30% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-left-color: var(--fill-color);\\n\",\n              \"      border-top-color: var(--fill-color);\\n\",\n              \"      border-right-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    40% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-right-color: var(--fill-color);\\n\",\n              \"      border-top-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    60% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-right-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    80% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-right-color: var(--fill-color);\\n\",\n              \"      border-bottom-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    90% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-bottom-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"  }\\n\",\n              \"</style>\\n\",\n              \"\\n\",\n              \"  <script>\\n\",\n              \"    async function quickchart(key) {\\n\",\n              \"      const quickchartButtonEl =\\n\",\n              \"        document.querySelector('#' + key + ' button');\\n\",\n              \"      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\\n\",\n              \"      quickchartButtonEl.classList.add('colab-df-spinner');\\n\",\n              \"      try {\\n\",\n              \"        const charts = await google.colab.kernel.invokeFunction(\\n\",\n              \"            'suggestCharts', [key], {});\\n\",\n              \"      } catch (error) {\\n\",\n              \"        console.error('Error during call to suggestCharts:', error);\\n\",\n              \"      }\\n\",\n              \"      quickchartButtonEl.classList.remove('colab-df-spinner');\\n\",\n              \"      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\\n\",\n              \"    }\\n\",\n              \"    (() => {\\n\",\n              \"      let quickchartButtonEl =\\n\",\n              \"        document.querySelector('#df-d353cda7-9937-430a-a4e2-605b8f9fa6ab button');\\n\",\n              \"      quickchartButtonEl.style.display =\\n\",\n              \"        google.colab.kernel.accessAllowed ? 'block' : 'none';\\n\",\n              \"    })();\\n\",\n              \"  </script>\\n\",\n              \"</div>\\n\",\n              \"    </div>\\n\",\n              \"  </div>\\n\"\n            ]\n          },\n          \"metadata\": {}\n        },\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"\\n\",\n            \"\\n\",\n            \"Sum of ages from generators `people_1()` and `people_2()` combined: 353\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"## Solution 3: Merge a generator\\n\",\n        \"\\n\",\n        \"A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\\n\",\n        \"\\n\",\n        \"Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\\n\",\n        \"\\n\",\n        \"After loading, you should have a total of 8 records, and ID 3 should have age 33.\"\n      ],\n      \"metadata\": {\n        \"id\": \"G-T-jR9qlzdB\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"import dlt\\n\",\n        \"\\n\",\n        \"# Set up a DLT pipeline.\\n\",\n        \"# Currently using DuckDB for local testing, but it can be switched to BigQuery for production.\\n\",\n        \"generators_pipeline = dlt.pipeline(destination='duckdb', dataset_name='people_merge')\\n\",\n        \"\\n\",\n        \"# Load data from the first generator `people_1` into 'people_merge' table.\\n\",\n        \"# This operation will replace any existing data in the table.\\n\",\n        \"# A primary key 'ID' is specified for potential future merge operations.\\n\",\n        \"info = generators_pipeline.run(people_1(),\\n\",\n        \"                               table_name=\\\"people_v2\\\",\\n\",\n        \"                               write_disposition=\\\"replace\\\",\\n\",\n        \"                               primary_key=\\\"ID\\\")\\n\",\n        \"\\n\",\n        \"# Print metadata of the loading process for the first generator.\\n\",\n        \"print(f\\\"{info}\\\\n\\\\n\\\")\\n\",\n        \"\\n\",\n        \"# Load data from the second generator `people_2` into the same 'people_merge' table.\\n\",\n        \"# This operation will merge the new data with existing data based on the primary key 'ID'.\\n\",\n        \"info = generators_pipeline.run(people_2(),\\n\",\n        \"                               table_name=\\\"people_merged\\\",\\n\",\n        \"                               write_disposition=\\\"merge\\\",\\n\",\n        \"                               primary_key=\\\"ID\\\")\\n\",\n        \"\\n\",\n        \"# Print metadata of the loading process for the second generator.\\n\",\n        \"print(f\\\"{info}\\\\n\\\\n\\\")\\n\",\n        \"\\n\",\n        \"import duckdb\\n\",\n        \"\\n\",\n        \"# Establish a connection to the DuckDB database created by the pipeline.\\n\",\n        \"conn = duckdb.connect(f\\\"{generators_pipeline.pipeline_name}.duckdb\\\")\\n\",\n        \"\\n\",\n        \"# Set the search path to the dataset 'people_merge' and display the available tables.\\n\",\n        \"conn.sql(f\\\"SET search_path = '{generators_pipeline.dataset_name}'\\\")\\n\",\n        \"print('Loaded tables: ')\\n\",\n        \"display(conn.sql(\\\"show tables\\\"))\\n\",\n        \"\\n\",\n        \"# Display the merged data from the 'people_merged' table.\\n\",\n        \"print(\\\"\\\\n\\\\n\\\\nData from the 'people_merged' table:\\\")\\n\",\n        \"data = conn.sql(\\\"SELECT * FROM people_merged\\\").df()\\n\",\n        \"display(data)\\n\",\n        \"\\n\",\n        \"# Calculate and display the sum of ages from the merged data in 'people_merged' table.\\n\",\n        \"sum_of_ages_p1_p2 = conn.execute(\\\"SELECT SUM(age) FROM people_merged\\\").fetchone()[0]\\n\",\n        \"print(f\\\"\\\\n\\\\nSum of ages of people in generator `people_1()` merged with generator `people_2()` is: {sum_of_ages_p1_p2}\\\")\\n\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\",\n          \"height\": 773\n        },\n        \"id\": \"rXR-IN85kBtq\",\n        \"outputId\": \"c74a7ab7-aa77-4445-c2bc-e782054a7201\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"Pipeline dlt_colab_kernel_launcher load step completed in 0.24 seconds\\n\",\n            \"1 load package(s) were loaded to destination duckdb and into dataset people_merge\\n\",\n            \"The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\\n\",\n            \"Load package 1706030294.0522 is LOADED and contains no failed jobs\\n\",\n            \"\\n\",\n            \"\\n\",\n            \"Pipeline dlt_colab_kernel_launcher load step completed in 0.42 seconds\\n\",\n            \"1 load package(s) were loaded to destination duckdb and into dataset people_merge\\n\",\n            \"The duckdb destination used duckdb:////content/dlt_colab_kernel_launcher.duckdb location to store data\\n\",\n            \"Load package 1706030294.7037766 is LOADED and contains no failed jobs\\n\",\n            \"\\n\",\n            \"\\n\",\n            \"Loaded tables: \\n\"\n          ]\n        },\n        {\n          \"output_type\": \"display_data\",\n          \"data\": {\n            \"text/plain\": [\n              \"┌─────────────────────┐\\n\",\n              \"│        name         │\\n\",\n              \"│       varchar       │\\n\",\n              \"├─────────────────────┤\\n\",\n              \"│ _dlt_loads          │\\n\",\n              \"│ _dlt_pipeline_state │\\n\",\n              \"│ _dlt_version        │\\n\",\n              \"│ people_merged       │\\n\",\n              \"│ people_v2           │\\n\",\n              \"└─────────────────────┘\"\n            ]\n          },\n          \"metadata\": {}\n        },\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"\\n\",\n            \"\\n\",\n            \"\\n\",\n            \"Data from the 'people_merged' table:\\n\"\n          ]\n        },\n        {\n          \"output_type\": \"display_data\",\n          \"data\": {\n            \"text/plain\": [\n              \"   id      name  age    city occupation        _dlt_load_id         _dlt_id\\n\",\n              \"0   8  Person_8   38  City_B      Job_8  1706030294.7037766  Q1k+DIAjXLL7cg\\n\",\n              \"1   4  Person_4   34  City_B      Job_4  1706030294.7037766  ewlZ3LjULEchiQ\\n\",\n              \"2   5  Person_5   35  City_B      Job_5  1706030294.7037766  X+LfQEa/X8GU9w\\n\",\n              \"3   7  Person_7   37  City_B      Job_7  1706030294.7037766  lQT0h7IL7E/wxg\\n\",\n              \"4   3  Person_3   33  City_B      Job_3  1706030294.7037766  gRBswCo8B/DJmw\\n\",\n              \"5   6  Person_6   36  City_B      Job_6  1706030294.7037766  M3IbNKfZZCtbcQ\"\n            ],\n            \"text/html\": [\n              \"\\n\",\n              \"  <div id=\\\"df-2f5274be-509c-41be-924d-49590376474d\\\" class=\\\"colab-df-container\\\">\\n\",\n              \"    <div>\\n\",\n              \"<style scoped>\\n\",\n              \"    .dataframe tbody tr th:only-of-type {\\n\",\n              \"        vertical-align: middle;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .dataframe tbody tr th {\\n\",\n              \"        vertical-align: top;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .dataframe thead th {\\n\",\n              \"        text-align: right;\\n\",\n              \"    }\\n\",\n              \"</style>\\n\",\n              \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n              \"  <thead>\\n\",\n              \"    <tr style=\\\"text-align: right;\\\">\\n\",\n              \"      <th></th>\\n\",\n              \"      <th>id</th>\\n\",\n              \"      <th>name</th>\\n\",\n              \"      <th>age</th>\\n\",\n              \"      <th>city</th>\\n\",\n              \"      <th>occupation</th>\\n\",\n              \"      <th>_dlt_load_id</th>\\n\",\n              \"      <th>_dlt_id</th>\\n\",\n              \"    </tr>\\n\",\n              \"  </thead>\\n\",\n              \"  <tbody>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>0</th>\\n\",\n              \"      <td>8</td>\\n\",\n              \"      <td>Person_8</td>\\n\",\n              \"      <td>38</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>Job_8</td>\\n\",\n              \"      <td>1706030294.7037766</td>\\n\",\n              \"      <td>Q1k+DIAjXLL7cg</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>1</th>\\n\",\n              \"      <td>4</td>\\n\",\n              \"      <td>Person_4</td>\\n\",\n              \"      <td>34</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>Job_4</td>\\n\",\n              \"      <td>1706030294.7037766</td>\\n\",\n              \"      <td>ewlZ3LjULEchiQ</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>2</th>\\n\",\n              \"      <td>5</td>\\n\",\n              \"      <td>Person_5</td>\\n\",\n              \"      <td>35</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>Job_5</td>\\n\",\n              \"      <td>1706030294.7037766</td>\\n\",\n              \"      <td>X+LfQEa/X8GU9w</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>3</th>\\n\",\n              \"      <td>7</td>\\n\",\n              \"      <td>Person_7</td>\\n\",\n              \"      <td>37</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>Job_7</td>\\n\",\n              \"      <td>1706030294.7037766</td>\\n\",\n              \"      <td>lQT0h7IL7E/wxg</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>4</th>\\n\",\n              \"      <td>3</td>\\n\",\n              \"      <td>Person_3</td>\\n\",\n              \"      <td>33</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>Job_3</td>\\n\",\n              \"      <td>1706030294.7037766</td>\\n\",\n              \"      <td>gRBswCo8B/DJmw</td>\\n\",\n              \"    </tr>\\n\",\n              \"    <tr>\\n\",\n              \"      <th>5</th>\\n\",\n              \"      <td>6</td>\\n\",\n              \"      <td>Person_6</td>\\n\",\n              \"      <td>36</td>\\n\",\n              \"      <td>City_B</td>\\n\",\n              \"      <td>Job_6</td>\\n\",\n              \"      <td>1706030294.7037766</td>\\n\",\n              \"      <td>M3IbNKfZZCtbcQ</td>\\n\",\n              \"    </tr>\\n\",\n              \"  </tbody>\\n\",\n              \"</table>\\n\",\n              \"</div>\\n\",\n              \"    <div class=\\\"colab-df-buttons\\\">\\n\",\n              \"\\n\",\n              \"  <div class=\\\"colab-df-container\\\">\\n\",\n              \"    <button class=\\\"colab-df-convert\\\" onclick=\\\"convertToInteractive('df-2f5274be-509c-41be-924d-49590376474d')\\\"\\n\",\n              \"            title=\\\"Convert this dataframe to an interactive table.\\\"\\n\",\n              \"            style=\\\"display:none;\\\">\\n\",\n              \"\\n\",\n              \"  <svg xmlns=\\\"http://www.w3.org/2000/svg\\\" height=\\\"24px\\\" viewBox=\\\"0 -960 960 960\\\">\\n\",\n              \"    <path d=\\\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\\\"/>\\n\",\n              \"  </svg>\\n\",\n              \"    </button>\\n\",\n              \"\\n\",\n              \"  <style>\\n\",\n              \"    .colab-df-container {\\n\",\n              \"      display:flex;\\n\",\n              \"      gap: 12px;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .colab-df-convert {\\n\",\n              \"      background-color: #E8F0FE;\\n\",\n              \"      border: none;\\n\",\n              \"      border-radius: 50%;\\n\",\n              \"      cursor: pointer;\\n\",\n              \"      display: none;\\n\",\n              \"      fill: #1967D2;\\n\",\n              \"      height: 32px;\\n\",\n              \"      padding: 0 0 0 0;\\n\",\n              \"      width: 32px;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .colab-df-convert:hover {\\n\",\n              \"      background-color: #E2EBFA;\\n\",\n              \"      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\\n\",\n              \"      fill: #174EA6;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    .colab-df-buttons div {\\n\",\n              \"      margin-bottom: 4px;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    [theme=dark] .colab-df-convert {\\n\",\n              \"      background-color: #3B4455;\\n\",\n              \"      fill: #D2E3FC;\\n\",\n              \"    }\\n\",\n              \"\\n\",\n              \"    [theme=dark] .colab-df-convert:hover {\\n\",\n              \"      background-color: #434B5C;\\n\",\n              \"      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\\n\",\n              \"      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\\n\",\n              \"      fill: #FFFFFF;\\n\",\n              \"    }\\n\",\n              \"  </style>\\n\",\n              \"\\n\",\n              \"    <script>\\n\",\n              \"      const buttonEl =\\n\",\n              \"        document.querySelector('#df-2f5274be-509c-41be-924d-49590376474d button.colab-df-convert');\\n\",\n              \"      buttonEl.style.display =\\n\",\n              \"        google.colab.kernel.accessAllowed ? 'block' : 'none';\\n\",\n              \"\\n\",\n              \"      async function convertToInteractive(key) {\\n\",\n              \"        const element = document.querySelector('#df-2f5274be-509c-41be-924d-49590376474d');\\n\",\n              \"        const dataTable =\\n\",\n              \"          await google.colab.kernel.invokeFunction('convertToInteractive',\\n\",\n              \"                                                    [key], {});\\n\",\n              \"        if (!dataTable) return;\\n\",\n              \"\\n\",\n              \"        const docLinkHtml = 'Like what you see? Visit the ' +\\n\",\n              \"          '<a target=\\\"_blank\\\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\\n\",\n              \"          + ' to learn more about interactive tables.';\\n\",\n              \"        element.innerHTML = '';\\n\",\n              \"        dataTable['output_type'] = 'display_data';\\n\",\n              \"        await google.colab.output.renderOutput(dataTable, element);\\n\",\n              \"        const docLink = document.createElement('div');\\n\",\n              \"        docLink.innerHTML = docLinkHtml;\\n\",\n              \"        element.appendChild(docLink);\\n\",\n              \"      }\\n\",\n              \"    </script>\\n\",\n              \"  </div>\\n\",\n              \"\\n\",\n              \"\\n\",\n              \"<div id=\\\"df-59a3fb69-8001-41be-ac63-c616dc356aab\\\">\\n\",\n              \"  <button class=\\\"colab-df-quickchart\\\" onclick=\\\"quickchart('df-59a3fb69-8001-41be-ac63-c616dc356aab')\\\"\\n\",\n              \"            title=\\\"Suggest charts\\\"\\n\",\n              \"            style=\\\"display:none;\\\">\\n\",\n              \"\\n\",\n              \"<svg xmlns=\\\"http://www.w3.org/2000/svg\\\" height=\\\"24px\\\"viewBox=\\\"0 0 24 24\\\"\\n\",\n              \"     width=\\\"24px\\\">\\n\",\n              \"    <g>\\n\",\n              \"        <path d=\\\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\\\"/>\\n\",\n              \"    </g>\\n\",\n              \"</svg>\\n\",\n              \"  </button>\\n\",\n              \"\\n\",\n              \"<style>\\n\",\n              \"  .colab-df-quickchart {\\n\",\n              \"      --bg-color: #E8F0FE;\\n\",\n              \"      --fill-color: #1967D2;\\n\",\n              \"      --hover-bg-color: #E2EBFA;\\n\",\n              \"      --hover-fill-color: #174EA6;\\n\",\n              \"      --disabled-fill-color: #AAA;\\n\",\n              \"      --disabled-bg-color: #DDD;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  [theme=dark] .colab-df-quickchart {\\n\",\n              \"      --bg-color: #3B4455;\\n\",\n              \"      --fill-color: #D2E3FC;\\n\",\n              \"      --hover-bg-color: #434B5C;\\n\",\n              \"      --hover-fill-color: #FFFFFF;\\n\",\n              \"      --disabled-bg-color: #3B4455;\\n\",\n              \"      --disabled-fill-color: #666;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  .colab-df-quickchart {\\n\",\n              \"    background-color: var(--bg-color);\\n\",\n              \"    border: none;\\n\",\n              \"    border-radius: 50%;\\n\",\n              \"    cursor: pointer;\\n\",\n              \"    display: none;\\n\",\n              \"    fill: var(--fill-color);\\n\",\n              \"    height: 32px;\\n\",\n              \"    padding: 0;\\n\",\n              \"    width: 32px;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  .colab-df-quickchart:hover {\\n\",\n              \"    background-color: var(--hover-bg-color);\\n\",\n              \"    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\\n\",\n              \"    fill: var(--button-hover-fill-color);\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  .colab-df-quickchart-complete:disabled,\\n\",\n              \"  .colab-df-quickchart-complete:disabled:hover {\\n\",\n              \"    background-color: var(--disabled-bg-color);\\n\",\n              \"    fill: var(--disabled-fill-color);\\n\",\n              \"    box-shadow: none;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  .colab-df-spinner {\\n\",\n              \"    border: 2px solid var(--fill-color);\\n\",\n              \"    border-color: transparent;\\n\",\n              \"    border-bottom-color: var(--fill-color);\\n\",\n              \"    animation:\\n\",\n              \"      spin 1s steps(1) infinite;\\n\",\n              \"  }\\n\",\n              \"\\n\",\n              \"  @keyframes spin {\\n\",\n              \"    0% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-bottom-color: var(--fill-color);\\n\",\n              \"      border-left-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    20% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-left-color: var(--fill-color);\\n\",\n              \"      border-top-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    30% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-left-color: var(--fill-color);\\n\",\n              \"      border-top-color: var(--fill-color);\\n\",\n              \"      border-right-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    40% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-right-color: var(--fill-color);\\n\",\n              \"      border-top-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    60% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-right-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    80% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-right-color: var(--fill-color);\\n\",\n              \"      border-bottom-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"    90% {\\n\",\n              \"      border-color: transparent;\\n\",\n              \"      border-bottom-color: var(--fill-color);\\n\",\n              \"    }\\n\",\n              \"  }\\n\",\n              \"</style>\\n\",\n              \"\\n\",\n              \"  <script>\\n\",\n              \"    async function quickchart(key) {\\n\",\n              \"      const quickchartButtonEl =\\n\",\n              \"        document.querySelector('#' + key + ' button');\\n\",\n              \"      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\\n\",\n              \"      quickchartButtonEl.classList.add('colab-df-spinner');\\n\",\n              \"      try {\\n\",\n              \"        const charts = await google.colab.kernel.invokeFunction(\\n\",\n              \"            'suggestCharts', [key], {});\\n\",\n              \"      } catch (error) {\\n\",\n              \"        console.error('Error during call to suggestCharts:', error);\\n\",\n              \"      }\\n\",\n              \"      quickchartButtonEl.classList.remove('colab-df-spinner');\\n\",\n              \"      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\\n\",\n              \"    }\\n\",\n              \"    (() => {\\n\",\n              \"      let quickchartButtonEl =\\n\",\n              \"        document.querySelector('#df-59a3fb69-8001-41be-ac63-c616dc356aab button');\\n\",\n              \"      quickchartButtonEl.style.display =\\n\",\n              \"        google.colab.kernel.accessAllowed ? 'block' : 'none';\\n\",\n              \"    })();\\n\",\n              \"  </script>\\n\",\n              \"</div>\\n\",\n              \"    </div>\\n\",\n              \"  </div>\\n\"\n            ]\n          },\n          \"metadata\": {}\n        },\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"\\n\",\n            \"\\n\",\n            \"Sum of ages of people in generator `people_1()` merged with generator `people_2()` is: 213\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [],\n      \"metadata\": {\n        \"id\": \"TApfkuNKtlt3\"\n      },\n      \"execution_count\": null,\n      \"outputs\": []\n    }\n  ]\n}"
  },
  {
    "path": "cohorts/2024/workshops/dlt_resources/homework_starter.ipynb",
    "content": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"name\": \"python3\",\n      \"display_name\": \"Python 3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    }\n  },\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# **Homework**: Data talks club data engineering zoomcamp Data loading workshop\\n\",\n        \"\\n\",\n        \"Hello folks, let's practice what we learned - Loading data with the best practices of data engineering.\\n\",\n        \"\\n\",\n        \"Here are the exercises we will do\\n\",\n        \"\\n\",\n        \"\\n\"\n      ],\n      \"metadata\": {\n        \"id\": \"mrTFv5nPClXh\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# 1. Use a generator\\n\",\n        \"\\n\",\n        \"Remember the concept of generator? Let's practice using them to futher our understanding of how they work.\\n\",\n        \"\\n\",\n        \"Let's define a generator and then run it as practice.\\n\",\n        \"\\n\",\n        \"**Answer the following questions:**\\n\",\n        \"\\n\",\n        \"- **Question 1: What is the sum of the outputs of the generator for limit = 5?**\\n\",\n        \"- **Question 2: What is the 13th number yielded**\\n\",\n        \"\\n\",\n        \"I suggest practicing these questions without GPT as the purpose is to further your learning.\"\n      ],\n      \"metadata\": {\n        \"id\": \"wLF4iXf-NR7t\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"def square_root_generator(limit):\\n\",\n        \"    n = 1\\n\",\n        \"    while n <= limit:\\n\",\n        \"        yield n ** 0.5\\n\",\n        \"        n += 1\\n\",\n        \"\\n\",\n        \"# Example usage:\\n\",\n        \"limit = 5\\n\",\n        \"generator = square_root_generator(limit)\\n\",\n        \"\\n\",\n        \"for sqrt_value in generator:\\n\",\n        \"    print(sqrt_value)\\n\"\n      ],\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"wLng-bDJN4jf\",\n        \"outputId\": \"547683cb-5f56-4815-a903-d0d9578eb1f9\"\n      },\n      \"execution_count\": null,\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"1.0\\n\",\n            \"1.4142135623730951\\n\",\n            \"1.7320508075688772\\n\",\n            \"2.0\\n\",\n            \"2.23606797749979\\n\"\n          ]\n        }\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [],\n      \"metadata\": {\n        \"id\": \"xbe3q55zN43j\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# 2. Append a generator to a table with existing data\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"Below you have 2 generators. You will be tasked to load them to duckdb and answer some questions from the data\\n\",\n        \"\\n\",\n        \"1. Load the first generator and calculate the sum of ages of all people. Make sure to only load it once.\\n\",\n        \"2. Append the second generator to the same table as the first.\\n\",\n        \"3. **After correctly appending the data, calculate the sum of all ages of people.**\\n\",\n        \"\\n\",\n        \"\\n\"\n      ],\n      \"metadata\": {\n        \"id\": \"vjWhILzGJMpK\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"2MoaQcdLBEk6\",\n        \"outputId\": \"d2b93dc1-d83f-44ea-aeff-fdf51d75f7aa\"\n      },\n      \"outputs\": [\n        {\n          \"output_type\": \"stream\",\n          \"name\": \"stdout\",\n          \"text\": [\n            \"{'ID': 1, 'Name': 'Person_1', 'Age': 26, 'City': 'City_A'}\\n\",\n            \"{'ID': 2, 'Name': 'Person_2', 'Age': 27, 'City': 'City_A'}\\n\",\n            \"{'ID': 3, 'Name': 'Person_3', 'Age': 28, 'City': 'City_A'}\\n\",\n            \"{'ID': 4, 'Name': 'Person_4', 'Age': 29, 'City': 'City_A'}\\n\",\n            \"{'ID': 5, 'Name': 'Person_5', 'Age': 30, 'City': 'City_A'}\\n\",\n            \"{'ID': 3, 'Name': 'Person_3', 'Age': 33, 'City': 'City_B', 'Occupation': 'Job_3'}\\n\",\n            \"{'ID': 4, 'Name': 'Person_4', 'Age': 34, 'City': 'City_B', 'Occupation': 'Job_4'}\\n\",\n            \"{'ID': 5, 'Name': 'Person_5', 'Age': 35, 'City': 'City_B', 'Occupation': 'Job_5'}\\n\",\n            \"{'ID': 6, 'Name': 'Person_6', 'Age': 36, 'City': 'City_B', 'Occupation': 'Job_6'}\\n\",\n            \"{'ID': 7, 'Name': 'Person_7', 'Age': 37, 'City': 'City_B', 'Occupation': 'Job_7'}\\n\",\n            \"{'ID': 8, 'Name': 'Person_8', 'Age': 38, 'City': 'City_B', 'Occupation': 'Job_8'}\\n\"\n          ]\n        }\n      ],\n      \"source\": [\n        \"def people_1():\\n\",\n        \"    for i in range(1, 6):\\n\",\n        \"        yield {\\\"ID\\\": i, \\\"Name\\\": f\\\"Person_{i}\\\", \\\"Age\\\": 25 + i, \\\"City\\\": \\\"City_A\\\"}\\n\",\n        \"\\n\",\n        \"for person in people_1():\\n\",\n        \"    print(person)\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"def people_2():\\n\",\n        \"    for i in range(3, 9):\\n\",\n        \"        yield {\\\"ID\\\": i, \\\"Name\\\": f\\\"Person_{i}\\\", \\\"Age\\\": 30 + i, \\\"City\\\": \\\"City_B\\\", \\\"Occupation\\\": f\\\"Job_{i}\\\"}\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"for person in people_2():\\n\",\n        \"    print(person)\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [],\n      \"metadata\": {\n        \"id\": \"vtdTIm4fvQCN\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# 3. Merge a generator\\n\",\n        \"\\n\",\n        \"Re-use the generators from Exercise 2.\\n\",\n        \"\\n\",\n        \"A table's primary key needs to be created from the start, so load your data to a new table with primary key ID.\\n\",\n        \"\\n\",\n        \"Load your first generator first, and then load the second one with merge. Since they have overlapping IDs, some of the records from the first load should be replaced by the ones from the second load.\\n\",\n        \"\\n\",\n        \"After loading, you should have a total of 8 records, and ID 3 should have age 33.\\n\",\n        \"\\n\",\n        \"Question: **Calculate the sum of ages of all the people loaded as described above.**\\n\"\n      ],\n      \"metadata\": {\n        \"id\": \"pY4cFAWOSwN1\"\n      }\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"# Solution: First make sure that the following modules are installed:\"\n      ],\n      \"metadata\": {\n        \"id\": \"kKB2GTB9oVjr\"\n      }\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"#Install the dependencies\\n\",\n        \"%%capture\\n\",\n        \"!pip install dlt[duckdb]\"\n      ],\n      \"metadata\": {\n        \"id\": \"xTVvtyqrfVNq\"\n      },\n      \"execution_count\": null,\n      \"outputs\": []\n    },\n    {\n      \"cell_type\": \"code\",\n      \"source\": [\n        \"# to do: homework :)\"\n      ],\n      \"metadata\": {\n        \"id\": \"a2-PRBAkGC2K\"\n      },\n      \"execution_count\": null,\n      \"outputs\": []\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"source\": [\n        \"Questions? difficulties? We are here to help.\\n\",\n        \"- DTC data engineering course channel: https://datatalks-club.slack.com/archives/C01FABYF2RG\\n\",\n        \"- dlt's DTC cohort channel: https://dlthub-community.slack.com/archives/C06GAEX2VNX\"\n      ],\n      \"metadata\": {\n        \"id\": \"PoTJu4kbGG0z\"\n      }\n    }\n  ]\n}"
  },
  {
    "path": "cohorts/2024/workshops/rising-wave.md",
    "content": "<p align=\"center\">\n  <picture>\n    <source srcset=\"https://github.com/risingwavelabs/risingwave/blob/main/.github/RisingWave-logo-dark.svg\" width=\"500px\" media=\"(prefers-color-scheme: dark)\">\n    <img src=\"https://github.com/risingwavelabs/risingwave/blob/main/.github/RisingWave-logo-light.svg\" width=\"500px\">\n  </picture>\n</p>\n\n\n</div>\n\n<p align=\"center\">\n  <a\n    href=\"https://docs.risingwave.com/\"\n    target=\"_blank\"\n  ><b>Documentation</b></a>&nbsp;&nbsp;&nbsp;📑&nbsp;&nbsp;&nbsp;\n  <a\n    href=\"https://tutorials.risingwave.com/\"\n    target=\"_blank\"\n  ><b>Hands-on Tutorials</b></a>&nbsp;&nbsp;&nbsp;🎯&nbsp;&nbsp;&nbsp;\n  <a\n    href=\"https://cloud.risingwave.com/\"\n    target=\"_blank\"\n  ><b>RisingWave Cloud</b></a>&nbsp;&nbsp;&nbsp;🚀&nbsp;&nbsp;&nbsp;\n  <a\n    href=\"https://risingwave.com/slack\"\n    target=\"_blank\"\n  >\n    <b>Get Instant Help</b>\n  </a>\n</p>\n<div align=\"center\">\n  <a\n    href=\"https://risingwave.com/slack\"\n    target=\"_blank\"\n  >\n    <img alt=\"Slack\" src=\"https://badgen.net/badge/Slack/Join%20RisingWave/0abd59?icon=slack\" />\n  </a>\n  <a\n    href=\"https://twitter.com/risingwavelabs\"\n    target=\"_blank\"\n  >\n    <img alt=\"X\" src=\"https://img.shields.io/twitter/follow/risingwavelabs\" />\n  </a>\n  <a\n    href=\"https://www.youtube.com/@risingwave-labs\"\n    target=\"_blank\"\n  >\n    <img alt=\"YouTube\" src=\"https://img.shields.io/youtube/channel/views/UCsHwdyBRxBpmkA5RRd0YNEA\" />\n  </a>\n</div>\n\n## Stream processing with RisingWave\n\nIn this hands-on workshop, we’ll learn how to process real-time streaming data using SQL in RisingWave. The system we’ll use is [RisingWave](https://github.com/risingwavelabs/risingwave), an open-source SQL database for processing and managing streaming data. You may not feel unfamiliar with RisingWave’s user experience, as it’s fully wire compatible with PostgreSQL.\n\n![RisingWave](https://raw.githubusercontent.com/risingwavelabs/risingwave-docs/main/docs/images/new_archi_grey.png)\n\n\n\nWe’ll cover the following topics in this Workshop: \n\n- Why Stream Processing?\n- Stateless computation (Filters, Projections)\n- Stateful Computation (Aggregations, Joins)\n- Data Ingestion and Delivery\n\nRisingWave in 10 Minutes:\nhttps://tutorials.risingwave.com/docs/intro\n\nWorkshop video:\n\n<a href=\"https://youtube.com/live/L2BHFnZ6XjE\">\n  <img src=\"https://markdown-videos-api.jorgenkh.no/youtube/L2BHFnZ6XjE\" />\n</a>\n\n[Project Repository](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04)\n\n## Homework\n\n**Please setup the environment in [Getting Started](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04?tab=readme-ov-file#getting-started) and for the [Homework](https://github.com/risingwavelabs/risingwave-data-talks-workshop-2024-03-04/blob/main/homework.md#setting-up) first.**\n\n### Question 0\n\n_This question is just a warm-up to introduce dynamic filter, please attempt it before viewing its solution._\n\nWhat are the dropoff taxi zones at the latest dropoff times?\n\nFor this part, we will use the [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/).\n\n<details>\n<summary>Solution</summary>\n\n```sql\nCREATE MATERIALIZED VIEW latest_dropoff_time AS\n    WITH t AS (\n        SELECT MAX(tpep_dropoff_datetime) AS latest_dropoff_time\n        FROM trip_data\n    )\n    SELECT taxi_zone.Zone as taxi_zone, latest_dropoff_time\n    FROM t,\n            trip_data\n    JOIN taxi_zone\n        ON trip_data.DOLocationID = taxi_zone.location_id\n    WHERE trip_data.tpep_dropoff_datetime = t.latest_dropoff_time;\n\n--    taxi_zone    | latest_dropoff_time\n-- ----------------+---------------------\n--  Midtown Center | 2022-01-03 17:24:54\n-- (1 row)\n```\n\n</details>\n\n### Question 1\n\nCreate a materialized view to compute the average, min and max trip time **between each taxi zone**.\n\nNote that we consider the do not consider `a->b` and `b->a` as the same trip pair.\nSo as an example, you would consider the following trip pairs as different pairs:\n```plaintext\nYorkville East -> Steinway\nSteinway -> Yorkville East\n```\n\nFrom this MV, find the pair of taxi zones with the highest average trip time.\nYou may need to use the [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/) for this.\n\nBonus (no marks): Create an MV which can identify anomalies in the data. For example, if the average trip time between two zones is 1 minute,\nbut the max trip time is 10 minutes and 20 minutes respectively.\n\nOptions:\n1. Yorkville East, Steinway\n2. Murray Hill, Midwood\n3. East Flatbush/Farragut, East Harlem North\n4. Midtown Center, University Heights/Morris Heights\n\np.s. The trip time between taxi zones does not take symmetricity into account, i.e. `A -> B` and `B -> A` are considered different trips. This applies to subsequent questions as well.\n\n### Question 2\n\nRecreate the MV(s) in question 1, to also find the **number of trips** for the pair of taxi zones with the highest average trip time.\n\nOptions:\n1. 5\n2. 3\n3. 10\n4. 1\n\n### Question 3\n\nFrom the latest pickup time to 17 hours before, what are the top 3 busiest zones in terms of number of pickups?\nFor example if the latest pickup time is 2020-01-01 17:00:00,\nthen the query should return the top 3 busiest zones from 2020-01-01 00:00:00 to 2020-01-01 17:00:00.\n\nHINT: You can use [dynamic filter pattern](https://docs.risingwave.com/docs/current/sql-pattern-dynamic-filters/)\nto create a filter condition based on the latest pickup time.\n\nNOTE: For this question `17 hours` was picked to ensure we have enough data to work with.\n\nOptions:\n1. Clinton East, Upper East Side North, Penn Station\n2. LaGuardia Airport, Lincoln Square East, JFK Airport\n3. Midtown Center, Upper East Side South, Upper East Side North\n4. LaGuardia Airport, Midtown Center, Upper East Side North\n\n\n## Submitting the solutions\n\n- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2024/homework/workshop2\n- Deadline: 11 March (Monday), 23:00 CET \n\n## Rewards 🥳\n\nEveryone who completes the homework will get a pen and a sticker, and 5 lucky winners will receive a Tshirt and other secret surprises!\nWe encourage you to share your achievements with this workshop on your socials and look forward to your submissions 😁\n\n- Follow us on **LinkedIn**: https://www.linkedin.com/company/risingwave\n- Follow us on **GitHub**: https://github.com/risingwavelabs/risingwave\n- Join us on **Slack**: https://risingwave-labs.com/slack\n\nSee you around!\n\n\n## Solution\n"
  },
  {
    "path": "cohorts/2025/01-docker-terraform/homework.md",
    "content": "# Module 1 Homework: Docker & SQL\n\nIn this homework we'll prepare the environment and practice\nDocker and SQL\n\nWhen submitting your homework, you will also need to include\na link to your GitHub repository or other public code-hosting\nsite.\n\nThis repository should contain the code for solving the homework. \n\nWhen your solution has SQL or shell commands and not code\n(e.g. python files) file format, include them directly in\nthe README file of your repository.\n\n\n## Question 1. Understanding docker first run \n\nRun docker with the `python:3.12.8` image in an interactive mode, use the entrypoint `bash`.\n\nWhat's the version of `pip` in the image?\n\n- 24.3.1\n- 24.2.1\n- 23.3.1\n- 23.2.1\n\n\n## Question 2. Understanding Docker networking and docker-compose\n\nGiven the following `docker-compose.yaml`, what is the `hostname` and `port` that **pgadmin** should use to connect to the postgres database?\n\n```yaml\nservices:\n  db:\n    container_name: postgres\n    image: postgres:17-alpine\n    environment:\n      POSTGRES_USER: 'postgres'\n      POSTGRES_PASSWORD: 'postgres'\n      POSTGRES_DB: 'ny_taxi'\n    ports:\n      - '5433:5432'\n    volumes:\n      - vol-pgdata:/var/lib/postgresql/data\n\n  pgadmin:\n    container_name: pgadmin\n    image: dpage/pgadmin4:latest\n    environment:\n      PGADMIN_DEFAULT_EMAIL: \"pgadmin@pgadmin.com\"\n      PGADMIN_DEFAULT_PASSWORD: \"pgadmin\"\n    ports:\n      - \"8080:80\"\n    volumes:\n      - vol-pgadmin_data:/var/lib/pgadmin  \n\nvolumes:\n  vol-pgdata:\n    name: vol-pgdata\n  vol-pgadmin_data:\n    name: vol-pgadmin_data\n```\n\n- postgres:5433\n- localhost:5432\n- db:5433\n- postgres:5432\n- db:5432\n\nIf there are more than one answers, select only one of them\n\n##  Prepare Postgres\n\nRun Postgres and load data as shown in the videos\nWe'll use the green taxi trips from October 2019:\n\n```bash\nwget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz\n```\n\nYou will also need the dataset with zones:\n\n```bash\nwget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv\n```\n\nDownload this data and put it into Postgres.\n\nYou can use the code from the course. It's up to you whether\nyou want to use Jupyter or a python script.\n\n## Question 3. Trip Segmentation Count\n\nDuring the period of October 1st 2019 (inclusive) and November 1st 2019 (exclusive), how many trips, **respectively**, happened:\n1. Up to 1 mile\n2. In between 1 (exclusive) and 3 miles (inclusive),\n3. In between 3 (exclusive) and 7 miles (inclusive),\n4. In between 7 (exclusive) and 10 miles (inclusive),\n5. Over 10 miles \n\nAnswers:\n\n- 104,802;  197,670;  110,612;  27,831;  35,281\n- 104,802;  198,924;  109,603;  27,678;  35,189\n- 104,793;  201,407;  110,612;  27,831;  35,281\n- 104,793;  202,661;  109,603;  27,678;  35,189\n- 104,838;  199,013;  109,645;  27,688;  35,202\n\n\n## Question 4. Longest trip for each day\n\nWhich was the pick up day with the longest trip distance?\nUse the pick up time for your calculations.\n\nTip: For every day, we only care about one single trip with the longest distance. \n\n- 2019-10-11\n- 2019-10-24\n- 2019-10-26\n- 2019-10-31\n\n\n## Question 5. Three biggest pickup zones\n\nWhich were the top pickup locations with over 13,000 in\n`total_amount` (across all trips) for 2019-10-18?\n\nConsider only `lpep_pickup_datetime` when filtering by date.\n \n- East Harlem North, East Harlem South, Morningside Heights\n- East Harlem North, Morningside Heights\n- Morningside Heights, Astoria Park, East Harlem South\n- Bedford, East Harlem North, Astoria Park\n\n\n## Question 6. Largest tip\n\nFor the passengers picked up in October 2019 in the zone\nnamed \"East Harlem North\" which was the drop off zone that had\nthe largest tip?\n\nNote: it's `tip` , not `trip`\n\nWe need the name of the zone, not the ID.\n\n- Yorkville West\n- JFK Airport\n- East Harlem North\n- East Harlem South\n\n\n## Terraform\n\nIn this section homework we'll prepare the environment by creating resources in GCP with Terraform.\n\nIn your VM on GCP/Laptop/GitHub Codespace install Terraform. \nCopy the files from the course repo\n[here](../../../01-docker-terraform/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace.\n\nModify the files as necessary to create a GCP Bucket and Big Query Dataset.\n\n\n## Question 7. Terraform Workflow\n\nWhich of the following sequences, **respectively**, describes the workflow for: \n1. Downloading the provider plugins and setting up backend,\n2. Generating proposed changes and auto-executing the plan\n3. Remove all resources managed by terraform`\n\nAnswers:\n- terraform import, terraform apply -y, terraform destroy\n- teraform init, terraform plan -auto-apply, terraform rm\n- terraform init, terraform run -auto-approve, terraform destroy\n- terraform init, terraform apply -auto-approve, terraform destroy\n- terraform import, terraform apply -y, terraform rm\n\n\n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw1\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/README.md",
    "content": "# Workflow Orchestration\n\nWelcome to Module 2 of the Data Engineering Zoomcamp! This week, we’ll dive into workflow orchestration using [Kestra](https://go.kestra.io/de-zoomcamp/github). \n\nKestra is an open-source, event-driven orchestration platform that simplifies building both scheduled and event-driven workflows. By adopting Infrastructure as Code practices for data and process orchestration, Kestra enables you to build reliable workflows with just a few lines of YAML.\n\n> [!NOTE]  \n>You can find all videos for this week in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist).\n\n---\n\n# Course Structure\n\n## 1. Conceptual Material: Introduction to Orchestration and Kestra\n\nIn this section, you’ll learn the foundations of workflow orchestration, its importance, and how Kestra fits into the orchestration landscape.\n\n### Videos\n- **2.2.1 - Introduction to Workflow Orchestration**  \n  [![2.2.1 - Workflow Orchestration Introduction](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FNp6QmmcgLCs)](https://youtu.be/Np6QmmcgLCs)\n\n- **2.2.2 - Learn the Concepts of Kestra**  \n  [![Learn Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fo79n-EVpics)](https://youtu.be/o79n-EVpics)\n\n### Resources\n- [Quickstart Guide](https://go.kestra.io/de-zoomcamp/quickstart)\n- [Install Kestra with Docker Compose](https://go.kestra.io/de-zoomcamp/docker-compose)\n- [Tutorial](https://go.kestra.io/de-zoomcamp/tutorial)\n- [What is an Orchestrator?](https://go.kestra.io/de-zoomcamp/what-is-an-orchestrator)\n\n---\n\n## 2. Hands-On Coding Project: Build Data Pipelines with Kestra\n\nThis week, we're gonna build ETL pipelines for Yellow and Green Taxi data from NYC’s Taxi and Limousine Commission (TLC). You will:\n1. Extract data from [CSV files](https://github.com/DataTalksClub/nyc-tlc-data/releases).\n2. Load it into Postgres or Google Cloud (GCS + BigQuery).\n3. Explore scheduling and backfilling workflows.\n\n>[!NOTE] \nIf you’re using the PostgreSQL and PgAdmin docker setup from Module 1 for this week’s Kestra Workflow Orchestration exercise, ensure your PostgreSQL image version is 15 or later (preferably the latest). The MERGE statement, introduced in PostgreSQL 15, won’t work on earlier versions and will likely cause syntax errors in your kestra flows.\n\n### File Structure\n\nThe project is organized as follows:\n```\n.\n├── flows/\n│   ├── 01_getting_started_data_pipeline.yaml\n│   ├── 02_postgres_taxi.yaml\n│   ├── 02_postgres_taxi_scheduled.yaml\n│   ├── 03_postgres_dbt.yaml\n│   ├── 04_gcp_kv.yaml\n│   ├── 05_gcp_setup.yaml\n│   ├── 06_gcp_taxi.yaml\n│   ├── 06_gcp_taxi_scheduled.yaml\n│   └── 07_gcp_dbt.yaml\n```\n\n### Setup Kestra\n\nWe'll set up Kestra using Docker Compose containing one container for the Kestra server and another for the Postgres database:\n\n```bash\ncd 02-workflow-orchestration/docker/combined\ndocker compose up -d\n```\n\nOnce the container starts, you can access the Kestra UI at [http://localhost:8080](http://localhost:8080).\n\nIf you prefer to add flows programmatically using Kestra's API, run the following commands:\n\n```bash\ncurl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/01_getting_started_data_pipeline.yaml\ncurl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/02_postgres_taxi.yaml\ncurl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/02_postgres_taxi_scheduled.yaml\ncurl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/03_postgres_dbt.yaml\ncurl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/04_gcp_kv.yaml\ncurl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/05_gcp_setup.yaml\ncurl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/06_gcp_taxi.yaml\ncurl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/06_gcp_taxi_scheduled.yaml\ncurl -X POST http://localhost:8080/api/v1/flows/import -F fileUpload=@flows/07_gcp_dbt.yaml\n```\n\n---\n\n## 3. ETL Pipelines in Kestra: Detailed Walkthrough\n\n### Getting Started Pipeline\n\nThis introductory flow is added just to demonstrate a simple data pipeline which extracts data via HTTP REST API, transforms that data in Python and then queries it using DuckDB. For this stage, a new separate Postgres database is created for the exercises. \n\n**Note:** Check that `pgAdmin` isn't running on the same ports as Kestra. If so, check out the [FAQ](#troubleshooting-tips) at the bottom of the README.\n\n### Videos\n\n- **2.2.3 - Create an ETL Pipeline with Postgres in Kestra**   \n  [![Create an ETL Pipeline with Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FOkfLX28Ecjg%3Fsi%3DvKbIyWo1TtjpNnvt)](https://youtu.be/OkfLX28Ecjg?si=vKbIyWo1TtjpNnvt)\n- **2.2.4 - Manage Scheduling and Backfills using Postgres in Kestra**  \n  [![Manage Scheduling and Backfills using Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F_-li_z97zog%3Fsi%3DG6jZbkfJb3GAyqrd)](https://youtu.be/_-li_z97zog?si=G6jZbkfJb3GAyqrd)\n- **2.2.5 - Transform Data with dbt and Postgres in Kestra**  \n  [![Transform Data with dbt and Postgres in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FZLp2N6p2JjE%3Fsi%3DtWhcvq5w4lO8v1_p)](https://youtu.be/ZLp2N6p2JjE?si=tWhcvq5w4lO8v1_p)\n\n\n```mermaid\ngraph LR\n  Extract[Extract Data via HTTP REST API] --> Transform[Transform Data in Python]\n  Transform --> Query[Query Data with DuckDB]\n```\n\nAdd the flow [`01_getting_started_data_pipeline.yaml`](flows/01_getting_started_data_pipeline.yaml) from the UI if you haven't already and execute it to see the results. Inspect the Gantt and Logs tabs to understand the flow execution.\n\n### Local DB: Load Taxi Data to Postgres\n\nBefore we start loading data to GCP, we'll first play with the Yellow and Green Taxi data using a local Postgres database running in a Docker container. We'll create a new Postgres database for these examples using this [Docker Compose file](docker/postgres/docker-compose.yml). Download it into a new directory, navigate to it and run the following command to start it:\n\n```bash\ndocker compose up -d\n```\n\nThe flow will extract CSV data partitioned by year and month, create tables, load data to the monthly table, and finally merge the data to the final destination table.\n\n```mermaid\ngraph LR\n  Start[Select Year & Month] --> SetLabel[Set Labels]\n  SetLabel --> Extract[Extract CSV Data]\n  Extract -->|Taxi=Yellow| YellowFinalTable[Create Yellow Final Table]:::yellow\n  Extract -->|Taxi=Green| GreenFinalTable[Create Green Final Table]:::green\n  YellowFinalTable --> YellowMonthlyTable[Create Yellow Monthly Table]:::yellow\n  GreenFinalTable --> GreenMonthlyTable[Create Green Monthly Table]:::green\n  YellowMonthlyTable --> YellowCopyIn[Load Data to Monthly Table]:::yellow\n  GreenMonthlyTable --> GreenCopyIn[Load Data to Monthly Table]:::green\n  YellowCopyIn --> YellowMerge[Merge Yellow Data]:::yellow\n  GreenCopyIn --> GreenMerge[Merge Green Data]:::green\n\n  classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px;\n  classDef green fill:#32CD32,stroke:#000,stroke-width:1px;\n```\n\nThe flow code: [`02_postgres_taxi.yaml`](flows/02_postgres_taxi.yaml).\n\n\n> [!NOTE]  \n> The NYC Taxi and Limousine Commission (TLC) Trip Record Data provided on the [nyc.gov](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) website is currently available only in a Parquet format, but this is NOT the dataset we're going to use in this course. For the purpose of this course, we'll use the **CSV files** available [here on GitHub](https://github.com/DataTalksClub/nyc-tlc-data/releases). This is because the Parquet format can be challenging to understand by newcomers, and we want to make the course as accessible as possible — the CSV format can be easily introspected using tools like Excel or Google Sheets, or even a simple text editor.\n\n### Local DB: Learn Scheduling and Backfills\n\nWe can now schedule the same pipeline shown above to run daily at 9 AM UTC. We'll also demonstrate how to backfill the data pipeline to run on historical data.\n\nNote: given the large dataset, we'll backfill only data for the green taxi dataset for the year 2019.\n\nThe flow code: [`02_postgres_taxi_scheduled.yaml`](flows/02_postgres_taxi_scheduled.yaml).\n\n### Local DB: Orchestrate dbt Models (Optional)\n\nNow that we have raw data ingested into a local Postgres database, we can use dbt to transform the data into meaningful insights. The flow will sync the dbt models from Git to Kestra and run the `dbt build` command to build the models.\n\n```mermaid\ngraph LR\n  Start[Select dbt command] --> Sync[Sync Namespace Files]\n  Sync --> DbtBuild[Run dbt CLI]\n```\n\nThis gives you a quick showcase of dbt inside of Kestra so the homework tasks do not depend on it. The course will go into more detail of dbt in [Week 4](../04-analytics-engineering).\n\nThe flow code: [`03_postgres_dbt.yaml`](flows/03_postgres_dbt.yaml).\n\n### Resources\n- [pgAdmin Download](https://www.pgadmin.org/download/)\n- [Postgres DB Docker Compose](docker/postgres/docker-compose.yml)\n\n---\n\n## 4. ETL Pipelines in Kestra: Google Cloud Platform\n\nNow that you've learned how to build ETL pipelines locally using Postgres, we are ready to move to the cloud. In this section, we'll load the same Yellow and Green Taxi data to Google Cloud Platform (GCP) using: \n1. Google Cloud Storage (GCS) as a data lake  \n2. BigQuery as a data warehouse.\n\n### Videos\n\n- **2.2.6 - Create an ETL Pipeline with GCS and BigQuery in Kestra**  \n  [![Create an ETL Pipeline with BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FnKqjjLJ7YXs)](https://youtu.be/nKqjjLJ7YXs)\n- **2.2.7 - Manage Scheduling and Backfills using BigQuery in Kestra**   \n  [![Manage Scheduling and Backfills using BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FDoaZ5JWEkH0)](https://youtu.be/DoaZ5JWEkH0)\n- **2.2.8 - Transform Data with dbt and BigQuery in Kestra**   \n  [![Transform Data with dbt and BigQuery in Kestra](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FeF_EdV4A1Wk)](https://youtu.be/eF_EdV4A1Wk)\n\n### Setup Google Cloud Platform (GCP)\n\nBefore we start loading data to GCP, we need to set up the Google Cloud Platform. \n\nFirst, adjust the following flow [`04_gcp_kv.yaml`](flows/04_gcp_kv.yaml) to include your service account, GCP project ID, BigQuery dataset and GCS bucket name (_along with their location_) as KV Store values:\n- GCP_CREDS\n- GCP_PROJECT_ID\n- GCP_LOCATION\n- GCP_BUCKET_NAME\n- GCP_DATASET.\n\n\n> [!WARNING]  \n> The `GCP_CREDS` service account contains sensitive information. Ensure you keep it secure and do not commit it to Git. Keep it as secure as your passwords.\n\n### Create GCP Resources\n\nIf you haven't already created the GCS bucket and BigQuery dataset in the first week of the course, you can use this flow to create them: [`05_gcp_setup.yaml`](flows/05_gcp_setup.yaml).\n\n\n### GCP Workflow: Load Taxi Data to BigQuery\n\n```mermaid\ngraph LR\n  SetLabel[Set Labels] --> Extract[Extract CSV Data]\n  Extract --> UploadToGCS[Upload Data to GCS]\n  UploadToGCS -->|Taxi=Yellow| BQYellowTripdata[Main Yellow Tripdata Table]:::yellow\n  UploadToGCS -->|Taxi=Green| BQGreenTripdata[Main Green Tripdata Table]:::green\n  BQYellowTripdata --> BQYellowTableExt[External Table]:::yellow\n  BQGreenTripdata --> BQGreenTableExt[External Table]:::green\n  BQYellowTableExt --> BQYellowTableTmp[Monthly Table]:::yellow\n  BQGreenTableExt --> BQGreenTableTmp[Monthly Table]:::green\n  BQYellowTableTmp --> BQYellowMerge[Merge to Main Table]:::yellow\n  BQGreenTableTmp --> BQGreenMerge[Merge to Main Table]:::green\n  BQYellowMerge --> PurgeFiles[Purge Files]\n  BQGreenMerge --> PurgeFiles[Purge Files]\n\n  classDef yellow fill:#FFD700,stroke:#000,stroke-width:1px;\n  classDef green fill:#32CD32,stroke:#000,stroke-width:1px;\n```\n\nThe flow code: [`06_gcp_taxi.yaml`](flows/06_gcp_taxi.yaml).\n\n### GCP Workflow: Schedule and Backfill Full Dataset\n\nWe can now schedule the same pipeline shown above to run daily at 9 AM UTC for the green dataset and at 10 AM UTC for the yellow dataset. You can backfill historical data directly from the Kestra UI.\n\nSince we now process data in a cloud environment with infinitely scalable storage and compute, we can backfill the entire dataset for both the yellow and green taxi data without the risk of running out of resources on our local machine.\n\nThe flow code: [`06_gcp_taxi_scheduled.yaml`](flows/06_gcp_taxi_scheduled.yaml).\n\n### GCP Workflow: Orchestrate dbt Models (Optional)\n\nNow that we have raw data ingested into BigQuery, we can use dbt to transform that data. The flow will sync the dbt models from Git to Kestra and run the `dbt build` command to build the models:\n\n```mermaid\ngraph LR\n  Start[Select dbt command] --> Sync[Sync Namespace Files]\n  Sync --> Build[Run dbt Build Command]\n```\n\nThis gives you a quick showcase of dbt inside of Kestra so the homework tasks do not depend on it. The course will go into more detail of dbt in [Week 4](../04-analytics-engineering).\n\nThe flow code: [`07_gcp_dbt.yaml`](flows/07_gcp_dbt.yaml).\n\n---\n\n## 5. Bonus: Deploy to the Cloud (Optional)\n\nNow that we've got our ETL pipeline working both locally and in the cloud, we can deploy Kestra to the cloud so it can continue to orchestrate our ETL pipelines monthly with our configured schedules, We'll cover how you can install Kestra on Google Cloud in Production, and automatically sync and deploy your workflows from a Git repository.\n\nNote: When committing your workflows to Kestra, make sure your workflow doesn't contain any sensitive information. You can use [Secrets](https://go.kestra.io/de-zoomcamp/secret) and the [KV Store](https://go.kestra.io/de-zoomcamp/kv-store) to keep sensitive data out of your workflow logic.\n\n### Videos\n\n- **2.2.9 - Deploy Workflows to the Cloud with Git**   \n  [![Deploy Workflows to the Cloud with Git](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fl-wC71tI3co)](https://youtu.be/l-wC71tI3co)\n\nResources\n\n- [Install Kestra on Google Cloud](https://go.kestra.io/de-zoomcamp/gcp-install)\n- [Moving from Development to Production](https://go.kestra.io/de-zoomcamp/dev-to-prod)\n- [Using Git in Kestra](https://go.kestra.io/de-zoomcamp/git)\n- [Deploy Flows with GitHub Actions](https://go.kestra.io/de-zoomcamp/deploy-github-actions)\n\n## 6. Additional Resources 📚\n\n- Check [Kestra Docs](https://go.kestra.io/de-zoomcamp/docs)\n- Explore our [Blueprints](https://go.kestra.io/de-zoomcamp/blueprints) library\n- Browse over 600 [plugins](https://go.kestra.io/de-zoomcamp/plugins) available in Kestra\n- Give us a star on [GitHub](https://go.kestra.io/de-zoomcamp/github)\n- Join our [Slack community](https://go.kestra.io/de-zoomcamp/slack) if you have any questions\n- Find all the videos in this [YouTube Playlist](https://go.kestra.io/de-zoomcamp/yt-playlist)\n\n\n### Troubleshooting tips\n\nIf you face any issues with Kestra flows in Module 2, make sure to use the following Docker images/ports:\n- `kestra/kestra:latest` is correct = latest stable release, while `kestra/kestra:develop` is incorrect as this is a bleeding-edge development version that might contain bugs\n- `postgres:latest` — make sure to use Postgres image, which uses **PostgreSQL 15** or higher\n- If you run `pgAdmin` or something else on port 8080, you can adjust Kestra docker-compose to use a different port, e.g. change port mapping to 18080 instead of 8080, and then access Kestra UI in your browser from http://localhost:18080/ instead of from http://localhost:8080/\n\nIf you're using Linux, you might encounter `Connection Refused` errors when connecting to the Postgres DB from within Kestra. This is because `host.docker.internal` works differently on Linux. Using the modified Docker Compose file below, you can run both Kestra and its dedicated Postgres DB, as well as the Postgres DB for the exercises all together. You can access it within Kestra by referring to the container name `postgres_zoomcamp` instead of `host.docker.internal` in `pluginDefaults`. This applies to pgAdmin as well. If you'd prefer to keep it in separate Docker Compose files, you'll need to setup a Docker network so that they can communicate with each other.\n\n<details>\n<summary>Docker Compose Example</summary>\n\nThis Docker Compose has the Zoomcamp DB container and pgAdmin container added to it, so it's all in one file.\n\nChanges include:\n- New `volume` for the Zoomcamp DB container\n- Zoomcamp DB container is added and renamed to prevent clashes with the Kestra DB container\n- Depends on condition is added to make sure Kestra is running before it starts\n- pgAdmin is added and running on Port 8085 so it doesn't clash wit Kestra which uses 8080 and 8081\n\n```yaml\nvolumes:\n  postgres-data:\n    driver: local\n  kestra-data:\n    driver: local\n  zoomcamp-data:\n    driver: local\n\nservices:\n  postgres:\n    image: postgres\n    volumes:\n      - postgres-data:/var/lib/postgresql/data\n    environment:\n      POSTGRES_DB: kestra\n      POSTGRES_USER: kestra\n      POSTGRES_PASSWORD: k3str4\n    healthcheck:\n      test: [\"CMD-SHELL\", \"pg_isready -d $${POSTGRES_DB} -U $${POSTGRES_USER}\"]\n      interval: 30s\n      timeout: 10s\n      retries: 10\n\n  kestra:\n    image: kestra/kestra:latest\n    pull_policy: always\n    # Note that this setup with a root user is intended for development purpose.\n    # Our base image runs without root, but the Docker Compose implementation needs root to access the Docker socket\n    # To run Kestra in a rootless mode in production, see: https://kestra.io/docs/installation/podman-compose\n    user: \"root\"\n    command: server standalone\n    volumes:\n      - kestra-data:/app/storage\n      - /var/run/docker.sock:/var/run/docker.sock\n      - /tmp/kestra-wd:/tmp/kestra-wd\n    environment:\n      KESTRA_CONFIGURATION: |\n        datasources:\n          postgres:\n            url: jdbc:postgresql://postgres:5432/kestra\n            driverClassName: org.postgresql.Driver\n            username: kestra\n            password: k3str4\n        kestra:\n          server:\n            basicAuth:\n              enabled: false\n              username: \"admin@kestra.io\" # it must be a valid email address\n              password: kestra\n          repository:\n            type: postgres\n          storage:\n            type: local\n            local:\n              basePath: \"/app/storage\"\n          queue:\n            type: postgres\n          tasks:\n            tmpDir:\n              path: /tmp/kestra-wd/tmp\n          url: http://localhost:8080/\n    ports:\n      - \"8080:8080\"\n      - \"8081:8081\"\n    depends_on:\n      postgres:\n        condition: service_started\n    \n  postgres_zoomcamp:\n    image: postgres\n    environment:\n      POSTGRES_USER: kestra\n      POSTGRES_PASSWORD: k3str4\n      POSTGRES_DB: postgres-zoomcamp\n    ports:\n      - \"5432:5432\"\n    volumes:\n      - zoomcamp-data:/var/lib/postgresql/data\n    depends_on:\n      kestra:\n        condition: service_started\n\n  pgadmin:\n    image: dpage/pgadmin4\n    environment:\n      - PGADMIN_DEFAULT_EMAIL=admin@admin.com\n      - PGADMIN_DEFAULT_PASSWORD=root\n    ports:\n      - \"8085:80\"\n    depends_on:\n      postgres_zoomcamp:\n        condition: service_started\n```\n\n</details>\n\nIf you are still facing any issues, stop and remove your existing Kestra + Postgres containers and start them again using `docker-compose up -d`. If this doesn't help, post your question on the DataTalksClub Slack or on Kestra's Slack http://kestra.io/slack.\n\n- **DE Zoomcamp FAQ - PostgresDB Setup and Installing pgAdmin**   \n  [![DE Zoomcamp FAQ - PostgresDB Setup and Installing pgAdmin](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2FywAPYNYFaB4%3Fsi%3D5X9AD0nFAT2WLWgS)](https://youtu.be/ywAPYNYFaB4?si=5X9AD0nFAT2WLWgS)\n- **DE Zoomcamp FAQ - Port and Images**  \n  [![DE Zoomcamp FAQ - Ports and Images](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2Fl2M2mW76RIU%3Fsi%3DoqyZ7KUaI27vi90V)](https://youtu.be/l2M2mW76RIU?si=oqyZ7KUaI27vi90V)\n- **DE Zoomcamp FAQ - Docker Setup**  \n  [![DE Zoomcamp FAQ - Docker Setup](https://markdown-videos-api.jorgenkh.no/url?url=https%3A%2F%2Fyoutu.be%2F73g6qJN0HcM)](https://youtu.be/73g6qJN0HcM)\n\n\n\nIf you encounter similar errors to:\n```\nBigQueryError{reason=invalid, location=null, \nmessage=Error while reading table: kestra-sandbox.zooomcamp.yellow_tripdata_2020_01, \nerror message: CSV table references column position 17, but line contains only 14 columns.; \nline_number: 2103925 byte_offset_to_start_of_line: 194863028 \ncolumn_index: 17 column_name: \"congestion_surcharge\" column_type: NUMERIC \nFile: gs://anna-geller/yellow_tripdata_2020-01.csv}\n```\n\nIt means that the CSV file you're trying to load into BigQuery has a mismatch in the number of columns between the external source table (i.e. file in GCS) and the destination table in BigQuery. This can happen when for due to network/transfer issues, the file is not fully downloaded from GitHub or not correctly uploaded to GCS. The error suggests schema issues but that's not the case. Simply rerun the entire execution including redownloading the CSV file and reuploading it to GCS. This should resolve the issue.\n\n---\n\n## Homework \n\nSee the [2025 cohort folder](../cohorts/2025/02-workflow-orchestration/homework.md)\n\n\n---\n\n# Community notes\n\nDid you take notes? You can share them by creating a PR to this file! \n\n* [Notes from Manuel Guerra)](https://github.com/ManuelGuerra1987/data-engineering-zoomcamp-notes/blob/main/2_Workflow-Orchestration-(Kestra)/README.md)\n* [Notes from Horeb Seidou](https://spotted-hardhat-eea.notion.site/Week-2-Workflow-Orchestration-17129780dc4a80148debf61e6453fffe)\n* [Notes from Livia](https://docs.google.com/document/d/1Y_QMonvEtFPbXIzmdpCSVsKNC1BWAHFBA1mpK9qaZko/edit?usp=sharing)\n* [2025 Gitbook Notes from Tinker0425](https://data-engineering-zoomcamp-2025-t.gitbook.io/tinker0425/module-2/introduction-to-module-2)\n* [Notes from Mercy Markus: Linux/Fedora Tweaks and Tips](https://mercymarkus.com/posts/2025/series/dtc-dez-jan-2025/dtc-dez-2025-module-2/)\n* Add your notes above this line\n\n---\n\n# Previous Cohorts\n\n* 2022: [notes](../cohorts/2022/week_2_data_ingestion#community-notes) and [videos](../cohorts/2022/week_2_data_ingestion)\n* 2023: [notes](../cohorts/2023/week_2_workflow_orchestration#community-notes) and [videos](../cohorts/2023/week_2_workflow_orchestration)\n* 2024: [notes](../cohorts/2024/02-workflow-orchestration#community-notes) and [videos](../cohorts/2024/02-workflow-orchestration)\n\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/flows/01_getting_started_data_pipeline.yaml",
    "content": "id: 01_getting_started_data_pipeline\nnamespace: zoomcamp\n\ninputs:\n  - id: columns_to_keep\n    type: ARRAY\n    itemType: STRING\n    defaults:\n      - brand\n      - price\n\ntasks:\n  - id: extract\n    type: io.kestra.plugin.core.http.Download\n    uri: https://dummyjson.com/products\n\n  - id: transform\n    type: io.kestra.plugin.scripts.python.Script\n    containerImage: python:3.11-alpine\n    inputFiles:\n      data.json: \"{{outputs.extract.uri}}\"\n    outputFiles:\n      - \"*.json\"\n    env:\n      COLUMNS_TO_KEEP: \"{{inputs.columns_to_keep}}\"\n    script: |\n      import json\n      import os\n\n      columns_to_keep_str = os.getenv(\"COLUMNS_TO_KEEP\")\n      columns_to_keep = json.loads(columns_to_keep_str)\n\n      with open(\"data.json\", \"r\") as file:\n          data = json.load(file)\n\n      filtered_data = [\n          {column: product.get(column, \"N/A\") for column in columns_to_keep}\n          for product in data[\"products\"]\n      ]\n\n      with open(\"products.json\", \"w\") as file:\n          json.dump(filtered_data, file, indent=4)\n\n  - id: query\n    type: io.kestra.plugin.jdbc.duckdb.Query\n    inputFiles:\n      products.json: \"{{outputs.transform.outputFiles['products.json']}}\"\n    sql: |\n      INSTALL json;\n      LOAD json;\n      SELECT brand, round(avg(price), 2) as avg_price\n      FROM read_json_auto('{{workingDir}}/products.json')\n      GROUP BY brand\n      ORDER BY avg_price DESC;\n    fetchType: STORE\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/flows/02_postgres_taxi.yaml",
    "content": "id: 02_postgres_taxi\nnamespace: zoomcamp\ndescription: |\n  The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases\n\ninputs:\n  - id: taxi\n    type: SELECT\n    displayName: Select taxi type\n    values: [yellow, green]\n    defaults: yellow\n\n  - id: year\n    type: SELECT\n    displayName: Select year\n    values: [\"2019\", \"2020\"]\n    defaults: \"2019\"\n\n  - id: month\n    type: SELECT\n    displayName: Select month\n    values: [\"01\", \"02\", \"03\", \"04\", \"05\", \"06\", \"07\", \"08\", \"09\", \"10\", \"11\", \"12\"]\n    defaults: \"01\"\n\nvariables:\n  file: \"{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv\"\n  staging_table: \"public.{{inputs.taxi}}_tripdata_staging\"\n  table: \"public.{{inputs.taxi}}_tripdata\"\n  data: \"{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}\"\n\ntasks:\n  - id: set_label\n    type: io.kestra.plugin.core.execution.Labels\n    labels:\n      file: \"{{render(vars.file)}}\"\n      taxi: \"{{inputs.taxi}}\"\n\n  - id: extract\n    type: io.kestra.plugin.scripts.shell.Commands\n    outputFiles:\n      - \"*.csv\"\n    taskRunner:\n      type: io.kestra.plugin.core.runner.Process\n    commands:\n      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}\n\n  - id: if_yellow_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'yellow'}}\"\n    then:\n      - id: yellow_create_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              tpep_pickup_datetime   timestamp,\n              tpep_dropoff_datetime  timestamp,\n              passenger_count        integer,\n              trip_distance          double precision,\n              RatecodeID             text,\n              store_and_fwd_flag     text,\n              PULocationID           text,\n              DOLocationID           text,\n              payment_type           integer,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              congestion_surcharge   double precision\n          );\n\n      - id: yellow_create_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              tpep_pickup_datetime   timestamp,\n              tpep_dropoff_datetime  timestamp,\n              passenger_count        integer,\n              trip_distance          double precision,\n              RatecodeID             text,\n              store_and_fwd_flag     text,\n              PULocationID           text,\n              DOLocationID           text,\n              payment_type           integer,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              congestion_surcharge   double precision\n          );\n\n      - id: yellow_truncate_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          TRUNCATE TABLE {{render(vars.staging_table)}};\n\n      - id: yellow_copy_in_to_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.CopyIn\n        format: CSV\n        from: \"{{render(vars.data)}}\"\n        table: \"{{render(vars.staging_table)}}\"\n        header: true\n        columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge]\n\n      - id: yellow_add_unique_id_and_filename\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          UPDATE {{render(vars.staging_table)}}\n          SET \n            unique_row_id = md5(\n              COALESCE(CAST(VendorID AS text), '') ||\n              COALESCE(CAST(tpep_pickup_datetime AS text), '') || \n              COALESCE(CAST(tpep_dropoff_datetime AS text), '') || \n              COALESCE(PULocationID, '') || \n              COALESCE(DOLocationID, '') || \n              COALESCE(CAST(fare_amount AS text), '') || \n              COALESCE(CAST(trip_distance AS text), '')      \n            ),\n            filename = '{{render(vars.file)}}';\n\n      - id: yellow_merge_data\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          MERGE INTO {{render(vars.table)}} AS T\n          USING {{render(vars.staging_table)}} AS S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (\n              unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,\n              passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,\n              DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,\n              improvement_surcharge, total_amount, congestion_surcharge\n            )\n            VALUES (\n              S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,\n              S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,\n              S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,\n              S.improvement_surcharge, S.total_amount, S.congestion_surcharge\n            );\n\n  - id: if_green_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'green'}}\"\n    then:\n      - id: green_create_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              lpep_pickup_datetime   timestamp,\n              lpep_dropoff_datetime  timestamp,\n              store_and_fwd_flag     text,\n              RatecodeID             text,\n              PULocationID           text,\n              DOLocationID           text,\n              passenger_count        integer,\n              trip_distance          double precision,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              ehail_fee              double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              payment_type           integer,\n              trip_type              integer,\n              congestion_surcharge   double precision\n          );\n\n      - id: green_create_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              lpep_pickup_datetime   timestamp,\n              lpep_dropoff_datetime  timestamp,\n              store_and_fwd_flag     text,\n              RatecodeID             text,\n              PULocationID           text,\n              DOLocationID           text,\n              passenger_count        integer,\n              trip_distance          double precision,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              ehail_fee              double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              payment_type           integer,\n              trip_type              integer,\n              congestion_surcharge   double precision\n          );\n\n      - id: green_truncate_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          TRUNCATE TABLE {{render(vars.staging_table)}};\n\n      - id: green_copy_in_to_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.CopyIn\n        format: CSV\n        from: \"{{render(vars.data)}}\"\n        table: \"{{render(vars.staging_table)}}\"\n        header: true\n        columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge]\n\n      - id: green_add_unique_id_and_filename\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          UPDATE {{render(vars.staging_table)}}\n          SET \n            unique_row_id = md5(\n              COALESCE(CAST(VendorID AS text), '') ||\n              COALESCE(CAST(lpep_pickup_datetime AS text), '') || \n              COALESCE(CAST(lpep_dropoff_datetime AS text), '') || \n              COALESCE(PULocationID, '') || \n              COALESCE(DOLocationID, '') || \n              COALESCE(CAST(fare_amount AS text), '') || \n              COALESCE(CAST(trip_distance AS text), '')      \n            ),\n            filename = '{{render(vars.file)}}';\n\n      - id: green_merge_data\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          MERGE INTO {{render(vars.table)}} AS T\n          USING {{render(vars.staging_table)}} AS S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (\n              unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime,\n              store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count,\n              trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee,\n              improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge\n            )\n            VALUES (\n              S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime,\n              S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count,\n              S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee,\n              S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge\n            );\n  \n  - id: purge_files\n    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles\n    description: This will remove output files. If you'd like to explore Kestra outputs, disable it.\n\npluginDefaults:\n  - type: io.kestra.plugin.jdbc.postgresql\n    values:\n      url: jdbc:postgresql://host.docker.internal:5432/postgres-zoomcamp\n      username: kestra\n      password: k3str4\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/flows/02_postgres_taxi_scheduled.yaml",
    "content": "id: 02_postgres_taxi_scheduled\nnamespace: zoomcamp\ndescription: |\n  Best to add a label `backfill:true` from the UI to track executions created via a backfill.\n  CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases\n\nconcurrency:\n  limit: 1\n\ninputs:\n  - id: taxi\n    type: SELECT\n    displayName: Select taxi type\n    values: [yellow, green]\n    defaults: yellow\n\nvariables:\n  file: \"{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv\"\n  staging_table: \"public.{{inputs.taxi}}_tripdata_staging\"\n  table: \"public.{{inputs.taxi}}_tripdata\"\n  data: \"{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}\"\n\ntasks:\n  - id: set_label\n    type: io.kestra.plugin.core.execution.Labels\n    labels:\n      file: \"{{render(vars.file)}}\"\n      taxi: \"{{inputs.taxi}}\"\n\n  - id: extract\n    type: io.kestra.plugin.scripts.shell.Commands\n    outputFiles:\n      - \"*.csv\"\n    taskRunner:\n      type: io.kestra.plugin.core.runner.Process\n    commands:\n      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}\n\n  - id: if_yellow_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'yellow'}}\"\n    then:\n      - id: yellow_create_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              tpep_pickup_datetime   timestamp,\n              tpep_dropoff_datetime  timestamp,\n              passenger_count        integer,\n              trip_distance          double precision,\n              RatecodeID             text,\n              store_and_fwd_flag     text,\n              PULocationID           text,\n              DOLocationID           text,\n              payment_type           integer,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              congestion_surcharge   double precision\n          );\n\n      - id: yellow_create_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              tpep_pickup_datetime   timestamp,\n              tpep_dropoff_datetime  timestamp,\n              passenger_count        integer,\n              trip_distance          double precision,\n              RatecodeID             text,\n              store_and_fwd_flag     text,\n              PULocationID           text,\n              DOLocationID           text,\n              payment_type           integer,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              congestion_surcharge   double precision\n          );\n\n      - id: yellow_truncate_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          TRUNCATE TABLE {{render(vars.staging_table)}};\n\n      - id: yellow_copy_in_to_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.CopyIn\n        format: CSV\n        from: \"{{render(vars.data)}}\"\n        table: \"{{render(vars.staging_table)}}\"\n        header: true\n        columns: [VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge]\n\n      - id: yellow_add_unique_id_and_filename\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          UPDATE {{render(vars.staging_table)}}\n          SET \n            unique_row_id = md5(\n              COALESCE(CAST(VendorID AS text), '') ||\n              COALESCE(CAST(tpep_pickup_datetime AS text), '') || \n              COALESCE(CAST(tpep_dropoff_datetime AS text), '') || \n              COALESCE(PULocationID, '') || \n              COALESCE(DOLocationID, '') || \n              COALESCE(CAST(fare_amount AS text), '') || \n              COALESCE(CAST(trip_distance AS text), '')      \n            ),\n            filename = '{{render(vars.file)}}';\n\n      - id: yellow_merge_data\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          MERGE INTO {{render(vars.table)}} AS T\n          USING {{render(vars.staging_table)}} AS S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (\n              unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime,\n              passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID,\n              DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,\n              improvement_surcharge, total_amount, congestion_surcharge\n            )\n            VALUES (\n              S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime,\n              S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID,\n              S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount,\n              S.improvement_surcharge, S.total_amount, S.congestion_surcharge\n            );\n\n  - id: if_green_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'green'}}\"\n    then:\n      - id: green_create_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              lpep_pickup_datetime   timestamp,\n              lpep_dropoff_datetime  timestamp,\n              store_and_fwd_flag     text,\n              RatecodeID             text,\n              PULocationID           text,\n              DOLocationID           text,\n              passenger_count        integer,\n              trip_distance          double precision,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              ehail_fee              double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              payment_type           integer,\n              trip_type              integer,\n              congestion_surcharge   double precision\n          );\n\n      - id: green_create_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          CREATE TABLE IF NOT EXISTS {{render(vars.staging_table)}} (\n              unique_row_id          text,\n              filename               text,\n              VendorID               text,\n              lpep_pickup_datetime   timestamp,\n              lpep_dropoff_datetime  timestamp,\n              store_and_fwd_flag     text,\n              RatecodeID             text,\n              PULocationID           text,\n              DOLocationID           text,\n              passenger_count        integer,\n              trip_distance          double precision,\n              fare_amount            double precision,\n              extra                  double precision,\n              mta_tax                double precision,\n              tip_amount             double precision,\n              tolls_amount           double precision,\n              ehail_fee              double precision,\n              improvement_surcharge  double precision,\n              total_amount           double precision,\n              payment_type           integer,\n              trip_type              integer,\n              congestion_surcharge   double precision\n          );\n\n      - id: green_truncate_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          TRUNCATE TABLE {{render(vars.staging_table)}};\n\n      - id: green_copy_in_to_staging_table\n        type: io.kestra.plugin.jdbc.postgresql.CopyIn\n        format: CSV\n        from: \"{{render(vars.data)}}\"\n        table: \"{{render(vars.staging_table)}}\"\n        header: true\n        columns: [VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge]\n\n      - id: green_add_unique_id_and_filename\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          UPDATE {{render(vars.staging_table)}}\n          SET \n            unique_row_id = md5(\n              COALESCE(CAST(VendorID AS text), '') ||\n              COALESCE(CAST(lpep_pickup_datetime AS text), '') || \n              COALESCE(CAST(lpep_dropoff_datetime AS text), '') || \n              COALESCE(PULocationID, '') || \n              COALESCE(DOLocationID, '') || \n              COALESCE(CAST(fare_amount AS text), '') || \n              COALESCE(CAST(trip_distance AS text), '')      \n            ),\n            filename = '{{render(vars.file)}}';\n\n      - id: green_merge_data\n        type: io.kestra.plugin.jdbc.postgresql.Queries\n        sql: |\n          MERGE INTO {{render(vars.table)}} AS T\n          USING {{render(vars.staging_table)}} AS S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (\n              unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime,\n              store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count,\n              trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee,\n              improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge\n            )\n            VALUES (\n              S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime,\n              S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count,\n              S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee,\n              S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge\n            );\n  \n  - id: purge_files\n    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles\n    description: To avoid cluttering your storage, we will remove the downloaded files\n\npluginDefaults:\n  - type: io.kestra.plugin.jdbc.postgresql\n    values:\n      url: jdbc:postgresql://host.docker.internal:5432/postgres-zoomcamp\n      username: kestra\n      password: k3str4\n\ntriggers:\n  - id: green_schedule\n    type: io.kestra.plugin.core.trigger.Schedule\n    cron: \"0 9 1 * *\"\n    inputs:\n      taxi: green\n\n  - id: yellow_schedule\n    type: io.kestra.plugin.core.trigger.Schedule\n    cron: \"0 10 1 * *\"\n    inputs:\n      taxi: yellow\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/flows/03_postgres_dbt.yaml",
    "content": "id: 03_postgres_dbt\nnamespace: zoomcamp\ninputs:\n  - id: dbt_command\n    type: SELECT\n    allowCustomValue: true\n    defaults: dbt build\n    values:\n      - dbt build\n      - dbt debug # use when running the first time to validate DB connection\ntasks:\n  - id: sync\n    type: io.kestra.plugin.git.SyncNamespaceFiles\n    url: https://github.com/DataTalksClub/data-engineering-zoomcamp\n    branch: main\n    namespace: \"{{ flow.namespace }}\"\n    gitDirectory: 04-analytics-engineering/taxi_rides_ny\n    dryRun: false\n    # disabled: true # this Git Sync is needed only when running it the first time, afterwards the task can be disabled\n\n  - id: dbt-build\n    type: io.kestra.plugin.dbt.cli.DbtCLI\n    env:\n      DBT_DATABASE: postgres-zoomcamp\n      DBT_SCHEMA: public\n    namespaceFiles:\n      enabled: true\n    containerImage: ghcr.io/kestra-io/dbt-postgres:latest\n    taskRunner:\n      type: io.kestra.plugin.scripts.runner.docker.Docker\n      networkMode: host\n    commands:\n      - dbt deps\n      - \"{{ inputs.dbt_command }}\"\n    storeManifest:\n      key: manifest.json\n      namespace: \"{{ flow.namespace }}\"\n    profiles: |\n      default:\n        outputs:\n          dev:\n            type: postgres\n            host: host.docker.internal\n            user: kestra\n            password: k3str4\n            port: 5432\n            dbname: postgres-zoomcamp\n            schema: public\n            threads: 8\n            connect_timeout: 10\n            priority: interactive\n        target: dev\ndescription: |\n  Note that you need to adjust the models/staging/schema.yml file to match your database and schema. Select and edit that Namespace File from the UI. Save and run this flow. Once https://github.com/DataTalksClub/data-engineering-zoomcamp/pull/565/files is merged, you can ignore this note as it will be dynamically adjusted based on env variables.\n  ```yaml\n  sources:\n    - name: staging\n      database: postgres-zoomcamp\n      schema: public\n  ```\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/flows/04_gcp_kv.yaml",
    "content": "id: 04_gcp_kv\nnamespace: zoomcamp\n\ntasks:\n  - id: gcp_project_id\n    type: io.kestra.plugin.core.kv.Set\n    key: GCP_PROJECT_ID\n    kvType: STRING\n    value: kestra-sandbox # TODO replace with your project id\n\n  - id: gcp_location\n    type: io.kestra.plugin.core.kv.Set\n    key: GCP_LOCATION\n    kvType: STRING\n    value: europe-west2\n\n  - id: gcp_bucket_name\n    type: io.kestra.plugin.core.kv.Set\n    key: GCP_BUCKET_NAME\n    kvType: STRING\n    value: your-name-kestra # TODO make sure it's globally unique!\n\n  - id: gcp_dataset\n    type: io.kestra.plugin.core.kv.Set\n    key: GCP_DATASET\n    kvType: STRING\n    value: zoomcamp\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/flows/05_gcp_setup.yaml",
    "content": "id: 05_gcp_setup\nnamespace: zoomcamp\n\ntasks:\n  - id: create_gcs_bucket\n    type: io.kestra.plugin.gcp.gcs.CreateBucket\n    ifExists: SKIP\n    storageClass: REGIONAL\n    name: \"{{kv('GCP_BUCKET_NAME')}}\" # make sure it's globally unique!\n\n  - id: create_bq_dataset\n    type: io.kestra.plugin.gcp.bigquery.CreateDataset\n    name: \"{{kv('GCP_DATASET')}}\"\n    ifExists: SKIP\n\npluginDefaults:\n  - type: io.kestra.plugin.gcp\n    values:\n      serviceAccount: \"{{kv('GCP_CREDS')}}\"\n      projectId: \"{{kv('GCP_PROJECT_ID')}}\"\n      location: \"{{kv('GCP_LOCATION')}}\"\n      bucket: \"{{kv('GCP_BUCKET_NAME')}}\"\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/flows/06_gcp_taxi.yaml",
    "content": "id: 06_gcp_taxi\nnamespace: zoomcamp\ndescription: |\n  The CSV Data used in the course: https://github.com/DataTalksClub/nyc-tlc-data/releases\n\ninputs:\n  - id: taxi\n    type: SELECT\n    displayName: Select taxi type\n    values: [yellow, green]\n    defaults: green\n\n  - id: year\n    type: SELECT\n    displayName: Select year\n    values: [\"2019\", \"2020\"]\n    defaults: \"2019\"\n    allowCustomValue: true # allows you to type 2021 from the UI for the homework 🤗\n\n  - id: month\n    type: SELECT\n    displayName: Select month\n    values: [\"01\", \"02\", \"03\", \"04\", \"05\", \"06\", \"07\", \"08\", \"09\", \"10\", \"11\", \"12\"]\n    defaults: \"01\"\n\nvariables:\n  file: \"{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv\"\n  gcs_file: \"gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}\"\n  table: \"{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{inputs.year}}_{{inputs.month}}\"\n  data: \"{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ inputs.year ~ '-' ~ inputs.month ~ '.csv']}}\"\n\ntasks:\n  - id: set_label\n    type: io.kestra.plugin.core.execution.Labels\n    labels:\n      file: \"{{render(vars.file)}}\"\n      taxi: \"{{inputs.taxi}}\"\n\n  - id: extract\n    type: io.kestra.plugin.scripts.shell.Commands\n    outputFiles:\n      - \"*.csv\"\n    taskRunner:\n      type: io.kestra.plugin.core.runner.Process\n    commands:\n      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}\n\n  - id: upload_to_gcs\n    type: io.kestra.plugin.gcp.gcs.Upload\n    from: \"{{render(vars.data)}}\"\n    to: \"{{render(vars.gcs_file)}}\"\n\n  - id: if_yellow_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'yellow'}}\"\n    then:\n      - id: bq_yellow_tripdata\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata`\n          (\n              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),\n              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      \n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          PARTITION BY DATE(tpep_pickup_datetime);\n\n      - id: bq_yellow_table_ext\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`\n          (\n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          OPTIONS (\n              format = 'CSV',\n              uris = ['{{render(vars.gcs_file)}}'],\n              skip_leading_rows = 1,\n              ignore_unknown_values = TRUE\n          );\n\n      - id: bq_yellow_table_tmp\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`\n          AS\n          SELECT\n            MD5(CONCAT(\n              COALESCE(CAST(VendorID AS STRING), \"\"),\n              COALESCE(CAST(tpep_pickup_datetime AS STRING), \"\"),\n              COALESCE(CAST(tpep_dropoff_datetime AS STRING), \"\"),\n              COALESCE(CAST(PULocationID AS STRING), \"\"),\n              COALESCE(CAST(DOLocationID AS STRING), \"\")\n            )) AS unique_row_id,\n            \"{{render(vars.file)}}\" AS filename,\n            *\n          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;\n\n      - id: bq_yellow_merge\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T\n          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge)\n            VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge);\n\n  - id: if_green_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'green'}}\"\n    then:\n      - id: bq_green_tripdata\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata`\n          (\n              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),\n              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      \n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              ehail_fee NUMERIC,\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          PARTITION BY DATE(lpep_pickup_datetime);\n\n      - id: bq_green_table_ext\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`\n          (\n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              ehail_fee NUMERIC,\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          OPTIONS (\n              format = 'CSV',\n              uris = ['{{render(vars.gcs_file)}}'],\n              skip_leading_rows = 1,\n              ignore_unknown_values = TRUE\n          );\n\n      - id: bq_green_table_tmp\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`\n          AS\n          SELECT\n            MD5(CONCAT(\n              COALESCE(CAST(VendorID AS STRING), \"\"),\n              COALESCE(CAST(lpep_pickup_datetime AS STRING), \"\"),\n              COALESCE(CAST(lpep_dropoff_datetime AS STRING), \"\"),\n              COALESCE(CAST(PULocationID AS STRING), \"\"),\n              COALESCE(CAST(DOLocationID AS STRING), \"\")\n            )) AS unique_row_id,\n            \"{{render(vars.file)}}\" AS filename,\n            *\n          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;\n\n      - id: bq_green_merge\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T\n          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge)\n            VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge);\n\n  - id: purge_files\n    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles\n    description: If you'd like to explore Kestra outputs, disable it.\n    disabled: false\n\npluginDefaults:\n  - type: io.kestra.plugin.gcp\n    values:\n      serviceAccount: \"{{kv('GCP_CREDS')}}\"\n      projectId: \"{{kv('GCP_PROJECT_ID')}}\"\n      location: \"{{kv('GCP_LOCATION')}}\"\n      bucket: \"{{kv('GCP_BUCKET_NAME')}}\"\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/flows/06_gcp_taxi_scheduled.yaml",
    "content": "\nid: 06_gcp_taxi_scheduled\nnamespace: zoomcamp\ndescription: |\n  Best to add a label `backfill:true` from the UI to track executions created via a backfill.\n  CSV data used here comes from: https://github.com/DataTalksClub/nyc-tlc-data/releases\n\ninputs:\n  - id: taxi\n    type: SELECT\n    displayName: Select taxi type\n    values: [yellow, green]\n    defaults: green\n\nvariables:\n  file: \"{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv\"\n  gcs_file: \"gs://{{kv('GCP_BUCKET_NAME')}}/{{vars.file}}\"\n  table: \"{{kv('GCP_DATASET')}}.{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy_MM')}}\"\n  data: \"{{outputs.extract.outputFiles[inputs.taxi ~ '_tripdata_' ~ (trigger.date | date('yyyy-MM')) ~ '.csv']}}\"\n\ntasks:\n  - id: set_label\n    type: io.kestra.plugin.core.execution.Labels\n    labels:\n      file: \"{{render(vars.file)}}\"\n      taxi: \"{{inputs.taxi}}\"\n\n  - id: extract\n    type: io.kestra.plugin.scripts.shell.Commands\n    outputFiles:\n      - \"*.csv\"\n    taskRunner:\n      type: io.kestra.plugin.core.runner.Process\n    commands:\n      - wget -qO- https://github.com/DataTalksClub/nyc-tlc-data/releases/download/{{inputs.taxi}}/{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}\n\n  - id: upload_to_gcs\n    type: io.kestra.plugin.gcp.gcs.Upload\n    from: \"{{render(vars.data)}}\"\n    to: \"{{render(vars.gcs_file)}}\"\n\n  - id: if_yellow_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'yellow'}}\"\n    then:\n      - id: bq_yellow_tripdata\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata`\n          (\n              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),\n              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      \n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          PARTITION BY DATE(tpep_pickup_datetime);\n\n      - id: bq_yellow_table_ext\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`\n          (\n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              tpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              tpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              passenger_count INTEGER OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. TRUE = store and forward trip, FALSE = not a store and forward trip'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          OPTIONS (\n              format = 'CSV',\n              uris = ['{{render(vars.gcs_file)}}'],\n              skip_leading_rows = 1,\n              ignore_unknown_values = TRUE\n          );\n\n      - id: bq_yellow_table_tmp\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`\n          AS\n          SELECT\n            MD5(CONCAT(\n              COALESCE(CAST(VendorID AS STRING), \"\"),\n              COALESCE(CAST(tpep_pickup_datetime AS STRING), \"\"),\n              COALESCE(CAST(tpep_dropoff_datetime AS STRING), \"\"),\n              COALESCE(CAST(PULocationID AS STRING), \"\"),\n              COALESCE(CAST(DOLocationID AS STRING), \"\")\n            )) AS unique_row_id,\n            \"{{render(vars.file)}}\" AS filename,\n            *\n          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;\n\n      - id: bq_yellow_merge\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.yellow_tripdata` T\n          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (unique_row_id, filename, VendorID, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, RatecodeID, store_and_fwd_flag, PULocationID, DOLocationID, payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge)\n            VALUES (S.unique_row_id, S.filename, S.VendorID, S.tpep_pickup_datetime, S.tpep_dropoff_datetime, S.passenger_count, S.trip_distance, S.RatecodeID, S.store_and_fwd_flag, S.PULocationID, S.DOLocationID, S.payment_type, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.improvement_surcharge, S.total_amount, S.congestion_surcharge);\n\n  - id: if_green_taxi\n    type: io.kestra.plugin.core.flow.If\n    condition: \"{{inputs.taxi == 'green'}}\"\n    then:\n      - id: bq_green_tripdata\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE TABLE IF NOT EXISTS `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata`\n          (\n              unique_row_id BYTES OPTIONS (description = 'A unique identifier for the trip, generated by hashing key trip attributes.'),\n              filename STRING OPTIONS (description = 'The source filename from which the trip data was loaded.'),      \n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              ehail_fee NUMERIC,\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          PARTITION BY DATE(lpep_pickup_datetime);\n\n      - id: bq_green_table_ext\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`\n          (\n              VendorID STRING OPTIONS (description = 'A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.'),\n              lpep_pickup_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was engaged'),\n              lpep_dropoff_datetime TIMESTAMP OPTIONS (description = 'The date and time when the meter was disengaged'),\n              store_and_fwd_flag STRING OPTIONS (description = 'This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka \"store and forward,\" because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip'),\n              RatecodeID STRING OPTIONS (description = 'The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride'),\n              PULocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was engaged'),\n              DOLocationID STRING OPTIONS (description = 'TLC Taxi Zone in which the taximeter was disengaged'),\n              passenger_count INT64 OPTIONS (description = 'The number of passengers in the vehicle. This is a driver-entered value.'),\n              trip_distance NUMERIC OPTIONS (description = 'The elapsed trip distance in miles reported by the taximeter.'),\n              fare_amount NUMERIC OPTIONS (description = 'The time-and-distance fare calculated by the meter'),\n              extra NUMERIC OPTIONS (description = 'Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges'),\n              mta_tax NUMERIC OPTIONS (description = '$0.50 MTA tax that is automatically triggered based on the metered rate in use'),\n              tip_amount NUMERIC OPTIONS (description = 'Tip amount. This field is automatically populated for credit card tips. Cash tips are not included.'),\n              tolls_amount NUMERIC OPTIONS (description = 'Total amount of all tolls paid in trip.'),\n              ehail_fee NUMERIC,\n              improvement_surcharge NUMERIC OPTIONS (description = '$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.'),\n              total_amount NUMERIC OPTIONS (description = 'The total amount charged to passengers. Does not include cash tips.'),\n              payment_type INTEGER OPTIONS (description = 'A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip'),\n              trip_type STRING OPTIONS (description = 'A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver. 1= Street-hail 2= Dispatch'),\n              congestion_surcharge NUMERIC OPTIONS (description = 'Congestion surcharge applied to trips in congested zones')\n          )\n          OPTIONS (\n              format = 'CSV',\n              uris = ['{{render(vars.gcs_file)}}'],\n              skip_leading_rows = 1,\n              ignore_unknown_values = TRUE\n          );\n\n      - id: bq_green_table_tmp\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`\n          AS\n          SELECT\n            MD5(CONCAT(\n              COALESCE(CAST(VendorID AS STRING), \"\"),\n              COALESCE(CAST(lpep_pickup_datetime AS STRING), \"\"),\n              COALESCE(CAST(lpep_dropoff_datetime AS STRING), \"\"),\n              COALESCE(CAST(PULocationID AS STRING), \"\"),\n              COALESCE(CAST(DOLocationID AS STRING), \"\")\n            )) AS unique_row_id,\n            \"{{render(vars.file)}}\" AS filename,\n            *\n          FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;\n\n      - id: bq_green_merge\n        type: io.kestra.plugin.gcp.bigquery.Query\n        sql: |\n          MERGE INTO `{{kv('GCP_PROJECT_ID')}}.{{kv('GCP_DATASET')}}.green_tripdata` T\n          USING `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}` S\n          ON T.unique_row_id = S.unique_row_id\n          WHEN NOT MATCHED THEN\n            INSERT (unique_row_id, filename, VendorID, lpep_pickup_datetime, lpep_dropoff_datetime, store_and_fwd_flag, RatecodeID, PULocationID, DOLocationID, passenger_count, trip_distance, fare_amount, extra, mta_tax, tip_amount, tolls_amount, ehail_fee, improvement_surcharge, total_amount, payment_type, trip_type, congestion_surcharge)\n            VALUES (S.unique_row_id, S.filename, S.VendorID, S.lpep_pickup_datetime, S.lpep_dropoff_datetime, S.store_and_fwd_flag, S.RatecodeID, S.PULocationID, S.DOLocationID, S.passenger_count, S.trip_distance, S.fare_amount, S.extra, S.mta_tax, S.tip_amount, S.tolls_amount, S.ehail_fee, S.improvement_surcharge, S.total_amount, S.payment_type, S.trip_type, S.congestion_surcharge);\n\n  - id: purge_files\n    type: io.kestra.plugin.core.storage.PurgeCurrentExecutionFiles\n    description: To avoid cluttering your storage, we will remove the downloaded files\n\npluginDefaults:\n  - type: io.kestra.plugin.gcp\n    values:\n      serviceAccount: \"{{kv('GCP_CREDS')}}\"\n      projectId: \"{{kv('GCP_PROJECT_ID')}}\"\n      location: \"{{kv('GCP_LOCATION')}}\"\n      bucket: \"{{kv('GCP_BUCKET_NAME')}}\"\n\ntriggers:\n  - id: green_schedule\n    type: io.kestra.plugin.core.trigger.Schedule\n    cron: \"0 9 1 * *\"\n    inputs:\n      taxi: green\n\n  - id: yellow_schedule\n    type: io.kestra.plugin.core.trigger.Schedule\n    cron: \"0 10 1 * *\"\n    inputs:\n      taxi: yellow\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/flows/07_gcp_dbt.yaml",
    "content": "id: 07_gcp_dbt\nnamespace: zoomcamp\ninputs:\n  - id: dbt_command\n    type: SELECT\n    allowCustomValue: true\n    defaults: dbt build\n    values:\n      - dbt build\n      - dbt debug # use when running the first time to validate DB connection\n\ntasks:\n  - id: sync\n    type: io.kestra.plugin.git.SyncNamespaceFiles\n    url: https://github.com/DataTalksClub/data-engineering-zoomcamp\n    branch: main\n    namespace: \"{{flow.namespace}}\"\n    gitDirectory: 04-analytics-engineering/taxi_rides_ny\n    dryRun: false\n    # disabled: true # this Git Sync is needed only when running it the first time, afterwards the task can be disabled\n\n  - id: dbt-build\n    type: io.kestra.plugin.dbt.cli.DbtCLI\n    env:\n      DBT_DATABASE: \"{{kv('GCP_PROJECT_ID')}}\"\n      DBT_SCHEMA: \"{{kv('GCP_DATASET')}}\"\n    namespaceFiles:\n      enabled: true\n    containerImage: ghcr.io/kestra-io/dbt-bigquery:latest\n    taskRunner:\n      type: io.kestra.plugin.scripts.runner.docker.Docker\n    inputFiles:\n      sa.json: \"{{kv('GCP_CREDS')}}\"\n    commands:\n      - dbt deps\n      - \"{{ inputs.dbt_command }}\"\n    storeManifest:\n      key: manifest.json\n      namespace: \"{{ flow.namespace }}\"\n    profiles: |\n      default:\n        outputs:\n          dev:\n            type: bigquery\n            dataset: \"{{kv('GCP_DATASET')}}\"\n            project: \"{{kv('GCP_PROJECT_ID')}}\"\n            location: \"{{kv('GCP_LOCATION')}}\"\n            keyfile: sa.json\n            method: service-account\n            priority: interactive\n            threads: 16\n            timeout_seconds: 300\n            fixed_retries: 1\n        target: dev\ndescription: |\n  Note that you need to adjust the models/staging/schema.yml file to match your database and schema. Select and edit that Namespace File from the UI. Save and run this flow. Once https://github.com/DataTalksClub/data-engineering-zoomcamp/pull/565/files is merged, you can ignore this note as it will be dynamically adjusted based on env variables.\n  ```yaml\n  sources:\n    - name: staging\n      database: kestra-sandbox \n      schema: zoomcamp\n  ```\n"
  },
  {
    "path": "cohorts/2025/02-workflow-orchestration/homework.md",
    "content": "## Module 2 Homework\n\nATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository.\n\n> In case you don't get one option exactly, select the closest one \n\nFor the homework, we'll be working with the _green_ taxi dataset located here:\n\n`https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download`\n\nTo get a `wget`-able link, use this prefix (note that the link itself gives 404):\n\n`https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/`\n\n### Assignment\n\nSo far in the course, we processed data for the year 2019 and 2020. Your task is to extend the existing flows to include data for the year 2021.\n\n![homework datasets](../../../02-workflow-orchestration/images/homework.png)\n\nAs a hint, Kestra makes that process really easy:\n1. You can leverage the backfill functionality in the [scheduled flow](../../../02-workflow-orchestration/flows/06_gcp_taxi_scheduled.yaml) to backfill the data for the year 2021. Just make sure to select the time period for which data exists i.e. from `2021-01-01` to `2021-07-31`. Also, make sure to do the same for both `yellow` and `green` taxi data (select the right service in the `taxi` input).\n2. Alternatively, run the flow manually for each of the seven months of 2021 for both `yellow` and `green` taxi data. Challenge for you: find out how to loop over the combination of Year-Month and `taxi`-type using `ForEach` task which triggers the flow for each combination using a `Subflow` task.\n\n### Quiz Questions\n\nComplete the Quiz shown below. It’s a set of 6 multiple-choice questions to test your understanding of workflow orchestration, Kestra and ETL pipelines for data lakes and warehouses.\n\n1) Within the execution for `Yellow` Taxi data for the year `2020` and month `12`: what is the uncompressed file size (i.e. the output file `yellow_tripdata_2020-12.csv` of the `extract` task)?\n- 128.3 MiB\n- 134.5 MiB\n- 364.7 MiB\n- 692.6 MiB\n\n2) What is the rendered value of the variable `file` when the inputs `taxi` is set to `green`, `year` is set to `2020`, and `month` is set to `04` during execution?\n- `{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv` \n- `green_tripdata_2020-04.csv`\n- `green_tripdata_04_2020.csv`\n- `green_tripdata_2020.csv`\n\n3) How many rows are there for the `Yellow` Taxi data for all CSV files in the year 2020?\n- 13,537.299\n- 24,648,499\n- 18,324,219\n- 29,430,127\n\n4) How many rows are there for the `Green` Taxi data for all CSV files in the year 2020?\n- 5,327,301\n- 936,199\n- 1,734,051\n- 1,342,034\n\n5) How many rows are there for the `Yellow` Taxi data for the March 2021 CSV file?\n- 1,428,092\n- 706,911\n- 1,925,152\n- 2,561,031\n\n6) How would you configure the timezone to New York in a Schedule trigger?\n- Add a `timezone` property set to `EST` in the `Schedule` trigger configuration  \n- Add a `timezone` property set to `America/New_York` in the `Schedule` trigger configuration\n- Add a `timezone` property set to `UTC-5` in the `Schedule` trigger configuration\n- Add a `location` property set to `New_York` in the `Schedule` trigger configuration  \n\n\n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw2\n* Check the link above to see the due date\n\n## Solution\n\nWill be added after the due date\n"
  },
  {
    "path": "cohorts/2025/03-data-warehouse/DLT_upload_to_GCP.ipynb",
    "content": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"aC2QnhmKxpq1\"\n      },\n      \"source\": [\n        \"**Please set up your credentials JSON as GCP_CREDENTIALS secrets**\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 2,\n      \"metadata\": {\n        \"id\": \"UsUZobVduL7l\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import os\\n\",\n        \"from google.colab import userdata\\n\",\n        \"\\n\",\n        \"os.environ[\\\"DESTINATION__CREDENTIALS\\\"] = userdata.get('GCP_CREDENTIALS')\\n\",\n        \"os.environ[\\\"BUCKET_URL\\\"] = \\\"gs://your_bucket_url\\\"\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 1,\n      \"metadata\": {\n        \"id\": \"mPBzsEgyjsBo\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# Install for production\\n\",\n        \"%%capture\\n\",\n        \"!pip install dlt[bigquery, gs]\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 1,\n      \"metadata\": {\n        \"id\": \"evdUsDNbkCTk\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# Install for testing\\n\",\n        \"%%capture\\n\",\n        \"!pip install dlt[duckdb]\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 2,\n      \"metadata\": {\n        \"id\": \"lYh7r1mTf4uo\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import dlt\\n\",\n        \"import requests\\n\",\n        \"import pandas as pd\\n\",\n        \"from dlt.destinations import filesystem\\n\",\n        \"from io import BytesIO\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"76zT1PzAgs7A\"\n      },\n      \"source\": [\n        \"Ingesting parquet files to GCS.\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"id\": \"xya0215jsnsb\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# Define a dlt source to download and process Parquet files as resources\\n\",\n        \"@dlt.source(name=\\\"rides\\\")\\n\",\n        \"def download_parquet():\\n\",\n        \"     for month in range(1,7):\\n\",\n        \"      file_name = f\\\"yellow_tripdata_2024-0{month}.parquet\\\"\\n\",\n        \"\\n\",\n        \"      url = f\\\"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-0{month}.parquet\\\"\\n\",\n        \"      response = requests.get(url)\\n\",\n        \"\\n\",\n        \"      df = pd.read_parquet(BytesIO(response.content))\\n\",\n        \"\\n\",\n        \"      # Return the dataframe as a dlt resource for ingestion\\n\",\n        \"      yield dlt.resource(df, name=file_name)\\n\",\n        \"\\n\",\n        \"# Initialize the pipeline\\n\",\n        \"pipeline = dlt.pipeline(\\n\",\n        \"    pipeline_name=\\\"rides_pipeline\\\",\\n\",\n        \"    destination=filesystem(\\n\",\n        \"      layout=\\\"{schema_name}/{table_name}.{ext}\\\"\\n\",\n        \"    ),\\n\",\n        \"    dataset_name=\\\"rides_dataset\\\"\\n\",\n        \")\\n\",\n        \"\\n\",\n        \"# Run the pipeline to load Parquet data into DuckDB\\n\",\n        \"load_info = pipeline.run(\\n\",\n        \"    download_parquet(),\\n\",\n        \"    loader_file_format=\\\"parquet\\\"\\n\",\n        \"    )\\n\",\n        \"\\n\",\n        \"# Print the results\\n\",\n        \"print(load_info)\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"S0310FT-gy_P\"\n      },\n      \"source\": [\n        \"Ingesting data to Database\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"1_3K97w1c2v2\",\n        \"outputId\": \"4b2d26bf-2814-46fa-f80d-7a2e17417a95\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# Define a dlt resource to download and process Parquet files as single table\\n\",\n        \"@dlt.resource(name=\\\"rides\\\", write_disposition=\\\"replace\\\")\\n\",\n        \"def download_parquet():\\n\",\n        \"     for month in range(1,7):\\n\",\n        \"      url = f\\\"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-0{month}.parquet\\\"\\n\",\n        \"      response = requests.get(url)\\n\",\n        \"\\n\",\n        \"      df = pd.read_parquet(BytesIO(response.content))\\n\",\n        \"\\n\",\n        \"      # Return the dataframe as a dlt resource for ingestion\\n\",\n        \"      yield df\\n\",\n        \"\\n\",\n        \"# Initialize the pipeline\\n\",\n        \"pipeline = dlt.pipeline(\\n\",\n        \"    pipeline_name=\\\"rides_pipeline\\\",\\n\",\n        \"    destination=\\\"duckdb\\\",  # Use DuckDB for testing\\n\",\n        \"    # destination=\\\"bigquery\\\",  # Use BigQuery for production\\n\",\n        \"    dataset_name=\\\"rides_dataset\\\"\\n\",\n        \")\\n\",\n        \"\\n\",\n        \"# Run the pipeline to load Parquet data into DuckDB\\n\",\n        \"info = pipeline.run(download_parquet)\\n\",\n        \"\\n\",\n        \"# Print the results\\n\",\n        \"print(info)\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"gDcLjzLtooBV\",\n        \"outputId\": \"74ff2de7-2f2e-41b9-a681-3dc5887f6eed\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import duckdb\\n\",\n        \"conn = duckdb.connect(f\\\"{pipeline.pipeline_name}.duckdb\\\")\\n\",\n        \"\\n\",\n        \"# Set search path to the dataset\\n\",\n        \"conn.sql(f\\\"SET search_path = '{pipeline.dataset_name}'\\\")\\n\",\n        \"\\n\",\n        \"# Describe the dataset to see loaded tables\\n\",\n        \"res = conn.sql(\\\"DESCRIBE\\\").df()\\n\",\n        \"print(res)\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"VVJy8JoerI2P\",\n        \"outputId\": \"3f8c7fee-a9ee-4fd4-ec75-153ca60bd36f\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# provide a resource name to query a table of that name\\n\",\n        \"with pipeline.sql_client() as client:\\n\",\n        \"    with client.execute_query(f\\\"SELECT count(1) FROM rides\\\") as cursor:\\n\",\n        \"        data = cursor.df()\\n\",\n        \"print(data)\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    }\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "cohorts/2025/03-data-warehouse/homework.md",
    "content": "## Module 3 Homework\n\nATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. \nThis repository should contain your code for solving the homework. If your solution includes code that is not in file format (such as SQL queries or \nshell commands), please include these directly in the README file of your repository.\n\n<b><u>Important Note:</b></u> <p> For this homework we will be using the Yellow Taxi Trip Records for **January 2024 - June 2024 NOT the entire year of data** \nParquet Files from the New York\nCity Taxi Data found here: </br> https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page </br>\nIf you are using orchestration such as Kestra, Mage, Airflow or Prefect etc. do not load the data into Big Query using the orchestrator.</br> \nStop with loading the files into a bucket. </br></br>\n\n**Load Script:** You can manually download the parquet files and upload them to your GCS Bucket or you can use the linked script [here](./load_yellow_taxi_data.py):<br>\nYou will simply need to generate a Service Account with GCS Admin Priveleges or be authenticated with the Google SDK and update the bucket name in the script to the name of your bucket<br>\nNothing is fool proof so make sure that all 6 files show in your GCS Bucket before beginning.</br><br>\n\n<u>NOTE:</u> You will need to use the PARQUET option files when creating an External Table</br>\n\n<b>BIG QUERY SETUP:</b></br>\nCreate an external table using the Yellow Taxi Trip Records. </br>\nCreate a (regular/materialized) table in BQ using the Yellow Taxi Trip Records (do not partition or cluster this table). </br>\n</p>\n\n## Question 1:\nWhat is count of records for the 2024 Yellow Taxi Data?\n- 65,623\n- 840,402\n- 20,332,093\n- 85,431,289\n\n\n## Question 2:\nWrite a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.</br> \nWhat is the **estimated amount** of data that will be read when this query is executed on the External Table and the Table?\n\n- 18.82 MB for the External Table and 47.60 MB for the Materialized Table\n- 0 MB for the External Table and 155.12 MB for the Materialized Table\n- 2.14 GB for the External Table and 0MB for the Materialized Table\n- 0 MB for the External Table and 0MB for the Materialized Table\n\n## Question 3:\nWrite a query to retrieve the PULocationID from the table (not the external table) in BigQuery. Now write a query to retrieve the PULocationID and DOLocationID on the same table. Why are the estimated number of Bytes different?\n- BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires \nreading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed.\n- BigQuery duplicates data across multiple storage partitions, so selecting two columns instead of one requires scanning the table twice, \ndoubling the estimated bytes processed.\n- BigQuery automatically caches the first queried column, so adding a second column increases processing time but does not affect the estimated bytes scanned.\n- When selecting multiple columns, BigQuery performs an implicit join operation between them, increasing the estimated bytes processed\n\n## Question 4:\nHow many records have a fare_amount of 0?\n- 128,210\n- 546,578\n- 20,188,016\n- 8,333\n\n## Question 5:\nWhat is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy)\n- Partition by tpep_dropoff_datetime and Cluster on VendorID\n- Cluster on by tpep_dropoff_datetime and Cluster on VendorID\n- Cluster on tpep_dropoff_datetime Partition by VendorID\n- Partition by tpep_dropoff_datetime and Partition by VendorID\n\n\n## Question 6:\nWrite a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime\n2024-03-01 and 2024-03-15 (inclusive)</br>\n\nUse the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values? </br>\n\nChoose the answer which most closely matches.</br> \n\n- 12.47 MB for non-partitioned table and 326.42 MB for the partitioned table\n- 310.24 MB for non-partitioned table and 26.84 MB for the partitioned table\n- 5.87 MB for non-partitioned table and 0 MB for the partitioned table\n- 310.31 MB for non-partitioned table and 285.64 MB for the partitioned table\n\n\n## Question 7: \nWhere is the data stored in the External Table you created?\n\n- Big Query\n- Container Registry\n- GCP Bucket\n- Big Table\n\n## Question 8:\nIt is best practice in Big Query to always cluster your data:\n- True\n- False\n\n\n## (Bonus: Not worth points) Question 9:\nNo Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why?\n\n\n## Submitting the solutions\n\nForm for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw3\n\n## Solution\n\nSolution: https://www.youtube.com/watch?v=wpLmImIUlPg\n"
  },
  {
    "path": "cohorts/2025/03-data-warehouse/load_yellow_taxi_data.py",
    "content": "import os\nimport sys\nimport urllib.request\nfrom concurrent.futures import ThreadPoolExecutor\nfrom google.cloud import storage\nfrom google.api_core.exceptions import NotFound, Forbidden\nimport time\n\n\n# Change this to your bucket name\nBUCKET_NAME = \"dezoomcamp_hw3_2025\"\n\n# If you authenticated through the GCP SDK you can comment out these two lines\nCREDENTIALS_FILE = \"gcs.json\"\nclient = storage.Client.from_service_account_json(CREDENTIALS_FILE)\n# If commented initialize client with the following\n# client = storage.Client(project='zoomcamp-mod3-datawarehouse')\n\n\nBASE_URL = \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-\"\nMONTHS = [f\"{i:02d}\" for i in range(1, 7)]\nDOWNLOAD_DIR = \".\"\n\nCHUNK_SIZE = 8 * 1024 * 1024\n\nos.makedirs(DOWNLOAD_DIR, exist_ok=True)\n\nbucket = client.bucket(BUCKET_NAME)\n\n\ndef download_file(month):\n    url = f\"{BASE_URL}{month}.parquet\"\n    file_path = os.path.join(DOWNLOAD_DIR, f\"yellow_tripdata_2024-{month}.parquet\")\n\n    try:\n        print(f\"Downloading {url}...\")\n        urllib.request.urlretrieve(url, file_path)\n        print(f\"Downloaded: {file_path}\")\n        return file_path\n    except Exception as e:\n        print(f\"Failed to download {url}: {e}\")\n        return None\n\n\ndef create_bucket(bucket_name):\n    try:\n        # Get bucket details\n        bucket = client.get_bucket(bucket_name)\n\n        # Check if the bucket belongs to the current project\n        project_bucket_ids = [bckt.id for bckt in client.list_buckets()]\n        if bucket_name in project_bucket_ids:\n            print(\n                f\"Bucket '{bucket_name}' exists and belongs to your project. Proceeding...\"\n            )\n        else:\n            print(\n                f\"A bucket with the name '{bucket_name}' already exists, but it does not belong to your project.\"\n            )\n            sys.exit(1)\n\n    except NotFound:\n        # If the bucket doesn't exist, create it\n        bucket = client.create_bucket(bucket_name)\n        print(f\"Created bucket '{bucket_name}'\")\n    except Forbidden:\n        # If the request is forbidden, it means the bucket exists but you don't have access to see details\n        print(\n            f\"A bucket with the name '{bucket_name}' exists, but it is not accessible. Bucket name is taken. Please try a different bucket name.\"\n        )\n        sys.exit(1)\n\n\ndef verify_gcs_upload(blob_name):\n    return storage.Blob(bucket=bucket, name=blob_name).exists(client)\n\n\ndef upload_to_gcs(file_path, max_retries=3):\n    blob_name = os.path.basename(file_path)\n    blob = bucket.blob(blob_name)\n    blob.chunk_size = CHUNK_SIZE\n\n    create_bucket(BUCKET_NAME)\n\n    for attempt in range(max_retries):\n        try:\n            print(f\"Uploading {file_path} to {BUCKET_NAME} (Attempt {attempt + 1})...\")\n            blob.upload_from_filename(file_path)\n            print(f\"Uploaded: gs://{BUCKET_NAME}/{blob_name}\")\n\n            if verify_gcs_upload(blob_name):\n                print(f\"Verification successful for {blob_name}\")\n                return\n            else:\n                print(f\"Verification failed for {blob_name}, retrying...\")\n        except Exception as e:\n            print(f\"Failed to upload {file_path} to GCS: {e}\")\n\n        time.sleep(5)\n\n    print(f\"Giving up on {file_path} after {max_retries} attempts.\")\n\n\nif __name__ == \"__main__\":\n    create_bucket(BUCKET_NAME)\n\n    with ThreadPoolExecutor(max_workers=4) as executor:\n        file_paths = list(executor.map(download_file, MONTHS))\n\n    with ThreadPoolExecutor(max_workers=4) as executor:\n        executor.map(upload_to_gcs, filter(None, file_paths))  # Remove None values\n\n    print(\"All files processed and verified.\")\n"
  },
  {
    "path": "cohorts/2025/04-analytics-engineering/homework.md",
    "content": "## Module 4 Homework\n\nFor this homework, you will need the following datasets:\n* [Green Taxi dataset (2019 and 2020)](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green)\n* [Yellow Taxi dataset (2019 and 2020)](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/yellow)\n* [For Hire Vehicle dataset (2019)](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv)\n\n### Before you start\n\n1. Make sure you, **at least**, have them in GCS with a External Table **OR** a Native Table - use whichever method you prefer to accomplish that (Workflow Orchestration with [pandas-gbq](https://cloud.google.com/bigquery/docs/samples/bigquery-pandas-gbq-to-gbq-simple), [dlt for gcs](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem), [dlt for BigQuery](https://dlthub.com/docs/dlt-ecosystem/destinations/bigquery), [gsutil](https://cloud.google.com/storage/docs/gsutil), etc)\n2. You should have exactly `7,778,101` records in your Green Taxi table\n3. You should have exactly `109,047,518` records in your Yellow Taxi table\n4. You should have exactly `43,244,696` records in your FHV table\n5. Build the staging models for green/yellow as shown in [here](../../../04-analytics-engineering/taxi_rides_ny/models/staging/)\n6. Build the dimension/fact for taxi_trips joining with `dim_zones`  as shown in [here](../../../04-analytics-engineering/taxi_rides_ny/models/core/fact_trips.sql)\n\n**Note**: If you don't have access to GCP, you can spin up a local Postgres instance and ingest the datasets above\n\n\n### Question 1: Understanding dbt model resolution\n\nProvided you've got the following sources.yaml\n```yaml\nversion: 2\n\nsources:\n  - name: raw_nyc_tripdata\n    database: \"{{ env_var('DBT_BIGQUERY_PROJECT', 'dtc_zoomcamp_2025') }}\"\n    schema:   \"{{ env_var('DBT_BIGQUERY_SOURCE_DATASET', 'raw_nyc_tripdata') }}\"\n    tables:\n      - name: ext_green_taxi\n      - name: ext_yellow_taxi\n```\n\nwith the following env variables setup where `dbt` runs:\n```shell\nexport DBT_BIGQUERY_PROJECT=myproject\nexport DBT_BIGQUERY_DATASET=my_nyc_tripdata\n```\n\nWhat does this .sql model compile to?\n```sql\nselect * \nfrom {{ source('raw_nyc_tripdata', 'ext_green_taxi' ) }}\n```\n\n- `select * from dtc_zoomcamp_2025.raw_nyc_tripdata.ext_green_taxi`\n- `select * from dtc_zoomcamp_2025.my_nyc_tripdata.ext_green_taxi`\n- `select * from myproject.raw_nyc_tripdata.ext_green_taxi`\n- `select * from myproject.my_nyc_tripdata.ext_green_taxi`\n- `select * from dtc_zoomcamp_2025.raw_nyc_tripdata.green_taxi`\n\n\n### Question 2: dbt Variables & Dynamic Models\n\nSay you have to modify the following dbt_model (`fct_recent_taxi_trips.sql`) to enable Analytics Engineers to dynamically control the date range. \n\n- In development, you want to process only **the last 7 days of trips**\n- In production, you need to process **the last 30 days** for analytics\n\n```sql\nselect *\nfrom {{ ref('fact_taxi_trips') }}\nwhere pickup_datetime >= CURRENT_DATE - INTERVAL '30' DAY\n```\n\nWhat would you change to accomplish that in a such way that command line arguments takes precedence over ENV_VARs, which takes precedence over DEFAULT value?\n\n- Add `ORDER BY pickup_datetime DESC` and `LIMIT {{ var(\"days_back\", 30) }}`\n- Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var(\"days_back\", 30) }}' DAY`\n- Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var(\"DAYS_BACK\", \"30\") }}' DAY`\n- Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var(\"days_back\", env_var(\"DAYS_BACK\", \"30\")) }}' DAY`\n- Update the WHERE clause to `pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var(\"DAYS_BACK\", var(\"days_back\", \"30\")) }}' DAY`\n\n\n### Question 3: dbt Data Lineage and Execution\n\nConsidering the data lineage below **and** that taxi_zone_lookup is the **only** materialization build (from a .csv seed file):\n\n![image](./homework_q2.png)\n\nSelect the option that does **NOT** apply for materializing `fct_taxi_monthly_zone_revenue`:\n\n- `dbt run`\n- `dbt run --select +models/core/dim_taxi_trips.sql+ --target prod`\n- `dbt run --select +models/core/fct_taxi_monthly_zone_revenue.sql`\n- `dbt run --select +models/core/`\n- `dbt run --select models/staging/+`\n\n\n### Question 4: dbt Macros and Jinja\n\nConsider you're dealing with sensitive data (e.g.: [PII](https://en.wikipedia.org/wiki/Personal_data)), that is **only available to your team and very selected few individuals**, in the `raw layer` of your DWH (e.g: a specific BigQuery dataset or PostgreSQL schema), \n\n - Among other things, you decide to obfuscate/masquerade that data through your staging models, and make it available in a different schema (a `staging layer`) for other Data/Analytics Engineers to explore\n\n- And **optionally**, yet  another layer (`service layer`), where you'll build your dimension (`dim_`) and fact (`fct_`) tables (assuming the [Star Schema dimensional modeling](https://www.databricks.com/glossary/star-schema)) for Dashboarding and for Tech Product Owners/Managers\n\nYou decide to make a macro to wrap a logic around it:\n\n```sql\n{% macro resolve_schema_for(model_type) -%}\n\n    {%- set target_env_var = 'DBT_BIGQUERY_TARGET_DATASET'  -%}\n    {%- set stging_env_var = 'DBT_BIGQUERY_STAGING_DATASET' -%}\n\n    {%- if model_type == 'core' -%} {{- env_var(target_env_var) -}}\n    {%- else -%}                    {{- env_var(stging_env_var, env_var(target_env_var)) -}}\n    {%- endif -%}\n\n{%- endmacro %}\n```\n\nAnd use on your staging, dim_ and fact_ models as:\n```sql\n{{ config(\n    schema=resolve_schema_for('core'), \n) }}\n```\n\nThat all being said, regarding macro above, **select all statements that are true to the models using it**:\n- Setting a value for  `DBT_BIGQUERY_TARGET_DATASET` env var is mandatory, or it'll fail to compile\n- Setting a value for `DBT_BIGQUERY_STAGING_DATASET` env var is mandatory, or it'll fail to compile\n- When using `core`, it materializes in the dataset defined in `DBT_BIGQUERY_TARGET_DATASET`\n- When using `stg`, it materializes in the dataset defined in `DBT_BIGQUERY_STAGING_DATASET`, or defaults to `DBT_BIGQUERY_TARGET_DATASET`\n- When using `staging`, it materializes in the dataset defined in `DBT_BIGQUERY_STAGING_DATASET`, or defaults to `DBT_BIGQUERY_TARGET_DATASET`\n\n\n## Serious SQL\n\nAlright, in module 1, you had a SQL refresher, so now let's build on top of that with some serious SQL.\n\nThese are not meant to be easy - but they'll boost your SQL and Analytics skills to the next level.  \nSo, without any further do, let's get started...\n\nYou might want to add some new dimensions `year` (e.g.: 2019, 2020), `quarter` (1, 2, 3, 4), `year_quarter` (e.g.: `2019/Q1`, `2019-Q2`), and `month` (e.g.: 1, 2, ..., 12), **extracted from pickup_datetime**, to your `fct_taxi_trips` OR `dim_taxi_trips.sql` models to facilitate filtering your queries\n\n\n### Question 5: Taxi Quarterly Revenue Growth\n\n1. Create a new model `fct_taxi_trips_quarterly_revenue.sql`\n2. Compute the Quarterly Revenues for each year for based on `total_amount`\n3. Compute the Quarterly YoY (Year-over-Year) revenue growth \n  * e.g.: In 2020/Q1, Green Taxi had -12.34% revenue growth compared to 2019/Q1\n  * e.g.: In 2020/Q4, Yellow Taxi had +34.56% revenue growth compared to 2019/Q4\n\n***Important Note: The Year-over-Year (YoY) growth percentages provided in the examples are purely illustrative. You will not be able to reproduce these exact values using the datasets provided for this homework.***\n\nConsidering the YoY Growth in 2020, which were the yearly quarters with the best (or less worse) and worst results for green, and yellow\n\n- green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q2, worst: 2020/Q1}\n- green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q3, worst: 2020/Q4}\n- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q2, worst: 2020/Q1}\n- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q1, worst: 2020/Q2}\n- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q3, worst: 2020/Q4}\n\n\n### Question 6: P97/P95/P90 Taxi Monthly Fare\n\n1. Create a new model `fct_taxi_trips_monthly_fare_p95.sql`\n2. Filter out invalid entries (`fare_amount > 0`, `trip_distance > 0`, and `payment_type_description in ('Cash', 'Credit card')`)\n3. Compute the **continous percentile** of `fare_amount` partitioning by service_type, year and and month\n\nNow, what are the values of `p97`, `p95`, `p90` for Green Taxi and Yellow Taxi, in April 2020?\n\n- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5}\n- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0}\n- green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5}\n- green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0}\n- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 25.5, p90: 19.0}\n\n\n### Question 7: Top #Nth longest P90 travel time Location for FHV\n\nPrerequisites:\n* Create a staging model for FHV Data (2019), and **DO NOT** add a deduplication step, just filter out the entries where `where dispatching_base_num is not null`\n* Create a core model for FHV Data (`dim_fhv_trips.sql`) joining with `dim_zones`. Similar to what has been done [here](../../../04-analytics-engineering/taxi_rides_ny/models/core/fact_trips.sql)\n* Add some new dimensions `year` (e.g.: 2019) and `month` (e.g.: 1, 2, ..., 12), based on `pickup_datetime`, to the core model to facilitate filtering for your queries\n\nNow...\n1. Create a new model `fct_fhv_monthly_zone_traveltime_p90.sql`\n2. For each record in `dim_fhv_trips.sql`, compute the [timestamp_diff](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timestamp_diff) in seconds between dropoff_datetime and pickup_datetime - we'll call it `trip_duration` for this exercise\n3. Compute the **continous** `p90` of `trip_duration` partitioning by year, month, pickup_location_id, and dropoff_location_id\n\nFor the Trips that **respectively** started from `Newark Airport`, `SoHo`, and `Yorkville East`, in November 2019, what are **dropoff_zones** with the 2nd longest p90 trip_duration ?\n\n- LaGuardia Airport, Chinatown, Garment District\n- LaGuardia Airport, Park Slope, Clinton East\n- LaGuardia Airport, Saint Albans, Howard Beach\n- LaGuardia Airport, Rosedale, Bath Beach\n- LaGuardia Airport, Yorkville East, Greenpoint\n\n\n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw4\n\n\n## Solution \n\n* To be published after deadline\n"
  },
  {
    "path": "cohorts/2025/05-batch/homework.md",
    "content": "# Module 5 Homework\n\nIn this homework we'll put what we learned about Spark in practice.\n\nFor this homework we will be using the Yellow 2024-10 data from the official website: \n\n```bash\nwget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-10.parquet\n```\n\n\n## Question 1: Install Spark and PySpark\n\n- Install Spark\n- Run PySpark\n- Create a local spark session\n- Execute spark.version.\n\nWhat's the output?\n\n> [!NOTE]\n> To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/05-batch/setup/pyspark.md)\n\n\n## Question 2: Yellow October 2024\n\nRead the October 2024 Yellow into a Spark Dataframe.\n\nRepartition the Dataframe to 4 partitions and save it to parquet.\n\nWhat is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.\n\n- 6MB\n- 25MB\n- 75MB\n- 100MB\n\n\n## Question 3: Count records \n\nHow many taxi trips were there on the 15th of October?\n\nConsider only trips that started on the 15th of October.\n\n- 85,567\n- 105,567\n- 125,567\n- 145,567\n\n\n## Question 4: Longest trip\n\nWhat is the length of the longest trip in the dataset in hours?\n\n- 122\n- 142\n- 162\n- 182\n\n\n## Question 5: User Interface\n\nSpark’s User Interface which shows the application's dashboard runs on which local port?\n\n- 80\n- 443\n- 4040\n- 8080\n\n\n\n## Question 6: Least frequent pickup location zone\n\nLoad the zone lookup data into a temp view in Spark:\n\n```bash\nwget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv\n```\n\nUsing the zone lookup data and the Yellow October 2024 data, what is the name of the LEAST frequent pickup location Zone?\n\n- Governor's Island/Ellis Island/Liberty Island\n- Arden Heights\n- Rikers Island\n- Jamaica Bay\n\n\n## Submitting the solutions\n\n- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw5\n- Deadline: See the website\n"
  },
  {
    "path": "cohorts/2025/06-streaming/homework/homework.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"a63a4585-8a6b-4446-9b63-8c5d5d0b80fc\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"True\"\n      ]\n     },\n     \"execution_count\": 1,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import json\\n\",\n    \"\\n\",\n    \"from kafka import KafkaProducer\\n\",\n    \"\\n\",\n    \"def json_serializer(data):\\n\",\n    \"    return json.dumps(data).encode('utf-8')\\n\",\n    \"\\n\",\n    \"server = 'localhost:9092'\\n\",\n    \"\\n\",\n    \"producer = KafkaProducer(\\n\",\n    \"    bootstrap_servers=[server],\\n\",\n    \"    value_serializer=json_serializer\\n\",\n    \")\\n\",\n    \"\\n\",\n    \"producer.bootstrap_connected()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"id\": \"78bd28f9-66cb-4532-bf03-bb3fe90655b5\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"--2025-03-07 19:27:06--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz\\n\",\n      \"Resolving github.com (github.com)... 140.82.121.3\\n\",\n      \"Connecting to github.com (github.com)|140.82.121.3|:443... connected.\\n\",\n      \"HTTP request sent, awaiting response... 302 Found\\n\",\n      \"Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/ea580e9e-555c-4bd0-ae73-43051d8e7c0b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250307%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250307T182706Z&X-Amz-Expires=300&X-Amz-Signature=6b8f2f603fe86515be24510f3f30bcf93c932b551769e5121fb0cbdf58e9b767&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dgreen_tripdata_2019-10.csv.gz&response-content-type=application%2Foctet-stream [following]\\n\",\n      \"--2025-03-07 19:27:07--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/513814948/ea580e9e-555c-4bd0-ae73-43051d8e7c0b?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250307%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250307T182706Z&X-Amz-Expires=300&X-Amz-Signature=6b8f2f603fe86515be24510f3f30bcf93c932b551769e5121fb0cbdf58e9b767&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dgreen_tripdata_2019-10.csv.gz&response-content-type=application%2Foctet-stream\\n\",\n      \"Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...\\n\",\n      \"Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.\\n\",\n      \"HTTP request sent, awaiting response... 200 OK\\n\",\n      \"Length: 8262584 (7.9M) [application/octet-stream]\\n\",\n      \"Saving to: 'green_tripdata_2019-10.csv.gz'\\n\",\n      \"\\n\",\n      \"     0K .......... .......... .......... .......... ..........  0% 1.08M 7s\\n\",\n      \"    50K .......... .......... .......... .......... ..........  1% 2.93M 5s\\n\",\n      \"   100K .......... .......... .......... .......... ..........  1% 3.15M 4s\\n\",\n      \"   150K .......... .......... .......... .......... ..........  2% 6.40M 3s\\n\",\n      \"   200K .......... .......... .......... .......... ..........  3% 5.41M 3s\\n\",\n      \"   250K .......... .......... .......... .......... ..........  3% 7.09M 3s\\n\",\n      \"   300K .......... .......... .......... .......... ..........  4% 4.84M 2s\\n\",\n      \"   350K .......... .......... .......... .......... ..........  4% 7.74M 2s\\n\",\n      \"   400K .......... .......... .......... .......... ..........  5% 20.4M 2s\\n\",\n      \"   450K .......... .......... .......... .......... ..........  6% 10.9M 2s\\n\",\n      \"   500K .......... .......... .......... .......... ..........  6% 5.03M 2s\\n\",\n      \"   550K .......... .......... .......... .......... ..........  7%  139M 2s\\n\",\n      \"   600K .......... .......... .......... .......... ..........  8% 11.8M 2s\\n\",\n      \"   650K .......... .......... .......... .......... ..........  8%  333M 1s\\n\",\n      \"   700K .......... .......... .......... .......... ..........  9% 6.83M 1s\\n\",\n      \"   750K .......... .......... .......... .......... ..........  9% 14.7M 1s\\n\",\n      \"   800K .......... .......... .......... .......... .......... 10% 4.41M 1s\\n\",\n      \"   850K .......... .......... .......... .......... .......... 11% 6.43M 1s\\n\",\n      \"   900K .......... .......... .......... .......... .......... 11%  292M 1s\\n\",\n      \"   950K .......... .......... .......... .......... .......... 12% 2.94M 1s\\n\",\n      \"  1000K .......... .......... .......... .......... .......... 13%  372M 1s\\n\",\n      \"  1050K .......... .......... .......... .......... .......... 13%  166M 1s\\n\",\n      \"  1100K .......... .......... .......... .......... .......... 14% 8.69M 1s\\n\",\n      \"  1150K .......... .......... .......... .......... .......... 14%  269M 1s\\n\",\n      \"  1200K .......... .......... .......... .......... .......... 15% 22.0M 1s\\n\",\n      \"  1250K .......... .......... .......... .......... .......... 16% 2.57M 1s\\n\",\n      \"  1300K .......... .......... .......... .......... .......... 16% 69.2M 1s\\n\",\n      \"  1350K .......... .......... .......... .......... .......... 17% 4.57M 1s\\n\",\n      \"  1400K .......... .......... .......... .......... .......... 17% 65.4M 1s\\n\",\n      \"  1450K .......... .......... .......... .......... .......... 18%  180M 1s\\n\",\n      \"  1500K .......... .......... .......... .......... .......... 19% 5.49M 1s\\n\",\n      \"  1550K .......... .......... .......... .......... .......... 19%  114M 1s\\n\",\n      \"  1600K .......... .......... .......... .......... .......... 20% 7.88M 1s\\n\",\n      \"  1650K .......... .......... .......... .......... .......... 21% 6.59M 1s\\n\",\n      \"  1700K .......... .......... .......... .......... .......... 21% 73.7M 1s\\n\",\n      \"  1750K .......... .......... .......... .......... .......... 22% 14.9M 1s\\n\",\n      \"  1800K .......... .......... .......... .......... .......... 22% 4.31M 1s\\n\",\n      \"  1850K .......... .......... .......... .......... .......... 23% 1.87M 1s\\n\",\n      \"  1900K .......... .......... .......... .......... .......... 24% 92.4M 1s\\n\",\n      \"  1950K .......... .......... .......... .......... .......... 24% 49.0M 1s\\n\",\n      \"  2000K .......... .......... .......... .......... .......... 25% 13.5M 1s\\n\",\n      \"  2050K .......... .......... .......... .......... .......... 26% 6.24M 1s\\n\",\n      \"  2100K .......... .......... .......... .......... .......... 26% 67.6M 1s\\n\",\n      \"  2150K .......... .......... .......... .......... .......... 27% 79.1M 1s\\n\",\n      \"  2200K .......... .......... .......... .......... .......... 27% 4.86M 1s\\n\",\n      \"  2250K .......... .......... .......... .......... .......... 28% 94.8M 1s\\n\",\n      \"  2300K .......... .......... .......... .......... .......... 29% 4.48M 1s\\n\",\n      \"  2350K .......... .......... .......... .......... .......... 29% 7.86M 1s\\n\",\n      \"  2400K .......... .......... .......... .......... .......... 30% 27.3M 1s\\n\",\n      \"  2450K .......... .......... .......... .......... .......... 30% 3.10M 1s\\n\",\n      \"  2500K .......... .......... .......... .......... .......... 31% 64.7M 1s\\n\",\n      \"  2550K .......... .......... .......... .......... .......... 32% 82.8M 1s\\n\",\n      \"  2600K .......... .......... .......... .......... .......... 32% 10.8M 1s\\n\",\n      \"  2650K .......... .......... .......... .......... .......... 33% 90.0M 1s\\n\",\n      \"  2700K .......... .......... .......... .......... .......... 34% 5.29M 1s\\n\",\n      \"  2750K .......... .......... .......... .......... .......... 34% 56.3M 1s\\n\",\n      \"  2800K .......... .......... .......... .......... .......... 35% 5.53M 1s\\n\",\n      \"  2850K .......... .......... .......... .......... .......... 35%  135M 1s\\n\",\n      \"  2900K .......... .......... .......... .......... .......... 36% 3.52M 1s\\n\",\n      \"  2950K .......... .......... .......... .......... .......... 37% 34.8M 1s\\n\",\n      \"  3000K .......... .......... .......... .......... .......... 37% 9.28M 1s\\n\",\n      \"  3050K .......... .......... .......... .......... .......... 38%  155M 1s\\n\",\n      \"  3100K .......... .......... .......... .......... .......... 39% 4.57M 1s\\n\",\n      \"  3150K .......... .......... .......... .......... .......... 39% 57.5M 1s\\n\",\n      \"  3200K .......... .......... .......... .......... .......... 40%  182M 1s\\n\",\n      \"  3250K .......... .......... .......... .......... .......... 40% 3.73M 1s\\n\",\n      \"  3300K .......... .......... .......... .......... .......... 41% 83.8M 1s\\n\",\n      \"  3350K .......... .......... .......... .......... .......... 42%  191M 1s\\n\",\n      \"  3400K .......... .......... .......... .......... .......... 42% 3.88M 1s\\n\",\n      \"  3450K .......... .......... .......... .......... .......... 43% 40.2M 1s\\n\",\n      \"  3500K .......... .......... .......... .......... .......... 43% 5.15M 1s\\n\",\n      \"  3550K .......... .......... .......... .......... .......... 44% 48.2M 1s\\n\",\n      \"  3600K .......... .......... .......... .......... .......... 45%  146M 1s\\n\",\n      \"  3650K .......... .......... .......... .......... .......... 45% 3.83M 1s\\n\",\n      \"  3700K .......... .......... .......... .......... .......... 46%  103M 1s\\n\",\n      \"  3750K .......... .......... .......... .......... .......... 47%  152M 1s\\n\",\n      \"  3800K .......... .......... .......... .......... .......... 47%  544M 1s\\n\",\n      \"  3850K .......... .......... .......... .......... .......... 48% 5.68M 0s\\n\",\n      \"  3900K .......... .......... .......... .......... .......... 48%  232M 0s\\n\",\n      \"  3950K .......... .......... .......... .......... .......... 49% 2.19M 0s\\n\",\n      \"  4000K .......... .......... .......... .......... .......... 50% 8.45M 0s\\n\",\n      \"  4050K .......... .......... .......... .......... .......... 50% 45.0M 0s\\n\",\n      \"  4100K .......... .......... .......... .......... .......... 51% 4.58M 0s\\n\",\n      \"  4150K .......... .......... .......... .......... .......... 52%  117M 0s\\n\",\n      \"  4200K .......... .......... .......... .......... .......... 52% 19.5M 0s\\n\",\n      \"  4250K .......... .......... .......... .......... .......... 53%  102M 0s\\n\",\n      \"  4300K .......... .......... .......... .......... .......... 53% 2.69M 0s\\n\",\n      \"  4350K .......... .......... .......... .......... .......... 54% 83.6M 0s\\n\",\n      \"  4400K .......... .......... .......... .......... .......... 55%  121M 0s\\n\",\n      \"  4450K .......... .......... .......... .......... .......... 55% 9.85M 0s\\n\",\n      \"  4500K .......... .......... .......... .......... .......... 56%  102M 0s\\n\",\n      \"  4550K .......... .......... .......... .......... .......... 57%  261M 0s\\n\",\n      \"  4600K .......... .......... .......... .......... .......... 57% 1.84M 0s\\n\",\n      \"  4650K .......... .......... .......... .......... .......... 58% 6.32M 0s\\n\",\n      \"  4700K .......... .......... .......... .......... .......... 58% 49.2M 0s\\n\",\n      \"  4750K .......... .......... .......... .......... .......... 59% 10.8M 0s\\n\",\n      \"  4800K .......... .......... .......... .......... .......... 60% 5.01M 0s\\n\",\n      \"  4850K .......... .......... .......... .......... .......... 60%  271M 0s\\n\",\n      \"  4900K .......... .......... .......... .......... .......... 61%  115M 0s\\n\",\n      \"  4950K .......... .......... .......... .......... .......... 61% 5.14M 0s\\n\",\n      \"  5000K .......... .......... .......... .......... .......... 62% 50.3M 0s\\n\",\n      \"  5050K .......... .......... .......... .......... .......... 63% 3.50M 0s\\n\",\n      \"  5100K .......... .......... .......... .......... .......... 63%  160M 0s\\n\",\n      \"  5150K .......... .......... .......... .......... .......... 64% 15.1M 0s\\n\",\n      \"  5200K .......... .......... .......... .......... .......... 65%  306M 0s\\n\",\n      \"  5250K .......... .......... .......... .......... .......... 65%  202M 0s\\n\",\n      \"  5300K .......... .......... .......... .......... .......... 66%  164M 0s\\n\",\n      \"  5350K .......... .......... .......... .......... .......... 66% 7.69M 0s\\n\",\n      \"  5400K .......... .......... .......... .......... .......... 67% 8.07M 0s\\n\",\n      \"  5450K .......... .......... .......... .......... .......... 68% 75.0M 0s\\n\",\n      \"  5500K .......... .......... .......... .......... .......... 68% 5.82M 0s\\n\",\n      \"  5550K .......... .......... .......... .......... .......... 69% 4.58M 0s\\n\",\n      \"  5600K .......... .......... .......... .......... .......... 70% 6.70M 0s\\n\",\n      \"  5650K .......... .......... .......... .......... .......... 70% 34.4M 0s\\n\",\n      \"  5700K .......... .......... .......... .......... .......... 71%  281M 0s\\n\",\n      \"  5750K .......... .......... .......... .......... .......... 71% 11.8M 0s\\n\",\n      \"  5800K .......... .......... .......... .......... .......... 72% 65.4M 0s\\n\",\n      \"  5850K .......... .......... .......... .......... .......... 73% 54.6M 0s\\n\",\n      \"  5900K .......... .......... .......... .......... .......... 73% 2.49M 0s\\n\",\n      \"  5950K .......... .......... .......... .......... .......... 74% 94.0M 0s\\n\",\n      \"  6000K .......... .......... .......... .......... .......... 74%  307M 0s\\n\",\n      \"  6050K .......... .......... .......... .......... .......... 75%  263M 0s\\n\",\n      \"  6100K .......... .......... .......... .......... .......... 76%  288M 0s\\n\",\n      \"  6150K .......... .......... .......... .......... .......... 76% 8.37M 0s\\n\",\n      \"  6200K .......... .......... .......... .......... .......... 77% 3.78M 0s\\n\",\n      \"  6250K .......... .......... .......... .......... .......... 78% 98.7M 0s\\n\",\n      \"  6300K .......... .......... .......... .......... .......... 78% 2.62M 0s\\n\",\n      \"  6350K .......... .......... .......... .......... .......... 79%  157M 0s\\n\",\n      \"  6400K .......... .......... .......... .......... .......... 79%  424M 0s\\n\",\n      \"  6450K .......... .......... .......... .......... .......... 80% 3.23M 0s\\n\",\n      \"  6500K .......... .......... .......... .......... .......... 81% 30.9M 0s\\n\",\n      \"  6550K .......... .......... .......... .......... .......... 81%  452M 0s\\n\",\n      \"  6600K .......... .......... .......... .......... .......... 82% 8.21M 0s\\n\",\n      \"  6650K .......... .......... .......... .......... .......... 83% 5.23M 0s\\n\",\n      \"  6700K .......... .......... .......... .......... .......... 83% 9.57M 0s\\n\",\n      \"  6750K .......... .......... .......... .......... .......... 84% 3.61M 0s\\n\",\n      \"  6800K .......... .......... .......... .......... .......... 84% 93.1M 0s\\n\",\n      \"  6850K .......... .......... .......... .......... .......... 85% 4.97M 0s\\n\",\n      \"  6900K .......... .......... .......... .......... .......... 86% 41.2M 0s\\n\",\n      \"  6950K .......... .......... .......... .......... .......... 86%  494M 0s\\n\",\n      \"  7000K .......... .......... .......... .......... .......... 87% 5.51M 0s\\n\",\n      \"  7050K .......... .......... .......... .......... .......... 87%  158M 0s\\n\",\n      \"  7100K .......... .......... .......... .......... .......... 88% 5.97M 0s\\n\",\n      \"  7150K .......... .......... .......... .......... .......... 89% 79.3M 0s\\n\",\n      \"  7200K .......... .......... .......... .......... .......... 89% 65.0M 0s\\n\",\n      \"  7250K .......... .......... .......... .......... .......... 90% 4.07M 0s\\n\",\n      \"  7300K .......... .......... .......... .......... .......... 91% 89.6M 0s\\n\",\n      \"  7350K .......... .......... .......... .......... .......... 91%  149M 0s\\n\",\n      \"  7400K .......... .......... .......... .......... .......... 92% 10.1M 0s\\n\",\n      \"  7450K .......... .......... .......... .......... .......... 92% 73.1M 0s\\n\",\n      \"  7500K .......... .......... .......... .......... .......... 93% 51.8M 0s\\n\",\n      \"  7550K .......... .......... .......... .......... .......... 94% 15.4M 0s\\n\",\n      \"  7600K .......... .......... .......... .......... .......... 94% 2.93M 0s\\n\",\n      \"  7650K .......... .......... .......... .......... .......... 95%  101M 0s\\n\",\n      \"  7700K .......... .......... .......... .......... .......... 96%  120M 0s\\n\",\n      \"  7750K .......... .......... .......... .......... .......... 96%  133M 0s\\n\",\n      \"  7800K .......... .......... .......... .......... .......... 97% 49.0M 0s\\n\",\n      \"  7850K .......... .......... .......... .......... .......... 97%  314M 0s\\n\",\n      \"  7900K .......... .......... .......... .......... .......... 98%  117M 0s\\n\",\n      \"  7950K .......... .......... .......... .......... .......... 99% 9.48M 0s\\n\",\n      \"  8000K .......... .......... .......... .......... .......... 99% 2.76M 0s\\n\",\n      \"  8050K .......... ........                                   100%  223M=0.9s\\n\",\n      \"\\n\",\n      \"2025-03-07 19:27:08 (9.10 MB/s) - 'green_tripdata_2019-10.csv.gz' saved [8262584/8262584]\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"id\": \"57fb14bf-f7f2-45a9-b918-d64203e5d802\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"id\": \"2b8b3ac1-e3fb-4713-9ccb-7c0fbfe4c017\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"C:\\\\Users\\\\alexe\\\\AppData\\\\Local\\\\Temp\\\\ipykernel_3424\\\\2667354967.py:1: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.\\n\",\n      \"  df = pd.read_csv('green_tripdata_2019-10.csv.gz')\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"df = pd.read_csv('green_tripdata_2019-10.csv.gz')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"id\": \"a0e8ab41-1520-46b1-b8fa-a3fedf170896\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>VendorID</th>\\n\",\n       \"      <th>lpep_pickup_datetime</th>\\n\",\n       \"      <th>lpep_dropoff_datetime</th>\\n\",\n       \"      <th>store_and_fwd_flag</th>\\n\",\n       \"      <th>RatecodeID</th>\\n\",\n       \"      <th>PULocationID</th>\\n\",\n       \"      <th>DOLocationID</th>\\n\",\n       \"      <th>passenger_count</th>\\n\",\n       \"      <th>trip_distance</th>\\n\",\n       \"      <th>fare_amount</th>\\n\",\n       \"      <th>extra</th>\\n\",\n       \"      <th>mta_tax</th>\\n\",\n       \"      <th>tip_amount</th>\\n\",\n       \"      <th>tolls_amount</th>\\n\",\n       \"      <th>ehail_fee</th>\\n\",\n       \"      <th>improvement_surcharge</th>\\n\",\n       \"      <th>total_amount</th>\\n\",\n       \"      <th>payment_type</th>\\n\",\n       \"      <th>trip_type</th>\\n\",\n       \"      <th>congestion_surcharge</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2.0</td>\\n\",\n       \"      <td>2019-10-01 00:26:02</td>\\n\",\n       \"      <td>2019-10-01 00:39:58</td>\\n\",\n       \"      <td>N</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>112</td>\\n\",\n       \"      <td>196</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>5.88</td>\\n\",\n       \"      <td>18.0</td>\\n\",\n       \"      <td>0.50</td>\\n\",\n       \"      <td>0.5</td>\\n\",\n       \"      <td>0.00</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>0.3</td>\\n\",\n       \"      <td>19.30</td>\\n\",\n       \"      <td>2.0</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>2019-10-01 00:18:11</td>\\n\",\n       \"      <td>2019-10-01 00:22:38</td>\\n\",\n       \"      <td>N</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>43</td>\\n\",\n       \"      <td>263</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>0.80</td>\\n\",\n       \"      <td>5.0</td>\\n\",\n       \"      <td>3.25</td>\\n\",\n       \"      <td>0.5</td>\\n\",\n       \"      <td>0.00</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>0.3</td>\\n\",\n       \"      <td>9.05</td>\\n\",\n       \"      <td>2.0</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>2019-10-01 00:09:31</td>\\n\",\n       \"      <td>2019-10-01 00:24:47</td>\\n\",\n       \"      <td>N</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>255</td>\\n\",\n       \"      <td>228</td>\\n\",\n       \"      <td>2.0</td>\\n\",\n       \"      <td>7.50</td>\\n\",\n       \"      <td>21.5</td>\\n\",\n       \"      <td>0.50</td>\\n\",\n       \"      <td>0.5</td>\\n\",\n       \"      <td>0.00</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>0.3</td>\\n\",\n       \"      <td>22.80</td>\\n\",\n       \"      <td>2.0</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>2019-10-01 00:37:40</td>\\n\",\n       \"      <td>2019-10-01 00:41:49</td>\\n\",\n       \"      <td>N</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>181</td>\\n\",\n       \"      <td>181</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>0.90</td>\\n\",\n       \"      <td>5.5</td>\\n\",\n       \"      <td>0.50</td>\\n\",\n       \"      <td>0.5</td>\\n\",\n       \"      <td>0.00</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>0.3</td>\\n\",\n       \"      <td>6.80</td>\\n\",\n       \"      <td>2.0</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2.0</td>\\n\",\n       \"      <td>2019-10-01 00:08:13</td>\\n\",\n       \"      <td>2019-10-01 00:17:56</td>\\n\",\n       \"      <td>N</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>97</td>\\n\",\n       \"      <td>188</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>2.52</td>\\n\",\n       \"      <td>10.0</td>\\n\",\n       \"      <td>0.50</td>\\n\",\n       \"      <td>0.5</td>\\n\",\n       \"      <td>2.26</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>0.3</td>\\n\",\n       \"      <td>13.56</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>1.0</td>\\n\",\n       \"      <td>0.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag  \\\\\\n\",\n       \"0       2.0  2019-10-01 00:26:02   2019-10-01 00:39:58                  N   \\n\",\n       \"1       1.0  2019-10-01 00:18:11   2019-10-01 00:22:38                  N   \\n\",\n       \"2       1.0  2019-10-01 00:09:31   2019-10-01 00:24:47                  N   \\n\",\n       \"3       1.0  2019-10-01 00:37:40   2019-10-01 00:41:49                  N   \\n\",\n       \"4       2.0  2019-10-01 00:08:13   2019-10-01 00:17:56                  N   \\n\",\n       \"\\n\",\n       \"   RatecodeID  PULocationID  DOLocationID  passenger_count  trip_distance  \\\\\\n\",\n       \"0         1.0           112           196              1.0           5.88   \\n\",\n       \"1         1.0            43           263              1.0           0.80   \\n\",\n       \"2         1.0           255           228              2.0           7.50   \\n\",\n       \"3         1.0           181           181              1.0           0.90   \\n\",\n       \"4         1.0            97           188              1.0           2.52   \\n\",\n       \"\\n\",\n       \"   fare_amount  extra  mta_tax  tip_amount  tolls_amount  ehail_fee  \\\\\\n\",\n       \"0         18.0   0.50      0.5        0.00           0.0        NaN   \\n\",\n       \"1          5.0   3.25      0.5        0.00           0.0        NaN   \\n\",\n       \"2         21.5   0.50      0.5        0.00           0.0        NaN   \\n\",\n       \"3          5.5   0.50      0.5        0.00           0.0        NaN   \\n\",\n       \"4         10.0   0.50      0.5        2.26           0.0        NaN   \\n\",\n       \"\\n\",\n       \"   improvement_surcharge  total_amount  payment_type  trip_type  \\\\\\n\",\n       \"0                    0.3         19.30           2.0        1.0   \\n\",\n       \"1                    0.3          9.05           2.0        1.0   \\n\",\n       \"2                    0.3         22.80           2.0        1.0   \\n\",\n       \"3                    0.3          6.80           2.0        1.0   \\n\",\n       \"4                    0.3         13.56           1.0        1.0   \\n\",\n       \"\\n\",\n       \"   congestion_surcharge  \\n\",\n       \"0                   0.0  \\n\",\n       \"1                   0.0  \\n\",\n       \"2                   0.0  \\n\",\n       \"3                   0.0  \\n\",\n       \"4                   0.0  \"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"id\": \"d085b583-1609-41a9-a222-ff6ca495ee27\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"columns = [\\n\",\n    \"    'lpep_pickup_datetime',\\n\",\n    \"    'lpep_dropoff_datetime',\\n\",\n    \"    'PULocationID',\\n\",\n    \"    'DOLocationID',\\n\",\n    \"    'passenger_count',\\n\",\n    \"    'trip_distance',\\n\",\n    \"    'tip_amount'\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"id\": \"66e9f47c-9284-4760-8011-3a8f48aaa49f\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"df = df[columns]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"id\": \"7ae3f843-d428-43d2-9e47-7f9fb43acbad\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from time import time\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 14,\n   \"id\": \"f1ca1ac1-176c-4ccc-aa11-7e1cb5659d39\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"from tqdm.auto import tqdm\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"id\": \"0b3da4e1-2f1c-400f-bb67-82734c1193f4\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"messages = df.to_dict(orient='records')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"id\": \"3bdc95d8-64e1-4819-a885-996813b4bf94\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"476386\"\n      ]\n     },\n     \"execution_count\": 13,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"len(messages)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 15,\n   \"id\": \"d6f15929-e928-464d-afc1-690343f4f780\",\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"application/vnd.jupyter.widget-view+json\": {\n       \"model_id\": \"4dffdeb2a0064e1d9bd02dff9f9c49f0\",\n       \"version_major\": 2,\n       \"version_minor\": 0\n      },\n      \"text/plain\": [\n       \"  0%|          | 0/476386 [00:00<?, ?it/s]\"\n      ]\n     },\n     \"metadata\": {},\n     \"output_type\": \"display_data\"\n    }\n   ],\n   \"source\": [\n    \"topic_name = 'green-trips'\\n\",\n    \"\\n\",\n    \"for message in tqdm(messages):\\n\",\n    \"    producer.send(topic_name, value=message)\\n\",\n    \"\\n\",\n    \"producer.flush()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"id\": \"2409953c-e9dd-403d-a0d1-d8b883c23ef5\",\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.12.3\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 5\n}\n"
  },
  {
    "path": "cohorts/2025/06-streaming/homework.md",
    "content": "# Homework\n\nIn this homework, we're going to learn about streaming with PyFlink.\n\nInstead of Kafka, we will use Red Panda, which is a drop-in\nreplacement for Kafka. It implements the same interface, \nso we can use the Kafka library for Python for communicating\nwith it, as well as use the Kafka connector in PyFlink.\n\nFor this homework we will be using the Taxi data:\n- Green 2019-10 data from [here](https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz)\n\n\n## Setup\n\nWe need:\n\n- Red Panda\n- Flink Job Manager\n- Flink Task Manager\n- Postgres\n\nIt's the same setup as in the [pyflink module](../../../06-streaming/pyflink/), so go there and start docker-compose:\n\n```bash\ncd ../../../06-streaming/pyflink/\ndocker-compose up\n```\n\n(Add `-d` if you want to run in detached mode)\n\nVisit http://localhost:8081 to see the Flink Job Manager\n\nConnect to Postgres with pgcli, pg-admin, [DBeaver](https://dbeaver.io/) or any other tool.\n\nThe connection credentials are:\n\n- Username `postgres`\n- Password `postgres`\n- Database `postgres`\n- Host `localhost`\n- Port `5432`\n\nWith pgcli, you'll need to run this to connect:\n\n```bash\npgcli -h localhost -p 5432 -u postgres -d postgres\n```\n\nRun these query to create the Postgres landing zone for the first events and windows:\n\n```sql \nCREATE TABLE processed_events (\n    test_data INTEGER,\n    event_timestamp TIMESTAMP\n);\n\nCREATE TABLE processed_events_aggregated (\n    event_hour TIMESTAMP,\n    test_data INTEGER,\n    num_hits INTEGER \n);\n```\n\n## Question 1: Redpanda version\n\nNow let's find out the version of redpandas. \n\nFor that, check the output of the command `rpk help` _inside the container_. The name of the container is `redpanda-1`.\n\nFind out what you need to execute based on the `help` output.\n\nWhat's the version, based on the output of the command you executed? (copy the entire version)\n\n\n## Question 2. Creating a topic\n\nBefore we can send data to the redpanda server, we\nneed to create a topic. We do it also with the `rpk`\ncommand we used previously for figuring out the version of \nredpandas.\n\nRead the output of `help` and based on it, create a topic with name `green-trips` \n\nWhat's the output of the command for creating a topic? Include the entire output in your answer.\n\n\n## Question 3. Connecting to the Kafka server\n\nWe need to make sure we can connect to the server, so\nlater we can send some data to its topics\n\nFirst, let's install the kafka connector (up to you if you\nwant to have a separate virtual environment for that)\n\n```bash\npip install kafka-python\n```\n\nYou can start a jupyter notebook in your solution folder or\ncreate a script\n\nLet's try to connect to our server:\n\n```python\nimport json\n\nfrom kafka import KafkaProducer\n\ndef json_serializer(data):\n    return json.dumps(data).encode('utf-8')\n\nserver = 'localhost:9092'\n\nproducer = KafkaProducer(\n    bootstrap_servers=[server],\n    value_serializer=json_serializer\n)\n\nproducer.bootstrap_connected()\n```\n\nProvided that you can connect to the server, what's the output\nof the last command?\n\n## Question 4: Sending the Trip Data\n\nNow we need to send the data to the `green-trips` topic\n\nRead the data, and keep only these columns:\n\n* `'lpep_pickup_datetime',`\n* `'lpep_dropoff_datetime',`\n* `'PULocationID',`\n* `'DOLocationID',`\n* `'passenger_count',`\n* `'trip_distance',`\n* `'tip_amount'`\n\nNow send all the data using this code:\n\n```python\nproducer.send(topic_name, value=message)\n```\n\nFor each row (`message`) in the dataset. In this case, `message`\nis a dictionary.\n\nAfter sending all the messages, flush the data:\n\n```python\nproducer.flush()\n```\n\nUse `from time import time` to see the total time \n\n```python\nfrom time import time\n\nt0 = time()\n\n# ... your code\n\nt1 = time()\ntook = t1 - t0\n```\n\nHow much time did it take to send the entire dataset and flush? \n\n\n## Question 5: Build a Sessionization Window (2 points)\n\nNow we have the data in the Kafka stream. It's time to process it.\n\n* Copy `aggregation_job.py` and rename it to `session_job.py`\n* Have it read from `green-trips` fixing the schema\n* Use a [session window](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/) with a gap of 5 minutes\n* Use `lpep_dropoff_datetime` time as your watermark with a 5 second tolerance\n* Which pickup and drop off locations have the longest unbroken streak of taxi trips?\n\n\n## Submitting the solutions\n\n- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw6\n- Deadline: See the website\n\n"
  },
  {
    "path": "cohorts/2025/README.md",
    "content": "## Data Engineering Zoomcamp 2025 Cohort\n\n* [Pre-launch Q&A stream](https://www.youtube.com/watch?v=DPnAOu2csYA)\n* [Launch stream with course overview](https://www.youtube.com/watch?v=X8cEEwi8DTM)\n* [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html)\n* [Course Playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n* [Cohort-specific playlist: only 2025 Live videos](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJZdpLpRHp7dg6EOx828q6y)\n\n\n[**Module 1: Introduction & Prerequisites**](01-docker-terraform/)\n\n* [Homework](01-docker-terraform/homework.md)\n\n\n[**Module 2: Workflow Orchestration**](02-workflow-orchestration)\n\n* [Homework](02-workflow-orchestration/homework.md)\n* Office hours\n\n[**Workshop 1: Data Ingestion**](workshops/dlt/README.md)\n\n* Workshop with dlt\n* [Homework](workshops/dlt/README.md)\n\n\n[**Module 3: Data Warehouse**](03-data-warehouse)\n\n* [Homework](03-data-warehouse/homework.md)\n\n\n[**Module 4: Analytics Engineering**](04-analytics-engineering/)\n\n* [Homework](04-analytics-engineering/homework.md)\n\n\n[**Module 5: Batch processing**](05-batch/)\n\n* [Homework](05-batch/homework.md)\n\n\n[**Module 6: Stream Processing**](06-streaming)\n\n* [Homework](06-streaming/homework.md)\n\n\n[**Project**](project.md)\n\nMore information [here](project.md)\n"
  },
  {
    "path": "cohorts/2025/project.md",
    "content": "## Course Project\n\nThe goal of this project is to apply everything we learned\nin this course and build an end-to-end data pipeline.\n\nYou will have two attempts to submit your project. If you don't have \ntime to submit your project by the end of attempt #1 (you started the \ncourse late, you have vacation plans, life/work got in the way, etc.)\nor you fail your first attempt, \nthen you will have a second chance to submit your project as attempt\n#2. \n\nThere are only two attempts.\n\nRemember that to pass the project, you must evaluate 3 peers. If you don't do that,\nyour project can't be considered complete.\n\nTo find the projects assigned to you, use the peer review assignments link \nand find your hash in the first column. You will see three rows: you need to evaluate \neach of these projects. For each project, you need to submit the form once,\nso in total, you will make three submissions. \n\n\n### Submitting\n\n#### Project Attempt #1\n\n* Project: https://courses.datatalks.club/de-zoomcamp-2025/project/project1\n* Review: https://courses.datatalks.club/de-zoomcamp-2025/project/project1/eval\n\n#### Project Attempt #2\n\n* Project: https://courses.datatalks.club/de-zoomcamp-2025/project/project2\n* Review: https://courses.datatalks.club/de-zoomcamp-2025/project/project2/eval\n\n> **Important**: update your \"Certificate name\" here: https://courses.datatalks.club/de-zoomcamp-2025/enrollment -\nthis is what we will use when generating certificates for you.\n\n### Evaluation criteria\n\nSee [here](../../projects/README.md)\n"
  },
  {
    "path": "cohorts/2025/workshops/dlt/README.md",
    "content": "# Data ingestion with dlt\n\nHomework: [dlt_homework.md](dlt_homework.md)\n\n🎥 **Watch the workshop video**\n\n[![Watch the workshop video](https://markdown-videos-api.jorgenkh.no/youtube/pgJWP_xqO1g)](https://www.youtube.com/watch?v=pgJWP_xqO1g \"Watch the workshop video\")\n\nWelcome to this hands-on workshop, where you'll learn to build efficient and scalable data ingestion pipelines.\n\n### **What will you learn in this workshop?**  \n\nIn this workshop, you’ll learn the core skills required to build and manage data pipelines:  \n- **How to build robust, scalable, and self-maintaining pipelines**.  \n- **Best practices**, like built-in data governance, for ensuring clean and reliable data flows.  \n- **Incremental loading techniques** to refresh data quickly and cost-effectively.  \n- **How to build a Data Lake** with dlt.\n\nBy the end of this workshop, you'll be able to build data pipelines like a senior data engineer — quickly, concisely, and with best practices baked in.\n\n\n--- \n\n## 📂 Navigation & Resources\n\n- Workshop:\n  - [Workshop content](data_ingestion_workshop.md).\n  - [Workshop Colab Notebook](https://colab.research.google.com/drive/1FiAHNFenM8RyptyTPtDTfqPCi5W6KX_V?usp=sharing).\n- Homework:\n  - [Homework Markdown](dlt_homework.md).\n  - [Homework Colab Notebook](https://colab.research.google.com/drive/1plqdl33K_HkVx0E0nGJrrkEUssStQsW7).\n- 🌐 [Official dlt Documentation](https://dlthub.com/docs/intro).\n- 💬 Join our [Slack Community](https://dlthub.com/community).\n\n---\n\n## 📖 Course overview\nThis workshop is structured into three key parts:\n\n1️⃣ **[Extracting Data](data_ingestion_workshop.md#extracting-data)** – Learn scalable data extraction techniques.  \n2️⃣ **[Normalizing Data](data_ingestion_workshop.md#normalizing-data)** – Clean and structure data before loading.  \n3️⃣ **[Loading & Incremental Updates](data_ingestion_workshop.md#loading-data)** – Efficiently load and update data.  \n\n📌 **Find the full course file here**: [Course File](data_ingestion_workshop.md)  \n\n---\n\n## 👩‍🏫 Teacher\n\nWelcome to the DataTalks.Club Data Engineering Zoomcamp the data ingestion workshop!\n\nI'm Violetta Mishechkina, Solutions Engineer at dltHub. 👋\n- I’ve been working in the data field since 2018, with a background in machine learning.\n- I started as a Data Scientist, training ML models and neural networks.\n- Over time, I realized that in production, hitting the highest RMSE isn’t as important as model size, infrastructure, and data quality - so I transitioned into MLOps.\n- A year ago, I joined dltHub’s Customer Success team and discovered dlt, a Python library that automates 90% of tedious data engineering tasks.\n- Now, I work closely with customers and partners to help them integrate and optimize dlt in production.\n- I also collaborate with our development team as the voice of the customer, ensuring our product meets real-world data engineering needs.\n- My experience across ML, MLOps, and data engineering gives me a practical, hands-on perspective on solving data challenges.\n\n---\n\n## Homework\n\n- [Homework Markdown](dlt_homework.md).\n- [Homework Colab Notebook](https://colab.research.google.com/drive/1plqdl33K_HkVx0E0nGJrrkEUssStQsW7).\n\n--- \n## Next steps\n\nAs you are learning the various concepts of data engineering, \nconsider creating a portfolio project that will further your own knowledge.\n\nBy demonstrating the ability to deliver end to end, you will have an easier time finding your first role. \nThis will help regardless of whether your hiring manager reviews your project, largely because you will have a better \nunderstanding and will be able to talk the talk.\n\nHere are some example projects that others did with dlt:\n- Serverless dlt-dbt on cloud functions: [Article](https://docs.getdbt.com/blog/serverless-dlt-dbt-stack)\n- Bird finder: [Part 1](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-i), [Part 2](https://publish.obsidian.md/lough-on-data/blogs/bird-finder-via-dlt-ii)\n- Event ingestion on GCP: [Article and repo](https://dlthub.com/docs/blog/streaming-pubsub-json-gcp)\n- Event ingestion on AWS: [Article and repo](https://dlthub.com/docs/blog/dlt-aws-taktile-blog)\n- Or see one of the many demos created by our working students: [Hacker news](https://dlthub.com/docs/blog/hacker-news-gpt-4-dashboard-demo), \n[GA4 events](https://dlthub.com/docs/blog/ga4-internal-dashboard-demo), \n[an E-Commerce](https://dlthub.com/docs/blog/postgresql-bigquery-metabase-demo), \n[Google Sheets](https://dlthub.com/docs/blog/google-sheets-to-data-warehouse-pipeline), \n[Motherduck](https://dlthub.com/docs/blog/dlt-motherduck-demo), \n[MongoDB + Holistics](https://dlthub.com/docs/blog/MongoDB-dlt-Holistics), \n[Deepnote](https://dlthub.com/docs/blog/deepnote-women-wellness-violence-tends), \n[Prefect](https://dlthub.com/docs/blog/dlt-prefect),\n[PowerBI vs GoodData vs Metabase](https://dlthub.com/docs/blog/semantic-modeling-tools-comparison),\n[Dagster](https://dlthub.com/docs/blog/dlt-dagster),\n[Ingesting events via gcp webhooks](https://dlthub.com/docs/blog/dlt-webhooks-on-cloud-functions-for-event-capture),\n[SAP to snowflake replication](https://dlthub.com/docs/blog/sap-hana-to-snowflake-demo-blog),\n[Read emails and send sumamry to slack with AI and Kestra](https://dlthub.com/docs/blog/dlt-kestra-demo-blog),\n[Mode +dlt capabilities](https://dlthub.com/docs/blog/dlt-mode-blog),\n[dbt on cloud functions](https://dlthub.com/docs/blog/dlt-dbt-runner-on-cloud-functions)\n- If you want to use dlt in your project, [check this list of public APIs](https://dlthub.com/docs/blog/practice-api-sources)\n\n\nIf you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt Slack.\n\n\n\n## **💛 If you enjoy dlt, support us!**  \n\n* ⭐ **Give us a [GitHub Star](https://github.com/dlt-hub/dlt)!**  \n* 💬 **Join our [Slack Community](https://dlthub.com/community)!**  \n* 🚀 **Let’s build great data pipelines together!**  \n\n---\n\n# Community notes\n\nDid you take notes? You can share them by creating a PR to this file!\n\n* [Ingest Data to GCS by dlt from peatwan](https://github.com/peatwan/de-zoomcamp/tree/main/workshop/dlt/homework/load_to_gcs)\n* Add your notes above this line"
  },
  {
    "path": "cohorts/2025/workshops/dlt/data_ingestion_workshop.md",
    "content": "# Data ingestion with dlt\n\n* Sign up: https://lu.ma/quyfn4q8 (optional) \n* Homework: [dlt_homework.md](dlt_homework.md)\n\n## **What is data ingestion?**  \nData ingestion is the process of **extracting** data from a source, transporting it to a suitable environment, and preparing it for use. This often includes **normalizing**, **cleaning**, and **adding metadata**.\n\n---\n\n### **“A wild dataset Magically appears!”**  \n\nIn many data science teams, data seems to appear out of nowhere — because an engineer loads it.  \n\nFor example, the well-known **NYC Taxi dataset** looks well-structured and ready to use, making it easy to query and analyze. However, not all datasets arrive in such a clean format.\n\n- **Well-structured data** (with an explicit schema) can be used immediately.  \n  - Examples: Parquet, Avro, or database tables where data types and structures are predefined.  \n- **Unstructured or weakly typed data** (without a defined schema) often needs cleaning and formatting first.  \n  - Examples: CSV, JSON, where fields might be inconsistent, nested or missing key details.  \n\n💡 **What is a schema?**  \nA schema defines the expected format and structure of data, including field names, data types, and relationships.  \n\n---\n\n### **Be the Magician! 😎**  \n\nSince you're here to learn data engineering, **you** will be the one making datasets magically appear!  \n\nTo build effective pipelines, you need to master:  \n\n✅ **Extracting** data from various sources (APIs, databases, files).  \n✅ **Normalization** data by transforming, cleaning, and defining schemas.  \n✅ **Loading** data where it can be used (data warehouse, lake, or database).\n\n---\n\n### **Why are data pipelines so amazing?**  \n\nData pipelines are the backbone of modern data-driven organizations, transforming raw, scattered data into actionable insights. \nThey ensure data flows seamlessly from its source to its final destination, where it can drive decision-making, analytics, and innovation. \nBut pipelines don’t just move data — they enable an entire ecosystem of functionality that makes them indispensable.  \n\n![pipes](img/pipes.jpg)\n\n### **What makes data pipelines so essential?**  \n\n1. **Collect**:  \n   Data pipelines gather information from a variety of sources, such as databases, data streams, and applications. This ensures no data is overlooked.  \n   - Example: Retrieving sales data from an online store or capturing user activity logs from an app.  \n\n2. **Ingest**:  \n   The collected data flows into an event queue, where it’s organized and prepared for the next steps.  \n   - **Structured data** (like Parquet files or database tables) can be processed immediately.  \n   - **Unstructured data** (like CSV or JSON files) often needs cleaning and normalization.  \n   - Example: Cleaning a JSON response by standardizing its fields or formatting dates in a CSV file.  \n\n3. **Store**:  \n   Pipelines send the processed data to **data lakes**, **data warehouses**, or **data lakehouses** for efficient storage and easy access.  \n   - Example: Storing marketing campaign data in a data warehouse to analyze its performance.  \n\n4. **Compute**:  \n   Data is processed either in **batches** (large chunks) or as **streams** (real-time updates) to make it ready for analysis.  \n   - Example: Calculating monthly revenue or processing live stock market data.  \n\n5. **Consume**:  \n   Finally, the prepared data is delivered to users in forms they can act on:  \n   - **Dashboards** for executives and analysts.  \n   - **Self-service analytics tools** for teams exploring trends.  \n   - **Machine learning models** for predictions and automation.  \n\n---\n\n### **Why are data engineers so important in this process?**  \n\nData engineers are the architects behind these pipelines. They don’t just build pipelines—they make sure they’re reliable, efficient, and scalable. Beyond pipeline development, data engineers:  \n- **Optimize data storage** to keep costs low and performance high.  \n- **Ensure data quality and integrity**, addressing duplicates, inconsistencies, and missing values.  \n- **Implement governance** for secure, compliant, and well-managed data.  \n- **Adapt data architectures** to meet the changing needs of the organization.  \n\nUltimately, their role is to strategically manage the entire **data lifecycle**, from collection to consumption.\n\n---\n\n### **What will you learn in this workshop?**  \n\nIn this workshop, you’ll learn the core skills required to build and manage data pipelines:  \n- **How to build robust, scalable, and self-maintaining pipelines**.  \n- **Best practices**, like built-in data governance, for ensuring clean and reliable data flows.  \n- **Incremental loading techniques** to refresh data quickly and cost-effectively.  \n- **How to build a Data Lake** with dlt.\n\nBy the end, you’ll not only understand why data pipelines are amazing, but you’ll also know how to create them with best practices to power your organization’s data-driven success.🚀\n\n---\n## **Extracting data**\n\nMost of the data you’ll work with is stored behind an **API**, which is like a doorway to the data. Here are the most common types:  \n\n- **RESTful APIs**: Provide records of data from business applications.  \n  - Example: Getting a list of customers from a CRM system.  \n- **File-based APIs**: Return secure file paths to bulk data like JSON or Parquet files stored in buckets.  \n  - Example: Downloading monthly sales reports.  \n- **Database APIs**: Connect to databases like MongoDB or SQL, often returning data as JSON, the most common interchange format.  \n\nAs an engineer, you will need to build pipelines that “just work”.\n\nSo here’s what you need to consider on extraction, to prevent the pipelines from breaking, and to keep them running smoothly:  \n\n1. **Hardware limits**: Be mindful of memory (RAM) and storage (disk space). Overloading these can crash your system.  \n2. **Network reliability**: Networks can fail! Always account for retries to make your pipelines more robust.  \n   - Tip: Use libraries like `dlt` that have built-in retry mechanisms.  \n3. **API rate limits**: APIs often restrict the number of requests you can make in a given time.  \n   - Tip: Check the API documentation to understand its limits (e.g., [Zendesk](https://developer.zendesk.com/api-reference/introduction/rate-limits/), [Shopify](https://shopify.dev/docs/api/usage/rate-limits)).  \n\nThere are even more challenges to consider when working with APIs — such as **pagination and authentication**. Let’s explore how to handle these effectively when working with **REST APIs**.\n\n### **Working with REST APIs**\n\nREST APIs (Representational State Transfer APIs) are one of the most common ways to extract data. They allow you to retrieve structured data using simple HTTP requests. However, working with APIs comes with its own challenges.\n\n#### **Common Challenges**\n\n![rest_api](img/Rest_API.png)\n\n#### **1. Rate limits**  \nMany APIs **limit the number of requests** you can make within a certain time frame to prevent overloading their servers. If you exceed this limit, the API may **reject your requests** temporarily or even block you for a period.  \n\nTo avoid hitting these limits, we can:  \n- **Monitor API rate limits** – Some APIs provide headers that tell you how many requests you have left.  \n- **Pause requests when needed** – If we're close to the limit, we wait before making more requests.  \n- **Implement automatic retries** – If a request fails due to rate limiting, we can wait and retry after some time.  \n\n💡Some APIs provide a **retry-after** header, which tells you how long to wait before making another request. Always check the API documentation for best practices!\n\n---\n\n#### **2. Authentication**  \nMany APIs require an **API key or token** to access data securely. Without authentication, requests may be limited or denied.  \n\n🔐 **Types of Authentication in APIs:**  \n- **API Keys** – A simple token included in the request header or URL.  \n- **OAuth Tokens** – A more secure authentication method requiring user authorization.  \n- **Basic Authentication** – Using a username and password (less common today).  \n\n💡 Never share your API token publicly! Store it in environment variables or use a secure secrets manager.\n\n----\n#### **3. Pagination**\n\nMany APIs return data in **chunks (or pages)** rather than sending everything at once. This prevents **overloading the server** and improves performance, especially for large datasets. To retrieve **all the data**, we need to make multiple requests and keep track of pages until we reach the last one.\n\n📌 Example:\n\n>In this example, we’ll request data from an API that serves the **NYC taxi dataset**.\n\nFor these purposes we created an API that can serve the data you are already familiar with. The API returns **1,000 records per page**, and we must request multiple pages to retrieve the full dataset.\n\n```py\nimport requests\n\nBASE_API_URL = \"https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api\"\n\npage_number = 1\nwhile True:\n    params = {'page': page_number}\n    response = requests.get(BASE_API_URL, params=params)\n    page_data = response.json()\n\n    if not page_data:\n        break\n\n    print(page_data)\n    page_number += 1\n\n    # limit the number of pages for testing\n    if page_number > 2:\n      break\n```\nWhat happens here:\n- Starts at page 1 and makes a GET request to the API.\n- Retrieves JSON data and checks if the page contains records.\n- If data exists, prints it and moves to the next page.\n- If the page is empty, stops requesting more data.\n\n💡 Different APIs handle pagination differently (some use offsets, cursors, or tokens instead of page numbers). Always check the API documentation for the correct method!\n\n---\n\n#### **4. Avoiding memory issues during extraction**  \n\nTo prevent your pipeline from crashing, you need to control memory usage.  \n\n#### **Challenges with memory**  \n- Many pipelines run on systems with limited memory, like serverless functions or shared clusters.  \n- If you try to load all the data into memory at once, it can crash the entire system.  \n- Even disk space can become an issue if you’re storing large amounts of data.  \n\n\n#### **The solution: streaming data**  \n\n**Streaming** means processing data in small chunks or events, rather than loading everything at once. This keeps memory usage low and ensures your pipeline remains efficient.\n\nAs a data engineer, you’ll use streaming to transfer data between buffers, such as:  \n- from APIs to local files;  \n- from Webhooks to event queues;  \n- from Event queues (like Kafka) to storage buckets.\n\n---\n\n### **Example of extracting data: Grabbing data from an API**\n\nIn this example, we’ll request data from an API that serves the **NYC taxi dataset**. For these purposes we created an API that can serve the data you are already familiar with.\n\n#### **API documentation**:  \n- **Data**: Comes in pages of 1,000 records.  \n- **Pagination**: When there’s no more data, the API returns an empty page.  \n- **Details**:  \n  - **Method**: GET  \n  - **URL**: `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api`  \n  - **Parameters**:  \n    - `page`: Integer (page number), defaults to 1.  \n\nHere’s how we design our requester:  \n1. **Request page by page** until we hit an empty page. Since we don’t know how much data is behind the API, we must assume it could be as little as 1,000 records or as much as 10GB.\n2. **Use a generator** to handle this efficiently and avoid loading all data into memory.  \n\n\n```py\nimport requests\n\nBASE_API_URL = \"https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api\"\n\ndef paginated_getter():\n    page_number = 1\n    while True:\n        params = {'page': page_number}\n        response = requests.get(BASE_API_URL, params=params)\n        response.raise_for_status()\n        page_json = response.json()\n        print(f'Got page {page_number} with {len(page_json)} records')\n\n        if page_json:\n            yield page_json\n            page_number += 1\n        else:\n            break\n\n\nfor page_data in paginated_getter():\n    print(page_data)\n```\n\nIn this approach to grabbing data from APIs, there are both pros and cons:  \n\n✅ Pros: **Easy memory management** since the API returns data in small pages or events.  \n❌ Cons: **Low throughput** because data transfer is limited by API constraints (rate limits, response time).\n\n\nTo simplify data extraction, use specialized tools that follow best practices like streaming — for example, [dlt (data load tool)](https://dlthub.com). It efficiently processes data while **keeping memory usage low** and **leveraging parallelism** for better performance.\n\n### **Extracting data with dlt**\n\nExtracting data from APIs manually requires handling\n- **pagination**,\n- **rate limits**,\n- **authentication**,\n- **errors**.\n\nInstead of writing custom scripts, **[dlt](https://dlthub.com/)** simplifies the process with a built-in **[REST API Client](https://dlthub.com/docs/general-usage/http/rest-client)**, making extraction **efficient, scalable, and reliable**.  \n\n---\n\n### **Why use dlt for extraction?**  \n\n✅ **Built-in REST API support** – Extract data from APIs with minimal code.  \n✅ **Automatic pagination handling** – No need to loop through pages manually.  \n✅ **Manages Rate Limits & Retries** – Prevents exceeding API limits and handles failures.  \n✅ **Streaming support** – Extracts and processes data without loading everything into memory.  \n✅ **Seamless integration** – Works with **normalization and loading** in a single pipeline.  \n\n![dlt](img/dlt.png)\n\n### **Install dlt**\n\n[Install](https://dlthub.com/docs/reference/installation) dlt with DuckDB as destination:\n\n```shell\npip install dlt[duckdb]\n```\n\n### **Example of extracting data with dlt**  \n\nInstead of manually writing pagination logic, let’s use **dlt’s [`RESTClient` helper](https://dlthub.com/docs/general-usage/http/rest-client)** to extract NYC taxi ride data:  \n```py\nimport dlt\nfrom dlt.sources.helpers.rest_client import RESTClient\nfrom dlt.sources.helpers.rest_client.paginators import PageNumberPaginator\n\n\ndef paginated_getter():\n    client = RESTClient(\n        base_url=\"https://us-central1-dlthub-analytics.cloudfunctions.net\",\n        # Define pagination strategy - page-based pagination\n        paginator=PageNumberPaginator(   # <--- Pages are numbered (1, 2, 3, ...)\n            base_page=1,   # <--- Start from page 1\n            total_path=None    # <--- No total count of pages provided by API, pagination should stop when a page contains no result items\n        )\n    )\n\n    for page in client.paginate(\"data_engineering_zoomcamp_api\"):    # <--- API endpoint for retrieving taxi ride data\n        yield page   # remember about memory management and yield data\n\nfor page_data in paginated_getter():\n    print(page_data)\n```\n\n**How dlt simplifies API extraction:**  \n\n🔹 **No manual pagination** – dlt **automatically** fetches **all pages** of data.  \n🔹 **Low memory usage** – Streams data **chunk by chunk**, avoiding RAM overflows.  \n🔹 **Handles rate limits & retries** – Ensures requests are sent efficiently **without failures**.  \n🔹 **Flexible destination support** – Load extracted data into **databases, warehouses, or data lakes**.\n\n---\n\nWell, you’ve successfully **extracted** the data — great! 🎉 But raw data isn’t always ready to use. Now, you need to **process**, **clean**, and **structure** it before it can be loaded into a data lake or data warehouse.\n\n\n## **Normalizing data**\n\nYou often hear that data professionals spend most of their time **“cleaning” data** — but what does that actually mean?  \n\nData cleaning typically involves two key steps:  \n\n1. **Normalizing data** – Structuring and standardizing data **without changing its meaning**.  \n2. **Filtering data for a specific use case** – Selecting or modifying data **in a way that changes its meaning** to fit the analysis.\n\n### **Data cleaning: more than just fixing errors**  \n\nA big part of **data cleaning** is actually **metadata work** — ensuring data is structured and standardized so it can be used effectively.  \n\n#### **Metadata tasks in data cleaning:**  \n\n✅ **Add types** – Convert strings to numbers, timestamps, etc.  \n✅ **Rename columns** – Ensure names follow a standard format (e.g., no special characters).  \n✅ **Flatten nested dictionaries** – Bring values from nested dictionaries into the top-level row.  \n✅ **Unnest lists/arrays** – Convert lists into **child tables** since they can’t be stored directly in a flat format.  \n\n👉 **We’ll look at a practical example next, as these concepts are easier to understand with real data.**\n\n---\n\n### **Why prepare data? Why not use JSON directly?**  \n\nWhile JSON is a great format for **data transfer**, it’s not ideal for analysis. Here’s why:  \n\n❌ **No enforced schema** – We don’t always know what fields exist in a JSON document.  \n❌ **Inconsistent data types** – A field like `age` might appear as `25`, `\"twenty five\"`, or `25.00`, which can break downstream applications.  \n❌ **Hard to process** – If we need to group data by day, we must manually convert date strings to timestamps.  \n❌ **Memory-heavy** – JSON requires reading the entire file into memory, unlike databases or columnar formats that allow scanning just the necessary fields.  \n❌ **Slow for aggregation and search** – JSON is not optimized for quick lookups or aggregations like columnar formats (e.g., Parquet).  \n\n\nJSON is great for **data exchange** but **not for direct analytical use**. To make data useful, we need to **normalize it** — flattening, typing, and structuring it for efficiency.\n\n---\n\n### **Normalization example**  \n\nTo understand what we’re working with, let’s look at a sample record from our API:\n\n```py\nitem = page_data[0]\nitem\n```\nOutput:\n```json\n{'End_Lat': 40.742963,\n 'End_Lon': -73.980072,\n 'Fare_Amt': 45.0,\n 'Passenger_Count': 1,\n 'Payment_Type': 'Credit',\n 'Rate_Code': None,\n 'Start_Lat': 40.641525,\n 'Start_Lon': -73.787442,\n 'Tip_Amt': 9.0,\n 'Tolls_Amt': 4.15,\n 'Total_Amt': 58.15,\n 'Trip_Distance': 17.52,\n 'Trip_Dropoff_DateTime': '2009-06-14 23:48:00',\n 'Trip_Pickup_DateTime': '2009-06-14 23:23:00',\n 'mta_tax': None,\n 'store_and_forward': None,\n 'surcharge': 0.0,\n 'vendor_name': 'VTS'}\n```\n\nThe data we retrieved from the API has **already been processed and unnested**, meaning that any **nested structures** (like dictionaries and lists) have been flattened, making it easier to store and query in a database or a dataframe. However, let’s imagine we originally received the **raw data** in a more complex format.\n\n---\n\n### **How was this data processed?**  \n\nBefore reaching this format, the raw data likely contained **nested structures** that had to be **flattened and transformed**.  \n\n1️⃣ **Flattened nested coordinates:**  \n   - Originally, the latitude and longitude values might have been nested like this:  \n     ```json\n     \"coordinates\": {\n         \"start\": {\"lat\": 40.641525, \"lon\": -73.787442},\n         \"end\": {\"lat\": 40.742963, \"lon\": -73.980072}\n     }\n     ```\n   - These were **flattened** into `Start_Lat`, `Start_Lon`, `End_Lat`, and `End_Lon`.  \n\n2️⃣ **Converted timestamps:**  \n   - Originally, timestamps might have been stored as Unix timestamps or separate date/time fields:  \n     ```json\n     \"Trip_Pickup\": {\"date\": \"2009-06-14\", \"time\": \"23:23:00\"}\n     ```\n   - Now, they are **formatted as ISO datetime strings**:  \n     ```json\n     \"Trip_Pickup_DateTime\": \"2009-06-14 23:23:00\"\n     ```\n\n3️⃣ **Unnested passenger & payment information:**  \n   - The original structure might have included a nested list for passengers:  \n     ```json\n     \"passengers\": [\n         {\"name\": \"John\", \"rating\": 4.9},\n         {\"name\": \"Jack\", \"rating\": 3.9}\n     ]\n     ```\n   - Since lists **cannot be stored directly in a database table**, they were likely **moved to a separate table**.\n\n💡 **However, real-world data is rarely this clean!** We often receive raw, nested, and inconsistent data. This is why the **normalization process** is so important—it **prepares** the data for efficient storage and analysis.  \n**[dlt (data load tool)](https://dlthub.com/docs/intro)** simplifies the **normalization process**, automatically transforming raw data into a **structured, clean format** that is ready for storage and analysis.\n\n---\n\n### **Normalizing data with dlt**  \n\n**Why use dlt for normalization?**  \n\n✅ **Automatically detects schema** – No need to define column types manually.  \n✅ **Flattens nested JSON** – Converts complex structures into table-ready formats.  \n✅ **Handles data type conversion** – Converts dates, numbers, and booleans correctly.  \n✅ **Splits lists into child tables** – Ensures relational integrity for better analysis.  \n✅ **Schema evolution support** – Adapts to changes in data structure over time.  \n\n---\n\n### **Example**  \n\nLet's assume we extracted the following raw NYC taxi ride data, which contains **nested dictionaries** and **lists**:\n\n```py\ndata = [\n    {\n        \"vendor_name\": \"VTS\",\n        \"record_hash\": \"b00361a396177a9cb410ff61f20015ad\",\n        \"time\": {\n            \"pickup\": \"2009-06-14 23:23:00\",\n            \"dropoff\": \"2009-06-14 23:48:00\"\n        },\n        \"coordinates\": {\n            \"start\": {\"lon\": -73.787442, \"lat\": 40.641525},\n            \"end\": {\"lon\": -73.980072, \"lat\": 40.742963}\n        },\n        \"passengers\": [\n            {\"name\": \"John\", \"rating\": 4.9},\n            {\"name\": \"Jack\", \"rating\": 3.9}\n        ]\n    }\n]\n```\n\n### **How dlt normalizes this data automatically**  \n\nInstead of manually flattening fields and extracting nested lists, we can **load it directly into dlt**:\n\n```py\nimport dlt\n\n# Define a dlt pipeline with automatic normalization\npipeline = dlt.pipeline(\n    pipeline_name=\"ny_taxi_data\",\n    destination=\"duckdb\",\n    dataset_name=\"taxi_rides\",\n)\n\n# Run the pipeline with raw nested data\ninfo = pipeline.run(data, table_name=\"rides\", write_disposition=\"replace\")\n\n# Print the load summary\nprint(info)\n\nprint(pipeline.last_trace)\n```\n\n---\n\n### **What happens behind the scenes?**  \n\nAfter running this pipeline, dlt automatically **transforms the data** into the following **normalized structure**:  \n\n**Main table: `rides`**  \n\n```py\npipeline.dataset(dataset_type=\"default\").rides.df()\n```\n\n| vendor_name | record_hash                         | time__pickup              | time__dropoff             | coordinates__start__lon | coordinates__start__lat | coordinates__end__lon | coordinates__end__lat | _dlt_load_id      | _dlt_id        |\n|-------------|------------------------------------|---------------------------|---------------------------|-------------------------|-------------------------|-----------------------|-----------------------|-------------------|---------------|\n| VTS         | b00361a396177a9cb410ff61f20015ad  | 2009-06-14 23:23:00+00:00 | 2009-06-14 23:48:00+00:00 | -73.787442              | 40.641525               | -73.980072            | 40.742963            | 1738604244.2625916 | k+bnoLuti245ag |\n  \n\nThis table **displays structured taxi ride data**, including **vendor details, timestamps, coordinates, and dlt metadata**. \n\n**Child Table: `rides_passengers`** \n\n```py\npipeline.dataset(dataset_type=\"default\").rides__passengers.df()\n```\n\n| name  | rating | _dlt_parent_id    | _dlt_list_idx | _dlt_id        |\n|-------|--------|------------------|--------------|---------------|\n| John  | 4.9    | k+bnoLuti245ag    | 0            | 8ppDh+8gQ7SSHg |\n| Jack  | 3.9    | k+bnoLuti245ag    | 1            | oQnWuvkgHhxlaA |\n\n\n✅ **Nested structures were flattened** into separate columns.  \n✅ **Lists were extracted into child tables**, preserving relationships.  \n✅ **Timestamps were converted to the correct format.**  \n\n---\n\n### **Why dlt makes normalization easy**  \n\n🔹  **No manual transformations needed** – Just load the raw data, and dlt does the rest!  \n🔹 **Database-ready format** – Ensures clean, structured tables for easy querying.  \n🔹 **Handles schema evolution** – Adapts to new fields automatically.  \n🔹 **Scales effortlessly** – Works for small datasets and enterprise-scale pipelines.  \n\n💡 With dlt, normalization happens automatically, so you can focus on insights instead of data wrangling.\n\n---\n\n## **Loading data**\n\nNow that we’ve covered **extracting** and **normalizing** data, the final step is **loading** the data **into a destination**. This is where the processed data is stored, making it ready for querying, analysis, or further transformations.\n\n\n### **How data loading happens without dlt**  \n\nBefore dlt, data engineers had to manually handle **schema validation, batch processing, error handling, and retries** for every destination. This process becomes especially complex when loading data into **data warehouses and data lakes**, where performance optimization, partitioning, and incremental updates are critical.\n\n### **Example: Loading data into database without dlt**  \nA basic pipeline requires:  \n1. Setting up a database connection.  \n2. Creating tables and defining schemas.  \n3. Handling schema changes manually.  \n4. Writing queries to insert/update data.\n\n```py\nimport duckdb\n\n# 1. Create a connection to an in-memory DuckDB database\nconn = duckdb.connect(\"ny_taxi_manual.db\")\n\n# 2. Create the rides Table\n# Since our dataset has nested structures, we must manually flatten it before inserting data.\nconn.execute(\"\"\"\nCREATE TABLE IF NOT EXISTS rides (\n    record_hash TEXT PRIMARY KEY,\n    vendor_name TEXT,\n    pickup_time TIMESTAMP,\n    dropoff_time TIMESTAMP,\n    start_lon DOUBLE,\n    start_lat DOUBLE,\n    end_lon DOUBLE,\n    end_lat DOUBLE\n);\n\"\"\")\n\n# 3. Insert Data Manually\n# Since JSON data has nested fields, we need to extract and transform them before inserting them into DuckDB.\ndata = [\n    {\n        \"vendor_name\": \"VTS\",\n        \"record_hash\": \"b00361a396177a9cb410ff61f20015ad\",\n        \"time\": {\n            \"pickup\": \"2009-06-14 23:23:00\",\n            \"dropoff\": \"2009-06-14 23:48:00\"\n        },\n        \"coordinates\": {\n            \"start\": {\"lon\": -73.787442, \"lat\": 40.641525},\n            \"end\": {\"lon\": -73.980072, \"lat\": 40.742963}\n        }\n    }\n]\n\n# Prepare data for insertion\nflattened_data = [\n    (\n        ride[\"record_hash\"],\n        ride[\"vendor_name\"],\n        ride[\"time\"][\"pickup\"],\n        ride[\"time\"][\"dropoff\"],\n        ride[\"coordinates\"][\"start\"][\"lon\"],\n        ride[\"coordinates\"][\"start\"][\"lat\"],\n        ride[\"coordinates\"][\"end\"][\"lon\"],\n        ride[\"coordinates\"][\"end\"][\"lat\"]\n    )\n    for ride in data\n]\n\n# Insert into DuckDB\nconn.executemany(\"\"\"\nINSERT INTO rides (record_hash, vendor_name, pickup_time, dropoff_time, start_lon, start_lat, end_lon, end_lat)\nVALUES (?, ?, ?, ?, ?, ?, ?, ?)\n\"\"\", flattened_data)\n\nprint(\"Data successfully loaded into DuckDB!\")\n\n\n# 4. Query Data in DuckDB\n# Now that the data is loaded, we can query it using DuckDB’s SQL engine.\ndf = conn.execute(\"SELECT * FROM rides\").df()\n\nconn.close()\n```\n\nProblems without dlt:\n\n❌ **Schema management is manual** – If the schema changes, you need to update table structures manually.  \n❌ **No automatic retries** – If the network fails, data may be lost.  \n❌ **No incremental loading** – Every run reloads everything, making it slow and expensive.  \n❌ **More code to maintain** – A simple pipeline quickly becomes complex.\n\n---\n\n### **How dlt handles the load step automatically**  \n\nWith dlt, loading data **requires just a few lines of code** — schema inference, error handling, and incremental updates are all handled automatically!\n\n### **Why use dlt for loading?**  \n\n✅ **Supports multiple destinations** – Load data into **BigQuery, Redshift, Snowflake, Postgres, DuckDB, Parquet (S3, GCS)** and more.  \n✅ **Optimized for performance** – Uses **batch loading, parallelism, and streaming** for fast and scalable data transfer.  \n✅ **Schema-aware** – Ensures that **column names, data types, and structures match** the destination’s requirements.  \n✅ **Incremental loading** – Avoids unnecessary reloading by **only inserting new or updated records**.  \n✅ **Resilience & retries** – Automatically handles failures, ensuring data is loaded **without missing records**.\n\n![dlt](img/dlt.png)\n\n### **Example: Loading data into database with dlt**\n\n\n\nTo use all the power of dlt is better to wrap our API Client in the `@dlt.resource` decorator which denotes a logical grouping of data within a data source, typically holding data of similar structure and origin:\n\n```py\nimport dlt\nfrom dlt.sources.helpers.rest_client import RESTClient\nfrom dlt.sources.helpers.rest_client.paginators import PageNumberPaginator\n\n\n# Define the API resource for NYC taxi data\n@dlt.resource(name=\"rides\")   # <--- The name of the resource (will be used as the table name)\ndef ny_taxi():\n    client = RESTClient(\n        base_url=\"https://us-central1-dlthub-analytics.cloudfunctions.net\",\n        paginator=PageNumberPaginator(\n            base_page=1,\n            total_path=None\n        )\n    )\n\n    for page in client.paginate(\"data_engineering_zoomcamp_api\"):    # <--- API endpoint for retrieving taxi ride data\n        yield page   # <--- yield data to manage memory\n\n\n# define new dlt pipeline\npipeline = dlt.pipeline(destination=\"duckdb\")\n\n# run the pipeline with the new resource\nload_info = pipeline.run(ny_taxi, write_disposition=\"replace\")\nprint(load_info)\n\n# explore loaded data\npipeline.dataset(dataset_type=\"default\").rides.df()\n```\n\n**Done!** The data is now stored in **DuckDB**, with schema managed automatically!\n\n---\n### **Incremental Loading**  \n\nIncremental loading allows us to update datasets by **loading only new or changed data**, instead of replacing the entire dataset. This makes pipelines **faster and more cost-effective** by reducing redundant data processing.  \n\n\n### **How does incremental loading work?**  \n\nIncremental loading works alongside two key concepts:  \n\n- **Incremental extraction** – Only extracts the new or modified data rather than retrieving everything again.  \n- **State tracking** – Keeps track of what has already been loaded, ensuring that only new data is processed.  \n\nIn dlt, **state** is stored in a **separate table** at the destination, allowing pipelines to track what has been processed.\n\n🔹 **Want to learn more?** You can read about incremental extraction and state management in the [dlt documentation](https://dlthub.com/docs).  \n\n---\n\n### **Incremental loading methods in dlt**  \n\ndlt provides two ways to load data incrementally:  \n\n#### **1. Append (adding new records)**  \n\n- Best for **immutable or stateless data**, such as taxi ride records.  \n- Each run **adds new records** without modifying previous data.  \n- Can also be used to create a **history of changes** (slowly changing dimensions).  \n\n**Example:**  \n- If taxi ride data is loaded daily, only **new rides** are added, rather than reloading the full history.  \n- If tracking changes in a list of vehicles, **each version** is stored as a new row for auditing.  \n\n---\n\n#### **2. Merge (updating existing records)**  \n\n- Best for **updating existing records** (stateful data).  \n- Replaces old records with updated ones based on a **unique key**.  \n- Useful for tracking **status changes**, such as payment updates.  \n\n**Example:**  \n- A taxi ride's **payment status** could change from `\"booked\"` to `\"cancelled\"`, requiring an update.  \n- A **customer profile** might be updated with a new email or phone number.  \n\n---\n\n### **Choosing between Append and Merge**  \n\n| **Scenario**                      | **Use Append** | **Use Merge** |\n|-----------------------------------|--------------|--------------|\n| Immutable records (e.g., ride history) | ✅ Yes         | ❌ No        |\n| Tracking historical changes (slowly changing dimensions) | ✅ Yes         | ❌ No        |\n| Updating existing records (e.g., payment status) | ❌ No         | ✅ Yes        |\n| Keeping full change history       | ✅ Yes         | ❌ No        |\n\n\n### **Example: Incremental loading with dlt**\n\n**The goal**: download only trips made after June 15, 2009, skipping the old ones.\n\nUsing `dlt`, we set up an [incremental filter](https://dlthub.com/docs/general-usage/incremental-loading%23incremental-loading-with-a-cursor-field) to only fetch trips made after a certain date:\n\n```python\ncursor_date = dlt.sources.incremental(\"Trip_Dropoff_DateTime\", initial_value=\"2009-06-15\")\n```\n\nThis tells `dlt`:\n- **Start date**: June 15, 2009 (`initial_value`).\n- **Field to track**: `Trip_Dropoff_DateTime` (our timestamp).\n\nAs you run the pipeline repeatedly, `dlt` will keep track of the latest `Trip_Dropoff_DateTime` value processed. It will skip records older than this date in future runs.\n\nLet's make the data resource incremental using `dlt.sources.incremental`:\n\n```py\nimport dlt\nfrom dlt.sources.helpers.rest_client import RESTClient\nfrom dlt.sources.helpers.rest_client.paginators import PageNumberPaginator\n\n\n@dlt.resource(name=\"rides\", write_disposition=\"append\")\ndef ny_taxi(\n    cursor_date=dlt.sources.incremental(\n        \"Trip_Dropoff_DateTime\",   # <--- field to track, our timestamp\n        initial_value=\"2009-06-15\",   # <--- start date June 15, 2009\n        )\n    ):\n    client = RESTClient(\n        base_url=\"https://us-central1-dlthub-analytics.cloudfunctions.net\",\n        paginator=PageNumberPaginator(\n            base_page=1,\n            total_path=None\n        )\n    )\n\n    for page in client.paginate(\"data_engineering_zoomcamp_api\"):\n        yield page\n```\n\nFinally, we run our pipeline and load the fresh taxi rides data:\n\n```py\n# define new dlt pipeline\npipeline = dlt.pipeline(pipeline_name=\"ny_taxi\", destination=\"duckdb\", dataset_name=\"ny_taxi_data\")\n\n# run the pipeline with the new resource\nload_info = pipeline.run(ny_taxi)\nprint(pipeline.last_trace)\n```\n\n\nOnly 5325 rows were flitered out and loaded into the `duckdb` destination. Let's take a look at the earliest date in the loaded data:\n\n```py\nwith pipeline.sql_client() as client:\n    res = client.execute_sql(\n            \"\"\"\n            SELECT\n            MIN(trip_dropoff_date_time)\n            FROM rides;\n            \"\"\"\n        )\n    print(res)\n```\n\nRun the same pipeline again.\n\n```py\n# define new dlt pipeline\npipeline = dlt.pipeline(pipeline_name=\"ny_taxi\", destination=\"duckdb\", dataset_name=\"ny_taxi_data\")\n\n\n# run the pipeline with the new resource\nload_info = pipeline.run(ny_taxi)\nprint(pipeline.last_trace)\n```\n\nThe pipeline will detect that there are **no new records** based on the `Trip_Dropoff_DateTime` field and the incremental cursor. As a result, **no new data will be loaded** into the destination:\n>0 load package(s) were loaded\n\n\n💡 **With dlt, incremental loading is simple, scalable, and automatic!**\n\n---\n\n### **Example: Loading data into a Data Warehouse (BigQuery)**  \nFirst, install the dependencies, define the source, then change the destination name and run the pipeline.\n\n```shell\npip install dlt[bigquery]\n```\n\nLet's use our NY Taxi API and load data from the source into destination.\n\n```py\nimport dlt\nfrom dlt.sources.helpers.rest_client import RESTClient\nfrom dlt.sources.helpers.rest_client.paginators import PageNumberPaginator\n\n\n@dlt.resource(name=\"rides\", write_disposition=\"replace\")\ndef ny_taxi():\n    client = RESTClient(\n        base_url=\"https://us-central1-dlthub-analytics.cloudfunctions.net\",\n        paginator=PageNumberPaginator(\n            base_page=1,\n            total_path=None\n        )\n    )\n\n    for page in client.paginate(\"data_engineering_zoomcamp_api\"):\n        yield page\n```\n\n\n**Choosing a destination**\n\nSwitching between  **data warehouses (BigQuery, Snowflake, Redshift)** or **data lakes (S3, Google Cloud Storage, Parquet files)**  in dlt is incredibly straightforward — simply modify the `destination` parameter in your pipeline configuration. \n\nFor example:\n\n```py\npipeline = dlt.pipeline(\n    pipeline_name='taxi_data',\n    destination='duckdb', # <--- to test pipeline locally\n    dataset_name='taxi_rides',\n)\n\npipeline = dlt.pipeline(\n    pipeline_name='taxi_data',\n    destination='bigquery', # <--- to run pipeline in production\n    dataset_name='taxi_rides',\n)\n```\n\nThis flexibility allows you to easily transition from local development to production-grade environments.\n\n> 💡 No need to rewrite your pipeline — dlt adapts automatically!\n\n**Set Credentials**  \n\nThe next logical step is to [set credentials](https://dlthub.com/docs/general-usage/credentials/) using **dlt's TOML providers** or **environment variables (ENVs)**.\n\n```py\nimport os\nfrom google.colab import userdata\n\nos.environ[\"DESTINATION__BIGQUERY__CREDENTIALS\"] = userdata.get('BIGQUERY_CREDENTIALS')\n```\n\nRun the pipeline:\n```py\npipeline = dlt.pipeline(\n    pipeline_name=\"taxi_data\",\n    destination=\"bigquery\",\n    dataset_name=\"taxi_rides\",\n    dev_mode=True,\n)\n\ninfo = pipeline.run(ny_taxi)\nprint(info)\n```\n\n💡 **What’s different?**  \n- **dlt automatically adapts the schema** to fit BigQuery.  \n- **Partitioning & clustering** can be applied for performance optimization.  \n- **Efficient batch loading** ensures scalability.\n\n---\n\n### **Example: Loading data into a Data Lake (Parquet on Local FS or S3)**  \n\n**Why use a Data Lake?**  \n- **Cost-effective storage** – Cheaper than traditional databases.   \n- **Optimized for big data processing** – Works seamlessly with Spark, Databricks, and Presto.  \n- **Easy scalability** – Store petabytes of data efficiently.  \n\n\nThe `filesystem` destination enables you to load data into **files stored locally** or in **cloud storage** solutions, making it an excellent choice for lightweight testing, prototyping, or file-based workflows.\n\nBelow is an **example** demonstrating how to use the `filesystem` destination to load data in **Parquet** format:\n\n* Step 1: Set up a local bucket or cloud directory for storing files\n\n```py\nimport os\n\nos.environ[\"BUCKET_URL\"] = \"/content\"\n```\n\n* Step 2: Define the data source (above)\n* Step 3: Run the pipeline\n\n```py\nimport dlt\n\n\npipeline = dlt.pipeline(\n    pipeline_name='fs_pipeline',\n    destination='filesystem', # <--- change destination to 'filesystem'\n    dataset_name='fs_data',\n)\n\nload_info = pipeline.run(ny_taxi, loader_file_format=\"parquet\") # <--- choose a file format: parquet, csv or jsonl\nprint(load_info)\n```\n\nLook at the files:\n\n```shell\n! ls fs_data/rides\n```\n\nLook at the loaded data:\n\n```py\n# explore loaded data\npipeline.dataset(dataset_type=\"default\").rides.df()\n```\n\n#### **Table formats: [Delta tables & Iceberg](https://dlthub.com/docs/dlt-ecosystem/destinations/delta-iceberg)**\n\ndlt supports writing **Delta** and **Iceberg** tables when using the `filesystem` destination.\n\n**How it works:**\n\ndlt uses the `deltalake` and `pyiceberg` libraries to write Delta and Iceberg tables, respectively. One or multiple Parquet files are prepared during the extract and normalize steps. In the load step, these Parquet files are exposed as an Arrow data structure and fed into `deltalake` or `pyiceberg`.\n\n```shell\n !pip install \"dlt[pyiceberg]\"\n```\n\n```py\npipeline = dlt.pipeline(\n    pipeline_name='fs_pipeline',\n    destination='filesystem', # <--- change destination to 'filesystem'\n    dataset_name='fs_iceberg_data',\n)\n\nload_info = pipeline.run(\n    ny_taxi,\n    loader_file_format=\"parquet\",\n    table_format=\"iceberg\",  # <--- choose a table format: delta or iceberg\n)\nprint(load_info)\n```\n\n💡**Note:**\n\nOpen source version of dlt supports basic functionality for **iceberg**, but the dltHub team is currently working on an **extended** and **more powerful** integration with iceberg.\n\n[Join the waiting list to learn more about dlt+ and Iceberg.](https://info.dlthub.com/waiting-list)\n\n\n---\n\n## **What’s Next?**  \n\n- **Try loading data into different [destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/)** – Test Postgres, Snowflake, or Parquet.  \n- **Experiment with [incremental loading](https://dlthub.com/docs/general-usage/incremental-loading)** – Load only new records for better efficiency.  \n- **Explore dlt’s [schema evolution](https://dlthub.com/docs/general-usage/schema-evolution)** – Automatically adjust to data structure changes.  \n- **Join our [Slack community](https://dlthub.com/community)** to share your progress!  \n\n\nWith **dlt’s automated load step**, you get **effortless, scalable, and resilient data loading**—so you can focus on insights instead of pipeline maintenance. 🚀\n\n---\n\n### Extra homework 💻\n* [Data ingestion with DLT to Bigquery from Sara Sabater](https://github.com/saraisab/Data_Engineer/blob/main/courses/DE_zoomcamp/Homework/DLT-Workshop/extra_homework/Data_ingestion_with_DLT_to_bigquery.ipynb).\n"
  },
  {
    "path": "cohorts/2025/workshops/dlt/dlt_homework.md",
    "content": "Original file is located at\n    https://colab.research.google.com/drive/1plqdl33K_HkVx0E0nGJrrkEUssStQsW7\n\n# **Workshop \"Data Ingestion with dlt\": Homework**\n\n---\n\n## **Dataset & API**\n\nWe’ll use **NYC Taxi data** via the same custom API from the workshop:\n\n🔹 **Base API URL:**  \n```\nhttps://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api\n```\n🔹 **Data format:** Paginated JSON (1,000 records per page).  \n🔹 **API Pagination:** Stop when an empty page is returned.\n\n## **Question 1: dlt Version**\n\n1. **Install dlt**:\n\n```\n!pip install dlt[duckdb]\n```\n\n> Or choose a different bracket—`bigquery`, `redshift`, etc.—if you prefer another primary destination. For this assignment, we’ll still do a quick test with DuckDB.\n\n2. **Check** the version:\n\n```\n!dlt --version\n```\n\nor:\n\n```py\nimport dlt\nprint(\"dlt version:\", dlt.__version__)\n```\n\nProvide the **version** you see in the output.\n\n## **Question 2: Define & Run the Pipeline (NYC Taxi API)**\n\nUse dlt to extract all pages of data from the API.\n\nSteps:\n\n1️⃣ Use the `@dlt.resource` decorator to define the API source.\n\n2️⃣ Implement automatic pagination using dlt's built-in REST client.\n\n3️⃣ Load the extracted data into DuckDB for querying.\n\n```py\nimport dlt\nfrom dlt.sources.helpers.rest_client import RESTClient\nfrom dlt.sources.helpers.rest_client.paginators import PageNumberPaginator\n\n\n# your code is here\n\n\npipeline = dlt.pipeline(\n    pipeline_name=\"ny_taxi_pipeline\",\n    destination=\"duckdb\",\n    dataset_name=\"ny_taxi_data\"\n)\n```\n\nLoad the data into DuckDB to test:\n```py\nload_info = pipeline.run(ny_taxi)\nprint(load_info)\n```\nStart a connection to your database using native `duckdb` connection and look what tables were generated:\"\"\"\n\n```py\nimport duckdb\nfrom google.colab import data_table\ndata_table.enable_dataframe_formatter()\n\n# A database '<pipeline_name>.duckdb' was created in working directory so just connect to it\n\n# Connect to the DuckDB database\nconn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n\n# Set search path to the dataset\nconn.sql(f\"SET search_path = '{pipeline.dataset_name}'\")\n\n# Describe the dataset\nconn.sql(\"DESCRIBE\").df()\n\n```\n\nHow many tables were created?\n\n* 2\n* 4\n* 6\n* 8\n\n## **Question 3: Explore the loaded data**\n\nInspect the table `ride`:\n\n```py\ndf = pipeline.dataset(dataset_type=\"default\").rides.df()\ndf\n```\n\nWhat is the total number of records extracted?\n\n* 2500\n* 5000\n* 7500\n* 10000\n\n## **Question 4: Trip Duration Analysis**\n\nRun the SQL query below to:\n\n* Calculate the average trip duration in minutes.\n\n```py\nwith pipeline.sql_client() as client:\n    res = client.execute_sql(\n            \"\"\"\n            SELECT\n            AVG(date_diff('minute', trip_pickup_date_time, trip_dropoff_date_time))\n            FROM rides;\n            \"\"\"\n        )\n    # Prints column values of the first row\n    print(res)\n```\n\nWhat is the average trip duration?\n\n* 12.3049\n* 22.3049\n* 32.3049\n* 42.3049\n\n## **Submitting the solutions**\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/workshop1\n\n## **Solution**\n\nWe will publish the solution here after deadline.\n"
  },
  {
    "path": "cohorts/2025/workshops/dynamic_load_dlt.py",
    "content": "import json\nimport os\nimport toml\nimport requests\nimport dlt\nfrom dlt.sources.filesystem import filesystem, read_parquet\nfrom google.cloud import storage\nimport io\nimport pyarrow.parquet as pq\n\n# Load the TOML file\n# the TOML file should follow below format:\n#[credentials]\n#project_id = \"your project id\"\n#private_key = \"your sevice account key\"\n#client_email = \"email\"\nconfig = toml.load(\"./.dlt/secrets.toml\")\n\n# Set environment variables\nos.environ[\"CREDENTIALS__PROJECT_ID\"] = config[\"credentials\"][\"project_id\"]\nos.environ[\"CREDENTIALS__PRIVATE_KEY\"] = config[\"credentials\"][\"private_key\"]\nos.environ[\"CREDENTIALS__CLIENT_EMAIL\"] = config[\"credentials\"][\"client_email\"]\n\n# Function to generate URLs based on user input for the date range and trip color\ndef generate_urls(color, start_year, end_year, start_month, end_month):\n    base_url = \"https://d37ci6vzurychx.cloudfront.net/trip-data/\"\n    urls = []\n\n# Generate the list of URLs based on the specified date range and color\n\n    for year in range(start_year, end_year + 1):\n        for month in range(start_month, end_month + 1):\n            # Format the month to ensure two digits\n            month_str = f\"{month:02d}\"\n            url = f\"{base_url}{color}_tripdata_{year}-{month_str}.parquet\"\n            #https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2020-01.parquet\n            urls.append(url)\n\n    return urls\n\n# User input for time range and trip color\ncolor = input(\"Enter color (green, yellow): \").lower()  \nstart_year = int(input(\"Enter the start year (e.g., 2019): \"))\nend_year = int(input(\"Enter the end year (e.g., 2022): \"))\nstart_month = int(input(\"Enter the start month (1-12): \"))\nend_month = int(input(\"Enter the end month (1-12): \"))\n\n# Generate URLs based on user input\nurls = generate_urls(color, start_year, end_year, start_month, end_month)\n\n\n# Debug: Print generated URLs\nprint(\"Generated URLs:\")\nfor url in urls:\n    print(url)\n\n\ndlt_method = input(\"Choose loading method: 1 for GCS -> Bigquery, 2 for Direct Web -> Bigquery: \")\n\nif dlt_method == \"1\":\n\n    # Initialize GCS client\n    storage_client = storage.Client.from_service_account_json(\"gcs.json\")\n    bucket_name = input(\"Enter the GCS bucket name: \")  # Replace with your GCS bucket name\n    bucket = storage_client.bucket(bucket_name)\n\n    # Download files and upload them to GCS\n    gcs_files = []\n    for url in urls:\n        file_name = url.split(\"/\")[-1]  # Extract the file name from the URL\n        gcs_blob = bucket.blob(file_name)\n\n        print(f\"Downloading {url} and uploading to GCS as {file_name}\")\n        response = requests.get(url)\n        gcs_blob.upload_from_string(response.content)\n        gcs_files.append(f\"gs://{bucket_name}/{file_name}\")\n\n    @dlt.resource(name=\"rides\", write_disposition=\"replace\")\n    def parquet_source():\n        # Use filesystem to load files from GCS and apply read_parquet transformation\n        files = filesystem(bucket_url=f\"gs://{bucket_name}/\", file_glob=\"*.parquet\")\n        reader = (files | read_parquet()).with_name(\"tripdata\")\n\n        # Iterate through the rows from the reader and yield them\n        row_count = 0\n        for row in reader:\n            row_count += 1\n            yield row\n        print(f\"Total rows yielded: {row_count}\")\n\nelif dlt_method == \"2\":\n    # Alternative method: Streaming Parquet files directly from the web\n    @dlt.resource(name=\"ny_taxi_dlt\", write_disposition=\"replace\")\n    def paginated_getter():\n        for url in urls:\n            try:\n                with requests.get(url, stream=True) as response:\n                    response.raise_for_status()\n                    buffer = io.BytesIO()\n                    for chunk in response.iter_content(chunk_size=1024 * 1024):  # 1MB chunks\n                        buffer.write(chunk)\n                    buffer.seek(0)\n                    table = pq.read_table(buffer)\n                    print(f'Got data from {url} with {table.num_rows} records')\n                    if table.num_rows > 0:\n                        yield table\n            except Exception as e:\n                print(f\"Failed to fetch data from {url}: {e}\")\n\n# Create the pipeline\npipeline = dlt.pipeline(\n    pipeline_name=\"test_taxi\",\n    dataset_name=input(\"Enter the dataset name: \"),\n    destination=\"bigquery\"\n   # dev_mode=True\n)\n\n# Run the pipeline with either method\nif dlt_method == \"1\":\n    info = pipeline.run(parquet_source())\nelif dlt_method == \"2\":\n    info = pipeline.run(paginated_getter())\nelse:\n    print(\"Invalid selection\")\n    exit()\n\nprint(info)"
  },
  {
    "path": "cohorts/2026/01-docker-terraform/homework.md",
    "content": "# Module 1 Homework: Docker & SQL\n\nIn this homework we'll prepare the environment and practice\nDocker and SQL\n\nWhen submitting your homework, you will also need to include\na link to your GitHub repository or other public code-hosting\nsite.\n\nThis repository should contain the code for solving the homework.\n\nWhen your solution has SQL or shell commands and not code\n(e.g. python files) file format, include them directly in\nthe README file of your repository.\n\n\n## Question 1. Understanding Docker images\n\nRun docker with the `python:3.13` image. Use an entrypoint `bash` to interact with the container.\n\nWhat's the version of `pip` in the image?\n\n- 25.3\n- 24.3.1\n- 24.2.1\n- 23.3.1\n\n\n## Question 2. Understanding Docker networking and docker-compose\n\nGiven the following `docker-compose.yaml`, what is the `hostname` and `port` that pgadmin should use to connect to the postgres database?\n\n```yaml\nservices:\n  db:\n    container_name: postgres\n    image: postgres:17-alpine\n    environment:\n      POSTGRES_USER: 'postgres'\n      POSTGRES_PASSWORD: 'postgres'\n      POSTGRES_DB: 'ny_taxi'\n    ports:\n      - '5433:5432'\n    volumes:\n      - vol-pgdata:/var/lib/postgresql/data\n\n  pgadmin:\n    container_name: pgadmin\n    image: dpage/pgadmin4:latest\n    environment:\n      PGADMIN_DEFAULT_EMAIL: \"pgadmin@pgadmin.com\"\n      PGADMIN_DEFAULT_PASSWORD: \"pgadmin\"\n    ports:\n      - \"8080:80\"\n    volumes:\n      - vol-pgadmin_data:/var/lib/pgadmin\n\nvolumes:\n  vol-pgdata:\n    name: vol-pgdata\n  vol-pgadmin_data:\n    name: vol-pgadmin_data\n```\n\n- postgres:5433\n- localhost:5432\n- db:5433\n- postgres:5432\n- db:5432\n\nIf multiple answers are correct, select any \n\n\n## Prepare the Data\n\nDownload the green taxi trips data for November 2025:\n\n```bash\nwget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet\n```\n\nYou will also need the dataset with zones:\n\n```bash\nwget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv\n```\n\n## Question 3. Counting short trips\n\nFor the trips in November 2025 (lpep_pickup_datetime between '2025-11-01' and '2025-12-01', exclusive of the upper bound), how many trips had a `trip_distance` of less than or equal to 1 mile?\n\n- 7,853\n- 8,007\n- 8,254\n- 8,421\n\n\n## Question 4. Longest trip for each day\n\nWhich was the pick up day with the longest trip distance? Only consider trips with `trip_distance` less than 100 miles (to exclude data errors).\n\nUse the pick up time for your calculations.\n\n- 2025-11-14\n- 2025-11-20\n- 2025-11-23\n- 2025-11-25\n\n\n## Question 5. Biggest pickup zone\n\nWhich was the pickup zone with the largest `total_amount` (sum of all trips) on November 18th, 2025?\n\n- East Harlem North\n- East Harlem South\n- Morningside Heights\n- Forest Hills\n\n\n## Question 6. Largest tip\n\nFor the passengers picked up in the zone named \"East Harlem North\" in November 2025, which was the drop off zone that had the largest tip?\n\nNote: it's `tip` , not `trip`. We need the name of the zone, not the ID.\n\n- JFK Airport\n- Yorkville West\n- East Harlem North\n- LaGuardia Airport\n\n\n## Terraform\n\nIn this section homework we'll prepare the environment by creating resources in GCP with Terraform.\n\nIn your VM on GCP/Laptop/GitHub Codespace install Terraform.\nCopy the files from the course repo\n[here](../../../01-docker-terraform/terraform/terraform) to your VM/Laptop/GitHub Codespace.\n\nModify the files as necessary to create a GCP Bucket and Big Query Dataset.\n\n\n## Question 7. Terraform Workflow\n\nWhich of the following sequences, respectively, describes the workflow for:\n1. Downloading the provider plugins and setting up backend,\n2. Generating proposed changes and auto-executing the plan\n3. Remove all resources managed by terraform`\n\nAnswers:\n- terraform import, terraform apply -y, terraform destroy\n- teraform init, terraform plan -auto-apply, terraform rm\n- terraform init, terraform run -auto-approve, terraform destroy\n- terraform init, terraform apply -auto-approve, terraform destroy\n- terraform import, terraform apply -y, terraform rm\n\n\n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw1\n\n\n## Learning in Public\n\nWe encourage everyone to share what they learned. This is called \"learning in public\".\n\n### Why learn in public?\n\n- Accountability: Sharing your progress creates commitment and motivation to continue\n- Feedback: The community can provide valuable suggestions and corrections\n- Networking: You'll connect with like-minded people and potential collaborators\n- Documentation: Your posts become a learning journal you can reference later\n- Opportunities: Employers and clients often discover talent through public learning\n\nYou can read more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).\n\nDon't worry about being perfect. Everyone starts somewhere, and people love following genuine learning journeys!\n\n### Example post for LinkedIn\n\n```\n🚀 Week 1 of Data Engineering Zoomcamp by @DataTalksClub complete!\n\nJust finished Module 1 - Docker & Terraform. Learned how to:\n\n✅ Containerize applications with Docker and Docker Compose\n✅ Set up PostgreSQL databases and write SQL queries\n✅ Build data pipelines to ingest NYC taxi data\n✅ Provision cloud infrastructure with Terraform\n\nHere's my homework solution: <LINK>\n\nFollowing along with this amazing free course - who else is learning data engineering?\n\nYou can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n\n### Example post for Twitter/X\n\n\n```\n🐳 Module 1 of Data Engineering Zoomcamp done!\n\n- Docker containers\n- Postgres & SQL\n- Terraform & GCP\n- NYC taxi data pipeline\n\nMy solution: <LINK>\n\nFree course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n\n\n"
  },
  {
    "path": "cohorts/2026/02-workflow-orchestration/homework.md",
    "content": "## Module 2 Homework\n\nATTENTION: At the end of the submission form, you will be required to include a link to your GitHub repository or other public code-hosting site. This repository should contain your code for solving the homework. If your solution includes code that is not in file format, please include these directly in the README file of your repository.\n\n> In case you don't get one option exactly, select the closest one \n\nFor the homework, we'll be working with the _green_ taxi dataset located here:\n\n`https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/green/download`\n\nTo get a `wget`-able link, use this prefix (note that the link itself gives 404):\n\n`https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/`\n\n### Assignment\n\nSo far in the course, we processed data for the year 2019 and 2020. Your task is to extend the existing flows to include data for the year 2021.\n\n![homework datasets](../../../02-workflow-orchestration/images/homework.png)\n\nAs a hint, Kestra makes that process really easy:\n1. You can leverage the backfill functionality in the [scheduled flow](../../../02-workflow-orchestration/flows/09_gcp_taxi_scheduled.yaml) to backfill the data for the year 2021. Just make sure to select the time period for which data exists i.e. from `2021-01-01` to `2021-07-31`. Also, make sure to do the same for both `yellow` and `green` taxi data (select the right service in the `taxi` input).\n2. Alternatively, run the flow manually for each of the seven months of 2021 for both `yellow` and `green` taxi data. Challenge for you: find out how to loop over the combination of Year-Month and `taxi`-type using `ForEach` task which triggers the flow for each combination using a `Subflow` task.\n\n### Quiz Questions\n\nComplete the quiz shown below. It's a set of 6 multiple-choice questions to test your understanding of workflow orchestration, Kestra, and ETL pipelines.\n\n1) Within the execution for `Yellow` Taxi data for the year `2020` and month `12`: what is the uncompressed file size (i.e. the output file `yellow_tripdata_2020-12.csv` of the `extract` task)?\n- 128.3 MiB\n- 134.5 MiB\n- 364.7 MiB\n- 692.6 MiB\n\n2) What is the rendered value of the variable `file` when the inputs `taxi` is set to `green`, `year` is set to `2020`, and `month` is set to `04` during execution?\n- `{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv` \n- `green_tripdata_2020-04.csv`\n- `green_tripdata_04_2020.csv`\n- `green_tripdata_2020.csv`\n\n3) How many rows are there for the `Yellow` Taxi data for all CSV files in the year 2020?\n- 13,537.299\n- 24,648,499\n- 18,324,219\n- 29,430,127\n\n4) How many rows are there for the `Green` Taxi data for all CSV files in the year 2020?\n- 5,327,301\n- 936,199\n- 1,734,051\n- 1,342,034\n\n5) How many rows are there for the `Yellow` Taxi data for the March 2021 CSV file?\n- 1,428,092\n- 706,911\n- 1,925,152\n- 2,561,031\n\n6) How would you configure the timezone to New York in a Schedule trigger?\n- Add a `timezone` property set to `EST` in the `Schedule` trigger configuration  \n- Add a `timezone` property set to `America/New_York` in the `Schedule` trigger configuration\n- Add a `timezone` property set to `UTC-5` in the `Schedule` trigger configuration\n- Add a `location` property set to `New_York` in the `Schedule` trigger configuration  \n\n## Submitting the solutions\n\n* Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw2\n* Check the link above to see the due date\n\n## Solution\n\nWill be added after the due date\n\n\n## Learning in Public\n\nWe encourage everyone to share what they learned. This is called \"learning in public\".\n\nRead more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).\n\n### Example post for LinkedIn\n\n```\n🚀 Week 2 of Data Engineering Zoomcamp by @DataTalksClub and @Will Russell complete!\n\nJust finished Module 2 - Workflow Orchestration with @Kestra. Learned how to:\n\n✅ Orchestrate data pipelines with Kestra flows\n✅ Use variables and expressions for dynamic workflows\n✅ Implement backfill for historical data\n✅ Schedule workflows with timezone support\n✅ Process NYC taxi data (Yellow & Green) for 2019-2021\n\nBuilt ETL pipelines that extract, transform, and load taxi trip data automatically!\n\nThanks to the @Kestra team for the great orchestration tool!\n\nHere's my homework solution: <LINK>\n\nFollowing along with this amazing free course - who else is learning data engineering?\n\nYou can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n\n### Example post for Twitter/X\n\n```\nModule 2 of DE Zoomcamp by @DataTalksClub @wrussell1999 done!\n\n- @kestra_io workflow orchestration\n- ETL pipelines for taxi data\n- Backfill & scheduling\n- Variables & dynamic flows\n\nMy solution: <LINK>\n\nJoin me here: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n"
  },
  {
    "path": "cohorts/2026/03-data-warehouse/DLT_upload_to_GCP.ipynb",
    "content": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"aC2QnhmKxpq1\"\n      },\n      \"source\": [\n        \"**Please set up your credentials JSON as GCP_CREDENTIALS secrets**\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"id\": \"UsUZobVduL7l\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import os\\n\",\n        \"from google.colab import userdata\\n\",\n        \"\\n\",\n        \"os.environ[\\\"DESTINATION__CREDENTIALS\\\"] = userdata.get(\\\"GCP_CREDENTIALS\\\")\\n\",\n        \"os.environ[\\\"BUCKET_URL\\\"] = \\\"gs://your_bucket_url\\\"\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 1,\n      \"metadata\": {\n        \"id\": \"mPBzsEgyjsBo\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# Install for production\\n\",\n        \"%%capture\\n\",\n        \"!pip install dlt[bigquery, gs]\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 1,\n      \"metadata\": {\n        \"id\": \"evdUsDNbkCTk\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# Install for testing\\n\",\n        \"%%capture\\n\",\n        \"!pip install dlt[duckdb]\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": 2,\n      \"metadata\": {\n        \"id\": \"lYh7r1mTf4uo\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import dlt\\n\",\n        \"import requests\\n\",\n        \"import pandas as pd\\n\",\n        \"from dlt.destinations import filesystem\\n\",\n        \"from io import BytesIO\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"76zT1PzAgs7A\"\n      },\n      \"source\": [\n        \"Ingesting parquet files to GCS.\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"id\": \"xya0215jsnsb\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# Define a dlt source to download and process Parquet files as resources\\n\",\n        \"@dlt.source(name=\\\"rides\\\")\\n\",\n        \"def download_parquet():\\n\",\n        \"    prefix = \\\"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata\\\"\\n\",\n        \"    for month in range(1, 7):\\n\",\n        \"        file_name = f\\\"yellow_tripdata_2024-0{month}.parquet\\\"\\n\",\n        \"        url = f\\\"{prefix}_2024-0{month}.parquet\\\"\\n\",\n        \"        response = requests.get(url)\\n\",\n        \"\\n\",\n        \"        df = pd.read_parquet(BytesIO(response.content))\\n\",\n        \"\\n\",\n        \"        # Return the dataframe as a dlt resource for ingestion\\n\",\n        \"        yield dlt.resource(df, name=file_name)\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"# Initialize the pipeline\\n\",\n        \"pipeline = dlt.pipeline(\\n\",\n        \"    pipeline_name=\\\"rides_pipeline\\\",\\n\",\n        \"    destination=filesystem(layout=\\\"{schema_name}/{table_name}.{ext}\\\"),\\n\",\n        \"    dataset_name=\\\"rides_dataset\\\",\\n\",\n        \")\\n\",\n        \"\\n\",\n        \"# Run the pipeline to load Parquet data into DuckDB\\n\",\n        \"load_info = pipeline.run(download_parquet(), loader_file_format=\\\"parquet\\\")\\n\",\n        \"\\n\",\n        \"# Print the results\\n\",\n        \"print(load_info)\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"S0310FT-gy_P\"\n      },\n      \"source\": [\n        \"Ingesting data to Database\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"1_3K97w1c2v2\",\n        \"outputId\": \"4b2d26bf-2814-46fa-f80d-7a2e17417a95\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# Define a dlt resource to download and process Parquet files as single table\\n\",\n        \"@dlt.resource(name=\\\"rides\\\", write_disposition=\\\"replace\\\")\\n\",\n        \"def download_parquet():\\n\",\n        \"    prefix = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata'\\n\",\n        \"\\n\",\n        \"    for month in range(1, 7):\\n\",\n        \"        url = f\\\"{prefix}_2024-0{month}.parquet\\\"\\n\",\n        \"        response = requests.get(url)\\n\",\n        \"\\n\",\n        \"        df = pd.read_parquet(BytesIO(response.content))\\n\",\n        \"\\n\",\n        \"        yield df\\n\",\n        \"\\n\",\n        \"\\n\",\n        \"# Initialize the pipeline\\n\",\n        \"pipeline = dlt.pipeline(\\n\",\n        \"    pipeline_name=\\\"rides_pipeline\\\",\\n\",\n        \"    destination=\\\"duckdb\\\",  # Use DuckDB for testing\\n\",\n        \"    # destination=\\\"bigquery\\\",  # Use BigQuery for production\\n\",\n        \"    dataset_name=\\\"rides_dataset\\\",\\n\",\n        \")\\n\",\n        \"\\n\",\n        \"# Run the pipeline to load Parquet data into DuckDB\\n\",\n        \"info = pipeline.run(download_parquet)\\n\",\n        \"\\n\",\n        \"# Print the results\\n\",\n        \"print(info)\\n\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"gDcLjzLtooBV\",\n        \"outputId\": \"74ff2de7-2f2e-41b9-a681-3dc5887f6eed\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"import duckdb\\n\",\n        \"\\n\",\n        \"conn = duckdb.connect(f\\\"{pipeline.pipeline_name}.duckdb\\\")\\n\",\n        \"\\n\",\n        \"# Set search path to the dataset\\n\",\n        \"conn.sql(f\\\"SET search_path = '{pipeline.dataset_name}'\\\")\\n\",\n        \"\\n\",\n        \"# Describe the dataset to see loaded tables\\n\",\n        \"res = conn.sql(\\\"DESCRIBE\\\").df()\\n\",\n        \"print(res)\"\n      ]\n    },\n    {\n      \"cell_type\": \"code\",\n      \"execution_count\": null,\n      \"metadata\": {\n        \"colab\": {\n          \"base_uri\": \"https://localhost:8080/\"\n        },\n        \"id\": \"VVJy8JoerI2P\",\n        \"outputId\": \"3f8c7fee-a9ee-4fd4-ec75-153ca60bd36f\"\n      },\n      \"outputs\": [],\n      \"source\": [\n        \"# provide a resource name to query a table of that name\\n\",\n        \"with pipeline.sql_client() as client:\\n\",\n        \"    with client.execute_query(f\\\"SELECT count(1) FROM rides\\\") as cursor:\\n\",\n        \"        data = cursor.df()\\n\",\n        \"print(data)\"\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\": {\n      \"display_name\": \"Python 3\",\n      \"name\": \"python3\"\n    },\n    \"language_info\": {\n      \"name\": \"python\"\n    }\n  },\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "cohorts/2026/03-data-warehouse/homework.md",
    "content": "# Module 3 Homework: Data Warehousing & BigQuery\n\nIn this homework we'll practice working with BigQuery and Google Cloud Storage.\n\nWhen submitting your homework, you will also need to include\na link to your GitHub repository or other public code-hosting\nsite.\n\nThis repository should contain the code for solving the homework.\n\nWhen your solution has SQL or shell commands and not code\n(e.g. python files) file format, include them directly in\nthe README file of your repository.\n\n## Data\n\nFor this homework we will be using the Yellow Taxi Trip Records for January 2024 - June 2024 (not the entire year of data).\n\nParquet Files are available from the New York City Taxi Data found here:\n\nhttps://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page\n\n## Loading the data\n\nYou can use the following scripts to load the data into your GCS bucket:\n\n- Python script: [load_yellow_taxi_data.py](./load_yellow_taxi_data.py)\n- Jupyter notebook with DLT: [DLT_upload_to_GCP.ipynb](./DLT_upload_to_GCP.ipynb)\n\nYou will need to generate a Service Account with GCS Admin privileges or be authenticated with the Google SDK, and update the bucket name in the script.\n\nIf you are using orchestration tools such as Kestra, Mage, Airflow, or Prefect, do not load the data into BigQuery using the orchestrator.\n\nMake sure that all 6 files show in your GCS bucket before beginning.\n\nNote: You will need to use the PARQUET option when creating an external table.\n\n\n## BigQuery Setup\n\nCreate an external table using the Yellow Taxi Trip Records. \n\nCreate a (regular/materialized) table in BQ using the Yellow Taxi Trip Records (do not partition or cluster this table). \n\n\n\n## Question 1. Counting records\n\nWhat is count of records for the 2024 Yellow Taxi Data?\n- 65,623\n- 840,402\n- 20,332,093\n- 85,431,289\n\n\n## Question 2. Data read estimation\n\nWrite a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.\n \nWhat is the **estimated amount** of data that will be read when this query is executed on the External Table and the Table?\n\n- 18.82 MB for the External Table and 47.60 MB for the Materialized Table\n- 0 MB for the External Table and 155.12 MB for the Materialized Table\n- 2.14 GB for the External Table and 0MB for the Materialized Table\n- 0 MB for the External Table and 0MB for the Materialized Table\n\n## Question 3. Understanding columnar storage\n\nWrite a query to retrieve the PULocationID from the table (not the external table) in BigQuery. Now write a query to retrieve the PULocationID and DOLocationID on the same table.\n\nWhy are the estimated number of Bytes different?\n- BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires \nreading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed.\n- BigQuery duplicates data across multiple storage partitions, so selecting two columns instead of one requires scanning the table twice, \ndoubling the estimated bytes processed.\n- BigQuery automatically caches the first queried column, so adding a second column increases processing time but does not affect the estimated bytes scanned.\n- When selecting multiple columns, BigQuery performs an implicit join operation between them, increasing the estimated bytes processed\n\n## Question 4. Counting zero fare trips\n\nHow many records have a fare_amount of 0?\n- 128,210\n- 546,578\n- 20,188,016\n- 8,333\n\n## Question 5. Partitioning and clustering\n\nWhat is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy)\n\n- Partition by tpep_dropoff_datetime and Cluster on VendorID\n- Cluster on by tpep_dropoff_datetime and Cluster on VendorID\n- Cluster on tpep_dropoff_datetime Partition by VendorID\n- Partition by tpep_dropoff_datetime and Partition by VendorID\n\n\n## Question 6. Partition benefits\n\nWrite a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime\n2024-03-01 and 2024-03-15 (inclusive)\n\n\nUse the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values? \n\n\nChoose the answer which most closely matches.\n \n\n- 12.47 MB for non-partitioned table and 326.42 MB for the partitioned table\n- 310.24 MB for non-partitioned table and 26.84 MB for the partitioned table\n- 5.87 MB for non-partitioned table and 0 MB for the partitioned table\n- 310.31 MB for non-partitioned table and 285.64 MB for the partitioned table\n\n\n## Question 7. External table storage\n\nWhere is the data stored in the External Table you created?\n\n- Big Query\n- Container Registry\n- GCP Bucket\n- Big Table\n\n## Question 8. Clustering best practices\n\nIt is best practice in Big Query to always cluster your data:\n- True\n- False\n\n\n## Question 9. Understanding table scans\n\nNo Points: Write a `SELECT count(*)` query FROM the materialized table you created. How many bytes does it estimate will be read? Why?\n\n\n## Submitting the solutions\n\nForm for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw3\n\n\n## Learning in Public\n\nWe encourage everyone to share what they learned. This is called \"learning in public\".\n\nRead more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).\n\n### Example post for LinkedIn\n\n```\n🚀 Week 3 of Data Engineering Zoomcamp by @DataTalksClub complete!\n\nJust finished Module 3 - Data Warehousing with BigQuery. Learned how to:\n\n✅ Create external tables from GCS bucket data\n✅ Build materialized tables in BigQuery\n✅ Partition and cluster tables for performance\n✅ Understand columnar storage and query optimization\n✅ Analyze NYC taxi data at scale\n\nWorking with 20M+ records and learning how partitioning reduces query costs!\n\nHere's my homework solution: <LINK>\n\nFollowing along with this amazing free course - who else is learning data engineering?\n\nYou can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n\n### Example post for Twitter/X\n\n```\n📊 Module 3 of Data Engineering Zoomcamp done!\n\n- BigQuery & GCS\n- External vs materialized tables\n- Partitioning & clustering\n- Query optimization\n\nMy solution: <LINK>\n\nFree course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n"
  },
  {
    "path": "cohorts/2026/03-data-warehouse/load_yellow_taxi_data.py",
    "content": "import os\nimport sys\nimport urllib.request\nfrom concurrent.futures import ThreadPoolExecutor\nfrom google.cloud import storage\nfrom google.api_core.exceptions import NotFound, Forbidden\nimport time\n\n\n# Change this to your bucket name\nBUCKET_NAME = \"dezoomcamp_hw3_2025\"\n\n# If you authenticated through the GCP SDK you can comment out these two lines\nCREDENTIALS_FILE = \"gcs.json\"\nclient = storage.Client.from_service_account_json(CREDENTIALS_FILE)\n# If commented initialize client with the following\n# client = storage.Client(project='zoomcamp-mod3-datawarehouse')\n\n\nBASE_URL = \"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-\"\nMONTHS = [f\"{i:02d}\" for i in range(1, 7)]\nDOWNLOAD_DIR = \".\"\n\nCHUNK_SIZE = 8 * 1024 * 1024\n\nos.makedirs(DOWNLOAD_DIR, exist_ok=True)\n\nbucket = client.bucket(BUCKET_NAME)\n\n\ndef download_file(month):\n    url = f\"{BASE_URL}{month}.parquet\"\n    file_path = os.path.join(DOWNLOAD_DIR, f\"yellow_tripdata_2024-{month}.parquet\")\n\n    try:\n        print(f\"Downloading {url}...\")\n        urllib.request.urlretrieve(url, file_path)\n        print(f\"Downloaded: {file_path}\")\n        return file_path\n    except Exception as e:\n        print(f\"Failed to download {url}: {e}\")\n        return None\n\n\ndef create_bucket(bucket_name):\n    try:\n        # Get bucket details\n        bucket = client.get_bucket(bucket_name)\n\n        # Check if the bucket belongs to the current project\n        project_bucket_ids = [bckt.id for bckt in client.list_buckets()]\n        if bucket_name in project_bucket_ids:\n            print(\n                f\"Bucket '{bucket_name}' exists and belongs to your project. Proceeding...\"\n            )\n        else:\n            print(\n                f\"A bucket with the name '{bucket_name}' already exists, but it does not belong to your project.\"\n            )\n            sys.exit(1)\n\n    except NotFound:\n        # If the bucket doesn't exist, create it\n        bucket = client.create_bucket(bucket_name)\n        print(f\"Created bucket '{bucket_name}'\")\n    except Forbidden:\n        # If the request is forbidden, it means the bucket exists but you don't have access to see details\n        print(\n            f\"A bucket with the name '{bucket_name}' exists, but it is not accessible. Bucket name is taken. Please try a different bucket name.\"\n        )\n        sys.exit(1)\n\n\ndef verify_gcs_upload(blob_name):\n    return storage.Blob(bucket=bucket, name=blob_name).exists(client)\n\n\ndef upload_to_gcs(file_path, max_retries=3):\n    blob_name = os.path.basename(file_path)\n    blob = bucket.blob(blob_name)\n    blob.chunk_size = CHUNK_SIZE\n\n    create_bucket(BUCKET_NAME)\n\n    for attempt in range(max_retries):\n        try:\n            print(f\"Uploading {file_path} to {BUCKET_NAME} (Attempt {attempt + 1})...\")\n            blob.upload_from_filename(file_path)\n            print(f\"Uploaded: gs://{BUCKET_NAME}/{blob_name}\")\n\n            if verify_gcs_upload(blob_name):\n                print(f\"Verification successful for {blob_name}\")\n                return\n            else:\n                print(f\"Verification failed for {blob_name}, retrying...\")\n        except Exception as e:\n            print(f\"Failed to upload {file_path} to GCS: {e}\")\n\n        time.sleep(5)\n\n    print(f\"Giving up on {file_path} after {max_retries} attempts.\")\n\n\nif __name__ == \"__main__\":\n    create_bucket(BUCKET_NAME)\n\n    with ThreadPoolExecutor(max_workers=4) as executor:\n        file_paths = list(executor.map(download_file, MONTHS))\n\n    with ThreadPoolExecutor(max_workers=4) as executor:\n        executor.map(upload_to_gcs, filter(None, file_paths))  # Remove None values\n\n    print(\"All files processed and verified.\")\n"
  },
  {
    "path": "cohorts/2026/04-analytics-engineering/homework.md",
    "content": "# Module 4 Homework: Analytics Engineering with dbt\n\nIn this homework, we'll use the dbt project in `04-analytics-engineering/taxi_rides_ny/` to transform NYC taxi data and answer questions by querying the models.\n\n## Setup\n\n1. Set up your dbt project following the [setup guide](../../../04-analytics-engineering/setup/)\n2. Load the Green and Yellow taxi data for 2019-2020 and FHV trip data for 2019 into your warehouse (use static tables from [dtc github](https://github.com/DataTalksClub/nyc-tlc-data/), don't use offical tables from tlc because some values change from time to time)\n3. Run `dbt build --target prod` to create all models and run tests\n\n> **Note:** By default, dbt uses the `dev` target. You must use `--target prod` to build the models in the production dataset, which is required for the homework queries below.\n\nAfter a successful build, you should have models like `fct_trips`, `dim_zones`, and `fct_monthly_zone_revenue` in your warehouse.\n\n---\n\n### Question 1. dbt Lineage and Execution\n\nGiven a dbt project with the following structure:\n\n```\nmodels/\n├── staging/\n│   ├── stg_green_tripdata.sql\n│   └── stg_yellow_tripdata.sql\n└── intermediate/\n    └── int_trips_unioned.sql (depends on stg_green_tripdata & stg_yellow_tripdata)\n```\n\nIf you run `dbt run --select int_trips_unioned`, what models will be built?\n\n- `stg_green_tripdata`, `stg_yellow_tripdata`, and `int_trips_unioned` (upstream dependencies)\n- Any model with upstream and downstream dependencies to `int_trips_unioned`\n- `int_trips_unioned` only\n- `int_trips_unioned`, `int_trips`, and `fct_trips` (downstream dependencies)\n\n---\n\n### Question 2. dbt Tests\n\nYou've configured a generic test like this in your `schema.yml`:\n\n```yaml\ncolumns:\n  - name: payment_type\n    data_tests:\n      - accepted_values:\n          arguments:\n            values: [1, 2, 3, 4, 5]\n            quote: false\n```\n\nYour model `fct_trips` has been running successfully for months. A new value `6` now appears in the source data.\n\nWhat happens when you run `dbt test --select fct_trips`?\n\n- dbt will skip the test because the model didn't change\n- dbt will fail the test, returning a non-zero exit code\n- dbt will pass the test with a warning about the new value\n- dbt will update the configuration to include the new value\n\n---\n\n### Question 3. Counting Records in `fct_monthly_zone_revenue`\n\nAfter running your dbt project, query the `fct_monthly_zone_revenue` model.\n\nWhat is the count of records in the `fct_monthly_zone_revenue` model?\n\n- 12,998\n- 14,120\n- 12,184\n- 15,421\n\n---\n\n### Question 4. Best Performing Zone for Green Taxis (2020)\n\nUsing the `fct_monthly_zone_revenue` table, find the pickup zone with the **highest total revenue** (`revenue_monthly_total_amount`) for **Green** taxi trips in 2020.\n\nWhich zone had the highest revenue?\n\n- East Harlem North\n- Morningside Heights\n- East Harlem South\n- Washington Heights South\n\n---\n\n### Question 5. Green Taxi Trip Counts (October 2019)\n\nUsing the `fct_monthly_zone_revenue` table, what is the **total number of trips** (`total_monthly_trips`) for Green taxis in October 2019?\n\n- 500,234\n- 350,891\n- 384,624\n- 421,509\n\n---\n\n### Question 6. Build a Staging Model for FHV Data\n\nCreate a staging model for the **For-Hire Vehicle (FHV)** trip data for 2019.\n\n1. Load the [FHV trip data for 2019](https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhv) into your data warehouse\n2. Create a staging model `stg_fhv_tripdata` with these requirements:\n   - Filter out records where `dispatching_base_num IS NULL`\n   - Rename fields to match your project's naming conventions (e.g., `PUlocationID` → `pickup_location_id`)\n\nWhat is the count of records in `stg_fhv_tripdata`?\n\n- 42,084,899\n- 43,244,693\n- 22,998,722\n- 44,112,187\n\n---\n\n## Submitting the solutions\n\n- Form for submitting: <https://courses.datatalks.club/de-zoomcamp-2026/homework/hw4>\n\n=======\n\n## Learning in Public\n\nWe encourage everyone to share what they learned. This is called \"learning in public\".\n\nRead more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).\n\n### Example post for LinkedIn\n\n```\n🚀 Week 4 of Data Engineering Zoomcamp by @DataTalksClub complete!\n\nJust finished Module 4 - Analytics Engineering with dbt. Learned how to:\n\n✅ Build transformation models with dbt\n✅ Create staging, intermediate, and fact tables\n✅ Write tests to ensure data quality\n✅ Understand lineage and model dependencies\n✅ Analyze revenue patterns across NYC zones\n\nTransforming raw data into analytics-ready models - the T in ELT!\n\nHere's my homework solution: <LINK>\n\nFollowing along with this amazing free course - who else is learning data engineering?\n\nYou can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n\n### Example post for Twitter/X\n\n```\n📈 Module 4 of Data Engineering Zoomcamp done!\n\n- Analytics Engineering with dbt\n- Transformation models & tests\n- Data lineage & dependencies\n- NYC taxi revenue analysis\n\nMy solution: <LINK>\n\nFree course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n"
  },
  {
    "path": "cohorts/2026/05-data-platforms/homework.md",
    "content": "# Module 5 Homework: Data Platforms with Bruin\n\nIn this homework, we'll use Bruin to build a complete data pipeline, from ingestion to reporting.\n\n## Setup\n\n1. Install Bruin CLI: `curl -LsSf https://getbruin.com/install/cli | sh`\n2. Initialize the zoomcamp template: `bruin init zoomcamp my-pipeline`\n3. Configure your `.bruin.yml` with a DuckDB connection\n4. Follow the tutorial in the [main module README](../../../05-data-platforms/)\n\nAfter completing the setup, you should have a working NYC taxi data pipeline.\n\n---\n\n### Question 1. Bruin Pipeline Structure\n\nIn a Bruin project, what are the required files/directories?\n\n- `bruin.yml` and `assets/`\n- `.bruin.yml` and `pipeline.yml` (assets can be anywhere)\n- `.bruin.yml` and `pipeline/` with `pipeline.yml` and `assets/`\n- `pipeline.yml` and `assets/` only\n\n---\n\n### Question 2. Materialization Strategies\n\nYou're building a pipeline that processes NYC taxi data organized by month based on `pickup_datetime`. Which incremental strategy is best for processing a specific interval period by deleting and inserting data for that time period?\n\n- `append` - always add new rows\n- `replace` - truncate and rebuild entirely\n- `time_interval` - incremental based on a time column\n- `view` - create a virtual table only\n\n---\n\n### Question 3. Pipeline Variables\n\nYou have the following variable defined in `pipeline.yml`:\n\n```yaml\nvariables:\n  taxi_types:\n    type: array\n    items:\n      type: string\n    default: [\"yellow\", \"green\"]\n```\n\nHow do you override this when running the pipeline to only process yellow taxis?\n\n- `bruin run --taxi-types yellow`\n- `bruin run --var taxi_types=yellow`\n- `bruin run --var 'taxi_types=[\"yellow\"]'`\n- `bruin run --set taxi_types=[\"yellow\"]`\n\n---\n\n### Question 4. Running with Dependencies\n\nYou've modified the `ingestion/trips.py` asset and want to run it plus all downstream assets. Which command should you use?\n\n- `bruin run ingestion.trips --all`\n- `bruin run ingestion/trips.py --downstream`\n- `bruin run pipeline/trips.py --recursive`\n- `bruin run --select ingestion.trips+`\n\n---\n\n### Question 5. Quality Checks\n\nYou want to ensure the `pickup_datetime` column in your trips table never has NULL values. Which quality check should you add to your asset definition?\n\n- `name: unique`\n- `name: not_null`\n- `name: positive`\n- `name: accepted_values, value: [not_null]`\n\n---\n\n### Question 6. Lineage and Dependencies\n\nAfter building your pipeline, you want to visualize the dependency graph between assets. Which Bruin command should you use?\n\n- `bruin graph`\n- `bruin dependencies`\n- `bruin lineage`\n- `bruin show`\n\n---\n\n### Question 7. First-Time Run\n\nYou're running a Bruin pipeline for the first time on a new DuckDB database. What flag should you use to ensure tables are created from scratch?\n\n- `--create`\n- `--init`\n- `--full-refresh`\n- `--truncate`\n\n---\n\n## Submitting the solutions\n\n- Form for submitting: <https://courses.datatalks.club/de-zoomcamp-2026/homework/hw5>\n\n=======\n\n## Learning in Public\n\nWe encourage everyone to share what they learned. This is called \"learning in public\".\n\nRead more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).\n\n### Example post for LinkedIn\n\n```\n🚀 Week 5 of Data Engineering Zoomcamp by @DataTalksClub complete!\n\nJust finished Module 5 - Data Platforms with Bruin. Learned how to:\n\n✅ Build end-to-end ELT pipelines with Bruin\n✅ Configure environments and connections\n✅ Use materialization strategies for incremental processing\n✅ Add data quality checks to ensure data integrity\n✅ Deploy pipelines from local to cloud (BigQuery)\n\nModern data platforms in a single CLI tool - no vendor lock-in!\n\nHere's my homework solution: <LINK>\n\nFollowing along with this amazing free course - who else is learning data engineering?\n\nYou can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n\n### Example post for Twitter/X\n\n```\n📊 Module 5 of Data Engineering Zoomcamp done!\n\n- Data Platforms with Bruin\n- End-to-end ELT pipelines\n- Data quality & lineage\n- Deployment to BigQuery\n\nMy solution: <LINK>\n\nFree course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n"
  },
  {
    "path": "cohorts/2026/06-batch/homework.md",
    "content": "# Module 6 Homework\n\nIn this homework we'll put what we learned about Spark in practice.\n\nFor this homework we will be using the Yellow 2025-11 data from the official website:\n\n```bash\nwget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet\n```\n\n\n## Question 1: Install Spark and PySpark\n\n- Install Spark\n- Run PySpark\n- Create a local spark session\n- Execute spark.version.\n\nWhat's the output?\n\n> [!NOTE]\n> To install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/06-batch/setup/)\n\n\n## Question 2: Yellow November 2025\n\nRead the November 2025 Yellow into a Spark Dataframe.\n\nRepartition the Dataframe to 4 partitions and save it to parquet.\n\nWhat is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.\n\n- 6MB\n- 25MB\n- 75MB\n- 100MB\n\n\n## Question 3: Count records\n\nHow many taxi trips were there on the 15th of November?\n\nConsider only trips that started on the 15th of November.\n\n- 62,610\n- 102,340\n- 162,604\n- 225,768\n\n\n## Question 4: Longest trip\n\nWhat is the length of the longest trip in the dataset in hours?\n\n- 22.7\n- 58.2\n- 90.6\n- 134.5\n\n\n## Question 5: User Interface\n\nSpark's User Interface which shows the application's dashboard runs on which local port?\n\n- 80\n- 443\n- 4040\n- 8080\n\n\n\n## Question 6: Least frequent pickup location zone\n\nLoad the zone lookup data into a temp view in Spark:\n\n```bash\nwget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv\n```\n\nUsing the zone lookup data and the Yellow November 2025 data, what is the name of the LEAST frequent pickup location Zone?\n\n- Governor's Island/Ellis Island/Liberty Island\n- Arden Heights\n- Rikers Island\n- Jamaica Bay\n\nIf multiple answers are correct, select any\n\n## Submitting the solutions\n\n- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw6\n- Deadline: See the website\n\n\n## Learning in Public\n\nWe encourage everyone to share what they learned. This is called \"learning in public\".\n\nRead more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).\n\n### Example post for LinkedIn\n\n```\n🚀 Week 6 of Data Engineering Zoomcamp by @DataTalksClub complete!\n\nJust finished Module 6 - Batch Processing with Spark. Learned how to:\n\n✅ Set up PySpark and create Spark sessions\n✅ Read and process Parquet files at scale\n✅ Repartition data for optimal performance\n✅ Analyze millions of taxi trips with DataFrames\n✅ Use Spark UI for monitoring jobs\n\nProcessing 4M+ taxi trips with Spark - distributed computing is powerful! 💪\n\nHere's my homework solution: <LINK>\n\nFollowing along with this amazing free course - who else is learning data engineering?\n\nYou can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n\n### Example post for Twitter/X\n\n```\n⚡ Module 6 of Data Engineering Zoomcamp done!\n\n- Batch processing with Spark 🔥\n- PySpark & DataFrames\n- Parquet file optimization\n- Spark UI on port 4040\n\nMy solution: <LINK>\n\nFree course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n"
  },
  {
    "path": "cohorts/2026/07-streaming/homework.md",
    "content": "# Homework\n\nIn this homework, we'll practice streaming with Kafka (Redpanda) and PyFlink.\n\nWe use Redpanda, a drop-in replacement for Kafka. It implements the same\nprotocol, so any Kafka client library works with it unchanged.\n\nFor this homework we will be using Green Taxi Trip data from October 2025:\n\n- [green_tripdata_2025-10.parquet](https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-10.parquet)\n\n\n## Setup\n\nWe'll use the same infrastructure from the [workshop](../../../07-streaming/workshop/).\n\nFollow the setup instructions: build the Docker image, start the services:\n\n```bash\ncd 07-streaming/workshop/\ndocker compose build\ndocker compose up -d\n```\n\nThis gives us:\n\n- Redpanda (Kafka-compatible broker) on `localhost:9092`\n- Flink Job Manager at http://localhost:8081\n- Flink Task Manager\n- PostgreSQL on `localhost:5432` (user: `postgres`, password: `postgres`)\n\nIf you previously ran the workshop and have old containers/volumes,\ndo a clean start:\n\n```bash\ndocker compose down -v\ndocker compose build\ndocker compose up -d\n```\n\nNote: the container names (like `workshop-redpanda-1`) assume the\ndirectory is called `workshop`. If you renamed it, adjust accordingly.\n\n\n## Question 1. Redpanda version\n\nRun `rpk version` inside the Redpanda container:\n\n```bash\ndocker exec -it workshop-redpanda-1 rpk version\n```\n\nWhat version of Redpanda are you running?\n\n\n## Question 2. Sending data to Redpanda\n\nCreate a topic called `green-trips`:\n\n```bash\ndocker exec -it workshop-redpanda-1 rpk topic create green-trips\n```\n\nNow write a producer to send the green taxi data to this topic.\n\nRead the parquet file and keep only these columns:\n\n- `lpep_pickup_datetime`\n- `lpep_dropoff_datetime`\n- `PULocationID`\n- `DOLocationID`\n- `passenger_count`\n- `trip_distance`\n- `tip_amount`\n- `total_amount`\n\nConvert each row to a dictionary and send it to the `green-trips` topic.\nYou'll need to handle the datetime columns - convert them to strings\nbefore serializing to JSON.\n\nMeasure the time it takes to send the entire dataset and flush:\n\n```python\nfrom time import time\n\nt0 = time()\n\n# send all rows ...\n\nproducer.flush()\n\nt1 = time()\nprint(f'took {(t1 - t0):.2f} seconds')\n```\n\nHow long did it take to send the data?\n\n- 10 seconds\n- 60 seconds\n- 120 seconds\n- 300 seconds\n\n\n## Question 3. Consumer - trip distance\n\nWrite a Kafka consumer that reads all messages from the `green-trips` topic\n(set `auto_offset_reset='earliest'`).\n\nCount how many trips have a `trip_distance` greater than 5.0 kilometers.\n\nHow many trips have `trip_distance` > 5?\n\n- 6506\n- 7506\n- 8506\n- 9506\n\n\n## Part 2: PyFlink (Questions 4-6)\n\nFor the PyFlink questions, you'll adapt the workshop code to work with\nthe green taxi data. The key differences from the workshop:\n\n- Topic name: `green-trips` (instead of `rides`)\n- Datetime columns use `lpep_` prefix (instead of `tpep_`)\n- You'll need to handle timestamps as strings (not epoch milliseconds)\n\nYou can convert string timestamps to Flink timestamps in your source DDL:\n\n```sql\nlpep_pickup_datetime VARCHAR,\nevent_timestamp AS TO_TIMESTAMP(lpep_pickup_datetime, 'yyyy-MM-dd HH:mm:ss'),\nWATERMARK FOR event_timestamp AS event_timestamp - INTERVAL '5' SECOND\n```\n\nBefore running the Flink jobs, create the necessary PostgreSQL tables\nfor your results.\n\nImportant notes for the Flink jobs:\n\n- Place your job files in `workshop/src/job/` - this directory is\n  mounted into the Flink containers at `/opt/src/job/`\n- Submit jobs with:\n  `docker exec -it workshop-jobmanager-1 flink run -py /opt/src/job/your_job.py`\n- The `green-trips` topic has 1 partition, so set parallelism to 1\n  in your Flink jobs (`env.set_parallelism(1)`). With higher parallelism,\n  idle consumer subtasks prevent the watermark from advancing.\n- Flink streaming jobs run continuously. Let the job run for a minute\n  or two until results appear in PostgreSQL, then query the results.\n  You can cancel the job from the Flink UI at http://localhost:8081\n- If you sent data to the topic multiple times, delete and recreate\n  the topic to avoid duplicates:\n  `docker exec -it workshop-redpanda-1 rpk topic delete green-trips`\n\n\n## Question 4. Tumbling window - pickup location\n\nCreate a Flink job that reads from `green-trips` and uses a 5-minute\ntumbling window to count trips per `PULocationID`.\n\nWrite the results to a PostgreSQL table with columns:\n`window_start`, `PULocationID`, `num_trips`.\n\nAfter the job processes all data, query the results:\n\n```sql\nSELECT PULocationID, num_trips\nFROM <your_table>\nORDER BY num_trips DESC\nLIMIT 3;\n```\n\nWhich `PULocationID` had the most trips in a single 5-minute window?\n\n- 42\n- 74\n- 75\n- 166\n\n\n## Question 5. Session window - longest streak\n\nCreate another Flink job that uses a session window with a 5-minute gap\non `PULocationID`, using `lpep_pickup_datetime` as the event time\nwith a 5-second watermark tolerance.\n\nA session window groups events that arrive within 5 minutes of each other.\nWhen there's a gap of more than 5 minutes, the window closes.\n\nWrite the results to a PostgreSQL table and find the `PULocationID`\nwith the longest session (most trips in a single session).\n\nHow many trips were in the longest session?\n\n- 12\n- 31\n- 51\n- 81\n\n\n## Question 6. Tumbling window - largest tip\n\nCreate a Flink job that uses a 1-hour tumbling window to compute the\ntotal `tip_amount` per hour (across all locations).\n\nWhich hour had the highest total tip amount?\n\n- 2025-10-01 18:00:00\n- 2025-10-16 18:00:00\n- 2025-10-22 08:00:00\n- 2025-10-30 16:00:00\n\n\n## Submitting the solutions\n\n- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw7\n\n\n## Learning in public\n\nWe encourage everyone to share what they learned.\nRead more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).\n\n## Example post for LinkedIn\n\n```\nWeek 7 of Data Engineering Zoomcamp by @DataTalksClub complete!\n\nJust finished Module 7 - Streaming with PyFlink. Learned how to:\n\n- Set up Redpanda as a Kafka replacement\n- Build Kafka producers and consumers in Python\n- Create tumbling and session windows in Flink\n- Analyze real-time taxi trip data with stream processing\n\nHere's my homework solution: <LINK>\n\nYou can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n\n## Example post for Twitter/X\n\n```\nModule 7 of Data Engineering Zoomcamp done!\n\n- Kafka producers and consumers\n- PyFlink tumbling and session windows\n- Real-time taxi data analysis\n- Redpanda as Kafka replacement\n\nMy solution: <LINK>\n\nFree course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n"
  },
  {
    "path": "cohorts/2026/README.md",
    "content": "## Data Engineering Zoomcamp 2026 Cohort\n\n* [Pre-launch Q&A stream](https://www.youtube.com/watch?v=WB6b1lcguaA)\n* [Launch stream with course overview](https://www.youtube.com/watch?v=JgspdlKXS-w)\n* [FAQ](https://datatalks.club/faq/data-engineering-zoomcamp.html)\n* [Course Playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb)\n\n\n[**Module 1: Introduction & Prerequisites**](01-docker-terraform/)\n\n* [Homework](01-docker-terraform/homework.md)\n\n\n[**Module 2: Workflow Orchestration**](02-workflow-orchestration)\n\n* [Homework](02-workflow-orchestration/homework.md)\n* Office hours\n\n[**Workshop 1: Data Ingestion**](workshops/dlt/README.md)\n\n* Workshop with dlt\n* [Homework](workshops/dlt/README.md)\n\n[**Workshop 2: AI-Assisted Data Ingestion with dlt**](workshops/dlt.md)\n\n* [Workshop details and registration](workshops/dlt.md)\n\n\n[**Module 3: Data Warehouse**](03-data-warehouse)\n\n* [Homework](03-data-warehouse/homework.md)\n\n\n[**Module 4: Analytics Engineering**](04-analytics-engineering/)\n\n* [Homework](04-analytics-engineering/homework.md)\n\n\n[**Module 5: Data Platforms**](05-data-platforms/)\n\n* [Homework](05-data-platforms/homework.md)\n\n\n[**Module 6: Batch processing**](06-batch/)\n\n* [Homework](06-batch/homework.md)\n\n\n[**Module 7: Stream Processing**](07-streaming)\n\n* [Homework](07-streaming/homework.md)\n\n\n[**Project**](project.md)\n\nMore information [here](project.md)\n"
  },
  {
    "path": "cohorts/2026/project.md",
    "content": "## Course Project\n\nThe goal of this project is to apply everything we learned\nin this course and build an end-to-end data pipeline.\n\nYou will have two attempts to submit your project. If you don't have \ntime to submit your project by the end of attempt #1 (you started the \ncourse late, you have vacation plans, life/work got in the way, etc.)\nor you fail your first attempt, \nthen you will have a second chance to submit your project as attempt\n#2. \n\nThere are only two attempts.\n\nRemember that to pass the project, you must evaluate 3 peers. If you don't do that,\nyour project can't be considered complete.\n\nTo find the projects assigned to you, use the peer review assignments link \nand find your hash in the first column. You will see three rows: you need to evaluate \neach of these projects. For each project, you need to submit the form once,\nso in total, you will make three submissions. \n\n\n### Submitting\n\n#### Project Attempt #1\n\n* Project: https://courses.datatalks.club/de-zoomcamp-2026/project/project1\n* Review: https://courses.datatalks.club/de-zoomcamp-2026/project/project1/eval\n\n#### Project Attempt #2\n\n* Project: https://courses.datatalks.club/de-zoomcamp-2026/project/project2\n* Review: https://courses.datatalks.club/de-zoomcamp-2026/project/project2/eval\n\n> **Important**: update your \"Certificate name\" here: https://courses.datatalks.club/de-zoomcamp-2026/enrollment -\nthis is what we will use when generating certificates for you.\n\n### Evaluation criteria\n\nSee [here](../../projects/README.md)\n"
  },
  {
    "path": "cohorts/2026/workshops/dlt/README.md",
    "content": "# From APIs to Warehouses: AI-Assisted Data Ingestion with dlt\n\nWelcome to the **Data Engineering Zoomcamp 2026** workshop!\n\nIn this workshop, you'll use an AI-powered IDE to build a complete data pipeline. Using simple prompts, you can go from an API to a local data warehouse with [dlt](https://dlthub.com/docs) (data load tool). The AI handles the code generation. You focus on the results.\n\n## What You'll Build\n\nBy the end of this workshop, you will have:\n\n1. A working dlt pipeline that extracts data from the [Open Library API](https://openlibrary.org/developers/api)\n2. Normalized relational tables stored in DuckDB\n3. The ability to query, inspect, and visualize your data\n4. Experience using AI-assisted development for data engineering\n\n**No API key required!** The Open Library API is completely open and doesn't require authentication. You can start building immediately.\n\n---\n\n## Prerequisites\n\nBefore the workshop, make sure you have the following set up:\n\n### 1. Understand What dlt Does (Recommended for Beginners)\n\nIf you're unfamiliar with dlt and what the library does, we recommend reading through the included Jupyter notebook before the workshop.\n\n**[Open the notebook in Google Colab](https://colab.research.google.com/github/anair123/data-engineering-zoomcamp/blob/workshop/dlt_2026/cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb)**\n\nIt walks through dlt step by step:\n\n- What a dlt source and pipeline are\n- How data moves through Extract, Normalize, and Load\n- How to inspect the loaded data\n\nUnderstanding these concepts will help you know what the agent-generated code is actually doing.\n\n> You do not need to clone the repo to follow the workshop. The `dlt init` command scaffolds everything you need.\n\n### 2. An Agentic IDE\n\nYou'll need an AI-powered code editor that can understand context and generate code from natural language. We recommend:\n\n| IDE | Description |\n|-----|-------------|\n| [**Cursor**](https://cursor.sh) | VS Code fork with built-in AI assistance (recommended) |\n| [Windsurf](https://codeium.com/windsurf) | Alternative agentic IDE |\n| [VS Code + GitHub Copilot](https://github.com/features/copilot) | Works, but less integrated |\n\n### 3. Python 3.11+\n\n```bash\npython --version  # Should be 3.11 or higher\n```\n\n### 4. uv (Recommended) or pip\n\nWe use [uv](https://docs.astral.sh/uv/) for fast dependency management:\n\n```bash\n# Install uv (if you don't have it)\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n```\n\n---\n\n## Workshop Instructions\n\n### Step 1: Create a New Project Folder\n\nCreate a fresh folder for your pipeline and open it in Cursor (or your preferred agentic IDE):\n\n```bash\nmkdir my-dlt-pipeline\ncd my-dlt-pipeline\n```\n\n### Step 2: Add the dlt MCP Server Config\n\nChoose the setup for your IDE:\n\nCursor - go to **Settings → Tools & MCP → New MCP Server** and add:\n\n```json\n{\n  \"mcpServers\": {\n    \"dlt\": {\n      \"command\": \"uv\",\n      \"args\": [\n        \"run\",\n        \"--with\",\n        \"dlt[duckdb]\",\n        \"--with\",\n        \"dlt-mcp[search]\",\n        \"python\",\n        \"-m\",\n        \"dlt_mcp\"\n      ]\n    }\n  }\n}\n```\n\nVS Code (Copilot) - create `.vscode/mcp.json` in your project folder:\n\n```json\n{\n  \"servers\": {\n    \"dlt\": {\n      \"command\": \"uv\",\n      \"args\": [\n        \"run\",\n        \"--with\",\n        \"dlt[duckdb]\",\n        \"--with\",\n        \"dlt-mcp[search]\",\n        \"python\",\n        \"-m\",\n        \"dlt_mcp\"\n      ]\n    }\n  }\n}\n```\n\nClaude Code - run in your terminal:\n\n```bash\nclaude mcp add dlt -- uv run --with \"dlt[duckdb]\" --with \"dlt-mcp[search]\" python -m dlt_mcp\n```\n\nThis enables the dlt MCP server, which gives the AI access to dlt documentation, code examples, and your pipeline metadata.\n\n### Step 3: Install dlt Workspace\n\n```bash\npip install \"dlt[workspace]\"\n```\n\n### Step 4: Initialize the dlt Project\n\n```bash\ndlt init dlthub:open_library duckdb\n```\n\nThis scaffolds the pipeline files and configuration for Open Library. You now have everything you need to start prompting.\n\n> 📖 **Reference:** [Open Library Workspace Instructions](https://dlthub.com/workspace/source/open-library)\n\n### Step 5: Prompt the Agent to Build and Run the Pipeline\n\nThis is where the magic happens. The `dlt init` command scaffolds sample prompts you can use. Here's an example to get started:\n\n```\nPlease generate a REST API Source for Open Library API, as specified in @open_library-docs.yaml\nStart with endpoint(s) books and skip incremental loading for now.\nPlace the code in open_library_pipeline.py and name the pipeline open_library_pipeline.\nIf the file exists, use it as a starting point.\nDo not add or modify any other files.\nUse @dlt rest api as a tutorial.\nAfter adding the endpoints, allow the user to run the pipeline with python open_library_pipeline.py and await further instructions.\n```\n\nFeel free to tweak the prompt based on your objective. The agent will:\n1. Generate the pipeline code\n2. Run the pipeline\n3. Load data into your local DuckDB database\n\nAll from a single prompt.\n\n### Step 6: Debug with the Agent\n\nIf there are any errors, paste them into the chat and let the AI resolve them. This is the power of AI-assisted development: you iterate quickly without getting stuck.\n\n### Step 7: Inspect Pipeline Data with the dlt Dashboard\n\nOnce your pipeline runs successfully, launch the dashboard to inspect your data and metadata:\n\n```bash\ndlt pipeline open_library_pipeline show\n```\n\nThis opens a web app where you can:\n- View pipeline state and run history\n- Explore schemas, tables, and columns\n- Query the loaded data\n- Debug any issues\n\n> 📖 **Reference:** [dlt Dashboard Documentation](https://dlthub.com/docs/general-usage/dashboard)\n\n### Step 8: Inspect the Pipeline via Chat\n\nWith the dlt MCP server configured, you can ask the AI about your pipeline directly:\n\n> \"What tables were created in the pipeline?\"  \n> \"Show me the schema for the books table.\"  \n> \"How many rows were loaded?\"\n\nThe agent has access to your pipeline metadata and can answer these questions.\n\n### Step 9 (Bonus): Build Visualizations with marimo + ibis\n\nTake your analysis further by creating interactive reports with [marimo](https://marimo.io/) notebooks and [ibis](https://ibis-project.org/).\n\nPrompt the agent to build a visualization:\n\n> \"Create a marimo notebook that visualizes the top 10 authors by book count. Use ibis for data access. Reference: https://dlthub.com/docs/general-usage/dataset-access/marimo\"\n\nBy providing the docs link, the agent will use the correct stack.\n\nRun your notebook:\n\n```bash\n# Edit mode (for development)\nmarimo edit your_notebook.py\n\n# Run mode (view the report)\nmarimo run your_notebook.py\n```\n\n> 📖 **Reference:** [Explore Data with marimo](https://dlthub.com/docs/general-usage/dataset-access/marimo)\n\n---\n\n## Homework\n\nYou've seen me do it, now it's your turn!\n\nSee [dlt_homework.md](dlt_homework.md) for instructions.\n\n---\n\n## Resources\n\n| Resource | Link |\n|----------|------|\n| dlt Documentation | [dlthub.com/docs](https://dlthub.com/docs) |\n| Open Library Workspace Guide | [dlthub.com/workspace/source/open-library](https://dlthub.com/workspace/source/open-library) |\n| dlt Dashboard Docs | [dlthub.com/docs/general-usage/dashboard](https://dlthub.com/docs/general-usage/dashboard) |\n| marimo + dlt Guide | [dlthub.com/docs/general-usage/dataset-access/marimo](https://dlthub.com/docs/general-usage/dataset-access/marimo) |\n| Open Library API | [openlibrary.org/developers/api](https://openlibrary.org/developers/api) |\n\n---\n\n*Workshop by [dltHub](https://dlthub.com) for the Data Engineering Zoomcamp 2026*\n"
  },
  {
    "path": "cohorts/2026/workshops/dlt/analysis.py",
    "content": "import marimo\n\n__generated_with = \"0.19.9\"\napp = marimo.App(width=\"medium\")\n\n\n@app.cell\ndef _():\n    import marimo as mo\n    import dlt\n    import ibis\n    import altair as alt\n    from dlt.helpers.marimo import render, load_package_viewer\n\n    return alt, dlt, ibis, load_package_viewer, mo, render\n\n\n@app.cell\ndef _(mo):\n    mo.md(r\"\"\"\n    # 📚 Open Library Harry Potter Books Analysis\n\n    This notebook analyzes Harry Potter-related books from the Open Library API using dlt's dataset interface.\n    \"\"\")\n    return\n\n\n@app.cell\ndef _(dlt):\n    # Access the pipeline and dataset using dlt's native interface\n    pipeline = dlt.attach(\"open_library_pipeline\")\n    dataset = pipeline.dataset()\n    # Get ibis connection for rich data exploration\n    ibis_con = dataset.ibis()\n    return (ibis_con,)\n\n\n@app.cell\nasync def _(load_package_viewer, render):\n    # Display the dlt package viewer widget\n    await render(load_package_viewer)\n    return\n\n\n@app.cell\ndef _(mo):\n    mo.md(r\"\"\"\n    ## 📊 Books by Author\n    \"\"\")\n    return\n\n\n@app.cell\ndef _(alt, ibis, ibis_con):\n    # Query for books by author (top 15) using ibis\n    author_table = ibis_con.table(\"books__author_name\")\n    author_query = (\n        author_table\n        .group_by(\"value\")\n        .agg(book_count=author_table.value.count())\n        .order_by(ibis.desc(\"book_count\"))\n        .limit(15)\n    )\n    author_df = author_query.to_pandas()\n    author_df = author_df.rename(columns={\"value\": \"author\"})\n\n    # Bar chart for authors\n    author_chart = alt.Chart(author_df).mark_bar(color=\"#6366f1\").encode(\n        x=alt.X(\"book_count:Q\", title=\"Number of Books\"),\n        y=alt.Y(\"author:N\", sort=\"-x\", title=\"Author\"),\n        tooltip=[\"author\", \"book_count\"]\n    ).properties(\n        title=\"Top 15 Authors by Number of Books\",\n        width=600,\n        height=400\n    )\n    author_chart\n    return\n\n\n@app.cell\ndef _(mo):\n    mo.md(r\"\"\"\n    ## 📈 Books Published Per Year\n    \"\"\")\n    return\n\n\n@app.cell\ndef _(alt, ibis_con):\n    # Query for books by year using ibis\n    books_table = ibis_con.table(\"books\")\n    year_query = (\n        books_table\n        .filter((books_table.first_publish_year >= 1997) & (books_table.first_publish_year <= 2025))\n        .group_by(\"first_publish_year\")\n        .agg(books=books_table.first_publish_year.count())\n        .order_by(\"first_publish_year\")\n    )\n    year_df = year_query.to_pandas()\n    year_df = year_df.rename(columns={\"first_publish_year\": \"year\"})\n\n    # Line chart for publication years\n    year_chart = alt.Chart(year_df).mark_line(\n        point=True,\n        color=\"#10b981\"\n    ).encode(\n        x=alt.X(\"year:O\", title=\"Year\"),\n        y=alt.Y(\"books:Q\", title=\"Number of Books\"),\n        tooltip=[\"year\", \"books\"]\n    ).properties(\n        title=\"Harry Potter-Related Books Published Per Year (1997-2025)\",\n        width=700,\n        height=350\n    )\n    year_chart\n    return\n\n\n@app.cell\ndef _(mo):\n    mo.md(r\"\"\"\n    ## 🌍 Books by Language\n    \"\"\")\n    return\n\n\n@app.cell\ndef _(alt, ibis, ibis_con):\n    # Query for books by language using ibis\n    lang_table = ibis_con.table(\"books__language\")\n    lang_query = (\n        lang_table\n        .group_by(\"value\")\n        .agg(count=lang_table.value.count())\n        .order_by(ibis.desc(\"count\"))\n        .limit(10)\n    )\n    language_df = lang_query.to_pandas()\n\n    # Map language codes to full names\n    lang_map = {\n        'eng': 'English', 'ger': 'German', 'fre': 'French',\n        'spa': 'Spanish', 'ita': 'Italian', 'chi': 'Chinese',\n        'por': 'Portuguese', 'rus': 'Russian', 'kor': 'Korean', 'pol': 'Polish'\n    }\n    language_df[\"language\"] = language_df[\"value\"].map(lambda x: lang_map.get(x, x))\n\n    # Pie chart for languages\n    language_chart = alt.Chart(language_df).mark_arc(innerRadius=50).encode(\n        theta=alt.Theta(\"count:Q\", title=\"Count\"),\n        color=alt.Color(\"language:N\", title=\"Language\", scale=alt.Scale(scheme=\"tableau10\")),\n        tooltip=[\"language\", \"count\"]\n    ).properties(\n        title=\"Proportion of Books by Language (Top 10)\",\n        width=400,\n        height=400\n    )\n    language_chart\n    return\n\n\n@app.cell\ndef _(mo):\n    mo.md(r\"\"\"\n    ## 📋 Summary Statistics\n\n    Key insights from the Open Library Harry Potter books dataset.\n    \"\"\")\n    return\n\n\n@app.cell\ndef _(ibis_con, mo):\n    # Get summary stats using ibis\n    total_books = ibis_con.table(\"books\").count().to_pandas()\n    total_authors = ibis_con.table(\"books__author_name\").value.nunique().to_pandas()\n    total_languages = ibis_con.table(\"books__language\").value.nunique().to_pandas()\n\n    mo.md(f\"\"\"\n    | Metric | Value |\n    |--------|-------|\n    | **Total Books** | {total_books:,} |\n    | **Unique Authors** | {total_authors:,} |\n    | **Languages** | {total_languages} |\n    \"\"\")\n    return\n\n\n@app.cell\ndef _():\n    return\n\n\n@app.cell\ndef _():\n    return\n\n\nif __name__ == \"__main__\":\n    app.run()\n"
  },
  {
    "path": "cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"bPVVve29bu6Z\"\n   },\n   \"source\": [\n    \"# Building a Data Pipeline with dlt\\n\",\n    \"\\n\",\n    \"In this notebook, we will build a complete data pipeline from scratch using **dlt**.\\n\",\n    \"\\n\",\n    \"Our goal is simple:\\n\",\n    \"\\n\",\n    \"→ Fetch real data from an API  \\n\",\n    \"→ Turn it into clean relational tables  \\n\",\n    \"→ Load it into a database  \\n\",\n    \"→ Explore and analyze it  \\n\",\n    \"\\n\",\n    \"We will use the **Open Library API** as our data source and **DuckDB** as our database.\\n\",\n    \"\\n\",\n    \"Along the way, you will learn:\\n\",\n    \"\\n\",\n    \"- What a dlt source is  \\n\",\n    \"- What a dlt pipeline does  \\n\",\n    \"- How data moves through Extract → Normalize → Load  \\n\",\n    \"- How to inspect and explore the final dataset  \\n\",\n    \"\\n\",\n    \"By the end, you will understand not just how to run a pipeline, but what happens at each stage.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"u9eCv60qV5PS\"\n   },\n   \"source\": [\n    \"## 📦 Step 0: Install Dependencies\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"id\": \"Arp4d7KZNRTS\"\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"zsh:1: no matches found: dlt[duckdb]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# install dependencies first\\n\",\n    \"!pip -q install dlt[duckdb]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"x7VGYS5hWNKQ\"\n   },\n   \"source\": [\n    \"<p>In this notebook we will use:</p>\\n\",\n    \"\\n\",\n    \"<ul>\\n\",\n    \"  <li><strong>dlt</strong> to extract, normalize, and load data</li>\\n\",\n    \"  <li><strong>DuckDB</strong> as the destination database (runs locally inside Colab)</li>\\n\",\n    \"</ul>\\n\",\n    \"\\n\",\n    \"<p>\\n\",\n    \"  DuckDB is great for beginners because it requires no setup and no credentials.\\n\",\n    \"</p>\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"aQTSvnvnHWBd\"\n   },\n   \"source\": [\n    \"## 📚 Step 1: Import Libraries\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"YFQGLTECWkpn\"\n   },\n   \"source\": [\n    \"\\n\",\n    \"<p>In this cell we import the libraries we will use throughout the notebook:</p>\\n\",\n    \"\\n\",\n    \"<ul>\\n\",\n    \"  <li><strong>dlt</strong> is the main library for building and running the pipeline</li>\\n\",\n    \"  <li><strong>rest_api_source</strong> helps us define an API source using a simple configuration</li>\\n\",\n    \"  <li><strong>islice</strong> (from <code>itertools</code>) is a small Python helper for previewing only a few records</li>\\n\",\n    \"</ul>\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {\n    \"id\": \"Lm8AbbHBImjI\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import dlt\\n\",\n    \"import dlt\\n\",\n    \"from itertools import islice\\n\",\n    \"from dlt.sources.rest_api import rest_api_source\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"UFoBTwDVhzRL\"\n   },\n   \"source\": [\n    \"## 🔗 Step 2: Define the API Source (Open Library)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"VdKrEM-VXEY2\"\n   },\n   \"source\": [\n    \"<p>\\n\",\n    \"  In <strong>dlt</strong>, a <strong>source</strong> is the part of your pipeline that knows how to fetch data from somewhere.\\n\",\n    \"  In this notebook, our source fetches data from the <strong>Open Library Search API</strong>.\\n\",\n    \"</p>\\n\",\n    \"\\n\",\n    \"<p>\\n\",\n    \"  We define the source using <code>rest_api_source</code>, which lets us describe an API in a simple\\n\",\n    \"  Python dictionary instead of writing lots of request code.\\n\",\n    \"</p>\\n\",\n    \"\\n\",\n    \"<p>\\n\",\n    \"  📖 <strong>Open Library Search API docs:</strong><br>\\n\",\n    \"  <a href=\\\"https://openlibrary.org/dev/docs/api/search\\\" target=\\\"_blank\\\">\\n\",\n    \"    https://openlibrary.org/dev/docs/api/search\\n\",\n    \"  </a>\\n\",\n    \"</p>\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {\n    \"id\": \"hOxkEKy4Kaj4\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def openlibrary_source(query: str = \\\"harry potter\\\"):\\n\",\n    \"\\n\",\n    \"    return rest_api_source({\\n\",\n    \"        \\\"client\\\": {\\n\",\n    \"            \\\"base_url\\\": \\\"https://openlibrary.org\\\",\\n\",\n    \"        },\\n\",\n    \"        \\\"resource_defaults\\\": {\\n\",\n    \"            \\\"primary_key\\\": \\\"key\\\",\\n\",\n    \"            \\\"write_disposition\\\": \\\"replace\\\",\\n\",\n    \"        },\\n\",\n    \"        \\\"resources\\\": [\\n\",\n    \"            {\\n\",\n    \"                \\\"name\\\": \\\"books\\\",\\n\",\n    \"                \\\"endpoint\\\": {\\n\",\n    \"                    \\\"path\\\": \\\"search.json\\\",\\n\",\n    \"                    \\\"params\\\": {\\n\",\n    \"                        \\\"q\\\": query,\\n\",\n    \"                        \\\"limit\\\": 100,\\n\",\n    \"                    },\\n\",\n    \"                    \\\"data_selector\\\": \\\"docs\\\",\\n\",\n    \"                    \\\"paginator\\\": {\\n\",\n    \"                        \\\"type\\\": \\\"offset\\\",\\n\",\n    \"                        \\\"limit\\\": 100,\\n\",\n    \"                        \\\"offset_param\\\": \\\"offset\\\",\\n\",\n    \"                        \\\"limit_param\\\": \\\"limit\\\",\\n\",\n    \"                        \\\"total_path\\\": \\\"numFound\\\",\\n\",\n    \"                    },\\n\",\n    \"                },\\n\",\n    \"            },\\n\",\n    \"        ],\\n\",\n    \"    })\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"ntKAVaEGYFgw\"\n   },\n   \"source\": [\n    \"## 🔧 Step 3: Create the dlt Pipeline\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"id\": \"bxpFEetGh3lS\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"pipeline = dlt.pipeline(\\n\",\n    \"    pipeline_name=\\\"ol_demo\\\",\\n\",\n    \"    destination=\\\"duckdb\\\",\\n\",\n    \"    dataset_name=\\\"ol_data\\\",\\n\",\n    \"    progress=\\\"log\\\" # logs the pipeline run (Optiona)\\n\",\n    \")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"y7CJ9A2HXsFb\"\n   },\n   \"source\": [\n    \"## 🔍 Understanding the Pipeline\\n\",\n    \"\\n\",\n    \"At this point we have defined two key building blocks:\\n\",\n    \"\\n\",\n    \"- **The source** describes where the data comes from and how to fetch it from the API.  \\n\",\n    \"- **The pipeline** describes where the data should go (DuckDB) and keeps track of tables, schemas, and run history.  \\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"\\n\",\n    \"Instead of running everything at once, we will now run the pipeline in three separate phases so you can clearly see what happens at each stage:\\n\",\n    \"\\n\",\n    \"1. **Extract**: download raw data from the API  \\n\",\n    \"2. **Normalize**: turn nested JSON into relational tables  \\n\",\n    \"3. **Load**: write those tables into DuckDB  \\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"![ETL Diagram](./images/etl_diagram.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"pAYgUUJIw-c4\"\n   },\n   \"source\": [\n    \"Once these steps make sense, we will run the full workflow again using one command:\\n\",\n    \"\\n\",\n    \"```python\\n\",\n    \"pipeline.run(source)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"JsfcBA7McJMo\"\n   },\n   \"source\": [\n    \"## ⬇️ Step 4: Extract\\n\",\n    \"\\n\",\n    \"Now we run the first stage of the pipeline: **Extract**.\\n\",\n    \"\\n\",\n    \"Extract means:\\n\",\n    \"\\n\",\n    \"- dlt sends requests to the Open Library API\\n\",\n    \"- the raw JSON responses are downloaded\\n\",\n    \"- the results are stored in dlt’s local working folder\\n\",\n    \"\\n\",\n    \"At this stage, the data is **not** in DuckDB yet. We are just confirming that we successfully pulled data from the API.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 5,\n   \"metadata\": {\n    \"id\": \"yifCIPxSKJZ4\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"extract_info = pipeline.extract(openlibrary_source())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"NLRRVLnLcNgl\"\n   },\n   \"source\": [\n    \"---\\n\",\n    \"\\n\",\n    \"### What we will print\\n\",\n    \"\\n\",\n    \"After extraction, we will print a small summary showing:\\n\",\n    \"\\n\",\n    \"- which **resources** were extracted\\n\",\n    \"- which **tables** will be created later\\n\",\n    \"- how many rows were extracted per resource\\n\",\n    \"\\n\",\n    \"This helps confirm that the pipeline is working before we move on to normalization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {\n    \"colab\": {\n     \"base_uri\": \"https://localhost:8080/\"\n    },\n    \"id\": \"wtDasHRNNNN0\",\n    \"outputId\": \"51c71eeb-5435-40a1-8728-ea48c59bfd58\"\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Resources: ['books']\\n\",\n      \"Tables: ['books']\\n\",\n      \"Load ID: 1770907406.962898\\n\",\n      \"\\n\",\n      \"Resource: books\\n\",\n      \"rows extracted: 3756\\n\",\n      \"\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"load_id = extract_info.loads_ids[-1]\\n\",\n    \"m = extract_info.metrics[load_id][0]\\n\",\n    \"\\n\",\n    \"print(\\\"Resources:\\\", list(m[\\\"resource_metrics\\\"].keys()))\\n\",\n    \"print(\\\"Tables:\\\", list(m[\\\"table_metrics\\\"].keys()))\\n\",\n    \"print(\\\"Load ID:\\\", load_id)\\n\",\n    \"print()\\n\",\n    \"\\n\",\n    \"for resource, rm in m[\\\"resource_metrics\\\"].items():\\n\",\n    \"    print(f\\\"Resource: {resource}\\\")\\n\",\n    \"    print(f\\\"rows extracted: {rm.items_count}\\\")\\n\",\n    \"    print()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"f6MwYtznc3UX\"\n   },\n   \"source\": [\n    \"### What you should see after Extract\\n\",\n    \"\\n\",\n    \"In our case, Extract shows only **one resource and one table**:\\n\",\n    \"\\n\",\n    \"- **Resources:** `['books']`  \\n\",\n    \"- **Tables:** `['books']`\\n\",\n    \"\\n\",\n    \"That is expected.\\n\",\n    \"\\n\",\n    \"The `search` endpoint returns a list of book results, so dlt stores those rows in a single table called `books`. The interesting part comes next, because many fields inside each row are lists or nested objects. Those will turn into additional tables during **Normalize**.\\n\",\n    \"\\n\",\n    \"Example output:\\n\",\n    \"\\n\",\n    \"- **25 rows extracted** means we pulled 25 search results (books)  \\n\",\n    \"\\n\",\n    \"---\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"lQVLZMcyXWkm\"\n   },\n   \"source\": [\n    \"## 🔄 Step 5: Normalize\\n\",\n    \"\\n\",\n    \"Now we run **Normalize**. This is where dlt transforms raw JSON into a clean relational structure.\\n\",\n    \"\\n\",\n    \"During normalization, dlt does three key things:\\n\",\n    \"\\n\",\n    \"### 1. Adds Tracking Columns to the Main Table\\n\",\n    \"\\n\",\n    \"dlt adds special columns to every table:\\n\",\n    \"- `_dlt_id`: A unique identifier for each row\\n\",\n    \"- `_dlt_load_id`: Links each row to the load job that created it\\n\",\n    \"\\n\",\n    \"### 2. Flattens Nested Data into Child Tables\\n\",\n    \"\\n\",\n    \"APIs often return nested JSON. For example, a book can have multiple authors (a list), multiple editions, and multiple identifiers.\\n\",\n    \"\\n\",\n    \"dlt flattens these nested structures into separate **child tables** with names like:\\n\",\n    \"- `books__author_name`\\n\",\n    \"- `books__author_key`\\n\",\n    \"- `books__language`\\n\",\n    \"\\n\",\n    \"Each child table has a `_dlt_parent_id` column that references `_dlt_id` in the parent table. This is how dlt maintains relationships.\\n\",\n    \"\\n\",\n    \"### 3. Creates Metadata Tables\\n\",\n    \"\\n\",\n    \"dlt also creates internal tables to track pipeline state:\\n\",\n    \"- `_dlt_loads`: Tracks load history (when data was loaded, status)\\n\",\n    \"- `_dlt_pipeline_state`: Stores pipeline state for incremental loading\\n\",\n    \"- `_dlt_version`: Tracks schema versions\\n\",\n    \"\\n\",\n    \"In the next cell, we will print a summary showing which tables were created.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {\n    \"id\": \"LCmiiG3tXXwh\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"normalize_info = pipeline.normalize()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"colab\": {\n     \"base_uri\": \"https://localhost:8080/\"\n    },\n    \"id\": \"-kNiY112Xvuk\",\n    \"outputId\": \"502bff6b-edb2-4bd8-a9e9-1f1b88f20c48\"\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"Load ID: 1770907406.962898\\n\",\n      \"\\n\",\n      \"Tables created/updated:\\n\",\n      \"  - books: 3756 rows\\n\",\n      \"  - books__author_key: 4600 rows\\n\",\n      \"  - books__author_name: 4600 rows\\n\",\n      \"  - books__ia: 3422 rows\\n\",\n      \"  - books__ia_collection: 2724 rows\\n\",\n      \"  - books__language: 3748 rows\\n\",\n      \"  - books__id_standard_ebooks: 12 rows\\n\",\n      \"  - books__id_librivox: 60 rows\\n\",\n      \"  - books__id_project_gutenberg: 54 rows\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"load_id = normalize_info.loads_ids[-1]\\n\",\n    \"m = normalize_info.metrics[load_id][0]\\n\",\n    \"\\n\",\n    \"print(\\\"Load ID:\\\", load_id)\\n\",\n    \"print()\\n\",\n    \"\\n\",\n    \"print(\\\"Tables created/updated:\\\")\\n\",\n    \"for table_name, tm in m[\\\"table_metrics\\\"].items():\\n\",\n    \"    # skip dlt internal tables to keep it beginner-friendly\\n\",\n    \"    if table_name.startswith(\\\"_dlt\\\"):\\n\",\n    \"        continue\\n\",\n    \"    print(f\\\"  - {table_name}: {tm.items_count} rows\\\")\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"ctHuJ0yEdNaq\"\n   },\n   \"source\": [\n    \"### What happened during Normalize?\\n\",\n    \"\\n\",\n    \"After running `pipeline.normalize()`, we now see multiple tables instead of just one.\\n\",\n    \"\\n\",\n    \"Tables created/updated:\\n\",\n    \"\\n\",\n    \"- `books`\\n\",\n    \"- `books__author_key`\\n\",\n    \"- `books__author_name`\\n\",\n    \"- `books__editions__docs`\\n\",\n    \"- `books__editions__docs__language`\\n\",\n    \"- `books__ia`\\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"\\n\",\n    \"### What does this mean?\\n\",\n    \"\\n\",\n    \"We started with **N book search results** in the `books` table.\\n\",\n    \"\\n\",\n    \"During normalization:\\n\",\n    \"\\n\",\n    \"- Each book may have **more than N authors**, so those were split into:\\n\",\n    \"  - `books__author_name`\\n\",\n    \"  - `books__author_key`\\n\",\n    \"\\n\",\n    \"- Each book may contain **edition information**, which became:\\n\",\n    \"  - `books__editions__docs`\\n\",\n    \"\\n\",\n    \"- Some editions contain **language information**, which became:\\n\",\n    \"  - `books__editions__docs__language`\\n\",\n    \"\\n\",\n    \"- The `ia` field (Internet Archive IDs) is a list, so it became:\\n\",\n    \"  - `books__ia`\\n\",\n    \"\\n\",\n    \"This is the key moment in the pipeline.\\n\",\n    \"\\n\",\n    \"The data has been transformed from nested JSON into a **relational structure** with multiple linked tables. This makes it much easier to query and analyze.\\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"\\n\",\n    \"### Schema Visualization\\n\",\n    \"\\n\",\n    \"dlt can render the schema as a visual diagram. Run the next cell to see the parent-child table relationships:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"\\n\",\n       \"<script src=\\\"//d3js.org/d3.v7.min.js\\\"></script>\\n\",\n       \"<script src=\\\"https://unpkg.com/@hpcc-js/wasm@2.20.0/dist/graphviz.umd.js\\\"></script>\\n\",\n       \"<script src=\\\"https://unpkg.com/d3-graphviz@5.6.0/build/d3-graphviz.js\\\"></script>\\n\",\n       \"\\n\",\n       \"<div id=\\\"graph\\\" style=\\\"width:100%;height:100vh;display:flex;justify-content:center;align-items:center;\\\"></div>\\n\",\n       \"<script>\\n\",\n       \"    d3.select(\\\"#graph\\\")\\n\",\n       \"      .graphviz({fit: true})\\n\",\n       \"      .renderDot(\\n\",\n       \"        `\\n\",\n       \"        digraph rest_api {\\n\",\n       \"    graph [fontname=\\\"helvetica\\\", fontcolor=\\\"{TABLE_BORDER_COLOR}\\\", rankdir=\\\"BT\\\", ranksep=5, layout=\\\"twopi\\\", root=\\\"_dlt_loads\\\"];\\n\",\n       \"    node [penwidth=0, margin=0, fontname=\\\"helvetica\\\"];\\n\",\n       \"    edge [fontname=\\\"helvetica\\\", fontcolor=\\\"{TABLE_BORDER_COLOR}\\\", color=\\\"{TABLE_BORDER_COLOR}\\\"];\\n\",\n       \"\\n\",\n       \"\\\"books\\\" [id=\\\"books\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>books</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">cover_edition_key</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">cover_i</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">ebook_access</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">edition_count</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f5\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">first_publish_year</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f6\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">has_fulltext</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bool</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f7\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\"><B>key🔑</B></td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f8\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">lending_edition_s</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f9\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">lending_identifier_s</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f10\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">public_scan_b</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bool</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f11\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">title</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f12\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_load_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f13\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f14\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">subtitle</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"books__author_key\\\" [id=\\\"books__author_key\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>books__author_key</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">value</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_parent_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_list_idx</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"books__author_name\\\" [id=\\\"books__author_name\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>books__author_name</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">value</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_parent_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_list_idx</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"books__ia\\\" [id=\\\"books__ia\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>books__ia</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">value</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_parent_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_list_idx</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"books__ia_collection\\\" [id=\\\"books__ia_collection\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>books__ia_collection</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">value</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_parent_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_list_idx</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"books__language\\\" [id=\\\"books__language\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>books__language</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">value</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_parent_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_list_idx</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"books__id_standard_ebooks\\\" [id=\\\"books__id_standard_ebooks\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>books__id_standard_ebooks</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">value</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_parent_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_list_idx</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"books__id_librivox\\\" [id=\\\"books__id_librivox\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>books__id_librivox</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">value</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_parent_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_list_idx</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"books__id_project_gutenberg\\\" [id=\\\"books__id_project_gutenberg\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>books__id_project_gutenberg</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">value</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_parent_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_list_idx</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"_dlt_version\\\" [id=\\\"_dlt_version\\\";tooltip=\\\"Created by DLT. Tracks schema updates\\\";label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>_dlt_version</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">version</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">engine_version</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">inserted_at</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>timestamp <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">schema_name</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f5\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">version_hash</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f6\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">schema</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"_dlt_loads\\\" [id=\\\"_dlt_loads\\\";tooltip=\\\"Created by DLT. Tracks completed loads\\\";label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>_dlt_loads</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">load_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">schema_name</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">status</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">inserted_at</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>timestamp <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f5\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">schema_version_hash</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"\\\"_dlt_pipeline_state\\\" [id=\\\"_dlt_pipeline_state\\\"; label=<\\n\",\n       \"    <table border=\\\"0\\\" color=\\\"#1c1c34\\\" cellborder=\\\"1\\\" cellspacing=\\\"0\\\" cellpadding=\\\"6\\\">\\n\",\n       \"                <tr>\\n\",\n       \"            <td port=\\\"p0\\\" bgcolor=\\\"#bbca06\\\">\\n\",\n       \"                <font color=\\\"#1c1c34\\\"><b>_dlt_pipeline_state</b></font>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"\\n\",\n       \"        <tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f1\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">version</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f2\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">engine_version</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>bigint <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f3\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">pipeline_name</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f4\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">state</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f5\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">created_at</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>timestamp <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f6\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">version_hash</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text</font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f7\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_load_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr><tr>\\n\",\n       \"            <td align=\\\"left\\\" port=\\\"f8\\\" bgcolor=\\\"#e7e2dd\\\">\\n\",\n       \"                <table cellpadding=\\\"0\\\" cellspacing=\\\"0\\\" border=\\\"0\\\">\\n\",\n       \"                    <tr>\\n\",\n       \"                        <td align=\\\"left\\\">_dlt_id</td>\\n\",\n       \"                        <td align=\\\"right\\\"><font>text <B>NN</B></font></td>\\n\",\n       \"                    </tr>\\n\",\n       \"                </table>\\n\",\n       \"            </td>\\n\",\n       \"        </tr>\\n\",\n       \"    </table>\\n\",\n       \">];\\n\",\n       \"\\n\",\n       \"books:p0 -> _dlt_loads:p0 [style=invis]\\n\",\n       \"books:f12:_ -> _dlt_loads:f1:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"_dlt_pipeline_state:p0 -> _dlt_loads:p0 [style=invis]\\n\",\n       \"_dlt_pipeline_state:f7:_ -> _dlt_loads:f1:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"books__author_key:p0 -> books:p0 [style=invis]\\n\",\n       \"books__author_key:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"books__author_name:p0 -> books:p0 [style=invis]\\n\",\n       \"books__author_name:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"books__ia:p0 -> books:p0 [style=invis]\\n\",\n       \"books__ia:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"books__ia_collection:p0 -> books:p0 [style=invis]\\n\",\n       \"books__ia_collection:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"books__language:p0 -> books:p0 [style=invis]\\n\",\n       \"books__language:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"books__id_standard_ebooks:p0 -> books:p0 [style=invis]\\n\",\n       \"books__id_standard_ebooks:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"books__id_librivox:p0 -> books:p0 [style=invis]\\n\",\n       \"books__id_librivox:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"books__id_project_gutenberg:p0 -> books:p0 [style=invis]\\n\",\n       \"books__id_project_gutenberg:f2:_ -> books:f13:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"_dlt_version:p0 -> _dlt_loads:p0 [style=invis]\\n\",\n       \"_dlt_version:f5:_ -> _dlt_loads:f5:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"_dlt_version:p0 -> _dlt_loads:p0 [style=invis]\\n\",\n       \"_dlt_version:f4:_ -> _dlt_loads:f2:_ [dir=both, penwidth=1, color=\\\"#1c1c34\\\", arrowtail=\\\"vee\\\", arrowhead=\\\"dot\\\"];\\n\",\n       \"}\\n\",\n       \"        `\\n\",\n       \"      );\\n\",\n       \"</script>\\n\"\n      ],\n      \"text/plain\": [\n       \"<dlt.Schema(name='rest_api', version=2, tables=['_dlt_version', '_dlt_loads', 'books', '_dlt_pipeline_state', 'books__author_key', 'books__author_name', 'books__ia', 'books__ia_collection', 'books__language', 'books__id_standard_ebooks', 'books__id_librivox', 'books__id_project_gutenberg'], version_hash='ZJIabaQJ9DAYgsR04wEVeXOgU80roBUfdvrR2YoBEyU=')>\"\n      ]\n     },\n     \"execution_count\": 13,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# Display schema \\n\",\n    \"pipeline.default_schema\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"lJ5QzSnYdidK\"\n   },\n   \"source\": [\n    \"## 📤 Step 6: Load\\n\",\n    \"\\n\",\n    \"Now we run the final stage of the pipeline: **Load**.\\n\",\n    \"\\n\",\n    \"Load means:\\n\",\n    \"\\n\",\n    \"- dlt creates tables in DuckDB (if they do not already exist)\\n\",\n    \"- the normalized rows are inserted into those tables\\n\",\n    \"- the pipeline records the load in its internal tracking tables\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {\n    \"id\": \"d9Xb67c5XfL5\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"load_info = pipeline.load()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"ehkz8lESGGdm\"\n   },\n   \"source\": [\n    \"\\n\",\n    \"After this step, the data is fully stored in the database and ready to query.\\n\",\n    \"\\n\",\n    \"At this point:\\n\",\n    \"\\n\",\n    \"- The `books` table contains our books\\n\",\n    \"- The related tables (such as `books__author_name` and `books__editions__docs`) contain the exploded nested data\\n\",\n    \"- Everything is now queryable using `pipeline.dataset()` or SQL\\n\",\n    \"\\n\",\n    \"This is the moment where the data officially moves from “pipeline processing” into a database you can explore.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"jBznxM00eCOF\"\n   },\n   \"source\": [\n    \"## 🚀 Step 7: Run the Full Pipeline\\n\",\n    \"\\n\",\n    \"Now that we have walked through each step individually, we can run the entire workflow using a single command:\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {\n    \"id\": \"YQLigkh-f7Ey\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"load_info = pipeline.run(openlibrary_source())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"SbLkA8W7eNPb\"\n   },\n   \"source\": [\n    \"<h3>What does <code>pipeline.run()</code> do?</h3>\\n\",\n    \"\\n\",\n    \"<p>\\n\",\n    \"  <code>pipeline.run()</code> simply combines the three steps we already executed manually:\\n\",\n    \"</p>\\n\",\n    \"\\n\",\n    \"<ol>\\n\",\n    \"  <li><strong>Extract</strong> – fetch data from the Open Library API</li>\\n\",\n    \"  <li><strong>Normalize</strong> – convert nested JSON into relational tables</li>\\n\",\n    \"  <li><strong>Load</strong> – write those tables into DuckDB</li>\\n\",\n    \"</ol>\\n\",\n    \"\\n\",\n    \"<p>In other words, this:</p>\\n\",\n    \"\\n\",\n    \"<pre><code>pipeline.run(source)</code></pre>\\n\",\n    \"\\n\",\n    \"<p>is equivalent to:</p>\\n\",\n    \"\\n\",\n    \"<pre><code>pipeline.extract(source)\\n\",\n    \"pipeline.normalize()\\n\",\n    \"pipeline.load()</code></pre>\\n\",\n    \"\\n\",\n    \"<p>\\n\",\n    \"  There is no hidden magic. It just runs the full ELT process in order.\\n\",\n    \"</p>\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"7ViMq6gIfJj_\"\n   },\n   \"source\": [\n    \"## 🔎 Step 8: Inspect the Loaded Data\\n\",\n    \"\\n\",\n    \"Now that the data is loaded into DuckDB, we can inspect it using `pipeline.dataset()`.\\n\",\n    \"\\n\",\n    \"This gives us a convenient Python interface for exploring the tables that dlt created, without writing SQL.\\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"\\n\",\n    \"### List available tables\\n\",\n    \"\\n\",\n    \"First, let’s see what tables exist in the dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 12,\n   \"metadata\": {\n    \"id\": \"bmnrK1aVZXPO\"\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"ds = pipeline.dataset()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 13,\n   \"metadata\": {\n    \"colab\": {\n     \"base_uri\": \"https://localhost:8080/\"\n    },\n    \"id\": \"SV6J6AtBf0xq\",\n    \"outputId\": \"19ad26bf-f34a-4f8e-c30c-5acd3342c3c5\"\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['books',\\n\",\n       \" 'books__author_key',\\n\",\n       \" 'books__author_name',\\n\",\n       \" 'books__ia',\\n\",\n       \" 'books__ia_collection',\\n\",\n       \" 'books__language',\\n\",\n       \" 'books__id_standard_ebooks',\\n\",\n       \" 'books__id_librivox',\\n\",\n       \" 'books__id_project_gutenberg',\\n\",\n       \" '_dlt_version',\\n\",\n       \" '_dlt_loads',\\n\",\n       \" '_dlt_pipeline_state']\"\n      ]\n     },\n     \"execution_count\": 13,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"ds.tables\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {\n    \"colab\": {\n     \"base_uri\": \"https://localhost:8080/\",\n     \"height\": 315\n    },\n    \"id\": \"WLa4yN7lf1TF\",\n    \"outputId\": \"d2da841b-a8bf-461f-a011-eb1db644656f\"\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"application/vnd.google.colaboratory.intrinsic+json\": {\n       \"summary\": \"{\\n  \\\"name\\\": \\\"df\\\",\\n  \\\"rows\\\": 3756,\\n  \\\"fields\\\": [\\n    {\\n      \\\"column\\\": \\\"cover_edition_key\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"category\\\",\\n        \\\"num_unique_values\\\": 1192,\\n        \\\"samples\\\": [\\n          \\\"OL24951484M\\\",\\n          \\\"OL9131663M\\\",\\n          \\\"OL47198575M\\\"\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"cover_i\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"Int64\\\",\\n        \\\"num_unique_values\\\": 1288,\\n        \\\"samples\\\": [\\n          842156,\\n          10365881,\\n          3341732\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"ebook_access\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"category\\\",\\n        \\\"num_unique_values\\\": 5,\\n        \\\"samples\\\": [\\n          \\\"printdisabled\\\",\\n          \\\"unclassified\\\",\\n          \\\"no_ebook\\\"\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"edition_count\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"number\\\",\\n        \\\"std\\\": 108,\\n        \\\"min\\\": 0,\\n        \\\"max\\\": 3546,\\n        \\\"num_unique_values\\\": 62,\\n        \\\"samples\\\": [\\n          44,\\n          92,\\n          396\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"first_publish_year\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"Int64\\\",\\n        \\\"num_unique_values\\\": 127,\\n        \\\"samples\\\": [\\n          2008,\\n          1622,\\n          1962\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"has_fulltext\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"boolean\\\",\\n        \\\"num_unique_values\\\": 2,\\n        \\\"samples\\\": [\\n          false,\\n          true\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"key\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"string\\\",\\n        \\\"num_unique_values\\\": 3756,\\n        \\\"samples\\\": [\\n          \\\"/works/OL34662215W\\\",\\n          \\\"/works/OL39702699W\\\"\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"lending_edition_s\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"category\\\",\\n        \\\"num_unique_values\\\": 281,\\n        \\\"samples\\\": [\\n          \\\"OL45637056M\\\",\\n          \\\"OL26064272M\\\"\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"lending_identifier_s\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"category\\\",\\n        \\\"num_unique_values\\\": 281,\\n        \\\"samples\\\": [\\n          \\\"alicesadventures0000unse_v7d2\\\",\\n          \\\"harrypottermagic0000unse_n5w6\\\"\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"public_scan_b\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"boolean\\\",\\n        \\\"num_unique_values\\\": 2,\\n        \\\"samples\\\": [\\n          true,\\n          false\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"title\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"string\\\",\\n        \\\"num_unique_values\\\": 2984,\\n        \\\"samples\\\": [\\n          \\\"1000 Facts and Trivia about Marvel Cinematic Universe, Game of Thrones, Disney, Star Wars, Harry Potter 1\\\",\\n          \\\"The Unofficial Harry Potter Insults Handbook\\\"\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"_dlt_load_id\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"category\\\",\\n        \\\"num_unique_values\\\": 1,\\n        \\\"samples\\\": [\\n          \\\"1770819876.9353185\\\"\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"_dlt_id\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"string\\\",\\n        \\\"num_unique_values\\\": 3756,\\n        \\\"samples\\\": [\\n          \\\"ZN3UfCkWBXFxSw\\\"\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    },\\n    {\\n      \\\"column\\\": \\\"subtitle\\\",\\n      \\\"properties\\\": {\\n        \\\"dtype\\\": \\\"category\\\",\\n        \\\"num_unique_values\\\": 59,\\n        \\\"samples\\\": [\\n          \\\"Hogwarts Through the Years\\\"\\n        ],\\n        \\\"semantic_type\\\": \\\"\\\",\\n        \\\"description\\\": \\\"\\\"\\n      }\\n    }\\n  ]\\n}\",\n       \"type\": \"dataframe\",\n       \"variable_name\": \"df\"\n      },\n      \"text/html\": [\n       \"\\n\",\n       \"  <div id=\\\"df-78b9c6b5-669d-4905-9f29-f9885ad30a9d\\\" class=\\\"colab-df-container\\\">\\n\",\n       \"    <div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>cover_edition_key</th>\\n\",\n       \"      <th>cover_i</th>\\n\",\n       \"      <th>ebook_access</th>\\n\",\n       \"      <th>edition_count</th>\\n\",\n       \"      <th>first_publish_year</th>\\n\",\n       \"      <th>has_fulltext</th>\\n\",\n       \"      <th>key</th>\\n\",\n       \"      <th>lending_edition_s</th>\\n\",\n       \"      <th>lending_identifier_s</th>\\n\",\n       \"      <th>public_scan_b</th>\\n\",\n       \"      <th>title</th>\\n\",\n       \"      <th>_dlt_load_id</th>\\n\",\n       \"      <th>_dlt_id</th>\\n\",\n       \"      <th>subtitle</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>OL61027601M</td>\\n\",\n       \"      <td>15155833</td>\\n\",\n       \"      <td>borrowable</td>\\n\",\n       \"      <td>396</td>\\n\",\n       \"      <td>1997</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>/works/OL82563W</td>\\n\",\n       \"      <td>OL38565767M</td>\\n\",\n       \"      <td>harrypotterylapi0000rowl_q5r6</td>\\n\",\n       \"      <td>False</td>\\n\",\n       \"      <td>Harry Potter and the Philosopher's Stone</td>\\n\",\n       \"      <td>1770819876.9353185</td>\\n\",\n       \"      <td>lGJrV2BS8Z9qJQ</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>OL26378158M</td>\\n\",\n       \"      <td>15158660</td>\\n\",\n       \"      <td>printdisabled</td>\\n\",\n       \"      <td>144</td>\\n\",\n       \"      <td>2007</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>/works/OL82586W</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>False</td>\\n\",\n       \"      <td>Harry Potter and the Deathly Hallows</td>\\n\",\n       \"      <td>1770819876.9353185</td>\\n\",\n       \"      <td>F9W0WQlLwgvsFw</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>OL26234270M</td>\\n\",\n       \"      <td>10580435</td>\\n\",\n       \"      <td>borrowable</td>\\n\",\n       \"      <td>278</td>\\n\",\n       \"      <td>1999</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>/works/OL82536W</td>\\n\",\n       \"      <td>OL48101764M</td>\\n\",\n       \"      <td>bdrc-W8LS66814</td>\\n\",\n       \"      <td>False</td>\\n\",\n       \"      <td>Harry Potter and the Prisoner of Azkaban</td>\\n\",\n       \"      <td>1770819876.9353185</td>\\n\",\n       \"      <td>kSdfO1XbBVAjmQ</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\\n\",\n       \"    <div class=\\\"colab-df-buttons\\\">\\n\",\n       \"\\n\",\n       \"  <div class=\\\"colab-df-container\\\">\\n\",\n       \"    <button class=\\\"colab-df-convert\\\" onclick=\\\"convertToInteractive('df-78b9c6b5-669d-4905-9f29-f9885ad30a9d')\\\"\\n\",\n       \"            title=\\\"Convert this dataframe to an interactive table.\\\"\\n\",\n       \"            style=\\\"display:none;\\\">\\n\",\n       \"\\n\",\n       \"  <svg xmlns=\\\"http://www.w3.org/2000/svg\\\" height=\\\"24px\\\" viewBox=\\\"0 -960 960 960\\\">\\n\",\n       \"    <path d=\\\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\\\"/>\\n\",\n       \"  </svg>\\n\",\n       \"    </button>\\n\",\n       \"\\n\",\n       \"  <style>\\n\",\n       \"    .colab-df-container {\\n\",\n       \"      display:flex;\\n\",\n       \"      gap: 12px;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .colab-df-convert {\\n\",\n       \"      background-color: #E8F0FE;\\n\",\n       \"      border: none;\\n\",\n       \"      border-radius: 50%;\\n\",\n       \"      cursor: pointer;\\n\",\n       \"      display: none;\\n\",\n       \"      fill: #1967D2;\\n\",\n       \"      height: 32px;\\n\",\n       \"      padding: 0 0 0 0;\\n\",\n       \"      width: 32px;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .colab-df-convert:hover {\\n\",\n       \"      background-color: #E2EBFA;\\n\",\n       \"      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\\n\",\n       \"      fill: #174EA6;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .colab-df-buttons div {\\n\",\n       \"      margin-bottom: 4px;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    [theme=dark] .colab-df-convert {\\n\",\n       \"      background-color: #3B4455;\\n\",\n       \"      fill: #D2E3FC;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    [theme=dark] .colab-df-convert:hover {\\n\",\n       \"      background-color: #434B5C;\\n\",\n       \"      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\\n\",\n       \"      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\\n\",\n       \"      fill: #FFFFFF;\\n\",\n       \"    }\\n\",\n       \"  </style>\\n\",\n       \"\\n\",\n       \"    <script>\\n\",\n       \"      const buttonEl =\\n\",\n       \"        document.querySelector('#df-78b9c6b5-669d-4905-9f29-f9885ad30a9d button.colab-df-convert');\\n\",\n       \"      buttonEl.style.display =\\n\",\n       \"        google.colab.kernel.accessAllowed ? 'block' : 'none';\\n\",\n       \"\\n\",\n       \"      async function convertToInteractive(key) {\\n\",\n       \"        const element = document.querySelector('#df-78b9c6b5-669d-4905-9f29-f9885ad30a9d');\\n\",\n       \"        const dataTable =\\n\",\n       \"          await google.colab.kernel.invokeFunction('convertToInteractive',\\n\",\n       \"                                                    [key], {});\\n\",\n       \"        if (!dataTable) return;\\n\",\n       \"\\n\",\n       \"        const docLinkHtml = 'Like what you see? Visit the ' +\\n\",\n       \"          '<a target=\\\"_blank\\\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\\n\",\n       \"          + ' to learn more about interactive tables.';\\n\",\n       \"        element.innerHTML = '';\\n\",\n       \"        dataTable['output_type'] = 'display_data';\\n\",\n       \"        await google.colab.output.renderOutput(dataTable, element);\\n\",\n       \"        const docLink = document.createElement('div');\\n\",\n       \"        docLink.innerHTML = docLinkHtml;\\n\",\n       \"        element.appendChild(docLink);\\n\",\n       \"      }\\n\",\n       \"    </script>\\n\",\n       \"  </div>\\n\",\n       \"\\n\",\n       \"\\n\",\n       \"    </div>\\n\",\n       \"  </div>\\n\"\n      ],\n      \"text/plain\": [\n       \"  cover_edition_key   cover_i   ebook_access  edition_count  \\\\\\n\",\n       \"0       OL61027601M  15155833     borrowable            396   \\n\",\n       \"1       OL26378158M  15158660  printdisabled            144   \\n\",\n       \"2       OL26234270M  10580435     borrowable            278   \\n\",\n       \"\\n\",\n       \"   first_publish_year  has_fulltext              key lending_edition_s  \\\\\\n\",\n       \"0                1997          True  /works/OL82563W       OL38565767M   \\n\",\n       \"1                2007          True  /works/OL82586W              None   \\n\",\n       \"2                1999          True  /works/OL82536W       OL48101764M   \\n\",\n       \"\\n\",\n       \"            lending_identifier_s  public_scan_b  \\\\\\n\",\n       \"0  harrypotterylapi0000rowl_q5r6          False   \\n\",\n       \"1                           None          False   \\n\",\n       \"2                 bdrc-W8LS66814          False   \\n\",\n       \"\\n\",\n       \"                                      title        _dlt_load_id  \\\\\\n\",\n       \"0  Harry Potter and the Philosopher's Stone  1770819876.9353185   \\n\",\n       \"1      Harry Potter and the Deathly Hallows  1770819876.9353185   \\n\",\n       \"2  Harry Potter and the Prisoner of Azkaban  1770819876.9353185   \\n\",\n       \"\\n\",\n       \"          _dlt_id subtitle  \\n\",\n       \"0  lGJrV2BS8Z9qJQ     None  \\n\",\n       \"1  F9W0WQlLwgvsFw     None  \\n\",\n       \"2  kSdfO1XbBVAjmQ     None  \"\n      ]\n     },\n     \"execution_count\": 17,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"df = ds.books.df()      # main table\\n\",\n    \"df.head(3)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"OWFqaH2wgCWR\"\n   },\n   \"source\": [\n    \"## 💡 Conclusion\\n\",\n    \"\\n\",\n    \"### What dlt handled for us\\n\",\n    \"\\n\",\n    \"✔ API requests  \\n\",\n    \"✔ JSON normalization  \\n\",\n    \"✔ Table creation  \\n\",\n    \"✔ Database loading  \\n\",\n    \"✔ Simple dataset inspection  \\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"\\n\",\n    \"### But there are still friction points\\n\",\n    \"\\n\",\n    \"• Getting the REST API config exactly right  \\n\",\n    \"• Remembering paginator syntax  \\n\",\n    \"• Remembering how to inspect tables  \\n\",\n    \"• Debugging schema or pagination issues  \\n\",\n    \"• Writing Python or SQL to get insights  \\n\",\n    \"\\n\",\n    \"It works... but it still takes effort.\\n\",\n    \"\\n\",\n    \"---\\n\",\n    \"\\n\",\n    \"## 🚀 Next Up: LLM-Powered Workflows\\n\",\n    \"\\n\",\n    \"dlt now integrates LLMs directly into the workflow to make:\\n\",\n    \"\\n\",\n    \"• Pipeline runs easier  \\n\",\n    \"• Debugging faster  \\n\",\n    \"• Schema inspection simpler  \\n\",\n    \"• Data analysis more natural  \\n\",\n    \"\\n\",\n    \"Instead of writing glue code, you can use natural language.\\n\",\n    \"\\n\",\n    \"In the workshop, we will see what that looks like.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 17,\n   \"metadata\": {\n    \"id\": \"BweSVO3igErN\"\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"colab\": {\n   \"provenance\": [],\n   \"toc_visible\": true\n  },\n  \"kernelspec\": {\n   \"display_name\": \"Python 3 (ipykernel)\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.13.5\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 4\n}\n"
  },
  {
    "path": "cohorts/2026/workshops/dlt/dlt_homework.md",
    "content": "# Homework: Build Your Own dlt Pipeline\n\nYou've seen how to build a pipeline with a scaffolded source. Now it's your turn to do it from scratch with a **custom API**.\n\n## Workshop Content\n\n* [Workshop README](README.md)\n* [dlt Pipeline Overview Notebook (Google Colab)](https://colab.research.google.com/github/anair123/data-engineering-zoomcamp/blob/workshop/dlt_2026/cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb)\n* [Workshop registration page](https://luma.com/hzis1yzp)\n\n## The Challenge\n\nFor this homework, build a dlt pipeline that loads NYC taxi trip data from a custom API into DuckDB and then answer some questions using the loaded data.\n\n## Data Source\n\nYou'll be working with **NYC Yellow Taxi trip data** from a custom API (not available as a dlt scaffold). This dataset contains records of individual taxi trips in New York City.\n\n| Property | Value |\n|----------|-------|\n| Base URL | `https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api` |\n| Format | Paginated JSON |\n| Page Size | 1,000 records per page |\n| Pagination | Stop when an empty page is returned |\n\n## Setup Instructions\n\nSince this API is custom (not one of the scaffolds in dlt workspace), the setup is slightly different.\n\n### Step 1: Create a New Project (or Reuse Your Demo Project)\n\nIf you already created a project folder while following along with the workshop demo, you can reuse that folder. Otherwise, create a new one:\n\n```bash\nmkdir taxi-pipeline\ncd taxi-pipeline\n```\n\nOpen this folder in Cursor (or your preferred agentic IDE).\n\n### Step 2: Set Up the dlt MCP Server (If Not Already Done)\n\nChoose the setup for your IDE:\n\nCursor - go to **Settings → Tools & MCP → New MCP Server** and add:\n\n```json\n{\n  \"mcpServers\": {\n    \"dlt\": {\n      \"command\": \"uv\",\n      \"args\": [\n        \"run\",\n        \"--with\",\n        \"dlt[duckdb]\",\n        \"--with\",\n        \"dlt-mcp[search]\",\n        \"python\",\n        \"-m\",\n        \"dlt_mcp\"\n      ]\n    }\n  }\n}\n```\n\nVS Code (Copilot) - create `.vscode/mcp.json` in your project folder:\n\n```json\n{\n  \"servers\": {\n    \"dlt\": {\n      \"command\": \"uv\",\n      \"args\": [\n        \"run\",\n        \"--with\",\n        \"dlt[duckdb]\",\n        \"--with\",\n        \"dlt-mcp[search]\",\n        \"python\",\n        \"-m\",\n        \"dlt_mcp\"\n      ]\n    }\n  }\n}\n```\n\nClaude Code - run in your terminal:\n\n```bash\nclaude mcp add dlt -- uv run --with \"dlt[duckdb]\" --with \"dlt-mcp[search]\" python -m dlt_mcp\n```\n\nThis enables the dlt MCP server, giving the AI access to dlt documentation, code examples, and your pipeline metadata.\n\n### Step 3: Install dlt\n\n```bash\npip install \"dlt[workspace]\"\n```\n\n### Step 4: Initialize the Project\n\n```bash\ndlt init dlthub:taxi_pipeline duckdb\n```\n\nYou can name the project whatever you like. Since this API has no scaffold, the command will create:\n- The dlt project files\n- Cursor rules for AI assistance\n\n**But no YAML file with API metadata.** You will need to provide the API information yourself.\n\n### Step 5: Prompt the Agent\n\nNow use your AI assistant to build the pipeline. You'll need to provide the API details in your prompt since there's no scaffold.\n\nHere's an example to get you started:\n\n```\nBuild a REST API source for NYC taxi data.\n\nAPI details:\n- Base URL: https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api\n- Data format: Paginated JSON (1,000 records per page)\n- Pagination: Stop when an empty page is returned\n\nPlace the code in taxi_pipeline.py and name the pipeline taxi_pipeline.\nUse @dlt rest api as a tutorial.\n```\n\n### Step 6: Run and Debug\n\nRun your pipeline and iterate with the agent until it works:\n\n```bash\npython taxi_pipeline.py\n```\n\n---\n\n## Questions\n\nOnce your pipeline has run successfully, use the methods covered in the workshop to investigate the following:\n\n- **dlt Dashboard**: `dlt pipeline taxi_pipeline show`\n- **dlt MCP Server**: Ask the agent questions about your pipeline\n- **Marimo Notebook**: Build visualizations and run queries\n\nWe challenge you to try out the different methods explored in the workshop when answering these questions to see what works best for you. Feel free to share your thoughts on what worked (or didn't) in your submission!\n\n### Question 1: What is the start date and end date of the dataset?\n\n- 2009-01-01 to 2009-01-31\n- 2009-06-01 to 2009-07-01\n- 2024-01-01 to 2024-02-01\n- 2024-06-01 to 2024-07-01\n\n### Question 2: What proportion of trips are paid with credit card?\n\n- 16.66%\n- 26.66%\n- 36.66%\n- 46.66%\n\n### Question 3: What is the total amount of money generated in tips?\n\n- $4,063.41\n- $6,063.41\n- $8,063.41\n- $10,063.41\n\n\n### Resources\n\n| Resource | Link |\n|----------|------|\n| dlt Dashboard Docs | [dlthub.com/docs/general-usage/dashboard](https://dlthub.com/docs/general-usage/dashboard) |\n| marimo + dlt Guide | [dlthub.com/docs/general-usage/dataset-access/marimo](https://dlthub.com/docs/general-usage/dataset-access/marimo) |\n| dlt Documentation | [dlthub.com/docs](https://dlthub.com/docs) |\n\n---\n\n## Submitting the solutions\n\n- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/dlt\n- Deadline: See the website\n\n## Tips\n\n- The API returns paginated data. Make sure your pipeline handles pagination correctly.\n- If the agent gets stuck, paste the error into the chat and let it debug.\n- Use the dlt MCP server to ask questions about your pipeline metadata.\n\n\n## Learning in Public\n\nWe encourage everyone to share what they learned. This is called \"learning in public\".\n\nRead more about the benefits [here](https://alexeyondata.substack.com/p/benefits-of-learning-in-public-and).\n\n### Example post for LinkedIn\n\n```\n🚀 dlt Workshop of Data Engineering Zoomcamp by @DataTalksClub complete!\n\nJust finished the Data Ingestion workshop with @dltHub. Learned how to:\n\n✅ Build REST API data pipelines with dlt\n✅ Use AI-assisted development with dlt MCP Server\n✅ Load paginated API data into DuckDB\n✅ Inspect pipeline data with dlt Dashboard and marimo notebooks\n\nBuilt a full NYC taxi data pipeline from a custom API - AI-assisted data engineering is the future!\n\nHere's my homework solution: <LINK>\n\nFollowing along with this amazing free course - who else is learning data engineering?\n\nYou can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n\n### Example post for Twitter/X\n\n```\n🔄 dlt Workshop of Data Engineering Zoomcamp done!\n\n- REST API pipelines with @dltHub\n- AI-assisted pipeline building\n- DuckDB as local data warehouse\n- dlt Dashboard & marimo notebooks\n\nMy solution: <LINK>\n\nFree course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/\n```\n"
  },
  {
    "path": "cohorts/2026/workshops/dlt/open_library_pipeline.py",
    "content": "\"\"\"Pipeline to ingest data from the Open Library Search API.\"\"\"\n\nimport dlt\nfrom dlt.sources.rest_api import rest_api_source\n\n\ndef open_library_source(query: str = \"harry potter\"):\n    \"\"\"\n    Create a dlt source for the Open Library Search API.\n    \n    Args:\n        query: Search query string (default: \"harry potter\")\n    \"\"\"\n    return rest_api_source({\n        \"client\": {\n            \"base_url\": \"https://openlibrary.org\",\n        },\n        \"resource_defaults\": {\n            \"primary_key\": \"key\",\n            \"write_disposition\": \"replace\",\n        },\n        \"resources\": [\n            {\n                \"name\": \"books\",\n                \"endpoint\": {\n                    \"path\": \"search.json\",\n                    \"params\": {\n                        \"q\": query,\n                        \"limit\": 100,\n                    },\n                    \"data_selector\": \"docs\",\n                    \"paginator\": {\n                        \"type\": \"offset\",\n                        \"limit\": 100,\n                        \"offset_param\": \"offset\",\n                        \"limit_param\": \"limit\",\n                        \"total_path\": \"numFound\",\n                    },\n                },\n            },\n        ],\n    })\n\n\nif __name__ == \"__main__\":\n    pipeline = dlt.pipeline(\n        pipeline_name=\"open_library_pipeline\",\n        destination=\"duckdb\",\n        dataset_name=\"open_library_data\",\n        progress=\"log\",\n    )\n\n    # Load Harry Potter books from Open Library\n    load_info = pipeline.run(open_library_source(query=\"harry potter\"))\n    print(load_info)\n"
  },
  {
    "path": "cohorts/2026/workshops/dlt/pyproject.toml",
    "content": "[project]\nname = \"zoomcamp-workshop-prep\"\nversion = \"0.1.0\"\ndescription = \"Add your description here\"\nreadme = \"README.md\"\nrequires-python = \">=3.13\"\ndependencies = [\n    \"altair>=6.0.0\",\n    \"dlt[workspace]>=1.21.0\",\n    \"ibis-framework[duckdb]>=12.0.0\",\n    \"jupyterlab>=4.5.4\",\n    \"marimo>=0.19.9\",\n]\n"
  },
  {
    "path": "cohorts/2026/workshops/dlt.md",
    "content": "# From APIs to Warehouses: AI-Assisted Data Ingestion with dlt\n\n[Video](https://www.youtube.com/watch?v=5eMytPBgmVs)\n\nThis hands-on workshop focuses on building reliable data ingestion pipelines to data warehouses (for example, Snowflake) using dlt (data load tool), enhanced with LLMs, the dlt dashboard, and dlt MCP.\n\n## What you'll learn\n\nYou'll work through the key building blocks of a production-ready ingestion setup, including:\n\n- Extracting data from APIs, files, and databases\n- Normalizing data into consistent schemas\n- Writing data to a data warehouse (e.g. Snowflake)\n- Using LLMs to accelerate dlt pipeline development\n- Validating data and schema changes using the dlt dashboard and dlt MCP\n\nThe session is fully practical and code-driven. By the end of the workshop, you'll understand how to design maintainable, scalable ingestion pipelines and use AI and validation tools to build them faster and with confidence.\n\n## Materials\n\n* [Workshop instructions](dlt/README.md)\n* [dlt Pipeline Overview Notebook (Google Colab)](https://colab.research.google.com/github/anair123/data-engineering-zoomcamp/blob/workshop/dlt_2026/cohorts/2026/workshops/dlt/dlt_Pipeline_Overview.ipynb)\n* [Homework](dlt/dlt_homework.md)\n* [Homework submission form](https://courses.datatalks.club/de-zoomcamp-2026/homework/dlt)\n\n## About the Speaker\n\n**Aashish Nair** is a Data Engineer at dltHub and the creator of the famous _dlt deployment_ course, where he teaches best practices for running dlt pipelines in production.\n"
  },
  {
    "path": "learning-in-public.md",
    "content": "# Learning in public\n\nMost people learn in private: they consume content but don't tell\nanyone about it. There's nothing wrong with it.\n\nBut we want to encourage you to document your progress and\nshare it publicly on social media.\n\nIt helps you get noticed and will lead to:\n\n* Expanding your network: meeting new people and making new friends\n* Being invited to meetups, conferences and podcasts\n* Landing a job or getting clients\n* Many other good things\n\nHere's a more comprehensive reading on why you want to do it: https://github.com/readme/guides/publishing-your-work\n\n\n## Learning in Public for Zoomcamps\n\nWhen you submit your homework or project, you can also submit\nlearning in public posts:\n\n<img src=\"https://github.com/DataTalksClub/mlops-zoomcamp/raw/main/images/learning-in-public-links.png\" />\n\nYou can watch this video to see how your learning in public posts may look like:\n\n<a href=\"https://www.loom.com/share/710e3297487b409d94df0e8da1c984ce\" target=\"_blank\">\n    <img src=\"https://github.com/DataTalksClub/mlops-zoomcamp/raw/main/images/learning-in-public.png\" height=\"240\" />\n</a>\n\n## Daily Documentation\n\n- **Post Daily Diaries**: Document what you learn each day, including the challenges faced and the methods used to overcome them.\n- **Create Quick Videos**: Make short videos showcasing your work and upload them to GitHub.\n\nSend a PR if you want to suggest improvements for this document\n"
  },
  {
    "path": "projects/README.md",
    "content": "## Course Project\n\n[🎥 Projects how-to (watch it!)](https://www.youtube.com/watch?v=BL0E8xO8OnE)\n\n\n### Objective\n\nThe goal of this project is to apply everything we have learned\nin this course to build an end-to-end data pipeline.\n\n### Problem statement\n\nDevelop a dashboard with two tiles by:\n\n* Selecting a dataset of interest (see [Datasets](#datasets))\n* Creating a pipeline for processing this dataset and putting it to a datalake\n* Creating a pipeline for moving the data from the lake to a data warehouse\n* Transforming the data in the data warehouse: prepare it for the dashboard\n* Building a dashboard to visualize the data\n\n\n## Data Pipeline \n\nThe pipeline could be **stream** or **batch**: this is the first thing you'll need to decide \n\n* **Stream**: If you want to consume data in real-time and put them to data lake\n* **Batch**: If you want to run things periodically (e.g. hourly/daily)\n\n## Technologies \n\nYou don't have to limit yourself to technologies covered in the course. You can use alternatives as well:\n\n* **Cloud**: AWS, GCP, Azure, ...\n* **Infrastructure as code (IaC)**: Terraform, Pulumi, Cloud Formation, ...\n* **Workflow orchestration**: Airflow, Prefect, Luigi, ...\n* **Data Warehouse**: BigQuery, Snowflake, Redshift, ...\n* **Batch processing**: Spark, Flink, AWS Batch, ...\n* **Stream processing**: Kafka, Pulsar, Kinesis, ...\n\nIf you use a tool that wasn't covered in the course, be sure to explain what that tool does.\n\nIf you're not certain about some tools, ask in Slack.\n\n## Dashboard\n\nYou can use any of the tools shown in the course (Looker Studio or Streamlit) or any other BI tool of your choice to build a dashboard. If you do use another tool, please specify and make sure that the dashboard is somehow accessible to your peers. \n\nYour dashboard should contain at least two tiles, we suggest you include:\n\n- 1 graph that shows the distribution of some categorical data \n- 1 graph that shows the distribution of the data across a temporal line\n\nEnsure that your graph is easy to understand by adding references and titles.\n \nExample dashboard: ![image](https://user-images.githubusercontent.com/4315804/159771458-b924d0c1-91d5-4a8a-8c34-f36c25c31a3c.png)\n\n\n## Peer reviewing\n\n> [!IMPORTANT]  \n> To evaluate the projects, we'll use peer reviewing. This is a great opportunity for you to learn from each other.\n> * To get points for your project, you need to evaluate 3 projects of your peers\n> * You get 3 extra points for each evaluation\n\n## Evaluation Criteria\n\n* Problem description\n    * 0 points: Problem is not described\n    * 2 points: Problem is described but shortly or not clearly \n    * 4 points: Problem is well described and it's clear what the problem the project solves\n* Cloud\n    * 0 points: Cloud is not used, things run only locally\n    * 2 points: The project is developed in the cloud\n    * 4 points: The project is developed in the cloud and IaC tools are used\n* Data ingestion (choose either batch or stream)\n    * Batch / Workflow orchestration\n        * 0 points: No workflow orchestration\n        * 2 points: Partial workflow orchestration: some steps are orchestrated, some run manually\n        * 4 points: End-to-end pipeline: multiple steps in the DAG, uploading data to data lake\n    * Stream\n        * 0 points: No streaming system (like Kafka, Pulsar, etc)\n        * 2 points: A simple pipeline with one consumer and one producer\n        * 4 points: Using consumer/producers and streaming technologies (like Kafka streaming, Spark streaming, Flink, etc)\n* Data warehouse\n    * 0 points: No DWH is used\n    * 2 points: Tables are created in DWH, but not optimized\n    * 4 points: Tables are partitioned and clustered in a way that makes sense for the upstream queries (with explanation)\n* Transformations (dbt, spark, etc)\n    * 0 points: No tranformations\n    * 2 points: Simple SQL transformation (no dbt or similar tools)\n    * 4 points: Tranformations are defined with dbt, Spark or similar technologies\n* Dashboard\n    * 0 points: No dashboard\n    * 2 points: A dashboard with 1 tile\n    * 4 points: A dashboard with 2 tiles\n* Reproducibility\n    * 0 points: No instructions how to run the code at all\n    * 2 points: Some instructions are there, but they are not complete\n    * 4 points: Instructions are clear, it's easy to run the code, and the code works\n\n\n> [!NOTE]\n> It's highly recommended to create a new repository for your project (not inside an existing repo) with a meaningful title, such as\n> \"Quake Analytics Dashboard\" or \"Bike Data Insights\" and include as many details as possible in the README file. ChatGPT can assist you with this. Doing so will not only make it easier to showcase your project for potential job opportunities but also have it featured on the [Projects Gallery App](#projects-gallery).\n> If you leave the README file empty or with minimal details, there may be point deductions as per the [Evaluation Criteria](#evaluation-criteria).\n\n## Going the extra mile (Optional)\n\n> [!NOTE]\n> The following things are not covered in the course, are entirely optional and they will not be graded.\n\nHowever, implementing these could significantly enhance the quality of your project:\n\n* Add tests\n* Use make\n* Add CI/CD pipeline\n\nIf you intend to include this project in your portfolio, adding these additional features will definitely help you to stand out from others.\n\n## Cheating and plagiarism\n\nPlagiarism in any form is not allowed. Examples of plagiarism:\n\n* Taking somebody's else notebooks and projects (in full or partly) and using it for the capstone project\n* Re-using your own projects (in full or partly) from other courses and bootcamps\n* Re-using your midterm project from ML Zoomcamp in capstone\n* Re-using your ML Zoomcamp from previous iterations of the course\n\nViolating any of this will result in 0 points for this project.\n\n## Resources\n\n### Datasets\n\nRefer to the provided [datasets](datasets.md) for possible selection.\n\n### Helpful Links\n\n* [Unit Tests + CI for Airflow](https://www.astronomer.io/events/recaps/testing-airflow-to-bulletproof-your-code/)\n* [CI/CD for Airflow (with Gitlab & GCP state file)](https://engineering.ripple.com/building-ci-cd-with-airflow-gitlab-and-terraform-in-gcp)\n* [CI/CD for Airflow (with GitHub and S3 state file)](https://programmaticponderings.com/2021/12/14/devops-for-dataops-building-a-ci-cd-pipeline-for-apache-airflow-dags/)\n* [CD for Terraform](https://medium.com/towards-data-science/git-actions-terraform-for-data-engineers-scientists-gcp-aws-azure-448dc7c60fcc)\n* [Spark + Airflow](https://medium.com/doubtnut/github-actions-airflow-for-automating-your-spark-pipeline-c9dff32686b)\n\n\n### Projects Gallery\n\nExplore a collection of projects completed by members of our community. The projects cover a wide range of topics and utilize different tools and techniques. Feel free to delve into any project and see how others have tackled real-world problems with data, structured their code, and presented their findings. It's a great resource to learn and get ideas for your own projects.\n\n[![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://datatalksclub-projects.streamlit.app/)\n\n### DE Zoomcamp 2023\n\n* [2023 Projects](../cohorts/2023/project.md)\n\n### DE Zoomcamp 2022\n\n* [2022 Projects](../cohorts/2022/project.md)\n"
  },
  {
    "path": "projects/datasets.md",
    "content": "## Datasets\n\nHere are some datasets that you could use for the project:\n\n\n* [Kaggle](https://www.kaggle.com/datasets)\n* [AWS datasets](https://registry.opendata.aws/)\n* [UK government open data](https://data.gov.uk/)\n* [Github archive](https://www.gharchive.org)\n* [Awesome public datasets](https://github.com/awesomedata/awesome-public-datasets)\n* [Million songs dataset](http://millionsongdataset.com)\n* [Some random datasets](https://components.one/datasets/)\n* [COVID Datasets](https://www.reddit.com/r/datasets/comments/n3ph2d/coronavirus_datsets/)\n* [Datasets from Azure](https://docs.microsoft.com/en-us/azure/azure-sql/public-data-sets)\n* [Datasets from BigQuery](https://cloud.google.com/bigquery/public-data/)\n* [Dataset search engine from Google](https://datasetsearch.research.google.com/)\n* [Public datasets offered by different GCP services](https://cloud.google.com/solutions/datasets)\n* [European statistics datasets](https://ec.europa.eu/eurostat/data/database)\n* [Datasets for streaming](https://github.com/ColinEberhardt/awesome-public-streaming-datasets)\n* [Dataset for Santander bicycle rentals in London](https://cycling.data.tfl.gov.uk/)\n* [Common crawl data](https://commoncrawl.org/) (copy of the internet)\n* [NASA's EarthData](https://search.earthdata.nasa.gov/search) (May require introductory geospatial analysis)\n* Collection Of Data Repositories\n  * [part 1](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-1.html) (from agriculture and finance to government)\n  * [part 2](https://www.kdnuggets.com/2022/04/complete-collection-data-repositories-part-2.html) (from healthcare to transportation)\n* [Data For Good by Meta](https://dataforgood.facebook.com/dfg/tools)\n\nPRs with more datasets are welcome!\n\nIt's not mandatory that you use a dataset from this list. You can use any dataset you want.\n"
  },
  {
    "path": "workshop-best-practices.md",
    "content": "# Workshop Best Practices\n\nPreferences and patterns learned from building the PyFlink streaming workshop.\n\n## Structure and Pacing\n\n- Introduce services one at a time, not all at once. Start with one container\n  (e.g., Redpanda), explain it, use it. Then add the next (PostgreSQL), etc.\n- Start with the simplest version that works (plain Python consumer), then\n  motivate the more complex tool (Flink) by showing what's missing.\n- Use `docker compose up <service> -d` to start services selectively during\n  the gradual buildup. `docker compose up --build -d` only when everything\n  is ready.\n\n## Data\n\n- Use real datasets, not fake test data. NYC taxi data\n  (`yellow_tripdata_YYYY-MM.parquet`) is a good go-to.\n- Limit to manageable sizes (e.g., first 1000 rows) for workshop speed.\n\n## Project Setup\n\n- Assume starting from scratch: `uv init -p 3.12` + `uv add <package>`.\n- Add dependencies gradually as they're needed in the narrative\n  (e.g., `uv add kafka-python pandas pyarrow` first, `uv add psycopg2-binary`\n  later when PostgreSQL is introduced).\n- Always note \"if you cloned the repo, run `uv sync` instead\" as a blockquote.\n\n## Code Delivery\n\n- Break large code blocks into small, focused blocks. Each block should do\n  one thing. Don't dump a full script in one block.\n- Pattern for code blocks: short intro line (what it does), then the code,\n  then the explanation of how it works below. Don't put detailed\n  explanations before the code - let the reader see the code first.\n- Keep imports local to each block - don't introduce all imports upfront.\n  Each block should only import what it uses.\n- Introduce functions and utilities where they're first used, not earlier.\n  For example, show `dataclasses.asdict()` in the block that calls it, not\n  in the block that defines the dataclass.\n- When introducing a function, show a test with sample data before using it\n  in the real code. For example, create a test binary string to verify a\n  deserializer, then pass it to the consumer.\n- Prefer named functions over inline lambdas. A named function is reusable,\n  testable, and easier to explain step by step. For example,\n  `value_deserializer=ride_deserializer` instead of\n  `value_deserializer=lambda m: json.loads(m.decode('utf-8'))`.\n- Extract repetitive logic into named functions. For example, row-to-object\n  conversion that appears in multiple places should be a function like\n  `ride_from_row(row)`.\n- Split one-liner functions into multiple lines. Each step (decode, parse,\n  construct) on its own line is easier to follow and explain.\n- Show the simple approach first, then improve it. For example, show a\n  generic `json_serializer` with manual `dataclasses.asdict()` calls, then\n  introduce a specialized `ride_serializer` that handles the conversion\n  internally. Let the student feel the friction before showing the fix.\n- Extract shared code (dataclasses, serializers, deserializers, converters)\n  into shared modules (e.g., `models.py`) so multiple scripts can import\n  from one place.\n- Reference the complete script at the end (e.g., \"> The complete script is\n  in `src/producers/producer.py`.\").\n- For infrastructure files that are long or complex (Dockerfile, YAML configs),\n  link to the file on GitHub and provide a short summary list of what it does.\n  Use `wget` to download from the GitHub repo instead of asking students to\n  type them.\n- Mention that students can run Python code in Jupyter notebooks\n  (`uv add jupyter`, `uv run jupyter lab`) as an alternative to .py scripts.\n  The small-block style maps naturally to notebook cells.\n- Flink jobs must remain as .py files (they're submitted to the cluster via\n  `docker compose exec`). Add a note explaining this distinction.\n\n## Formatting\n\n- No bold formatting (`**text**`) in README files. Use plain text.\n- No em dashes. Use hyphens with spaces (` - `) instead.\n- Use `python` not `python3`.\n- Use `docker compose` not `docker-compose`.\n- Use `uvx pgcli` not just `pgcli`.\n- Use `uv run python` not `python` for running scripts.\n\n## Naming\n\n- Use meaningful names that reflect purpose, not generic placeholders.\n  For example, `group_id='rides-console'` or `group_id='rides-to-postgres'`,\n  not `group_id='test-consumer-group'`.\n\n## Explanations\n\n- For complex configurations (like Redpanda's docker-compose command), explain\n  every parameter in a table or list.\n- Explain the \"why\" not just the \"what\" (why two Kafka addresses? why\n  checkpointing every 10 seconds? why watermarks?).\n- Use tables for parameter explanations and comparisons.\n- Include sample output for every command students will run.\n- Use `>` blockquotes for tips, notes about the repo, and common mistakes from\n  original workshops/streams.\n- For complex concepts (watermarks, task slots, parallelism), pull the\n  explanation out of bullet lists into its own multi-paragraph section. State\n  the value or syntax in the bullet, then explain the concept below in\n  separate paragraphs for easier reading.\n- Use lists for multi-point summaries instead of packing everything into one\n  long sentence.\n- When showing a development shortcut (like mounting local files into Docker),\n  add a note explaining how it works in production. Students benefit from\n  understanding real-world deployment patterns alongside the workshop setup.\n\n## Code Organization\n\n- Define the source (where you read from) before the sink (where you write\n  to) when presenting code blocks. Set up the consumer/reader first, then\n  the database connection or output destination.\n\n## Docker Compose\n\n- Don't use `container_name` or `hostname` - Docker Compose handles naming\n  automatically.\n- Don't use `extra_hosts` unless specifically needed.\n- Service names are automatically resolvable as hostnames within the Docker\n  network.\n- Prefer short service names (e.g., `redpanda` not `redpanda-1`).\n- Keep `restart: on-failure` only for services that need it (like databases).\n\n## Dependencies and Versions\n\n- Always use the latest stable versions of images and libraries.\n- Pin exact versions for Flink and its connectors (they must match).\n- Use `uv` for everything Python-related (package management, running scripts,\n  even installing Python itself inside Docker).\n- Prefer `COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/` in Dockerfiles\n  instead of `apt-get install`.\n\n## Workshop Header\n\n- Credit the original stream/video at the top with a link.\n- If the new video is not yet available, put \"TBA\" with a sign-up link\n  (e.g., Luma).\n- Brief description of what we'll build and prerequisites.\n\n## Workshop Flow Template\n\n1. Introduce the first component (message broker, database, etc.)\n2. Set up with docker-compose (explain parameters)\n3. Create a simple producer/writer\n4. Create a simple consumer/reader\n5. Add a database, save data\n6. Show limitations of the simple approach\n7. Introduce the framework (Flink, Spark, etc.)\n8. Reproduce the simple case with the framework\n9. Do something the simple approach can't (aggregation, windowing)\n10. Explain advanced concepts (window types, offsets, etc.)\n11. Cleanup\n12. Q&A - questions and answers from the original stream. Include production\n    deployment topics here rather than as standalone sections.\n"
  }
]