Repository: ilya-galperin/SF-EvictionTracker Branch: master Commit: e3f0a03131eb Files: 10 Total size: 60.3 KB Directory structure: gitextract_jbid5ch1/ ├── README.md ├── airflow_installation.txt └── dags/ ├── full_load_dag.py ├── incremental_load_dag.py ├── operators/ │ ├── s3_to_postgres_operator.py │ └── soda_to_s3_operator.py └── sql/ ├── full_load.sql ├── incremental_load.sql ├── init_db_schema.sql └── trunc_target_tables.sql ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # SF-EvictionTracker Tracking eviction trends in San Francisco across filing reasons, districts, neighborhoods, and demographics in the months following COVID-19. Data warehouse infrastructure is housed in the AWS ecosystem and uses Apache Airflow for orchestration with public-facing dashboards created using Metabase. Questions? Feel free to reach me at ilya.glprn@gmail.com. Public Dashboard Link: http://sf-evictiontracker-metabase.us-east-1.elasticbeanstalk.com/public/dashboard/f637e470-8ea9-4b03-af80-53988e5b6a9b

ARCHITECTURE:

![Architecture](https://i.imgur.com/s2gLBZt.png) Data is sourced from San Francisco Open Data's API and csv's containing San Francisco district and neighborhood aggregate census results. Airflow orchestrates its movement to an S3 bucket and into a data warehouse hosted in RDS. SQL scripts are then ran to transform the data from its raw form through a staging schema and into production target tables. The presentation layer is created using Metabase, an open-source data visualization tool, and deployed using Elastic Beanstalk.

DATA MODEL:

Dimension Tables: `dim_district` `dim_neighborhood` `dim_location` `dim_reason` `dim_date` `br_reason_group` Fact Tables: `fact_evictions` The data model is implemented using a star schema with a bridge table to accomodate any new permutations for the reason dimension. More information on bridge tables can be found here: https://www.kimballgroup.com/2012/02/design-tip-142-building-bridges/ ![Model](https://i.imgur.com/uInBlzR.png)

ETL FLOW:

General Overview - - Evictions data is collected from the SODA API and moved into an S3 Bucket - Neighborhood/district census data is stored as a CSV in S3 - Once the API load to S3 is complete, data is moved into RDS into a "raw" schema and moves through a staging schema for processing - ETL job execution is complete once data is moved from the staging schema into the final production tables DAGs and Custom Airflow Operators - ![Ops](https://i.imgur.com/WTOUiGU.jpg) ![Dag](https://i.imgur.com/yJb3DKT.jpg) There are 2 DAGs (Directed Acyclic Graphs) used for this project - full load which should be executed on initial setup and incremental load which is scheduled to run daily and pull new data from the Socrata Open Data API. The incremental load DAG uses XCom to pass the filesize of the load between the API call task and a ShortCircuitOperator to skip downstream tasks if the API call produces no results. The DAGs use two customer operators. They have been purpose built for this project but are easily expandable to be used in other data pipelines. 1. soda_to_s3_operator: Queries the Socrata Open Data API using a SoQL string and uploads the results to an S3 bucket. Includes optional function to check source data size and abort ETL if filesize exceeds user-defined limit. 2. s3_to_postges_operator: Collects data from a file hosted on AWS S3 and loads it into a Postgres table. Current version supports JSON and CSV source data types.

INFRASTRUCTURE:

This project is hosted in the AWS ecosystem and uses the following resources: ![EC2](https://i.imgur.com/jB2X1jI.png) EC2 - - t2.medium - dedicated resource for Airflow, managed by AWS Instance Scheduler to complete the daily DAG run and shut off after execution - t2.small - used to host Metabase, always online RDS - - t2.small - hosts application database for Metabase and the data warehouse Elastic Beanstalk is used to deploy the Metabase web application.

DASHBOARD:

The dashboard is publicly accessible here: http://sf-evictiontracker-metabase.us-east-1.elasticbeanstalk.com/public/dashboard/f637e470-8ea9-4b03-af80-53988e5b6a9b Some examples screengrabs below! ![Dash1](https://i.imgur.com/MZ325PT.jpg) ![Dash2](https://i.imgur.com/OeyOVp0.jpg) ![Dash3](https://i.imgur.com/v6Nwz9l.jpg) ================================================ FILE: airflow_installation.txt ================================================ STEP 1 - Launch EC2 Instance: - t3.medium - 12gb storage - launch-wizard-3 security group to open TCP Port 8080 - associate elastic IP STEP 2 - Install Postgres Server on EC2: run: sudo apt-get update sudo apt-get install python-psycopg2 sudo apt-get install postgresql postgresql-contrib Step 3 - Create OS User airflow run: sudo adduser airflow sudo usermod -aG sudo airflow su - airflow Note: From here on, make sure you are logged in as airflow user. Step 4 - Create Postgres Metadatabase and User Access run: sudo -u postgres psql in postgres prompt: CREATE USER airflow PASSWORD 'password'; CREATE DATABASE airflow; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airflow; \q Step 5 - Change Postgres Connection Config run: sudo nano /etc/postgresql/10/main/pg_hba.conf Change this line - # IPv4 local connections: host all all 127.0.0.1/32 md5 To this line - # IPv4 local connections: host all all 0.0.0.0/0 trust run: sudo nano /etc/postgresql/10/main/postgresql.conf Change this line - #listen_addresses = ‘localhost’ # what IP address(es) to listen on To this line - listen_addresses = ‘*’ # what IP address(es) to listen on restart postgres server: sudo service postgresql restart Step 6 - Install Airflow run: su - airflow sudo apt-get install python3-pip sudo python3 -m pip install apache-airflow[postgres,s3,aws] run: airflow initdb Step 7 - Connect Airflow to Postgres run: nano /home/airflow/airflow/airflow.cfg Change lines - sql_alchemy_conn = postgresql+psycopg2://airflow:password@localhost:5432/airflow executor = LocalExecutor load_examples = False run: airflow initdb Step 7 - Add DAGs: mkdir /home/airflow/airflow/dags/ cd /home/airflow/airflow/dags/ touch tutorial.py nano tutorial.py Step 6: Setup Airflow Webserver and Scheduler to start automatically We are almost there. The final thing we need to do is to ensure airflow starts up when your ec2 instance starts. sudo nano /etc/systemd/system/airflow-webserver.service Paste the following into the file created above [Unit] Description=Airflow webserver daemon After=network.target postgresql.service Wants=postgresql.service [Service] EnvironmentFile=/etc/environment User=airflow Group=airflow Type=simple ExecStart= /usr/local/bin/airflow webserver Restart=on-failure RestartSec=5s PrivateTmp=true [Install] WantedBy=multi-user.target Next we will create the following file to enable scheduler service sudo nano /etc/systemd/system/airflow-scheduler.service Paste the following [Unit] Description=Airflow scheduler daemon After=network.target postgresql.service Wants=postgresql.service [Service] EnvironmentFile=/etc/environment User=airflow Group=airflow Type=simple ExecStart=/usr/local/bin/airflow scheduler Restart=always RestartSec=5s [Install] WantedBy=multi-user.target Next enable these services and check their status sudo systemctl enable airflow-webserver.service sudo systemctl enable airflow-scheduler.service sudo systemctl start airflow-scheduler sudo systemctl start airflow-webserver ================================================ FILE: dags/full_load_dag.py ================================================ # echo "" > /home/airflow/airflow/dags/full_load_dag.py # nano /home/airflow/airflow/dags/full_load_dag.py from airflow import DAG from airflow.operators.postgres_operator import PostgresOperator from operators.soda_to_s3_operator import SodaToS3Operator from operators.s3_to_postgres_operator import S3ToPostgresOperator from airflow.utils.dates import days_ago from datetime import timedelta soda_headers = { 'keyId':'############', 'keySecret':'#################', 'Accept':'application/json' } default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': days_ago(2), 'email': ['airflow@example.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(seconds=30) } with DAG('eviction-tracker-full_load', default_args=default_args, description='Executes full load from SODA API to Production DW.', max_active_runs=1, schedule_interval=None) as dag: op1 = SodaToS3Operator( task_id='get_evictions_data', http_conn_id='API_Evictions', headers=soda_headers, s3_conn_id='S3_Evictions', s3_bucket='sf-evictionmeter', s3_directory='soda_jsons', size_check=True, max_bytes=500000000, dag=dag ) op2 = PostgresOperator( task_id='initialize_target_db', postgres_conn_id='RDS_Evictions', sql='sql/init_db_schema.sql', dag=dag ) op3 = S3ToPostgresOperator( task_id='load_evictions_data', s3_conn_id='S3_Evictions', s3_bucket='sf-evictionmeter', s3_prefix='soda_jsons/soda_evictions_import', source_data_type='json', postgres_conn_id='RDS_Evictions', schema='raw', table='soda_evictions', get_latest=True, dag=dag ) op4 = S3ToPostgresOperator( task_id='load_neighborhood_data', s3_conn_id='S3_Evictions', s3_bucket='sf-evictionmeter', s3_prefix='census_csv/sf_by_neighborhood', source_data_type='csv', header=True, postgres_conn_id='RDS_Evictions', schema='raw', table='neighborhood_data', get_latest=True, dag=dag ) op5 = S3ToPostgresOperator( task_id='load_district_data', s3_conn_id='S3_Evictions', s3_bucket='sf-evictionmeter', s3_prefix='census_csv/sf_by_district', source_data_type='csv', header=True, postgres_conn_id='RDS_Evictions', schema='raw', table='district_data', get_latest=True, dag=dag ) op6 = PostgresOperator( task_id='execute_full_load', postgres_conn_id='RDS_Evictions', sql='sql/full_load.sql', dag=dag ) op1 >> op2 >> (op3, op4, op5) >> op6 ================================================ FILE: dags/incremental_load_dag.py ================================================ # echo "" > /home/airflow/airflow/dags/incremental_load_dag.py # nano /home/airflow/airflow/dags/incremental_load_dag.py from airflow import DAG from airflow.operators.postgres_operator import PostgresOperator from airflow.operators.python_operator import ShortCircuitOperator from operators.soda_to_s3_operator import SodaToS3Operator from operators.s3_to_postgres_operator import S3ToPostgresOperator from airflow.utils.dates import days_ago from datetime import timedelta soda_headers = { 'keyId':'############', 'keySecret':'#################', 'Accept':'application/json' } default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': days_ago(2), 'email': ['airflow@example.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(seconds=30) } def get_size(**context): val = context['ti'].xcom_pull(key='obj_len') return True if val > 0 else False with DAG('eviction-tracker-incremental_load', default_args=default_args, description='Executes incremental load from SODA API & S3-hosted csv''s into Production DW.', max_active_runs=1, schedule_interval=None) as dag: op1 = SodaToS3Operator( task_id='get_evictions_data', http_conn_id='API_Evictions', headers=soda_headers, days_ago=31, s3_conn_id='S3_Evictions', s3_bucket='sf-evictionmeter', s3_directory='soda_jsons', size_check=True, max_bytes=500000000, dag=dag ) op2 = ShortCircuitOperator( task_id='check_get_results', python_callable=get_size, provide_context=True, dag=dag ) op3 = PostgresOperator( task_id='truncate_target_tables', postgres_conn_id='RDS_Evictions', sql='sql/trunc_target_tables.sql', dag=dag ) op4 = S3ToPostgresOperator( task_id='load_evictions_data', s3_conn_id='S3_Evictions', s3_bucket='sf-evictionmeter', s3_prefix='soda_jsons/soda_evictions_import', source_data_type='json', postgres_conn_id='RDS_Evictions', schema='raw', table='soda_evictions', get_latest=True, dag=dag ) op5 = S3ToPostgresOperator( task_id='load_neighborhood_data', s3_conn_id='S3_Evictions', s3_bucket='sf-evictionmeter', s3_prefix='census_csv/sf_by_neighborhood', source_data_type='csv', header=True, postgres_conn_id='RDS_Evictions', schema='raw', table='neighborhood_data', get_latest=True, dag=dag ) op6 = S3ToPostgresOperator( task_id='load_district_data', s3_conn_id='S3_Evictions', s3_bucket='sf-evictionmeter', s3_prefix='census_csv/sf_by_district', source_data_type='csv', header=True, postgres_conn_id='RDS_Evictions', schema='raw', table='district_data', get_latest=True, dag=dag ) op7 = PostgresOperator( task_id='execute_incremental_load', postgres_conn_id='RDS_Evictions', sql='sql/incremental_load.sql', dag=dag ) op1 >> op2 >> op3 >> (op4, op5, op6) >> op7 ================================================ FILE: dags/operators/s3_to_postgres_operator.py ================================================ # echo "" > /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py # nano /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py from airflow.models.baseoperator import BaseOperator from airflow.utils.decorators import apply_defaults from airflow.hooks.S3_hook import S3Hook from airflow.hooks.postgres_hook import PostgresHook import json import io from contextlib import closing class S3ToPostgresOperator(BaseOperator): """ Collects data from a file hosted on AWS S3 and loads it into a Postgres table. Current version supports JSON and CSV sources but requires pre-defined data model. :param s3_conn_id: S3 Connection ID :param s3_bucket: S3 Bucket Destination :param s3_prefix: S3 File Prefix :param source_data_type: S3 Source File data type :param header: Toggles ignore header for CSV source type :param postgres_conn_id: Postgres Connection ID :param db_schema: Postgres Target Schema :param db_table: Postgres Target Table :param get_latest: if True, pulls from last modified file in S3 path """ @apply_defaults def __init__(self, s3_conn_id=None, s3_bucket=None, s3_prefix='', source_data_type='', postgres_conn_id='postgres_default', header=False, schema='public', table='raw_load', get_latest=False, *args, **kwargs) -> None: super().__init__(*args, **kwargs) self.s3_conn_id = s3_conn_id self.s3_bucket = s3_bucket self.s3_prefix = s3_prefix self.source_data_type = source_data_type self.postgres_conn_id = postgres_conn_id self.header = header self.schema = schema self.table = table self.get_latest = get_latest def execute(self, context): """ Executes the operator. """ s3_hook = S3Hook(self.s3_conn_id) s3_session = s3_hook.get_session() s3_client = s3_session.client('s3') if self.get_latest == True: objects = s3_client.list_objects_v2(Bucket=self.s3_bucket, Prefix=self.s3_prefix)['Contents'] latest = max(objects, key=lambda x: x['LastModified']) s3_obj = s3_client.get_object(Bucket=self.s3_bucket, Key=latest['Key']) file_content = s3_obj['Body'].read().decode('utf-8') pg_hook = PostgresHook(self.postgres_conn_id) if self.source_data_type == 'json': print('inserting json object...') json_content = json.loads(file_content) schema = self.schema if isinstance(self.schema, tuple): schema = self.schema[0] table = self.table if isinstance(self.table, tuple): table = self.table[0] target_fields = ['raw_id','created_at','updated_at','eviction_id','address','city','state', 'zip','file_date','non_payment','breach','nuisance','illegal_use','failure_to_sign_renewal', 'access_denial','unapproved_subtenant','owner_move_in','demolition','capital_improvement', 'substantial_rehab','ellis_act_withdrawal','condo_conversion','roommate_same_unit', 'other_cause','late_payments','lead_remediation','development','good_samaritan_ends', 'constraints_date','supervisor_district','neighborhood'] target_fields = ','.join(target_fields) with closing(pg_hook.get_conn()) as conn: with closing(conn.cursor()) as cur: cur.executemany( f"""INSERT INTO {schema}.{table} ({target_fields}) VALUES( %(:id)s, %(:created_at)s, %(:updated_at)s, %(eviction_id)s, %(address)s, %(city)s, %(state)s, %(zip)s, %(file_date)s, %(non_payment)s, %(breach)s, %(nuisance)s, %(illegal_use)s, %(failure_to_sign_renewal)s, %(access_denial)s, %(unapproved_subtenant)s, %(owner_move_in)s, %(demolition)s, %(capital_improvement)s, %(substantial_rehab)s, %(ellis_act_withdrawal)s, %(condo_conversion)s, %(roommate_same_unit)s, %(other_cause)s, %(late_payments)s, %(lead_remediation)s, %(development)s, %(good_samaritan_ends)s, %(constraints_date)s, %(supervisor_district)s, %(neighborhood)s ); """,({ ':id': line[':id'], ':created_at': line[':created_at'], ':updated_at': line[':updated_at'], 'eviction_id': line['eviction_id'], 'address': line.get('address', None), 'city': line.get('city', None), 'state': line.get('state', None),'zip': line.get('zip', None),'file_date': line.get('file_date', None), 'non_payment': line.get('non_payment', None),'breach': line.get('breach', None), 'nuisance': line.get('nuisance', None),'illegal_use': line.get('illegal_use', None), 'failure_to_sign_renewal': line.get('failure_to_sign_renewal', None), 'access_denial': line.get('access_denial', None),'unapproved_subtenant': line.get('unapproved_subtenant', None), 'owner_move_in': line.get('owner_move_in', None),'demolition': line.get('demolition', None), 'capital_improvement': line.get('capital_improvement', None), 'substantial_rehab': line.get('substantial_rehab', None),'ellis_act_withdrawal': line.get('ellis_act_withdrawal', None), 'condo_conversion': line.get('condo_conversion', None),'roommate_same_unit': line.get('roommate_same_unit', None), 'other_cause': line.get('other_cause', None),'late_payments': line.get('late_payments', None), 'lead_remediation': line.get('lead_remediation', None),'development': line.get('development', None), 'good_samaritan_ends': line.get('good_samaritan_ends', None),'constraints_date': line.get('constraints_date', None), 'supervisor_district': line.get('supervisor_district', None),'neighborhood': line.get('neighborhood', None) } for line in json_content)) conn.commit() if self.source_data_type == 'csv': print('inserting csv...') file = io.StringIO(file_content) sql = "COPY %s FROM STDIN DELIMITER ','" if self.header == True: sql = "COPY %s FROM STDIN DELIMITER ',' CSV HEADER" schema = self.schema if isinstance(self.schema, tuple): schema = self.schema[0] table = self.table if isinstance(self.table, tuple): table = self.table[0] table = f'{schema}.{table}' with closing(pg_hook.get_conn()) as conn: with closing(conn.cursor()) as cur: cur.copy_expert(sql=sql % table, file=file) conn.commit() print('inserting complete...') ================================================ FILE: dags/operators/soda_to_s3_operator.py ================================================ # echo "" > /home/airflow/airflow/dags/operators/soda_to_s3_operator.py # nano /home/airflow/airflow/dags/operators/soda_to_s3_operator.py from airflow.models.baseoperator import BaseOperator from airflow.utils.decorators import apply_defaults from airflow.hooks.http_hook import HttpHook from airflow.hooks.S3_hook import S3Hook from datetime import datetime, timedelta import json import sys class SizeExceededError(Exception): """Raised when max file size is exceeded""" def __init__(self): self.message = 'Max file size exceeded' def __str__(self): return f'SizeExceededError, {self.message}' class SodaToS3Operator(BaseOperator): """ Queries the Socrata Open Data API using a SoQL string and uploads the results to an S3 bucket. :param endpoint: Optional API connection endpoint :param data: Custom Socrata SoQL string used to query API, overrides default get request :param days_ago: Restricts get request to updated/created records from specified date onward :param headers: Dictionary containing optional API connection keys (keyId, keySecret, Accept) :param s3_conn_id: S3 Connection ID :param s3_bucket: S3 Bucket Destination :param s3_directory: S3 Directory Destination :param method: Request type for API :param http_conn_id: SODA API Connection ID :param size_check: Boolean indicating whether to run a size check prior to upload to S3 :param max_bytes: Maximum number of bytes to allow for a single S3 upload """ @apply_defaults def __init__(self, endpoint=None, data=None, days_ago=None, headers=None, s3_conn_id=None, s3_bucket=None, s3_directory='', method='GET', http_conn_id='http_default', size_check=False, max_bytes=5000000000, *args, **kwargs) -> None: super().__init__(*args, **kwargs) self.endpoint = endpoint self.data = data self.days_ago = days_ago self.s3_conn_id = s3_conn_id self.s3_bucket = s3_bucket self.s3_directory = s3_directory self.headers = headers self.method = method self.http_conn_id = http_conn_id self.size_check = size_check self.max_bytes = max_bytes def get_size(self, obj, seen=None): """ Recursively finds size of object. """ size = sys.getsizeof(obj) if seen is None: seen = set() obj_id = id(obj) if obj_id in seen: return 0 seen.add(obj_id) if isinstance(obj, dict): size += sum([self.get_size(v, seen) for v in obj.values()]) size += sum([self.get_size(k, seen) for k in obj.keys()]) elif hasattr(obj, '__dict__'): size += self.get_size(obj.__dict__, seen) elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)): size += sum([self.get_size(i, seen) for i in obj]) return size def parse_metadata(self, header): """ Parses metadata from API response. """ try: metadata = { 'api-call-date': header['Date'], 'content-type': header['Content-Type'], 'source-last-modified': header['X-SODA2-Truth-Last-Modified'], 'fields': header['X-SODA2-Fields'], 'types': header['X-SODA2-Types'] } except KeyError: metadata = {'KeyError': 'Metadata missing from header, see error log.'} return metadata def execute(self, context): """ Executes the operator, including running a max filesize check if enabled. The SODA API maxes out the # of returned results so we use paging to query the endpoint multiple times and continuously move the offset. Metadata is parsed and saved separately in a /logs/ subfolder along with the JSON results from API call. """ soda = HttpHook(method=self.method, http_conn_id=self.http_conn_id) if self.data: soql_filter = self.data elif self.days_ago: current_dt = datetime.now() target_dt = current_dt - timedelta(self.days_ago) format_dt = target_dt.strftime("%Y-%m-%d") soql_filter = f"""$query=SELECT:*,* WHERE :created_at > '{format_dt}' OR :updated_at > '{format_dt}' ORDER BY :id LIMIT 10000""" else: soql_filter = """$query=SELECT:*,* ORDER BY :id LIMIT 10000""" print('getting... ' + soql_filter) #soql_filter = f"""$query=SELECT:*,* WHERE :created_at < '2020-04-01' ORDER BY :id LIMIT 10000""" offset, counter = 0, 1 combined = [] while True: soql_filter_offset = soql_filter + f' OFFSET {offset}' response = soda.run(endpoint=self.endpoint, data=soql_filter_offset, headers=self.headers) if response.status_code != 200: break captured = response.json() if len(captured) == 0: break combined.extend(captured) offset = 10000 * counter counter += 1 if self.size_check == True: print('actual size... ' + str(self.get_size(combined))) print('max size... ' + str(self.max_bytes)) if self.get_size(combined) > self.max_bytes: raise SizeExceededError dest_s3 = S3Hook(self.s3_conn_id) body_obj = 'soda_evictions_import_' + datetime.now().strftime("%Y-%m-%dT%H%M%S") + '.json' metadata = self.parse_metadata(response.headers) meta_obj = 'logs/soda_evictions_import_log_' + datetime.now().strftime("%Y-%m-%dT%H%M%S") dest_s3.load_string(json.dumps(combined), key=self.s3_directory+'/'+body_obj, bucket_name=self.s3_bucket) dest_s3.load_string(json.dumps(metadata), key=self.s3_directory+'/'+meta_obj, bucket_name=self.s3_bucket) # XCom used to skip downstream tasks if body object size is 0 self.xcom_push(context=context, key='obj_len', value=len(combined)) ================================================ FILE: dags/sql/full_load.sql ================================================ -- echo "" > /home/airflow/airflow/dags/sql/full_load.sql -- nano /home/airflow/airflow/dags/sql/full_load.sql -- Populate District Dimension INSERT INTO staging.dim_district (district_key, district) SELECT -1, 'Unknown'; INSERT INTO staging.dim_district (district, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty) SELECT district, population::int, households::int, perc_asian::numeric as percent_asian, perc_black::numeric as percent_black, perc_white::numeric as percent_white, perc_nat_am::numeric as percent_native_am, perc_nat_pac::numeric as percent_pacific_isle, perc_other::numeric as percent_other_race, perc_latin::numeric as percent_latin, median_age::numeric, total_units::int, perc_owner_occupied::numeric as percent_owner_occupied, perc_renter_occupied::numeric as percent_renter_occupied, median_rent_as_perc_of_income::numeric, median_household_income::numeric, median_family_income::numeric, per_capita_income::numeric, perc_in_poverty::numeric as percent_in_poverty FROM raw.district_data; -- Populate Neighborhood Dimension INSERT INTO staging.dim_neighborhood (neighborhood_key, neighborhood) SELECT -1, 'Unknown'; INSERT INTO staging.dim_neighborhood (neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty) SELECT acs_name as neighborhood, db_name as neighborhood_alt_name, population::int, households::int, perc_asian::numeric as percent_asian, perc_black::numeric as percent_black, perc_white::numeric as percent_white, perc_nat_am::numeric as percent_native_am, perc_nat_pac::numeric as percent_pacific_isle, perc_other::numeric as percent_other_race, perc_latin::numeric as percent_latin, median_age::numeric, total_units::int, perc_owner_occupied::numeric as percent_owner_occupied, perc_renter_occupied::numeric as percent_renter_occupied, median_rent_as_perc_of_income::numeric, median_household_income::numeric, median_family_income::numeric, per_capita_income::numeric, perc_in_poverty::numeric as percent_in_poverty FROM raw.neighborhood_data; -- Populate Location Dimension INSERT INTO staging.dim_location (location_key, city, state, zip_code) SELECT -1, 'Unknown', 'Unknown', 'Unknown'; INSERT INTO staging.dim_location (city, state, zip_code) SELECT DISTINCT COALESCE(city, 'Unknown') as city, COALESCE(state, 'Unknown') as state, COALESCE(zip, 'Unknown') as zip_code FROM raw.soda_evictions WHERE city IS NOT NULL OR state IS NOT NULL OR zip IS NOT NULL; -- Populate Reason Dimension INSERT INTO staging.dim_reason (reason_key, reason_code, reason_desc) VALUES (-1, 'Unknown', 'Unknown'); INSERT INTO staging.dim_reason (reason_code, reason_desc) VALUES ('non_payment', 'Non-Payment'), ('breach', 'Breach'), ('nuisance', 'Nuisance'), ('illegal_use', 'Illegal Use'), ('failure_to_sign_renewal', 'Failure to Sign Renewal'), ('access_denial', 'Access Denial'), ('unapproved_subtenant', 'Unapproved Subtenant'), ('owner_move_in', 'Owner Move-In'), ('demolition', 'Demolition'), ('capital_improvement', 'Capital Improvement'), ('substantial_rehab', 'Substantial Rehab'), ('ellis_act_withdrawal', 'Ellis Act Withdrawal'), ('condo_conversion', 'Condo Conversion'), ('roommate_same_unit', 'Roommate Same Unit'), ('other_cause', 'Other Cause'), ('late_payments', 'Late Payments'), ('lead_remediation', 'Lead Remediation'), ('development', 'Development'), ('good_samaritan_ends', 'Good Samaritan Ends'); -- Populate Reason Bridge Table SELECT ROW_NUMBER() OVER(ORDER BY concat_reason) as group_key, string_to_array(concat_reason, '|') as reason_array, concat_reason INTO TEMP tmp_reason_group FROM ( SELECT DISTINCT TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)) as concat_reason FROM ( SELECT eviction_id, CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END|| CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END|| CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END|| CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END|| CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END|| CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END|| CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END|| CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END|| CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END|| CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END|| CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END|| CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END|| CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END|| CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END|| CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END|| CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END|| CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END|| CASE WHEN development = 'true' THEN 'development|' ELSE '' END|| CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END as concat_reason FROM raw.soda_evictions ) f1 ) f2; INSERT INTO staging.br_reason_group (reason_group_key, reason_key) SELECT DISTINCT group_key as reason_group_key, reason_key FROM (SELECT group_key, unnest(reason_array) unnested FROM tmp_reason_group) grp JOIN staging.dim_Reason r ON r.reason_code = grp.unnested; -- Populate Date Dimension Table INSERT INTO staging.dim_date (date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday, period, cw_start, cw_end, month_start, month_end) SELECT -1, '1900-01-01', -1, -1, 'Unknown', -1, -1, 'Unknown', -1, 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', 'Unknown', '1900-01-01', '1900-01-01', '1900-01-01', '1900-01-01'; INSERT INTO staging.dim_date (date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday, period, cw_start, cw_end, month_start, month_end) SELECT TO_CHAR(datum, 'yyyymmdd')::int as date_key, datum as date, EXTRACT(YEAR FROM datum) as year, EXTRACT(MONTH FROM datum) as month, TO_CHAR(datum, 'TMMonth') as month_name, EXTRACT(DAY FROM datum) as day, EXTRACT(doy FROM datum) as day_of_year, TO_CHAR(datum, 'TMDay') as weekday_name, EXTRACT(week FROM datum) as calendar_week, TO_CHAR(datum, 'dd. mm. yyyy') as formatted_date, 'Q' || TO_CHAR(datum, 'Q') as quartal, TO_CHAR(datum, 'yyyy/"Q"Q') as year_quartal, TO_CHAR(datum, 'yyyy/mm') as year_month, TO_CHAR(datum, 'iyyy/IW') as year_calendar_week, CASE WHEN EXTRACT(isodow FROM datum) IN (6, 7) THEN 'Weekend' ELSE 'Weekday' END as weekend, CASE WHEN TO_CHAR(datum, 'MMDD') IN ('0101', '0704', '1225', '1226') THEN 'Holiday' ELSE 'No holiday' END as us_holiday, CASE WHEN TO_CHAR(datum, 'MMDD') BETWEEN '0701' AND '0831' THEN 'Summer break' WHEN TO_CHAR(datum, 'MMDD') BETWEEN '1115' AND '1225' THEN 'Christmas season' WHEN TO_CHAR(datum, 'MMDD') > '1225' OR TO_CHAR(datum, 'MMDD') <= '0106' THEN 'Winter break' ELSE 'Normal' END as period, datum + (1 - EXTRACT(isodow FROM datum))::int as cw_start, datum + (7 - EXTRACT(isodow FROM datum))::int as cw_end, datum + (1 - EXTRACT(DAY FROM datum))::int as month_start, (datum + (1 - EXTRACT(DAY FROM datum))::int + '1 month'::interval)::date - '1 day'::interval as month_end FROM ( SELECT '1997-01-01'::date + SEQUENCE.DAY as datum FROM generate_series(0,10956) as SEQUENCE(DAY) GROUP BY SEQUENCE.DAY ) DQ; -- Populate Evictions Fact Table SELECT eviction_id, group_key as reason_group_key INTO tmp_reason_facts FROM ( SELECT eviction_id, TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)) as concat_reason FROM ( SELECT eviction_id, CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END|| CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END|| CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END|| CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END|| CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END|| CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END|| CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END|| CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END|| CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END|| CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END|| CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END|| CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END|| CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END|| CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END|| CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END|| CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END|| CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END|| CASE WHEN development = 'true' THEN 'development|' ELSE '' END|| CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END as concat_reason FROM raw.soda_evictions ) grp ) f_grp JOIN tmp_reason_group t_grp ON f_grp.concat_reason = t_grp.concat_reason; INSERT INTO staging.fact_evictions (eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address) SELECT f.eviction_id as eviction_key, COALESCE(d.district_key, -1) as district_key, COALESCE(n.neighborhood_key, -1) as neighborhood_key, COALESCE(l.location_key, -1) as location_key, reason_group_key, COALESCE(dt1.date_key, -1) as file_date_key, COALESCE(dt2.date_key, -1) as constraints_date_key, f.address as street_address FROM raw.soda_evictions f LEFT JOIN tmp_reason_facts r ON f.eviction_id = r.eviction_id LEFT JOIN staging.dim_district d ON f.supervisor_district = d.district LEFT JOIN staging.dim_neighborhood n ON f.neighborhood = n.neighborhood_alt_name LEFT JOIN staging.dim_location l ON COALESCE(f.city, 'Unknown') = l.city AND COALESCE(f.state, 'Unknown') = l.state AND COALESCE(f.zip, 'Unknown') = l.zip_code LEFT JOIN staging.dim_date dt1 ON f.file_date = dt1.date LEFT JOIN staging.dim_date dt2 ON f.constraints_date = dt2.date; DROP TABLE tmp_reason_group; DROP TABLE tmp_reason_facts; -- Migrate to Production Schema INSERT INTO prod.dim_district (district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty) SELECT district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty FROM staging.dim_district; INSERT INTO prod.dim_neighborhood (neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty) SELECT neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty FROM staging.dim_neighborhood; INSERT INTO prod.dim_location (location_key, city, state, zip_code) SELECT location_key, city, state, zip_code FROM staging.dim_location; INSERT INTO prod.dim_reason (reason_key, reason_code, reason_desc) SELECT reason_key, reason_code, reason_desc FROM staging.dim_reason; INSERT INTO prod.br_reason_group (reason_group_key, reason_key) SELECT reason_group_key, reason_key FROM staging.br_reason_group; INSERT INTO prod.dim_date (date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday, period, cw_start, cw_end, month_start, month_end) SELECT date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday, period, cw_start, cw_end, month_start, month_end FROM staging.dim_date; INSERT INTO prod.fact_evictions (eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address) SELECT eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address FROM staging.fact_evictions; ================================================ FILE: dags/sql/incremental_load.sql ================================================ -- echo "" > /home/airflow/airflow/dags/sql/incremental_load.sql -- nano /home/airflow/airflow/dags/sql/incremental_load.sql -- Populate District Dimension INSERT INTO staging.dim_district (district, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty) SELECT district, population::int, households::int, perc_asian::numeric as percent_asian, perc_black::numeric as percent_black, perc_white::numeric as percent_white, perc_nat_am::numeric as percent_native_am, perc_nat_pac::numeric as percent_pacific_isle, perc_other::numeric as percent_other_race, perc_latin::numeric as percent_latin, median_age::numeric, total_units::int, perc_owner_occupied::numeric as percent_owner_occupied, perc_renter_occupied::numeric as percent_renter_occupied, median_rent_as_perc_of_income::numeric, median_household_income::numeric, median_family_income::numeric, per_capita_income::numeric, perc_in_poverty::numeric as percent_in_poverty FROM raw.district_data ON CONFLICT (district) DO UPDATE SET population = EXCLUDED.population, households = EXCLUDED.households, percent_asian = EXCLUDED.percent_asian, percent_black = EXCLUDED.percent_black, percent_white = EXCLUDED.percent_white, percent_native_am = EXCLUDED.percent_native_am, percent_pacific_isle = EXCLUDED.percent_pacific_isle, percent_other_race = EXCLUDED.percent_other_race, percent_latin = EXCLUDED.percent_latin, median_age = EXCLUDED.median_age, total_units = EXCLUDED.total_units, percent_owner_occupied = EXCLUDED.percent_owner_occupied, percent_renter_occupied = EXCLUDED.percent_renter_occupied, median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income, median_household_income = EXCLUDED.median_household_income, median_family_income = EXCLUDED.median_family_income, per_capita_income = EXCLUDED.per_capita_income, percent_in_poverty = EXCLUDED.percent_in_poverty; -- Populate Neighborhood Dimension INSERT INTO staging.dim_neighborhood (neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty) SELECT acs_name as neighborhood, db_name as neighborhood_alt_name, population::int, households::int, perc_asian::numeric as percent_asian, perc_black::numeric as percent_black, perc_white::numeric as percent_white, perc_nat_am::numeric as percent_native_am, perc_nat_pac::numeric as percent_pacific_isle, perc_other::numeric as percent_other_race, perc_latin::numeric as percent_latin, median_age::numeric, total_units::int, perc_owner_occupied::numeric as percent_owner_occupied, perc_renter_occupied::numeric as percent_renter_occupied, median_rent_as_perc_of_income::numeric, median_household_income::numeric, median_family_income::numeric, per_capita_income::numeric, perc_in_poverty::numeric as percent_in_poverty FROM raw.neighborhood_data ON CONFLICT (neighborhood) DO UPDATE SET neighborhood_alt_name = EXCLUDED.neighborhood_alt_name, population = EXCLUDED.population, households = EXCLUDED.households, percent_asian = EXCLUDED.percent_asian, percent_black = EXCLUDED.percent_black, percent_white = EXCLUDED.percent_white, percent_native_am = EXCLUDED.percent_native_am, percent_pacific_isle = EXCLUDED.percent_pacific_isle, percent_other_race = EXCLUDED.percent_other_race, percent_latin = EXCLUDED.percent_latin, median_age = EXCLUDED.median_age, total_units = EXCLUDED.total_units, percent_owner_occupied = EXCLUDED.percent_owner_occupied, percent_renter_occupied = EXCLUDED.percent_renter_occupied, median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income, median_household_income = EXCLUDED.median_household_income, median_family_income = EXCLUDED.median_family_income, per_capita_income = EXCLUDED.per_capita_income, percent_in_poverty = EXCLUDED.percent_in_poverty; -- Populate Location Dimension INSERT INTO staging.dim_location (city, state, zip_code) SELECT se.city, se.state, se.zip_code FROM ( SELECT DISTINCT COALESCE(city, 'Unknown') as city, COALESCE(state, 'Unknown') as state, COALESCE(zip, 'Unknown') as zip_code FROM raw.soda_evictions ) se LEFT JOIN staging.dim_location dl ON se.city = dl.city AND se.state = dl.state AND se.zip_code = dl.zip_code WHERE dl.location_key IS NULL; -- Populate Reason Bridge Table SELECT DISTINCT reason_group_key, ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array INTO TEMP tmp_existing_reason_groups FROM staging.br_reason_group GROUP BY reason_group_key; SELECT concat_reason, ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array INTO TEMP tmp_new_reason_groups FROM ( SELECT DISTINCT string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|') as concat_reason, unnest(string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|')) unnested_reason FROM ( SELECT DISTINCT CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END|| CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END|| CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END|| CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END|| CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END|| CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END|| CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END|| CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END|| CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END|| CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END|| CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END|| CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END|| CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END|| CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END|| CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END|| CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END|| CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END|| CASE WHEN development = 'true' THEN 'development|' ELSE '' END|| CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END as concat_reason FROM raw.soda_evictions ) se1 GROUP BY concat_reason ) se2 JOIN staging.dim_reason r ON se2.unnested_reason = r.reason_code GROUP BY concat_reason; INSERT INTO staging.br_reason_group (reason_group_key, reason_key) SELECT final_grp.max_key + new_grp.tmp_group_key as reason_group_key, new_grp.reason_key as reason_key FROM ( SELECT DISTINCT ROW_NUMBER() OVER(ORDER BY concat_reason) as tmp_group_key, concat_reason, unnest(n.rk_array) as reason_key FROM tmp_new_reason_groups n LEFT JOIN tmp_existing_reason_groups e ON n.rk_array = e.rk_array WHERE e.reason_group_key IS NULL ) new_grp LEFT JOIN (SELECT MAX(reason_group_key) max_key FROM staging.br_reason_group) final_grp ON 1=1 ORDER BY reason_group_key, reason_key; DROP TABLE tmp_existing_reason_groups; DROP TABLE tmp_new_reason_groups; -- Populate Staging Fact Table SELECT DISTINCT reason_group_key, ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array INTO TEMP tmp_existing_reason_groups FROM staging.br_reason_group GROUP BY reason_group_key; SELECT eviction_id, ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array INTO TEMP tmp_fct_reason_groups FROM ( SELECT eviction_id, unnest(string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|')) as unnested_reason FROM ( SELECT eviction_id, CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END|| CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END|| CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END|| CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END|| CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END|| CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END|| CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END|| CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END|| CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END|| CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END|| CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END|| CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END|| CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END|| CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END|| CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END|| CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END|| CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END|| CASE WHEN development = 'true' THEN 'development|' ELSE '' END|| CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END as concat_reason FROM raw.soda_evictions ) se1 ) se2 JOIN staging.dim_reason r ON se2.unnested_reason = r.reason_code GROUP BY se2.eviction_id; SELECT eviction_id, reason_group_key INTO tmp_reason_group_lookup FROM tmp_fct_reason_groups f JOIN tmp_existing_reason_groups d ON f.rk_array = d.rk_array; TRUNCATE TABLE staging.fact_evictions; INSERT INTO staging.fact_evictions (eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address) SELECT f.eviction_id as eviction_key, COALESCE(d.district_key, -1) as district_key, COALESCE(n.neighborhood_key, -1) as neighborhood_key, COALESCE(l.location_key, -1) as location_key, r.reason_group_key as reason_group_key, COALESCE(dt1.date_key, -1) as file_date_key, COALESCE(dt2.date_key, -1) as constraints_date_key, f.address as street_address FROM raw.soda_evictions f JOIN tmp_reason_group_lookup r ON f.eviction_id = r.eviction_id LEFT JOIN staging.dim_district d ON f.supervisor_district = d.district LEFT JOIN staging.dim_neighborhood n ON f.neighborhood = n.neighborhood_alt_name LEFT JOIN staging.dim_location l ON COALESCE(f.city, 'Unknown') = l.city AND COALESCE(f.state, 'Unknown') = l.state AND COALESCE(f.zip, 'Unknown') = l.zip_code LEFT JOIN staging.dim_date dt1 ON f.file_date = dt1.date LEFT JOIN staging.dim_date dt2 ON f.constraints_date = dt2.date; DROP TABLE tmp_existing_reason_groups; DROP TABLE tmp_fct_reason_groups; DROP TABLE tmp_reason_group_lookup; -- Merge Into Production Schema INSERT INTO prod.dim_district (district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty) SELECT district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty FROM staging.dim_district ON CONFLICT(district_key) DO UPDATE SET district = EXCLUDED.district, population = EXCLUDED.population, households = EXCLUDED.households, percent_asian = EXCLUDED.percent_asian, percent_black = EXCLUDED.percent_black, percent_white = EXCLUDED.percent_white, percent_native_am = EXCLUDED.percent_native_am, percent_pacific_isle = EXCLUDED.percent_pacific_isle, percent_other_race = EXCLUDED.percent_other_race, percent_latin = EXCLUDED.percent_latin, median_age = EXCLUDED.median_age, total_units = EXCLUDED.total_units, percent_owner_occupied = EXCLUDED.percent_owner_occupied, percent_renter_occupied = EXCLUDED.percent_renter_occupied, median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income, median_household_income = EXCLUDED.median_household_income, median_family_income = EXCLUDED.median_family_income, per_capita_income = EXCLUDED.per_capita_income, percent_in_poverty = EXCLUDED.percent_in_poverty; INSERT INTO prod.dim_neighborhood (neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty) SELECT neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, median_family_income, per_capita_income, percent_in_poverty FROM staging.dim_neighborhood ON CONFLICT (neighborhood_key) DO UPDATE SET neighborhood = EXCLUDED.neighborhood, neighborhood_alt_name = EXCLUDED.neighborhood_alt_name, population = EXCLUDED.population, households = EXCLUDED.households, percent_asian = EXCLUDED.percent_asian, percent_black = EXCLUDED.percent_black, percent_white = EXCLUDED.percent_white, percent_native_am = EXCLUDED.percent_native_am, percent_pacific_isle = EXCLUDED.percent_pacific_isle, percent_other_race = EXCLUDED.percent_other_race, percent_latin = EXCLUDED.percent_latin, median_age = EXCLUDED.median_age, total_units = EXCLUDED.total_units, percent_owner_occupied = EXCLUDED.percent_owner_occupied, percent_renter_occupied = EXCLUDED.percent_renter_occupied, median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income, median_household_income = EXCLUDED.median_household_income, median_family_income = EXCLUDED.median_family_income, per_capita_income = EXCLUDED.per_capita_income, percent_in_poverty = EXCLUDED.percent_in_poverty; INSERT INTO prod.dim_location (location_key, city, state, zip_code) SELECT location_key, city, state, zip_code FROM staging.dim_location ON CONFLICT (location_key) DO NOTHING; INSERT INTO prod.br_reason_group (reason_group_key, reason_key) SELECT stg.reason_group_key, stg.reason_key FROM staging.br_reason_group stg LEFT JOIN prod.br_reason_group prd ON stg.reason_group_key = prd.reason_group_key AND stg.reason_key = prd.reason_key WHERE prd.reason_group_key IS NULL; INSERT INTO prod.fact_evictions (eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address) SELECT eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address FROM staging.fact_evictions ON CONFLICT (eviction_key) DO UPDATE SET district_key = EXCLUDED.district_key, neighborhood_key = EXCLUDED.neighborhood_key, location_key = EXCLUDED.location_key, reason_group_key = EXCLUDED.reason_group_key, file_date_key = EXCLUDED.file_date_key, constraints_date_key = EXCLUDED.constraints_date_key, street_address = EXCLUDED.street_address; ================================================ FILE: dags/sql/init_db_schema.sql ================================================ -- echo "" > /home/airflow/airflow/dags/sql/init_db_schema.sql -- nano /home/airflow/airflow/dags/sql/init_db_schema.sql DROP SCHEMA IF EXISTS raw CASCADE; DROP SCHEMA IF EXISTS staging CASCADE; DROP SCHEMA IF EXISTS prod CASCADE; CREATE SCHEMA raw; CREATE SCHEMA staging; CREATE SCHEMA prod; -- Raw CREATE UNLOGGED TABLE raw.soda_evictions ( raw_id text, created_at timestamp, updated_at timestamp, eviction_id text, address text, city text, state text, zip text, file_date timestamp, non_payment boolean, breach boolean, nuisance boolean, illegal_use boolean, failure_to_sign_renewal boolean, access_denial boolean, unapproved_subtenant boolean, owner_move_in boolean, demolition boolean, capital_improvement boolean, substantial_rehab boolean, ellis_act_withdrawal boolean, condo_conversion boolean, roommate_same_unit boolean, other_cause boolean, late_payments boolean, lead_remediation boolean, development boolean, good_samaritan_ends boolean, constraints_date timestamp, supervisor_district text, neighborhood text ); CREATE UNLOGGED TABLE raw.district_data ( district text, population text, households text, perc_asian text, perc_black text, perc_white text, perc_nat_am text, perc_nat_pac text, perc_other text, perc_latin text, median_age text, total_units text, perc_owner_occupied text, perc_renter_occupied text, median_rent_as_perc_of_income text, median_household_income text, median_family_income text, per_capita_income text, perc_in_poverty text ); CREATE UNIQUE INDEX district_name_uniq_idx ON raw.district_data (district); CREATE UNLOGGED TABLE raw.neighborhood_data ( acs_name text, db_name text, population text, households text, perc_asian text, perc_black text, perc_white text, perc_nat_am text, perc_nat_pac text, perc_other text, perc_latin text, median_age text, total_units text, perc_owner_occupied text, perc_renter_occupied text, median_rent_as_perc_of_income text, median_household_income text, median_family_income text, per_capita_income text, perc_in_poverty text ); CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON raw.neighborhood_data (acs_name); -- Staging CREATE TABLE staging.dim_district ( district_key serial PRIMARY KEY, district text, population integer, households integer, percent_asian numeric, percent_black numeric, percent_white numeric, percent_native_am numeric, percent_pacific_isle numeric, percent_other_race numeric, percent_latin numeric, median_age numeric, total_units integer, percent_owner_occupied numeric, percent_renter_occupied numeric, median_rent_as_perc_of_income numeric, median_household_income numeric, median_family_income numeric, per_capita_income numeric, percent_in_poverty numeric ); CREATE UNIQUE INDEX district_name_uniq_idx ON staging.dim_district (district); CREATE TABLE staging.dim_neighborhood ( neighborhood_key serial PRIMARY KEY, neighborhood text, neighborhood_alt_name text, population integer, households integer, percent_asian numeric, percent_black numeric, percent_white numeric, percent_native_am numeric, percent_pacific_isle numeric, percent_other_race numeric, percent_latin numeric, median_age numeric, total_units integer, percent_owner_occupied numeric, percent_renter_occupied numeric, median_rent_as_perc_of_income numeric, median_household_income numeric, median_family_income numeric, per_capita_income numeric, percent_in_poverty numeric ); CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON staging.dim_neighborhood (neighborhood); CREATE TABLE staging.dim_location ( location_key serial PRIMARY KEY, city text, state text, zip_code text ); CREATE TABLE staging.dim_reason ( reason_key serial PRIMARY KEY, reason_code text, reason_desc text ); CREATE TABLE staging.br_reason_group ( reason_group_key int, reason_key int ); CREATE INDEX reason_group_key_idx ON staging.br_reason_group (reason_group_key); CREATE INDEX reason_key_idx ON staging.br_reason_group (reason_key); CREATE TABLE staging.dim_date ( date_key int PRIMARY KEY, date date, year int, month int, month_name text, day int, day_of_year int, weekday_name text, calendar_week int, formatted_date text, quartal text, year_quartal text, year_month text, year_calendar_week text, weekend text, us_holiday text, period text, cw_start date, cw_end date, month_start date, month_end date ); CREATE TABLE staging.fact_evictions ( eviction_key text PRIMARY KEY, location_key int, district_key int, neighborhood_key int, reason_group_key int, file_date_key int, constraints_date_key int, street_address text ); -- Prod CREATE TABLE prod.dim_district ( district_key serial PRIMARY KEY, district text, population integer, households integer, percent_asian numeric, percent_black numeric, percent_white numeric, percent_native_am numeric, percent_pacific_isle numeric, percent_other_race numeric, percent_latin numeric, median_age numeric, total_units integer, percent_owner_occupied numeric, percent_renter_occupied numeric, median_rent_as_perc_of_income numeric, median_household_income numeric, median_family_income numeric, per_capita_income numeric, percent_in_poverty numeric ); CREATE TABLE prod.dim_neighborhood ( neighborhood_key serial PRIMARY KEY, neighborhood text, neighborhood_alt_name text, population integer, households integer, percent_asian numeric, percent_black numeric, percent_white numeric, percent_native_am numeric, percent_pacific_isle numeric, percent_other_race numeric, percent_latin numeric, median_age numeric, total_units integer, percent_owner_occupied numeric, percent_renter_occupied numeric, median_rent_as_perc_of_income numeric, median_household_income numeric, median_family_income numeric, per_capita_income numeric, percent_in_poverty numeric ); CREATE TABLE prod.dim_location ( location_key serial PRIMARY KEY, city text, state text, zip_code text ); CREATE TABLE prod.dim_reason ( reason_key serial PRIMARY KEY, reason_code text, reason_desc text ); CREATE TABLE prod.br_reason_group ( reason_group_key int, reason_key int ); CREATE INDEX reason_group_key_idx ON prod.br_reason_group (reason_group_key); CREATE INDEX reason_key_idx ON prod.br_reason_group (reason_key); CREATE TABLE prod.dim_date ( date_key int PRIMARY KEY, date date, year int, month int, month_name text, day int, day_of_year int, weekday_name text, calendar_week int, formatted_date text, quartal text, year_quartal text, year_month text, year_calendar_week text, weekend text, us_holiday text, period text, cw_start date, cw_end date, month_start date, month_end date ); CREATE TABLE prod.fact_evictions ( eviction_key text PRIMARY KEY, location_key int, district_key int, neighborhood_key int, reason_group_key int, file_date_key int, constraints_date_key int, street_address text ); ================================================ FILE: dags/sql/trunc_target_tables.sql ================================================ -- echo "" > /home/airflow/airflow/dags/sql/trunc_target_tables.sql -- nano /home/airflow/airflow/dags/sql/trunc_target_tables.sql TRUNCATE TABLE raw.soda_evictions; TRUNCATE TABLE raw.district_data; TRUNCATE TABLE raw.neighborhood_data;