Repository: ilya-galperin/SF-EvictionTracker
Branch: master
Commit: e3f0a03131eb
Files: 10
Total size: 60.3 KB
Directory structure:
gitextract_jbid5ch1/
├── README.md
├── airflow_installation.txt
└── dags/
├── full_load_dag.py
├── incremental_load_dag.py
├── operators/
│ ├── s3_to_postgres_operator.py
│ └── soda_to_s3_operator.py
└── sql/
├── full_load.sql
├── incremental_load.sql
├── init_db_schema.sql
└── trunc_target_tables.sql
================================================
FILE CONTENTS
================================================
================================================
FILE: README.md
================================================
# SF-EvictionTracker
Tracking eviction trends in San Francisco across filing reasons, districts, neighborhoods, and demographics in the months following COVID-19. Data warehouse infrastructure is housed in the AWS ecosystem and uses Apache Airflow for orchestration with public-facing dashboards created using Metabase.
Questions? Feel free to reach me at ilya.glprn@gmail.com.
Public Dashboard Link: http://sf-evictiontracker-metabase.us-east-1.elasticbeanstalk.com/public/dashboard/f637e470-8ea9-4b03-af80-53988e5b6a9b
<h3>ARCHITECTURE:</h3>

Data is sourced from San Francisco Open Data's API and csv's containing San Francisco district and neighborhood aggregate census results. Airflow orchestrates its movement to an S3 bucket and into a data warehouse hosted in RDS. SQL scripts are then ran to transform the data from its raw form through a staging schema and into production target tables. The presentation layer is created using Metabase, an open-source data visualization tool, and deployed using Elastic Beanstalk.
<h3>DATA MODEL:</h3>
Dimension Tables:
`dim_district`
`dim_neighborhood`
`dim_location`
`dim_reason`
`dim_date`
`br_reason_group`
Fact Tables:
`fact_evictions`
The data model is implemented using a star schema with a bridge table to accomodate any new permutations for the reason dimension. More information on bridge tables can be found here: https://www.kimballgroup.com/2012/02/design-tip-142-building-bridges/

<h3>ETL FLOW:</h3>
General Overview -
- Evictions data is collected from the SODA API and moved into an S3 Bucket
- Neighborhood/district census data is stored as a CSV in S3
- Once the API load to S3 is complete, data is moved into RDS into a "raw" schema and moves through a staging schema for processing
- ETL job execution is complete once data is moved from the staging schema into the final production tables
DAGs and Custom Airflow Operators -


There are 2 DAGs (Directed Acyclic Graphs) used for this project - <b>full load</b> which should be executed on initial setup and <b>incremental load</b> which is scheduled to run daily and pull new data from the Socrata Open Data API.
The incremental load DAG uses XCom to pass the filesize of the load between the API call task and a ShortCircuitOperator to skip downstream tasks if the API call produces no results.
The DAGs use two customer operators. They have been purpose built for this project but are easily expandable to be used in other data pipelines.
1. soda_to_s3_operator: Queries the Socrata Open Data API using a SoQL string and uploads the results to an S3 bucket. Includes optional function to check source data size and abort ETL if filesize exceeds user-defined limit.
2. s3_to_postges_operator: Collects data from a file hosted on AWS S3 and loads it into a Postgres table. Current version supports JSON and CSV source data types.
<h3>INFRASTRUCTURE:</h3>
This project is hosted in the AWS ecosystem and uses the following resources:

EC2 -
- t2.medium - dedicated resource for Airflow, managed by AWS Instance Scheduler to complete the daily DAG run and shut off after execution
- t2.small - used to host Metabase, always online
RDS -
- t2.small - hosts application database for Metabase and the data warehouse
Elastic Beanstalk is used to deploy the Metabase web application.
<h3>DASHBOARD:</h3>
The dashboard is publicly accessible here: http://sf-evictiontracker-metabase.us-east-1.elasticbeanstalk.com/public/dashboard/f637e470-8ea9-4b03-af80-53988e5b6a9b
Some examples screengrabs below!



================================================
FILE: airflow_installation.txt
================================================
STEP 1 - Launch EC2 Instance:
- t3.medium
- 12gb storage
- launch-wizard-3 security group to open TCP Port 8080
- associate elastic IP
STEP 2 - Install Postgres Server on EC2:
run:
sudo apt-get update
sudo apt-get install python-psycopg2
sudo apt-get install postgresql postgresql-contrib
Step 3 - Create OS User airflow
run:
sudo adduser airflow
sudo usermod -aG sudo airflow
su - airflow
Note: From here on, make sure you are logged in as airflow user.
Step 4 - Create Postgres Metadatabase and User Access
run:
sudo -u postgres psql
in postgres prompt:
CREATE USER airflow PASSWORD 'password';
CREATE DATABASE airflow;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airflow;
\q
Step 5 - Change Postgres Connection Config
run:
sudo nano /etc/postgresql/10/main/pg_hba.conf
Change this line -
# IPv4 local connections:
host all all 127.0.0.1/32 md5
To this line -
# IPv4 local connections:
host all all 0.0.0.0/0 trust
run:
sudo nano /etc/postgresql/10/main/postgresql.conf
Change this line -
#listen_addresses = ‘localhost’ # what IP address(es) to listen on
To this line -
listen_addresses = ‘*’ # what IP address(es) to listen on
restart postgres server:
sudo service postgresql restart
Step 6 - Install Airflow
run:
su - airflow
sudo apt-get install python3-pip
sudo python3 -m pip install apache-airflow[postgres,s3,aws]
run:
airflow initdb
Step 7 - Connect Airflow to Postgres
run:
nano /home/airflow/airflow/airflow.cfg
Change lines -
sql_alchemy_conn = postgresql+psycopg2://airflow:password@localhost:5432/airflow
executor = LocalExecutor
load_examples = False
run:
airflow initdb
Step 7 - Add DAGs:
mkdir /home/airflow/airflow/dags/
cd /home/airflow/airflow/dags/
touch tutorial.py
nano tutorial.py
Step 6: Setup Airflow Webserver and Scheduler to start automatically
We are almost there. The final thing we need to do is to ensure airflow starts up when your ec2 instance starts.
sudo nano /etc/systemd/system/airflow-webserver.service
Paste the following into the file created above
[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service
Wants=postgresql.service
[Service]
EnvironmentFile=/etc/environment
User=airflow
Group=airflow
Type=simple
ExecStart= /usr/local/bin/airflow webserver
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
Next we will create the following file to enable scheduler service
sudo nano /etc/systemd/system/airflow-scheduler.service
Paste the following
[Unit]
Description=Airflow scheduler daemon
After=network.target postgresql.service
Wants=postgresql.service
[Service]
EnvironmentFile=/etc/environment
User=airflow
Group=airflow
Type=simple
ExecStart=/usr/local/bin/airflow scheduler
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
Next enable these services and check their status
sudo systemctl enable airflow-webserver.service
sudo systemctl enable airflow-scheduler.service
sudo systemctl start airflow-scheduler
sudo systemctl start airflow-webserver
================================================
FILE: dags/full_load_dag.py
================================================
# echo "" > /home/airflow/airflow/dags/full_load_dag.py
# nano /home/airflow/airflow/dags/full_load_dag.py
from airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
from operators.soda_to_s3_operator import SodaToS3Operator
from operators.s3_to_postgres_operator import S3ToPostgresOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
soda_headers = {
'keyId':'############',
'keySecret':'#################',
'Accept':'application/json'
}
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=30)
}
with DAG('eviction-tracker-full_load',
default_args=default_args,
description='Executes full load from SODA API to Production DW.',
max_active_runs=1,
schedule_interval=None) as dag:
op1 = SodaToS3Operator(
task_id='get_evictions_data',
http_conn_id='API_Evictions',
headers=soda_headers,
s3_conn_id='S3_Evictions',
s3_bucket='sf-evictionmeter',
s3_directory='soda_jsons',
size_check=True,
max_bytes=500000000,
dag=dag
)
op2 = PostgresOperator(
task_id='initialize_target_db',
postgres_conn_id='RDS_Evictions',
sql='sql/init_db_schema.sql',
dag=dag
)
op3 = S3ToPostgresOperator(
task_id='load_evictions_data',
s3_conn_id='S3_Evictions',
s3_bucket='sf-evictionmeter',
s3_prefix='soda_jsons/soda_evictions_import',
source_data_type='json',
postgres_conn_id='RDS_Evictions',
schema='raw',
table='soda_evictions',
get_latest=True,
dag=dag
)
op4 = S3ToPostgresOperator(
task_id='load_neighborhood_data',
s3_conn_id='S3_Evictions',
s3_bucket='sf-evictionmeter',
s3_prefix='census_csv/sf_by_neighborhood',
source_data_type='csv',
header=True,
postgres_conn_id='RDS_Evictions',
schema='raw',
table='neighborhood_data',
get_latest=True,
dag=dag
)
op5 = S3ToPostgresOperator(
task_id='load_district_data',
s3_conn_id='S3_Evictions',
s3_bucket='sf-evictionmeter',
s3_prefix='census_csv/sf_by_district',
source_data_type='csv',
header=True,
postgres_conn_id='RDS_Evictions',
schema='raw',
table='district_data',
get_latest=True,
dag=dag
)
op6 = PostgresOperator(
task_id='execute_full_load',
postgres_conn_id='RDS_Evictions',
sql='sql/full_load.sql',
dag=dag
)
op1 >> op2 >> (op3, op4, op5) >> op6
================================================
FILE: dags/incremental_load_dag.py
================================================
# echo "" > /home/airflow/airflow/dags/incremental_load_dag.py
# nano /home/airflow/airflow/dags/incremental_load_dag.py
from airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import ShortCircuitOperator
from operators.soda_to_s3_operator import SodaToS3Operator
from operators.s3_to_postgres_operator import S3ToPostgresOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
soda_headers = {
'keyId':'############',
'keySecret':'#################',
'Accept':'application/json'
}
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['airflow@example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(seconds=30)
}
def get_size(**context):
val = context['ti'].xcom_pull(key='obj_len')
return True if val > 0 else False
with DAG('eviction-tracker-incremental_load',
default_args=default_args,
description='Executes incremental load from SODA API & S3-hosted csv''s into Production DW.',
max_active_runs=1,
schedule_interval=None) as dag:
op1 = SodaToS3Operator(
task_id='get_evictions_data',
http_conn_id='API_Evictions',
headers=soda_headers,
days_ago=31,
s3_conn_id='S3_Evictions',
s3_bucket='sf-evictionmeter',
s3_directory='soda_jsons',
size_check=True,
max_bytes=500000000,
dag=dag
)
op2 = ShortCircuitOperator(
task_id='check_get_results',
python_callable=get_size,
provide_context=True,
dag=dag
)
op3 = PostgresOperator(
task_id='truncate_target_tables',
postgres_conn_id='RDS_Evictions',
sql='sql/trunc_target_tables.sql',
dag=dag
)
op4 = S3ToPostgresOperator(
task_id='load_evictions_data',
s3_conn_id='S3_Evictions',
s3_bucket='sf-evictionmeter',
s3_prefix='soda_jsons/soda_evictions_import',
source_data_type='json',
postgres_conn_id='RDS_Evictions',
schema='raw',
table='soda_evictions',
get_latest=True,
dag=dag
)
op5 = S3ToPostgresOperator(
task_id='load_neighborhood_data',
s3_conn_id='S3_Evictions',
s3_bucket='sf-evictionmeter',
s3_prefix='census_csv/sf_by_neighborhood',
source_data_type='csv',
header=True,
postgres_conn_id='RDS_Evictions',
schema='raw',
table='neighborhood_data',
get_latest=True,
dag=dag
)
op6 = S3ToPostgresOperator(
task_id='load_district_data',
s3_conn_id='S3_Evictions',
s3_bucket='sf-evictionmeter',
s3_prefix='census_csv/sf_by_district',
source_data_type='csv',
header=True,
postgres_conn_id='RDS_Evictions',
schema='raw',
table='district_data',
get_latest=True,
dag=dag
)
op7 = PostgresOperator(
task_id='execute_incremental_load',
postgres_conn_id='RDS_Evictions',
sql='sql/incremental_load.sql',
dag=dag
)
op1 >> op2 >> op3 >> (op4, op5, op6) >> op7
================================================
FILE: dags/operators/s3_to_postgres_operator.py
================================================
# echo "" > /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py
# nano /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py
from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.hooks.S3_hook import S3Hook
from airflow.hooks.postgres_hook import PostgresHook
import json
import io
from contextlib import closing
class S3ToPostgresOperator(BaseOperator):
"""
Collects data from a file hosted on AWS S3 and loads it into a Postgres table.
Current version supports JSON and CSV sources but requires pre-defined data model.
:param s3_conn_id: S3 Connection ID
:param s3_bucket: S3 Bucket Destination
:param s3_prefix: S3 File Prefix
:param source_data_type: S3 Source File data type
:param header: Toggles ignore header for CSV source type
:param postgres_conn_id: Postgres Connection ID
:param db_schema: Postgres Target Schema
:param db_table: Postgres Target Table
:param get_latest: if True, pulls from last modified file in S3 path
"""
@apply_defaults
def __init__(self,
s3_conn_id=None,
s3_bucket=None,
s3_prefix='',
source_data_type='',
postgres_conn_id='postgres_default',
header=False,
schema='public',
table='raw_load',
get_latest=False,
*args,
**kwargs) -> None:
super().__init__(*args, **kwargs)
self.s3_conn_id = s3_conn_id
self.s3_bucket = s3_bucket
self.s3_prefix = s3_prefix
self.source_data_type = source_data_type
self.postgres_conn_id = postgres_conn_id
self.header = header
self.schema = schema
self.table = table
self.get_latest = get_latest
def execute(self, context):
"""
Executes the operator.
"""
s3_hook = S3Hook(self.s3_conn_id)
s3_session = s3_hook.get_session()
s3_client = s3_session.client('s3')
if self.get_latest == True:
objects = s3_client.list_objects_v2(Bucket=self.s3_bucket, Prefix=self.s3_prefix)['Contents']
latest = max(objects, key=lambda x: x['LastModified'])
s3_obj = s3_client.get_object(Bucket=self.s3_bucket, Key=latest['Key'])
file_content = s3_obj['Body'].read().decode('utf-8')
pg_hook = PostgresHook(self.postgres_conn_id)
if self.source_data_type == 'json':
print('inserting json object...')
json_content = json.loads(file_content)
schema = self.schema
if isinstance(self.schema, tuple):
schema = self.schema[0]
table = self.table
if isinstance(self.table, tuple):
table = self.table[0]
target_fields = ['raw_id','created_at','updated_at','eviction_id','address','city','state',
'zip','file_date','non_payment','breach','nuisance','illegal_use','failure_to_sign_renewal',
'access_denial','unapproved_subtenant','owner_move_in','demolition','capital_improvement',
'substantial_rehab','ellis_act_withdrawal','condo_conversion','roommate_same_unit',
'other_cause','late_payments','lead_remediation','development','good_samaritan_ends',
'constraints_date','supervisor_district','neighborhood']
target_fields = ','.join(target_fields)
with closing(pg_hook.get_conn()) as conn:
with closing(conn.cursor()) as cur:
cur.executemany(
f"""INSERT INTO {schema}.{table} ({target_fields})
VALUES(
%(:id)s, %(:created_at)s, %(:updated_at)s, %(eviction_id)s, %(address)s, %(city)s, %(state)s, %(zip)s,
%(file_date)s, %(non_payment)s, %(breach)s, %(nuisance)s, %(illegal_use)s, %(failure_to_sign_renewal)s,
%(access_denial)s, %(unapproved_subtenant)s, %(owner_move_in)s, %(demolition)s, %(capital_improvement)s,
%(substantial_rehab)s, %(ellis_act_withdrawal)s, %(condo_conversion)s, %(roommate_same_unit)s,
%(other_cause)s, %(late_payments)s, %(lead_remediation)s, %(development)s, %(good_samaritan_ends)s,
%(constraints_date)s, %(supervisor_district)s, %(neighborhood)s
);
""",({
':id': line[':id'], ':created_at': line[':created_at'], ':updated_at': line[':updated_at'],
'eviction_id': line['eviction_id'], 'address': line.get('address', None), 'city': line.get('city', None),
'state': line.get('state', None),'zip': line.get('zip', None),'file_date': line.get('file_date', None),
'non_payment': line.get('non_payment', None),'breach': line.get('breach', None),
'nuisance': line.get('nuisance', None),'illegal_use': line.get('illegal_use', None),
'failure_to_sign_renewal': line.get('failure_to_sign_renewal', None),
'access_denial': line.get('access_denial', None),'unapproved_subtenant': line.get('unapproved_subtenant', None),
'owner_move_in': line.get('owner_move_in', None),'demolition': line.get('demolition', None),
'capital_improvement': line.get('capital_improvement', None),
'substantial_rehab': line.get('substantial_rehab', None),'ellis_act_withdrawal': line.get('ellis_act_withdrawal', None),
'condo_conversion': line.get('condo_conversion', None),'roommate_same_unit': line.get('roommate_same_unit', None),
'other_cause': line.get('other_cause', None),'late_payments': line.get('late_payments', None),
'lead_remediation': line.get('lead_remediation', None),'development': line.get('development', None),
'good_samaritan_ends': line.get('good_samaritan_ends', None),'constraints_date': line.get('constraints_date', None),
'supervisor_district': line.get('supervisor_district', None),'neighborhood': line.get('neighborhood', None)
} for line in json_content))
conn.commit()
if self.source_data_type == 'csv':
print('inserting csv...')
file = io.StringIO(file_content)
sql = "COPY %s FROM STDIN DELIMITER ','"
if self.header == True:
sql = "COPY %s FROM STDIN DELIMITER ',' CSV HEADER"
schema = self.schema
if isinstance(self.schema, tuple):
schema = self.schema[0]
table = self.table
if isinstance(self.table, tuple):
table = self.table[0]
table = f'{schema}.{table}'
with closing(pg_hook.get_conn()) as conn:
with closing(conn.cursor()) as cur:
cur.copy_expert(sql=sql % table, file=file)
conn.commit()
print('inserting complete...')
================================================
FILE: dags/operators/soda_to_s3_operator.py
================================================
# echo "" > /home/airflow/airflow/dags/operators/soda_to_s3_operator.py
# nano /home/airflow/airflow/dags/operators/soda_to_s3_operator.py
from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.hooks.http_hook import HttpHook
from airflow.hooks.S3_hook import S3Hook
from datetime import datetime, timedelta
import json
import sys
class SizeExceededError(Exception):
"""Raised when max file size is exceeded"""
def __init__(self):
self.message = 'Max file size exceeded'
def __str__(self):
return f'SizeExceededError, {self.message}'
class SodaToS3Operator(BaseOperator):
"""
Queries the Socrata Open Data API using a SoQL string and uploads the results to an S3 bucket.
:param endpoint: Optional API connection endpoint
:param data: Custom Socrata SoQL string used to query API, overrides default get request
:param days_ago: Restricts get request to updated/created records from specified date onward
:param headers: Dictionary containing optional API connection keys (keyId, keySecret, Accept)
:param s3_conn_id: S3 Connection ID
:param s3_bucket: S3 Bucket Destination
:param s3_directory: S3 Directory Destination
:param method: Request type for API
:param http_conn_id: SODA API Connection ID
:param size_check: Boolean indicating whether to run a size check prior to upload to S3
:param max_bytes: Maximum number of bytes to allow for a single S3 upload
"""
@apply_defaults
def __init__(self,
endpoint=None,
data=None,
days_ago=None,
headers=None,
s3_conn_id=None,
s3_bucket=None,
s3_directory='',
method='GET',
http_conn_id='http_default',
size_check=False,
max_bytes=5000000000,
*args,
**kwargs) -> None:
super().__init__(*args, **kwargs)
self.endpoint = endpoint
self.data = data
self.days_ago = days_ago
self.s3_conn_id = s3_conn_id
self.s3_bucket = s3_bucket
self.s3_directory = s3_directory
self.headers = headers
self.method = method
self.http_conn_id = http_conn_id
self.size_check = size_check
self.max_bytes = max_bytes
def get_size(self, obj, seen=None):
"""
Recursively finds size of object.
"""
size = sys.getsizeof(obj)
if seen is None:
seen = set()
obj_id = id(obj)
if obj_id in seen:
return 0
seen.add(obj_id)
if isinstance(obj, dict):
size += sum([self.get_size(v, seen) for v in obj.values()])
size += sum([self.get_size(k, seen) for k in obj.keys()])
elif hasattr(obj, '__dict__'):
size += self.get_size(obj.__dict__, seen)
elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
size += sum([self.get_size(i, seen) for i in obj])
return size
def parse_metadata(self, header):
"""
Parses metadata from API response.
"""
try:
metadata = {
'api-call-date': header['Date'],
'content-type': header['Content-Type'],
'source-last-modified': header['X-SODA2-Truth-Last-Modified'],
'fields': header['X-SODA2-Fields'],
'types': header['X-SODA2-Types']
}
except KeyError:
metadata = {'KeyError': 'Metadata missing from header, see error log.'}
return metadata
def execute(self, context):
"""
Executes the operator, including running a max filesize check if enabled.
The SODA API maxes out the # of returned results so we use paging to query
the endpoint multiple times and continuously move the offset.
Metadata is parsed and saved separately in a /logs/ subfolder along with
the JSON results from API call.
"""
soda = HttpHook(method=self.method, http_conn_id=self.http_conn_id)
if self.data:
soql_filter = self.data
elif self.days_ago:
current_dt = datetime.now()
target_dt = current_dt - timedelta(self.days_ago)
format_dt = target_dt.strftime("%Y-%m-%d")
soql_filter = f"""$query=SELECT:*,* WHERE :created_at > '{format_dt}' OR :updated_at > '{format_dt}'
ORDER BY :id LIMIT 10000"""
else:
soql_filter = """$query=SELECT:*,* ORDER BY :id LIMIT 10000"""
print('getting... ' + soql_filter)
#soql_filter = f"""$query=SELECT:*,* WHERE :created_at < '2020-04-01' ORDER BY :id LIMIT 10000"""
offset, counter = 0, 1
combined = []
while True:
soql_filter_offset = soql_filter + f' OFFSET {offset}'
response = soda.run(endpoint=self.endpoint, data=soql_filter_offset, headers=self.headers)
if response.status_code != 200:
break
captured = response.json()
if len(captured) == 0:
break
combined.extend(captured)
offset = 10000 * counter
counter += 1
if self.size_check == True:
print('actual size... ' + str(self.get_size(combined)))
print('max size... ' + str(self.max_bytes))
if self.get_size(combined) > self.max_bytes:
raise SizeExceededError
dest_s3 = S3Hook(self.s3_conn_id)
body_obj = 'soda_evictions_import_' + datetime.now().strftime("%Y-%m-%dT%H%M%S") + '.json'
metadata = self.parse_metadata(response.headers)
meta_obj = 'logs/soda_evictions_import_log_' + datetime.now().strftime("%Y-%m-%dT%H%M%S")
dest_s3.load_string(json.dumps(combined), key=self.s3_directory+'/'+body_obj, bucket_name=self.s3_bucket)
dest_s3.load_string(json.dumps(metadata), key=self.s3_directory+'/'+meta_obj, bucket_name=self.s3_bucket)
# XCom used to skip downstream tasks if body object size is 0
self.xcom_push(context=context, key='obj_len', value=len(combined))
================================================
FILE: dags/sql/full_load.sql
================================================
-- echo "" > /home/airflow/airflow/dags/sql/full_load.sql
-- nano /home/airflow/airflow/dags/sql/full_load.sql
-- Populate District Dimension
INSERT INTO staging.dim_district (district_key, district)
SELECT -1, 'Unknown';
INSERT INTO staging.dim_district (district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty)
SELECT
district,
population::int,
households::int,
perc_asian::numeric as percent_asian,
perc_black::numeric as percent_black,
perc_white::numeric as percent_white,
perc_nat_am::numeric as percent_native_am,
perc_nat_pac::numeric as percent_pacific_isle,
perc_other::numeric as percent_other_race,
perc_latin::numeric as percent_latin,
median_age::numeric,
total_units::int,
perc_owner_occupied::numeric as percent_owner_occupied,
perc_renter_occupied::numeric as percent_renter_occupied,
median_rent_as_perc_of_income::numeric,
median_household_income::numeric,
median_family_income::numeric,
per_capita_income::numeric,
perc_in_poverty::numeric as percent_in_poverty
FROM raw.district_data;
-- Populate Neighborhood Dimension
INSERT INTO staging.dim_neighborhood (neighborhood_key, neighborhood)
SELECT -1, 'Unknown';
INSERT INTO staging.dim_neighborhood (neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white,
percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty)
SELECT
acs_name as neighborhood,
db_name as neighborhood_alt_name,
population::int,
households::int,
perc_asian::numeric as percent_asian,
perc_black::numeric as percent_black,
perc_white::numeric as percent_white,
perc_nat_am::numeric as percent_native_am,
perc_nat_pac::numeric as percent_pacific_isle,
perc_other::numeric as percent_other_race,
perc_latin::numeric as percent_latin,
median_age::numeric,
total_units::int,
perc_owner_occupied::numeric as percent_owner_occupied,
perc_renter_occupied::numeric as percent_renter_occupied,
median_rent_as_perc_of_income::numeric,
median_household_income::numeric,
median_family_income::numeric,
per_capita_income::numeric,
perc_in_poverty::numeric as percent_in_poverty
FROM raw.neighborhood_data;
-- Populate Location Dimension
INSERT INTO staging.dim_location (location_key, city, state, zip_code)
SELECT -1, 'Unknown', 'Unknown', 'Unknown';
INSERT INTO staging.dim_location (city, state, zip_code)
SELECT DISTINCT
COALESCE(city, 'Unknown') as city,
COALESCE(state, 'Unknown') as state,
COALESCE(zip, 'Unknown') as zip_code
FROM raw.soda_evictions
WHERE
city IS NOT NULL OR state IS NOT NULL OR zip IS NOT NULL;
-- Populate Reason Dimension
INSERT INTO staging.dim_reason (reason_key, reason_code, reason_desc)
VALUES (-1, 'Unknown', 'Unknown');
INSERT INTO staging.dim_reason (reason_code, reason_desc)
VALUES ('non_payment', 'Non-Payment'),
('breach', 'Breach'),
('nuisance', 'Nuisance'),
('illegal_use', 'Illegal Use'),
('failure_to_sign_renewal', 'Failure to Sign Renewal'),
('access_denial', 'Access Denial'),
('unapproved_subtenant', 'Unapproved Subtenant'),
('owner_move_in', 'Owner Move-In'),
('demolition', 'Demolition'),
('capital_improvement', 'Capital Improvement'),
('substantial_rehab', 'Substantial Rehab'),
('ellis_act_withdrawal', 'Ellis Act Withdrawal'),
('condo_conversion', 'Condo Conversion'),
('roommate_same_unit', 'Roommate Same Unit'),
('other_cause', 'Other Cause'),
('late_payments', 'Late Payments'),
('lead_remediation', 'Lead Remediation'),
('development', 'Development'),
('good_samaritan_ends', 'Good Samaritan Ends');
-- Populate Reason Bridge Table
SELECT
ROW_NUMBER() OVER(ORDER BY concat_reason) as group_key,
string_to_array(concat_reason, '|') as reason_array,
concat_reason
INTO TEMP tmp_reason_group
FROM (
SELECT DISTINCT
TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)) as concat_reason
FROM (
SELECT
eviction_id,
CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||
CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||
CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||
CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||
CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||
CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||
CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||
CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||
CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||
CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||
CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||
CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||
CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||
CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||
CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||
CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||
CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||
CASE WHEN development = 'true' THEN 'development|' ELSE '' END||
CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END
as concat_reason
FROM raw.soda_evictions
) f1
) f2;
INSERT INTO staging.br_reason_group (reason_group_key, reason_key)
SELECT DISTINCT
group_key as reason_group_key,
reason_key
FROM (SELECT group_key, unnest(reason_array) unnested FROM tmp_reason_group) grp
JOIN staging.dim_Reason r ON r.reason_code = grp.unnested;
-- Populate Date Dimension Table
INSERT INTO staging.dim_date (date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week,
formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday,
period, cw_start, cw_end, month_start, month_end)
SELECT -1, '1900-01-01', -1, -1, 'Unknown', -1, -1, 'Unknown', -1, 'Unknown', 'Unknown', 'Unknown', 'Unknown',
'Unknown', 'Unknown', 'Unknown', 'Unknown', '1900-01-01', '1900-01-01', '1900-01-01', '1900-01-01';
INSERT INTO staging.dim_date (date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week,
formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday,
period, cw_start, cw_end, month_start, month_end)
SELECT
TO_CHAR(datum, 'yyyymmdd')::int as date_key,
datum as date,
EXTRACT(YEAR FROM datum) as year,
EXTRACT(MONTH FROM datum) as month,
TO_CHAR(datum, 'TMMonth') as month_name,
EXTRACT(DAY FROM datum) as day,
EXTRACT(doy FROM datum) as day_of_year,
TO_CHAR(datum, 'TMDay') as weekday_name,
EXTRACT(week FROM datum) as calendar_week,
TO_CHAR(datum, 'dd. mm. yyyy') as formatted_date,
'Q' || TO_CHAR(datum, 'Q') as quartal,
TO_CHAR(datum, 'yyyy/"Q"Q') as year_quartal,
TO_CHAR(datum, 'yyyy/mm') as year_month,
TO_CHAR(datum, 'iyyy/IW') as year_calendar_week,
CASE WHEN EXTRACT(isodow FROM datum) IN (6, 7) THEN 'Weekend' ELSE 'Weekday' END as weekend,
CASE WHEN TO_CHAR(datum, 'MMDD') IN ('0101', '0704', '1225', '1226') THEN 'Holiday' ELSE 'No holiday' END
as us_holiday,
CASE WHEN TO_CHAR(datum, 'MMDD') BETWEEN '0701' AND '0831' THEN 'Summer break'
WHEN TO_CHAR(datum, 'MMDD') BETWEEN '1115' AND '1225' THEN 'Christmas season'
WHEN TO_CHAR(datum, 'MMDD') > '1225' OR TO_CHAR(datum, 'MMDD') <= '0106' THEN 'Winter break'
ELSE 'Normal' END
as period,
datum + (1 - EXTRACT(isodow FROM datum))::int as cw_start,
datum + (7 - EXTRACT(isodow FROM datum))::int as cw_end,
datum + (1 - EXTRACT(DAY FROM datum))::int as month_start,
(datum + (1 - EXTRACT(DAY FROM datum))::int + '1 month'::interval)::date - '1 day'::interval as month_end
FROM (
SELECT '1997-01-01'::date + SEQUENCE.DAY as datum
FROM generate_series(0,10956) as SEQUENCE(DAY)
GROUP BY SEQUENCE.DAY
) DQ;
-- Populate Evictions Fact Table
SELECT
eviction_id,
group_key as reason_group_key
INTO tmp_reason_facts
FROM (
SELECT
eviction_id,
TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)) as concat_reason
FROM (
SELECT
eviction_id,
CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||
CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||
CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||
CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||
CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||
CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||
CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||
CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||
CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||
CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||
CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||
CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||
CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||
CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||
CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||
CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||
CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||
CASE WHEN development = 'true' THEN 'development|' ELSE '' END||
CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END
as concat_reason
FROM raw.soda_evictions
) grp
) f_grp
JOIN tmp_reason_group t_grp ON f_grp.concat_reason = t_grp.concat_reason;
INSERT INTO staging.fact_evictions (eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key,
constraints_date_key, street_address)
SELECT
f.eviction_id as eviction_key,
COALESCE(d.district_key, -1) as district_key,
COALESCE(n.neighborhood_key, -1) as neighborhood_key,
COALESCE(l.location_key, -1) as location_key,
reason_group_key,
COALESCE(dt1.date_key, -1) as file_date_key,
COALESCE(dt2.date_key, -1) as constraints_date_key,
f.address as street_address
FROM raw.soda_evictions f
LEFT JOIN tmp_reason_facts r ON f.eviction_id = r.eviction_id
LEFT JOIN staging.dim_district d ON f.supervisor_district = d.district
LEFT JOIN staging.dim_neighborhood n ON f.neighborhood = n.neighborhood_alt_name
LEFT JOIN staging.dim_location l
ON COALESCE(f.city, 'Unknown') = l.city
AND COALESCE(f.state, 'Unknown') = l.state
AND COALESCE(f.zip, 'Unknown') = l.zip_code
LEFT JOIN staging.dim_date dt1 ON f.file_date = dt1.date
LEFT JOIN staging.dim_date dt2 ON f.constraints_date = dt2.date;
DROP TABLE tmp_reason_group;
DROP TABLE tmp_reason_facts;
-- Migrate to Production Schema
INSERT INTO prod.dim_district
(district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty)
SELECT
district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty
FROM staging.dim_district;
INSERT INTO prod.dim_neighborhood
(neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white,
percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty)
SELECT
neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white,
percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty
FROM staging.dim_neighborhood;
INSERT INTO prod.dim_location (location_key, city, state, zip_code)
SELECT location_key, city, state, zip_code
FROM staging.dim_location;
INSERT INTO prod.dim_reason (reason_key, reason_code, reason_desc)
SELECT reason_key, reason_code, reason_desc
FROM staging.dim_reason;
INSERT INTO prod.br_reason_group (reason_group_key, reason_key)
SELECT reason_group_key, reason_key
FROM staging.br_reason_group;
INSERT INTO prod.dim_date
(date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week,
formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday,
period, cw_start, cw_end, month_start, month_end)
SELECT
date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week,
formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday, period,
cw_start, cw_end, month_start, month_end
FROM staging.dim_date;
INSERT INTO prod.fact_evictions
(eviction_key, district_key, neighborhood_key, location_key, reason_group_key,
file_date_key, constraints_date_key, street_address)
SELECT
eviction_key, district_key, neighborhood_key, location_key, reason_group_key,
file_date_key, constraints_date_key, street_address
FROM staging.fact_evictions;
================================================
FILE: dags/sql/incremental_load.sql
================================================
-- echo "" > /home/airflow/airflow/dags/sql/incremental_load.sql
-- nano /home/airflow/airflow/dags/sql/incremental_load.sql
-- Populate District Dimension
INSERT INTO staging.dim_district
(district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty)
SELECT
district,
population::int,
households::int,
perc_asian::numeric as percent_asian,
perc_black::numeric as percent_black,
perc_white::numeric as percent_white,
perc_nat_am::numeric as percent_native_am,
perc_nat_pac::numeric as percent_pacific_isle,
perc_other::numeric as percent_other_race,
perc_latin::numeric as percent_latin,
median_age::numeric,
total_units::int,
perc_owner_occupied::numeric as percent_owner_occupied,
perc_renter_occupied::numeric as percent_renter_occupied,
median_rent_as_perc_of_income::numeric,
median_household_income::numeric,
median_family_income::numeric,
per_capita_income::numeric,
perc_in_poverty::numeric as percent_in_poverty
FROM raw.district_data
ON CONFLICT (district) DO UPDATE SET
population = EXCLUDED.population,
households = EXCLUDED.households,
percent_asian = EXCLUDED.percent_asian,
percent_black = EXCLUDED.percent_black,
percent_white = EXCLUDED.percent_white,
percent_native_am = EXCLUDED.percent_native_am,
percent_pacific_isle = EXCLUDED.percent_pacific_isle,
percent_other_race = EXCLUDED.percent_other_race,
percent_latin = EXCLUDED.percent_latin,
median_age = EXCLUDED.median_age,
total_units = EXCLUDED.total_units,
percent_owner_occupied = EXCLUDED.percent_owner_occupied,
percent_renter_occupied = EXCLUDED.percent_renter_occupied,
median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,
median_household_income = EXCLUDED.median_household_income,
median_family_income = EXCLUDED.median_family_income,
per_capita_income = EXCLUDED.per_capita_income,
percent_in_poverty = EXCLUDED.percent_in_poverty;
-- Populate Neighborhood Dimension
INSERT INTO staging.dim_neighborhood
(neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white,
percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty)
SELECT
acs_name as neighborhood,
db_name as neighborhood_alt_name,
population::int,
households::int,
perc_asian::numeric as percent_asian,
perc_black::numeric as percent_black,
perc_white::numeric as percent_white,
perc_nat_am::numeric as percent_native_am,
perc_nat_pac::numeric as percent_pacific_isle,
perc_other::numeric as percent_other_race,
perc_latin::numeric as percent_latin,
median_age::numeric,
total_units::int,
perc_owner_occupied::numeric as percent_owner_occupied,
perc_renter_occupied::numeric as percent_renter_occupied,
median_rent_as_perc_of_income::numeric,
median_household_income::numeric,
median_family_income::numeric,
per_capita_income::numeric,
perc_in_poverty::numeric as percent_in_poverty
FROM raw.neighborhood_data
ON CONFLICT (neighborhood) DO UPDATE SET
neighborhood_alt_name = EXCLUDED.neighborhood_alt_name,
population = EXCLUDED.population,
households = EXCLUDED.households,
percent_asian = EXCLUDED.percent_asian,
percent_black = EXCLUDED.percent_black,
percent_white = EXCLUDED.percent_white,
percent_native_am = EXCLUDED.percent_native_am,
percent_pacific_isle = EXCLUDED.percent_pacific_isle,
percent_other_race = EXCLUDED.percent_other_race,
percent_latin = EXCLUDED.percent_latin,
median_age = EXCLUDED.median_age,
total_units = EXCLUDED.total_units,
percent_owner_occupied = EXCLUDED.percent_owner_occupied,
percent_renter_occupied = EXCLUDED.percent_renter_occupied,
median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,
median_household_income = EXCLUDED.median_household_income,
median_family_income = EXCLUDED.median_family_income,
per_capita_income = EXCLUDED.per_capita_income,
percent_in_poverty = EXCLUDED.percent_in_poverty;
-- Populate Location Dimension
INSERT INTO staging.dim_location (city, state, zip_code)
SELECT
se.city,
se.state,
se.zip_code
FROM (
SELECT DISTINCT
COALESCE(city, 'Unknown') as city,
COALESCE(state, 'Unknown') as state,
COALESCE(zip, 'Unknown') as zip_code
FROM raw.soda_evictions
) se
LEFT JOIN staging.dim_location dl
ON se.city = dl.city
AND se.state = dl.state
AND se.zip_code = dl.zip_code
WHERE
dl.location_key IS NULL;
-- Populate Reason Bridge Table
SELECT DISTINCT
reason_group_key,
ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array
INTO TEMP tmp_existing_reason_groups
FROM staging.br_reason_group
GROUP BY reason_group_key;
SELECT
concat_reason,
ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array
INTO TEMP tmp_new_reason_groups
FROM (
SELECT DISTINCT
string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|') as concat_reason,
unnest(string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|')) unnested_reason
FROM (
SELECT DISTINCT
CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||
CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||
CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||
CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||
CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||
CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||
CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||
CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||
CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||
CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||
CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||
CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||
CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||
CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||
CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||
CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||
CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||
CASE WHEN development = 'true' THEN 'development|' ELSE '' END||
CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END
as concat_reason
FROM raw.soda_evictions
) se1
GROUP BY concat_reason
) se2
JOIN staging.dim_reason r ON se2.unnested_reason = r.reason_code
GROUP BY concat_reason;
INSERT INTO staging.br_reason_group (reason_group_key, reason_key)
SELECT
final_grp.max_key + new_grp.tmp_group_key as reason_group_key,
new_grp.reason_key as reason_key
FROM (
SELECT DISTINCT
ROW_NUMBER() OVER(ORDER BY concat_reason) as tmp_group_key,
concat_reason,
unnest(n.rk_array) as reason_key
FROM tmp_new_reason_groups n
LEFT JOIN tmp_existing_reason_groups e ON n.rk_array = e.rk_array
WHERE e.reason_group_key IS NULL
) new_grp
LEFT JOIN (SELECT MAX(reason_group_key) max_key FROM staging.br_reason_group) final_grp ON 1=1
ORDER BY reason_group_key, reason_key;
DROP TABLE tmp_existing_reason_groups;
DROP TABLE tmp_new_reason_groups;
-- Populate Staging Fact Table
SELECT DISTINCT
reason_group_key,
ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array
INTO TEMP tmp_existing_reason_groups
FROM staging.br_reason_group
GROUP BY reason_group_key;
SELECT
eviction_id,
ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array
INTO TEMP tmp_fct_reason_groups
FROM (
SELECT
eviction_id,
unnest(string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|')) as unnested_reason
FROM (
SELECT
eviction_id,
CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||
CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||
CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||
CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||
CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||
CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||
CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||
CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||
CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||
CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||
CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||
CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||
CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||
CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||
CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||
CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||
CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||
CASE WHEN development = 'true' THEN 'development|' ELSE '' END||
CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END
as concat_reason
FROM raw.soda_evictions
) se1
) se2
JOIN staging.dim_reason r ON se2.unnested_reason = r.reason_code
GROUP BY se2.eviction_id;
SELECT
eviction_id,
reason_group_key
INTO tmp_reason_group_lookup
FROM tmp_fct_reason_groups f
JOIN tmp_existing_reason_groups d ON f.rk_array = d.rk_array;
TRUNCATE TABLE staging.fact_evictions;
INSERT INTO staging.fact_evictions
(eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address)
SELECT
f.eviction_id as eviction_key,
COALESCE(d.district_key, -1) as district_key,
COALESCE(n.neighborhood_key, -1) as neighborhood_key,
COALESCE(l.location_key, -1) as location_key,
r.reason_group_key as reason_group_key,
COALESCE(dt1.date_key, -1) as file_date_key,
COALESCE(dt2.date_key, -1) as constraints_date_key,
f.address as street_address
FROM raw.soda_evictions f
JOIN tmp_reason_group_lookup r ON f.eviction_id = r.eviction_id
LEFT JOIN staging.dim_district d ON f.supervisor_district = d.district
LEFT JOIN staging.dim_neighborhood n ON f.neighborhood = n.neighborhood_alt_name
LEFT JOIN staging.dim_location l
ON COALESCE(f.city, 'Unknown') = l.city
AND COALESCE(f.state, 'Unknown') = l.state
AND COALESCE(f.zip, 'Unknown') = l.zip_code
LEFT JOIN staging.dim_date dt1 ON f.file_date = dt1.date
LEFT JOIN staging.dim_date dt2 ON f.constraints_date = dt2.date;
DROP TABLE tmp_existing_reason_groups;
DROP TABLE tmp_fct_reason_groups;
DROP TABLE tmp_reason_group_lookup;
-- Merge Into Production Schema
INSERT INTO prod.dim_district
(district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty)
SELECT
district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty
FROM staging.dim_district
ON CONFLICT(district_key) DO UPDATE SET
district = EXCLUDED.district,
population = EXCLUDED.population,
households = EXCLUDED.households,
percent_asian = EXCLUDED.percent_asian,
percent_black = EXCLUDED.percent_black,
percent_white = EXCLUDED.percent_white,
percent_native_am = EXCLUDED.percent_native_am,
percent_pacific_isle = EXCLUDED.percent_pacific_isle,
percent_other_race = EXCLUDED.percent_other_race,
percent_latin = EXCLUDED.percent_latin,
median_age = EXCLUDED.median_age,
total_units = EXCLUDED.total_units,
percent_owner_occupied = EXCLUDED.percent_owner_occupied,
percent_renter_occupied = EXCLUDED.percent_renter_occupied,
median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,
median_household_income = EXCLUDED.median_household_income,
median_family_income = EXCLUDED.median_family_income,
per_capita_income = EXCLUDED.per_capita_income,
percent_in_poverty = EXCLUDED.percent_in_poverty;
INSERT INTO prod.dim_neighborhood
(neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white,
percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty)
SELECT
neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white,
percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units,
percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income,
median_family_income, per_capita_income, percent_in_poverty
FROM staging.dim_neighborhood
ON CONFLICT (neighborhood_key) DO UPDATE SET
neighborhood = EXCLUDED.neighborhood,
neighborhood_alt_name = EXCLUDED.neighborhood_alt_name,
population = EXCLUDED.population,
households = EXCLUDED.households,
percent_asian = EXCLUDED.percent_asian,
percent_black = EXCLUDED.percent_black,
percent_white = EXCLUDED.percent_white,
percent_native_am = EXCLUDED.percent_native_am,
percent_pacific_isle = EXCLUDED.percent_pacific_isle,
percent_other_race = EXCLUDED.percent_other_race,
percent_latin = EXCLUDED.percent_latin,
median_age = EXCLUDED.median_age,
total_units = EXCLUDED.total_units,
percent_owner_occupied = EXCLUDED.percent_owner_occupied,
percent_renter_occupied = EXCLUDED.percent_renter_occupied,
median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,
median_household_income = EXCLUDED.median_household_income,
median_family_income = EXCLUDED.median_family_income,
per_capita_income = EXCLUDED.per_capita_income,
percent_in_poverty = EXCLUDED.percent_in_poverty;
INSERT INTO prod.dim_location (location_key, city, state, zip_code)
SELECT location_key, city, state, zip_code
FROM staging.dim_location
ON CONFLICT (location_key) DO NOTHING;
INSERT INTO prod.br_reason_group (reason_group_key, reason_key)
SELECT stg.reason_group_key, stg.reason_key
FROM staging.br_reason_group stg
LEFT JOIN prod.br_reason_group prd
ON stg.reason_group_key = prd.reason_group_key
AND stg.reason_key = prd.reason_key
WHERE
prd.reason_group_key IS NULL;
INSERT INTO prod.fact_evictions
(eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address)
SELECT eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address
FROM staging.fact_evictions
ON CONFLICT (eviction_key) DO UPDATE SET
district_key = EXCLUDED.district_key,
neighborhood_key = EXCLUDED.neighborhood_key,
location_key = EXCLUDED.location_key,
reason_group_key = EXCLUDED.reason_group_key,
file_date_key = EXCLUDED.file_date_key,
constraints_date_key = EXCLUDED.constraints_date_key,
street_address = EXCLUDED.street_address;
================================================
FILE: dags/sql/init_db_schema.sql
================================================
-- echo "" > /home/airflow/airflow/dags/sql/init_db_schema.sql
-- nano /home/airflow/airflow/dags/sql/init_db_schema.sql
DROP SCHEMA IF EXISTS raw CASCADE;
DROP SCHEMA IF EXISTS staging CASCADE;
DROP SCHEMA IF EXISTS prod CASCADE;
CREATE SCHEMA raw;
CREATE SCHEMA staging;
CREATE SCHEMA prod;
-- Raw
CREATE UNLOGGED TABLE raw.soda_evictions (
raw_id text,
created_at timestamp,
updated_at timestamp,
eviction_id text,
address text,
city text,
state text,
zip text,
file_date timestamp,
non_payment boolean,
breach boolean,
nuisance boolean,
illegal_use boolean,
failure_to_sign_renewal boolean,
access_denial boolean,
unapproved_subtenant boolean,
owner_move_in boolean,
demolition boolean,
capital_improvement boolean,
substantial_rehab boolean,
ellis_act_withdrawal boolean,
condo_conversion boolean,
roommate_same_unit boolean,
other_cause boolean,
late_payments boolean,
lead_remediation boolean,
development boolean,
good_samaritan_ends boolean,
constraints_date timestamp,
supervisor_district text,
neighborhood text
);
CREATE UNLOGGED TABLE raw.district_data (
district text,
population text,
households text,
perc_asian text,
perc_black text,
perc_white text,
perc_nat_am text,
perc_nat_pac text,
perc_other text,
perc_latin text,
median_age text,
total_units text,
perc_owner_occupied text,
perc_renter_occupied text,
median_rent_as_perc_of_income text,
median_household_income text,
median_family_income text,
per_capita_income text,
perc_in_poverty text
);
CREATE UNIQUE INDEX district_name_uniq_idx ON raw.district_data (district);
CREATE UNLOGGED TABLE raw.neighborhood_data (
acs_name text,
db_name text,
population text,
households text,
perc_asian text,
perc_black text,
perc_white text,
perc_nat_am text,
perc_nat_pac text,
perc_other text,
perc_latin text,
median_age text,
total_units text,
perc_owner_occupied text,
perc_renter_occupied text,
median_rent_as_perc_of_income text,
median_household_income text,
median_family_income text,
per_capita_income text,
perc_in_poverty text
);
CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON raw.neighborhood_data (acs_name);
-- Staging
CREATE TABLE staging.dim_district (
district_key serial PRIMARY KEY,
district text,
population integer,
households integer,
percent_asian numeric,
percent_black numeric,
percent_white numeric,
percent_native_am numeric,
percent_pacific_isle numeric,
percent_other_race numeric,
percent_latin numeric,
median_age numeric,
total_units integer,
percent_owner_occupied numeric,
percent_renter_occupied numeric,
median_rent_as_perc_of_income numeric,
median_household_income numeric,
median_family_income numeric,
per_capita_income numeric,
percent_in_poverty numeric
);
CREATE UNIQUE INDEX district_name_uniq_idx ON staging.dim_district (district);
CREATE TABLE staging.dim_neighborhood (
neighborhood_key serial PRIMARY KEY,
neighborhood text,
neighborhood_alt_name text,
population integer,
households integer,
percent_asian numeric,
percent_black numeric,
percent_white numeric,
percent_native_am numeric,
percent_pacific_isle numeric,
percent_other_race numeric,
percent_latin numeric,
median_age numeric,
total_units integer,
percent_owner_occupied numeric,
percent_renter_occupied numeric,
median_rent_as_perc_of_income numeric,
median_household_income numeric,
median_family_income numeric,
per_capita_income numeric,
percent_in_poverty numeric
);
CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON staging.dim_neighborhood (neighborhood);
CREATE TABLE staging.dim_location (
location_key serial PRIMARY KEY,
city text,
state text,
zip_code text
);
CREATE TABLE staging.dim_reason (
reason_key serial PRIMARY KEY,
reason_code text,
reason_desc text
);
CREATE TABLE staging.br_reason_group (
reason_group_key int,
reason_key int
);
CREATE INDEX reason_group_key_idx ON staging.br_reason_group (reason_group_key);
CREATE INDEX reason_key_idx ON staging.br_reason_group (reason_key);
CREATE TABLE staging.dim_date (
date_key int PRIMARY KEY,
date date,
year int,
month int,
month_name text,
day int,
day_of_year int,
weekday_name text,
calendar_week int,
formatted_date text,
quartal text,
year_quartal text,
year_month text,
year_calendar_week text,
weekend text,
us_holiday text,
period text,
cw_start date,
cw_end date,
month_start date,
month_end date
);
CREATE TABLE staging.fact_evictions (
eviction_key text PRIMARY KEY,
location_key int,
district_key int,
neighborhood_key int,
reason_group_key int,
file_date_key int,
constraints_date_key int,
street_address text
);
-- Prod
CREATE TABLE prod.dim_district (
district_key serial PRIMARY KEY,
district text,
population integer,
households integer,
percent_asian numeric,
percent_black numeric,
percent_white numeric,
percent_native_am numeric,
percent_pacific_isle numeric,
percent_other_race numeric,
percent_latin numeric,
median_age numeric,
total_units integer,
percent_owner_occupied numeric,
percent_renter_occupied numeric,
median_rent_as_perc_of_income numeric,
median_household_income numeric,
median_family_income numeric,
per_capita_income numeric,
percent_in_poverty numeric
);
CREATE TABLE prod.dim_neighborhood (
neighborhood_key serial PRIMARY KEY,
neighborhood text,
neighborhood_alt_name text,
population integer,
households integer,
percent_asian numeric,
percent_black numeric,
percent_white numeric,
percent_native_am numeric,
percent_pacific_isle numeric,
percent_other_race numeric,
percent_latin numeric,
median_age numeric,
total_units integer,
percent_owner_occupied numeric,
percent_renter_occupied numeric,
median_rent_as_perc_of_income numeric,
median_household_income numeric,
median_family_income numeric,
per_capita_income numeric,
percent_in_poverty numeric
);
CREATE TABLE prod.dim_location (
location_key serial PRIMARY KEY,
city text,
state text,
zip_code text
);
CREATE TABLE prod.dim_reason (
reason_key serial PRIMARY KEY,
reason_code text,
reason_desc text
);
CREATE TABLE prod.br_reason_group (
reason_group_key int,
reason_key int
);
CREATE INDEX reason_group_key_idx ON prod.br_reason_group (reason_group_key);
CREATE INDEX reason_key_idx ON prod.br_reason_group (reason_key);
CREATE TABLE prod.dim_date (
date_key int PRIMARY KEY,
date date,
year int,
month int,
month_name text,
day int,
day_of_year int,
weekday_name text,
calendar_week int,
formatted_date text,
quartal text,
year_quartal text,
year_month text,
year_calendar_week text,
weekend text,
us_holiday text,
period text,
cw_start date,
cw_end date,
month_start date,
month_end date
);
CREATE TABLE prod.fact_evictions (
eviction_key text PRIMARY KEY,
location_key int,
district_key int,
neighborhood_key int,
reason_group_key int,
file_date_key int,
constraints_date_key int,
street_address text
);
================================================
FILE: dags/sql/trunc_target_tables.sql
================================================
-- echo "" > /home/airflow/airflow/dags/sql/trunc_target_tables.sql
-- nano /home/airflow/airflow/dags/sql/trunc_target_tables.sql
TRUNCATE TABLE raw.soda_evictions;
TRUNCATE TABLE raw.district_data;
TRUNCATE TABLE raw.neighborhood_data;
gitextract_jbid5ch1/
├── README.md
├── airflow_installation.txt
└── dags/
├── full_load_dag.py
├── incremental_load_dag.py
├── operators/
│ ├── s3_to_postgres_operator.py
│ └── soda_to_s3_operator.py
└── sql/
├── full_load.sql
├── incremental_load.sql
├── init_db_schema.sql
└── trunc_target_tables.sql
SYMBOL INDEX (37 symbols across 4 files)
FILE: dags/incremental_load_dag.py
function get_size (line 29) | def get_size(**context):
FILE: dags/operators/s3_to_postgres_operator.py
class S3ToPostgresOperator (line 14) | class S3ToPostgresOperator(BaseOperator):
method __init__ (line 31) | def __init__(self,
method execute (line 57) | def execute(self, context):
FILE: dags/operators/soda_to_s3_operator.py
class SizeExceededError (line 14) | class SizeExceededError(Exception):
method __init__ (line 16) | def __init__(self):
method __str__ (line 19) | def __str__(self):
class SodaToS3Operator (line 23) | class SodaToS3Operator(BaseOperator):
method __init__ (line 41) | def __init__(self,
method get_size (line 71) | def get_size(self, obj, seen=None):
method parse_metadata (line 93) | def parse_metadata(self, header):
method execute (line 112) | def execute(self, context):
FILE: dags/sql/init_db_schema.sql
type raw (line 14) | CREATE UNLOGGED TABLE raw.soda_evictions (
type raw (line 48) | CREATE UNLOGGED TABLE raw.district_data (
type district_name_uniq_idx (line 69) | CREATE UNIQUE INDEX district_name_uniq_idx ON raw.district_data (district)
type raw (line 71) | CREATE UNLOGGED TABLE raw.neighborhood_data (
type neighborhood_name_uniq_idx (line 93) | CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON raw.neighborhood_data ...
type staging (line 96) | CREATE TABLE staging.dim_district (
type district_name_uniq_idx (line 118) | CREATE UNIQUE INDEX district_name_uniq_idx ON staging.dim_district (dist...
type staging (line 120) | CREATE TABLE staging.dim_neighborhood (
type neighborhood_name_uniq_idx (line 143) | CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON staging.dim_neighborho...
type staging (line 145) | CREATE TABLE staging.dim_location (
type staging (line 152) | CREATE TABLE staging.dim_reason (
type staging (line 158) | CREATE TABLE staging.br_reason_group (
type reason_group_key_idx (line 162) | CREATE INDEX reason_group_key_idx ON staging.br_reason_group (reason_gro...
type reason_key_idx (line 163) | CREATE INDEX reason_key_idx ON staging.br_reason_group (reason_key)
type staging (line 165) | CREATE TABLE staging.dim_date (
type staging (line 189) | CREATE TABLE staging.fact_evictions (
type prod (line 202) | CREATE TABLE prod.dim_district (
type prod (line 225) | CREATE TABLE prod.dim_neighborhood (
type prod (line 249) | CREATE TABLE prod.dim_location (
type prod (line 256) | CREATE TABLE prod.dim_reason (
type prod (line 262) | CREATE TABLE prod.br_reason_group (
type reason_group_key_idx (line 266) | CREATE INDEX reason_group_key_idx ON prod.br_reason_group (reason_group_...
type reason_key_idx (line 267) | CREATE INDEX reason_key_idx ON prod.br_reason_group (reason_key)
type prod (line 269) | CREATE TABLE prod.dim_date (
type prod (line 293) | CREATE TABLE prod.fact_evictions (
Condensed preview — 10 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (66K chars).
[
{
"path": "README.md",
"chars": 3879,
"preview": "# SF-EvictionTracker\n\nTracking eviction trends in San Francisco across filing reasons, districts, neighborhoods, and dem"
},
{
"path": "airflow_installation.txt",
"chars": 3108,
"preview": "\nSTEP 1 - Launch EC2 Instance:\n- t3.medium\n- 12gb storage\n- launch-wizard-3 security group to open TCP Port 8080\n- assoc"
},
{
"path": "dags/full_load_dag.py",
"chars": 2494,
"preview": "# echo \"\" > /home/airflow/airflow/dags/full_load_dag.py\n# nano /home/airflow/airflow/dags/full_load_dag.py\n\nfrom airflow"
},
{
"path": "dags/incremental_load_dag.py",
"chars": 2885,
"preview": "# echo \"\" > /home/airflow/airflow/dags/incremental_load_dag.py\n# nano /home/airflow/airflow/dags/incremental_load_dag.py"
},
{
"path": "dags/operators/s3_to_postgres_operator.py",
"chars": 6200,
"preview": "# echo \"\" > /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py\n# nano /home/airflow/airflow/dags/operators/"
},
{
"path": "dags/operators/soda_to_s3_operator.py",
"chars": 5429,
"preview": "# echo \"\" > /home/airflow/airflow/dags/operators/soda_to_s3_operator.py\n# nano /home/airflow/airflow/dags/operators/soda"
},
{
"path": "dags/sql/full_load.sql",
"chars": 14400,
"preview": "-- echo \"\" > /home/airflow/airflow/dags/sql/full_load.sql\n-- nano /home/airflow/airflow/dags/sql/full_load.sql\n\n-- Popul"
},
{
"path": "dags/sql/incremental_load.sql",
"chars": 16231,
"preview": "-- echo \"\" > /home/airflow/airflow/dags/sql/incremental_load.sql\n-- nano /home/airflow/airflow/dags/sql/incremental_load"
},
{
"path": "dags/sql/init_db_schema.sql",
"chars": 6911,
"preview": "-- echo \"\" > /home/airflow/airflow/dags/sql/init_db_schema.sql\n-- nano /home/airflow/airflow/dags/sql/init_db_schema.sql"
},
{
"path": "dags/sql/trunc_target_tables.sql",
"chars": 238,
"preview": "-- echo \"\" > /home/airflow/airflow/dags/sql/trunc_target_tables.sql\n-- nano /home/airflow/airflow/dags/sql/trunc_target_"
}
]
About this extraction
This page contains the full source code of the ilya-galperin/SF-EvictionTracker GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 10 files (60.3 KB), approximately 16.7k tokens, and a symbol index with 37 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.