master e3f0a03131eb cached
10 files
60.3 KB
16.7k tokens
37 symbols
1 requests
Download .txt
Repository: ilya-galperin/SF-EvictionTracker
Branch: master
Commit: e3f0a03131eb
Files: 10
Total size: 60.3 KB

Directory structure:
gitextract_jbid5ch1/

├── README.md
├── airflow_installation.txt
└── dags/
    ├── full_load_dag.py
    ├── incremental_load_dag.py
    ├── operators/
    │   ├── s3_to_postgres_operator.py
    │   └── soda_to_s3_operator.py
    └── sql/
        ├── full_load.sql
        ├── incremental_load.sql
        ├── init_db_schema.sql
        └── trunc_target_tables.sql

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# SF-EvictionTracker

Tracking eviction trends in San Francisco across filing reasons, districts, neighborhoods, and demographics in the months following COVID-19. Data warehouse infrastructure is housed in the AWS ecosystem and uses Apache Airflow for orchestration with public-facing dashboards created using Metabase. 

Questions? Feel free to reach me at ilya.glprn@gmail.com.

Public Dashboard Link: http://sf-evictiontracker-metabase.us-east-1.elasticbeanstalk.com/public/dashboard/f637e470-8ea9-4b03-af80-53988e5b6a9b


<h3>ARCHITECTURE:</h3>

![Architecture](https://i.imgur.com/s2gLBZt.png)
Data is sourced from San Francisco Open Data's API and csv's containing San Francisco district and neighborhood aggregate census results. Airflow orchestrates its movement to an S3 bucket and into a data warehouse hosted in RDS. SQL scripts are then ran to transform the data from its raw form through a staging schema and into production target tables. The presentation layer is created using Metabase, an open-source data visualization tool, and deployed using Elastic Beanstalk. 

<h3>DATA MODEL:</h3>

Dimension Tables:
`dim_district`
`dim_neighborhood`
`dim_location`
`dim_reason`
`dim_date`
`br_reason_group`

Fact Tables:
`fact_evictions`

The data model is implemented using a star schema with a bridge table to accomodate any new permutations for the reason dimension. More information on bridge tables can be found here: https://www.kimballgroup.com/2012/02/design-tip-142-building-bridges/

![Model](https://i.imgur.com/uInBlzR.png)
<h3>ETL FLOW:</h3>

General Overview - 
- Evictions data is collected from the SODA API and moved into an S3 Bucket
- Neighborhood/district census data is stored as a CSV in S3
- Once the API load to S3 is complete, data is moved into RDS into a "raw" schema and moves through a staging schema for processing
- ETL job execution is complete once data is moved from the staging schema into the final production tables

DAGs and Custom Airflow Operators -

![Ops](https://i.imgur.com/WTOUiGU.jpg)
![Dag](https://i.imgur.com/yJb3DKT.jpg)

There are 2 DAGs (Directed Acyclic Graphs) used for this project - <b>full load</b> which should be executed on initial setup and <b>incremental load</b> which is scheduled to run daily and pull new data from the Socrata Open Data API.

The incremental load DAG uses XCom to pass the filesize of the load between the API call task and a ShortCircuitOperator to skip downstream tasks if the API call produces no results. 

The DAGs use two customer operators. They have been purpose built for this project but are easily expandable to be used in other data pipelines.

1. soda_to_s3_operator: Queries the Socrata Open Data API using a SoQL string and uploads the results to an S3 bucket. Includes optional function to check source data size and abort ETL if filesize exceeds user-defined limit.

2. s3_to_postges_operator: Collects data from a file hosted on AWS S3 and loads it into a Postgres table. Current version supports JSON and CSV source data types.


<h3>INFRASTRUCTURE:</h3>

This project is hosted in the AWS ecosystem and uses the following resources:

![EC2](https://i.imgur.com/jB2X1jI.png)

EC2 -
- t2.medium - dedicated resource for Airflow, managed by AWS Instance Scheduler to complete the daily DAG run and shut off after execution 
- t2.small - used to host Metabase, always online

RDS -
- t2.small - hosts application database for Metabase and the data warehouse

Elastic Beanstalk is used to deploy the Metabase web application.


<h3>DASHBOARD:</h3>

The dashboard is publicly accessible here: http://sf-evictiontracker-metabase.us-east-1.elasticbeanstalk.com/public/dashboard/f637e470-8ea9-4b03-af80-53988e5b6a9b

Some examples screengrabs below!

![Dash1](https://i.imgur.com/MZ325PT.jpg)
![Dash2](https://i.imgur.com/OeyOVp0.jpg)
![Dash3](https://i.imgur.com/v6Nwz9l.jpg)


================================================
FILE: airflow_installation.txt
================================================

STEP 1 - Launch EC2 Instance:
- t3.medium
- 12gb storage
- launch-wizard-3 security group to open TCP Port 8080
- associate elastic IP 


STEP 2 - Install Postgres Server on EC2:
run:
sudo apt-get update
sudo apt-get install python-psycopg2
sudo apt-get install postgresql postgresql-contrib


Step 3 - Create OS User airflow
run:
sudo adduser airflow
sudo usermod -aG sudo airflow
su - airflow

Note: From here on, make sure you are logged in as airflow user.


Step 4 - Create Postgres Metadatabase and User Access
run: 
sudo -u postgres psql

in postgres prompt: 
CREATE USER airflow PASSWORD 'password';
CREATE DATABASE airflow;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airflow;
\q 


Step 5 - Change Postgres Connection Config
run:
sudo nano /etc/postgresql/10/main/pg_hba.conf

Change this line -
# IPv4 local connections:
host    all             all             127.0.0.1/32         md5
To this line - 
# IPv4 local connections:
host    all             all             0.0.0.0/0            trust

run:
sudo nano /etc/postgresql/10/main/postgresql.conf

Change this line - 
#listen_addresses = ‘localhost’ # what IP address(es) to listen on
To this line -
listen_addresses = ‘*’ # what IP address(es) to listen on

restart postgres server:
sudo service postgresql restart


Step 6 - Install Airflow

run:
su - airflow
sudo apt-get install python3-pip
sudo python3 -m pip install apache-airflow[postgres,s3,aws]

run:
airflow initdb


Step 7 - Connect Airflow to Postgres

run:
nano /home/airflow/airflow/airflow.cfg

Change lines -
sql_alchemy_conn = postgresql+psycopg2://airflow:password@localhost:5432/airflow
executor = LocalExecutor
load_examples = False

run:
airflow initdb


Step 7 - Add DAGs:
mkdir /home/airflow/airflow/dags/
cd /home/airflow/airflow/dags/
touch tutorial.py
nano tutorial.py


Step 6: Setup Airflow Webserver and Scheduler to start automatically
We are almost there. The final thing we need to do is to ensure airflow starts up when your ec2 instance starts.

sudo nano /etc/systemd/system/airflow-webserver.service

Paste the following into the file created above

[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service
Wants=postgresql.service
[Service]
EnvironmentFile=/etc/environment
User=airflow
Group=airflow
Type=simple
ExecStart= /usr/local/bin/airflow webserver
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target

Next we will create the following file to enable scheduler service

sudo nano /etc/systemd/system/airflow-scheduler.service

Paste the following

[Unit]
Description=Airflow scheduler daemon
After=network.target postgresql.service
Wants=postgresql.service
[Service]
EnvironmentFile=/etc/environment
User=airflow
Group=airflow
Type=simple
ExecStart=/usr/local/bin/airflow scheduler
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target

Next enable these services and check their status

sudo systemctl enable airflow-webserver.service
sudo systemctl enable airflow-scheduler.service
sudo systemctl start airflow-scheduler
sudo systemctl start airflow-webserver


================================================
FILE: dags/full_load_dag.py
================================================
# echo "" > /home/airflow/airflow/dags/full_load_dag.py
# nano /home/airflow/airflow/dags/full_load_dag.py

from airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
from operators.soda_to_s3_operator import SodaToS3Operator
from operators.s3_to_postgres_operator import S3ToPostgresOperator
from airflow.utils.dates import days_ago
from datetime import timedelta

soda_headers = {
    'keyId':'############',
    'keySecret':'#################',
    'Accept':'application/json'
}

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': days_ago(2),
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(seconds=30)
}

with DAG('eviction-tracker-full_load',
		default_args=default_args,
		description='Executes full load from SODA API to Production DW.',
		max_active_runs=1,
		schedule_interval=None) as dag:
 
	op1 = SodaToS3Operator(
		task_id='get_evictions_data',
		http_conn_id='API_Evictions',
		headers=soda_headers,
		s3_conn_id='S3_Evictions',
		s3_bucket='sf-evictionmeter',
		s3_directory='soda_jsons',
		size_check=True,
		max_bytes=500000000,
		dag=dag
	)
	
	op2 = PostgresOperator(
		task_id='initialize_target_db',
		postgres_conn_id='RDS_Evictions',
		sql='sql/init_db_schema.sql',
		dag=dag
	)
	
	op3 = S3ToPostgresOperator(
		task_id='load_evictions_data',
		s3_conn_id='S3_Evictions',
		s3_bucket='sf-evictionmeter',
		s3_prefix='soda_jsons/soda_evictions_import',
		source_data_type='json',
		postgres_conn_id='RDS_Evictions',
		schema='raw',
		table='soda_evictions',
		get_latest=True,
		dag=dag
	)
	
	op4 = S3ToPostgresOperator(
		task_id='load_neighborhood_data',
		s3_conn_id='S3_Evictions',
		s3_bucket='sf-evictionmeter',
		s3_prefix='census_csv/sf_by_neighborhood',
		source_data_type='csv',
		header=True,
		postgres_conn_id='RDS_Evictions',
		schema='raw',
		table='neighborhood_data',
		get_latest=True,
		dag=dag
	)
	
	op5 = S3ToPostgresOperator(
		task_id='load_district_data',
		s3_conn_id='S3_Evictions',
		s3_bucket='sf-evictionmeter',
		s3_prefix='census_csv/sf_by_district',
		source_data_type='csv',
		header=True,
		postgres_conn_id='RDS_Evictions',
		schema='raw',
		table='district_data',
		get_latest=True,
		dag=dag
	)
	
	op6 = PostgresOperator(
		task_id='execute_full_load',
		postgres_conn_id='RDS_Evictions',
		sql='sql/full_load.sql',
		dag=dag
	)
	
	op1 >> op2 >> (op3, op4, op5) >> op6


================================================
FILE: dags/incremental_load_dag.py
================================================
# echo "" > /home/airflow/airflow/dags/incremental_load_dag.py
# nano /home/airflow/airflow/dags/incremental_load_dag.py

from airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import ShortCircuitOperator
from operators.soda_to_s3_operator import SodaToS3Operator
from operators.s3_to_postgres_operator import S3ToPostgresOperator
from airflow.utils.dates import days_ago
from datetime import timedelta

soda_headers = {
    'keyId':'############',
    'keySecret':'#################',
    'Accept':'application/json'
}

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': days_ago(2),
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(seconds=30)
}

def get_size(**context):
	val = context['ti'].xcom_pull(key='obj_len')
	return True if val > 0 else False
	

with DAG('eviction-tracker-incremental_load',
	default_args=default_args,
	description='Executes incremental load from SODA API & S3-hosted csv''s into Production DW.',
	max_active_runs=1,
	schedule_interval=None) as dag:
 
	op1 = SodaToS3Operator(
		task_id='get_evictions_data',
		http_conn_id='API_Evictions',
		headers=soda_headers,
		days_ago=31,
		s3_conn_id='S3_Evictions',
		s3_bucket='sf-evictionmeter',
		s3_directory='soda_jsons',
		size_check=True,
		max_bytes=500000000,
		dag=dag
	)
	
	op2 = ShortCircuitOperator(
		task_id='check_get_results',
		python_callable=get_size,
		provide_context=True,
		dag=dag
	)
	
	op3 = PostgresOperator(
		task_id='truncate_target_tables',
		postgres_conn_id='RDS_Evictions',
		sql='sql/trunc_target_tables.sql',
		dag=dag
	)
	
	op4 = S3ToPostgresOperator(
		task_id='load_evictions_data',
		s3_conn_id='S3_Evictions',
		s3_bucket='sf-evictionmeter',
		s3_prefix='soda_jsons/soda_evictions_import',
		source_data_type='json',
		postgres_conn_id='RDS_Evictions',
		schema='raw',
		table='soda_evictions',
		get_latest=True,
		dag=dag
	)
	
	op5 = S3ToPostgresOperator(
		task_id='load_neighborhood_data',
		s3_conn_id='S3_Evictions',
		s3_bucket='sf-evictionmeter',
		s3_prefix='census_csv/sf_by_neighborhood',
		source_data_type='csv',
		header=True,
		postgres_conn_id='RDS_Evictions',
		schema='raw',
		table='neighborhood_data',
		get_latest=True,
		dag=dag
	)
	
	op6 = S3ToPostgresOperator(
		task_id='load_district_data',
		s3_conn_id='S3_Evictions',
		s3_bucket='sf-evictionmeter',
		s3_prefix='census_csv/sf_by_district',
		source_data_type='csv',
		header=True,
		postgres_conn_id='RDS_Evictions',
		schema='raw',
		table='district_data',
		get_latest=True,
		dag=dag
	)
	
	op7 = PostgresOperator(
		task_id='execute_incremental_load',
		postgres_conn_id='RDS_Evictions',
		sql='sql/incremental_load.sql',
		dag=dag
	)
	
	op1 >> op2 >> op3 >> (op4, op5, op6) >> op7


================================================
FILE: dags/operators/s3_to_postgres_operator.py
================================================
# echo "" > /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py
# nano /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py

from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.hooks.S3_hook import S3Hook
from airflow.hooks.postgres_hook import PostgresHook

import json
import io
from contextlib import closing


class S3ToPostgresOperator(BaseOperator):
	""" 
	Collects data from a file hosted on AWS S3 and loads it into a Postgres table. 
	Current version supports JSON and CSV sources but requires pre-defined data model.
	
	:param s3_conn_id:			S3 Connection ID
	:param s3_bucket:			S3 Bucket Destination
	:param s3_prefix:			S3 File Prefix
	:param source_data_type:		S3 Source File data type
	:param header:				Toggles ignore header for CSV source type 
	:param postgres_conn_id: 		Postgres Connection ID
	:param db_schema:			Postgres Target Schema
	:param db_table:			Postgres Target Table
	:param get_latest:			if True, pulls from last modified file in S3 path
	"""
	
	@apply_defaults
	def __init__(self,
		s3_conn_id=None,
		s3_bucket=None,
		s3_prefix='',
		source_data_type='',
		postgres_conn_id='postgres_default',
		header=False,
		schema='public',
		table='raw_load',
		get_latest=False,		
		*args, 
		**kwargs) -> None: 
		
		super().__init__(*args, **kwargs)
		
		self.s3_conn_id = s3_conn_id
		self.s3_bucket = s3_bucket
		self.s3_prefix = s3_prefix
		self.source_data_type = source_data_type
		self.postgres_conn_id = postgres_conn_id
		self.header = header
		self.schema = schema
		self.table = table
		self.get_latest = get_latest
	
	
	def execute(self, context):
		"""
		Executes the operator.
		"""
		s3_hook = S3Hook(self.s3_conn_id)
		s3_session = s3_hook.get_session()
		s3_client = s3_session.client('s3')
		
		if self.get_latest == True:
			objects = s3_client.list_objects_v2(Bucket=self.s3_bucket, Prefix=self.s3_prefix)['Contents']
			latest = max(objects, key=lambda x: x['LastModified'])
			s3_obj = s3_client.get_object(Bucket=self.s3_bucket, Key=latest['Key'])
			
		file_content = s3_obj['Body'].read().decode('utf-8')
		
		pg_hook = PostgresHook(self.postgres_conn_id)
			
		if self.source_data_type == 'json':
			
			print('inserting json object...')
	
			json_content = json.loads(file_content)		
				
			schema = self.schema
			if isinstance(self.schema, tuple):
				schema = self.schema[0]
			
			table = self.table	
			if isinstance(self.table, tuple):
				table = self.table[0]	
		
			target_fields = ['raw_id','created_at','updated_at','eviction_id','address','city','state',
					'zip','file_date','non_payment','breach','nuisance','illegal_use','failure_to_sign_renewal',
					'access_denial','unapproved_subtenant','owner_move_in','demolition','capital_improvement',
					'substantial_rehab','ellis_act_withdrawal','condo_conversion','roommate_same_unit',
					'other_cause','late_payments','lead_remediation','development','good_samaritan_ends',
					'constraints_date','supervisor_district','neighborhood']
			target_fields = ','.join(target_fields)
			
			with closing(pg_hook.get_conn()) as conn:
				with closing(conn.cursor()) as cur:
						cur.executemany(
							f"""INSERT INTO {schema}.{table} ({target_fields})
							VALUES(
							%(:id)s, %(:created_at)s, %(:updated_at)s, %(eviction_id)s, %(address)s, %(city)s, %(state)s, %(zip)s,
							%(file_date)s, %(non_payment)s, %(breach)s, %(nuisance)s, %(illegal_use)s, %(failure_to_sign_renewal)s,
							%(access_denial)s, %(unapproved_subtenant)s, %(owner_move_in)s, %(demolition)s, %(capital_improvement)s,
							%(substantial_rehab)s, %(ellis_act_withdrawal)s, %(condo_conversion)s, %(roommate_same_unit)s,
							%(other_cause)s, %(late_payments)s, %(lead_remediation)s, %(development)s, %(good_samaritan_ends)s,
							%(constraints_date)s, %(supervisor_district)s, %(neighborhood)s
								);
							""",({
							':id': line[':id'], ':created_at': line[':created_at'], ':updated_at': line[':updated_at'],
							'eviction_id': line['eviction_id'], 'address': line.get('address', None), 'city': line.get('city', None),
							'state': line.get('state', None),'zip': line.get('zip', None),'file_date': line.get('file_date', None),
							'non_payment': line.get('non_payment', None),'breach': line.get('breach', None),
							'nuisance': line.get('nuisance', None),'illegal_use': line.get('illegal_use', None),
							'failure_to_sign_renewal': line.get('failure_to_sign_renewal', None),
							'access_denial': line.get('access_denial', None),'unapproved_subtenant': line.get('unapproved_subtenant', None),
							'owner_move_in': line.get('owner_move_in', None),'demolition': line.get('demolition', None),
							'capital_improvement': line.get('capital_improvement', None),
							'substantial_rehab': line.get('substantial_rehab', None),'ellis_act_withdrawal': line.get('ellis_act_withdrawal', None),
							'condo_conversion': line.get('condo_conversion', None),'roommate_same_unit': line.get('roommate_same_unit', None),
							'other_cause': line.get('other_cause', None),'late_payments': line.get('late_payments', None),
							'lead_remediation': line.get('lead_remediation', None),'development': line.get('development', None),
							'good_samaritan_ends': line.get('good_samaritan_ends', None),'constraints_date': line.get('constraints_date', None),
							'supervisor_district': line.get('supervisor_district', None),'neighborhood': line.get('neighborhood', None)
							 } for line in json_content))
						conn.commit()		
			
			
		if self.source_data_type == 'csv':
			
			print('inserting csv...')

			file = io.StringIO(file_content)
			
			sql = "COPY %s FROM STDIN DELIMITER ','"
			if self.header == True:
				sql = "COPY %s FROM STDIN DELIMITER ',' CSV HEADER"
			
			schema = self.schema
			if isinstance(self.schema, tuple):
				schema = self.schema[0]
			
			table = self.table	
			if isinstance(self.table, tuple):
				table = self.table[0]	
				
			table = f'{schema}.{table}'	
			
			with closing(pg_hook.get_conn()) as conn:
				with closing(conn.cursor()) as cur:
					cur.copy_expert(sql=sql % table, file=file)
					conn.commit()
		
		print('inserting complete...')


================================================
FILE: dags/operators/soda_to_s3_operator.py
================================================
# echo "" > /home/airflow/airflow/dags/operators/soda_to_s3_operator.py
# nano /home/airflow/airflow/dags/operators/soda_to_s3_operator.py

from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.hooks.http_hook import HttpHook
from airflow.hooks.S3_hook import S3Hook

from datetime import datetime, timedelta
import json
import sys


class SizeExceededError(Exception):
	"""Raised when max file size is exceeded"""
	def __init__(self):
		self.message = 'Max file size exceeded'

	def __str__(self):
		return f'SizeExceededError, {self.message}'


class SodaToS3Operator(BaseOperator):
	""" 
	Queries the Socrata Open Data API using a SoQL string and uploads the results to an S3 bucket.
	
	:param endpoint:		Optional API connection endpoint
	:param data:			Custom Socrata SoQL string used to query API, overrides default get request
	:param days_ago:		Restricts get request to updated/created records from specified date onward
	:param headers:			Dictionary containing optional API connection keys (keyId, keySecret, Accept)
	:param s3_conn_id:		S3 Connection ID
	:param s3_bucket:		S3 Bucket Destination
	:param s3_directory:		S3 Directory Destination
	:param method:			Request type for API
	:param http_conn_id:		SODA API Connection ID
	:param size_check:		Boolean indicating whether to run a size check prior to upload to S3
	:param max_bytes:		Maximum number of bytes to allow for a single S3 upload		
	"""
	
	@apply_defaults
	def __init__(self,
		endpoint=None,
		data=None,
		days_ago=None,
		headers=None,
		s3_conn_id=None,
		s3_bucket=None,
		s3_directory='',
		method='GET',
		http_conn_id='http_default',
		size_check=False,
		max_bytes=5000000000,
		*args, 
		**kwargs) -> None: 
		
		super().__init__(*args, **kwargs)
		
		self.endpoint = endpoint
		self.data = data
		self.days_ago = days_ago
		self.s3_conn_id = s3_conn_id
		self.s3_bucket = s3_bucket
		self.s3_directory = s3_directory
		self.headers = headers
		self.method = method
		self.http_conn_id = http_conn_id
		self.size_check = size_check
		self.max_bytes = max_bytes
	
	
	def get_size(self, obj, seen=None):
		"""
		Recursively finds size of object.
		"""
		
		size = sys.getsizeof(obj)
		if seen is None:
			seen = set()
		obj_id = id(obj)
		if obj_id in seen:
			return 0
		seen.add(obj_id)
		if isinstance(obj, dict):
			size += sum([self.get_size(v, seen) for v in obj.values()])
			size += sum([self.get_size(k, seen) for k in obj.keys()])
		elif hasattr(obj, '__dict__'):
			size += self.get_size(obj.__dict__, seen)
		elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
			size += sum([self.get_size(i, seen) for i in obj])
		return size
	
	
	def parse_metadata(self, header):
		"""
		Parses metadata from API response.
		"""
		
		try:
			metadata = {
				'api-call-date': header['Date'],
				'content-type': header['Content-Type'],
				'source-last-modified': header['X-SODA2-Truth-Last-Modified'],
				'fields': header['X-SODA2-Fields'],
				'types': header['X-SODA2-Types']
			}
		except KeyError:
			metadata = {'KeyError': 'Metadata missing from header, see error log.'}
    
		return metadata

	
	def execute(self, context):
		"""
		Executes the operator, including running a max filesize check if enabled. 
		
		The SODA API maxes out the # of returned results so we use paging to query
		the endpoint multiple times and continuously move the offset. 
		
		Metadata is parsed and saved separately in a /logs/ subfolder along with 
		the JSON results from API call.
		"""
		
		soda = HttpHook(method=self.method, http_conn_id=self.http_conn_id)
		
		if self.data:
			soql_filter = self.data
		elif self.days_ago:
			current_dt = datetime.now()
			target_dt = current_dt - timedelta(self.days_ago)
			format_dt = target_dt.strftime("%Y-%m-%d")
			soql_filter = f"""$query=SELECT:*,* WHERE :created_at > '{format_dt}' OR :updated_at > '{format_dt}'
							   ORDER BY :id LIMIT 10000"""
		else:
			soql_filter = """$query=SELECT:*,* ORDER BY :id LIMIT 10000"""
		
		print('getting... ' + soql_filter)
		
		#soql_filter = f"""$query=SELECT:*,* WHERE :created_at < '2020-04-01' ORDER BY :id LIMIT 10000"""
		
		offset, counter = 0, 1
		combined = []
		while True:
			soql_filter_offset = soql_filter + f' OFFSET {offset}'
			response = soda.run(endpoint=self.endpoint, data=soql_filter_offset, headers=self.headers)
			if response.status_code != 200:
				break
			captured = response.json()
			if len(captured) == 0:
				break
			combined.extend(captured)
			offset = 10000 * counter
			counter += 1

		if self.size_check == True:
			print('actual size... ' + str(self.get_size(combined)))
			print('max size... ' + str(self.max_bytes))
			if self.get_size(combined) > self.max_bytes:
				raise SizeExceededError
		
		dest_s3 = S3Hook(self.s3_conn_id)
		
		body_obj = 'soda_evictions_import_' + datetime.now().strftime("%Y-%m-%dT%H%M%S") + '.json'
		
		metadata = self.parse_metadata(response.headers)
		meta_obj = 'logs/soda_evictions_import_log_' + datetime.now().strftime("%Y-%m-%dT%H%M%S")
		
		dest_s3.load_string(json.dumps(combined), key=self.s3_directory+'/'+body_obj, bucket_name=self.s3_bucket)
		dest_s3.load_string(json.dumps(metadata), key=self.s3_directory+'/'+meta_obj, bucket_name=self.s3_bucket)
		
		# XCom used to skip downstream tasks if body object size is 0
		self.xcom_push(context=context, key='obj_len', value=len(combined))


================================================
FILE: dags/sql/full_load.sql
================================================
-- echo "" > /home/airflow/airflow/dags/sql/full_load.sql
-- nano /home/airflow/airflow/dags/sql/full_load.sql

-- Populate District Dimension

INSERT INTO staging.dim_district (district_key, district)
SELECT -1, 'Unknown';

INSERT INTO staging.dim_district (district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
				percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
				percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
				median_family_income, per_capita_income, percent_in_poverty)
SELECT 
	district,
	population::int,
	households::int,
	perc_asian::numeric as percent_asian,
	perc_black::numeric as percent_black,
	perc_white::numeric as percent_white,
	perc_nat_am::numeric as percent_native_am,
	perc_nat_pac::numeric as percent_pacific_isle,
	perc_other::numeric as percent_other_race,
	perc_latin::numeric as percent_latin,
	median_age::numeric,
	total_units::int,
	perc_owner_occupied::numeric as percent_owner_occupied,
	perc_renter_occupied::numeric as percent_renter_occupied,
	median_rent_as_perc_of_income::numeric,
	median_household_income::numeric,
	median_family_income::numeric,
	per_capita_income::numeric,
	perc_in_poverty::numeric as percent_in_poverty
FROM raw.district_data;	


-- Populate Neighborhood Dimension

INSERT INTO staging.dim_neighborhood (neighborhood_key, neighborhood)
SELECT -1, 'Unknown';

INSERT INTO staging.dim_neighborhood (neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, 
				percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
				percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
				median_family_income, per_capita_income, percent_in_poverty)
SELECT 
	acs_name as neighborhood,
	db_name as neighborhood_alt_name,
	population::int,
	households::int,
	perc_asian::numeric as percent_asian,
	perc_black::numeric as percent_black,
	perc_white::numeric as percent_white,
	perc_nat_am::numeric as percent_native_am,
	perc_nat_pac::numeric as percent_pacific_isle,
	perc_other::numeric as percent_other_race,
	perc_latin::numeric as percent_latin,
	median_age::numeric,
	total_units::int,
	perc_owner_occupied::numeric as percent_owner_occupied,
	perc_renter_occupied::numeric as percent_renter_occupied,
	median_rent_as_perc_of_income::numeric,
	median_household_income::numeric,
	median_family_income::numeric,
	per_capita_income::numeric,
	perc_in_poverty::numeric as percent_in_poverty
FROM raw.neighborhood_data;	


-- Populate Location Dimension

INSERT INTO staging.dim_location (location_key, city, state, zip_code)
SELECT -1, 'Unknown', 'Unknown', 'Unknown';

INSERT INTO staging.dim_location (city, state, zip_code)
SELECT DISTINCT
	COALESCE(city, 'Unknown') as city,
	COALESCE(state, 'Unknown') as state,
	COALESCE(zip, 'Unknown') as zip_code
FROM raw.soda_evictions
WHERE 
 city IS NOT NULL OR state IS NOT NULL OR zip IS NOT NULL;


-- Populate Reason Dimension

INSERT INTO staging.dim_reason (reason_key, reason_code, reason_desc)
VALUES (-1, 'Unknown', 'Unknown');

INSERT INTO staging.dim_reason (reason_code, reason_desc)
VALUES 	('non_payment', 'Non-Payment'),
	('breach', 'Breach'),
	('nuisance', 'Nuisance'),
	('illegal_use', 'Illegal Use'),
	('failure_to_sign_renewal', 'Failure to Sign Renewal'),
	('access_denial', 'Access Denial'),
	('unapproved_subtenant', 'Unapproved Subtenant'),
	('owner_move_in', 'Owner Move-In'),
	('demolition', 'Demolition'),
	('capital_improvement', 'Capital Improvement'),
	('substantial_rehab', 'Substantial Rehab'),
	('ellis_act_withdrawal', 'Ellis Act Withdrawal'),
	('condo_conversion', 'Condo Conversion'),
	('roommate_same_unit', 'Roommate Same Unit'),
	('other_cause', 'Other Cause'),
	('late_payments', 'Late Payments'),
	('lead_remediation', 'Lead Remediation'),
	('development', 'Development'),
	('good_samaritan_ends', 'Good Samaritan Ends');

	
-- Populate Reason Bridge Table

SELECT 
	ROW_NUMBER() OVER(ORDER BY concat_reason) as group_key,
	string_to_array(concat_reason, '|') as reason_array,
	concat_reason
INTO TEMP tmp_reason_group
FROM (
	SELECT DISTINCT
		TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)) as concat_reason
	FROM (
		SELECT
			eviction_id,
			CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||
			CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||
			CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||
			CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||
			CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||
			CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||
			CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||
			CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||
			CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||
			CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||
			CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||
			CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||
			CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||
			CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||
			CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||
			CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||
			CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||
			CASE WHEN development = 'true' THEN 'development|' ELSE '' END||
			CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END
				as concat_reason
		FROM raw.soda_evictions
		) f1
	) f2;

INSERT INTO staging.br_reason_group (reason_group_key, reason_key)
SELECT DISTINCT
	group_key as reason_group_key,
	reason_key
FROM (SELECT group_key, unnest(reason_array) unnested FROM tmp_reason_group) grp
JOIN staging.dim_Reason r ON r.reason_code = grp.unnested;	


-- Populate Date Dimension Table

INSERT INTO staging.dim_date (date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, 
				formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday,
				period, cw_start, cw_end, month_start, month_end)
SELECT -1, '1900-01-01', -1, -1, 'Unknown', -1, -1, 'Unknown', -1, 'Unknown', 'Unknown', 'Unknown', 'Unknown',
		'Unknown', 'Unknown', 'Unknown', 'Unknown', '1900-01-01', '1900-01-01', '1900-01-01', '1900-01-01';
		

INSERT INTO staging.dim_date (date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, 
				formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday,
				period, cw_start, cw_end, month_start, month_end)
SELECT
	TO_CHAR(datum, 'yyyymmdd')::int as date_key,
	datum as date,
	EXTRACT(YEAR FROM datum) as year,
	EXTRACT(MONTH FROM datum) as month,
	TO_CHAR(datum, 'TMMonth') as month_name,
	EXTRACT(DAY FROM datum) as day,
	EXTRACT(doy FROM datum) as day_of_year,
	TO_CHAR(datum, 'TMDay') as weekday_name,
	EXTRACT(week FROM datum) as calendar_week,
	TO_CHAR(datum, 'dd. mm. yyyy') as formatted_date,
	'Q' || TO_CHAR(datum, 'Q') as quartal,
	TO_CHAR(datum, 'yyyy/"Q"Q') as year_quartal,
	TO_CHAR(datum, 'yyyy/mm') as year_month,
	TO_CHAR(datum, 'iyyy/IW') as year_calendar_week,
	CASE WHEN EXTRACT(isodow FROM datum) IN (6, 7) THEN 'Weekend' ELSE 'Weekday' END as weekend,
	CASE WHEN TO_CHAR(datum, 'MMDD') IN ('0101', '0704', '1225', '1226') THEN 'Holiday' ELSE 'No holiday' END
			as us_holiday,
	CASE WHEN TO_CHAR(datum, 'MMDD') BETWEEN '0701' AND '0831' THEN 'Summer break'
	     WHEN TO_CHAR(datum, 'MMDD') BETWEEN '1115' AND '1225' THEN 'Christmas season'
	     WHEN TO_CHAR(datum, 'MMDD') > '1225' OR TO_CHAR(datum, 'MMDD') <= '0106' THEN 'Winter break'
		 ELSE 'Normal' END
			as period,
	datum + (1 - EXTRACT(isodow FROM datum))::int as cw_start,
	datum + (7 - EXTRACT(isodow FROM datum))::int as cw_end,
	datum + (1 - EXTRACT(DAY FROM datum))::int as month_start,
	(datum + (1 - EXTRACT(DAY FROM datum))::int + '1 month'::interval)::date - '1 day'::interval as month_end
FROM (
	SELECT '1997-01-01'::date + SEQUENCE.DAY as datum
	FROM generate_series(0,10956) as SEQUENCE(DAY)
	GROUP BY SEQUENCE.DAY
     ) DQ;


-- Populate Evictions Fact Table

SELECT 
	eviction_id,
	group_key as reason_group_key
INTO tmp_reason_facts
FROM (
	SELECT 
		eviction_id,
		TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)) as concat_reason
	FROM (
		SELECT
			eviction_id,
			CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||
			CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||
			CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||
			CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||
			CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||
			CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||
			CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||
			CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||
			CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||
			CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||
			CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||
			CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||
			CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||
			CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||
			CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||
			CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||
			CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||
			CASE WHEN development = 'true' THEN 'development|' ELSE '' END||
			CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END
				as concat_reason
		FROM raw.soda_evictions
		) grp
	) f_grp
JOIN tmp_reason_group t_grp ON f_grp.concat_reason = t_grp.concat_reason;	


INSERT INTO staging.fact_evictions (eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, 
									constraints_date_key, street_address)
SELECT 
	f.eviction_id as eviction_key,
	COALESCE(d.district_key, -1) as district_key,
	COALESCE(n.neighborhood_key, -1) as neighborhood_key,
	COALESCE(l.location_key, -1) as location_key,
	reason_group_key,
	COALESCE(dt1.date_key, -1) as file_date_key,
	COALESCE(dt2.date_key, -1) as constraints_date_key,
	f.address as street_address
FROM raw.soda_evictions f
LEFT JOIN tmp_reason_facts r ON f.eviction_id = r.eviction_id
LEFT JOIN staging.dim_district d ON f.supervisor_district = d.district
LEFT JOIN staging.dim_neighborhood n ON f.neighborhood = n.neighborhood_alt_name
LEFT JOIN staging.dim_location l 
	ON COALESCE(f.city, 'Unknown') = l.city
	AND COALESCE(f.state, 'Unknown') = l.state
	AND COALESCE(f.zip, 'Unknown') = l.zip_code
LEFT JOIN staging.dim_date dt1 ON f.file_date = dt1.date
LEFT JOIN staging.dim_date dt2 ON f.constraints_date = dt2.date;

DROP TABLE tmp_reason_group;
DROP TABLE tmp_reason_facts;

		     
-- Migrate to Production Schema

INSERT INTO prod.dim_district 
	(district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
	percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty)
SELECT 
	district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
	percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty
FROM staging.dim_district;

INSERT INTO prod.dim_neighborhood
	(neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, 
	percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty)
SELECT
	neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, 
	percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty
FROM staging.dim_neighborhood;

INSERT INTO prod.dim_location (location_key, city, state, zip_code)
SELECT location_key, city, state, zip_code
FROM staging.dim_location;

INSERT INTO prod.dim_reason (reason_key, reason_code, reason_desc)
SELECT reason_key, reason_code, reason_desc
FROM staging.dim_reason;

INSERT INTO prod.br_reason_group (reason_group_key, reason_key)
SELECT reason_group_key, reason_key
FROM staging.br_reason_group;

INSERT INTO prod.dim_date 
	(date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, 
	formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday,
	period, cw_start, cw_end, month_start, month_end)
SELECT 
	date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, 
	formatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday, period, 
	cw_start, cw_end, month_start, month_end
FROM staging.dim_date;

INSERT INTO prod.fact_evictions 
	(eviction_key, district_key, neighborhood_key, location_key, reason_group_key, 
	file_date_key, constraints_date_key, street_address)
SELECT 
	eviction_key, district_key, neighborhood_key, location_key, reason_group_key, 
	file_date_key, constraints_date_key, street_address
FROM staging.fact_evictions;


================================================
FILE: dags/sql/incremental_load.sql
================================================
-- echo "" > /home/airflow/airflow/dags/sql/incremental_load.sql
-- nano /home/airflow/airflow/dags/sql/incremental_load.sql

-- Populate District Dimension

INSERT INTO staging.dim_district
	(district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
	percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty)
SELECT 
	district,
	population::int,
	households::int,
	perc_asian::numeric as percent_asian,
	perc_black::numeric as percent_black,
	perc_white::numeric as percent_white,
	perc_nat_am::numeric as percent_native_am,
	perc_nat_pac::numeric as percent_pacific_isle,
	perc_other::numeric as percent_other_race,
	perc_latin::numeric as percent_latin,
	median_age::numeric,
	total_units::int,
	perc_owner_occupied::numeric as percent_owner_occupied,
	perc_renter_occupied::numeric as percent_renter_occupied,
	median_rent_as_perc_of_income::numeric,
	median_household_income::numeric,
	median_family_income::numeric, 			
	per_capita_income::numeric,
	perc_in_poverty::numeric as percent_in_poverty
FROM raw.district_data
	ON CONFLICT (district) DO UPDATE SET 
		population = EXCLUDED.population,
		households = EXCLUDED.households,
		percent_asian = EXCLUDED.percent_asian,
		percent_black = EXCLUDED.percent_black,
		percent_white = EXCLUDED.percent_white,
		percent_native_am = EXCLUDED.percent_native_am,
		percent_pacific_isle = EXCLUDED.percent_pacific_isle,
		percent_other_race = EXCLUDED.percent_other_race,
		percent_latin = EXCLUDED.percent_latin,
		median_age = EXCLUDED.median_age,
		total_units = EXCLUDED.total_units,
		percent_owner_occupied = EXCLUDED.percent_owner_occupied,
		percent_renter_occupied = EXCLUDED.percent_renter_occupied,
		median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,
		median_household_income = EXCLUDED.median_household_income,
		median_family_income = EXCLUDED.median_family_income,
		per_capita_income = EXCLUDED.per_capita_income,
		percent_in_poverty = EXCLUDED.percent_in_poverty;


-- Populate Neighborhood Dimension

INSERT INTO staging.dim_neighborhood
	(neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, 
	percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty)
SELECT 
	acs_name as neighborhood,
	db_name as neighborhood_alt_name,
	population::int,
	households::int,
	perc_asian::numeric as percent_asian,
	perc_black::numeric as percent_black,
	perc_white::numeric as percent_white,
	perc_nat_am::numeric as percent_native_am,
	perc_nat_pac::numeric as percent_pacific_isle,
	perc_other::numeric as percent_other_race,
	perc_latin::numeric as percent_latin,
	median_age::numeric,
	total_units::int,
	perc_owner_occupied::numeric as percent_owner_occupied,
	perc_renter_occupied::numeric as percent_renter_occupied,
	median_rent_as_perc_of_income::numeric,
	median_household_income::numeric,
	median_family_income::numeric,
	per_capita_income::numeric,
	perc_in_poverty::numeric as percent_in_poverty
FROM raw.neighborhood_data
	ON CONFLICT (neighborhood) DO UPDATE SET
		neighborhood_alt_name = EXCLUDED.neighborhood_alt_name,
		population = EXCLUDED.population,
		households = EXCLUDED.households,
		percent_asian = EXCLUDED.percent_asian,
		percent_black = EXCLUDED.percent_black,
		percent_white = EXCLUDED.percent_white,
		percent_native_am = EXCLUDED.percent_native_am,
		percent_pacific_isle = EXCLUDED.percent_pacific_isle,
		percent_other_race = EXCLUDED.percent_other_race,
		percent_latin = EXCLUDED.percent_latin,
		median_age = EXCLUDED.median_age,
		total_units = EXCLUDED.total_units,
		percent_owner_occupied = EXCLUDED.percent_owner_occupied,
		percent_renter_occupied = EXCLUDED.percent_renter_occupied,
		median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,
		median_household_income = EXCLUDED.median_household_income,
		median_family_income = EXCLUDED.median_family_income,
		per_capita_income = EXCLUDED.per_capita_income,
		percent_in_poverty = EXCLUDED.percent_in_poverty;


-- Populate Location Dimension

INSERT INTO staging.dim_location (city, state, zip_code)
SELECT 
	se.city,
	se.state,
	se.zip_code
FROM (
	SELECT DISTINCT
		COALESCE(city, 'Unknown') as city,
		COALESCE(state, 'Unknown') as state,
		COALESCE(zip, 'Unknown') as zip_code
	FROM raw.soda_evictions
	) se
LEFT JOIN staging.dim_location dl 
	ON se.city = dl.city
	AND se.state = dl.state
	AND se.zip_code = dl.zip_code
WHERE 
	dl.location_key IS NULL;
	
	
-- Populate Reason Bridge Table

SELECT DISTINCT
	reason_group_key,
	ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array
INTO TEMP tmp_existing_reason_groups
FROM staging.br_reason_group
GROUP BY reason_group_key; 

SELECT 
	concat_reason,
	ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array
INTO TEMP tmp_new_reason_groups
FROM (
	SELECT DISTINCT
		string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|') as concat_reason,
		unnest(string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|')) unnested_reason
	FROM (
		SELECT DISTINCT
			CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||
			CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||
			CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||
			CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||
			CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||
			CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||
			CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||
			CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||
			CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||
			CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||
			CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||
			CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||
			CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||
			CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||
			CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||
			CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||
			CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||
			CASE WHEN development = 'true' THEN 'development|' ELSE '' END||
			CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END
				as concat_reason
		FROM raw.soda_evictions
		) se1
	GROUP BY concat_reason
	) se2
JOIN staging.dim_reason r ON se2.unnested_reason = r.reason_code
GROUP BY concat_reason; 

INSERT INTO staging.br_reason_group (reason_group_key, reason_key)
SELECT
	final_grp.max_key + new_grp.tmp_group_key as reason_group_key,
	new_grp.reason_key as reason_key
FROM (
	SELECT DISTINCT
		ROW_NUMBER() OVER(ORDER BY concat_reason) as tmp_group_key,
		concat_reason,
		unnest(n.rk_array) as reason_key
	FROM tmp_new_reason_groups n
	LEFT JOIN tmp_existing_reason_groups e ON n.rk_array = e.rk_array
	WHERE e.reason_group_key IS NULL
	) new_grp
LEFT JOIN (SELECT MAX(reason_group_key) max_key FROM staging.br_reason_group) final_grp ON 1=1
ORDER BY reason_group_key, reason_key;

DROP TABLE tmp_existing_reason_groups;
DROP TABLE tmp_new_reason_groups;

					    	    
-- Populate Staging Fact Table

SELECT DISTINCT
	reason_group_key,
	ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array
INTO TEMP tmp_existing_reason_groups
FROM staging.br_reason_group
GROUP BY reason_group_key; 
					    
SELECT 
	eviction_id,
	ARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array
INTO TEMP tmp_fct_reason_groups	
FROM (
	SELECT 
		eviction_id,
		unnest(string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|')) as unnested_reason
	FROM (
		SELECT 
			eviction_id,
			CASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||
			CASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||
			CASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||
			CASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||
			CASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||
			CASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||
			CASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||
			CASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||
			CASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||
			CASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||
			CASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||
			CASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||
			CASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||
			CASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||
			CASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||
			CASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||
			CASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||
			CASE WHEN development = 'true' THEN 'development|' ELSE '' END||
			CASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END
				as concat_reason
		FROM raw.soda_evictions
		) se1
	) se2
JOIN staging.dim_reason r ON se2.unnested_reason = r.reason_code	
GROUP BY se2.eviction_id; 
					    
SELECT
	eviction_id, 
	reason_group_key
INTO tmp_reason_group_lookup
FROM tmp_fct_reason_groups f
JOIN tmp_existing_reason_groups d ON f.rk_array = d.rk_array;			    


TRUNCATE TABLE staging.fact_evictions;
					    
INSERT INTO staging.fact_evictions 
	(eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address)
SELECT
	f.eviction_id as eviction_key,
	COALESCE(d.district_key, -1) as district_key,
	COALESCE(n.neighborhood_key, -1) as neighborhood_key,
	COALESCE(l.location_key, -1) as location_key,
	r.reason_group_key as reason_group_key,
	COALESCE(dt1.date_key, -1) as file_date_key,
	COALESCE(dt2.date_key, -1) as constraints_date_key,
	f.address as street_address
FROM raw.soda_evictions f
JOIN tmp_reason_group_lookup r ON f.eviction_id = r.eviction_id
LEFT JOIN staging.dim_district d ON f.supervisor_district = d.district
LEFT JOIN staging.dim_neighborhood n ON f.neighborhood = n.neighborhood_alt_name
LEFT JOIN staging.dim_location l 
	ON COALESCE(f.city, 'Unknown') = l.city
	AND COALESCE(f.state, 'Unknown') = l.state
	AND COALESCE(f.zip, 'Unknown') = l.zip_code
LEFT JOIN staging.dim_date dt1 ON f.file_date = dt1.date
LEFT JOIN staging.dim_date dt2 ON f.constraints_date = dt2.date;

DROP TABLE tmp_existing_reason_groups;
DROP TABLE tmp_fct_reason_groups;
DROP TABLE tmp_reason_group_lookup;
					    
					    
-- Merge Into Production Schema

INSERT INTO prod.dim_district
	(district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
	percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty)
SELECT 	
	district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,
	percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty
FROM staging.dim_district	
	ON CONFLICT(district_key) DO UPDATE SET
		district = EXCLUDED.district,
		population = EXCLUDED.population,
		households = EXCLUDED.households,
		percent_asian = EXCLUDED.percent_asian,
		percent_black = EXCLUDED.percent_black,
		percent_white = EXCLUDED.percent_white,
		percent_native_am = EXCLUDED.percent_native_am,
		percent_pacific_isle = EXCLUDED.percent_pacific_isle,
		percent_other_race = EXCLUDED.percent_other_race,
		percent_latin = EXCLUDED.percent_latin,
		median_age = EXCLUDED.median_age,
		total_units = EXCLUDED.total_units,
		percent_owner_occupied = EXCLUDED.percent_owner_occupied,
		percent_renter_occupied = EXCLUDED.percent_renter_occupied,
		median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,
		median_household_income = EXCLUDED.median_household_income,
		median_family_income = EXCLUDED.median_family_income,
		per_capita_income = EXCLUDED.per_capita_income,
		percent_in_poverty = EXCLUDED.percent_in_poverty;


INSERT INTO prod.dim_neighborhood
	(neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, 
	percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty)
SELECT 
	neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, 
	percent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, 
	percent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, 
	median_family_income, per_capita_income, percent_in_poverty
FROM staging.dim_neighborhood
	ON CONFLICT (neighborhood_key) DO UPDATE SET
		neighborhood = EXCLUDED.neighborhood,
		neighborhood_alt_name = EXCLUDED.neighborhood_alt_name,
		population = EXCLUDED.population,
		households = EXCLUDED.households,
		percent_asian = EXCLUDED.percent_asian,
		percent_black = EXCLUDED.percent_black,
		percent_white = EXCLUDED.percent_white,
		percent_native_am = EXCLUDED.percent_native_am,
		percent_pacific_isle = EXCLUDED.percent_pacific_isle,
		percent_other_race = EXCLUDED.percent_other_race,
		percent_latin = EXCLUDED.percent_latin,
		median_age = EXCLUDED.median_age,
		total_units = EXCLUDED.total_units,
		percent_owner_occupied = EXCLUDED.percent_owner_occupied,
		percent_renter_occupied = EXCLUDED.percent_renter_occupied,
		median_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,
		median_household_income = EXCLUDED.median_household_income,
		median_family_income = EXCLUDED.median_family_income,
		per_capita_income = EXCLUDED.per_capita_income,
		percent_in_poverty = EXCLUDED.percent_in_poverty;


INSERT INTO prod.dim_location (location_key, city, state, zip_code)
SELECT location_key, city, state, zip_code
FROM staging.dim_location
	ON CONFLICT (location_key) DO NOTHING;	

					    
INSERT INTO prod.br_reason_group (reason_group_key, reason_key)
SELECT stg.reason_group_key, stg.reason_key 
FROM staging.br_reason_group stg
LEFT JOIN prod.br_reason_group prd 
	ON stg.reason_group_key = prd.reason_group_key 
	AND stg.reason_key = prd.reason_key
WHERE 
	prd.reason_group_key IS NULL;
					    
					    
INSERT INTO prod.fact_evictions 
	(eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address)
SELECT eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address
FROM staging.fact_evictions 
	ON CONFLICT (eviction_key) DO UPDATE SET 
		district_key = EXCLUDED.district_key,
		neighborhood_key = EXCLUDED.neighborhood_key,
		location_key = EXCLUDED.location_key,
		reason_group_key = EXCLUDED.reason_group_key,
		file_date_key = EXCLUDED.file_date_key,
		constraints_date_key = EXCLUDED.constraints_date_key,
		street_address = EXCLUDED.street_address;


================================================
FILE: dags/sql/init_db_schema.sql
================================================
-- echo "" > /home/airflow/airflow/dags/sql/init_db_schema.sql
-- nano /home/airflow/airflow/dags/sql/init_db_schema.sql

DROP SCHEMA IF EXISTS raw CASCADE;
DROP SCHEMA IF EXISTS staging CASCADE;
DROP SCHEMA IF EXISTS prod CASCADE;

CREATE SCHEMA raw;
CREATE SCHEMA staging;
CREATE SCHEMA prod;


-- Raw
CREATE UNLOGGED TABLE raw.soda_evictions (
	raw_id text,
	created_at timestamp,
	updated_at timestamp,
	eviction_id text,
	address text,
	city text,
	state text,
	zip text,
	file_date timestamp,
	non_payment boolean,
	breach boolean,
	nuisance boolean,
	illegal_use boolean,
	failure_to_sign_renewal boolean,
	access_denial boolean,
	unapproved_subtenant boolean,
	owner_move_in boolean,
	demolition boolean,
	capital_improvement boolean,
	substantial_rehab boolean,
	ellis_act_withdrawal boolean,
	condo_conversion boolean,
	roommate_same_unit boolean,
	other_cause boolean,
	late_payments boolean,
	lead_remediation boolean,
	development boolean,
	good_samaritan_ends boolean,
	constraints_date timestamp,
	supervisor_district text,
	neighborhood text
);

CREATE UNLOGGED TABLE raw.district_data (
	district text,
	population text,
	households text,
	perc_asian text,
	perc_black text,
	perc_white text,
	perc_nat_am text,
	perc_nat_pac text,
	perc_other text,
	perc_latin text,
	median_age text,
	total_units text,
	perc_owner_occupied text,
	perc_renter_occupied text,
	median_rent_as_perc_of_income text,
	median_household_income text,
	median_family_income text,
	per_capita_income text,
	perc_in_poverty text
);
CREATE UNIQUE INDEX district_name_uniq_idx ON raw.district_data (district);

CREATE UNLOGGED TABLE raw.neighborhood_data (
	acs_name text,
	db_name text,
	population text,
	households text,
	perc_asian text,
	perc_black text,
	perc_white text,
	perc_nat_am text,
	perc_nat_pac text,
	perc_other text,
	perc_latin text,
	median_age text,
	total_units text,
	perc_owner_occupied text,
	perc_renter_occupied text,
	median_rent_as_perc_of_income text,
	median_household_income text,
	median_family_income text,
	per_capita_income text,
	perc_in_poverty text
);
CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON raw.neighborhood_data (acs_name);

-- Staging
CREATE TABLE staging.dim_district (
	district_key serial PRIMARY KEY,
	district text,
	population integer,
	households integer,
	percent_asian numeric,
	percent_black numeric,
	percent_white numeric,
	percent_native_am numeric,
	percent_pacific_isle numeric,
	percent_other_race numeric,
	percent_latin numeric,
	median_age numeric,
	total_units integer,
	percent_owner_occupied numeric,
	percent_renter_occupied numeric,
	median_rent_as_perc_of_income numeric,
	median_household_income numeric,
	median_family_income numeric,
	per_capita_income numeric,
	percent_in_poverty numeric
);
CREATE UNIQUE INDEX district_name_uniq_idx ON staging.dim_district (district);

CREATE TABLE staging.dim_neighborhood (
	neighborhood_key serial PRIMARY KEY,
	neighborhood text,
	neighborhood_alt_name text,
	population integer,
	households integer,
	percent_asian numeric,
	percent_black numeric,
	percent_white numeric,
	percent_native_am numeric,
	percent_pacific_isle numeric,
	percent_other_race numeric,
	percent_latin numeric,
	median_age numeric,
	total_units integer,
	percent_owner_occupied numeric,
	percent_renter_occupied numeric,
	median_rent_as_perc_of_income numeric,
	median_household_income numeric,
	median_family_income numeric,
	per_capita_income numeric,
	percent_in_poverty numeric
);
CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON staging.dim_neighborhood (neighborhood);

CREATE TABLE staging.dim_location (
	location_key serial PRIMARY KEY,
	city text,
	state text,
	zip_code text
);

CREATE TABLE staging.dim_reason (
	reason_key serial PRIMARY KEY,
	reason_code text,
	reason_desc text
);

CREATE TABLE staging.br_reason_group (
	reason_group_key int,
	reason_key int
);	
CREATE INDEX reason_group_key_idx ON staging.br_reason_group (reason_group_key);
CREATE INDEX reason_key_idx ON staging.br_reason_group (reason_key);

CREATE TABLE staging.dim_date (
	date_key int PRIMARY KEY,
	date date,
	year int,
	month int,
	month_name text,
	day int,
	day_of_year int,
	weekday_name text,
	calendar_week int,
	formatted_date text,
	quartal text,
	year_quartal text,
	year_month text,
	year_calendar_week text,
	weekend text,
	us_holiday text,
	period text,
	cw_start date,
	cw_end date,
	month_start date,
	month_end date
);

CREATE TABLE staging.fact_evictions (
	eviction_key text PRIMARY KEY,
	location_key int,
	district_key int,
	neighborhood_key int,
	reason_group_key int,
	file_date_key int,
	constraints_date_key int,
	street_address text
);


-- Prod
CREATE TABLE prod.dim_district (
	district_key serial PRIMARY KEY,
	district text,
	population integer,
	households integer,
	percent_asian numeric,
	percent_black numeric,
	percent_white numeric,
	percent_native_am numeric,
	percent_pacific_isle numeric,
	percent_other_race numeric,
	percent_latin numeric,
	median_age numeric,
	total_units integer,
	percent_owner_occupied numeric,
	percent_renter_occupied numeric,
	median_rent_as_perc_of_income numeric,
	median_household_income numeric,
	median_family_income numeric,
	per_capita_income numeric,
	percent_in_poverty numeric
);

CREATE TABLE prod.dim_neighborhood (
	neighborhood_key serial PRIMARY KEY,
	neighborhood text,
	neighborhood_alt_name text,
	population integer,
	households integer,
	percent_asian numeric,
	percent_black numeric,
	percent_white numeric,
	percent_native_am numeric,
	percent_pacific_isle numeric,
	percent_other_race numeric,
	percent_latin numeric,
	median_age numeric,
	total_units integer,
	percent_owner_occupied numeric,
	percent_renter_occupied numeric,
	median_rent_as_perc_of_income numeric,
	median_household_income numeric,
	median_family_income numeric,
	per_capita_income numeric,
	percent_in_poverty numeric
);

CREATE TABLE prod.dim_location (
	location_key serial PRIMARY KEY,
	city text,
	state text,
	zip_code text
);

CREATE TABLE prod.dim_reason (
	reason_key serial PRIMARY KEY,
	reason_code text,
	reason_desc text
);

CREATE TABLE prod.br_reason_group (
	reason_group_key int,
	reason_key int
);	
CREATE INDEX reason_group_key_idx ON prod.br_reason_group (reason_group_key);
CREATE INDEX reason_key_idx ON prod.br_reason_group (reason_key);

CREATE TABLE prod.dim_date (
	date_key int PRIMARY KEY,
	date date,
	year int,
	month int,
	month_name text,
	day int,
	day_of_year int,
	weekday_name text,
	calendar_week int,
	formatted_date text,
	quartal text,
	year_quartal text,
	year_month text,
	year_calendar_week text,
	weekend text,
	us_holiday text,
	period text,
	cw_start date,
	cw_end date,
	month_start date,
	month_end date
);

CREATE TABLE prod.fact_evictions (
	eviction_key text PRIMARY KEY,
	location_key int,
	district_key int,
	neighborhood_key int,
	reason_group_key int,
	file_date_key int,
	constraints_date_key int,
	street_address text
);


================================================
FILE: dags/sql/trunc_target_tables.sql
================================================
-- echo "" > /home/airflow/airflow/dags/sql/trunc_target_tables.sql
-- nano /home/airflow/airflow/dags/sql/trunc_target_tables.sql
TRUNCATE TABLE raw.soda_evictions;
TRUNCATE TABLE raw.district_data;
TRUNCATE TABLE raw.neighborhood_data;
Download .txt
gitextract_jbid5ch1/

├── README.md
├── airflow_installation.txt
└── dags/
    ├── full_load_dag.py
    ├── incremental_load_dag.py
    ├── operators/
    │   ├── s3_to_postgres_operator.py
    │   └── soda_to_s3_operator.py
    └── sql/
        ├── full_load.sql
        ├── incremental_load.sql
        ├── init_db_schema.sql
        └── trunc_target_tables.sql
Download .txt
SYMBOL INDEX (37 symbols across 4 files)

FILE: dags/incremental_load_dag.py
  function get_size (line 29) | def get_size(**context):

FILE: dags/operators/s3_to_postgres_operator.py
  class S3ToPostgresOperator (line 14) | class S3ToPostgresOperator(BaseOperator):
    method __init__ (line 31) | def __init__(self,
    method execute (line 57) | def execute(self, context):

FILE: dags/operators/soda_to_s3_operator.py
  class SizeExceededError (line 14) | class SizeExceededError(Exception):
    method __init__ (line 16) | def __init__(self):
    method __str__ (line 19) | def __str__(self):
  class SodaToS3Operator (line 23) | class SodaToS3Operator(BaseOperator):
    method __init__ (line 41) | def __init__(self,
    method get_size (line 71) | def get_size(self, obj, seen=None):
    method parse_metadata (line 93) | def parse_metadata(self, header):
    method execute (line 112) | def execute(self, context):

FILE: dags/sql/init_db_schema.sql
  type raw (line 14) | CREATE UNLOGGED TABLE raw.soda_evictions (
  type raw (line 48) | CREATE UNLOGGED TABLE raw.district_data (
  type district_name_uniq_idx (line 69) | CREATE UNIQUE INDEX district_name_uniq_idx ON raw.district_data (district)
  type raw (line 71) | CREATE UNLOGGED TABLE raw.neighborhood_data (
  type neighborhood_name_uniq_idx (line 93) | CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON raw.neighborhood_data ...
  type staging (line 96) | CREATE TABLE staging.dim_district (
  type district_name_uniq_idx (line 118) | CREATE UNIQUE INDEX district_name_uniq_idx ON staging.dim_district (dist...
  type staging (line 120) | CREATE TABLE staging.dim_neighborhood (
  type neighborhood_name_uniq_idx (line 143) | CREATE UNIQUE INDEX neighborhood_name_uniq_idx ON staging.dim_neighborho...
  type staging (line 145) | CREATE TABLE staging.dim_location (
  type staging (line 152) | CREATE TABLE staging.dim_reason (
  type staging (line 158) | CREATE TABLE staging.br_reason_group (
  type reason_group_key_idx (line 162) | CREATE INDEX reason_group_key_idx ON staging.br_reason_group (reason_gro...
  type reason_key_idx (line 163) | CREATE INDEX reason_key_idx ON staging.br_reason_group (reason_key)
  type staging (line 165) | CREATE TABLE staging.dim_date (
  type staging (line 189) | CREATE TABLE staging.fact_evictions (
  type prod (line 202) | CREATE TABLE prod.dim_district (
  type prod (line 225) | CREATE TABLE prod.dim_neighborhood (
  type prod (line 249) | CREATE TABLE prod.dim_location (
  type prod (line 256) | CREATE TABLE prod.dim_reason (
  type prod (line 262) | CREATE TABLE prod.br_reason_group (
  type reason_group_key_idx (line 266) | CREATE INDEX reason_group_key_idx ON prod.br_reason_group (reason_group_...
  type reason_key_idx (line 267) | CREATE INDEX reason_key_idx ON prod.br_reason_group (reason_key)
  type prod (line 269) | CREATE TABLE prod.dim_date (
  type prod (line 293) | CREATE TABLE prod.fact_evictions (
Condensed preview — 10 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (66K chars).
[
  {
    "path": "README.md",
    "chars": 3879,
    "preview": "# SF-EvictionTracker\n\nTracking eviction trends in San Francisco across filing reasons, districts, neighborhoods, and dem"
  },
  {
    "path": "airflow_installation.txt",
    "chars": 3108,
    "preview": "\nSTEP 1 - Launch EC2 Instance:\n- t3.medium\n- 12gb storage\n- launch-wizard-3 security group to open TCP Port 8080\n- assoc"
  },
  {
    "path": "dags/full_load_dag.py",
    "chars": 2494,
    "preview": "# echo \"\" > /home/airflow/airflow/dags/full_load_dag.py\n# nano /home/airflow/airflow/dags/full_load_dag.py\n\nfrom airflow"
  },
  {
    "path": "dags/incremental_load_dag.py",
    "chars": 2885,
    "preview": "# echo \"\" > /home/airflow/airflow/dags/incremental_load_dag.py\n# nano /home/airflow/airflow/dags/incremental_load_dag.py"
  },
  {
    "path": "dags/operators/s3_to_postgres_operator.py",
    "chars": 6200,
    "preview": "# echo \"\" > /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py\n# nano /home/airflow/airflow/dags/operators/"
  },
  {
    "path": "dags/operators/soda_to_s3_operator.py",
    "chars": 5429,
    "preview": "# echo \"\" > /home/airflow/airflow/dags/operators/soda_to_s3_operator.py\n# nano /home/airflow/airflow/dags/operators/soda"
  },
  {
    "path": "dags/sql/full_load.sql",
    "chars": 14400,
    "preview": "-- echo \"\" > /home/airflow/airflow/dags/sql/full_load.sql\n-- nano /home/airflow/airflow/dags/sql/full_load.sql\n\n-- Popul"
  },
  {
    "path": "dags/sql/incremental_load.sql",
    "chars": 16231,
    "preview": "-- echo \"\" > /home/airflow/airflow/dags/sql/incremental_load.sql\n-- nano /home/airflow/airflow/dags/sql/incremental_load"
  },
  {
    "path": "dags/sql/init_db_schema.sql",
    "chars": 6911,
    "preview": "-- echo \"\" > /home/airflow/airflow/dags/sql/init_db_schema.sql\n-- nano /home/airflow/airflow/dags/sql/init_db_schema.sql"
  },
  {
    "path": "dags/sql/trunc_target_tables.sql",
    "chars": 238,
    "preview": "-- echo \"\" > /home/airflow/airflow/dags/sql/trunc_target_tables.sql\n-- nano /home/airflow/airflow/dags/sql/trunc_target_"
  }
]

About this extraction

This page contains the full source code of the ilya-galperin/SF-EvictionTracker GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 10 files (60.3 KB), approximately 16.7k tokens, and a symbol index with 37 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!