[
  {
    "path": "README.md",
    "content": "# SF-EvictionTracker\n\nTracking eviction trends in San Francisco across filing reasons, districts, neighborhoods, and demographics in the months following COVID-19. Data warehouse infrastructure is housed in the AWS ecosystem and uses Apache Airflow for orchestration with public-facing dashboards created using Metabase. \n\nQuestions? Feel free to reach me at ilya.glprn@gmail.com.\n\nPublic Dashboard Link: http://sf-evictiontracker-metabase.us-east-1.elasticbeanstalk.com/public/dashboard/f637e470-8ea9-4b03-af80-53988e5b6a9b\n\n\n<h3>ARCHITECTURE:</h3>\n\n![Architecture](https://i.imgur.com/s2gLBZt.png)\nData is sourced from San Francisco Open Data's API and csv's containing San Francisco district and neighborhood aggregate census results. Airflow orchestrates its movement to an S3 bucket and into a data warehouse hosted in RDS. SQL scripts are then ran to transform the data from its raw form through a staging schema and into production target tables. The presentation layer is created using Metabase, an open-source data visualization tool, and deployed using Elastic Beanstalk. \n\n<h3>DATA MODEL:</h3>\n\nDimension Tables:\n`dim_district`\n`dim_neighborhood`\n`dim_location`\n`dim_reason`\n`dim_date`\n`br_reason_group`\n\nFact Tables:\n`fact_evictions`\n\nThe data model is implemented using a star schema with a bridge table to accomodate any new permutations for the reason dimension. More information on bridge tables can be found here: https://www.kimballgroup.com/2012/02/design-tip-142-building-bridges/\n\n![Model](https://i.imgur.com/uInBlzR.png)\n<h3>ETL FLOW:</h3>\n\nGeneral Overview - \n- Evictions data is collected from the SODA API and moved into an S3 Bucket\n- Neighborhood/district census data is stored as a CSV in S3\n- Once the API load to S3 is complete, data is moved into RDS into a \"raw\" schema and moves through a staging schema for processing\n- ETL job execution is complete once data is moved from the staging schema into the final production tables\n\nDAGs and Custom Airflow Operators -\n\n![Ops](https://i.imgur.com/WTOUiGU.jpg)\n![Dag](https://i.imgur.com/yJb3DKT.jpg)\n\nThere are 2 DAGs (Directed Acyclic Graphs) used for this project - <b>full load</b> which should be executed on initial setup and <b>incremental load</b> which is scheduled to run daily and pull new data from the Socrata Open Data API.\n\nThe incremental load DAG uses XCom to pass the filesize of the load between the API call task and a ShortCircuitOperator to skip downstream tasks if the API call produces no results. \n\nThe DAGs use two customer operators. They have been purpose built for this project but are easily expandable to be used in other data pipelines.\n\n1. soda_to_s3_operator: Queries the Socrata Open Data API using a SoQL string and uploads the results to an S3 bucket. Includes optional function to check source data size and abort ETL if filesize exceeds user-defined limit.\n\n2. s3_to_postges_operator: Collects data from a file hosted on AWS S3 and loads it into a Postgres table. Current version supports JSON and CSV source data types.\n\n\n<h3>INFRASTRUCTURE:</h3>\n\nThis project is hosted in the AWS ecosystem and uses the following resources:\n\n![EC2](https://i.imgur.com/jB2X1jI.png)\n\nEC2 -\n- t2.medium - dedicated resource for Airflow, managed by AWS Instance Scheduler to complete the daily DAG run and shut off after execution \n- t2.small - used to host Metabase, always online\n\nRDS -\n- t2.small - hosts application database for Metabase and the data warehouse\n\nElastic Beanstalk is used to deploy the Metabase web application.\n\n\n<h3>DASHBOARD:</h3>\n\nThe dashboard is publicly accessible here: http://sf-evictiontracker-metabase.us-east-1.elasticbeanstalk.com/public/dashboard/f637e470-8ea9-4b03-af80-53988e5b6a9b\n\nSome examples screengrabs below!\n\n![Dash1](https://i.imgur.com/MZ325PT.jpg)\n![Dash2](https://i.imgur.com/OeyOVp0.jpg)\n![Dash3](https://i.imgur.com/v6Nwz9l.jpg)\n"
  },
  {
    "path": "airflow_installation.txt",
    "content": "\nSTEP 1 - Launch EC2 Instance:\n- t3.medium\n- 12gb storage\n- launch-wizard-3 security group to open TCP Port 8080\n- associate elastic IP \n\n\nSTEP 2 - Install Postgres Server on EC2:\nrun:\nsudo apt-get update\nsudo apt-get install python-psycopg2\nsudo apt-get install postgresql postgresql-contrib\n\n\nStep 3 - Create OS User airflow\nrun:\nsudo adduser airflow\nsudo usermod -aG sudo airflow\nsu - airflow\n\nNote: From here on, make sure you are logged in as airflow user.\n\n\nStep 4 - Create Postgres Metadatabase and User Access\nrun: \nsudo -u postgres psql\n\nin postgres prompt: \nCREATE USER airflow PASSWORD 'password';\nCREATE DATABASE airflow;\nGRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airflow;\n\\q \n\n\nStep 5 - Change Postgres Connection Config\nrun:\nsudo nano /etc/postgresql/10/main/pg_hba.conf\n\nChange this line -\n# IPv4 local connections:\nhost    all             all             127.0.0.1/32         md5\nTo this line - \n# IPv4 local connections:\nhost    all             all             0.0.0.0/0            trust\n\nrun:\nsudo nano /etc/postgresql/10/main/postgresql.conf\n\nChange this line - \n#listen_addresses = ‘localhost’ # what IP address(es) to listen on\nTo this line -\nlisten_addresses = ‘*’ # what IP address(es) to listen on\n\nrestart postgres server:\nsudo service postgresql restart\n\n\nStep 6 - Install Airflow\n\nrun:\nsu - airflow\nsudo apt-get install python3-pip\nsudo python3 -m pip install apache-airflow[postgres,s3,aws]\n\nrun:\nairflow initdb\n\n\nStep 7 - Connect Airflow to Postgres\n\nrun:\nnano /home/airflow/airflow/airflow.cfg\n\nChange lines -\nsql_alchemy_conn = postgresql+psycopg2://airflow:password@localhost:5432/airflow\nexecutor = LocalExecutor\nload_examples = False\n\nrun:\nairflow initdb\n\n\nStep 7 - Add DAGs:\nmkdir /home/airflow/airflow/dags/\ncd /home/airflow/airflow/dags/\ntouch tutorial.py\nnano tutorial.py\n\n\nStep 6: Setup Airflow Webserver and Scheduler to start automatically\nWe are almost there. The final thing we need to do is to ensure airflow starts up when your ec2 instance starts.\n\nsudo nano /etc/systemd/system/airflow-webserver.service\n\nPaste the following into the file created above\n\n[Unit]\nDescription=Airflow webserver daemon\nAfter=network.target postgresql.service\nWants=postgresql.service\n[Service]\nEnvironmentFile=/etc/environment\nUser=airflow\nGroup=airflow\nType=simple\nExecStart= /usr/local/bin/airflow webserver\nRestart=on-failure\nRestartSec=5s\nPrivateTmp=true\n[Install]\nWantedBy=multi-user.target\n\nNext we will create the following file to enable scheduler service\n\nsudo nano /etc/systemd/system/airflow-scheduler.service\n\nPaste the following\n\n[Unit]\nDescription=Airflow scheduler daemon\nAfter=network.target postgresql.service\nWants=postgresql.service\n[Service]\nEnvironmentFile=/etc/environment\nUser=airflow\nGroup=airflow\nType=simple\nExecStart=/usr/local/bin/airflow scheduler\nRestart=always\nRestartSec=5s\n[Install]\nWantedBy=multi-user.target\n\nNext enable these services and check their status\n\nsudo systemctl enable airflow-webserver.service\nsudo systemctl enable airflow-scheduler.service\nsudo systemctl start airflow-scheduler\nsudo systemctl start airflow-webserver\n"
  },
  {
    "path": "dags/full_load_dag.py",
    "content": "# echo \"\" > /home/airflow/airflow/dags/full_load_dag.py\n# nano /home/airflow/airflow/dags/full_load_dag.py\n\nfrom airflow import DAG\nfrom airflow.operators.postgres_operator import PostgresOperator\nfrom operators.soda_to_s3_operator import SodaToS3Operator\nfrom operators.s3_to_postgres_operator import S3ToPostgresOperator\nfrom airflow.utils.dates import days_ago\nfrom datetime import timedelta\n\nsoda_headers = {\n    'keyId':'############',\n    'keySecret':'#################',\n    'Accept':'application/json'\n}\n\ndefault_args = {\n    'owner': 'airflow',\n    'depends_on_past': False,\n    'start_date': days_ago(2),\n    'email': ['airflow@example.com'],\n    'email_on_failure': False,\n    'email_on_retry': False,\n    'retries': 1,\n    'retry_delay': timedelta(seconds=30)\n}\n\nwith DAG('eviction-tracker-full_load',\n\t\tdefault_args=default_args,\n\t\tdescription='Executes full load from SODA API to Production DW.',\n\t\tmax_active_runs=1,\n\t\tschedule_interval=None) as dag:\n \n\top1 = SodaToS3Operator(\n\t\ttask_id='get_evictions_data',\n\t\thttp_conn_id='API_Evictions',\n\t\theaders=soda_headers,\n\t\ts3_conn_id='S3_Evictions',\n\t\ts3_bucket='sf-evictionmeter',\n\t\ts3_directory='soda_jsons',\n\t\tsize_check=True,\n\t\tmax_bytes=500000000,\n\t\tdag=dag\n\t)\n\t\n\top2 = PostgresOperator(\n\t\ttask_id='initialize_target_db',\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tsql='sql/init_db_schema.sql',\n\t\tdag=dag\n\t)\n\t\n\top3 = S3ToPostgresOperator(\n\t\ttask_id='load_evictions_data',\n\t\ts3_conn_id='S3_Evictions',\n\t\ts3_bucket='sf-evictionmeter',\n\t\ts3_prefix='soda_jsons/soda_evictions_import',\n\t\tsource_data_type='json',\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tschema='raw',\n\t\ttable='soda_evictions',\n\t\tget_latest=True,\n\t\tdag=dag\n\t)\n\t\n\top4 = S3ToPostgresOperator(\n\t\ttask_id='load_neighborhood_data',\n\t\ts3_conn_id='S3_Evictions',\n\t\ts3_bucket='sf-evictionmeter',\n\t\ts3_prefix='census_csv/sf_by_neighborhood',\n\t\tsource_data_type='csv',\n\t\theader=True,\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tschema='raw',\n\t\ttable='neighborhood_data',\n\t\tget_latest=True,\n\t\tdag=dag\n\t)\n\t\n\top5 = S3ToPostgresOperator(\n\t\ttask_id='load_district_data',\n\t\ts3_conn_id='S3_Evictions',\n\t\ts3_bucket='sf-evictionmeter',\n\t\ts3_prefix='census_csv/sf_by_district',\n\t\tsource_data_type='csv',\n\t\theader=True,\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tschema='raw',\n\t\ttable='district_data',\n\t\tget_latest=True,\n\t\tdag=dag\n\t)\n\t\n\top6 = PostgresOperator(\n\t\ttask_id='execute_full_load',\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tsql='sql/full_load.sql',\n\t\tdag=dag\n\t)\n\t\n\top1 >> op2 >> (op3, op4, op5) >> op6\n"
  },
  {
    "path": "dags/incremental_load_dag.py",
    "content": "# echo \"\" > /home/airflow/airflow/dags/incremental_load_dag.py\n# nano /home/airflow/airflow/dags/incremental_load_dag.py\n\nfrom airflow import DAG\nfrom airflow.operators.postgres_operator import PostgresOperator\nfrom airflow.operators.python_operator import ShortCircuitOperator\nfrom operators.soda_to_s3_operator import SodaToS3Operator\nfrom operators.s3_to_postgres_operator import S3ToPostgresOperator\nfrom airflow.utils.dates import days_ago\nfrom datetime import timedelta\n\nsoda_headers = {\n    'keyId':'############',\n    'keySecret':'#################',\n    'Accept':'application/json'\n}\n\ndefault_args = {\n    'owner': 'airflow',\n    'depends_on_past': False,\n    'start_date': days_ago(2),\n    'email': ['airflow@example.com'],\n    'email_on_failure': False,\n    'email_on_retry': False,\n    'retries': 1,\n    'retry_delay': timedelta(seconds=30)\n}\n\ndef get_size(**context):\n\tval = context['ti'].xcom_pull(key='obj_len')\n\treturn True if val > 0 else False\n\t\n\nwith DAG('eviction-tracker-incremental_load',\n\tdefault_args=default_args,\n\tdescription='Executes incremental load from SODA API & S3-hosted csv''s into Production DW.',\n\tmax_active_runs=1,\n\tschedule_interval=None) as dag:\n \n\top1 = SodaToS3Operator(\n\t\ttask_id='get_evictions_data',\n\t\thttp_conn_id='API_Evictions',\n\t\theaders=soda_headers,\n\t\tdays_ago=31,\n\t\ts3_conn_id='S3_Evictions',\n\t\ts3_bucket='sf-evictionmeter',\n\t\ts3_directory='soda_jsons',\n\t\tsize_check=True,\n\t\tmax_bytes=500000000,\n\t\tdag=dag\n\t)\n\t\n\top2 = ShortCircuitOperator(\n\t\ttask_id='check_get_results',\n\t\tpython_callable=get_size,\n\t\tprovide_context=True,\n\t\tdag=dag\n\t)\n\t\n\top3 = PostgresOperator(\n\t\ttask_id='truncate_target_tables',\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tsql='sql/trunc_target_tables.sql',\n\t\tdag=dag\n\t)\n\t\n\top4 = S3ToPostgresOperator(\n\t\ttask_id='load_evictions_data',\n\t\ts3_conn_id='S3_Evictions',\n\t\ts3_bucket='sf-evictionmeter',\n\t\ts3_prefix='soda_jsons/soda_evictions_import',\n\t\tsource_data_type='json',\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tschema='raw',\n\t\ttable='soda_evictions',\n\t\tget_latest=True,\n\t\tdag=dag\n\t)\n\t\n\top5 = S3ToPostgresOperator(\n\t\ttask_id='load_neighborhood_data',\n\t\ts3_conn_id='S3_Evictions',\n\t\ts3_bucket='sf-evictionmeter',\n\t\ts3_prefix='census_csv/sf_by_neighborhood',\n\t\tsource_data_type='csv',\n\t\theader=True,\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tschema='raw',\n\t\ttable='neighborhood_data',\n\t\tget_latest=True,\n\t\tdag=dag\n\t)\n\t\n\top6 = S3ToPostgresOperator(\n\t\ttask_id='load_district_data',\n\t\ts3_conn_id='S3_Evictions',\n\t\ts3_bucket='sf-evictionmeter',\n\t\ts3_prefix='census_csv/sf_by_district',\n\t\tsource_data_type='csv',\n\t\theader=True,\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tschema='raw',\n\t\ttable='district_data',\n\t\tget_latest=True,\n\t\tdag=dag\n\t)\n\t\n\top7 = PostgresOperator(\n\t\ttask_id='execute_incremental_load',\n\t\tpostgres_conn_id='RDS_Evictions',\n\t\tsql='sql/incremental_load.sql',\n\t\tdag=dag\n\t)\n\t\n\top1 >> op2 >> op3 >> (op4, op5, op6) >> op7\n"
  },
  {
    "path": "dags/operators/s3_to_postgres_operator.py",
    "content": "# echo \"\" > /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py\n# nano /home/airflow/airflow/dags/operators/s3_to_postgres_operator.py\n\nfrom airflow.models.baseoperator import BaseOperator\nfrom airflow.utils.decorators import apply_defaults\nfrom airflow.hooks.S3_hook import S3Hook\nfrom airflow.hooks.postgres_hook import PostgresHook\n\nimport json\nimport io\nfrom contextlib import closing\n\n\nclass S3ToPostgresOperator(BaseOperator):\n\t\"\"\" \n\tCollects data from a file hosted on AWS S3 and loads it into a Postgres table. \n\tCurrent version supports JSON and CSV sources but requires pre-defined data model.\n\t\n\t:param s3_conn_id:\t\t\tS3 Connection ID\n\t:param s3_bucket:\t\t\tS3 Bucket Destination\n\t:param s3_prefix:\t\t\tS3 File Prefix\n\t:param source_data_type:\t\tS3 Source File data type\n\t:param header:\t\t\t\tToggles ignore header for CSV source type \n\t:param postgres_conn_id: \t\tPostgres Connection ID\n\t:param db_schema:\t\t\tPostgres Target Schema\n\t:param db_table:\t\t\tPostgres Target Table\n\t:param get_latest:\t\t\tif True, pulls from last modified file in S3 path\n\t\"\"\"\n\t\n\t@apply_defaults\n\tdef __init__(self,\n\t\ts3_conn_id=None,\n\t\ts3_bucket=None,\n\t\ts3_prefix='',\n\t\tsource_data_type='',\n\t\tpostgres_conn_id='postgres_default',\n\t\theader=False,\n\t\tschema='public',\n\t\ttable='raw_load',\n\t\tget_latest=False,\t\t\n\t\t*args, \n\t\t**kwargs) -> None: \n\t\t\n\t\tsuper().__init__(*args, **kwargs)\n\t\t\n\t\tself.s3_conn_id = s3_conn_id\n\t\tself.s3_bucket = s3_bucket\n\t\tself.s3_prefix = s3_prefix\n\t\tself.source_data_type = source_data_type\n\t\tself.postgres_conn_id = postgres_conn_id\n\t\tself.header = header\n\t\tself.schema = schema\n\t\tself.table = table\n\t\tself.get_latest = get_latest\n\t\n\t\n\tdef execute(self, context):\n\t\t\"\"\"\n\t\tExecutes the operator.\n\t\t\"\"\"\n\t\ts3_hook = S3Hook(self.s3_conn_id)\n\t\ts3_session = s3_hook.get_session()\n\t\ts3_client = s3_session.client('s3')\n\t\t\n\t\tif self.get_latest == True:\n\t\t\tobjects = s3_client.list_objects_v2(Bucket=self.s3_bucket, Prefix=self.s3_prefix)['Contents']\n\t\t\tlatest = max(objects, key=lambda x: x['LastModified'])\n\t\t\ts3_obj = s3_client.get_object(Bucket=self.s3_bucket, Key=latest['Key'])\n\t\t\t\n\t\tfile_content = s3_obj['Body'].read().decode('utf-8')\n\t\t\n\t\tpg_hook = PostgresHook(self.postgres_conn_id)\n\t\t\t\n\t\tif self.source_data_type == 'json':\n\t\t\t\n\t\t\tprint('inserting json object...')\n\t\n\t\t\tjson_content = json.loads(file_content)\t\t\n\t\t\t\t\n\t\t\tschema = self.schema\n\t\t\tif isinstance(self.schema, tuple):\n\t\t\t\tschema = self.schema[0]\n\t\t\t\n\t\t\ttable = self.table\t\n\t\t\tif isinstance(self.table, tuple):\n\t\t\t\ttable = self.table[0]\t\n\t\t\n\t\t\ttarget_fields = ['raw_id','created_at','updated_at','eviction_id','address','city','state',\n\t\t\t\t\t'zip','file_date','non_payment','breach','nuisance','illegal_use','failure_to_sign_renewal',\n\t\t\t\t\t'access_denial','unapproved_subtenant','owner_move_in','demolition','capital_improvement',\n\t\t\t\t\t'substantial_rehab','ellis_act_withdrawal','condo_conversion','roommate_same_unit',\n\t\t\t\t\t'other_cause','late_payments','lead_remediation','development','good_samaritan_ends',\n\t\t\t\t\t'constraints_date','supervisor_district','neighborhood']\n\t\t\ttarget_fields = ','.join(target_fields)\n\t\t\t\n\t\t\twith closing(pg_hook.get_conn()) as conn:\n\t\t\t\twith closing(conn.cursor()) as cur:\n\t\t\t\t\t\tcur.executemany(\n\t\t\t\t\t\t\tf\"\"\"INSERT INTO {schema}.{table} ({target_fields})\n\t\t\t\t\t\t\tVALUES(\n\t\t\t\t\t\t\t%(:id)s, %(:created_at)s, %(:updated_at)s, %(eviction_id)s, %(address)s, %(city)s, %(state)s, %(zip)s,\n\t\t\t\t\t\t\t%(file_date)s, %(non_payment)s, %(breach)s, %(nuisance)s, %(illegal_use)s, %(failure_to_sign_renewal)s,\n\t\t\t\t\t\t\t%(access_denial)s, %(unapproved_subtenant)s, %(owner_move_in)s, %(demolition)s, %(capital_improvement)s,\n\t\t\t\t\t\t\t%(substantial_rehab)s, %(ellis_act_withdrawal)s, %(condo_conversion)s, %(roommate_same_unit)s,\n\t\t\t\t\t\t\t%(other_cause)s, %(late_payments)s, %(lead_remediation)s, %(development)s, %(good_samaritan_ends)s,\n\t\t\t\t\t\t\t%(constraints_date)s, %(supervisor_district)s, %(neighborhood)s\n\t\t\t\t\t\t\t\t);\n\t\t\t\t\t\t\t\"\"\",({\n\t\t\t\t\t\t\t':id': line[':id'], ':created_at': line[':created_at'], ':updated_at': line[':updated_at'],\n\t\t\t\t\t\t\t'eviction_id': line['eviction_id'], 'address': line.get('address', None), 'city': line.get('city', None),\n\t\t\t\t\t\t\t'state': line.get('state', None),'zip': line.get('zip', None),'file_date': line.get('file_date', None),\n\t\t\t\t\t\t\t'non_payment': line.get('non_payment', None),'breach': line.get('breach', None),\n\t\t\t\t\t\t\t'nuisance': line.get('nuisance', None),'illegal_use': line.get('illegal_use', None),\n\t\t\t\t\t\t\t'failure_to_sign_renewal': line.get('failure_to_sign_renewal', None),\n\t\t\t\t\t\t\t'access_denial': line.get('access_denial', None),'unapproved_subtenant': line.get('unapproved_subtenant', None),\n\t\t\t\t\t\t\t'owner_move_in': line.get('owner_move_in', None),'demolition': line.get('demolition', None),\n\t\t\t\t\t\t\t'capital_improvement': line.get('capital_improvement', None),\n\t\t\t\t\t\t\t'substantial_rehab': line.get('substantial_rehab', None),'ellis_act_withdrawal': line.get('ellis_act_withdrawal', None),\n\t\t\t\t\t\t\t'condo_conversion': line.get('condo_conversion', None),'roommate_same_unit': line.get('roommate_same_unit', None),\n\t\t\t\t\t\t\t'other_cause': line.get('other_cause', None),'late_payments': line.get('late_payments', None),\n\t\t\t\t\t\t\t'lead_remediation': line.get('lead_remediation', None),'development': line.get('development', None),\n\t\t\t\t\t\t\t'good_samaritan_ends': line.get('good_samaritan_ends', None),'constraints_date': line.get('constraints_date', None),\n\t\t\t\t\t\t\t'supervisor_district': line.get('supervisor_district', None),'neighborhood': line.get('neighborhood', None)\n\t\t\t\t\t\t\t } for line in json_content))\n\t\t\t\t\t\tconn.commit()\t\t\n\t\t\t\n\t\t\t\n\t\tif self.source_data_type == 'csv':\n\t\t\t\n\t\t\tprint('inserting csv...')\n\n\t\t\tfile = io.StringIO(file_content)\n\t\t\t\n\t\t\tsql = \"COPY %s FROM STDIN DELIMITER ','\"\n\t\t\tif self.header == True:\n\t\t\t\tsql = \"COPY %s FROM STDIN DELIMITER ',' CSV HEADER\"\n\t\t\t\n\t\t\tschema = self.schema\n\t\t\tif isinstance(self.schema, tuple):\n\t\t\t\tschema = self.schema[0]\n\t\t\t\n\t\t\ttable = self.table\t\n\t\t\tif isinstance(self.table, tuple):\n\t\t\t\ttable = self.table[0]\t\n\t\t\t\t\n\t\t\ttable = f'{schema}.{table}'\t\n\t\t\t\n\t\t\twith closing(pg_hook.get_conn()) as conn:\n\t\t\t\twith closing(conn.cursor()) as cur:\n\t\t\t\t\tcur.copy_expert(sql=sql % table, file=file)\n\t\t\t\t\tconn.commit()\n\t\t\n\t\tprint('inserting complete...')\n"
  },
  {
    "path": "dags/operators/soda_to_s3_operator.py",
    "content": "# echo \"\" > /home/airflow/airflow/dags/operators/soda_to_s3_operator.py\n# nano /home/airflow/airflow/dags/operators/soda_to_s3_operator.py\n\nfrom airflow.models.baseoperator import BaseOperator\nfrom airflow.utils.decorators import apply_defaults\nfrom airflow.hooks.http_hook import HttpHook\nfrom airflow.hooks.S3_hook import S3Hook\n\nfrom datetime import datetime, timedelta\nimport json\nimport sys\n\n\nclass SizeExceededError(Exception):\n\t\"\"\"Raised when max file size is exceeded\"\"\"\n\tdef __init__(self):\n\t\tself.message = 'Max file size exceeded'\n\n\tdef __str__(self):\n\t\treturn f'SizeExceededError, {self.message}'\n\n\nclass SodaToS3Operator(BaseOperator):\n\t\"\"\" \n\tQueries the Socrata Open Data API using a SoQL string and uploads the results to an S3 bucket.\n\t\n\t:param endpoint:\t\tOptional API connection endpoint\n\t:param data:\t\t\tCustom Socrata SoQL string used to query API, overrides default get request\n\t:param days_ago:\t\tRestricts get request to updated/created records from specified date onward\n\t:param headers:\t\t\tDictionary containing optional API connection keys (keyId, keySecret, Accept)\n\t:param s3_conn_id:\t\tS3 Connection ID\n\t:param s3_bucket:\t\tS3 Bucket Destination\n\t:param s3_directory:\t\tS3 Directory Destination\n\t:param method:\t\t\tRequest type for API\n\t:param http_conn_id:\t\tSODA API Connection ID\n\t:param size_check:\t\tBoolean indicating whether to run a size check prior to upload to S3\n\t:param max_bytes:\t\tMaximum number of bytes to allow for a single S3 upload\t\t\n\t\"\"\"\n\t\n\t@apply_defaults\n\tdef __init__(self,\n\t\tendpoint=None,\n\t\tdata=None,\n\t\tdays_ago=None,\n\t\theaders=None,\n\t\ts3_conn_id=None,\n\t\ts3_bucket=None,\n\t\ts3_directory='',\n\t\tmethod='GET',\n\t\thttp_conn_id='http_default',\n\t\tsize_check=False,\n\t\tmax_bytes=5000000000,\n\t\t*args, \n\t\t**kwargs) -> None: \n\t\t\n\t\tsuper().__init__(*args, **kwargs)\n\t\t\n\t\tself.endpoint = endpoint\n\t\tself.data = data\n\t\tself.days_ago = days_ago\n\t\tself.s3_conn_id = s3_conn_id\n\t\tself.s3_bucket = s3_bucket\n\t\tself.s3_directory = s3_directory\n\t\tself.headers = headers\n\t\tself.method = method\n\t\tself.http_conn_id = http_conn_id\n\t\tself.size_check = size_check\n\t\tself.max_bytes = max_bytes\n\t\n\t\n\tdef get_size(self, obj, seen=None):\n\t\t\"\"\"\n\t\tRecursively finds size of object.\n\t\t\"\"\"\n\t\t\n\t\tsize = sys.getsizeof(obj)\n\t\tif seen is None:\n\t\t\tseen = set()\n\t\tobj_id = id(obj)\n\t\tif obj_id in seen:\n\t\t\treturn 0\n\t\tseen.add(obj_id)\n\t\tif isinstance(obj, dict):\n\t\t\tsize += sum([self.get_size(v, seen) for v in obj.values()])\n\t\t\tsize += sum([self.get_size(k, seen) for k in obj.keys()])\n\t\telif hasattr(obj, '__dict__'):\n\t\t\tsize += self.get_size(obj.__dict__, seen)\n\t\telif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):\n\t\t\tsize += sum([self.get_size(i, seen) for i in obj])\n\t\treturn size\n\t\n\t\n\tdef parse_metadata(self, header):\n\t\t\"\"\"\n\t\tParses metadata from API response.\n\t\t\"\"\"\n\t\t\n\t\ttry:\n\t\t\tmetadata = {\n\t\t\t\t'api-call-date': header['Date'],\n\t\t\t\t'content-type': header['Content-Type'],\n\t\t\t\t'source-last-modified': header['X-SODA2-Truth-Last-Modified'],\n\t\t\t\t'fields': header['X-SODA2-Fields'],\n\t\t\t\t'types': header['X-SODA2-Types']\n\t\t\t}\n\t\texcept KeyError:\n\t\t\tmetadata = {'KeyError': 'Metadata missing from header, see error log.'}\n    \n\t\treturn metadata\n\n\t\n\tdef execute(self, context):\n\t\t\"\"\"\n\t\tExecutes the operator, including running a max filesize check if enabled. \n\t\t\n\t\tThe SODA API maxes out the # of returned results so we use paging to query\n\t\tthe endpoint multiple times and continuously move the offset. \n\t\t\n\t\tMetadata is parsed and saved separately in a /logs/ subfolder along with \n\t\tthe JSON results from API call.\n\t\t\"\"\"\n\t\t\n\t\tsoda = HttpHook(method=self.method, http_conn_id=self.http_conn_id)\n\t\t\n\t\tif self.data:\n\t\t\tsoql_filter = self.data\n\t\telif self.days_ago:\n\t\t\tcurrent_dt = datetime.now()\n\t\t\ttarget_dt = current_dt - timedelta(self.days_ago)\n\t\t\tformat_dt = target_dt.strftime(\"%Y-%m-%d\")\n\t\t\tsoql_filter = f\"\"\"$query=SELECT:*,* WHERE :created_at > '{format_dt}' OR :updated_at > '{format_dt}'\n\t\t\t\t\t\t\t   ORDER BY :id LIMIT 10000\"\"\"\n\t\telse:\n\t\t\tsoql_filter = \"\"\"$query=SELECT:*,* ORDER BY :id LIMIT 10000\"\"\"\n\t\t\n\t\tprint('getting... ' + soql_filter)\n\t\t\n\t\t#soql_filter = f\"\"\"$query=SELECT:*,* WHERE :created_at < '2020-04-01' ORDER BY :id LIMIT 10000\"\"\"\n\t\t\n\t\toffset, counter = 0, 1\n\t\tcombined = []\n\t\twhile True:\n\t\t\tsoql_filter_offset = soql_filter + f' OFFSET {offset}'\n\t\t\tresponse = soda.run(endpoint=self.endpoint, data=soql_filter_offset, headers=self.headers)\n\t\t\tif response.status_code != 200:\n\t\t\t\tbreak\n\t\t\tcaptured = response.json()\n\t\t\tif len(captured) == 0:\n\t\t\t\tbreak\n\t\t\tcombined.extend(captured)\n\t\t\toffset = 10000 * counter\n\t\t\tcounter += 1\n\n\t\tif self.size_check == True:\n\t\t\tprint('actual size... ' + str(self.get_size(combined)))\n\t\t\tprint('max size... ' + str(self.max_bytes))\n\t\t\tif self.get_size(combined) > self.max_bytes:\n\t\t\t\traise SizeExceededError\n\t\t\n\t\tdest_s3 = S3Hook(self.s3_conn_id)\n\t\t\n\t\tbody_obj = 'soda_evictions_import_' + datetime.now().strftime(\"%Y-%m-%dT%H%M%S\") + '.json'\n\t\t\n\t\tmetadata = self.parse_metadata(response.headers)\n\t\tmeta_obj = 'logs/soda_evictions_import_log_' + datetime.now().strftime(\"%Y-%m-%dT%H%M%S\")\n\t\t\n\t\tdest_s3.load_string(json.dumps(combined), key=self.s3_directory+'/'+body_obj, bucket_name=self.s3_bucket)\n\t\tdest_s3.load_string(json.dumps(metadata), key=self.s3_directory+'/'+meta_obj, bucket_name=self.s3_bucket)\n\t\t\n\t\t# XCom used to skip downstream tasks if body object size is 0\n\t\tself.xcom_push(context=context, key='obj_len', value=len(combined))\n"
  },
  {
    "path": "dags/sql/full_load.sql",
    "content": "-- echo \"\" > /home/airflow/airflow/dags/sql/full_load.sql\n-- nano /home/airflow/airflow/dags/sql/full_load.sql\n\n-- Populate District Dimension\n\nINSERT INTO staging.dim_district (district_key, district)\nSELECT -1, 'Unknown';\n\nINSERT INTO staging.dim_district (district, population, households, percent_asian, percent_black, percent_white, percent_native_am,\n\t\t\t\tpercent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\t\t\t\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\t\t\t\tmedian_family_income, per_capita_income, percent_in_poverty)\nSELECT \n\tdistrict,\n\tpopulation::int,\n\thouseholds::int,\n\tperc_asian::numeric as percent_asian,\n\tperc_black::numeric as percent_black,\n\tperc_white::numeric as percent_white,\n\tperc_nat_am::numeric as percent_native_am,\n\tperc_nat_pac::numeric as percent_pacific_isle,\n\tperc_other::numeric as percent_other_race,\n\tperc_latin::numeric as percent_latin,\n\tmedian_age::numeric,\n\ttotal_units::int,\n\tperc_owner_occupied::numeric as percent_owner_occupied,\n\tperc_renter_occupied::numeric as percent_renter_occupied,\n\tmedian_rent_as_perc_of_income::numeric,\n\tmedian_household_income::numeric,\n\tmedian_family_income::numeric,\n\tper_capita_income::numeric,\n\tperc_in_poverty::numeric as percent_in_poverty\nFROM raw.district_data;\t\n\n\n-- Populate Neighborhood Dimension\n\nINSERT INTO staging.dim_neighborhood (neighborhood_key, neighborhood)\nSELECT -1, 'Unknown';\n\nINSERT INTO staging.dim_neighborhood (neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, \n\t\t\t\tpercent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\t\t\t\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\t\t\t\tmedian_family_income, per_capita_income, percent_in_poverty)\nSELECT \n\tacs_name as neighborhood,\n\tdb_name as neighborhood_alt_name,\n\tpopulation::int,\n\thouseholds::int,\n\tperc_asian::numeric as percent_asian,\n\tperc_black::numeric as percent_black,\n\tperc_white::numeric as percent_white,\n\tperc_nat_am::numeric as percent_native_am,\n\tperc_nat_pac::numeric as percent_pacific_isle,\n\tperc_other::numeric as percent_other_race,\n\tperc_latin::numeric as percent_latin,\n\tmedian_age::numeric,\n\ttotal_units::int,\n\tperc_owner_occupied::numeric as percent_owner_occupied,\n\tperc_renter_occupied::numeric as percent_renter_occupied,\n\tmedian_rent_as_perc_of_income::numeric,\n\tmedian_household_income::numeric,\n\tmedian_family_income::numeric,\n\tper_capita_income::numeric,\n\tperc_in_poverty::numeric as percent_in_poverty\nFROM raw.neighborhood_data;\t\n\n\n-- Populate Location Dimension\n\nINSERT INTO staging.dim_location (location_key, city, state, zip_code)\nSELECT -1, 'Unknown', 'Unknown', 'Unknown';\n\nINSERT INTO staging.dim_location (city, state, zip_code)\nSELECT DISTINCT\n\tCOALESCE(city, 'Unknown') as city,\n\tCOALESCE(state, 'Unknown') as state,\n\tCOALESCE(zip, 'Unknown') as zip_code\nFROM raw.soda_evictions\nWHERE \n city IS NOT NULL OR state IS NOT NULL OR zip IS NOT NULL;\n\n\n-- Populate Reason Dimension\n\nINSERT INTO staging.dim_reason (reason_key, reason_code, reason_desc)\nVALUES (-1, 'Unknown', 'Unknown');\n\nINSERT INTO staging.dim_reason (reason_code, reason_desc)\nVALUES \t('non_payment', 'Non-Payment'),\n\t('breach', 'Breach'),\n\t('nuisance', 'Nuisance'),\n\t('illegal_use', 'Illegal Use'),\n\t('failure_to_sign_renewal', 'Failure to Sign Renewal'),\n\t('access_denial', 'Access Denial'),\n\t('unapproved_subtenant', 'Unapproved Subtenant'),\n\t('owner_move_in', 'Owner Move-In'),\n\t('demolition', 'Demolition'),\n\t('capital_improvement', 'Capital Improvement'),\n\t('substantial_rehab', 'Substantial Rehab'),\n\t('ellis_act_withdrawal', 'Ellis Act Withdrawal'),\n\t('condo_conversion', 'Condo Conversion'),\n\t('roommate_same_unit', 'Roommate Same Unit'),\n\t('other_cause', 'Other Cause'),\n\t('late_payments', 'Late Payments'),\n\t('lead_remediation', 'Lead Remediation'),\n\t('development', 'Development'),\n\t('good_samaritan_ends', 'Good Samaritan Ends');\n\n\t\n-- Populate Reason Bridge Table\n\nSELECT \n\tROW_NUMBER() OVER(ORDER BY concat_reason) as group_key,\n\tstring_to_array(concat_reason, '|') as reason_array,\n\tconcat_reason\nINTO TEMP tmp_reason_group\nFROM (\n\tSELECT DISTINCT\n\t\tTRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)) as concat_reason\n\tFROM (\n\t\tSELECT\n\t\t\teviction_id,\n\t\t\tCASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||\n\t\t\tCASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||\n\t\t\tCASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||\n\t\t\tCASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||\n\t\t\tCASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||\n\t\t\tCASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||\n\t\t\tCASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||\n\t\t\tCASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||\n\t\t\tCASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||\n\t\t\tCASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||\n\t\t\tCASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||\n\t\t\tCASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||\n\t\t\tCASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||\n\t\t\tCASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||\n\t\t\tCASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||\n\t\t\tCASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||\n\t\t\tCASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||\n\t\t\tCASE WHEN development = 'true' THEN 'development|' ELSE '' END||\n\t\t\tCASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END\n\t\t\t\tas concat_reason\n\t\tFROM raw.soda_evictions\n\t\t) f1\n\t) f2;\n\nINSERT INTO staging.br_reason_group (reason_group_key, reason_key)\nSELECT DISTINCT\n\tgroup_key as reason_group_key,\n\treason_key\nFROM (SELECT group_key, unnest(reason_array) unnested FROM tmp_reason_group) grp\nJOIN staging.dim_Reason r ON r.reason_code = grp.unnested;\t\n\n\n-- Populate Date Dimension Table\n\nINSERT INTO staging.dim_date (date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, \n\t\t\t\tformatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday,\n\t\t\t\tperiod, cw_start, cw_end, month_start, month_end)\nSELECT -1, '1900-01-01', -1, -1, 'Unknown', -1, -1, 'Unknown', -1, 'Unknown', 'Unknown', 'Unknown', 'Unknown',\n\t\t'Unknown', 'Unknown', 'Unknown', 'Unknown', '1900-01-01', '1900-01-01', '1900-01-01', '1900-01-01';\n\t\t\n\nINSERT INTO staging.dim_date (date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, \n\t\t\t\tformatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday,\n\t\t\t\tperiod, cw_start, cw_end, month_start, month_end)\nSELECT\n\tTO_CHAR(datum, 'yyyymmdd')::int as date_key,\n\tdatum as date,\n\tEXTRACT(YEAR FROM datum) as year,\n\tEXTRACT(MONTH FROM datum) as month,\n\tTO_CHAR(datum, 'TMMonth') as month_name,\n\tEXTRACT(DAY FROM datum) as day,\n\tEXTRACT(doy FROM datum) as day_of_year,\n\tTO_CHAR(datum, 'TMDay') as weekday_name,\n\tEXTRACT(week FROM datum) as calendar_week,\n\tTO_CHAR(datum, 'dd. mm. yyyy') as formatted_date,\n\t'Q' || TO_CHAR(datum, 'Q') as quartal,\n\tTO_CHAR(datum, 'yyyy/\"Q\"Q') as year_quartal,\n\tTO_CHAR(datum, 'yyyy/mm') as year_month,\n\tTO_CHAR(datum, 'iyyy/IW') as year_calendar_week,\n\tCASE WHEN EXTRACT(isodow FROM datum) IN (6, 7) THEN 'Weekend' ELSE 'Weekday' END as weekend,\n\tCASE WHEN TO_CHAR(datum, 'MMDD') IN ('0101', '0704', '1225', '1226') THEN 'Holiday' ELSE 'No holiday' END\n\t\t\tas us_holiday,\n\tCASE WHEN TO_CHAR(datum, 'MMDD') BETWEEN '0701' AND '0831' THEN 'Summer break'\n\t     WHEN TO_CHAR(datum, 'MMDD') BETWEEN '1115' AND '1225' THEN 'Christmas season'\n\t     WHEN TO_CHAR(datum, 'MMDD') > '1225' OR TO_CHAR(datum, 'MMDD') <= '0106' THEN 'Winter break'\n\t\t ELSE 'Normal' END\n\t\t\tas period,\n\tdatum + (1 - EXTRACT(isodow FROM datum))::int as cw_start,\n\tdatum + (7 - EXTRACT(isodow FROM datum))::int as cw_end,\n\tdatum + (1 - EXTRACT(DAY FROM datum))::int as month_start,\n\t(datum + (1 - EXTRACT(DAY FROM datum))::int + '1 month'::interval)::date - '1 day'::interval as month_end\nFROM (\n\tSELECT '1997-01-01'::date + SEQUENCE.DAY as datum\n\tFROM generate_series(0,10956) as SEQUENCE(DAY)\n\tGROUP BY SEQUENCE.DAY\n     ) DQ;\n\n\n-- Populate Evictions Fact Table\n\nSELECT \n\teviction_id,\n\tgroup_key as reason_group_key\nINTO tmp_reason_facts\nFROM (\n\tSELECT \n\t\teviction_id,\n\t\tTRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)) as concat_reason\n\tFROM (\n\t\tSELECT\n\t\t\teviction_id,\n\t\t\tCASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||\n\t\t\tCASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||\n\t\t\tCASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||\n\t\t\tCASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||\n\t\t\tCASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||\n\t\t\tCASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||\n\t\t\tCASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||\n\t\t\tCASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||\n\t\t\tCASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||\n\t\t\tCASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||\n\t\t\tCASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||\n\t\t\tCASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||\n\t\t\tCASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||\n\t\t\tCASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||\n\t\t\tCASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||\n\t\t\tCASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||\n\t\t\tCASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||\n\t\t\tCASE WHEN development = 'true' THEN 'development|' ELSE '' END||\n\t\t\tCASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END\n\t\t\t\tas concat_reason\n\t\tFROM raw.soda_evictions\n\t\t) grp\n\t) f_grp\nJOIN tmp_reason_group t_grp ON f_grp.concat_reason = t_grp.concat_reason;\t\n\n\nINSERT INTO staging.fact_evictions (eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, \n\t\t\t\t\t\t\t\t\tconstraints_date_key, street_address)\nSELECT \n\tf.eviction_id as eviction_key,\n\tCOALESCE(d.district_key, -1) as district_key,\n\tCOALESCE(n.neighborhood_key, -1) as neighborhood_key,\n\tCOALESCE(l.location_key, -1) as location_key,\n\treason_group_key,\n\tCOALESCE(dt1.date_key, -1) as file_date_key,\n\tCOALESCE(dt2.date_key, -1) as constraints_date_key,\n\tf.address as street_address\nFROM raw.soda_evictions f\nLEFT JOIN tmp_reason_facts r ON f.eviction_id = r.eviction_id\nLEFT JOIN staging.dim_district d ON f.supervisor_district = d.district\nLEFT JOIN staging.dim_neighborhood n ON f.neighborhood = n.neighborhood_alt_name\nLEFT JOIN staging.dim_location l \n\tON COALESCE(f.city, 'Unknown') = l.city\n\tAND COALESCE(f.state, 'Unknown') = l.state\n\tAND COALESCE(f.zip, 'Unknown') = l.zip_code\nLEFT JOIN staging.dim_date dt1 ON f.file_date = dt1.date\nLEFT JOIN staging.dim_date dt2 ON f.constraints_date = dt2.date;\n\nDROP TABLE tmp_reason_group;\nDROP TABLE tmp_reason_facts;\n\n\t\t     \n-- Migrate to Production Schema\n\nINSERT INTO prod.dim_district \n\t(district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,\n\tpercent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty)\nSELECT \n\tdistrict_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,\n\tpercent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty\nFROM staging.dim_district;\n\nINSERT INTO prod.dim_neighborhood\n\t(neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, \n\tpercent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty)\nSELECT\n\tneighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, \n\tpercent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty\nFROM staging.dim_neighborhood;\n\nINSERT INTO prod.dim_location (location_key, city, state, zip_code)\nSELECT location_key, city, state, zip_code\nFROM staging.dim_location;\n\nINSERT INTO prod.dim_reason (reason_key, reason_code, reason_desc)\nSELECT reason_key, reason_code, reason_desc\nFROM staging.dim_reason;\n\nINSERT INTO prod.br_reason_group (reason_group_key, reason_key)\nSELECT reason_group_key, reason_key\nFROM staging.br_reason_group;\n\nINSERT INTO prod.dim_date \n\t(date_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, \n\tformatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday,\n\tperiod, cw_start, cw_end, month_start, month_end)\nSELECT \n\tdate_key, date, year, month, month_name, day, day_of_year, weekday_name, calendar_week, \n\tformatted_date, quartal, year_quartal, year_month, year_calendar_week, weekend, us_holiday, period, \n\tcw_start, cw_end, month_start, month_end\nFROM staging.dim_date;\n\nINSERT INTO prod.fact_evictions \n\t(eviction_key, district_key, neighborhood_key, location_key, reason_group_key, \n\tfile_date_key, constraints_date_key, street_address)\nSELECT \n\teviction_key, district_key, neighborhood_key, location_key, reason_group_key, \n\tfile_date_key, constraints_date_key, street_address\nFROM staging.fact_evictions;\n"
  },
  {
    "path": "dags/sql/incremental_load.sql",
    "content": "-- echo \"\" > /home/airflow/airflow/dags/sql/incremental_load.sql\n-- nano /home/airflow/airflow/dags/sql/incremental_load.sql\n\n-- Populate District Dimension\n\nINSERT INTO staging.dim_district\n\t(district, population, households, percent_asian, percent_black, percent_white, percent_native_am,\n\tpercent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty)\nSELECT \n\tdistrict,\n\tpopulation::int,\n\thouseholds::int,\n\tperc_asian::numeric as percent_asian,\n\tperc_black::numeric as percent_black,\n\tperc_white::numeric as percent_white,\n\tperc_nat_am::numeric as percent_native_am,\n\tperc_nat_pac::numeric as percent_pacific_isle,\n\tperc_other::numeric as percent_other_race,\n\tperc_latin::numeric as percent_latin,\n\tmedian_age::numeric,\n\ttotal_units::int,\n\tperc_owner_occupied::numeric as percent_owner_occupied,\n\tperc_renter_occupied::numeric as percent_renter_occupied,\n\tmedian_rent_as_perc_of_income::numeric,\n\tmedian_household_income::numeric,\n\tmedian_family_income::numeric, \t\t\t\n\tper_capita_income::numeric,\n\tperc_in_poverty::numeric as percent_in_poverty\nFROM raw.district_data\n\tON CONFLICT (district) DO UPDATE SET \n\t\tpopulation = EXCLUDED.population,\n\t\thouseholds = EXCLUDED.households,\n\t\tpercent_asian = EXCLUDED.percent_asian,\n\t\tpercent_black = EXCLUDED.percent_black,\n\t\tpercent_white = EXCLUDED.percent_white,\n\t\tpercent_native_am = EXCLUDED.percent_native_am,\n\t\tpercent_pacific_isle = EXCLUDED.percent_pacific_isle,\n\t\tpercent_other_race = EXCLUDED.percent_other_race,\n\t\tpercent_latin = EXCLUDED.percent_latin,\n\t\tmedian_age = EXCLUDED.median_age,\n\t\ttotal_units = EXCLUDED.total_units,\n\t\tpercent_owner_occupied = EXCLUDED.percent_owner_occupied,\n\t\tpercent_renter_occupied = EXCLUDED.percent_renter_occupied,\n\t\tmedian_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,\n\t\tmedian_household_income = EXCLUDED.median_household_income,\n\t\tmedian_family_income = EXCLUDED.median_family_income,\n\t\tper_capita_income = EXCLUDED.per_capita_income,\n\t\tpercent_in_poverty = EXCLUDED.percent_in_poverty;\n\n\n-- Populate Neighborhood Dimension\n\nINSERT INTO staging.dim_neighborhood\n\t(neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, \n\tpercent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty)\nSELECT \n\tacs_name as neighborhood,\n\tdb_name as neighborhood_alt_name,\n\tpopulation::int,\n\thouseholds::int,\n\tperc_asian::numeric as percent_asian,\n\tperc_black::numeric as percent_black,\n\tperc_white::numeric as percent_white,\n\tperc_nat_am::numeric as percent_native_am,\n\tperc_nat_pac::numeric as percent_pacific_isle,\n\tperc_other::numeric as percent_other_race,\n\tperc_latin::numeric as percent_latin,\n\tmedian_age::numeric,\n\ttotal_units::int,\n\tperc_owner_occupied::numeric as percent_owner_occupied,\n\tperc_renter_occupied::numeric as percent_renter_occupied,\n\tmedian_rent_as_perc_of_income::numeric,\n\tmedian_household_income::numeric,\n\tmedian_family_income::numeric,\n\tper_capita_income::numeric,\n\tperc_in_poverty::numeric as percent_in_poverty\nFROM raw.neighborhood_data\n\tON CONFLICT (neighborhood) DO UPDATE SET\n\t\tneighborhood_alt_name = EXCLUDED.neighborhood_alt_name,\n\t\tpopulation = EXCLUDED.population,\n\t\thouseholds = EXCLUDED.households,\n\t\tpercent_asian = EXCLUDED.percent_asian,\n\t\tpercent_black = EXCLUDED.percent_black,\n\t\tpercent_white = EXCLUDED.percent_white,\n\t\tpercent_native_am = EXCLUDED.percent_native_am,\n\t\tpercent_pacific_isle = EXCLUDED.percent_pacific_isle,\n\t\tpercent_other_race = EXCLUDED.percent_other_race,\n\t\tpercent_latin = EXCLUDED.percent_latin,\n\t\tmedian_age = EXCLUDED.median_age,\n\t\ttotal_units = EXCLUDED.total_units,\n\t\tpercent_owner_occupied = EXCLUDED.percent_owner_occupied,\n\t\tpercent_renter_occupied = EXCLUDED.percent_renter_occupied,\n\t\tmedian_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,\n\t\tmedian_household_income = EXCLUDED.median_household_income,\n\t\tmedian_family_income = EXCLUDED.median_family_income,\n\t\tper_capita_income = EXCLUDED.per_capita_income,\n\t\tpercent_in_poverty = EXCLUDED.percent_in_poverty;\n\n\n-- Populate Location Dimension\n\nINSERT INTO staging.dim_location (city, state, zip_code)\nSELECT \n\tse.city,\n\tse.state,\n\tse.zip_code\nFROM (\n\tSELECT DISTINCT\n\t\tCOALESCE(city, 'Unknown') as city,\n\t\tCOALESCE(state, 'Unknown') as state,\n\t\tCOALESCE(zip, 'Unknown') as zip_code\n\tFROM raw.soda_evictions\n\t) se\nLEFT JOIN staging.dim_location dl \n\tON se.city = dl.city\n\tAND se.state = dl.state\n\tAND se.zip_code = dl.zip_code\nWHERE \n\tdl.location_key IS NULL;\n\t\n\t\n-- Populate Reason Bridge Table\n\nSELECT DISTINCT\n\treason_group_key,\n\tARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array\nINTO TEMP tmp_existing_reason_groups\nFROM staging.br_reason_group\nGROUP BY reason_group_key; \n\nSELECT \n\tconcat_reason,\n\tARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array\nINTO TEMP tmp_new_reason_groups\nFROM (\n\tSELECT DISTINCT\n\t\tstring_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|') as concat_reason,\n\t\tunnest(string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|')) unnested_reason\n\tFROM (\n\t\tSELECT DISTINCT\n\t\t\tCASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||\n\t\t\tCASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||\n\t\t\tCASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||\n\t\t\tCASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||\n\t\t\tCASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||\n\t\t\tCASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||\n\t\t\tCASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||\n\t\t\tCASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||\n\t\t\tCASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||\n\t\t\tCASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||\n\t\t\tCASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||\n\t\t\tCASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||\n\t\t\tCASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||\n\t\t\tCASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||\n\t\t\tCASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||\n\t\t\tCASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||\n\t\t\tCASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||\n\t\t\tCASE WHEN development = 'true' THEN 'development|' ELSE '' END||\n\t\t\tCASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END\n\t\t\t\tas concat_reason\n\t\tFROM raw.soda_evictions\n\t\t) se1\n\tGROUP BY concat_reason\n\t) se2\nJOIN staging.dim_reason r ON se2.unnested_reason = r.reason_code\nGROUP BY concat_reason; \n\nINSERT INTO staging.br_reason_group (reason_group_key, reason_key)\nSELECT\n\tfinal_grp.max_key + new_grp.tmp_group_key as reason_group_key,\n\tnew_grp.reason_key as reason_key\nFROM (\n\tSELECT DISTINCT\n\t\tROW_NUMBER() OVER(ORDER BY concat_reason) as tmp_group_key,\n\t\tconcat_reason,\n\t\tunnest(n.rk_array) as reason_key\n\tFROM tmp_new_reason_groups n\n\tLEFT JOIN tmp_existing_reason_groups e ON n.rk_array = e.rk_array\n\tWHERE e.reason_group_key IS NULL\n\t) new_grp\nLEFT JOIN (SELECT MAX(reason_group_key) max_key FROM staging.br_reason_group) final_grp ON 1=1\nORDER BY reason_group_key, reason_key;\n\nDROP TABLE tmp_existing_reason_groups;\nDROP TABLE tmp_new_reason_groups;\n\n\t\t\t\t\t    \t    \n-- Populate Staging Fact Table\n\nSELECT DISTINCT\n\treason_group_key,\n\tARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array\nINTO TEMP tmp_existing_reason_groups\nFROM staging.br_reason_group\nGROUP BY reason_group_key; \n\t\t\t\t\t    \nSELECT \n\teviction_id,\n\tARRAY_AGG(reason_key ORDER BY reason_key ASC) as rk_array\nINTO TEMP tmp_fct_reason_groups\t\nFROM (\n\tSELECT \n\t\teviction_id,\n\t\tunnest(string_to_array(TRIM(TRAILING '|' FROM (CASE WHEN concat_reason = '' THEN 'Unknown' ELSE concat_reason END)), '|')) as unnested_reason\n\tFROM (\n\t\tSELECT \n\t\t\teviction_id,\n\t\t\tCASE WHEN non_payment = 'true' THEN 'non_payment|' ELSE '' END||\n\t\t\tCASE WHEN breach = 'true' THEN 'breach|' ELSE '' END||\n\t\t\tCASE WHEN nuisance = 'true' THEN 'nuisance|' ELSE '' END||\n\t\t\tCASE WHEN illegal_use = 'true' THEN 'illegal_use|' ELSE '' END||\n\t\t\tCASE WHEN failure_to_sign_renewal = 'true' THEN 'failure_to_sign_renewal|' ELSE '' END||\n\t\t\tCASE WHEN access_denial = 'true' THEN 'access_denial|' ELSE '' END||\n\t\t\tCASE WHEN unapproved_subtenant = 'true' THEN 'unapproved_subtenant|' ELSE '' END||\n\t\t\tCASE WHEN owner_move_in = 'true' THEN 'owner_move_in|' ELSE '' END||\n\t\t\tCASE WHEN demolition = 'true' THEN 'demolition|' ELSE '' END||\n\t\t\tCASE WHEN capital_improvement = 'true' THEN 'capital_improvement|' ELSE '' END||\n\t\t\tCASE WHEN substantial_rehab = 'true' THEN 'substantial_rehab|' ELSE '' END||\n\t\t\tCASE WHEN ellis_act_withdrawal = 'true' THEN 'ellis_act_withdrawal|' ELSE '' END||\n\t\t\tCASE WHEN condo_conversion = 'true' THEN 'condo_conversion|' ELSE '' END||\n\t\t\tCASE WHEN roommate_same_unit = 'true' THEN 'roommate_same_unit|' ELSE '' END||\n\t\t\tCASE WHEN other_cause = 'true' THEN 'other_cause|' ELSE '' END||\n\t\t\tCASE WHEN late_payments = 'true' THEN 'late_payments|' ELSE '' END||\n\t\t\tCASE WHEN lead_remediation = 'true' THEN 'lead_remediation|' ELSE '' END||\n\t\t\tCASE WHEN development = 'true' THEN 'development|' ELSE '' END||\n\t\t\tCASE WHEN good_samaritan_ends = 'true' THEN 'good_samaritan_ends|' ELSE '' END\n\t\t\t\tas concat_reason\n\t\tFROM raw.soda_evictions\n\t\t) se1\n\t) se2\nJOIN staging.dim_reason r ON se2.unnested_reason = r.reason_code\t\nGROUP BY se2.eviction_id; \n\t\t\t\t\t    \nSELECT\n\teviction_id, \n\treason_group_key\nINTO tmp_reason_group_lookup\nFROM tmp_fct_reason_groups f\nJOIN tmp_existing_reason_groups d ON f.rk_array = d.rk_array;\t\t\t    \n\n\nTRUNCATE TABLE staging.fact_evictions;\n\t\t\t\t\t    \nINSERT INTO staging.fact_evictions \n\t(eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address)\nSELECT\n\tf.eviction_id as eviction_key,\n\tCOALESCE(d.district_key, -1) as district_key,\n\tCOALESCE(n.neighborhood_key, -1) as neighborhood_key,\n\tCOALESCE(l.location_key, -1) as location_key,\n\tr.reason_group_key as reason_group_key,\n\tCOALESCE(dt1.date_key, -1) as file_date_key,\n\tCOALESCE(dt2.date_key, -1) as constraints_date_key,\n\tf.address as street_address\nFROM raw.soda_evictions f\nJOIN tmp_reason_group_lookup r ON f.eviction_id = r.eviction_id\nLEFT JOIN staging.dim_district d ON f.supervisor_district = d.district\nLEFT JOIN staging.dim_neighborhood n ON f.neighborhood = n.neighborhood_alt_name\nLEFT JOIN staging.dim_location l \n\tON COALESCE(f.city, 'Unknown') = l.city\n\tAND COALESCE(f.state, 'Unknown') = l.state\n\tAND COALESCE(f.zip, 'Unknown') = l.zip_code\nLEFT JOIN staging.dim_date dt1 ON f.file_date = dt1.date\nLEFT JOIN staging.dim_date dt2 ON f.constraints_date = dt2.date;\n\nDROP TABLE tmp_existing_reason_groups;\nDROP TABLE tmp_fct_reason_groups;\nDROP TABLE tmp_reason_group_lookup;\n\t\t\t\t\t    \n\t\t\t\t\t    \n-- Merge Into Production Schema\n\nINSERT INTO prod.dim_district\n\t(district_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,\n\tpercent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty)\nSELECT \t\n\tdistrict_key, district, population, households, percent_asian, percent_black, percent_white, percent_native_am,\n\tpercent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty\nFROM staging.dim_district\t\n\tON CONFLICT(district_key) DO UPDATE SET\n\t\tdistrict = EXCLUDED.district,\n\t\tpopulation = EXCLUDED.population,\n\t\thouseholds = EXCLUDED.households,\n\t\tpercent_asian = EXCLUDED.percent_asian,\n\t\tpercent_black = EXCLUDED.percent_black,\n\t\tpercent_white = EXCLUDED.percent_white,\n\t\tpercent_native_am = EXCLUDED.percent_native_am,\n\t\tpercent_pacific_isle = EXCLUDED.percent_pacific_isle,\n\t\tpercent_other_race = EXCLUDED.percent_other_race,\n\t\tpercent_latin = EXCLUDED.percent_latin,\n\t\tmedian_age = EXCLUDED.median_age,\n\t\ttotal_units = EXCLUDED.total_units,\n\t\tpercent_owner_occupied = EXCLUDED.percent_owner_occupied,\n\t\tpercent_renter_occupied = EXCLUDED.percent_renter_occupied,\n\t\tmedian_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,\n\t\tmedian_household_income = EXCLUDED.median_household_income,\n\t\tmedian_family_income = EXCLUDED.median_family_income,\n\t\tper_capita_income = EXCLUDED.per_capita_income,\n\t\tpercent_in_poverty = EXCLUDED.percent_in_poverty;\n\n\nINSERT INTO prod.dim_neighborhood\n\t(neighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, \n\tpercent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty)\nSELECT \n\tneighborhood_key, neighborhood, neighborhood_alt_name, population, households, percent_asian, percent_black, percent_white, \n\tpercent_native_am, percent_pacific_isle, percent_other_race, percent_latin, median_age, total_units, \n\tpercent_owner_occupied, percent_renter_occupied, median_rent_as_perc_of_income, median_household_income, \n\tmedian_family_income, per_capita_income, percent_in_poverty\nFROM staging.dim_neighborhood\n\tON CONFLICT (neighborhood_key) DO UPDATE SET\n\t\tneighborhood = EXCLUDED.neighborhood,\n\t\tneighborhood_alt_name = EXCLUDED.neighborhood_alt_name,\n\t\tpopulation = EXCLUDED.population,\n\t\thouseholds = EXCLUDED.households,\n\t\tpercent_asian = EXCLUDED.percent_asian,\n\t\tpercent_black = EXCLUDED.percent_black,\n\t\tpercent_white = EXCLUDED.percent_white,\n\t\tpercent_native_am = EXCLUDED.percent_native_am,\n\t\tpercent_pacific_isle = EXCLUDED.percent_pacific_isle,\n\t\tpercent_other_race = EXCLUDED.percent_other_race,\n\t\tpercent_latin = EXCLUDED.percent_latin,\n\t\tmedian_age = EXCLUDED.median_age,\n\t\ttotal_units = EXCLUDED.total_units,\n\t\tpercent_owner_occupied = EXCLUDED.percent_owner_occupied,\n\t\tpercent_renter_occupied = EXCLUDED.percent_renter_occupied,\n\t\tmedian_rent_as_perc_of_income = EXCLUDED.median_rent_as_perc_of_income,\n\t\tmedian_household_income = EXCLUDED.median_household_income,\n\t\tmedian_family_income = EXCLUDED.median_family_income,\n\t\tper_capita_income = EXCLUDED.per_capita_income,\n\t\tpercent_in_poverty = EXCLUDED.percent_in_poverty;\n\n\nINSERT INTO prod.dim_location (location_key, city, state, zip_code)\nSELECT location_key, city, state, zip_code\nFROM staging.dim_location\n\tON CONFLICT (location_key) DO NOTHING;\t\n\n\t\t\t\t\t    \nINSERT INTO prod.br_reason_group (reason_group_key, reason_key)\nSELECT stg.reason_group_key, stg.reason_key \nFROM staging.br_reason_group stg\nLEFT JOIN prod.br_reason_group prd \n\tON stg.reason_group_key = prd.reason_group_key \n\tAND stg.reason_key = prd.reason_key\nWHERE \n\tprd.reason_group_key IS NULL;\n\t\t\t\t\t    \n\t\t\t\t\t    \nINSERT INTO prod.fact_evictions \n\t(eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address)\nSELECT eviction_key, district_key, neighborhood_key, location_key, reason_group_key, file_date_key, constraints_date_key, street_address\nFROM staging.fact_evictions \n\tON CONFLICT (eviction_key) DO UPDATE SET \n\t\tdistrict_key = EXCLUDED.district_key,\n\t\tneighborhood_key = EXCLUDED.neighborhood_key,\n\t\tlocation_key = EXCLUDED.location_key,\n\t\treason_group_key = EXCLUDED.reason_group_key,\n\t\tfile_date_key = EXCLUDED.file_date_key,\n\t\tconstraints_date_key = EXCLUDED.constraints_date_key,\n\t\tstreet_address = EXCLUDED.street_address;\n"
  },
  {
    "path": "dags/sql/init_db_schema.sql",
    "content": "-- echo \"\" > /home/airflow/airflow/dags/sql/init_db_schema.sql\n-- nano /home/airflow/airflow/dags/sql/init_db_schema.sql\n\nDROP SCHEMA IF EXISTS raw CASCADE;\nDROP SCHEMA IF EXISTS staging CASCADE;\nDROP SCHEMA IF EXISTS prod CASCADE;\n\nCREATE SCHEMA raw;\nCREATE SCHEMA staging;\nCREATE SCHEMA prod;\n\n\n-- Raw\nCREATE UNLOGGED TABLE raw.soda_evictions (\n\traw_id text,\n\tcreated_at timestamp,\n\tupdated_at timestamp,\n\teviction_id text,\n\taddress text,\n\tcity text,\n\tstate text,\n\tzip text,\n\tfile_date timestamp,\n\tnon_payment boolean,\n\tbreach boolean,\n\tnuisance boolean,\n\tillegal_use boolean,\n\tfailure_to_sign_renewal boolean,\n\taccess_denial boolean,\n\tunapproved_subtenant boolean,\n\towner_move_in boolean,\n\tdemolition boolean,\n\tcapital_improvement boolean,\n\tsubstantial_rehab boolean,\n\tellis_act_withdrawal boolean,\n\tcondo_conversion boolean,\n\troommate_same_unit boolean,\n\tother_cause boolean,\n\tlate_payments boolean,\n\tlead_remediation boolean,\n\tdevelopment boolean,\n\tgood_samaritan_ends boolean,\n\tconstraints_date timestamp,\n\tsupervisor_district text,\n\tneighborhood text\n);\n\nCREATE UNLOGGED TABLE raw.district_data (\n\tdistrict text,\n\tpopulation text,\n\thouseholds text,\n\tperc_asian text,\n\tperc_black text,\n\tperc_white text,\n\tperc_nat_am text,\n\tperc_nat_pac text,\n\tperc_other text,\n\tperc_latin text,\n\tmedian_age text,\n\ttotal_units text,\n\tperc_owner_occupied text,\n\tperc_renter_occupied text,\n\tmedian_rent_as_perc_of_income text,\n\tmedian_household_income text,\n\tmedian_family_income text,\n\tper_capita_income text,\n\tperc_in_poverty text\n);\nCREATE UNIQUE INDEX district_name_uniq_idx ON raw.district_data (district);\n\nCREATE UNLOGGED TABLE raw.neighborhood_data (\n\tacs_name text,\n\tdb_name text,\n\tpopulation text,\n\thouseholds text,\n\tperc_asian text,\n\tperc_black text,\n\tperc_white text,\n\tperc_nat_am text,\n\tperc_nat_pac text,\n\tperc_other text,\n\tperc_latin text,\n\tmedian_age text,\n\ttotal_units text,\n\tperc_owner_occupied text,\n\tperc_renter_occupied text,\n\tmedian_rent_as_perc_of_income text,\n\tmedian_household_income text,\n\tmedian_family_income text,\n\tper_capita_income text,\n\tperc_in_poverty text\n);\nCREATE UNIQUE INDEX neighborhood_name_uniq_idx ON raw.neighborhood_data (acs_name);\n\n-- Staging\nCREATE TABLE staging.dim_district (\n\tdistrict_key serial PRIMARY KEY,\n\tdistrict text,\n\tpopulation integer,\n\thouseholds integer,\n\tpercent_asian numeric,\n\tpercent_black numeric,\n\tpercent_white numeric,\n\tpercent_native_am numeric,\n\tpercent_pacific_isle numeric,\n\tpercent_other_race numeric,\n\tpercent_latin numeric,\n\tmedian_age numeric,\n\ttotal_units integer,\n\tpercent_owner_occupied numeric,\n\tpercent_renter_occupied numeric,\n\tmedian_rent_as_perc_of_income numeric,\n\tmedian_household_income numeric,\n\tmedian_family_income numeric,\n\tper_capita_income numeric,\n\tpercent_in_poverty numeric\n);\nCREATE UNIQUE INDEX district_name_uniq_idx ON staging.dim_district (district);\n\nCREATE TABLE staging.dim_neighborhood (\n\tneighborhood_key serial PRIMARY KEY,\n\tneighborhood text,\n\tneighborhood_alt_name text,\n\tpopulation integer,\n\thouseholds integer,\n\tpercent_asian numeric,\n\tpercent_black numeric,\n\tpercent_white numeric,\n\tpercent_native_am numeric,\n\tpercent_pacific_isle numeric,\n\tpercent_other_race numeric,\n\tpercent_latin numeric,\n\tmedian_age numeric,\n\ttotal_units integer,\n\tpercent_owner_occupied numeric,\n\tpercent_renter_occupied numeric,\n\tmedian_rent_as_perc_of_income numeric,\n\tmedian_household_income numeric,\n\tmedian_family_income numeric,\n\tper_capita_income numeric,\n\tpercent_in_poverty numeric\n);\nCREATE UNIQUE INDEX neighborhood_name_uniq_idx ON staging.dim_neighborhood (neighborhood);\n\nCREATE TABLE staging.dim_location (\n\tlocation_key serial PRIMARY KEY,\n\tcity text,\n\tstate text,\n\tzip_code text\n);\n\nCREATE TABLE staging.dim_reason (\n\treason_key serial PRIMARY KEY,\n\treason_code text,\n\treason_desc text\n);\n\nCREATE TABLE staging.br_reason_group (\n\treason_group_key int,\n\treason_key int\n);\t\nCREATE INDEX reason_group_key_idx ON staging.br_reason_group (reason_group_key);\nCREATE INDEX reason_key_idx ON staging.br_reason_group (reason_key);\n\nCREATE TABLE staging.dim_date (\n\tdate_key int PRIMARY KEY,\n\tdate date,\n\tyear int,\n\tmonth int,\n\tmonth_name text,\n\tday int,\n\tday_of_year int,\n\tweekday_name text,\n\tcalendar_week int,\n\tformatted_date text,\n\tquartal text,\n\tyear_quartal text,\n\tyear_month text,\n\tyear_calendar_week text,\n\tweekend text,\n\tus_holiday text,\n\tperiod text,\n\tcw_start date,\n\tcw_end date,\n\tmonth_start date,\n\tmonth_end date\n);\n\nCREATE TABLE staging.fact_evictions (\n\teviction_key text PRIMARY KEY,\n\tlocation_key int,\n\tdistrict_key int,\n\tneighborhood_key int,\n\treason_group_key int,\n\tfile_date_key int,\n\tconstraints_date_key int,\n\tstreet_address text\n);\n\n\n-- Prod\nCREATE TABLE prod.dim_district (\n\tdistrict_key serial PRIMARY KEY,\n\tdistrict text,\n\tpopulation integer,\n\thouseholds integer,\n\tpercent_asian numeric,\n\tpercent_black numeric,\n\tpercent_white numeric,\n\tpercent_native_am numeric,\n\tpercent_pacific_isle numeric,\n\tpercent_other_race numeric,\n\tpercent_latin numeric,\n\tmedian_age numeric,\n\ttotal_units integer,\n\tpercent_owner_occupied numeric,\n\tpercent_renter_occupied numeric,\n\tmedian_rent_as_perc_of_income numeric,\n\tmedian_household_income numeric,\n\tmedian_family_income numeric,\n\tper_capita_income numeric,\n\tpercent_in_poverty numeric\n);\n\nCREATE TABLE prod.dim_neighborhood (\n\tneighborhood_key serial PRIMARY KEY,\n\tneighborhood text,\n\tneighborhood_alt_name text,\n\tpopulation integer,\n\thouseholds integer,\n\tpercent_asian numeric,\n\tpercent_black numeric,\n\tpercent_white numeric,\n\tpercent_native_am numeric,\n\tpercent_pacific_isle numeric,\n\tpercent_other_race numeric,\n\tpercent_latin numeric,\n\tmedian_age numeric,\n\ttotal_units integer,\n\tpercent_owner_occupied numeric,\n\tpercent_renter_occupied numeric,\n\tmedian_rent_as_perc_of_income numeric,\n\tmedian_household_income numeric,\n\tmedian_family_income numeric,\n\tper_capita_income numeric,\n\tpercent_in_poverty numeric\n);\n\nCREATE TABLE prod.dim_location (\n\tlocation_key serial PRIMARY KEY,\n\tcity text,\n\tstate text,\n\tzip_code text\n);\n\nCREATE TABLE prod.dim_reason (\n\treason_key serial PRIMARY KEY,\n\treason_code text,\n\treason_desc text\n);\n\nCREATE TABLE prod.br_reason_group (\n\treason_group_key int,\n\treason_key int\n);\t\nCREATE INDEX reason_group_key_idx ON prod.br_reason_group (reason_group_key);\nCREATE INDEX reason_key_idx ON prod.br_reason_group (reason_key);\n\nCREATE TABLE prod.dim_date (\n\tdate_key int PRIMARY KEY,\n\tdate date,\n\tyear int,\n\tmonth int,\n\tmonth_name text,\n\tday int,\n\tday_of_year int,\n\tweekday_name text,\n\tcalendar_week int,\n\tformatted_date text,\n\tquartal text,\n\tyear_quartal text,\n\tyear_month text,\n\tyear_calendar_week text,\n\tweekend text,\n\tus_holiday text,\n\tperiod text,\n\tcw_start date,\n\tcw_end date,\n\tmonth_start date,\n\tmonth_end date\n);\n\nCREATE TABLE prod.fact_evictions (\n\teviction_key text PRIMARY KEY,\n\tlocation_key int,\n\tdistrict_key int,\n\tneighborhood_key int,\n\treason_group_key int,\n\tfile_date_key int,\n\tconstraints_date_key int,\n\tstreet_address text\n);\n"
  },
  {
    "path": "dags/sql/trunc_target_tables.sql",
    "content": "-- echo \"\" > /home/airflow/airflow/dags/sql/trunc_target_tables.sql\n-- nano /home/airflow/airflow/dags/sql/trunc_target_tables.sql\nTRUNCATE TABLE raw.soda_evictions;\nTRUNCATE TABLE raw.district_data;\nTRUNCATE TABLE raw.neighborhood_data;\n"
  }
]