Full Code of SocialDataSci/Geospatial_Data_with_Python for AI

master 5753316c20c7 cached

3 files

56.8 KB

17.9k tokens

1 requests

Download .txt

Repository: SocialDataSci/Geospatial_Data_with_Python
Branch: master
Commit: 5753316c20c7
Files: 3
Total size: 56.8 KB

Directory structure:
gitextract_395ygo8c/

├── .gitignore
├── Intro to Geospatial Data with Python.ipynb
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Created by .ignore support plugin (hsz.mobi)
### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Data
data/MN_PCA_Cedar_Lake
data/Results
data/MetCouncil_Lakes_Rivers/
data/MetroGIS_Tax_Parcels_2014/

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# dotenv
.env

# virtualenv
.venv/
venv/
ENV/

# Spyder project settings
.spyderproject

# Rope project settings
.ropeproject
### JetBrains template
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and Webstorm
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839

# User-specific stuff:
.idea/workspace.xml
.idea/tasks.xml

# Sensitive or high-churn files:
.idea/dataSources/
.idea/dataSources.ids
.idea/dataSources.xml
.idea/dataSources.local.xml
.idea/sqlDataSources.xml
.idea/dynamic.xml
.idea/uiDesigner.xml

# Gradle:
.idea/gradle.xml
.idea/libraries

# Mongo Explorer plugin:
.idea/mongoSettings.xml

## File-based project format:
*.iws

## Plugin-specific files:

# IntelliJ
/out/

# mpeltonen/sbt-idea plugin
.idea_modules/

# JIRA plugin
atlassian-ide-plugin.xml

# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties



================================================
FILE: Intro to Geospatial Data with Python.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-19T12:42:19.676901",
     "start_time": "2017-01-19T12:42:19.665880"
    }
   },
   "source": [
    "# Intro to Geospatial Data using Python\n",
    "\n",
    "## Disclaimer\n",
    "* Not a GIS expert, just someone with the drive to self learn.\n",
    "* Going to be talking mostly about Shape (.shp) files but other formats exist (geojson, raster, etc)\n",
    "* Using Python 3.5, no guarantees everything works on 2.7\n",
    "\n",
    "## Background\n",
    "* What is GeoSpatial Data?\n",
    "* Types of GeoSpatial Data\n",
    "* What kinds of GeoSpatial Data is available?\n",
    "* Where to get data?\n",
    "\n",
    "## Technical\n",
    "* Getting Set Up\n",
    "* Reading in Data\n",
    "* Exploring Data\n",
    "    * Fields\n",
    "    * Profiling\n",
    "    * Visualization\n",
    "* Filtering\n",
    "    * Data Attributes\n",
    "    * Geodesic Features\n",
    "* Geodesic transformations\n",
    "    * Units of Measure\n",
    "    * Projections\n",
    "* Geodesic Calculations\n",
    "    * Centroid\n",
    "    * Distance between points\n",
    "* Joins/merges\n",
    "    * Joining tabular data\n",
    "    * Joining on geodesic features\n",
    "* Creating new data\n",
    "    * New Fields\n",
    "    * New Shapes\n",
    "* Writing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is GeoSpatial Data?\n",
    "The word geospatial is used to indicate that data that has a geographic component to it.  This means that the records in a dataset have locational information tied to them such as geographic data in the form of coordinates, address, city, or ZIP code. GIS data is a form of geospatial data.  Other geospatial data can originate from GPS data, satellite imagery, and geotagging. [1]\n",
    "\n",
    "![](./img/gislayers.jpg)\n",
    "\n",
    "[1]: https://www.gislounge.com/difference-gis-geospatial/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Types of Geospatial Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vector/Polygon Data\n",
    "* A representation of the world using points, lines, and polygons. \n",
    "* Vector models are useful for storing data that has discrete boundaries, such as country borders, land parcels, and streets.\n",
    "* Common formats are Shape Files, GeoJSON, KML (Keyhole Markup Language)\n",
    "* Often used by data scientists to calculate additional variables (distance to water in this example) or weight attributes based on area/density.\n",
    "![](./img/GIS_Shape.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Raster Data\n",
    "* Rasters are digital aerial photographs, imagery from satellites, digital pictures, or even scanned maps\n",
    "* Common formats are .JPG, .TIF, .GIF or similar format\n",
    "* Can help answer fuzzy questions like \"how many fields were planted in county X vs left fallow?\" \n",
    "    * This ends up being an image recognition type problem as you are trying to planted vs fallow by coloration.\n",
    "![](./img/GIS_Raster.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tablular Files\n",
    "* Numeric data is statistical data which includes a geographical component \n",
    "* Joined with vector files so the data can be queried and displayed as a layer on a map in a GIS. \n",
    "* The most common type of numeric data is demographic data from the US Census.\n",
    "* Unique Identifiers (Hydrology Number, State, Metropolitan Statistical Area ID, Lat/Long, etc)\n",
    "* Typically what most data scientists & statisticians work with, columns of attributes/characteristics that describe an customer/town/entity\n",
    "![](./img/GIS_tabular.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-19T09:49:28.537005",
     "start_time": "2017-01-19T09:49:28.533503"
    }
   },
   "source": [
    "## What kinds of GeoSpatial Data is available?\n",
    "![](./img/GIS_Categories.PNG)\n",
    "### Government\n",
    "* Local (MetCouncil, Minneapolis School District)\n",
    "* State (DNR, MN PCA, Hennepin County)\n",
    "* Federal (Census Bureau, NASA)\n",
    "\n",
    "### Private\n",
    "* Energy (Xcel Engery, Centerpoint, etc)\n",
    "* Technology (Google, Uber, etc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-19T09:50:06.244765",
     "start_time": "2017-01-19T09:50:06.242263"
    }
   },
   "source": [
    "## Where can I get Minnesota Geospatial Data?\n",
    "[![MN GeoSpatial Commons](./img/mn_geospatial.PNG)](https://gisdata.mn.gov/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Set Up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start with Anaconda 3.5\n",
    "Anaconda is the de facto industry standard for Python Scientific Computing. Without it users are left to manage dependencies, find and compile low level C libraries and generally in for a huge headache. Added bonus is that you don't need administrator privledges to install if you install only for the local user.\n",
    "\n",
    "### Download Here\n",
    "[![Download Here](./img/Anaconda_Download.PNG)](https://www.continuum.io/downloads)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![conda-forge](./img/Conda_Forge.PNG)\n",
    "\n",
    "## Install Additional Libraries\n",
    "\n",
    "Conda is a packaging tool and installer that aims to do more than what pip does; handle library dependencies outside of the Python packages as well as the Python packages themselves.\n",
    "\n",
    "## What is Conda Forge?\n",
    "\n",
    "conda-forge is a github organization containing repositories of conda recipes. Each repository automatically builds its own recipe in a clean and repeatable way on Windows, Linux and OSX. \n",
    "\n",
    "Extremely valuable as you don't have to find and compile dependencies (which isn't fun on Linux/OSX and is a *NIGHTMARE* on Windows).\n",
    "\n",
    "**Enable conda-forge repositories by running following line in Terminal/CMD/Shell of your choice.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-20T14:17:34.768982",
     "start_time": "2017-01-20T14:17:32.349269"
    },
    "collapsed": false
   },
   "source": [
    "`conda config --add channels conda-forge`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-19T10:51:48.195559",
     "start_time": "2017-01-19T10:51:48.170043"
    }
   },
   "source": [
    "## Python Geospatial Libraries Covered in Notebook\n",
    "\n",
    "* [geopandas][] - Working with spatial data is fun again!\n",
    "* [shapely][] - For geometry handling\n",
    "* [rtree][] - For efficiently querying spatial data\n",
    "* [pyshp][] - For reading and writing shapefiles (in _pure_ Python)\n",
    "* [pyproj][] - For conversions between projections\n",
    "* [fiona][] - For making it _easy_ to read/write geospatial data formats\n",
    "* [ogr/gdal][] - For reading, writing, and transforming geospatial data formats\n",
    "* [geopy][] - For geolocating and things like that\n",
    "* [pysal][] -  Spatial econometrics, exploratory spatial and spatio-temporal data analysis, spatial clustering (and more)\n",
    "* [descartes][] - For plotting geometries in matplotlib\n",
    "\n",
    "[pandas]: http://pandas.pydata.org/\n",
    "[geopandas]: https://github.com/kjordahl/geopandas\n",
    "[shapely]: https://pypi.python.org/pypi/Shapely\n",
    "[rtree]: http://toblerity.github.io/rtree/\n",
    "[geopy]: https://code.google.com/p/geopy/\n",
    "[ogr/gdal]: https://pypi.python.org/pypi/GDAL/\n",
    "[fiona]: http://toblerity.github.io/fiona/\n",
    "[pysal]: http://pysal.org\n",
    "[pyproj]: https://code.google.com/p/pyproj/\n",
    "[pyshp]: https://code.google.com/p/pyshp/\n",
    "[descartes]: https://pypi.python.org/pypi/descartes\n",
    "\n",
    "### [Exhaustive List Here](https://github.com/SpatialPython/spatial_python/blob/master/packages.md)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Install packages by running each line in Terminal/CMD/Shell of your choice**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "start_time": "2017-01-20T20:23:39.792Z"
    },
    "collapsed": false,
    "scrolled": true
   },
   "source": [
    "`conda install geopandas`\n",
    "\n",
    "`conda install rtree`\n",
    "\n",
    "`conda install pyshp`\n",
    "\n",
    "`conda install pyproj`\n",
    "\n",
    "`conda install geopy`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Additional GitHub Packages\n",
    "Conda should be your go-to installer but some of the really specific libraries you need to install with Pip\n",
    "\n",
    "* [pandas-profiling][] - Generates profile reports from a pandas DataFrame\n",
    "* [geoplotlib][] - For visualizing geographical data and making maps\n",
    "* [missingno][] - Provides a small toolset of flexible and easy-to-use missing data visualizations \n",
    "\n",
    "[pandas-profiling]: https://github.com/JosPolfliet/pandas-profiling\n",
    "[geoplotlib]: https://github.com/andrea-cuttone/geoplotlib\n",
    "[missingno]: https://github.com/ResidentMario/missingno\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-19T11:23:57.048240",
     "start_time": "2017-01-19T11:23:57.041215"
    }
   },
   "source": [
    "**Install packages by running each line in Terminal/CMD/Shell of your choice**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "start_time": "2017-01-20T20:32:15.868Z"
    },
    "collapsed": false,
    "scrolled": true
   },
   "source": [
    "`pip install https://github.com/JosPolfliet/pandas-profiling/archive/master.zip`\n",
    "\n",
    "`pip install https://github.com/andrea-cuttone/geoplotlib/archive/master.zip`\n",
    "\n",
    "`pip install https://github.com/ResidentMario/missingno/archive/master.zip`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-19T16:07:14.566043",
     "start_time": "2017-01-19T16:07:14.208632"
    }
   },
   "source": [
    "# Data Set\n",
    "## MetroGIS Tax Parcels 2014\n",
    "![](./img/preview.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-19T16:13:53.151312",
     "start_time": "2017-01-19T16:13:53.147810"
    }
   },
   "source": [
    "## Extract Data from zipfile\n",
    "\n",
    "**Warning: This will use about 1GB of free space.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:09:49.226515",
     "start_time": "2017-02-08T09:09:42.931561"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from zipfile import ZipFile\n",
    "\n",
    "file_list = ['./data/data.zip']\n",
    "\n",
    "for archive in file_list:\n",
    "    zfile = ZipFile(archive)\n",
    "    zfile.extractall('./data/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-19T16:15:22.364231",
     "start_time": "2017-01-19T16:15:22.361231"
    }
   },
   "source": [
    "## [Review Tax Parcel Meta Data](./data/MetroGIS_Tax_Parcels_2014/metadata/metadata.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Review Tax Field Descriptions](./data/MetroGIS_Tax_Parcels_2014/metadata/MetroGIS_Regional_Parcels_Attributes_2014.pdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading in Data\n",
    "\n",
    "**The geopandas library can read almost any vector-based spatial data format including ESRI shapefile, GeoJSON files and more**\n",
    "\n",
    "**Run the cell below to import the shapefile data into a GeoDataFrame which will behave just like a regular Pandas DataFrame.**\n",
    "\n",
    "**When working with a ShapeFile its really a series of files.**` .cpg, .dbf, .shp, .shx, .prj, .shp.xml` ** each of these files contains different information and work together. Lucky for us **`geopandas`** is knows this and as long as you point it to one of the files the rest will get read in.**\n",
    "\n",
    "**NOTE: This might take a minute or two.**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:11:33.979172",
     "start_time": "2017-02-08T09:09:49.230018"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import geopandas as gpd\n",
    "\n",
    "shp_file = './data/MetroGIS_Tax_Parcels_2014/Parcels2014Hennepin.dbf'\n",
    "\n",
    "hennepin = gpd.read_file(shp_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Now is probably a good time to just import some commonly used libraries for data analysis.**\n",
    "\n",
    "**Note **`%matplotlib inline`** is not a library but a Jupyter Notebook magic function that allows us to plot inside the notebook.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:11:34.385765",
     "start_time": "2017-02-08T09:11:33.982674"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's get a feel for the dataset.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:11:36.261747",
     "start_time": "2017-02-08T09:11:34.387767"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hennepin.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Yikes, thats a lot of memory being used just to hold it, let alone do any type of analysis on.**\n",
    "\n",
    "**A alert observer might point out that this is already a lot smaller than the 800MB file we started with. This is because of all the numeric data is stored as text data in the Shapefile. Each character takes up 8bytes, numberical data takes up significantly less.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's create an index so we can traverse the data faster**\n",
    "\n",
    "**Let's check to make sure that all of the **`PIN` **values are unique and let's make sure to count NULL values in that list.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:11:36.649840",
     "start_time": "2017-02-08T09:11:36.264750"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hennepin['PIN'].nunique(dropna=True) / len(hennepin['PIN'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-20T15:58:44.182339",
     "start_time": "2017-01-20T15:58:42.626274"
    },
    "collapsed": false
   },
   "source": [
    "**Great they are all unique. Now to set the index.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:11:38.683278",
     "start_time": "2017-02-08T09:11:36.651845"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "hennepin = hennepin.set_index(['PIN']).sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A lot of the fields are object type, which defaults to** `str` **on import.** `str` **is a not efficient, it contains a ton of additional methods like len(), replace(), etc. All of which we don't really care about right now. Additionally, they take up more space in memory as noted above.**\n",
    "\n",
    "**Let's treat any column that has a lot of repeat values as** `category` **type.** `category` **type basically just creates a dictionary of words to numbers. A good example of this is** `GREEN_ACRE` **column.**\n",
    "\n",
    "**Let's inspect the unique values in the column.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:11:38.718311",
     "start_time": "2017-02-08T09:11:38.688282"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "list(hennepin['GREEN_ACRE'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's find how many **`object`** columns there are.**\n",
    "\n",
    "**The code below creates a list of all the column names that are **`object`** type.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:11:39.455386",
     "start_time": "2017-02-08T09:11:38.720312"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "column_list = list(hennepin.select_dtypes(include=['object']).columns.values)\n",
    "# how many are there?\n",
    "len(column_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**I don't really want to go through each of the 54 columns so let's create a function that looks for columns that the number of unique values is less than 20% of the total row count. Data that has this property is said to have 'low cardinality'.**\n",
    "\n",
    "**If we find a column that has low cardinality, let's convert them to** `category` **type.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:11:39.465393",
     "start_time": "2017-02-08T09:11:39.457888"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# convert columns with strings to 'categorical' type they have low cardinality.\n",
    "def convert_to_categorical(df, cols):\n",
    "    for col in cols:\n",
    "        # get number of unique values\n",
    "        unique_vals = len(df[col].unique())\n",
    "        # calculate the ratio of unique values to total number of rows\n",
    "        unique_ratio = unique_vals / len(df)\n",
    "        if unique_ratio <= 0.2:\n",
    "            df[col] = df[col].astype('category')\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:12:00.234756",
     "start_time": "2017-02-08T09:11:39.467895"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hennepin = convert_to_categorical(hennepin, column_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Oh no! An error!**\n",
    "\n",
    "**Looks like my logic for converting broke at ** `geometry` **column since** `object` **in this case wasn't referring to a ** `str`** but a **`geom_type`**. This **`geom_type`** is a key difference between a **`Pandas DataFrame`** and a **`geopandas GeoDataFrame`**. Let's run it again without that column.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:12:52.733292",
     "start_time": "2017-02-08T09:12:52.103781"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "column_list = list(hennepin.select_dtypes(include=['object']).columns.values)\n",
    "# remove the value from the list of columns we are converting\n",
    "column_list.remove('geometry')\n",
    "# run the function\n",
    "hennepin = convert_to_categorical(hennepin, column_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's see what that did to our memory usage.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:12:54.449323",
     "start_time": "2017-02-08T09:12:52.735294"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hennepin.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Wow that more than halved our memory consumption!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Profiling\n",
    "**Let's get some summary stats on our data.**\n",
    "\n",
    "**Note: This takes a while to run, but gives you a great amount of summary information over **`.describe()`**.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:09.573381",
     "start_time": "2017-02-08T09:12:54.452327"
    },
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas_profiling\n",
    "\n",
    "# skip object columns since the will be things like street address or geometry (high cardinality)\n",
    "pandas_profiling.ProfileReport(hennepin.select_dtypes(exclude=['object']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This gives us a ton of information. In particular I like to look at the varibles that are highly correlated. If we are building a regression model we don't want to include multiple columns that capture (nearly) the same information.**\n",
    "\n",
    "** Looks like some columns are just constant. Let's remove them.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:09.773774",
     "start_time": "2017-02-08T09:14:09.575882"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# remove columns\n",
    "drop_cols = ['AGPRE_ENRD', 'AGPRE_EXPD', 'COUNTY_ID', 'DWELL_TYPE', 'LANDMARK', 'MULTI_USES', \n",
    "             'NUM_UNITS', 'OWNER_MORE', 'OWN_ADD_L1', 'OWN_ADD_L2', 'OWN_ADD_L3', 'PARC_CODE', \n",
    "             'PREFIXTYPE', 'PREFIX_DIR', 'STREETTYPE', 'SUFFIX_DIR', 'ZIP4']\n",
    "hennepin = hennepin.drop(drop_cols, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Missing Data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:20.512968",
     "start_time": "2017-02-08T09:14:09.776777"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import missingno as msno\n",
    "\n",
    "msno.matrix(hennepin)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**OK looks like some columns are barely populated. Let's remove them if they are less than 1% populated.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:20.523477",
     "start_time": "2017-02-08T09:14:20.515469"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def delete_near_null(df, cols, populated_threshold=0.01):\n",
    "    \"\"\"iterate through columns and remove columns with almost null columns\"\"\"\n",
    "    for col in cols:\n",
    "        non_null_rows = df[col].count()\n",
    "        total_rows = len(df[col])\n",
    "        populated_ratio = non_null_rows/total_rows\n",
    "        if populated_ratio <= populated_threshold:\n",
    "            del df[col]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:21.417227",
     "start_time": "2017-02-08T09:14:20.525983"
    },
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# get the full list of columns to check\n",
    "column_list = list(hennepin.columns.values)\n",
    "\n",
    "# again I don't want to mess with the geometry column\n",
    "column_list.remove('geometry')\n",
    "# run the function\n",
    "hennepin = delete_near_null(hennepin, column_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**What does our dataset look like now?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:21.425736",
     "start_time": "2017-02-08T09:14:21.419734"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hennepin.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Filtering on Geodesic Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**The dataset may have **`Point`** information. **\n",
    "\n",
    "**These are sometimes there to help break up condos into individually owned units and tax each seperately. For ease of use and sanity I'm going to exclude them.**\n",
    "\n",
    "**You can also have the following** `geom_types`\n",
    "* Points / Multi-Points\n",
    "* Lines / Multi-Lines\n",
    "* Polygons / Multi-Polygons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:22.647028",
     "start_time": "2017-02-08T09:14:21.427737"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# filter out points\n",
    "hennepin = hennepin[hennepin['geometry'].geom_type != 'Point']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Geospatial Joins and Searchs are computationally expensive. **\n",
    "\n",
    "**If you can it helps to break up your data into pieces and process them iteratively. For this example we will just do Minneapolis, but you could write a loop to do each city at a time.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:22.708957",
     "start_time": "2017-02-08T09:14:22.649033"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# filter to minneapolis\n",
    "mpls = hennepin[hennepin['CITY'] == 'MINNEAPOLIS']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:22.716970",
     "start_time": "2017-02-08T09:14:22.710458"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mpls.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Great we have 25% of the data we started with, that will help speed things along.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Joins / Merges\n",
    "\n",
    "**Ok enough about taxes. Let's find properties that are adjacent to Lake Calhoun!**\n",
    "\n",
    "**Read in the MetCount Lakes & Rivers Open Water Features shape file**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-20T16:52:27.186115",
     "start_time": "2017-01-20T16:52:27.182112"
    }
   },
   "source": [
    "## [Review Lakes & Rivers Meta Data](./data/MetCouncil_Lakes_Rivers/metadata/metadata.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:24.359452",
     "start_time": "2017-02-08T09:14:22.720979"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "shp_file = './data/MetCouncil_Lakes_Rivers/LakesAndRivers.dbf'\n",
    "\n",
    "water_df = gpd.read_file(shp_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:24.374462",
     "start_time": "2017-02-08T09:14:24.361454"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "water_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Always a good habit to set the index**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:24.412919",
     "start_time": "2017-02-08T09:14:24.377464"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "water_df = water_df.set_index(['OWF_ID']).sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's plot what we've got. This is a much smaller file than the tax parcels so it will be pretty fast.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:38.691765",
     "start_time": "2017-02-08T09:14:24.415921"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "water_df.plot()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's take a look at the fill rate of data.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:39.249762",
     "start_time": "2017-02-08T09:14:38.694268"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "msno.matrix(water_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Hmm lots of water features don't have names.**\n",
    "\n",
    "**Don't forget the importance of context when doing your analysis. Talking with MetCouncil I found that these can be seasonal wetlands, retention ponds and other small water features.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Geospatial Transformations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-20T16:10:20.623318",
     "start_time": "2017-01-20T16:10:20.616313"
    }
   },
   "source": [
    "**Remember when I said that geospatial calculations are expensive? **\n",
    "\n",
    "**Well they still are. So again, split your work up and loop it if you need to do multiples.**\n",
    "\n",
    "**Let's filter the data down to any lakes that are named 'Cedar'**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:39.280799",
     "start_time": "2017-02-08T09:14:39.251763"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# cedar lake\n",
    "cedar_lake = water_df[water_df['NAME_DNR'] == 'Cedar']\n",
    "cedar_lake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**There are two lakes named Cedar in the dataset, which one?**\n",
    "\n",
    "**We can look at a given lake's shape to see if it's what we are expecting.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:39.302314",
     "start_time": "2017-02-08T09:14:39.282300"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cedar_lake['geometry'].iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:39.346383",
     "start_time": "2017-02-08T09:14:39.305316"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cedar_lake['geometry'].iloc[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's filter our set to use the first lake. In reality you'd want to use the OWF_ID to ensure you are choosing the correct lake.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:39.377219",
     "start_time": "2017-02-08T09:14:39.348887"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cedar_lake = cedar_lake.iloc[[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:39.435753",
     "start_time": "2017-02-08T09:14:39.380224"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cedar_lake"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**If we want to get properties next to Cedar Lake, you'd want to know that around many lakes in the city there is a public trail, so we need to expand our search beyond just what touches the lake.**\n",
    "\n",
    "**We can accomplish this by 'buffering' or making the shape bigger in all directions. Let's do 100m for good measure.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:39.718098",
     "start_time": "2017-02-08T09:14:39.438922"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "buffered_cedar_lake = cedar_lake.buffer(100)\n",
    "ax = cedar_lake.plot(color='red')\n",
    "\n",
    "buffered_cedar_lake.plot(ax=ax, color='green')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**For now we don't really care about any of the other attributes attached to this lake, we just care about its shape, so let's just get the **`Polygon`** to see where their are overlaps with properties.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:39.725104",
     "start_time": "2017-02-08T09:14:39.721099"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "buffered_cedar_poly = buffered_cedar_lake.iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Now this next section is a bit complicated but is super fast. Using the **`rtree` **library you can quickly narrow down your search.**\n",
    "\n",
    "**First step is to set the spatial index of the geospatial dataset you wish to search.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:51.817901",
     "start_time": "2017-02-08T09:14:39.727610"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "spatial_index = mpls.sindex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Next I want you to envision how complicated it would be to determine if a weird shape is overlapping another weird shape.**\n",
    "\n",
    "**Wouldn't it be easier to just filter out the vast majority by seeing if a point is inside a rectangle?**\n",
    "\n",
    "**Yes, yes it is.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A Bounding Box makes the smallest possible rectangle that completely encloses your polygon.**\n",
    "\n",
    "**It is just a tuple of the minx, miny, maxx, maxy**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:51.827398",
     "start_time": "2017-02-08T09:14:51.820385"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cedar_bb = buffered_cedar_poly.bounds\n",
    "cedar_bb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This next part isn't essential to understand the code because honestly its a mess, but rather to see that the bounding box is the smallest rectangle that the buffered lake fits in.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:52.107841",
     "start_time": "2017-02-08T09:14:51.829900"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from shapely.geometry import box\n",
    "from descartes import PolygonPatch\n",
    "\n",
    "# convert the bounding box tuple into a polygon\n",
    "cedar_box = box(cedar_bb[0], cedar_bb[1], cedar_bb[2], cedar_bb[3])\n",
    "\n",
    "# plot the rectangle\n",
    "fig = plt.figure() \n",
    "ax = fig.gca() \n",
    "ax.add_patch(PolygonPatch(cedar_box))\n",
    "\n",
    "# ensure we aren't distorting the image\n",
    "ax.axis('scaled')\n",
    "\n",
    "# plot the buffered cedar lake polygon on top\n",
    "buffered_cedar_lake.plot(ax=ax)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Now we find all of the index values of the Minneapolis Tax Parcels that intersect with this bounding box.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:52.116337",
     "start_time": "2017-02-08T09:14:52.110330"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "possible_matches_index = list(spatial_index.intersection(cedar_bb))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-01-20T16:42:03.405938",
     "start_time": "2017-01-20T16:42:03.397928"
    }
   },
   "source": [
    "**Now we can select those parcels by their index values**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:52.173899",
     "start_time": "2017-02-08T09:14:52.122845"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "possible_matches = mpls.iloc[possible_matches_index]\n",
    "possible_matches.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Now that was fairly quick and we've gone from 100k+ records down to a handful.**\n",
    "\n",
    "**Next we look at those that actually touch that 100m buffer around Cedar Lake**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:52.301515",
     "start_time": "2017-02-08T09:14:52.176901"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "precise_matches = possible_matches[possible_matches.intersects(buffered_cedar_poly)]\n",
    "precise_matches.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Also fairly fast and look we are left with 208 parcels that are within 100m of Cedar Lake!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:53.608561",
     "start_time": "2017-02-08T09:14:52.308019"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "precise_matches.plot()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's make a new dataframe that is only these matches. Let's add a column specifies these are close to Cedar Lake**\n",
    "\n",
    "\n",
    "**NOTE: A better way would be to use HUC's (Hydrological Unit Code) so you don't run into the collision of multiple Cedar Lakes.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:53.620070",
     "start_time": "2017-02-08T09:14:53.611063"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# .copy() ensures we aren't mucking up the mpls dataframe since precise_matches is just a slice of it.\n",
    "cedar_lake_parcels = precise_matches.copy()\n",
    "cedar_lake_parcels['LAKE_NAME'] = 'Cedar Lake'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Joining Tablular Data to Geospatial Data\n",
    "\n",
    "**Now let's append some Lake Quality to these parcels surrounding Cedar Lake!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Review Lake Quality Meta Data](./data/MN_PCA_Cedar_Lake/metadata.txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:54.059436",
     "start_time": "2017-02-08T09:14:53.623072"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "txt_file = './data/MN_PCA_Cedar_Lake/cedar_lake_qual.csv'\n",
    "lake_qual = pd.read_csv(txt_file, parse_dates=['sampleDate', 'analysisDate'], na_values='(null)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:54.157503",
     "start_time": "2017-02-08T09:14:54.064439"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lake_qual.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:14:55.222287",
     "start_time": "2017-02-08T09:14:54.162006"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "msno.matrix(lake_qual)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:05.027226",
     "start_time": "2017-02-08T09:14:55.224776"
    },
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "pandas_profiling.ProfileReport(lake_qual)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Hmm for some reason result is being treated as a categorical variable.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:05.273333",
     "start_time": "2017-02-08T09:15:05.029227"
    },
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "lake_qual['result'] = lake_qual['result'].astype('float')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Ahh, ok there are a few non-numeric values. Let's replace them with numbers.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:30.745241",
     "start_time": "2017-02-08T09:15:30.714737"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lake_qual['result'] = lake_qual['result'].astype('str').replace('3.MED ALGAE', '3').replace('3.FAIR', '3')\n",
    "lake_qual['result'] = lake_qual['result'].astype('float')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**We only need 2014 data. Let's filter to that first**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:30.777312",
     "start_time": "2017-02-08T09:15:30.753250"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lake_qual = lake_qual[lake_qual['sampleDate'].dt.year == 2014]\n",
    "lake_qual.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Wait a second... 984 rows for one year??!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:30.834207",
     "start_time": "2017-02-08T09:15:30.789825"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lake_qual.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Ahh, looks like data needs to be pivoted out.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:30.905264",
     "start_time": "2017-02-08T09:15:30.836713"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lake_qual_pivot = lake_qual.pivot_table(values='result', index='sampleDate', columns='parameter')\n",
    "lake_qual_pivot.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Always good to look at the meta data and see if you should add anything to the set. In this case I want to put the lat/long**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:31.867779",
     "start_time": "2017-02-08T09:15:30.907766"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lake_qual_pivot['latitude'] = 44.961482\n",
    "lake_qual_pivot['longitude'] = -93.32013"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**We will take the easy way out here and just create a column **`LAKE_NAME`** to join on. In reality you should use something like a 10digit HUC (Hydrological Unit Code) to insure that you don't have duplicate lake names ie (Cedar)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:31.911809",
     "start_time": "2017-02-08T09:15:31.871781"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lake_qual_pivot['LAKE_NAME'] = 'Cedar Lake'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:32.008785",
     "start_time": "2017-02-08T09:15:31.916312"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result = cedar_lake_parcels.reset_index().merge(lake_qual_pivot, on='LAKE_NAME')\n",
    "result.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This inflated our data since there is an entry for roughly every week in 2014.**\n",
    "\n",
    "**Lot of NaN columns, this is due to how each city collects and reports data around tax parcels**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:32.019293",
     "start_time": "2017-02-08T09:15:32.012288"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:32.073330",
     "start_time": "2017-02-08T09:15:32.022796"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = result.dropna(axis=1, how='all')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:32.104352",
     "start_time": "2017-02-08T09:15:32.078834"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Geospatial Calculations\n",
    "\n",
    "**Some lakes have multiple monitoring stations. You might want to use the data that is closest to the parcel.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's just do a simple calculation from the center of each parcel to the Lat/Long of the Monitoring Site**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:32.492666",
     "start_time": "2017-02-08T09:15:32.107354"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result['Parcel_Centroid'] = result.centroid\n",
    "result['Parcel_Centroid'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Weird, those don't look like any type of GPS coords I've seen.**\n",
    "\n",
    "**GIS files use different 'datums' to set the origin of their coordinate systems. We will need to convert this.**\n",
    "\n",
    "**This [GIS.stackexchange](http://gis.stackexchange.com/a/722) goes over the concept in detail. And yes... there is a Stack Overflow equivalent for GIS...**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:32.502179",
     "start_time": "2017-02-08T09:15:32.495168"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# what is the starting projection\n",
    "result.crs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Convert the projection to Normal Lat/Long**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:32.985098",
     "start_time": "2017-02-08T09:15:32.505682"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = result.to_crs({'init': 'epsg:4326'})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:33.031501",
     "start_time": "2017-02-08T09:15:32.988090"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result['Parcel_Centroid'] = result.centroid\n",
    "result['Parcel_Centroid'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**So the problem here is that these centroids are still **`shapley.geometry.Point`** objects. We need to get them into tuples for **`geopy` ** to work with.\n",
    "\n",
    "**Below is a **`lambda`** function they aren't really known for being super easy to read, but they are really handy. You can read the bottom one as: For each Point, get its x and y values and return them in a tuple.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:33.135459",
     "start_time": "2017-02-08T09:15:33.034486"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "result['Parcel_Centroid'] = result['Parcel_Centroid'].apply(lambda p: tuple([p.y, p.x]))\n",
    "result['Parcel_Centroid'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Right now monitoring station coords are in into columns, need to convert that into a tuple.**\n",
    "\n",
    "**The **`zip()`** function is similar to taking the left and right side of a jacket and 'zipping' them together. First thing on the left goes with the first thing on the right and on down the line.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:33.158936",
     "start_time": "2017-02-08T09:15:33.137963"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from shapely.geometry import Point\n",
    "\n",
    "result['station_coords'] = tuple(zip(result['latitude'], result['longitude']))\n",
    "# remove the old columns\n",
    "drop_cols = ['latitude', 'longitude']\n",
    "result = result.drop(drop_cols, axis=1)\n",
    "\n",
    "result['station_coords'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**So now you have two Lat/Long Points. The key here is not to use euclidean distance when trying to find out how far part they are.**\n",
    "\n",
    "**Recall that the further North or South you go the more the Longitude lines converge towards the pole. If you use euclidean distance your results will only be accurate at the equator!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:33.829523",
     "start_time": "2017-02-08T09:15:33.160937"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from geopy.distance import vincenty\n",
    "\n",
    "def calc_dist(row):\n",
    "    dist_to_station = vincenty(row['Parcel_Centroid'], row['station_coords']).km\n",
    "    return dist_to_station\n",
    "\n",
    "result['dist_to_station'] = result.apply(calc_dist, axis=1)\n",
    "\n",
    "result['dist_to_station'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Now we just want the data were its the closest to the point.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:33.847562",
     "start_time": "2017-02-08T09:15:33.833527"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# group by the PIN key, which would be duplicated for each station and take the minimum\n",
    "def get_min_rows(df, grpby, aggcol):\n",
    "    min_values = df.groupby(grpby)[aggcol].transform(min)\n",
    "    return df[df[aggcol] == min_values] \n",
    "\n",
    "min_parcels = get_min_rows(result, 'PIN', 'dist_to_station')\n",
    "min_parcels.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Writing Data Out\n",
    "\n",
    "**GeoDataFrames can be exported to many different standard formats.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-02-08T09:15:38.922414",
     "start_time": "2017-02-08T09:15:33.851547"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Shape files complain about certain data types\n",
    "min_parcels['Parcel_Centroid'] = min_parcels['Parcel_Centroid'].astype('str')\n",
    "min_parcels['station_coords'] = min_parcels['station_coords'].astype('str')\n",
    "\n",
    "file_name = './data/results'\n",
    "\n",
    "min_parcels.to_file(file_name)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: README.md
================================================

![logo](./img/sdsLogo.png)
# Introduction to Geospatial Data with Python

Data comes in all shapes and sizes and often government data is geospatial in nature. Often times data science programs & tutorials ignore how to work with this rich data to make room for more advanced topics. Our MinneMUDAC competition heavily utilized geospatial data but was processed to provide students a more familiar format. But as good scientists, we should use primary sources of information as often as possible.

![logo](./img/gislayers.jpg)

## Why use this Notebook?

Use this Notebook to get a basic understanding of how to read, write, query, perform geospatial calculations and join data sets together. Along the way you will see some tips to preprocessing data for analysis and some tricks to ensure you are computing efficiently. This Notebook is be focused on Minnesota Tax shapefiles, MetCouncil Water Features and MN PCA Lake Quality Attributes all of which were the focus of our [Dive Into Water (Data)](http://minneanalytics.org/minnemudac/) Competition. It is meant as a way to give you real data, real code and a real problem to work through. 

*Social Data Science hopes you take what you learn here and use it to improve the world around you!*

## [Open the Notebook with Jupyter to get started!](Intro%20to%20Geospatial%20Data%20with%20Python.ipynb)

## [Watch the YouTube Talk](https://www.youtube.com/watch?v=qvHXRuGPHl0)

Download .txt

gitextract_395ygo8c/

├── .gitignore
├── Intro to Geospatial Data with Python.ipynb
└── README.md

Download .json

Condensed preview — 3 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (64K chars).

[
  {
    "path": ".gitignore",
    "chars": 2107,
    "preview": "# Created by .ignore support plugin (hsz.mobi)\n### Python template\n# Byte-compiled / optimized / DLL files\n__pycache__/\n"
  },
  {
    "path": "Intro to Geospatial Data with Python.ipynb",
    "chars": 54592,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"ExecuteTime\": {\n     \"end_time\": \"2017-01-19T12:42:1"
  },
  {
    "path": "README.md",
    "chars": 1427,
    "preview": "\n![logo](./img/sdsLogo.png)\n# Introduction to Geospatial Data with Python\n\nData comes in all shapes and sizes and often "
  }
]

About this extraction

This page contains the full source code of the SocialDataSci/Geospatial_Data_with_Python GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 3 files (56.8 KB), approximately 17.9k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo