dev 931308b1fc64 cached
22 files
111.0 KB
29.9k tokens
80 symbols
1 requests
Download .txt
Repository: GreenBuildingRegistry/usaddress-scourgify
Branch: dev
Commit: 931308b1fc64
Files: 22
Total size: 111.0 KB

Directory structure:
gitextract_ipci6wv1/

├── .coveragerc
├── .gitignore
├── .isort.cfg
├── CHANGELOG.rst
├── LICENSE
├── README.rst
├── requirements/
│   ├── base.txt
│   └── dev.txt
├── scourgify/
│   ├── __init__.py
│   ├── address_constants.py
│   ├── cleaning.py
│   ├── exceptions.py
│   ├── normalize.py
│   ├── tests/
│   │   ├── __init__.py
│   │   ├── config/
│   │   │   ├── __init__.py
│   │   │   └── address_constants.yaml
│   │   ├── test_address_normalization.py
│   │   └── test_cleaning.py
│   └── validations.py
├── setup.cfg
├── setup.py
└── tox.ini

================================================
FILE CONTENTS
================================================

================================================
FILE: .coveragerc
================================================
[paths]
source =
    scourgify

[run]
source =
    scourgify
omit =
    *tox*
    setup.py
    *test*

[report]
;sort = Cover
sort = Name
skip_covered = True
show_missing = True

================================================
FILE: .gitignore
================================================
# Created by .ignore support plugin (hsz.mobi)
### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

.pytest_cache/

# pycharm
.idea


================================================
FILE: .isort.cfg
================================================
[settings]
line_length=79
multi_line_output=3
include_trailing_comma=1
known_standard_library=typing
known_django=django
known_thirdparty=peewee
import_heading_stdlib=Imports from Standard Library
import_heading_thirdparty=Imports from Third Party Modules
import_heading_django=Imports from Django
import_heading_firstparty=Local Imports
sections=FUTURE,STDLIB,DJANGO,THIRDPARTY,FIRSTPARTY,LOCALFOLDER
not_skip = __init__.py

# for additional settings see:
#    https://github.com/timothycrosley/isort/wiki/isort-Settings


================================================
FILE: CHANGELOG.rst
================================================
Changelog
=========
0.2.3 [2020-05-06]
------------------
* Valid OccupancyType bug fix for OccupancyType that is already valid abbreviation

0.2.1 [2020-05-06]
------------------
* Corrected for late OccupancyType additions and allowed # OccpancyType to pass through

0.2.0 [2020-05-06]
------------------
* potentially breaking change. Non-standard unit numbers now converted to a default.
This is based on a real life incident; the original
behavior to allow non-standard unit types to pass through resulted
in an address validation service also allowing the address to pass
through even though no unit should have existed on the home.

0.1.3 [2018-09-09]
------------------
* python 3.7.0 compatibility

0.1.0 [2018-02-16]
------------------
* OpenSource release


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2018 Green Building Registry

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

================================================
FILE: README.rst
================================================
usaddress-scourgify
===================

A Python3.x library for cleaning/normalizing US addresses following USPS pub 28 and RESO guidelines.



Documentation
-------------
Use

``normalize_address_record()``

 or

``get_geocoder_normalized_addr()``

or

``NormalizeAddress().normalize()``

to standardize your addresses. (Note: usaddress-scourgify does not make any attempts at address validation.)

Both functions, and the class init, take an address string, or a dict-like object, and return an address dict with all field values in uppercase format mapped to the keys address_line_1, address_line_2, city, state, postal_code... code-block:: python


.. code-block:: python


        from scourgify import normalize_address_record, NormalizeAddress

        normalize_address_record('123 southwest Main street, Boring, or, 97203')
        
        normalize_address_record({
            'address_line_1': '123 southwest Main street',
            'address_line_2': 'unit 2',
            'city': 'Boring',
            'state': 'or',
            'postal_code': '97203'
        })

        NormalizeAddress('123 southwest Main street, Boring, or, 97203').normalize()

expected output


.. code-block:: python

       {
            'address_line_1': '123 SW MAIN ST',
            'address_line_2': 'UNIT 2'
            'city': 'BORING',
            'state': 'OR',
            'postal_code': '97203'
        }


By default, the output style abbreviates all pre or post directionals, street types, and occupancy types.
Alternately, if you would like to receive your output with full word directionals and street types, you can use the `long_hand` parameter.

.. code-block:: python


        from scourgify import normalize_address_record, NormalizeAddress

        normalize_address_record('123 southwest Main street, Boring, or, 97203', long_hand=True)

        normalize_address_record({
            'address_line_1': '123 southwest Main street',
            'address_line_2': 'unit 2,
            'city': 'Boring',
            'state': 'or',
            'postal_code': '97203'
        })

        NormalizeAddress('123 southwest Main street, Boring, or, 97203', long_hand=True).normalize()

expected output


.. code-block:: python

       {
            'address_line_1': '123 SOUTHWEST MAIN STREET',
            'address_line_2': 'UNIT 2'
            'city': 'BORING',
            'state': 'OR',
            'postal_code': '97203'
        }

normalized_address_record() uses the included processing functions to remove unacceptable special characters, extra spaces, predictable abnormal character sub-strings and phrases. It also abbreviates directional indicators and street types according to the abbreviation mappings found in address_constants.  If applicable, line 2 address elements (ie: Apt, Unit) are separated from line 1 inputs and standard occupancy type abbreviations are applied.

You may supply additional additional processing functions as a list of callable supplied to the addtl_funcs parameter. Any additional functions should take a string address and return a tuple of strings (line1, line2).

Postal codes are normalized to US zip or zip+4 and zero padded as applicable.  ie: `2129 => 02129`, `02129-44 => 02129-0044`, `021290044 => 02129-0044`.
However, postal codes that cannot be effectively normalized, such as invalid length or invalid characters, will raise AddressValidationError. ie `12345678901 or 02129- or 02129-0044-123, etc`

Alternately, you may extend the `NormalizeAddress` class to customize the normalization behavior by overriding any of the class' methods.

If your address is in the form of a dict that does not use the keys address_line_1, address_line_2, city, state, and postal_code, you must supply a key map to the addr_map parameter in the format {standard_key: custom_key}


.. code-block:: python

        {
            'address_line_1': 'Line1',
            'address_line_2': 'Line2',
            'city': 'City',
            'state': 'State',
            'postal_code': 'Zip'
        }


You can also customize the address constants used by setting up an `address_constants.yaml` config file.
Allowed keys are::
            DIRECTIONAL_REPLACEMENTS
            OCCUPANCY_TYPE_ABBREVIATIONS
            STATE_ABBREVIATIONS
            STREET_TYPE_ABBREVIATIONS
            KNOWN_ODDITIES
            PROBLEM_ST_TYPE_ABBRVS

You may also use the key `insertion_method` with a value of `update` or `replace` to indicate where you would like to insert your values into the existing constants or replace them. If `insertion_method` is not present, update is assumed.


.. code-block:: yaml

        insertion_method: update
        KNOWN_ODDITIES:
            'developed by HOST': ''
            ', UN ': ' UNIT '

        OCCUPANCY_TYPE_ABBREVIATIONS:
            'UN': 'UNIT'


get_geocoder_normalized_addr() uses geocoder.google to parse your address into a standard dict.  No additional cleaning is performed, so if your address contains any stray or non-conforming elements (ie: 8888 NE KILLINGSWORTH ST, UN C, PORTLAND, OR 97008), no result will be returned.
Since geocoder accepts an address string, if your address is in dict format you will need to supply a list of the address related keys within your dict, in the order of address string composition, if your keys do not match the standard key set (address_line_1, address_line_2, city, state, postal_code)

Installation
------------
Requires Python3.x.

``pip install usaddress-scourgify``

To use a custom constants yaml, set the ADDRESS_CONFIG_DIR environment variable with the full path to the directory containing your address_constants.yaml file

``export ADDRESS_CONFIG_DIR=/path/to/your/config_dir``

To use get_geocoder_normalized_addr, set the GOOGLE_API_KEY environment variable

``export GOOGLE_API_KEY=your_google_api_key``

Contributing
------------
Create a new branch to hold your change; no pull requests submitted directly to dev or master will be approved.  Please include a comment explain the issue your pull request solves. Make sure all appropriate test, and tox, updates are included and that all tests are passing.

License
-------
usaddress-scourgify is released under the terms of the MIT license. Full details in LICENSE file.

Changelog
---------
usaddress-scourgify was developed for use in the greenbuildingregistry project.
For a full changelog see `CHANGELOG.rst <https://github.com/GreenBuildingRegistry/usaddress-scourgify/blob/master/CHANGELOG.rst>`_.


================================================
FILE: requirements/base.txt
================================================
usaddress>=0.5.9
geocoder>=1.22.6
yaml-config>=0.1.2
typing>=3.6.1; python_version<'3.6'


================================================
FILE: requirements/dev.txt
================================================
-r base.txt
coverage>=6.2
flake8>=3.0.4
frosted>=1.4.1
isort>=4.2.5
pep8>=1.7.0
pylama>=7.3.3
pylint>=1.6.4
tox>=2.7.0



================================================
FILE: scourgify/__init__.py
================================================
#!/usr/bin/env python
# encoding: utf-8
"""
copyright (c) 2016  Earth Advantage.
All rights reserved
"""

# Local Imports
from scourgify.normalize import (
    get_geocoder_normalized_addr,
    normalize_address_record,
    NormalizeAddress
)


================================================
FILE: scourgify/address_constants.py
================================================
#!/usr/bin/env python
# encoding: utf-8
"""
copyright (c) 2016-2017 Earth Advantage.
All rights reserved
..codeauthor::Fable Turas <fable@rainsoftware.tech>

"""
# Imports from Third Party Modules
from yamlconf import Config, ConfigError

KNOWN_ODDITIES = {}
ABNORMAL_OCCUPANCY_ABBRVS = {}

PROBLEM_ST_TYPE_ABBRVS = {
    'CT': 'COURT'
}

AMBIGUOUS_DIRECTIONALS = {
    'NORTH-WEST': 'NW',
    'NORTH-EAST': 'NE',
    'SOUTH-WEST': 'SW',
    'SOUTH-EAST': 'SE'
}

DIRECTIONAL_REPLACEMENTS = {
    'EAST': 'E',
    'WEST': 'W',
    'NORTH': 'N',
    'SOUTH': 'S',
    'NORTHEAST': 'NE',
    'NORTHWEST': 'NW',
    'SOUTHEAST': 'SE',
    'SOUTHWEST': 'SW'
}

LONGHAND_DIRECTIONALS = {v: k for k, v in DIRECTIONAL_REPLACEMENTS.items()}

CITY_ABBREVIATIONS = LONGHAND_DIRECTIONALS.copy()
CITY_ABBRS = {
    'ST': 'SAINT',
    'MT': 'MOUNT',
    'FT': 'FORT',
    'VA': 'VIRGINIA'
}
CITY_ABBREVIATIONS.update(CITY_ABBRS)
STREET_TYPE_ABBREVIATIONS = {
    'ALLEE': 'ALY',
    'ALLEY': 'ALY',
    'ALLY': 'ALY',
    'ALY': 'ALY',
    'ANEX': 'ANX',
    'ANNEX': 'ANX',
    'ANNX': 'ANX',
    'ANX': 'ANX',
    'ARC': 'ARC',
    'ARCADE': 'ARC',
    'AV': 'AVE',
    'AVE': 'AVE',
    'AVEN': 'AVE',
    'AVENU': 'AVE',
    'AVENUE': 'AVE',
    'AVN': 'AVE',
    'AVNUE': 'AVE',
    'BAYOO': 'BYU',
    'BAYOU': 'BYU',
    'BCH': 'BCH',
    'BEACH': 'BCH',
    'BEND': 'BND',
    'BND': 'BND',
    'BLF': 'BLF',
    'BLUF': 'BLF',
    'BLUFF': 'BLF',
    'BLUFFS': 'BLFS',
    'BOT': 'BTM',
    'BOTTM': 'BTM',
    'BOTTOM': 'BTM',
    'BTM': 'BTM',
    'BLVD': 'BLVD',
    'BOUL': 'BLVD',
    'BOULEVARD': 'BLVD',
    'BOULV': 'BLVD',
    'BR': 'BR',
    'BRANCH': 'BR',
    'BRNCH': 'BR',
    'BRDGE': 'BRG',
    'BRG': 'BRG',
    'BRIDGE': 'BRG',
    'BRK': 'BRK',
    'BROOK': 'BRK',
    'BROOKS': 'BRKS',
    'BURG': 'BG',
    'BURGS': 'BGS',
    'BYP': 'BYP',
    'BYPA': 'BYP',
    'BYPAS': 'BYP',
    'BYPASS': 'BYP',
    'BYPS': 'BYP',
    'CAMP': 'CP',
    'CMP': 'CP',
    'CP': 'CP',
    'CANYN': 'CYN',
    'CANYON': 'CYN',
    'CNYN': 'CYN',
    'CYN': 'CYN',
    'CAPE': 'CPE',
    'CPE': 'CPE',
    'CAUSEWAY': 'CSWY',
    'CAUSWAY': 'CSWY',
    'CSWY': 'CSWY',
    'CEN': 'CTR',
    'CENT': 'CTR',
    'CENTER': 'CTR',
    'CENTR': 'CTR',
    'CENTRE': 'CTR',
    'CNTER': 'CTR',
    'CNTR': 'CTR',
    'CTR': 'CTR',
    'CENTERS': 'CTRS',
    'CIR': 'CIR',
    'CIRC': 'CIR',
    'CIRCL': 'CIR',
    'CIRCLE': 'CIR',
    'CRCL': 'CIR',
    'CRCLE': 'CIR',
    'CIRCLES': 'CIRS',
    'CLF': 'CLF',
    'CLIFF': 'CLF',
    'CLFS': 'CLFS',
    'CLIFFS': 'CLFS',
    'CLB': 'CLB',
    'CLUB': 'CLB',
    'COMMON': 'CMN',
    'COR': 'COR',
    'CORNER': 'COR',
    'CORNERS': 'CORS',
    'CORS': 'CORS',
    'COURSE': 'CRSE',
    'CRSE': 'CRSE',
    'COURT': 'CT',
    'CRT': 'CT',
    'CT': 'CT',
    'COURTS': 'CTS',
    'COVE': 'CV',
    'CV': 'CV',
    'COVES': 'CVS',
    'CK': 'CRK',
    'CR': 'CRK',
    'CREEK': 'CRK',
    'CRK': 'CRK',
    'CRECENT': 'CRES',
    'CRES': 'CRES',
    'CRESCENT': 'CRES',
    'CRESENT': 'CRES',
    'CRSCNT': 'CRES',
    'CRSENT': 'CRES',
    'CRSNT': 'CRES',
    'CREST': 'CRST',
    'CROSSING': 'XING',
    'CRSSING': 'XING',
    'CRSSNG': 'XING',
    'XING': 'XING',
    'CROSSROAD': 'XRD',
    'CURVE': 'CURV',
    'DALE': 'DL',
    'DL': 'DL',
    'DAM': 'DM',
    'DM': 'DM',
    'DIV': 'DV',
    'DIVIDE': 'DV',
    'DV': 'DV',
    'DVD': 'DV',
    'DR': 'DR',
    'DRIV': 'DR',
    'DRIVE': 'DR',
    'DRV': 'DR',
    'DRIVES': 'DRS',
    'EST': 'EST',
    'ESTATE': 'EST',
    'ESTATES': 'ESTS',
    'ESTS': 'ESTS',
    'EXP': 'EXPY',
    'EXPR': 'EXPY',
    'EXPRESS': 'EXPY',
    'EXPRESSWAY': 'EXPY',
    'EXPW': 'EXPY',
    'EXPY': 'EXPY',
    'EXT': 'EXT',
    'EXTENSION': 'EXT',
    'EXTN': 'EXT',
    'EXTNSN': 'EXT',
    'EXTENSIONS': 'EXTS',
    'EXTS': 'EXTS',
    'FALL': 'FALL',
    'FALLS': 'FLS',
    'FLS': 'FLS',
    'FERRY': 'FRY',
    'FRRY': 'FRY',
    'FRY': 'FRY',
    'FIELD': 'FLD',
    'FLD': 'FLD',
    'FIELDS': 'FLDS',
    'FLDS': 'FLDS',
    'FLAT': 'FLT',
    'FLT': 'FLT',
    'FLATS': 'FLTS',
    'FLTS': 'FLTS',
    'FORD': 'FRD',
    'FRD': 'FRD',
    'FORDS': 'FRDS',
    'FOREST': 'FRST',
    'FORESTS': 'FRST',
    'FRST': 'FRST',
    'FORG': 'FRG',
    'FORGE': 'FRG',
    'FRG': 'FRG',
    'FORGES': 'FRGS',
    'FORK': 'FRK',
    'FRK': 'FRK',
    'FORKS': 'FRKS',
    'FRKS': 'FRKS',
    'FORT': 'FT',
    'FRT': 'FT',
    'FT': 'FT',
    'FREEWAY': 'FWY',
    'FREEWY': 'FWY',
    'FRWAY': 'FWY',
    'FRWY': 'FWY',
    'FWY': 'FWY',
    'GARDEN': 'GDN',
    'GARDN': 'GDN',
    'GDN': 'GDN',
    'GRDEN': 'GDN',
    'GRDN': 'GDN',
    'GARDENS': 'GDNS',
    'GDNS': 'GDNS',
    'GRDNS': 'GDNS',
    'GATEWAY': 'GTWY',
    'GATEWY': 'GTWY',
    'GATWAY': 'GTWY',
    'GTWAY': 'GTWY',
    'GTWY': 'GTWY',
    'GLEN': 'GLN',
    'GLN': 'GLN',
    'GLENS': 'GLNS',
    'GREEN': 'GRN',
    'GRN': 'GRN',
    'GREENS': 'GRNS',
    'GROV': 'GRV',
    'GROVE': 'GRV',
    'GRV': 'GRV',
    'GROVES': 'GRVS',
    'HARB': 'HBR',
    'HARBOR': 'HBR',
    'HARBR': 'HBR',
    'HBR': 'HBR',
    'HRBOR': 'HBR',
    'HARBORS': 'HBRS',
    'HAVEN': 'HVN',
    'HAVN': 'HVN',
    'HVN': 'HVN',
    'HEIGHT': 'HTS',
    'HEIGHTS': 'HTS',
    'HGTS': 'HTS',
    'HT': 'HTS',
    'HTS': 'HTS',
    'HIGHWAY': 'HWY',
    'HIGHWY': 'HWY',
    'HIWAY': 'HWY',
    'HIWY': 'HWY',
    'HWAY': 'HWY',
    'HWY': 'HWY',
    'HILL': 'HL',
    'HL': 'HL',
    'HILLS': 'HLS',
    'HLS': 'HLS',
    'HLLW': 'HOLW',
    'HOLLOW': 'HOLW',
    'HOLLOWS': 'HOLW',
    'HOLW': 'HOLW',
    'HOLWS': 'HOLW',
    'INLET': 'INLT',
    'INLT': 'INLT',
    'IS': 'IS',
    'ISLAND': 'IS',
    'ISLND': 'IS',
    'ISLANDS': 'ISS',
    'ISLNDS': 'ISS',
    'ISS': 'ISS',
    'ISLE': 'ISLE',
    'ISLES': 'ISLE',
    'JCT': 'JCT',
    'JCTION': 'JCT',
    'JCTN': 'JCT',
    'JUNCTION': 'JCT',
    'JUNCTN': 'JCT',
    'JUNCTON': 'JCT',
    'JCTNS': 'JCTS',
    'JCTS': 'JCTS',
    'JUNCTIONS': 'JCTS',
    'KEY': 'KY',
    'KY': 'KY',
    'KEYS': 'KYS',
    'KYS': 'KYS',
    'KNL': 'KNL',
    'KNOL': 'KNL',
    'KNOLL': 'KNL',
    'KNLS': 'KNLS',
    'KNOLLS': 'KNLS',
    'LAKE': 'LK',
    'LK': 'LK',
    'LAKES': 'LKS',
    'LKS': 'LKS',
    'LAND': 'LAND',
    'LANDING': 'LNDG',
    'LNDG': 'LNDG',
    'LNDNG': 'LNDG',
    'LA': 'LN',
    'LANE': 'LN',
    'LANES': 'LN',
    'LN': 'LN',
    'LGT': 'LGT',
    'LIGHT': 'LGT',
    'LIGHTS': 'LGTS',
    'LF': 'LF',
    'LOAF': 'LF',
    'LCK': 'LCK',
    'LOCK': 'LCK',
    'LCKS': 'LCKS',
    'LOCKS': 'LCKS',
    'LDG': 'LDG',
    'LDGE': 'LDG',
    'LODG': 'LDG',
    'LODGE': 'LDG',
    'LOOP': 'LOOP',
    'LOOPS': 'LOOP',
    'MALL': 'MALL',
    'MANOR': 'MNR',
    'MNR': 'MNR',
    'MANORS': 'MNRS',
    'MNRS': 'MNRS',
    'MDW': 'MDW',
    'MEADOW': 'MDW',
    'MDWS': 'MDWS',
    'MEADOWS': 'MDWS',
    'MEDOWS': 'MDWS',
    'MEWS': 'MEWS',
    'MILL': 'ML',
    'ML': 'ML',
    'MILLS': 'MLS',
    'MLS': 'MLS',
    'MISSION': 'MSN',
    'MISSN': 'MSN',
    'MSN': 'MSN',
    'MSSN': 'MSN',
    'MOTORWAY': 'MTWY',
    'MNT': 'MT',
    'MOUNT': 'MT',
    'MT': 'MT',
    'MNTAIN': 'MTN',
    'MNTN': 'MTN',
    'MOUNTAIN': 'MTN',
    'MOUNTIN': 'MTN',
    'MTIN': 'MTN',
    'MTN': 'MTN',
    'MNTNS': 'MTNS',
    'MOUNTAINS': 'MTNS',
    'NCK': 'NCK',
    'NECK': 'NCK',
    'ORCH': 'ORCH',
    'ORCHARD': 'ORCH',
    'ORCHRD': 'ORCH',
    'OVAL': 'OVAL',
    'OVL': 'OVAL',
    'OVERPASS': 'OPAS',
    'PARK': 'PARK',
    'PK': 'PARK',
    'PRK': 'PARK',
    'PARKS': 'PARK',
    'PARKWAY': 'PKWY',
    'PARKWY': 'PKWY',
    'PKWAY': 'PKWY',
    'PKWY': 'PKWY',
    'PKY': 'PKWY',
    'PARKWAYS': 'PKWY',
    'PKWYS': 'PKWY',
    'PASS': 'PASS',
    'PASSAGE': 'PSGE',
    'PATH': 'PATH',
    'PATHS': 'PATH',
    'PIKE': 'PIKE',
    'PIKES': 'PIKE',
    'PINE': 'PNE',
    'PINES': 'PNES',
    'PNES': 'PNES',
    'PL': 'PL',
    'PLACE': 'PL',
    'PLAIN': 'PLN',
    'PLN': 'PLN',
    'PLAINES': 'PLNS',
    'PLAINS': 'PLNS',
    'PLNS': 'PLNS',
    'PLAZA': 'PLZ',
    'PLZ': 'PLZ',
    'PLZA': 'PLZ',
    'POINT': 'PT',
    'PT': 'PT',
    'POINTS': 'PTS',
    'PTS': 'PTS',
    'PORT': 'PRT',
    'PRT': 'PRT',
    'PORTS': 'PRTS',
    'PRTS': 'PRTS',
    'PR': 'PR',
    'PRAIRIE': 'PR',
    'PRARIE': 'PR',
    'PRR': 'PR',
    'RAD': 'RADL',
    'RADIAL': 'RADL',
    'RADIEL': 'RADL',
    'RADL': 'RADL',
    'RAMP': 'RAMP',
    'RANCH': 'RNCH',
    'RANCHES': 'RNCH',
    'RNCH': 'RNCH',
    'RNCHS': 'RNCH',
    'RAPID': 'RPD',
    'RPD': 'RPD',
    'RAPIDS': 'RPDS',
    'RPDS': 'RPDS',
    'REST': 'RST',
    'RST': 'RST',
    'RDG': 'RDG',
    'RDGE': 'RDG',
    'RIDGE': 'RDG',
    'RDGS': 'RDGS',
    'RIDGES': 'RDGS',
    'RIV': 'RIV',
    'RIVER': 'RIV',
    'RIVR': 'RIV',
    'RVR': 'RIV',
    'RD': 'RD',
    'ROAD': 'RD',
    'RDS': 'RDS',
    'ROADS': 'RDS',
    'ROUTE': 'RTE',
    'ROW': 'ROW',
    'RUE': 'RUE',
    'RUN': 'RUN',
    'SHL': 'SHL',
    'SHOAL': 'SHL',
    'SHLS': 'SHLS',
    'SHOALS': 'SHLS',
    'SHOAR': 'SHR',
    'SHORE': 'SHR',
    'SHR': 'SHR',
    'SHOARS': 'SHRS',
    'SHORES': 'SHRS',
    'SHRS': 'SHRS',
    'SKYWAY': 'SKWY',
    'SPG': 'SPG',
    'SPNG': 'SPG',
    'SPRING': 'SPG',
    'SPRNG': 'SPG',
    'SPGS': 'SPGS',
    'SPNGS': 'SPGS',
    'SPRINGS': 'SPGS',
    'SPRNGS': 'SPGS',
    'SPUR': 'SPUR',
    'SPURS': 'SPUR',
    'SQ': 'SQ',
    'SQR': 'SQ',
    'SQRE': 'SQ',
    'SQU': 'SQ',
    'SQUARE': 'SQ',
    'SQRS': 'SQS',
    'SQUARES': 'SQS',
    'STA': 'STA',
    'STATION': 'STA',
    'STATN': 'STA',
    'STN': 'STA',
    'STRA': 'STRA',
    'STRAV': 'STRA',
    'STRAVE': 'STRA',
    'STRAVEN': 'STRA',
    'STRAVENUE': 'STRA',
    'STRAVN': 'STRA',
    'STRVN': 'STRA',
    'STRVNUE': 'STRA',
    'STREAM': 'STRM',
    'STREME': 'STRM',
    'STRM': 'STRM',
    'ST': 'ST',
    'STR': 'ST',
    'STREET': 'ST',
    'STRT': 'ST',
    'STREETS': 'STS',
    'SMT': 'SMT',
    'SUMIT': 'SMT',
    'SUMITT': 'SMT',
    'SUMMIT': 'SMT',
    'TER': 'TER',
    'TERR': 'TER',
    'TERRACE': 'TER',
    'THROUGHWAY': 'TRWY',
    'TRACE': 'TRCE',
    'TRACES': 'TRCE',
    'TRCE': 'TRCE',
    'TRACK': 'TRAK',
    'TRACKS': 'TRAK',
    'TRAK': 'TRAK',
    'TRK': 'TRAK',
    'TRKS': 'TRAK',
    'TRAFFICWAY': 'TRFY',
    'TRFY': 'TRFY',
    'TR': 'TRL',
    'TRAIL': 'TRL',
    'TRAILS': 'TRL',
    'TRL': 'TRL',
    'TRLS': 'TRL',
    'TUNEL': 'TUNL',
    'TUNL': 'TUNL',
    'TUNLS': 'TUNL',
    'TUNNEL': 'TUNL',
    'TUNNELS': 'TUNL',
    'TUNNL': 'TUNL',
    'TPK': 'TPKE',
    'TPKE': 'TPKE',
    'TRNPK': 'TPKE',
    'TRPK': 'TPKE',
    'TURNPIKE': 'TPKE',
    'TURNPK': 'TPKE',
    'UNDERPASS': 'UPAS',
    'UN': 'UN',
    'UNION': 'UN',
    'UNIONS': 'UNS',
    'VALLEY': 'VLY',
    'VALLY': 'VLY',
    'VLLY': 'VLY',
    'VLY': 'VLY',
    'VALLEYS': 'VLYS',
    'VLYS': 'VLYS',
    'VDCT': 'VIA',
    'VIA': 'VIA',
    'VIADCT': 'VIA',
    'VIADUCT': 'VIA',
    'VIEW': 'VW',
    'VW': 'VW',
    'VIEWS': 'VWS',
    'VWS': 'VWS',
    'VILL': 'VLG',
    'VILLAG': 'VLG',
    'VILLAGE': 'VLG',
    'VILLG': 'VLG',
    'VILLIAGE': 'VLG',
    'VLG': 'VLG',
    'VILLAGES': 'VLGS',
    'VLGS': 'VLGS',
    'VILLE': 'VL',
    'VL': 'VL',
    'VIS': 'VIS',
    'VIST': 'VIS',
    'VISTA': 'VIS',
    'VST': 'VIS',
    'VSTA': 'VIS',
    'WALK': 'WALK',
    'WALKS': 'WALK',
    'WALL': 'WALL',
    'WAY': 'WAY',
    'WY': 'WAY',
    'WAYS': 'WAYS',
    'WELL': 'WL',
    'WELLS': 'WLS',
    'WLS': 'WLS'
}

OCCUPANCY_TYPE_ABBREVIATIONS = {
    'APARTMENT': 'APT',
    'BUILDING': 'BLDG',
    'BASEMENT': 'BSMT',
    'DEPARTMENT': 'DEPT',
    'FLOOR': 'FL',
    'FRONT': 'FRNT',
    'HANGER': 'HNGR',
    'KEY': 'KEY',
    'LOBBY': 'LBBY',
    'LOT': 'LOT',
    'LOWER': 'LOWR',
    'OFFICE': 'OFC',
    'PENTHOUSE': 'PH',
    'PIER': 'PIER',
    'REAR': 'REAR',
    'ROOM': 'RM',
    'SIDE': 'SIDE',
    'SLIP': 'SLIP',
    'SPACE': 'SPC',
    'STOP': 'STOP',
    'SUITE': 'STE',
    'TRAILER': 'TRLR',
    'UNIT': 'UNIT',
    'UPPER': 'UPPER',
    '#': '#'
}
LONGHAND_STREET_TYPES = {
    'ALY': 'ALLEY',
    'ANX': 'ANNEX',
    'ARC': 'ARCADE',
    'AVE': 'AVENUE',
    'BYU': 'BAYOU',
    'BCH': 'BEACH',
    'BND': 'BEND',
    'BLF': 'BLUFF',
    'BLFS': 'BLUFFS',
    'BTM': 'BOTTOM',
    'BLVD': 'BOULEVARD',
    'BR': 'BRANCH',
    'BRG': 'BRIDGE',
    'BRK': 'BROOK',
    'BRKS': 'BROOKS',
    'BGS': 'BURGS',
    'BYP': 'BYPASS',
    'CP': 'CAMP',
    'CYN': 'CANYON',
    'CPE': 'CAPE',
    'CSWY': 'CAUSEWAY',
    'CTR': 'CENTER',
    'CTRS': 'CENTERS',
    'CIR': 'CIRCLE',
    'CIRS': 'CIRCLES',
    'CLF': 'CLIFF',
    'CLFS': 'CLIFFS',
    'CMN': 'COMMON',
    'COR': 'CORNER',
    'CORS': 'CORNERS',
    'CRSE': 'COURSE',
    'CT': 'COURT',
    'CTS': 'COURTS',
    'CVS': 'COVES',
    'CRK': 'CREEK',
    'CRES': 'CRESCENT',
    'CRST': 'CREST',
    'XING': 'CROSSING',
    'XRD': 'CROSSROAD',
    'CURV': 'CURVE',
    'DL': 'DALE',
    'DM': 'DAM',
    'DV': 'DIVIDE',
    'DR': 'DRIVE',
    'DRS': 'DRIVES',
    'EST': 'ESTATE',
    'ESTS': 'ESTATES',
    'EXPY': 'EXPRESSWAY',
    'EXT': 'EXTENSION',
    'EXTS': 'EXTENSIONS',
    'FALL': 'FALL',
    'FLS': 'FALLS',
    'FRY': 'FERRY',
    'FLD': 'FIELD',
    'FLDS': 'FIELDS',
    'FLT': 'FLAT',
    'FLTS': 'FLATS',
    'FRD': 'FORD',
    'FRDS': 'FORDS',
    'FRST': 'FORESTS',
    'FRG': 'FORGE',
    'FRGS': 'FORGES',
    'FRK': 'FORK',
    'FRKS': 'FORKS',
    'FT': 'FORT',
    'FWY': 'FREEWAY',
    'GDN': 'GARDEN',
    'GDNS': 'GARDENS',
    'GTWY': 'GATEWAY',
    'GLN': 'GLEN',
    'GLNS': 'GLENS',
    'GRNS': 'GREENS',
    'GRV': 'GROVE',
    'GRVS': 'GROVES',
    'HBR': 'HARBOR',
    'HBRS': 'HARBORS',
    'HVN': 'HAVEN',
    'HTS': 'HEIGHTS',
    'HWY': 'HIGHWAY',
    'HL': 'HILL',
    'HLS': 'HILLS',
    'HOLW': 'HOLLOW',
    'INLT': 'INLET',
    'IS': 'ISLAND',
    'ISS': 'ISLANDS',
    'ISLE': 'ISLE',
    'JCT': 'JUNCTION',
    'JCTS': 'JUNCTIONS',
    'KY': 'KEY',
    'KYS': 'KEYS',
    'KNL': 'KNOLL',
    'KNLS': 'KNOLLS',
    'LK': 'LAKE',
    'LKS': 'LAKES',
    'LAND': 'LAND',
    'LNDG': 'LANDING',
    'LN': 'LANE',
    'LGT': 'LIGHT',
    'LGTS': 'LIGHTS',
    'LF': 'LOAF',
    'LCK': 'LOCK',
    'LCKS': 'LOCKS',
    'LDG': 'LODGE',
    'LOOP': 'LOOP',
    'MALL': 'MALL',
    'MNR': 'MANOR',
    'MNRS': 'MANORS',
    'MDW': 'MEADOW',
    'MDWS': 'MEADOWS',
    'MEWS': 'MEWS',
    'ML': 'MILL',
    'MLS': 'MILLS',
    'MSN': 'MISSION',
    'MTWY': 'MOTORWAY',
    'MT': 'MOUNT',
    'MTN': 'MOUNTAIN',
    'MTNS': 'MOUNTAINS',
    'NCK': 'NECK',
    'ORCH': 'ORCHARD',
    'OVAL': 'OVAL',
    'OPAS': 'OVERPASS',
    'PARK': 'PARKS',
    'PKWY': 'PARKWAY',
    'PASS': 'PASS',
    'PSGE': 'PASSAGE',
    'PATH': 'PATHS',
    'PIKE': 'PIKES',
    'PNE': 'PINE',
    'PNES': 'PINES',
    'PL': 'PLACE',
    'PLN': 'PLAIN',
    'PLNS': 'PLAINS',
    'PLZ': 'PLAZA',
    'PT': 'POINT',
    'PTS': 'POINTS',
    'PRT': 'PORT',
    'PRTS': 'PORTS',
    'PR': 'PRAIRIE',
    'RADL': 'RADIAL',
    'RAMP': 'RAMP',
    'RNCH': 'RANCH',
    'RPD': 'RAPID',
    'RPDS': 'RAPIDS',
    'RST': 'REST',
    'RDG': 'RIDGE',
    'RDGS': 'RIDGES',
    'RIV': 'RIVER',
    'RD': 'ROAD',
    'RDS': 'ROADS',
    'RTE': 'ROUTE',
    'ROW': 'ROW',
    'RUE': 'RUE',
    'RUN': 'RUN',
    'SHL': 'SHOAL',
    'SHLS': 'SHOALS',
    'SHR': 'SHORE',
    'SHRS': 'SHORES',
    'SKWY': 'SKYWAY',
    'SPG': 'SPRING',
    'SPGS': 'SPRINGS',
    'SPUR': 'SPURS',
    'SQ': 'SQUARE',
    'SQS': 'SQUARES',
    'STA': 'STATION',
    'STRA': 'STRAVENUE',
    'STRM': 'STREAM',
    'ST': 'STREET',
    'STS': 'STREETS',
    'SMT': 'SUMMIT',
    'TER': 'TERRACE',
    'TRWY': 'THROUGHWAY',
    'TRCE': 'TRACE',
    'TRAK': 'TRACK',
    'TRFY': 'TRAFFICWAY',
    'TRL': 'TRAIL',
    'TUNL': 'TUNNEL',
    'TPKE': 'TURNPIKE',
    'UPAS': 'UNDERPASS',
    'UN': 'UNION',
    'UNS': 'UNIONS',
    'VLY': 'VALLEY',
    'VLYS': 'VALLEYS',
    'VIA': 'VIADUCT',
    'VW': 'VIEW',
    'VWS': 'VIEWS',
    'VLG': 'VILLAGE',
    'VLGS': 'VILLAGES',
    'VL': 'VILLE',
    'VIS': 'VISTA',
    'WALK': 'WALK',
    'WALL': 'WALL',
    'WAY': 'WAY',
    'WL': 'WELL',
    'WLS': 'WELLS'
}
STATE_ABBREVIATIONS = {
    'ALABAMA': 'AL',
    'ALA': 'AL',
    'ALASKA': 'AK',
    'ALAS': 'AK',
    'ARIZONA': 'AZ',
    'ARIZ': 'AZ',
    'ARKANSAS': 'AR',
    'ARK': 'AR',
    'CALIFORNIA': 'CA',
    'CALIF': 'CA',
    'CAL': 'CA',
    'COLORADO': 'CO',
    'COLO': 'CO',
    'COL': 'CO',
    'CONNECTICUT': 'CT',
    'CONN': 'CT',
    'DELAWARE': 'DE',
    'DEL': 'DE',
    'DISTRICT OF COLUMBIA': 'DC',
    'FLORIDA': 'FL',
    'FLA': 'FL',
    'FLOR': 'FL',
    'GEORGIA': 'GA',
    'GA': 'GA',
    'HAWAII': 'HI',
    'IDAHO': 'ID',
    'IDA': 'ID',
    'ILLINOIS': 'IL',
    'ILL': 'IL',
    'INDIANA': 'IN',
    'IND': 'IN',
    'IOWA': 'IA',
    'KANSAS': 'KS',
    'KANS': 'KS',
    'KAN': 'KS',
    'KENTUCKY': 'KY',
    'KEN': 'KY',
    'KENT': 'KY',
    'LOUISIANA': 'LA',
    'MAINE': 'ME',
    'MARYLAND': 'MD',
    'MASSACHUSETTS': 'MA',
    'MASS': 'MA',
    'MICHIGAN': 'MI',
    'MICH': 'MI',
    'MINNESOTA': 'MN',
    'MINN': 'MN',
    'MISSISSIPPI': 'MS',
    'MISS': 'MS',
    'MISSOURI': 'MO',
    'MONTANA': 'MT',
    'MONT': 'MT',
    'NEBRASKA': 'NE',
    'NEBR': 'NE',
    'NEB': 'NE',
    'NEVADA': 'NV',
    'NEV': 'NV',
    'NEW HAMPSHIRE': 'NH',
    'NEW JERSEY': 'NJ',
    'NEW MEXICO': 'NM',
    'N MEX': 'NM',
    'NEW M': 'NM',
    'NEW YORK': 'NY',
    'NORTH CAROLINA': 'NC',
    'NORTH DAKOTA': 'ND',
    'N DAK': 'ND',
    'OHIO': 'OH',
    'OKLAHOMA': 'OK',
    'OKLA': 'OK',
    'OREGON': 'OR',
    'OREG': 'OR',
    'ORE': 'OR',
    'PENNSYLVANIA': 'PA',
    'PENN': 'PA',
    'RHODE ISLAND': 'RI',
    'SOUTH CAROLINA': 'SC',
    'SOUTH DAKOTA': 'SD',
    'S DAK': 'SD',
    'TENNESSEE': 'TN',
    'TENN': 'TN',
    'TEXAS': 'TX',
    'TEX': 'TX',
    'UTAH': 'UT',
    'VERMONT': 'VT',
    'VIRGINIA': 'VA',
    'WASHINGTON': 'WA',
    'WASH': 'WA',
    'WEST VIRGINIA': 'WV',
    'W VA': 'WV',
    'WISCONSIN': 'WI',
    'WIS': 'WI',
    'WISC': 'WI',
    'WYOMING': 'WY',
    'WYO': 'WY'
}

ADDRESS_KEYS = (
    'address_line_1', 'address_line_2', 'city', 'state', 'postal_code'
)


class NormalizationConfig(Config):
    """Config class for GBR"""
    # pylint: disable=too-few-public-methods
    default_file = 'address_constants.yaml'

    def __init__(self, config_file=None, config_dir=None, section=None):
        super(NormalizationConfig, self).__init__(
            config_file=config_file, config_dir=config_dir, section=section,
            env_prefix='ADDRESS_CONFIG'
        )


def set_address_constants():
    config = NormalizationConfig()
    if config:
        addr_constants = (
            'DIRECTIONAL_REPLACEMENTS',
            'OCCUPANCY_TYPE_ABBREVIATIONS',
            'STATE_ABBREVIATIONS',
            'STREET_TYPE_ABBREVIATIONS',
            'KNOWN_ODDITIES',
            'PROBLEM_ST_TYPE_ABBRVS',
            'LONGHAND_DIRECTIONALS',
            'LONGHAND_STREET_TYPES',
        )
        insertion_method = config.get('insertion_method', default='update')
        update = ('update', 'insert')
        replace = ('replace', 'overwrite')
        if insertion_method not in update + replace:
            msg = "'{}' is not a valid option for 'insertion_method'".format(
                insertion_method
            )
            raise ConfigError(msg)
        globals()['ADDRESS_KEYS'] = config.get(
            'ADDRESS_KEYS', default=globals()['ADDRESS_KEYS']
        )
        for key in addr_constants:
            new_vals = config.get(key, default={})
            if key == 'OCCUPANCY_TYPE_ABBREVIATIONS' and new_vals:
                org_keys = OCCUPANCY_TYPE_ABBREVIATIONS.keys()
                new_keys = new_vals.keys()
                globals()['ABNORMAL_OCCUPANCY_ABBRVS'] = (
                    set(new_keys) - set(org_keys)
                )
            if new_vals and insertion_method in update:
                globals()[key].update(**new_vals)
            elif new_vals and insertion_method in replace:
                globals()[key] = new_vals


set_address_constants()


================================================
FILE: scourgify/cleaning.py
================================================
#!/usr/bin/env python
# encoding: utf-8
"""
copyright (c) 2016-2017 Earth Advantage.
All rights reserved
..codeauthor::Fable Turas <fable@rainsoftware.tech>

[ INSERT DOC STRING ]  # TODO
"""

# Imports from Standard Library
import re
import unicodedata
from typing import Any, Optional, Sequence, Union

# Imports from Third Party Modules
import usaddress

# Local Imports
from scourgify.address_constants import (
    KNOWN_ODDITIES,
    OCCUPANCY_TYPE_ABBREVIATIONS,
    PROBLEM_ST_TYPE_ABBRVS,
    AMBIGUOUS_DIRECTIONALS
)

# Setup

# Constants
# periods (in decimals), hyphens, / , and & are acceptable address components
# ord('&') ord('#') ord('-'), ord('.') and ord('/')
ALLOWED_CHARS = [35, 38, 45, 46, 47]

# Don't remove ',', '(' or ')' in PRE_CLEAN
PRECLEAN_EXCLUDE = [40, 41, 44]
EXCLUDE_ALL = ALLOWED_CHARS + PRECLEAN_EXCLUDE

STRIP_CHAR_CATS = (
    'M', 'S', 'C', 'Nl', 'No', 'Pc', 'Ps', 'Pe', 'Pi', 'Pf', 'Po'
)
STRIP_PUNC_CATS = ('Z', 'Pd')
STRIP_ALL_CATS = STRIP_CHAR_CATS + STRIP_PUNC_CATS

# Data Structure Definitions

# Private Functions


# Public Classes and Functions

def pre_clean_addr_str(addr_str, state=None):
    # type: (str, Optional[str]) -> str
    """Remove any known undesirable sub-strings and special characters.

    Cleaning should be enacted on an addr_str to remove known characters
    and phrases that might prevent usaddress from successfully parsing.
    Follows USPS pub 28 guidelines for undesirable special characters.
    Non-address phrases or character sets known to occur in raw addresses
    should be added to address_constants.KNOWN_ODDITIES.

    Some characters are left behind to potentially assist in second chance
    processing of unparseable addresses and should be further cleaned
    post_processing. (see post_clean_addr_str).

    :param addr_str: raw address string
    :type addr_str: str
    :param state: optional string containing normalized state data.
    :type state: str
    :return: cleaned string
    :rtype: str
    """
    # replace any easily handled, undesirable sub-strings
    if any(oddity in addr_str for oddity in KNOWN_ODDITIES.keys()):
        for key, replacement in KNOWN_ODDITIES.items():      # pragma: no cover
            addr_str = addr_str.replace(key, replacement)

    # remove non-decimal point period chars.
    if '.' in addr_str:                                      # pragma: no cover
        addr_str = clean_period_char(addr_str)

    addr_str = pre_clean_directionals(addr_str)

    # remove special characters per USPS pub 28, except & which impacts
    # intersection addresses, and - which impacts range addresses and zipcodes.
    # ',', '(' and ')' are also left for potential use in additional line 2
    # processing functions
    addr_str = clean_upper(
        addr_str, exclude=EXCLUDE_ALL, removal_cats=STRIP_CHAR_CATS
    )

    # to prevent any potential confusion between CT = COURT v CT = Connecticut,
    # clean_ambiguous_street_types is not applied if state is CT.
    if state and state not in PROBLEM_ST_TYPE_ABBRVS.keys():
        addr_str = clean_ambiguous_street_types(addr_str)

    return addr_str


def clean_ambiguous_street_types(addr_str):
    # type: (str) -> str
    """Clean street type abbreviations treated ambiguously by usaddress.

    Some two char street type abbreviations (ie. CT) are treated as StateName
    by usaddress when address lines are parsed in isolation. To correct this,
    known problem abbreviations are converted to their whole word equivalent.

    :param addr_str: string containing address street and occupancy data
        without city and state.
    :type addr_str: str | None
    :return: original or cleaned addr_str
    :rtype: str | None
    """
    if addr_str:
        split_addr = addr_str.split()
        for key in PROBLEM_ST_TYPE_ABBRVS:
            if key in split_addr:
                split_addr[split_addr.index(key)] = PROBLEM_ST_TYPE_ABBRVS[key]
                addr_str = ' '.join(split_addr)
                break
    return addr_str


def post_clean_addr_str(addr_str):
    # type: (Union[str, None], Optional[bool]) -> str
    """Remove any special chars or extra white space remaining post-processing.

    :param addr_str: post-processing address string.
    :type addr_str: str | None
    :param is_line2: optional boolean to trigger extra line 2 processing.
    :type is_line2: bool
    :return: str set to uppercase, extra white space and special chars removed.
    :rtype: str
    """
    if addr_str:
        addr_str = clean_upper(
            addr_str, exclude=ALLOWED_CHARS, removal_cats=STRIP_CHAR_CATS
        )
    return addr_str


def _parse_occupancy(addr_line_2):
    occupancy = None
    if addr_line_2:
        parsed = None
        # first try usaddress parsing labels
        try:
            parsed = usaddress.tag(addr_line_2)
        except usaddress.RepeatedLabelError:
            pass
        if parsed:
            occupancy = parsed[0].get('OccupancyIdentifier')
    return occupancy


def strip_occupancy_type(addr_line_2):
    # type: (str) -> str
    """Strip occupancy type (ie apt, unit, etc) from addr_line_2 string

    :param addr_line_2: address line 2 string that may contain type
    :type addr_line_2: str
    :return:
    :rtype: str
    """
    occupancy = None
    if addr_line_2:
        addr_line_2 = addr_line_2.replace('#', '').strip().upper()
        occupancy = _parse_occupancy(addr_line_2)

        # if that doesn't work, clean abbrevs and try again
        if not occupancy:
            parts = str(addr_line_2).split()
            for p in parts:
                if p in OCCUPANCY_TYPE_ABBREVIATIONS:
                    addr_line_2 = addr_line_2.replace(
                        p, OCCUPANCY_TYPE_ABBREVIATIONS[p]
                    )
            occupancy = _parse_occupancy(addr_line_2)

            # if that doesn't work, dissect it manually
            if not occupancy:
                occupancy = addr_line_2
                types = (
                    list(OCCUPANCY_TYPE_ABBREVIATIONS.keys())
                    + list(OCCUPANCY_TYPE_ABBREVIATIONS.values())
                )
                if parts and len(parts) > 1:
                    ids = [p for p in parts if p not in types]
                    print(ids)
                    occupancy = ' '.join(ids)

    return occupancy


def clean_upper(text,                           # type: Any
                exclude=None,                   # type: Optional[Sequence[int]]
                removal_cats=STRIP_CHAR_CATS,   # type: Optional[Sequence[str]]
                strip_spaces=False              # type: Optional[bool]
                ):
    # type: (str, Optional[Sequence[int]], Optional[Sequence[str]]) -> str
    """
    Return text as upper case unicode string and remove unwanted characters.
    Defaults to STRIP_CHARS e.g all  whitespace, punctuation etc
    :param text: text to clean
    :type text: str
    :param exclude: sequence of char ordinals to exclude from text.translate
    :type exclude: Sequence
    :param removal_cats: sequence of strings identifying unicodedata categories
        (or startswith) of characters to be removed from text
    :type removal_cats: Sequence
    :param strip_spaces: Bool to indicate whether to leave or remove all
        spaces. Default is False (leaves single spaces)
    :type strip_spaces: bool
    :return: cleaned uppercase unicode string
    :rtype: str
    """
    exclude = exclude or []
    # coerce ints etc to str
    if not isinstance(text, str):  # pragma: no cover
        text = str(text)
    # catch and convert fractions
    text = unicodedata.normalize('NFKD', text)
    text = text.translate({8260: '/'})

    # evaluate string without commas (,) or ampersand (&) to determine if
    # further processing is necessary
    alnum_text = text.translate({44: None, 38: None})

    # remove unwanted non-alphanumeric characters and convert all dash type
    # characters to hyphen
    if not alnum_text.replace(' ', '').isalnum():
        for char in text:
            if (unicodedata.category(char).startswith(removal_cats)
                    and ord(char) not in exclude):
                text = text.translate({ord(char): None})
            elif unicodedata.category(char).startswith('Pd'):
                text = text.translate({ord(char): '-'})
    join_char = ' '
    if strip_spaces:
        join_char = ''
    # remove extra spaces and convert to uppercase
    return join_char.join(text.split()).upper()


def clean_period_char(text):
    """Remove all period characters that are not decimal points.

    :param text: string text to clean
    :type text: str
    :return: cleaned string
    :rtype: str
    """
    period_pattern = re.compile(r'\.(?!\d)')
    return re.sub(period_pattern, '', text)


def pre_clean_directionals(text):
    """
    Replaces any ambiguous directionals (ie south-west) with their
    standard abbreviation (ie SW). This helps ensure the directionals are
    correctly identified during the usaddress tagging, rather than being
    identified as part of the street name.
    Directionals misrepresented as two words (ie south west) are not cleaned
    because directional named streets (ie West St) do exist with
    pre-directionals (ie S West St).
    """
    for direction, abbr in AMBIGUOUS_DIRECTIONALS.items():
        text = text.upper().replace(direction, abbr)
    return text


================================================
FILE: scourgify/exceptions.py
================================================
#!/usr/bin/env python
# encoding: utf-8
"""
copyright (c) 2016-2017 Earth Advantage.
All rights reserved
..codeauthor::Fable Turas <fable@rainsoftware.tech>

Custom errors pertaining to address normalization.
"""


# Private Functions


# Public Classes and Functions

class AddressNormalizationError(Exception):
    """Indicates error during normalization"""
    TITLE = None
    MESSAGE = None

    def __init__(self, error=None, title=None, *args):
        self.error = error or self.MESSAGE
        self.title = title or self.TITLE
        args = (error, title) + args
        super(AddressNormalizationError, self).__init__(*args)

    def __str__(self):
        msg = "{}: {}".format(self.title, self.error)
        if len(self.args) > 2:
            msg = "{}, {}".format(
                msg, ', '.join(str(a) for a in self.args[2:])
            )
        return msg


class AmbiguousAddressError(AddressNormalizationError):
    """Indicates an error from ambiguous addresses or address parts."""
    MESSAGE = "This address contains ambiguous elements."
    TITLE = "AMBIGUOUS ADDRESS"


class UnParseableAddressError(AddressNormalizationError):
    """Indicates an error from addresses that cannot be parsed."""
    MESSAGE = "Unable to break this address into its component parts"
    TITLE = "UNPARSEABLE ADDRESS"


class IncompleteAddressError(AddressNormalizationError):
    """Indicates error from addresses that don't have enough data to index."""
    MESSAGE = "This address is missing one or more required elements"
    TITLE = "INCOMPLETE ADDRESS"


class AddressValidationError(AddressNormalizationError):
    """Indicates address elements that don't meet format standards."""
    MESSAGE = "Address contains invalid formatting"
    TITLE = "ADDRESS FORMAT VALIDATION"


================================================
FILE: scourgify/normalize.py
================================================
#!/usr/bin/env python
# encoding: utf-8
"""
copyright (c) 2016  Earth Advantage.
All rights reserved
..codeauthor::Fable Turas <fable@rainsoftware.tech>

Provides functions to normalize address per USPS pub 28 and/or RESO standards.
"""
from __future__ import annotations

# TODO: Find why # with no street gets through
# form_normalization = {
#     'jurisdiction_property_id': 'TST123',
#     'address_line_1': '123',
#     'city': 'Portland',
#     'state': 'OR',
#     'postal_code': '97212'
# }

# Imports from Standard Library

from string import Template
from collections import OrderedDict  # noqa # pylint: disable=unused-import
from typing import (  # noqa # pylint: disable=unused-import
    Callable,
    Mapping,
    MutableMapping,
    Optional,
    Sequence,
    Union,
)

# Imports from Third Party Modules
import geocoder
import usaddress

# Local Imports
from scourgify.address_constants import (
    ABNORMAL_OCCUPANCY_ABBRVS,
    ADDRESS_KEYS,
    CITY_ABBREVIATIONS,
    DIRECTIONAL_REPLACEMENTS,
    LONGHAND_DIRECTIONALS,
    LONGHAND_STREET_TYPES,
    OCCUPANCY_TYPE_ABBREVIATIONS,
    STATE_ABBREVIATIONS,
    STREET_TYPE_ABBREVIATIONS,
)
from scourgify.cleaning import (
    clean_upper,
    post_clean_addr_str,
    pre_clean_addr_str,
    strip_occupancy_type,
)
from scourgify.exceptions import (
    AddressNormalizationError,
    AmbiguousAddressError,
    UnParseableAddressError,
)
from scourgify.validations import (
    validate_address_components,
    validate_parens_groups_parsed,
    validate_us_postal_code_format,
)

# Setup

# Constants

LINE1_USADDRESS_LABELS = (
    'AddressNumber',
    'StreetName',
    'AddressNumberPrefix',
    'AddressNumberSuffix',
    'StreetNamePreDirectional',
    'StreetNamePostDirectional',
    'StreetNamePreModifier',
    'StreetNamePostType',
    'StreetNamePreType',
    'IntersectionSeparator',
    'SecondStreetNamePreDirectional',
    'SecondStreetNamePostDirectional',
    'SecondStreetNamePreModifier',
    'SecondStreetNamePostType',
    'SecondStreetNamePreType',
    'LandmarkName',
    'CornerOf',
    'IntersectionSeparator',
    'BuildingName',
)
LINE2_USADDRESS_LABELS = (
    'OccupancyType',
    'OccupancyIdentifier',
    'SubaddressIdentifier',
    'SubaddressType',
)

LAST_LINE_LABELS = (
    'PlaceName',
    'StateName',
    'ZipCode',
)

AMBIGUOUS_LABELS = (
    'Recipient',
    'USPSBoxType',
    'USPSBoxID',
    'USPSBoxGroupType',
    'USPSBoxGroupID',
    'NotAddress'
)

STRIP_CHAR_CATS = (
    'M', 'S', 'C', 'Nl', 'No', 'Pc', 'Ps', 'Pe', 'Pi', 'Pf', 'Po'
)
STRIP_PUNC_CATS = ('Z', 'Pd')
STRIP_ALL_CATS = STRIP_CHAR_CATS + STRIP_PUNC_CATS


# Private Functions

# Public Classes and Functions

def normalize_address_record(address: str | dict, addr_map: dict = None,
                             addtl_funcs: [Callable] = None,
                             strict: bool = True,
                             long_hand: bool = False) -> dict:
    """Normalize an address according to USPS pub. 28 standards.

    Takes an address string, or a dict-like with standard address fields
    (address_line_1, address_line_2, city, state, postal_code), removes
    unacceptable special characters, extra spaces, predictable abnormal
    character sub-strings and phrases, abbreviates directional indicators
    and street types.  If applicable, line 2 address elements (ie: Apt, Unit)
    are separated from line 1 inputs.

    addr_map, if used, must be in the format {standard_key: custom_key} based
    on standard address keys sighted above.

    Returns an address dict with all field values in uppercase format.

    :param address: str or dict-like object containing details of a single
        address.
    :type address: str | Mapping[str, str]
    :param addr_map: mapping of standard address fields to custom key names
    :type addr_map: Mapping[str, str]
    :param addtl_funcs: optional sequence of funcs that take string for further
        processing and return line1 and line2 strings
    :type addtl_funcs: Sequence[Callable[str, (str, str)]]
    :param strict: bool indicating strict handling of components address parts
        city, state and postal_code, vs city and state OR postal_code
    :param long_hand: bool indicating whether to use long hand versions of
        directionals and street types in the output.
    :return: address dict containing parsed and normalized address values.
    :rtype: Mapping[str, str]
    """
    if isinstance(address, str):
        return normalize_addr_str(
            address, addtl_funcs=addtl_funcs, long_hand=long_hand
        )
    else:
        return normalize_addr_dict(
            address, addr_map=addr_map, addtl_funcs=addtl_funcs,
            strict=strict, long_hand=long_hand
        )


def normalize_addr_str(addr_str: str, line2: str = None, city: str = None,
                       state: str = None, zipcode: str = None,
                       addtl_funcs: [Callable] = None,
                       long_hand: bool = False) -> dict:
    """Normalize a complete or partial address string.

    :param addr_str: str containing address data.
    :type addr_str: str
    :param line2: optional str containing occupancy or sub-address data
        (eg: Unit, Apt, Lot).
    :type line2: str
    :param city: optional str city name that does not need to be parsed from
        addr_str.
    :type city: str
    :param state: optional str state name that does not need to be parsed from
        addr_str.
    :type state: str
    :param zipcode: optional str postal code that does not need to be parsed
        from addr_str.
    :type zipcode: str
    :param addtl_funcs: optional sequence of funcs that take string for further
        processing and return line1 and line2 strings
    :type addtl_funcs: Sequence[Callable[str, (str)]]
    :param long_hand: bool indicating whether to use long hand versions of
        directionals and street types in the output.
    :return: address dict with uppercase parsed and normalized address values.
    :rtype: Mapping[str, str]
    """
    # get address parsed into usaddress components.
    error = None
    parsed_addr = None
    addr_str = pre_clean_addr_str(addr_str, normalize_state(state))
    try:
        parsed_addr = parse_address_string(addr_str)
    except (usaddress.RepeatedLabelError, AmbiguousAddressError) as err:
        error = err
        if not line2 and addtl_funcs:
            for func in addtl_funcs:
                try:
                    line1, line2 = func(addr_str)
                    error = False
                    # send refactored line_1 and line_2 back through processing
                    return normalize_addr_str(
                        line1, line2=line2, city=city,
                        state=state, zipcode=zipcode, long_hand=long_hand
                    )
                except ValueError:
                    # try a different additional processing function
                    pass

    if parsed_addr and not parsed_addr.get('StreetName'):
        addr_dict = dict(
            address_line_1=addr_str, address_line_2=line2, city=city,
            state=state, postal_code=zipcode
        )
        full_addr = format_address_record(addr_dict)
        try:
            parsed_addr = parse_address_string(full_addr)
        except (usaddress.RepeatedLabelError, AmbiguousAddressError) as err:
            parsed_addr = None
            error = err

    if parsed_addr:
        parsed_addr = normalize_address_components(
            parsed_addr, long_hand=long_hand
        )
        zipcode = get_parsed_values(
            parsed_addr, zipcode, 'ZipCode', addr_str
        )
        city = get_parsed_values(
            parsed_addr, city, 'PlaceName', addr_str
        )
        state = get_parsed_values(
            parsed_addr, state, 'StateName', addr_str
        )
        state = normalize_state(state)

        # assumes if line2 is passed in that it need not be parsed from
        # addr_str. Primarily used to allow advanced processing of otherwise
        # unparsable addresses.
        line2 = line2 if line2 else get_normalized_line_segment(
            parsed_addr, LINE2_USADDRESS_LABELS
        )
        line2 = post_clean_addr_str(line2)
        # line 1 is fully post cleaned in get_normalized_line_segment.
        line1 = get_normalized_line_segment(
            parsed_addr, LINE1_USADDRESS_LABELS
        )
        validate_parens_groups_parsed(line1)
    else:
        # line1 is set to addr_str so complete dict can be passed to error.
        line1 = addr_str

    addr_rec = OrderedDict(
        address_line_1=line1, address_line_2=line2, city=city,
        state=state, postal_code=zipcode
    )
    if error:
        raise UnParseableAddressError(None, None, addr_rec)
    else:
        return addr_rec


def normalize_addr_dict(addr_dict: dict, addr_map: dict = None,
                        addtl_funcs: [Callable] = None,
                        strict: bool = True, long_hand: bool = False) -> dict:
    """Normalize an address from dict or dict-like object.

    Assumes addr_dict will have standard address related keys (address_line_1,
    address_line_2, city, state, postal_code), unless addr_map is provided.

    addr_map, if used, must be in the format {standard_key: custom_key} based
    on standard address keys sighted above.

    :param addr_dict: mapping containing address keys and values.
    :type addr_dict: Mapping
    :param addr_map: mapping of standard address fields to custom key names
    :type addr_map: Mapping[str, str]
    :param addtl_funcs: optional sequence of funcs that take string for further
        processing and return line1 and line2 strings
    :type addtl_funcs: Sequence[Callable[str, (str, str)]]
    :param strict: bool indicating strict handling of components address parts
        city, state and postal_code, vs city and state OR postal_code
    :param long_hand: bool indicating whether to use long hand versions of
        directionals and street types in the output.
    :return: address dict with normalized, uppercase address values.
    :rtype: Mapping[str, str]
    """
    if addr_map:
        addr_dict = {key: addr_dict.get(val) for key, val in addr_map.items()}
    addr_dict = validate_address_components(addr_dict, strict=strict)

    # line 1 and line 2 elements are combined to ensure consistent processing
    # whether the line 2 elements are pre-parsed or included in line 1
    addr_str = get_addr_line_str(addr_dict, comma_separate=True)
    postal_code = addr_dict.get('postal_code')
    zipcode = validate_us_postal_code_format(
        postal_code, addr_dict
    ) if postal_code else None
    city = addr_dict.get('city')
    state = addr_dict.get('state')
    try:
        address = normalize_addr_str(
            addr_str, city=city, state=state, zipcode=zipcode,
            addtl_funcs=addtl_funcs, long_hand=long_hand
        )
    except AddressNormalizationError:
        addr_str = get_addr_line_str(
            addr_dict, comma_separate=True, addr_parts=ADDRESS_KEYS
        )
        address = normalize_addr_str(
            addr_str, city=city, state=state, zipcode=zipcode,
            addtl_funcs=addtl_funcs, long_hand=long_hand
        )
    return address


def parse_address_string(addr_str: str) -> dict:
    """Separate an address string into its component parts per usaddress.

    Attempts to parse addr_str into it's component parts, using usaddress.

    If usaddress identifies the address type as Ambiguous or the resulting
    OrderedDict includes any keys from AMBIGUOUS_LABELS that would constitute
    ambiguous address in the SEED/GBR use case (ie: Recipient) then
    an AmbiguousAddressError is raised.

    :param addr_str: str address to be processed.
    :type addr_str: str
    :return: usaddress OrderedDict
    :rtype: MutableMapping
    """
    parsed_results = usaddress.tag(addr_str)
    parsed_addr = parsed_results[0]
    # if the address is parseable but some form of ambiguity is found that
    # may result in data corruption NormalizationError is raised.
    if (parsed_results[1] == 'Ambiguous' or
            any(key in AMBIGUOUS_LABELS for key in parsed_addr.keys())):
        raise AmbiguousAddressError()
    parsed_addr = handle_abnormal_occupancy(parsed_addr, addr_str)
    return parsed_addr


def handle_abnormal_occupancy(parsed_addr: OrderedDict,
                              addr_str: str) -> OrderedDict:
    """Handle abnormal occupancy abbreviations that are parsed as street type.

    Evaluates addresses with an Occupancy or Subaddress identifier whose type
    may be parsed into StreetNamePostType and swaps the StreetNamePostType tag
    for the OccupancyType tag if necessary.

    For example: Portland Maps uses 'UN' as an abbreviation for 'Unit' which
    usaddress parses to 'StreetNamePostType' since 'UN' is correctly an
    abbreviation for 'Union' street type.
        123 MAIN UN => 123 MAIN UN
        123 MAIN UN A => 123 MAIN, UNIT A
        123 MAIN UN UN A => 123 MAIN UN, UNIT A

    :param parsed_addr: address parsed into usaddress components
    :type parsed_addr: OrderedDict
    :param addr_str: Original address string
    :type addr_str: str
    :return: parsed address
    :rtype: OrderedDict
    """
    occupancy_id_key = None
    occupany_type_key = 'OccupancyType'
    street_type_key = 'StreetNamePostType'
    occupany_type_keys = (occupany_type_key, 'SubaddressType')
    occupancy_identifier_keys = ('OccupancyIdentifier', 'SubaddressIdentifier')
    street_type = parsed_addr.get(street_type_key)
    if street_type in ABNORMAL_OCCUPANCY_ABBRVS:
        occupancy_type = None
        occupancy = None
        for key in occupany_type_keys:
            try:
                occupancy_type = parsed_addr[key]
                break
            except KeyError:
                pass
        for key in occupancy_identifier_keys:
            try:
                occupancy = parsed_addr[key]
                occupancy_id_key = key
                break
            except KeyError:
                break
        if occupancy and not occupancy_type:
            if street_type in occupancy:
                occupancy = occupancy.replace(street_type, '').strip()
                del parsed_addr[occupancy_id_key]
            else:
                line2 = "{} {}".format(street_type, occupancy)
                addr_str = addr_str.replace(line2, '')
                parsed_addr = parse_address_string(addr_str)
            parsed_addr.update({occupany_type_key: street_type})
            parsed_addr.update({occupancy_id_key: occupancy})
    return parsed_addr


def get_parsed_values(parsed_addr: OrderedDict, orig_val: str,
                      val_label: str, orig_addr_str: str) -> str | None:
    """Get valid values from parsed_addr corresponding to val_label.

    Retrieves values from parsed_addr corresponding to the label supplied in
    val_label.
    If a value for val_label is found in parsed_addr AND an orig_val is
    supplied, a single string will be returned if the values match. If only
    one of the two contains a non-null value.
    If both values are empty, None is returned.
    If the values an AmbiguousAddressError will be returned if the two values
    are not equal. This provides a check against misidentified address
    components when known values are available. (For example when a city is
    supplied from the address dict or record being normalized, but usaddress
    identifies extra information stored in address_line_1 as a PlaceName.)

    :param parsed_addr: address parsed into ordereddict per usaddress.
    :type parsed_addr: Mapping
    :param orig_val: related value passed in from incoming data source.
    :type orig_val: str
    :param val_label: label to locate in parsed_addr
    :type val_label: str
    :param orig_addr_str: address string to pass to error, if applicable.
    :type orig_addr_str: str
    :return: str | None
    """
    val_from_parse = parsed_addr.get(val_label)
    orig_val = post_clean_addr_str(orig_val)
    val_from_parse = post_clean_addr_str(val_from_parse)
    non_null_val_set = {orig_val, val_from_parse} - {None}
    if len(non_null_val_set) > 1:
        msg = (
            f'Parsed {val_label} does not align with submitted value: '
            f'Parsed: {val_from_parse}. Original: {orig_val}'
        )
        raise AmbiguousAddressError(None, msg, orig_addr_str)
    else:
        return non_null_val_set.pop() if non_null_val_set else None


def normalize_address_components(parsed_addr: OrderedDict,
                                 long_hand: bool = False) -> OrderedDict:
    """Normalize parsed sections of address as appropriate.

    Processes parsed address through subsets of normalization rules.

    :param parsed_addr: address parsed into ordereddict per usaddress.
    :param long_hand: bool indicating whether to use long hand versions of
        directional and street type in the output.
    :return: parsed_addr with normalization processing applied to elements.
    :rtype: OrderedDict
    """
    parsed_addr = normalize_numbered_streets(parsed_addr)
    parsed_addr = normalize_directionals(parsed_addr, long_hand=long_hand)
    parsed_addr = normalize_street_types(parsed_addr, long_hand=long_hand)
    parsed_addr = normalize_occupancy_type(parsed_addr)
    return parsed_addr


def normalize_numbered_streets(parsed_addr: OrderedDict) -> OrderedDict:
    """Change numbered street names to include missing original identifiers.

    :param parsed_addr: address parsed into ordereddict per usaddress.
    :type parsed_addr: Mapping
    :return: parsed_addr with ordinal identifiers appended to numbered streets.
    :rtype: dict"""
    street_tags = ['StreetName', 'SecondStreetName']
    for tag in street_tags:
        post_type_tag = '{}PostType'.format(tag)
        # limits updates to numbered street names that include a post street
        # type, since an ordinal indicator would be inappropriate for some
        # numbered streets (ie. Country Road 97).
        if tag in parsed_addr.keys() and post_type_tag in parsed_addr.keys():
            try:
                cardinal = int(parsed_addr[tag])
                ord_indicator = get_ordinal_indicator(cardinal)
                parsed_addr[tag] = '{}{}'.format(cardinal, ord_indicator)
            except ValueError:
                pass
    return parsed_addr


def normalize_directionals(parsed_addr: OrderedDict,
                           long_hand=False) -> OrderedDict:
    """Change directional notations to standard abbreviations.

    :param parsed_addr: address parsed into ordereddict per usaddress.
    :type parsed_addr: Mapping
    :param long_hand: bool indicating whether to use long hand versions of
        directionals in the output.
    :return: parsed_addr with directionals updated to abbreviated format.
    :rtype: dict
    """
    # get the directional related keys from the current address.
    found_directional_tags = (
        tag for tag in parsed_addr.keys() if 'Directional' in tag
    )
    found_directional_tags = list(found_directional_tags)
    for found in found_directional_tags:
        # get the original directional related value per key.
        dir_str = parsed_addr[found]
        # remove spaces, punctuation, hyphens etc so two part directions
        # conform to a single word standard. Convert to upper case
        dir_str = clean_upper(
            dir_str, exclude=[], removal_cats=STRIP_ALL_CATS, strip_spaces=True
        )
        if dir_str in DIRECTIONAL_REPLACEMENTS.keys():
            dir_str = DIRECTIONAL_REPLACEMENTS[dir_str]
        if long_hand:
            dir_str = LONGHAND_DIRECTIONALS[dir_str]
        parsed_addr[found] = dir_str

    return parsed_addr


def normalize_street_types(parsed_addr: OrderedDict,
                           long_hand=False) -> OrderedDict:
    """Change street types to accepted abbreviated format.

    No change is made if street types do not conform to common usages per
    USPS pub 28 appendix C.

    :param parsed_addr: address parsed into ordereddict per usaddress.
    :type parsed_addr: Mapping
    :param long_hand: bool indicating whether to use long hand versions of
        street types in the output.
    :return: parsed_addr with street types updated to abbreviated format.
    :rtype: dict
    """
    # get the *Street*Type keys from the current parsed address.
    found_type_tags = (
        tag for tag in parsed_addr.keys() if 'Street' in tag and 'Type' in tag
    )
    for found in found_type_tags:
        street_type = parsed_addr[found]
        # lookup the appropriate abbrev for the street type found per key.
        type_abbr = STREET_TYPE_ABBREVIATIONS.get(parsed_addr[found])
        # update the street type only if a new abbreviation is found.
        if type_abbr:
            street_type = type_abbr
        if long_hand:
            street_type = LONGHAND_STREET_TYPES[street_type]
        parsed_addr[found] = street_type
    return parsed_addr


def normalize_occupancy_type(parsed_addr: OrderedDict,
                             default=None) -> OrderedDict:
    """Change occupancy types to accepted abbreviated format.

    If there is an occupancy and it does not conform to one of the
    OCCUPANCY_TYPE_ABBREVIATIONS, occupancy is changed to the generic 'UNIT'.
    OCCUPANCY_TYPE_ABBREVIATIONS contains common abbreviations per
    USPS pub 28 appendix C, however, OCCUPANCY_TYPE_ABBREVIATIONS can be
    customized to allow alternate abbreviations to pass through. (see README)

    :param parsed_addr: address parsed into ordereddict per usaddress.
    :type parsed_addr: Mapping
    :param default: default abbreviation to use for types that fall outside the
     standard abbreviations. Default is 'UNIT'
    :return: parsed_addr with occupancy types updated to abbreviated format.
    :rtype: dict
    """
    default = default if default is not None else 'UNIT'
    occupancy_type_label = 'OccupancyType'
    occupancy_type = parsed_addr.pop(occupancy_type_label, None)
    occupancy_type_abbr = (
        occupancy_type
        if occupancy_type in OCCUPANCY_TYPE_ABBREVIATIONS.values()
        else OCCUPANCY_TYPE_ABBREVIATIONS.get(occupancy_type)
    )
    occupancy_id = parsed_addr.get('OccupancyIdentifier')
    if ((occupancy_id and not occupancy_id.startswith('#'))
            and not occupancy_type_abbr):
        occupancy_type_abbr = default
    if occupancy_type_abbr:
        parsed_list = list(parsed_addr.items())
        try:
            index = parsed_list.index(('OccupancyIdentifier', occupancy_id))
        except ValueError:
            msg = (
                'Address has an occupancy type (ie: Apt, Unit, etc) '
                'but no occupancy identifier (ie: 101, A, etc)'
            )
            raise AddressNormalizationError(msg)
        parsed_list.insert(index, (occupancy_type_label, occupancy_type_abbr))
        parsed_addr = OrderedDict(parsed_list)
    return parsed_addr


def normalize_state(state: str | None) -> str | None:
    """Change state string to accepted abbreviated format.

    :param state: string containing state name or abbreviation.
    :type state: str | None
    :return: 2 char state abbreviation, or original state string if not found
        in state names or standard long abbreviations.
    :rtype: str | None
    """
    if state:
        state_abbrv = STATE_ABBREVIATIONS.get(state.upper())
        if state_abbrv:
            state = state_abbrv
    return state


def normalize_city(city: str):
    city = city.split()
    for i, part in enumerate(city):
        replacement = CITY_ABBREVIATIONS.get(part.replace('.', ''))
        if replacement:
            city[i] = replacement
    return ' '.join(city)


def get_normalized_line_segment(parsed_addr: OrderedDict,
                                line_labels: [str]) -> str:
    """

    :param parsed_addr: address parsed into ordereddict per usaddress.
    :param line_labels: tuple of str labels of all the potential keys related
        to the desired address segment (ie address_line_1 or address_line_2).
    :return: s/r joined values from parsed_addr corresponding to given labels.
    """
    line_elems = [
        elem for key, elem in parsed_addr.items() if key in line_labels
    ]
    line_str = ' '.join(line_elems) if line_elems else None
    return post_clean_addr_str(line_str)


def get_addr_line_str(addr_dict: dict, addr_parts: [str] = None,
                      comma_separate: bool = False) -> str:
    """Get address 'line' elements as a single string.

    Combines 'address_line_1' and 'address_line_2' elements as a single string
    to ensure no data is lost and line_2 can be processed according to a
    standard set of rules.

    :param addr_dict: dict containing keys 'address_line_1', 'address_line_2'.
    :type addr_dict: Mapping
    :param addr_parts: optional sequence of address elements
    :type addr_parts:
    :param comma_separate: optional boolean to separate dict values by comma
        useful for dealing with potentially ambiguous line 2 segments
    :type bool:
    :return: string combining 'address_line_1' & 'address_line_2' values.
    :rtype: str
    """
    if not addr_parts:
        addr_parts = ['address_line_1', 'address_line_2']
    if not isinstance(addr_parts, (list, tuple)):
        raise TypeError('addr_parts must be a list or tuple')
    separator = ', ' if comma_separate else ' '
    addr_str = separator.join(
        str(addr_dict[elem]) for elem in addr_parts if addr_dict.get(elem)
    )
    return addr_str


def format_address_record(address: dict) -> str:
    # type AddressRecord -> str
    """Format AddressRecord as string."""
    address_template = Template('$address')
    address = dict(address)
    addr_parts = [
        str(address[field]) for field in ADDRESS_KEYS if address.get(field)
    ]
    return address_template.safe_substitute(address=', '.join(addr_parts))


def get_geocoder_normalized_addr(address: dict | str,
                                 addr_keys: [str] = ADDRESS_KEYS) -> dict:
    """Get geocoder normalized address parsed to dict with addr_keys.

    :param address: string or dict-like containing address data
    :param addr_keys: optional list of address keys. standard list of keys will
        be used if not supplied
    :return: dict containing geocoder address result
    """
    address_line_2 = None
    geo_addr_dict = {}
    if not isinstance(address, str):
        address_line_2 = address.get('address_line_2')
        address = get_addr_line_str(address, addr_parts=addr_keys)
    geo_resp = geocoder.google(address)
    if geo_resp.ok and geo_resp.housenumber:
        line2 = geo_resp.subpremise or address_line_2
        geo_addr_dict = {
            'address_line_1':
                ' '.join([geo_resp.housenumber, geo_resp.street]),
            'address_line_2': strip_occupancy_type(line2),
            'city': geo_resp.city,
            'state': geo_resp.state,
            'postal_code': geo_resp.postal
        }
        for key, value in geo_addr_dict.items():
            geo_addr_dict[key] = value.upper() if value else None
    return geo_addr_dict


def get_ordinal_indicator(number: int) -> str:
    """Get the ordinal indicator suffix applicable to the supplied number.

     Ordinal numbers are words representing position or rank in a sequential
     order (1st, 2nd, 3rd, etc).
     Ordinal indicators are the suffix characters (st, nd, rd, th) that, when
     applied to a numeral (int), denote that it an ordinal number.

    :param number: int
    :type: int
    :return: ordinal indicator appropriate to the number supplied.
    :rtype: str
    """
    str_num = str(number)
    digits = len(str_num)
    if str_num[-1] == '1' and not (digits >= 2 and str_num[-2:] == '11'):
        return 'st'
    elif str_num[-1] == '2' and not (digits >= 2 and str_num[-2:] == '12'):
        return 'nd'
    elif str_num[-1] == '3' and not (digits >= 2 and str_num[-2:] == '13'):
        return 'rd'
    else:
        return 'th'


class NormalizeAddress(object):
    """Normalize an address according to USPS pub. 28 standards.

    Instantiates with an address string, or a dict-like with standard address
    fields (address_line_1, address_line_2, city, state, postal_code), removes
    unacceptable special characters, extra spaces, predictable abnormal
    character sub-strings and phrases, abbreviates directional indicators
    and street types.  If applicable, line 2 address elements (ie: Apt, Unit)
    are separated from line 1 inputs.

    addr_map, if used, must be in the format {standard_key: custom_key} based
    on standard address keys sighted above.

    Returns an address dict with all field values in uppercase format.

    :param address: str or dict-like object containing details of a single
        address.
    :param addr_map: mapping of standard address fields to custom key names
    :param addtl_funcs: optional sequence of funcs that take string for further
        processing and return line1 and line2 strings
    :type addtl_funcs: Sequence[Callable[str, (str, str)]]
    :param strict: bool indicating strict handling of components address parts
        city, state and postal_code, vs city and state OR postal_code
    :param long_hand: bool indicating whether to use long hand versions of
        directionals and street types in the output.
    :return: address dict containing parsed and normalized address values.
    """
    parse_address_string = staticmethod(parse_address_string)
    pre_clean_addr_str = staticmethod(pre_clean_addr_str)
    post_clean_addr_str = staticmethod(post_clean_addr_str)
    format_address_record = staticmethod(format_address_record)
    normalize_address_components = staticmethod(normalize_address_components)
    get_parsed_values = staticmethod(get_parsed_values)

    def __init__(self, address, addr_map=None, addtl_funcs=None,
                 strict=None, long_hand=False):
        self.address = address
        self.addtl_funcs = addtl_funcs
        self.strict = True if strict is None else strict
        self.long_hand = long_hand
        if addr_map and not isinstance(self.address, str):
            self.address = {
                key: self.address.get(val) for key, val in addr_map.items()
            }

    @staticmethod
    def get_normalized_line_1(parsed_addr, line_labels=LINE1_USADDRESS_LABELS):
        return get_normalized_line_segment(parsed_addr, line_labels)

    @staticmethod
    def get_normalized_line_2(parsed_addr, line_labels=LINE2_USADDRESS_LABELS):
        return get_normalized_line_segment(parsed_addr, line_labels)

    def normalize(self):
        if isinstance(self.address, str):
            return self.normalize_addr_str(
                self.address, long_hand=self.long_hand
            )
        else:
            return self.normalize_addr_dict()

    def normalize_addr_str(self, addr_str,  # type: str
                           line2=None,  # type: Optional[str]
                           city=None,  # type: Optional[str]
                           state=None,  # type: Optional[str]
                           zipcode=None,  # type: Optional[str]
                           long_hand=False
                           ):  # noqa
        # get address parsed into usaddress components.
        error = None
        parsed_addr = None
        addr_str = self.pre_clean_addr_str(addr_str, normalize_state(state))
        try:
            parsed_addr = self.parse_address_string(addr_str)
        except (usaddress.RepeatedLabelError, AmbiguousAddressError) as err:
            error = err
            if not line2 and self.addtl_funcs:
                for func in self.addtl_funcs:
                    try:
                        line1, line2 = func(addr_str)
                        error = False
                        # send refactored line_1 and line_2 back through
                        # processing
                        return self.normalize_addr_str(
                            line1, line2=line2,
                            city=city, state=state, zipcode=zipcode,
                            long_hand=long_hand
                        )
                    except ValueError:
                        # try a different additional processing function
                        pass

        if parsed_addr and not parsed_addr.get('StreetName'):
            addr_dict = dict(
                address_line_1=addr_str, address_line_2=line2, city=city,
                state=state, postal_code=zipcode
            )
            full_addr = self.format_address_record(addr_dict)
            try:
                parsed_addr = self.parse_address_string(full_addr)
            except (usaddress.RepeatedLabelError,
                    AmbiguousAddressError) as err:
                parsed_addr = None
                error = err

        if parsed_addr:
            parsed_addr = self.normalize_address_components(
                parsed_addr, long_hand=long_hand
            )
            zipcode = self.get_parsed_values(
                parsed_addr, zipcode, 'ZipCode', addr_str
            )
            city = self.normalize_city(parsed_addr, addr_str, city)
            state = self.get_parsed_values(
                parsed_addr, state, 'StateName', addr_str
            )
            state = normalize_state(state)

            # assumes if line2 is passed in that it need not be parsed from
            # addr_str. Primarily used to allow advanced processing of
            # otherwise unparsable addresses.
            line2 = line2 if line2 else self.get_normalized_line_2(parsed_addr)
            line2 = self.post_clean_addr_str(line2)
            # line 1 is fully post cleaned in get_normalized_line_segment.
            line1 = self.get_normalized_line_1(parsed_addr)
            validate_parens_groups_parsed(line1)
        else:
            # line1 is set to addr_str so complete dict can be passed to error.
            line1 = addr_str

        addr_rec = OrderedDict(
            address_line_1=line1, address_line_2=line2, city=city,
            state=state, postal_code=zipcode
        )
        if error:
            raise UnParseableAddressError(None, None, addr_rec)
        else:
            return addr_rec

    def normalize_addr_dict(self):
        addr_dict = validate_address_components(
            self.address, strict=self.strict
        )

        # line 1 and line 2 elements are combined to ensure consistent
        # processing whether the line 2 elements are pre-parsed or
        # included in line 1
        addr_str = get_addr_line_str(addr_dict, comma_separate=True)
        postal_code = addr_dict.get('postal_code')
        zipcode = validate_us_postal_code_format(
            postal_code, addr_dict
        ) if postal_code else None
        city = addr_dict.get('city')
        state = addr_dict.get('state')
        try:
            address = self.normalize_addr_str(
                addr_str, city=city, state=state,
                zipcode=zipcode, long_hand=self.long_hand
            )
        except AddressNormalizationError:
            addr_str = get_addr_line_str(
                addr_dict, comma_separate=True, addr_parts=ADDRESS_KEYS
            )
            address = self.normalize_addr_str(
                addr_str, city=city, state=state,
                zipcode=zipcode, long_hand=self.long_hand
            )
        return address

    def normalize_city(self, parsed_addr, addr_str, city=None):
        return self.get_parsed_values(
            parsed_addr, city, 'PlaceName', addr_str
        )


================================================
FILE: scourgify/tests/__init__.py
================================================


================================================
FILE: scourgify/tests/config/__init__.py
================================================


================================================
FILE: scourgify/tests/config/address_constants.yaml
================================================

KNOWN_ODDITIES:
    'developed by HOST': ''
    ', UN ': ' UNIT '

OCCUPANCY_TYPE_ABBREVIATIONS:
    'UN': 'UNIT'

================================================
FILE: scourgify/tests/test_address_normalization.py
================================================
#!/usr/bin/env python
# encoding: utf-8
"""
copyright (c) 2016-2019 Earth Advantage.
All rights reserved

Unit tests for scourgify.
"""

# Imports from Standard Library
from collections import OrderedDict
from unittest import TestCase, mock

# Imports from Third Party Modules
from yamlconf import ConfigError

# Local Imports
from scourgify import address_constants
from scourgify.cleaning import (
    clean_ambiguous_street_types,
    clean_period_char,
    post_clean_addr_str,
    pre_clean_addr_str,
)
from scourgify.exceptions import (
    AddressNormalizationError,
    AddressValidationError,
    AmbiguousAddressError,
    IncompleteAddressError,
    UnParseableAddressError,
)
from scourgify.normalize import (
    get_addr_line_str,
    get_geocoder_normalized_addr,
    get_normalized_line_segment,
    get_ordinal_indicator,
    get_parsed_values,
    normalize_addr_dict,
    normalize_addr_str,
    normalize_address_record,
    normalize_directionals,
    normalize_numbered_streets,
    normalize_occupancy_type,
    normalize_state,
    normalize_street_types,
    parse_address_string,
    NormalizeAddress
)
from scourgify.validations import (
    validate_address_components,
    validate_parens_groups_parsed,
    validate_us_postal_code_format,
)

# Constants
SERVICE = 'GBR Test Normalization'
# Helper Functions & Classes


# Tests
class TestAddressNormalization(TestCase):
    """Unit tests for scourgify"""
    # pylint:disable=too-many-arguments

    def setUp(self):
        """setUp"""
        self.expected = dict(
            address_line_1='123 NOWHERE ST',
            address_line_2='STE 0',
            city='BORING',
            state='OR',
            postal_code='97009'
        )
        self.address_dict = dict(
            address_line_1='123 Nowhere St',
            address_line_2='Suite 0',
            city='Boring',
            state='OR',
            postal_code='97009'
        )

        self.ordinal_addr = dict(
            address_line_1='4333 NE 113th',
            city='Boring',
            state='OR',
            postal_code='97009'
        )
        self.ordinal_expected = dict(
            address_line_1='4333 NE 113TH',
            address_line_2=None,
            city='BORING',
            state='OR',
            postal_code='97009'
        )
        self.parseable_addr_str = '123 Nowhere Street Suite 0 Boring OR 97009'
        self.parsed_addr = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetName', 'NOWHERE'),
            ('StreetNamePostType', 'STREET'),
            ('OccupancyType', 'SUITE'),
            ('OccupancyIdentifier', '0'),
            ('PlaceName', 'BORING'),
            ('StateName', 'OR'),
            ('ZipCode', '97009')
        ])
        self.hash_tag = '999 Nowhere Street # 12 Boring OR 97009'
        self.hash_expected = dict(
            address_line_1='999 NOWHERE ST',
            address_line_2='# 12',
            city='BORING',
            state='OR',
            postal_code='97009'
        )
        self.unparesable_addr_str = '6000 SW 1000TH AVE  (BLDG  A5 RIGHT)'

        self.direction_expected = dict(
            address_line_1='123 SW NOWHERE ST',
            address_line_2='STE 0',
            city='BORING',
            state='OR',
            postal_code='97009'
        )
        self.long_hand_expected = dict(
            address_line_1='123 SOUTHWEST NOWHERE STREET',
            address_line_2='STE 0',
            city='BORING',
            state='OR',
            postal_code='97009'
        )
        self.abnormal_direction = dict(
            address_line_1='123 South-West Nowhere St',
            address_line_2='Suite 0',
            city='Boring',
            state='OR',
            postal_code='97009'
        )

    def test_normalize_address_record(self):
        """Test normalize_address_record function."""
        result = normalize_address_record(self.parseable_addr_str)
        self.assertDictEqual(self.expected, result)

        result = normalize_address_record(self.address_dict)
        self.assertDictEqual(self.expected, result)

        result = normalize_address_record(self.ordinal_addr)
        self.assertDictEqual(self.ordinal_expected, result)

        result = normalize_address_record(self.hash_tag)
        self.assertDictEqual(self.hash_expected, result)

        result = normalize_address_record(self.abnormal_direction)
        self.assertDictEqual(self.direction_expected, result)

        result = normalize_address_record(
            self.abnormal_direction, long_hand=True
        )
        self.assertDictEqual(self.long_hand_expected, result)

    def test_normalize_class(self):
        """Test normalize_address_record function."""
        result = NormalizeAddress(self.parseable_addr_str).normalize()
        self.assertDictEqual(self.expected, result)

        result = NormalizeAddress(self.address_dict).normalize()
        self.assertDictEqual(self.expected, result)

        result = NormalizeAddress(self.ordinal_addr).normalize()
        self.assertDictEqual(self.ordinal_expected, result)

        result = NormalizeAddress(self.hash_tag).normalize()
        self.assertDictEqual(self.hash_expected, result)

        result = NormalizeAddress(self.abnormal_direction).normalize()
        self.assertDictEqual(self.direction_expected, result)

        result = NormalizeAddress(
            self.abnormal_direction, long_hand=True
        ).normalize()
        self.assertDictEqual(self.long_hand_expected, result)

    def test_normalize_addr_str(self):
        """Test normalize_addr_str function."""
        result = normalize_addr_str(self.parseable_addr_str)
        self.assertDictEqual(self.expected, result)

        broken_line1 = '6000 SW 1000TH AVE '
        broken_line2 = '(BLDG  A1 RIGHT)'
        result = normalize_addr_str(
            broken_line1, line2=broken_line2,
            city='Portland', state='OR', zipcode='97203'
        )
        expected = {
            'address_line_1': '6000 SW 1000TH AVE',
            'address_line_2': 'BLDG A1 RIGHT',
            'state': 'OR', 'city': 'PORTLAND',
            'postal_code': '97203'
        }
        self.assertDictEqual(expected, result)

        def addtl_test_func(addr_str):
            if 'BLDG A1' in addr_str:
                return '123 NOWHERE STREET', 'BLDG A1 RIGHT'
            else:
                raise ValueError

        addtl_processing = '123 Nowhere Street (BLDG A1 RIGHT)'
        expected = {
            'address_line_1': '123 NOWHERE ST',
            'address_line_2': 'BLDG A1 RIGHT',
            'state': 'OR', 'city': 'PORTLAND',
            'postal_code': '97203'
        }
        result = normalize_addr_str(
            addtl_processing, city='Portland', state='OR', zipcode='97203',
            addtl_funcs=[addtl_test_func]
        )
        self.assertDictEqual(expected, result)

        self.assertRaises(
            UnParseableAddressError,
            normalize_addr_str,
            self.unparesable_addr_str,
            city='Portland', state='OR', zipcode='97203',
            addtl_funcs=[addtl_test_func]

        )

    def test_normalize_addr_dict(self):
        """Test normalize_addr_dict function."""
        result = normalize_addr_dict(self.address_dict)
        self.assertDictEqual(self.expected, result)

        alternate_dict = dict(
            address1='123 Nowhere St',
            address2='Suite 0',
            city='Boring',
            state='OR',
            zip='97009'
        )
        dict_map = {
            'address_line_1': 'address1',
            'address_line_2': 'address2',
            'city': 'city',
            'state': 'state',
            'postal_code': 'zip'
        }
        result = normalize_addr_dict(alternate_dict, addr_map=dict_map)
        self.assertDictEqual(self.expected, result)

    def test_parse_address_string(self):
        """Test parse_address_string function."""
        result = parse_address_string(self.parseable_addr_str)
        self.assertIsInstance(result, OrderedDict)

        ambig_addr_str = 'AWBREY VILLAGE'
        with self.assertRaises(AmbiguousAddressError):
            parse_address_string(ambig_addr_str)

    def test_normalize_occupancies(self):
        """Test normalize_addr_dict function with handling for occupancy
        type oddities.  This is based on a real life incident; the original
        behavior to allow non-standard unit types to pass through resulted
        in an address validation service also allowing the address to pass
        through even though no unit should have existed on the home.
        """
        dict_map = {
            'address_line_1': 'address1',
            'address_line_2': 'address2',
            'city': 'city',
            'state': 'state',
            'postal_code': 'zip'
        }

        weird_unit = dict(
            address1='123 Nowhere St',
            address2='Ave 345',
            city='Boring',
            state='OR',
            zip='97009'
        )
        expected = dict(
            address_line_1='123 NOWHERE ST',
            address_line_2='UNIT 345',
            city='BORING',
            state='OR',
            postal_code='97009'
        )
        result = normalize_addr_dict(weird_unit, addr_map=dict_map)
        self.assertDictEqual(expected, result)

        late_unit_add = dict(
            address1='123 Nowhere St',
            address2='345',
            city='Boring',
            state='OR',
            zip='97009'
        )
        result = normalize_addr_dict(late_unit_add, addr_map=dict_map)
        self.assertDictEqual(expected, result)

        expected = dict(
            address_line_1='123 NOWHERE ST',
            address_line_2='# 345',
            city='BORING',
            state='OR',
            postal_code='97009'
        )

        hashtag_unit = dict(
            address1='123 Nowhere St',
            address2='# 345',
            city='Boring',
            state='OR',
            zip='97009'
        )
        result = normalize_addr_dict(hashtag_unit, addr_map=dict_map)
        self.assertDictEqual(expected, result)

        hashtag_unit = dict(
            address1='123 Nowhere St',
            address2='#345',
            city='Boring',
            state='OR',
            zip='97009'
        )
        result = normalize_addr_dict(hashtag_unit, addr_map=dict_map)
        self.assertDictEqual(expected, result)

        expected = dict(
            address_line_1='123 NOWHERE ST',
            address_line_2='APT 345',
            city='BORING',
            state='OR',
            postal_code='97009'
        )

        abbreviation = dict(
            address1='123 Nowhere St',
            address2='Apt 345',
            city='Boring',
            state='OR',
            zip='97009'
        )
        result = normalize_addr_dict(abbreviation, addr_map=dict_map)
        self.assertDictEqual(expected, result)

        full_name = dict(
            address1='123 Nowhere St',
            address2='Apartment 345',
            city='Boring',
            state='OR',
            zip='97009'
        )
        result = normalize_addr_dict(full_name, addr_map=dict_map)
        self.assertDictEqual(expected, result)


class TestAddressNormalizationUtils(TestCase):
    """Unit tests for scourgify utils"""

    def setUp(self):

        self.address_dict = dict(
            address_line_1='123 Nowhere St',
            address_line_2='Suite 0',
            city='Boring',
            state='OR',
            postal_code='97009'
        )
        self.parseable_addr = '123 Nowhere Street Suite 0 Boring OR 97009'
        self.parsed_addr = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetName', 'NOWHERE'),
            ('StreetNamePostType', 'STREET'),
            ('OccupancyType', 'SUITE'),
            ('OccupancyIdentifier', '0'),
            ('PlaceName', 'BORING'),
            ('StateName', 'OR'),
            ('ZipCode', '97009')
        ])

        self.unparesable_addr = '6000 SW 1000TH AVE  (BLDG  A1 RIGHT)'

        self.unparesable_addr_dict = OrderedDict([
            ('AddressNumber', '6000'),
            ('StreetNamePreDirectional', 'SW'),
            ('StreetName', '1000TH'),
            ('StreetNamePostType', 'AVE'),
            ('SubaddressType', 'BLDG'),
            ('SubaddressIdentifier', 'A1'),
            ('SubaddressType', 'RIGHT')
        ])

    def test_get_parsed_values(self):
        """Test get_parsed_values function."""
        expected = 'BORING'
        result = get_parsed_values(self.parsed_addr, 'Boring',
                                   'PlaceName', self.parseable_addr)
        self.assertEqual(expected, result)

        expected = 'ONE VALUE PRESENT'
        result = get_parsed_values(self.parsed_addr, 'One Value Present',
                                   'LandmarkName', self.parseable_addr)
        self.assertEqual(expected, result)

        result = get_parsed_values(self.parsed_addr, None,
                                   'LandmarkName', self.parseable_addr)
        self.assertIsNone(result)

        with self.assertRaises(AmbiguousAddressError):
            get_parsed_values(self.parsed_addr, 'UnMatched City',
                              'PlaceName', self.parseable_addr)

    def test_get_norm_line_segment(self):
        """Test get normalized_line_segment function."""
        result = get_normalized_line_segment(self.parsed_addr,
                                             ['StreetName', 'AddressNumber'])
        expected = '{} {}'.format(self.parsed_addr['AddressNumber'],
                                  self.parsed_addr['StreetName'])
        self.assertEqual(expected, result)

        result = get_normalized_line_segment(
            self.parsed_addr,
            ['StreetName', 'StreetNamePostType', 'IntersectionSeparator']
        )
        expected = '{} {}'.format(self.parsed_addr['StreetName'],
                                  self.parsed_addr['StreetNamePostType'])
        self.assertEqual(expected, result)

    def test_normalize_numbered_streets(self):
        """Test normalize_numbered_streets function."""
        numbered_addr = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetName', '100'),
            ('StreetNamePostType', 'STREET')
        ])
        county_road = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreType', 'COUNTY ROAD'),
            ('StreetName', '100')
        ])
        string_addr = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetName', '91st'),
            ('StreetNamePostType', 'STREET')
        ])

        expected = '{}{}'.format(
            numbered_addr['StreetName'], 'th'
        )
        result = normalize_numbered_streets(numbered_addr)
        self.assertEqual(expected, result['StreetName'])

        result = normalize_numbered_streets(county_road)
        self.assertDictEqual(county_road, result)

        result = normalize_numbered_streets(string_addr)
        self.assertDictEqual(string_addr, result)

    def test_normalize_directionals(self):
        """Test normalize_directionals function."""
        unabbr_directional = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreDirectional', 'South West', ),
            ('StreetName', '100'),
            ('StreetNamePostType', 'STREET')
        ])
        abbrev_directional = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreDirectional', 'SW'),
            ('StreetNamePreType', 'COUNTY ROAD'),
            ('StreetName', '100')
        ])
        no_directional = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetName', '91st'),
            ('StreetNamePostType', 'STREET')
        ])

        expected = 'SW'
        result = normalize_directionals(unabbr_directional)
        self.assertEqual(expected, result['StreetNamePreDirectional'])

        result = normalize_directionals(abbrev_directional)
        self.assertDictEqual(abbrev_directional, result)

        result = normalize_directionals(no_directional)
        self.assertDictEqual(no_directional, result)

        expected = 'SOUTHWEST'
        result = normalize_directionals(abbrev_directional, long_hand=True)
        self.assertEqual(expected, result['StreetNamePreDirectional'])

    def test_normalize_street_types(self):
        """Test normalize_street_types function."""
        unabbr_type = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreDirectional', 'SW', ),
            ('StreetName', 'MAIN'),
            ('StreetNamePostType', 'STREET')
        ])
        abbrev_type = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreDirectional', 'SW', ),
            ('StreetName', 'MAIN'),
            ('StreetNamePostType', 'AVE')
        ])
        typo_type = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreDirectional', 'SW', ),
            ('StreetName', 'MAIN'),
            ('StreetNamePostType', 'STROET')
        ])
        no_type = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreDirectional', 'SW', ),
            ('StreetName', 'MAIN'),
        ])

        expected = 'ST'
        result = normalize_street_types(unabbr_type)
        self.assertEqual(expected, result['StreetNamePostType'])

        result = normalize_street_types(abbrev_type)
        self.assertDictEqual(abbrev_type, result)

        result = normalize_street_types(typo_type)
        self.assertDictEqual(typo_type, result)

        result = normalize_street_types(no_type)
        self.assertDictEqual(no_type, result)

        expected = 'AVENUE'
        result = normalize_street_types(abbrev_type, long_hand=True)
        self.assertEqual(expected, result['StreetNamePostType'])

    def test_normalize_occupancy_type(self):
        """Test normalize_occupancy_type function."""
        expected = 'STE'
        result = normalize_occupancy_type(self.parsed_addr)
        self.assertEqual(expected, result['OccupancyType'])

    def test_normalize_state(self):
        """Test normalize_state function"""
        state = 'ore'
        expected = 'OR'
        result = normalize_state(state)
        self.assertEqual(expected, result)

        state = 'oregano'
        expected = state
        result = normalize_state(state)
        self.assertEqual(expected, result)

        self.assertIsNone(normalize_state(None))

    def test_pre_clean_addr_str(self):
        """Test pre_clean_addr_str function"""
        odd_addr = '123 Nowhere    Street, Suite 0; @Boring OR 97009'
        # we're leaving commas in the pre-clean until norm can be revisited
        expected = '123 Nowhere Street, Suite 0 Boring OR 97009'.upper()
        # expected = '123 Nowhere Street Suite 0 Boring OR 97009'.upper()
        result = pre_clean_addr_str(odd_addr)
        self.assertEqual(expected, result)

    def test_post_clean_addr_str(self):
        """Test post_clean_addr_str function."""
        addr_str = '(100-104) SW NO   WHERE st'
        expected = '100-104 SW NO WHERE ST'
        result = post_clean_addr_str(addr_str)
        self.assertEqual(expected, result)

        self.assertIsNone(post_clean_addr_str(None))
        self.assertEqual('', post_clean_addr_str(''))

    def test_validate_address(self):
        """Test validate_address_components function."""
        expected = self.address_dict
        result = validate_address_components(self.address_dict)
        self.assertEqual(expected, result)

        minus_line1 = dict(
            address_line_1=None,
            address_line_2='Suite 0',
            city='Boring',
            state='OR',
            postal_code='97009'
        )
        with self.assertRaises(IncompleteAddressError):
            validate_address_components(minus_line1)

        minus_zip = dict(
            address_line_1='123 NoWhere St',
            address_line_2='Suite 0',
            city='Boring',
            state='OR',
            postal_code=None
        )
        with self.assertRaises(IncompleteAddressError):
            validate_address_components(minus_zip)

        minus_city_state = dict(
            address_line_1='123 NoWhere St',
            address_line_2='Suite 0',
            city=None,
            state=None,
            postal_code='97009'
        )

        with self.assertRaises(IncompleteAddressError):
            validate_address_components(minus_city_state)

        minus_city_state_zip = dict(
            address_line_1='123 NoWhere St',
            address_line_2='Suite 0',
            city=None,
            state=None,
            postal_code=None
        )
        with self.assertRaises(IncompleteAddressError):
            validate_address_components(minus_city_state_zip)

    def test_validate_postal_code(self):
        """Test validate_us_postal_code_format"""

        with self.assertRaises(AddressValidationError):
            zip_five = 'AAAAA'
            validate_us_postal_code_format(zip_five, self.address_dict)

        with self.assertRaises(AddressValidationError):
            zip_five = '97219-AAAA'
            validate_us_postal_code_format(zip_five, self.address_dict)

        with self.assertRaises(AddressValidationError):
            zip_plus = '97219-000100'
            validate_us_postal_code_format(zip_plus, self.address_dict)

        with self.assertRaises(AddressValidationError):
            zip_plus = '97219-0001-00'
            validate_us_postal_code_format(zip_plus, self.address_dict)

        with self.assertRaises(AddressValidationError):
            zip_five = '9721900'
            validate_us_postal_code_format(zip_five, self.address_dict)

        zip_five = '972'
        expected = '00972'
        result = validate_us_postal_code_format(zip_five, self.address_dict)
        self.assertEqual(expected, result)

        zip_plus = '97219-00'
        expected = '97219-0000'
        result = validate_us_postal_code_format(zip_plus, self.address_dict)
        self.assertEqual(expected, result)

        zip_plus = '972-0001'
        expected = '00972-0001'
        result = validate_us_postal_code_format(zip_plus, self.address_dict)
        self.assertEqual(expected, result)

        zip_plus = '972190001'
        expected = '97219-0001'
        result = validate_us_postal_code_format(zip_plus, self.address_dict)
        self.assertEqual(expected, result)

        expected = '97219'
        result = validate_us_postal_code_format(expected, self.address_dict)
        self.assertEqual(expected, result)

    def test_get_addr_line_str(self):
        """Test get_addr_line_str function."""
        expected = '{} {}'.format(
            self.address_dict['address_line_1'],
            self.address_dict['address_line_2']
        )
        result = get_addr_line_str(self.address_dict)
        self.assertEqual(expected, result)

        no_line_2 = {
            'address_line_1': 'address line 1'
        }
        expected = no_line_2['address_line_1']
        result = get_addr_line_str(no_line_2)
        self.assertEqual(expected, result)

        empty_line_2 = {
            'address_line_1': 'address line 1',
            'address_line_2': None
        }
        expected = no_line_2['address_line_1']
        result = get_addr_line_str(empty_line_2)
        self.assertEqual(expected, result)

        with self.assertRaises(TypeError):
            get_addr_line_str(self.address_dict, addr_parts='line1')

    @mock.patch(
        'scourgify.normalize.geocoder'
    )
    def test_get_geocoder_normalized_addr(self, mock_geocoder):
        """Test get_geocoder_normalized_addr"""
        geo_addr = mock.MagicMock()
        geo_addr.ok = True
        geo_addr.housenumber = '1234'
        geo_addr.street = "Main"
        geo_addr.subpremise = ''
        geo_addr.city = 'Boring'
        geo_addr.state = 'OR'
        geo_addr.postal = '97000'

        mock_geocoder.google.return_value = geo_addr

        address = {
            'address_line_1': '1234 Main',
            'address_line_2': None,
            'city': 'Boring',
            'state': 'OR',
            'postal_code': '97000'
        }
        addr_str_return_value = "1234 Main Boring OR 97000"
        get_geocoder_normalized_addr(address)
        mock_geocoder.google.assert_called_with(addr_str_return_value)

    def test_get_ordinal_indicator(self):
        """Test get_ordinal_indicator"""
        result = get_ordinal_indicator(11)
        expected = 'th'
        self.assertEqual(expected, result)

        result = get_ordinal_indicator(112)
        self.assertEqual(expected, result)

        result = get_ordinal_indicator(3113)
        self.assertEqual(expected, result)

        result = get_ordinal_indicator(1)
        expected = 'st'
        self.assertEqual(expected, result)

        result = get_ordinal_indicator(22)
        expected = 'nd'
        self.assertEqual(expected, result)

        result = get_ordinal_indicator(31243)
        expected = 'rd'
        self.assertEqual(expected, result)

    def test_clean_period_char(self):
        """Test clean_period_char"""
        text = "49.5 blah.blah."
        expected = "49.5 blahblah"
        result = clean_period_char(text)
        self.assertEqual(expected, result)

    def test_validate_parens_group_parsed(self):
        """Test validate_parens_groups_parsed"""
        broken_line1 = '6000 SW 1000TH AVE'
        result = validate_parens_groups_parsed(broken_line1)
        self.assertEqual(broken_line1, result)

        bad_addr = '10000 NE 8TH (ROW HOUSE)'
        with self.assertRaises(AmbiguousAddressError):
            validate_parens_groups_parsed(bad_addr)

    def test_clean_ambiguous_street_types(self):
        """ Test clean_ambiguous_street_types"""
        problem_addr = "1234 BROKEN CT"
        expected = "1234 BROKEN COURT"
        result = clean_ambiguous_street_types(problem_addr)
        self.assertEqual(expected, result)

        normal_addr = "1234 NORMAL ST"
        result = clean_ambiguous_street_types(normal_addr)
        self.assertEqual(normal_addr, result)

    def test_address_normalization_error(self):
        error_msg = 'Error Message'
        error_title = 'ERROR TITLE'
        addtl_args = 'Addition info'
        expected = "{}: {}, {}".format(error_title, error_msg, addtl_args)
        error = AddressNormalizationError(error_msg, error_title, addtl_args)
        self.assertEqual(expected, str(error))

    @mock.patch.object(address_constants.NormalizationConfig, 'get')
    def test_set_constants(self, mock_config_get):
        new_addr_keys = ['new keys']
        new_problem_st = {
            "PS": 'STREET'
        }
        mock_config_get.side_effect = (
            'update', new_addr_keys,
            None, None, None, None, None,
            new_problem_st, None, None,
        )
        address_constants.set_address_constants()
        self.assertEqual(address_constants.ADDRESS_KEYS, new_addr_keys)
        self.assertIn("PS", address_constants.PROBLEM_ST_TYPE_ABBRVS.keys())

        mock_config_get.side_effect = (
            'replace', new_addr_keys,
            None, None, None, None, None,
            new_problem_st, None, None,
        )
        address_constants.set_address_constants()
        self.assertEqual(address_constants.ADDRESS_KEYS, new_addr_keys)
        self.assertDictEqual(
            new_problem_st, address_constants.PROBLEM_ST_TYPE_ABBRVS
        )

        mock_config_get.side_effect = (
            'invalid', new_addr_keys,
            None, None, None, None, None,
            new_problem_st, None, None,
        )
        self.assertRaises(
            ConfigError, address_constants.set_address_constants
        )

    def test_handle_abnormal_occupancy(self):
        addr_str = '123 SW MAIN UN'
        expected = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreDirectional', 'SW'),
            ('StreetName', 'MAIN'),
            ('StreetNamePostType', 'UN'),
        ])
        result = parse_address_string(addr_str)
        self.assertEqual(expected, result)

        addr_str = '123 SW MAIN UN A'
        expected = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreDirectional', 'SW'),
            ('StreetName', 'MAIN'),
            ('OccupancyType', 'UN'),
            ('OccupancyIdentifier', 'A')
        ])
        result = parse_address_string(addr_str)
        self.assertEqual(expected, result)

        addr_str = '123 SW MAIN UN, UN A'
        expected = OrderedDict([
            ('AddressNumber', '123'),
            ('StreetNamePreDirectional', 'SW'),
            ('StreetName', 'MAIN'),
            ('StreetNamePostType', 'UN'),
            ('OccupancyType', 'UN'),
            ('OccupancyIdentifier', 'A')
        ])
        result = parse_address_string(addr_str)
        self.assertEqual(expected, result)


================================================
FILE: scourgify/tests/test_cleaning.py
================================================
#!/usr/bin/env python
# encoding: utf-8
"""
copyright (c) 2016-2019 Earth Advantage.
All rights reserved
"""

# Imports from Standard Library
from unittest import TestCase

# Local Imports
from scourgify.cleaning import strip_occupancy_type


class CleaningTests(TestCase):

    def test_strip_occupancy_type(self):
        expected = '33'

        line2 = 'Unit 33'
        result = strip_occupancy_type(line2)
        self.assertEqual(result, expected)

        line2 = 'Apartment 33'
        result = strip_occupancy_type(line2)
        self.assertEqual(result, expected)

        line2 = 'Unit #33'
        result = strip_occupancy_type(line2)
        self.assertEqual(result, expected)

        line2 = 'Building 3 Unit 33'
        result = strip_occupancy_type(line2)
        self.assertEqual(result, expected)

        line2 = 'Building 3 UN 33'
        result = strip_occupancy_type(line2)
        self.assertEqual(result, expected)

        line2 = '33'
        result = strip_occupancy_type(line2)
        self.assertEqual(result, expected)


================================================
FILE: scourgify/validations.py
================================================
#!/usr/bin/env python
# encoding: utf-8
"""
copyright (c) 2016-2017 Earth Advantage.
All rights reserved
..codeauthor::Fable Turas <fable@rainsoftware.tech>

[ INSERT DOC STRING ]  # TODO
"""
# Imports from Standard Library
import re
from typing import Mapping, Union

# Local Imports
# Public Classes and Functions
from scourgify.cleaning import post_clean_addr_str
from scourgify.exceptions import (
    AddressValidationError,
    AmbiguousAddressError,
    IncompleteAddressError,
)

# Setup

# Constants

# Data Structure Definitions

# Private Functions


def _get_substrings_with_regex(string, pattern=None):
    # type: (str) -> list
    """Get substring matching regex rule.

    :param string: string to search for substring
    :type string: str
    :param pattern: regex pattern
    :type pattern: regex
    :return: str matching pattern search or None
    :rtype: list
    """
    pattern = re.compile(pattern)
    match = re.findall(pattern, string)
    return match


# Public Functions
def validate_address_components(address_dict, strict=True):
    # type: (Mapping[str, str]) -> Mapping[str, str]
    """Validate non-null values for minimally viable address elements.

    All addresses should have at least an address_line_1 and a postal_code
    or a city and state.

    :param address_dict: dict containing address components having keys
        'address_line_1', 'postal_code', 'city', 'state'
    :type address_dict: Mapping
    :param strict: bool indicating strict handling of components address parts
        city, state and postal_code, vs city and state OR postal_code
    :return: address_dict if no errors are raised.
    :rtype: Mapping
    """
    locality = (
        address_dict.get('postal_code') and
        address_dict.get('city') and address_dict.get('state')
        if strict else
        address_dict.get('postal_code') or
        (address_dict.get('city') and address_dict.get('state'))
    )
    if not address_dict.get('address_line_1'):
        msg = 'Address records must include Line 1 data.'
        raise IncompleteAddressError(msg)
    elif not locality:
        msg = (
            'Address records must contain a city, state, and postal_code.'
            if strict else
            'Address records must contain a city and state, or a postal_code'
        )
        raise IncompleteAddressError(msg)
    return address_dict


def validate_us_postal_code_format(postal_code, address):
    # type: (str, Union[str, Mapping]) -> str
    """Validate postal code conforms to US five-digit Zip or Zip+4 standards.

    :param postal_code: string containing US postal code data.
    :type postal_code: str
    :param address: dict or string containing original address.
    :type address: dict | str
    :return: original postal code if no error is raised
    :rtype: str
    """
    error = None
    msg = (
        'US Postal Codes must conform to five-digit Zip or Zip+4 standards.'
    )
    postal_code = post_clean_addr_str(postal_code)
    plus_four_code = postal_code.split('-')
    for code in plus_four_code:
        try:
            int(code)
        except ValueError:
            error = True
    if not error:
        if '-' in postal_code:
            if len(postal_code.replace('-', '')) > 9:
                error = True
            elif len(plus_four_code) != 2:
                error = True
            else:
                postal_code = '-'.join([
                    plus_four_code[0].zfill(5), plus_four_code[1].zfill(4)
                ])
        elif len(postal_code) == 9:
            postal_code = '-'.join([postal_code[:5], postal_code[5:]])
        elif len(postal_code) > 5:
            error = True
        else:
            postal_code = postal_code.zfill(5)

    if error:
        raise AddressValidationError(msg, None, address)
    else:
        return postal_code


def validate_parens_groups_parsed(line1):
    # type: (str) -> str
    """Validate any parenthesis segments have been successfully parsed.

    Assumes any parenthesis segments in original address string are either
    line 2 or ambiguous address elements.  If any parenthesis segment remains
    in line1 after all other address processing has been applied,
    AmbiguousAddressError is raised.

    :param line1: processed line1 address string portion
    :type line1: str
    :return: line1 address string
    :rtype: str
    """
    parenthesis_groups = _get_substrings_with_regex(line1, r'\((.+?)\)')
    if parenthesis_groups:
        raise AmbiguousAddressError(None, None, line1)
    else:
        return line1


================================================
FILE: setup.cfg
================================================
[metadata]
name=usaddress-scourgify
version=0.6.0
description=Clean US addresses following USPS pub 28 and RESO guidelines
author=Fable Turas
author_email=fable@rainsoftware.tech
maintainer=GreenBuildingRegistry
maintainer_email=admin@greenbuildingregistry.com
keywords= usaddress, normalization, address
url=https://github.com/GreenBuildingRegistry/usaddress-scourgify
classifiers =
	Development Status :: 5 - Production/Stable
	Intended Audience :: Developers
	Operating System :: OS Independent
	Programming Language :: Python :: 3.5
	Programming Language :: Python :: 3.6
	Programming Language :: Python :: 3.7
	Programming Language :: Python :: 3.8
python_requires='>=3.5'
[options]
packages = find:
include_package_data = True
zip_safe = False
install_requires =
	geocoder>=1.22.6
	usaddress>=0.5.9
	yaml-config>=0.1.2
[bdist_wheel]
python-tag = py3


================================================
FILE: setup.py
================================================
#!/usr/bin/env python

from setuptools import setup


setup()


================================================
FILE: tox.ini
================================================
[tox]
envlist = py35,py36,py37,py38

[testenv]
setenv =
    ADDRESS_CONFIG_DIR = {toxinidir}/scourgify/tests/config
deps=
    -rrequirements/dev.txt
	pytest
    pytest-cov
    pytest-xdist
    testfixtures>=5.1.1

commands =
    pytest --cov=. --cov-report= --cov-append -s
    flake8 scourgify

[flake8]
exclude=__init__.py
Download .txt
gitextract_ipci6wv1/

├── .coveragerc
├── .gitignore
├── .isort.cfg
├── CHANGELOG.rst
├── LICENSE
├── README.rst
├── requirements/
│   ├── base.txt
│   └── dev.txt
├── scourgify/
│   ├── __init__.py
│   ├── address_constants.py
│   ├── cleaning.py
│   ├── exceptions.py
│   ├── normalize.py
│   ├── tests/
│   │   ├── __init__.py
│   │   ├── config/
│   │   │   ├── __init__.py
│   │   │   └── address_constants.yaml
│   │   ├── test_address_normalization.py
│   │   └── test_cleaning.py
│   └── validations.py
├── setup.cfg
├── setup.py
└── tox.ini
Download .txt
SYMBOL INDEX (80 symbols across 7 files)

FILE: scourgify/address_constants.py
  class NormalizationConfig (line 902) | class NormalizationConfig(Config):
    method __init__ (line 907) | def __init__(self, config_file=None, config_dir=None, section=None):
  function set_address_constants (line 914) | def set_address_constants():

FILE: scourgify/cleaning.py
  function pre_clean_addr_str (line 51) | def pre_clean_addr_str(addr_str, state=None):
  function clean_ambiguous_street_types (line 99) | def clean_ambiguous_street_types(addr_str):
  function post_clean_addr_str (line 123) | def post_clean_addr_str(addr_str):
  function _parse_occupancy (line 141) | def _parse_occupancy(addr_line_2):
  function strip_occupancy_type (line 155) | def strip_occupancy_type(addr_line_2):
  function clean_upper (line 194) | def clean_upper(text,                           # type: Any
  function clean_period_char (line 244) | def clean_period_char(text):
  function pre_clean_directionals (line 256) | def pre_clean_directionals(text):

FILE: scourgify/exceptions.py
  class AddressNormalizationError (line 17) | class AddressNormalizationError(Exception):
    method __init__ (line 22) | def __init__(self, error=None, title=None, *args):
    method __str__ (line 28) | def __str__(self):
  class AmbiguousAddressError (line 37) | class AmbiguousAddressError(AddressNormalizationError):
  class UnParseableAddressError (line 43) | class UnParseableAddressError(AddressNormalizationError):
  class IncompleteAddressError (line 49) | class IncompleteAddressError(AddressNormalizationError):
  class AddressValidationError (line 55) | class AddressValidationError(AddressNormalizationError):

FILE: scourgify/normalize.py
  function normalize_address_record (line 125) | def normalize_address_record(address: str | dict, addr_map: dict = None,
  function normalize_addr_str (line 169) | def normalize_addr_str(addr_str: str, line2: str = None, city: str = None,
  function normalize_addr_dict (line 272) | def normalize_addr_dict(addr_dict: dict, addr_map: dict = None,
  function parse_address_string (line 326) | def parse_address_string(addr_str: str) -> dict:
  function handle_abnormal_occupancy (line 352) | def handle_abnormal_occupancy(parsed_addr: OrderedDict,
  function get_parsed_values (line 409) | def get_parsed_values(parsed_addr: OrderedDict, orig_val: str,
  function normalize_address_components (line 449) | def normalize_address_components(parsed_addr: OrderedDict,
  function normalize_numbered_streets (line 468) | def normalize_numbered_streets(parsed_addr: OrderedDict) -> OrderedDict:
  function normalize_directionals (line 491) | def normalize_directionals(parsed_addr: OrderedDict,
  function normalize_street_types (line 524) | def normalize_street_types(parsed_addr: OrderedDict,
  function normalize_occupancy_type (line 555) | def normalize_occupancy_type(parsed_addr: OrderedDict,
  function normalize_state (line 599) | def normalize_state(state: str | None) -> str | None:
  function normalize_city (line 615) | def normalize_city(city: str):
  function get_normalized_line_segment (line 624) | def get_normalized_line_segment(parsed_addr: OrderedDict,
  function get_addr_line_str (line 640) | def get_addr_line_str(addr_dict: dict, addr_parts: [str] = None,
  function format_address_record (line 669) | def format_address_record(address: dict) -> str:
  function get_geocoder_normalized_addr (line 680) | def get_geocoder_normalized_addr(address: dict | str,
  function get_ordinal_indicator (line 710) | def get_ordinal_indicator(number: int) -> str:
  class NormalizeAddress (line 735) | class NormalizeAddress(object):
    method __init__ (line 769) | def __init__(self, address, addr_map=None, addtl_funcs=None,
    method get_normalized_line_1 (line 781) | def get_normalized_line_1(parsed_addr, line_labels=LINE1_USADDRESS_LAB...
    method get_normalized_line_2 (line 785) | def get_normalized_line_2(parsed_addr, line_labels=LINE2_USADDRESS_LAB...
    method normalize (line 788) | def normalize(self):
    method normalize_addr_str (line 796) | def normalize_addr_str(self, addr_str,  # type: str
    method normalize_addr_dict (line 874) | def normalize_addr_dict(self):
    method normalize_city (line 904) | def normalize_city(self, parsed_addr, addr_str, city=None):

FILE: scourgify/tests/test_address_normalization.py
  class TestAddressNormalization (line 61) | class TestAddressNormalization(TestCase):
    method setUp (line 65) | def setUp(self):
    method test_normalize_address_record (line 138) | def test_normalize_address_record(self):
    method test_normalize_class (line 160) | def test_normalize_class(self):
    method test_normalize_addr_str (line 182) | def test_normalize_addr_str(self):
    method test_normalize_addr_dict (line 229) | def test_normalize_addr_dict(self):
    method test_parse_address_string (line 251) | def test_parse_address_string(self):
    method test_normalize_occupancies (line 260) | def test_normalize_occupancies(self):
  class TestAddressNormalizationUtils (line 359) | class TestAddressNormalizationUtils(TestCase):
    method setUp (line 362) | def setUp(self):
    method test_get_parsed_values (line 395) | def test_get_parsed_values(self):
    method test_get_norm_line_segment (line 415) | def test_get_norm_line_segment(self):
    method test_normalize_numbered_streets (line 431) | def test_normalize_numbered_streets(self):
    method test_normalize_directionals (line 461) | def test_normalize_directionals(self):
    method test_normalize_street_types (line 495) | def test_normalize_street_types(self):
    method test_normalize_occupancy_type (line 538) | def test_normalize_occupancy_type(self):
    method test_normalize_state (line 544) | def test_normalize_state(self):
    method test_pre_clean_addr_str (line 558) | def test_pre_clean_addr_str(self):
    method test_post_clean_addr_str (line 567) | def test_post_clean_addr_str(self):
    method test_validate_address (line 577) | def test_validate_address(self):
    method test_validate_postal_code (line 624) | def test_validate_postal_code(self):
    method test_get_addr_line_str (line 671) | def test_get_addr_line_str(self):
    method test_get_geocoder_normalized_addr (line 701) | def test_get_geocoder_normalized_addr(self, mock_geocoder):
    method test_get_ordinal_indicator (line 725) | def test_get_ordinal_indicator(self):
    method test_clean_period_char (line 749) | def test_clean_period_char(self):
    method test_validate_parens_group_parsed (line 756) | def test_validate_parens_group_parsed(self):
    method test_clean_ambiguous_street_types (line 766) | def test_clean_ambiguous_street_types(self):
    method test_address_normalization_error (line 777) | def test_address_normalization_error(self):
    method test_set_constants (line 786) | def test_set_constants(self, mock_config_get):
    method test_handle_abnormal_occupancy (line 820) | def test_handle_abnormal_occupancy(self):

FILE: scourgify/tests/test_cleaning.py
  class CleaningTests (line 15) | class CleaningTests(TestCase):
    method test_strip_occupancy_type (line 17) | def test_strip_occupancy_type(self):

FILE: scourgify/validations.py
  function _get_substrings_with_regex (line 32) | def _get_substrings_with_regex(string, pattern=None):
  function validate_address_components (line 49) | def validate_address_components(address_dict, strict=True):
  function validate_us_postal_code_format (line 84) | def validate_us_postal_code_format(postal_code, address):
  function validate_parens_groups_parsed (line 129) | def validate_parens_groups_parsed(line1):
Condensed preview — 22 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (119K chars).
[
  {
    "path": ".coveragerc",
    "chars": 177,
    "preview": "[paths]\nsource =\n    scourgify\n\n[run]\nsource =\n    scourgify\nomit =\n    *tox*\n    setup.py\n    *test*\n\n[report]\n;sort = "
  },
  {
    "path": ".gitignore",
    "chars": 1249,
    "preview": "# Created by .ignore support plugin (hsz.mobi)\n### Python template\n# Byte-compiled / optimized / DLL files\n__pycache__/\n"
  },
  {
    "path": ".isort.cfg",
    "chars": 522,
    "preview": "[settings]\nline_length=79\nmulti_line_output=3\ninclude_trailing_comma=1\nknown_standard_library=typing\nknown_django=django"
  },
  {
    "path": "CHANGELOG.rst",
    "chars": 767,
    "preview": "Changelog\n=========\n0.2.3 [2020-05-06]\n------------------\n* Valid OccupancyType bug fix for OccupancyType that is alread"
  },
  {
    "path": "LICENSE",
    "chars": 1099,
    "preview": "MIT License\r\n\r\nCopyright (c) 2018 Green Building Registry\r\n\r\nPermission is hereby granted, free of charge, to any person"
  },
  {
    "path": "README.rst",
    "chars": 6496,
    "preview": "usaddress-scourgify\n===================\n\nA Python3.x library for cleaning/normalizing US addresses following USPS pub 28"
  },
  {
    "path": "requirements/base.txt",
    "chars": 89,
    "preview": "usaddress>=0.5.9\ngeocoder>=1.22.6\nyaml-config>=0.1.2\ntyping>=3.6.1; python_version<'3.6'\n"
  },
  {
    "path": "requirements/dev.txt",
    "chars": 120,
    "preview": "-r base.txt\ncoverage>=6.2\nflake8>=3.0.4\nfrosted>=1.4.1\nisort>=4.2.5\npep8>=1.7.0\npylama>=7.3.3\npylint>=1.6.4\ntox>=2.7.0\n\n"
  },
  {
    "path": "scourgify/__init__.py",
    "chars": 243,
    "preview": "#!/usr/bin/env python\n# encoding: utf-8\n\"\"\"\ncopyright (c) 2016  Earth Advantage.\nAll rights reserved\n\"\"\"\n\n# Local Import"
  },
  {
    "path": "scourgify/address_constants.py",
    "chars": 20016,
    "preview": "#!/usr/bin/env python\n# encoding: utf-8\n\"\"\"\ncopyright (c) 2016-2017 Earth Advantage.\nAll rights reserved\n..codeauthor::F"
  },
  {
    "path": "scourgify/cleaning.py",
    "chars": 9411,
    "preview": "#!/usr/bin/env python\n# encoding: utf-8\n\"\"\"\ncopyright (c) 2016-2017 Earth Advantage.\nAll rights reserved\n..codeauthor::F"
  },
  {
    "path": "scourgify/exceptions.py",
    "chars": 1789,
    "preview": "#!/usr/bin/env python\n# encoding: utf-8\n\"\"\"\ncopyright (c) 2016-2017 Earth Advantage.\nAll rights reserved\n..codeauthor::F"
  },
  {
    "path": "scourgify/normalize.py",
    "chars": 35681,
    "preview": "#!/usr/bin/env python\n# encoding: utf-8\n\"\"\"\ncopyright (c) 2016  Earth Advantage.\nAll rights reserved\n..codeauthor::Fable"
  },
  {
    "path": "scourgify/tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "scourgify/tests/config/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "scourgify/tests/config/address_constants.yaml",
    "chars": 114,
    "preview": "\nKNOWN_ODDITIES:\n    'developed by HOST': ''\n    ', UN ': ' UNIT '\n\nOCCUPANCY_TYPE_ABBREVIATIONS:\n    'UN': 'UNIT'"
  },
  {
    "path": "scourgify/tests/test_address_normalization.py",
    "chars": 29009,
    "preview": "#!/usr/bin/env python\n# encoding: utf-8\n\"\"\"\ncopyright (c) 2016-2019 Earth Advantage.\nAll rights reserved\n\nUnit tests for"
  },
  {
    "path": "scourgify/tests/test_cleaning.py",
    "chars": 1093,
    "preview": "#!/usr/bin/env python\r\n# encoding: utf-8\r\n\"\"\"\r\ncopyright (c) 2016-2019 Earth Advantage.\r\nAll rights reserved\r\n\"\"\"\r\n\r\n# I"
  },
  {
    "path": "scourgify/validations.py",
    "chars": 4570,
    "preview": "#!/usr/bin/env python\n# encoding: utf-8\n\"\"\"\ncopyright (c) 2016-2017 Earth Advantage.\nAll rights reserved\n..codeauthor::F"
  },
  {
    "path": "setup.cfg",
    "chars": 856,
    "preview": "[metadata]\nname=usaddress-scourgify\nversion=0.6.0\ndescription=Clean US addresses following USPS pub 28 and RESO guidelin"
  },
  {
    "path": "setup.py",
    "chars": 62,
    "preview": "#!/usr/bin/env python\n\nfrom setuptools import setup\n\n\nsetup()\n"
  },
  {
    "path": "tox.ini",
    "chars": 324,
    "preview": "[tox]\nenvlist = py35,py36,py37,py38\n\n[testenv]\nsetenv =\n    ADDRESS_CONFIG_DIR = {toxinidir}/scourgify/tests/config\ndeps"
  }
]

About this extraction

This page contains the full source code of the GreenBuildingRegistry/usaddress-scourgify GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 22 files (111.0 KB), approximately 29.9k tokens, and a symbol index with 80 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!